A brief history of speech synthesis (text to speech)

Listen to this page using ReadSpeaker

Behind all our services there is a server-based software performing the speech synthesis, called text-to-speech software. The voices we use are provided by different providers but the technique behind the different voices has many similarities. Therefore we like to tell you briefly about the development of speech synthesis and its history.

The history of speech synthesis

Over the last few years there has been a great development of the quality of the speech produced with text to speech. Many people think that synthetic speech as it is also called sounds like robots from older movies. The truth is though that some voices almost sound like recorded speech and due to that we have seen a very strong growth of user groups for our services the last years.
When we invented the talking web in 2001 the target group was people with reading difficulties but now we see that the user group is much broader.

What you maybe don’t know is that the first synthetic speech was produced as early as in the late 18th century. The machine was built in wood and leather and was very complicated to use generating audible speech. It was constructed by Wolfgang von Kempelen and had great importance in the early studies of Phonetics. The picture to the right is the original construction as it can be seen at the Deutsches Museum (von Meisterwerken der Naturwissenschaft und Technik) in Munich, Germany.

Von Kempelen's speech synthesis machine

Here’s an audio sample of the synthetic speech the machine produced (WAV-file 776 kB) .

(First there is a human that says a sentence and then the machine tries to say the same. This was made by a re-construction of Kempelens machine.)

In the early 20th century when it was possible to use electricity to create synthetic speech, the first known electric speech synthesis was “Voder” and its creator Homer Dudley showed it to a broader audience in 1939 on the world fair in New York.

Here’s an audio sample of Voder, the first electronic speech synthesis ever (WAV-file 381 kB)

One of the pioneers of the development of speech synthesis in Sweden was Gunnar Fant. During the 1950s he was responsible for the development of the first Swedish speech synthesis OVE (Orator Verbis Electris.) By that time it was only Walter Lawrences Parametric Artificial Talker (PAT) that could compete with OVE in speech quality.

Here’s a sample of OVE speech synthesis (WAV-file 77 kB).

and here’s a sample of the PAT speech synthesis (WAV 117 kB).

OVE and PAT were text-to-speech systems using Formant synthesis.

Speech synthesis becomes more human-like

The greatest improvements when it comes to natural speech were during the last 10 years. The first voices we used for ReadSpeaker back in 2001 were produced using Diphone synthesis. The voices are sampled from real recorded speech and split into phonemes, a small unit of human speech. This was the first example of Concatenation synthesis. However, they still have an artificial/synthetic sound. We still use diphone voices for some smaller languages and they are widely used to speech-enable handheld computers and mobile phones due to their limited resource consumption, both memory and CPU.

It wasn’t until the introduction of a technique called Unit selection, that voices became very naturally sounding. this is still concatenation synthesis but the used units are larger than phonemes, sometimes a complete sentence. We use different providers for different languages to always assure we can offer the best voices available for that language.

In a next post we will cover the different techniques behind speech synthesis.

Thanks to Professor Hartmut Traunmüller, Dept. of Linguistics at the University of Stockholm for a lot of the facts, the picture and the sound samples on this page.

Posted in: Speech synthesis

2 Responses to “A brief history of speech synthesis (text to speech)”

  1. Edwin says:

    It wasn’t until the introduction of a technique called Unit selection, that voices became very naturally sounding. this is still concatenation synthesis but the used units are larger than phonemes, sometimes a complete sentence

  2. [...] text-to-speech technology has made a lot of progress, it can sometimes stumble on certain terms such as [...]

Leave a Reply

© 2012 ReadSpeaker Holding B.V. | www.readspeaker.com | Powered by WordPress