Pronunciation corrections in online text to speech

Although text-to-speech technology has made a lot of progress, it can sometimes stumble on certain terms such as acronyms, abbreviations, date formats or number representations to name a few. We have some customers where the pronunciation needs fine-tuning such as in the pharmaceutical sector for example where it is even more important that each term is perfectly well read.
Every account that we open comes with a specific dictionary for the customer. We provide a service to each of our customers that helps them with pronunciation issues when they exist. Some of the pronunciation corrections will only be relevant to the dictionary of a particular customer, but in some instances the corrections can also be used for the default dictionary and benefit our entire customer base. We have a very knowledgeable network of linguists that can help our customers in many parts of the world when they encounter pronunciation difficulties.
We have prepared a few online text to speech demos that show the before and after effect of our work on some types of words that can get mispronounced by speech synthesis.
Photo Credit: Jim Linwood
Text to Speech Online Worldwide

We get requests for speech-enabling web sites and mobile apps from all different kinds of places. To cope with these demands we offer a large (and growing) panel of languages and voices available for the text to speech online requirements of our customers.
- To date we propose 35 languages and 88 voices!
- There is a big majority of female voices with 54 of them versus 34 male voices.
- The most represented language in terms of voices is the English one with a total of 13 voices (6 US English voices, 5 UK English voices, 1 Scottish English voice and 1 Australian English voice)) followed by Spanish with 7 voices (4 Spanish Castilian, 3 Spanish American), Dutch with 7 voices, French and German with 5 voices each.
- Did you know that we propose 2 variants of Norwegian: Bokmål (used by 85 to 90% of the Norwegian population) and Nynorsk.
- Apart from Spanish, we also have available Catalan, Valencian, Galician and Basque. Did you know that in the Catalan, Balearic Islands, Valencian, Basque, Navarra and Galician areas of Spain, only a part of the population is Spanish speaking only.
- If you have web sites and/or mobiles apps in Arabic, we can provide you with a choice of 4 voices (2 female and 2 male). We also have 3 Turkish voices. Did you know that the earliest known Arabic texts go back to the 8th century BC.
- For web sites in Eastern Europe (as defined per the United Nations), we cover Czech, Romanian, Polish and Russian.
- Did you know that we also propose Faroese and Finland Swedish. The latter is a combination of Swedish dialects spoken by Swedish-speaking Finnish in Finland.
- We also have Welsh known as Cymraeg in its native writing.
- Did you know that as many as 15% words differ between Portuguese and Brazilian Portuguese, 2 languages that we also have in our portfolio.
How text to speech is made

Following yesterday’s post about a brief history of text to speech, today we list some of the techniques involved in creating speech synthesis.
Articulatory synthesis
In an articulatory synthesis, models of the human articulators (tongue, lips, teeth, jaw) and vocal ligament are used to simulate how an airflow passes through, to calculate what the resulting sound will be like. It is a great challenge to find good mathematical models and therefore the development of articulatory synthesis is still in research. The technique is very computation-intensive but memory requirements is almost nothing.
Formant
The synthesis is a sort of source-filter-method that is based on mathematic models of the human speech organ.
The approach pipe is modelled from a number of resonances with resemblance to the formants (frequency bands with high energy in voices) in natural speech.
The first electronic voices Voder, and later on OVE and PAT, were speaking with totally synthetic and electronic produced sounds using formant synthesis. As with articulatory synthesis, the memory consumption is small but CPU usage is large.
Concatenating synthesis
A concatenating synthesis is made of recorded pieces of speech (sound-clips) that is then unitized and formed to speech. Depending on how long sound-clips that are used it become a diphone or a polyphonic synthesis. The later in a more developed version is also called a Unit Selection synthesis, where the synthesizer has access to both long and short segments of speech and the best segments for the actual context is chosen.
Diphone
For a diphone synthesis the elements from the recorded speech are very small.
The strength in this case is that almost any sentence or expression may be read but quite often there are errors in the pronunciation and if the model used for prosody is not good, or modelling is difficult, the speech may sound a bit monotonic.
A diphone synthesis doesn’t work that well in languages where there is a lot of inconsequence in the pronunciation rules (English, Swedish etc) and in special cases where letters is pronounced differently than in general. The diphone works better for languages that have large consistencies in the pronunciation (Spanish, Finnish etc.) Another advantage is that the prosody, the intonation, can be described in very much detail.
Unit selection
The greatest difference between a Unit selection and a diphone voice is the length of the used speech segments. There are entire words and phrases stored in the unit database. this implies that the database for the Unit selection voices are many times bigger than for diphone voices. Thus, the memory consumption is huge while the CPU consumption is low.
The most important issue is to still get a natural and smooth prosody. This is hard because the units contain both intonation and pronunciation since entire phrases are used almost directly from the recorded data. Since the first Unit selection voice was released, over eight years ago, there has been much improvements for each new voice with every release. This is by far the most widely used technique among our providers.
HMM synthesis
A quite new technology is speech synthesis based on HMM, a mathematical concept called Hidden Markov models. It is a statistical method where the text-to-speech system is based on a model that is not known beforehand but it is refined by continuous training. The technique consumes large CPU resources but very little memory. This approach seems to give a better prosody, without glitches, and still producing very natural sounding, human-like speech. We collaborate with providers offering this technique as well.
Customizations and improvements
On top of using the best voices available we also add our own layer of improvement, both general and customer specific customizations. We have linguists with long experience of speech synthesis working with transcriptions to tweak the pronunciation and reading of the spoken text. Therefore we can greatly help our customers that want to optimize the quality of the text to speech on their web pages. Sometimes it is enough to do a quality control of a couple of hours listening to your website and correct the errors we find. In other cases some of our customers have industry specific words (think of the pharmaceutical industry for example) where it is very important that they are pronounced correctly.
One of the largest customizations we have made so far was for a customer who sent us a list of over 3000 words that had to be quality controlled. Another customization was for a site with about 200 000 pages where the same acronym or abbreviation had to be expanded differently depending on at what part of the site it was mentioned in. Many users wonder why the same voice reads so much better when it is used in our services compared to when the same voice, or text-to-speech system, is used for reading similar, or the same, content with other softwares or services. The answer is the above mentioned customizations.
Thanks to Professor Hartmut Traunmüller, Dept. of Linguistics at the University of Stockholm for a lot of the facts, the picture and the sound samples on this page.
A brief history of speech synthesis (text to speech)

Behind all our services there is a server-based software performing the speech synthesis, called text-to-speech software. The voices we use are provided by different providers but the technique behind the different voices has many similarities. Therefore we like to tell you briefly about the development of speech synthesis and its history.










