How text to speech is made

Listen to this page using ReadSpeaker

Following yesterday’s post about a brief history of text to speech, today we list some of the techniques involved in creating speech synthesis.

Articulatory synthesis

In an articulatory synthesis, models of the human articulators (tongue, lips, teeth, jaw) and vocal ligament are used to simulate how an airflow passes through, to calculate what the resulting sound will be like. It is a great challenge to find good mathematical models and therefore the development of articulatory synthesis is still in research. The technique is very computation-intensive but memory requirements is almost nothing.

Formant

The synthesis is a sort of source-filter-method that is based on mathematic models of the human speech organ.
The approach pipe is modelled from a number of resonances with resemblance to the formants (frequency bands with high energy in voices) in natural speech.
The first electronic voices Voder, and later on OVE and PAT, were speaking with totally synthetic and electronic produced sounds using formant synthesis. As with articulatory synthesis, the memory consumption is small but CPU usage is large.

Concatenating synthesis

A concatenating synthesis is made of recorded pieces of speech (sound-clips) that is then unitized and formed to speech. Depending on how long sound-clips that are used it become a diphone or a polyphonic synthesis. The later in a more developed version is also called a Unit Selection synthesis, where the synthesizer has access to both long and short segments of speech and the best segments for the actual context is chosen.

Diphone

For a diphone synthesis the elements from the recorded speech are very small.
The strength in this case is that almost any sentence or expression may be read but quite often there are errors in the pronunciation and if the model used for prosody is not good, or modelling is difficult, the speech may sound a bit monotonic.
A diphone synthesis doesn’t work that well in languages where there is a lot of inconsequence in the pronunciation rules (English, Swedish etc) and in special cases where letters is pronounced differently than in general. The diphone works better for languages that have large consistencies in the pronunciation (Spanish, Finnish etc.) Another advantage is that the prosody, the intonation, can be described in very much detail.

Unit selection

The greatest difference between a Unit selection and a diphone voice is the length of the used speech segments. There are entire words and phrases stored in the unit database. this implies that the database for the Unit selection voices are many times bigger than for diphone voices. Thus, the memory consumption is huge while the CPU consumption is low.

The most important issue is to still get a natural and smooth prosody. This is hard because the units contain both intonation and pronunciation since entire phrases are used almost directly from the recorded data. Since the first Unit selection voice was released, over eight years ago, there has been much improvements for each new voice with every release. This is by far the most widely used technique among our providers.

HMM synthesis

A quite new technology is speech synthesis based on HMM, a mathematical concept called Hidden Markov models. It is a statistical method where the text-to-speech system is based on a model that is not known beforehand but it is refined by continuous training. The technique consumes large CPU resources but very little memory. This approach seems to give a better prosody, without glitches, and still producing very natural sounding, human-like speech. We collaborate with providers offering this technique as well.

Customizations and improvements

On top of using the best voices available we also add our own layer of improvement, both general and customer specific customizations. We have linguists with long experience of speech synthesis working with transcriptions to tweak the pronunciation and reading of the spoken text. Therefore we can greatly help our customers that want to optimize the quality of the text to speech on their web pages. Sometimes it is enough to do a quality control of a couple of hours listening to your website and correct the errors we find. In other cases some of our customers have industry specific words (think of the pharmaceutical industry for example) where it is very important that they are pronounced correctly.

One of the largest customizations we have made so far was for a customer who sent us a list of over 3000 words that had to be quality controlled. Another customization was for a site with about 200 000 pages where the same acronym or abbreviation had to be expanded differently depending on at what part of the site it was mentioned in. Many users wonder why the same voice reads so much better when it is used in our services compared to when the same voice, or text-to-speech system, is used for reading similar, or the same, content with other softwares or services. The answer is the above mentioned customizations.

Thanks to Professor Hartmut Traunmüller, Dept. of Linguistics at the University of Stockholm for a lot of the facts, the picture and the sound samples on this page.

A brief history of speech synthesis (text to speech)

Listen to this page using ReadSpeaker

Behind all our services there is a server-based software performing the speech synthesis, called text-to-speech software. The voices we use are provided by different providers but the technique behind the different voices has many similarities. Therefore we like to tell you briefly about the development of speech synthesis and its history.

(more…)

Posted in: Speech synthesis

The ReadSpeaker formReader story

Listen to this page using ReadSpeaker

With 10 years of experience in speech enabling the web, it is more than time to broaden the scope than just making content speak on the web and in mobile phones. In these 10 years, as you all know, the web has gone through a number of dramatic changes. From being all about information, it is now about transaction, interaction and socializing. How can web based speech enabling improve these areas? To start with, text is still the problem to a lot of people. Statistics about reading difficulties for example have not changed just because the web has moved forward. Actually, the more that day to day activities get online, the greater the digital divide gets. Exclusion rather than inclusion. That doesn’t feel so 2010.

Sure, speech enabling the web is not the answer to all questions and is not the answer to all prayers, but it sure is a means in reducing the digital divide.

On-line banking and other financial services, government and company e-Services, E-commerce, surveys etc all interact with the users with some kind of online form where they can exercise various tasks whenever they like. Apart from being very convenient for the user, it is also a cost saver for the organization offering these services. Automated processes, case handling systems, online customer support services make a large number of organizations more efficient. However, have they made all necessary efforts to make the front end as usable and accessible as possible?

Not making a form accessible and usable is as wise as putting a 1,76 meter tall and 0,48 meter wide door 50 centimeters from the ground as the only entrance to the supermarket. With average height, width and gymnastic skills you can come in, and if not, you don’t.

Since we know that speech enabling does help a lot of people, we developed a prototype of what came to be ReadSpeaker formReader. We implemented it on a few forms (e-services) at a municipality website in Sweden. We also gathered a test group with people from different disability groups (plus a few elderly and some non-native speaking persons). After the test phase, we did as we normally do when developing a new product, we went back to the drawing table incorporating the results from the user tests. Speech enabling forms helps. To be able to have audio prompts that tell you what to fill in and a voice that reads back what you have written/chosen proves to be very useful. With more people being able to fill out the forms themselves, and fill them in accurately (thanks to the “proof listening”), the organization offering the service gets a better value for their investment.

And since formReader works pretty much like a screen reader, the requirements on the forms are the same. Meaning that they should be properly coded according W3C/WCAG guidelines.

During the implementation of the formReader on the municipality website, a couple of easily solvable accessibility issues became very obvious and were easily corrected by the municipality web developers. So the result was, regardless if you chose to activate formReader or not, a better and more accessible web form.

CSUN2010_ReadSpeaker_formReader_Presentation

webReader module for Drupal

Listen to this page using ReadSpeaker

Yesterday Drupal.org announced a new module that implements ReadSpeaker webReader for the Drupal CMS. We have not yet fully test the module ourselves, but it is available on the Drupal website.

http://drupal.org/project/webreader

Posted in: webReader

ReadSpeaker webReader now in German!

Listen to this page using ReadSpeaker

Hey the list keeps on getting longer and longer! After US English, British English, French, Swedish, Italian, Spanish and Portuguese, we have just launched this morning the German version of ReadSpeaker webReader. Germany is one of the most active markets in terms of number of web sites and blogs and we are very excited about giving German web site owners and bloggers both male and female voices to speech-enable their content.

Posted in: webReader

The Official San Francisco Website, now talking to you!

Listen to this page using ReadSpeaker

City of San Francisco by night 

VoiceCorp has done it again! The official website for the City of San Francisco is one of the latest web sites to make their content more accessible by adding the ReadSpeaker read-aloud text-to-speech service to their web pages. Most of the pages on the website now have a ”Listen” button in the tool bar right next to the ”Print”, ”Text Only” and ”Font size” functions. Listen for yourself at http://www.sfgov.org/site/mayor_index.asp

Posted in: Customers

ReadSpeaker in the Press

Listen to this page using ReadSpeaker

If you have a minute, please read this great article from “Insurance & Technology” about one of our recent ReadSpeaker implementations. / Niclas

© 2012 ReadSpeaker Holding B.V. | www.readspeaker.com | Powered by WordPress