Pronunciation corrections in online text to speech

Listen to this page using ReadSpeaker

Moving World Artwork showing acronyms, Heathrow Terminal 5, London.

Although text-to-speech technology has made a lot of progress, it can sometimes stumble on certain terms such as acronyms, abbreviations, date formats or number representations to name a few. We have some customers where the pronunciation needs fine-tuning such as in the pharmaceutical sector for example where it is even more important that each term is perfectly well read.

Every account that we open comes with a specific dictionary for the customer. We provide a service to each of our customers that helps them with pronunciation issues when they exist. Some of the pronunciation corrections will only be relevant to the dictionary of a particular customer, but in some instances the corrections can also be used for the default dictionary and benefit our entire customer base. We have a very knowledgeable network of linguists that can help our customers in many parts of the world when they encounter pronunciation difficulties.

We have prepared a few online text to speech demos that show the before and after effect of our work on some types of words that can get mispronounced by speech synthesis.

Photo Credit: Jim Linwood

How text to speech is made

Listen to this page using ReadSpeaker

Following yesterday’s post about a brief history of text to speech, today we list some of the techniques involved in creating speech synthesis.

Articulatory synthesis

In an articulatory synthesis, models of the human articulators (tongue, lips, teeth, jaw) and vocal ligament are used to simulate how an airflow passes through, to calculate what the resulting sound will be like. It is a great challenge to find good mathematical models and therefore the development of articulatory synthesis is still in research. The technique is very computation-intensive but memory requirements is almost nothing.

Formant

The synthesis is a sort of source-filter-method that is based on mathematic models of the human speech organ.
The approach pipe is modelled from a number of resonances with resemblance to the formants (frequency bands with high energy in voices) in natural speech.
The first electronic voices Voder, and later on OVE and PAT, were speaking with totally synthetic and electronic produced sounds using formant synthesis. As with articulatory synthesis, the memory consumption is small but CPU usage is large.

Concatenating synthesis

A concatenating synthesis is made of recorded pieces of speech (sound-clips) that is then unitized and formed to speech. Depending on how long sound-clips that are used it become a diphone or a polyphonic synthesis. The later in a more developed version is also called a Unit Selection synthesis, where the synthesizer has access to both long and short segments of speech and the best segments for the actual context is chosen.

Diphone

For a diphone synthesis the elements from the recorded speech are very small.
The strength in this case is that almost any sentence or expression may be read but quite often there are errors in the pronunciation and if the model used for prosody is not good, or modelling is difficult, the speech may sound a bit monotonic.
A diphone synthesis doesn’t work that well in languages where there is a lot of inconsequence in the pronunciation rules (English, Swedish etc) and in special cases where letters is pronounced differently than in general. The diphone works better for languages that have large consistencies in the pronunciation (Spanish, Finnish etc.) Another advantage is that the prosody, the intonation, can be described in very much detail.

Unit selection

The greatest difference between a Unit selection and a diphone voice is the length of the used speech segments. There are entire words and phrases stored in the unit database. this implies that the database for the Unit selection voices are many times bigger than for diphone voices. Thus, the memory consumption is huge while the CPU consumption is low.

The most important issue is to still get a natural and smooth prosody. This is hard because the units contain both intonation and pronunciation since entire phrases are used almost directly from the recorded data. Since the first Unit selection voice was released, over eight years ago, there has been much improvements for each new voice with every release. This is by far the most widely used technique among our providers.

HMM synthesis

A quite new technology is speech synthesis based on HMM, a mathematical concept called Hidden Markov models. It is a statistical method where the text-to-speech system is based on a model that is not known beforehand but it is refined by continuous training. The technique consumes large CPU resources but very little memory. This approach seems to give a better prosody, without glitches, and still producing very natural sounding, human-like speech. We collaborate with providers offering this technique as well.

Customizations and improvements

On top of using the best voices available we also add our own layer of improvement, both general and customer specific customizations. We have linguists with long experience of speech synthesis working with transcriptions to tweak the pronunciation and reading of the spoken text. Therefore we can greatly help our customers that want to optimize the quality of the text to speech on their web pages. Sometimes it is enough to do a quality control of a couple of hours listening to your website and correct the errors we find. In other cases some of our customers have industry specific words (think of the pharmaceutical industry for example) where it is very important that they are pronounced correctly.

One of the largest customizations we have made so far was for a customer who sent us a list of over 3000 words that had to be quality controlled. Another customization was for a site with about 200 000 pages where the same acronym or abbreviation had to be expanded differently depending on at what part of the site it was mentioned in. Many users wonder why the same voice reads so much better when it is used in our services compared to when the same voice, or text-to-speech system, is used for reading similar, or the same, content with other softwares or services. The answer is the above mentioned customizations.

Thanks to Professor Hartmut Traunmüller, Dept. of Linguistics at the University of Stockholm for a lot of the facts, the picture and the sound samples on this page.

A brief history of speech synthesis (text to speech)

Listen to this page using ReadSpeaker

Behind all our services there is a server-based software performing the speech synthesis, called text-to-speech software. The voices we use are provided by different providers but the technique behind the different voices has many similarities. Therefore we like to tell you briefly about the development of speech synthesis and its history.

(more…)

Posted in: Speech synthesis

Speech-enabling for the long-tail

Listen to this page using ReadSpeaker

As you might have remembered when I wrote a post about From birth of the talking web and into the future. I owed you a follow-up note so here it is! As I had discussed, we started out by having a focused approach on which customers we should approach and which end-users would most benefit from a server-side speech-enabling solution for web sites. On the user side we have seen that the usages of our technology have increased over the past years making it appealing to a greater number of users. On the customer side, we also witnessed a greater variety of sectors interested in speech-enabling their web content ranging from public sites to banks, insurance companies, non-profit organisations and many others.
 
Now over the past months another change happened. We started getting an increasing amount of incoming leads from much smaller web sites and blogs also interested in speech-enabling their content. This could range from the mom and pop store with a web site to the blogger interested in space technology. These are typically 1 to 10 people organisations. Some of them are purely personal initiatives ie someone interested in a hobby while others might be freelancers, consultants, designers or any other small company or non-profit organisation. Since our company is set up to deal with mid-sized and bigger organisations we needed to see how we could propose an easy way for all these smaller web sites and blogs to speech-enable their content. The idea here was to really get a grasp on the essential features that matter to this segment and not throw in all the bells and whistles that serve no purpose at all. Then we thought how to make the implementation process as easy as possible so that all these new small customers could simply integrate our solution as a no-brainer either by using plug-ins we have developed for some popular CMS and blog platforms or either as a simple copy & paste of our HTML code directly into the source code of the page. The last point was to create a new web shop where both personal web sites and blogs as well as small companies and organisations could easily choose the most suitable package for their needs, sign-up and subscribe as seamlessly as possible.
 
We are now proud to announce that we are ready to launch this new venture! Our new product for this segment is called webReader and you can find out all about it by going to www.readspeaker.com. We hope you will enjoy this new service and find it useful and we will dedicate our maximum attention to support you in the best way possible. We are starting off with American and British English, Swedish and French voices and will be adding more very shortly.

Good article on text-to-speech

Listen to this page using ReadSpeaker

My colleague and co-blogger Daniel Erkstam has just published a good article about the history of text-to-speech technology. Click here to read the full article about the history of TTS.

Posted in: TTS

Speech enabling for the masses!

Listen to this page using ReadSpeaker

In the past couple of months we have been getting an increasing number of requests from smaller personal or business web sites and blogs that are interested in speech-enabling their web content using the award winning ReadSpeaker text-to-speech services, but that we just simply could not dedicate enough time to present and sell our applications to. To meet their needs we have decided to open up a dedicated web shop in the next coming days and sell our new application called webReader at either affordable monthly or yearly rates or for a free ad-financed version.

Please stay tuned as we will announce the launch on the blog soon now.

If you want to be contacted as soon as webReader is available, please register at http://www.rspeak.com/wr_signup/

Google Knol – now with text-to-speech

Listen to this page using ReadSpeaker

A few days ago Google announced that they begin to experiment with text-to-speech on their “Knol”.

Quote from their site: “We are experimenting with Audio Playback as an option for some knols, starting with a handful of English language featured knols. You can listen using our Flash player, or by downloading an mp3 file and using any mp3 player.”

If a listen-button is shown next to the “print” and “share” button, you know that the Knol is available also as audio.

Read all about it and try it out here: http://knol.google.com/k/knol-help/knol-audio-playback/

Posted in: TTS

Listen function as Universal design

Listen to this page using ReadSpeaker

The other day I was standing in the hotel bar watching the TV. The volume was turned down completely but thanks to the real-time captioning I was able to follow the news broadcast. The day after, I was spending some hours waiting for my delayed flight at Heathrow airport to get ready for departure. There was a TV on the waiting area, again with the volume turned down. This time there was no captioning. However, they did have a sign-language narrator in the bottom right corner of the screen. That didn’t help me much since I can’t understand sign language. I was experience “Situational Disability”. In this case, text would have helped everybody that could read.Now, what about Audio? There are a great number of reasons why audio version of the text is as universal as text version of audio. Take reading a news article as an example. It is fairly difficult (not to say dangerous) to read today’s edition of the International Herald Tribune when driving a car. Text just simply doesn’t do very well in that situation. Reading it on a small mobile display is also not the best way to consume the article. If you have some kind of disability that makes it difficult to read ANY text you are in about the same situation. The fact that we want to consume written text in a situation when that is not possible (or convenient) somehow makes us all disabled. It is the situation that creates the handicap, not necessarily our abilities.

There are many people that are helped by speech function integrated on a website. I would dare to say that being able to listen to a web page is Universal Design.

The last years more and more websites subscribe to our ReadSpeaker services that speech enable the websites for anyone that rather listens than reads. We are currently working very hard to make the services more usable in any kind of situation, and regardless of what device you happen to use. It is both a question of usability and mobile user experience. ReadSpeaker is in itself completely device independent since it is a server side service, and we are now finalizing our new implementation instructions that will ensure that it works on any computer, handheld, mobile phone and whatever device that could possibly have a web browser installed. The amount of people using the mobile phone to browse the Internet is increasing dramatically and within the next 2-3 years analysts expect that almost 3 billion people will have web access through their mobile phones. It is time to get ready for this. First, to create websites that work in all these devices and also, since we would probably not see any 17 inch displays on these, speech enable the sites. For everybody that rather listens than reads.

The Official San Francisco Website, now talking to you!

Listen to this page using ReadSpeaker

City of San Francisco by night 

VoiceCorp has done it again! The official website for the City of San Francisco is one of the latest web sites to make their content more accessible by adding the ReadSpeaker read-aloud text-to-speech service to their web pages. Most of the pages on the website now have a ”Listen” button in the tool bar right next to the ”Print”, ”Text Only” and ”Font size” functions. Listen for yourself at http://www.sfgov.org/site/mayor_index.asp

Posted in: Customers

Guest blogger: Speech syntheses – one for each purpose

Listen to this page using ReadSpeaker

This is a post from todays guest blogger: Daniel Erkstam, Nordic Sales Director for VoiceCorp.

 Two robots

The pictures shows two robots. The left one is an industrial robot from ABB that probably is used to build cars or something similar. The right one is one of the most advanced AI robots that can be found today. It is possible to converse with it and it is very human like.

Both robots serve their purpose and do it well. And it is the same with speech syntheses.

When we launched the first speaking web services back in 2001 the only available voices was very robotic ones and became kind of boring listening to on longer texts. Today we use voices made in a complete different technique and the quality become closer and closer to recorded speech.

But the thing is that the older voices is still used by a lot of people and is even preferred compared to the newer ones for some purposes. For example people with visual impairment often prefer the older voices for screen-reading software’s like Jaws. The reason is that the older voices are more consequent on how they read the text and you can get used to the odd and robotic character of the voice. The older voices also read out the text in a more detailed way. The voices we use today are a lot more human like but also more “forgiving” when it comes to spelling errors and some words from foreign languages etc. The secret behind that is many times bigger database with the phonemes.

We know that the smaller need a person have for a synthetic speech, the harder judge he/she will be. We who doesn’t have reading difficulties or visual impairment can see/read the text and compare that to the voice speaking. Then we react on every little slight error in the pronouncing by the synthesis.

We put a lot of effort to make the reading as good as possible by making a lot of customizations so that the speech syntheses pronounce the current website’s vocabulary as good as possible. Because we know that there is a strong connection between how good it sounds and how many people that will use the service.

Back to the robots again: They might both serve their purposes well. But I guess it would be an easy choice which one you would pick to serve visitors at the reception desk, right?

Posted in: TTS
© 2012 ReadSpeaker Holding B.V. | www.readspeaker.com | Powered by WordPress