How text to speech is made

Listen to this page using ReadSpeaker

Following yesterday’s post about a brief history of text to speech, today we list some of the techniques involved in creating speech synthesis.

Articulatory synthesis

In an articulatory synthesis, models of the human articulators (tongue, lips, teeth, jaw) and vocal ligament are used to simulate how an airflow passes through, to calculate what the resulting sound will be like. It is a great challenge to find good mathematical models and therefore the development of articulatory synthesis is still in research. The technique is very computation-intensive but memory requirements is almost nothing.

Formant

The synthesis is a sort of source-filter-method that is based on mathematic models of the human speech organ.
The approach pipe is modelled from a number of resonances with resemblance to the formants (frequency bands with high energy in voices) in natural speech.
The first electronic voices Voder, and later on OVE and PAT, were speaking with totally synthetic and electronic produced sounds using formant synthesis. As with articulatory synthesis, the memory consumption is small but CPU usage is large.

Concatenating synthesis

A concatenating synthesis is made of recorded pieces of speech (sound-clips) that is then unitized and formed to speech. Depending on how long sound-clips that are used it become a diphone or a polyphonic synthesis. The later in a more developed version is also called a Unit Selection synthesis, where the synthesizer has access to both long and short segments of speech and the best segments for the actual context is chosen.

Diphone

For a diphone synthesis the elements from the recorded speech are very small.
The strength in this case is that almost any sentence or expression may be read but quite often there are errors in the pronunciation and if the model used for prosody is not good, or modelling is difficult, the speech may sound a bit monotonic.
A diphone synthesis doesn’t work that well in languages where there is a lot of inconsequence in the pronunciation rules (English, Swedish etc) and in special cases where letters is pronounced differently than in general. The diphone works better for languages that have large consistencies in the pronunciation (Spanish, Finnish etc.) Another advantage is that the prosody, the intonation, can be described in very much detail.

Unit selection

The greatest difference between a Unit selection and a diphone voice is the length of the used speech segments. There are entire words and phrases stored in the unit database. this implies that the database for the Unit selection voices are many times bigger than for diphone voices. Thus, the memory consumption is huge while the CPU consumption is low.

The most important issue is to still get a natural and smooth prosody. This is hard because the units contain both intonation and pronunciation since entire phrases are used almost directly from the recorded data. Since the first Unit selection voice was released, over eight years ago, there has been much improvements for each new voice with every release. This is by far the most widely used technique among our providers.

HMM synthesis

A quite new technology is speech synthesis based on HMM, a mathematical concept called Hidden Markov models. It is a statistical method where the text-to-speech system is based on a model that is not known beforehand but it is refined by continuous training. The technique consumes large CPU resources but very little memory. This approach seems to give a better prosody, without glitches, and still producing very natural sounding, human-like speech. We collaborate with providers offering this technique as well.

Customizations and improvements

On top of using the best voices available we also add our own layer of improvement, both general and customer specific customizations. We have linguists with long experience of speech synthesis working with transcriptions to tweak the pronunciation and reading of the spoken text. Therefore we can greatly help our customers that want to optimize the quality of the text to speech on their web pages. Sometimes it is enough to do a quality control of a couple of hours listening to your website and correct the errors we find. In other cases some of our customers have industry specific words (think of the pharmaceutical industry for example) where it is very important that they are pronounced correctly.

One of the largest customizations we have made so far was for a customer who sent us a list of over 3000 words that had to be quality controlled. Another customization was for a site with about 200 000 pages where the same acronym or abbreviation had to be expanded differently depending on at what part of the site it was mentioned in. Many users wonder why the same voice reads so much better when it is used in our services compared to when the same voice, or text-to-speech system, is used for reading similar, or the same, content with other softwares or services. The answer is the above mentioned customizations.

Thanks to Professor Hartmut Traunmüller, Dept. of Linguistics at the University of Stockholm for a lot of the facts, the picture and the sound samples on this page.

You can help improve the text to speech voices!

Listen to this page using ReadSpeaker

We are currently working very hard to fix a lot of pronunciation corrects (words that are read in a strange/funny/wrong way) in the text-to-speech voices. No text-to-speech engines are perfect, so sometimes it is needed to “learn” it how certain words should be pronounced.

Words such as Foreign words, brand names, technical expressions, abbreviations and person names can be a bit tricky for the TTS to pronounce since they are usually exceptions not following the “rules” of the language. All languages are made up by rules and exceptions, and the list of the latter is almost endless.

You can help!

If you come across words that you should say belong to the  “common” category, please let us know. You can use the Customer Service/Feedback menu on http://webreader.readspeaker.com

Posted in: TTS webReader

Speech-enabling for the long-tail

Listen to this page using ReadSpeaker

As you might have remembered when I wrote a post about From birth of the talking web and into the future. I owed you a follow-up note so here it is! As I had discussed, we started out by having a focused approach on which customers we should approach and which end-users would most benefit from a server-side speech-enabling solution for web sites. On the user side we have seen that the usages of our technology have increased over the past years making it appealing to a greater number of users. On the customer side, we also witnessed a greater variety of sectors interested in speech-enabling their web content ranging from public sites to banks, insurance companies, non-profit organisations and many others.
 
Now over the past months another change happened. We started getting an increasing amount of incoming leads from much smaller web sites and blogs also interested in speech-enabling their content. This could range from the mom and pop store with a web site to the blogger interested in space technology. These are typically 1 to 10 people organisations. Some of them are purely personal initiatives ie someone interested in a hobby while others might be freelancers, consultants, designers or any other small company or non-profit organisation. Since our company is set up to deal with mid-sized and bigger organisations we needed to see how we could propose an easy way for all these smaller web sites and blogs to speech-enable their content. The idea here was to really get a grasp on the essential features that matter to this segment and not throw in all the bells and whistles that serve no purpose at all. Then we thought how to make the implementation process as easy as possible so that all these new small customers could simply integrate our solution as a no-brainer either by using plug-ins we have developed for some popular CMS and blog platforms or either as a simple copy & paste of our HTML code directly into the source code of the page. The last point was to create a new web shop where both personal web sites and blogs as well as small companies and organisations could easily choose the most suitable package for their needs, sign-up and subscribe as seamlessly as possible.
 
We are now proud to announce that we are ready to launch this new venture! Our new product for this segment is called webReader and you can find out all about it by going to www.readspeaker.com. We hope you will enjoy this new service and find it useful and we will dedicate our maximum attention to support you in the best way possible. We are starting off with American and British English, Swedish and French voices and will be adding more very shortly.

Good article on text-to-speech

Listen to this page using ReadSpeaker

My colleague and co-blogger Daniel Erkstam has just published a good article about the history of text-to-speech technology. Click here to read the full article about the history of TTS.

Posted in: TTS

From birth of the talking web and into the future (part 1)

Listen to this page using ReadSpeaker

We have been in the business of speech-enabling web sites since 1999, date at which I had the idea to bring text-to-speech into the arena of web sites. What was my motivation for doing so? I realised that a certain number of people around me had problems or felt uncomfortable reading text found on web sites. Sure, screen readers were already around and TTS had been built into operating systems but these options were simply not used by these users who I questioned hard about how they would like web sites to function. On the other side I thought to myself that for a web site owner it would be a useful feature to help users get an easier and free access to the audio version of their content without having to take care and worry about developing, installing, maintaining and updating this themselves. The combination of those 2 findings gave birth to ReadSpeaker which was commercially launched in Sweden back in 2001.

At the beginning I had a very focused idea of which web site owners this would appeal to. I started approaching the public sector as well as web sites that were aimed at disability groups. At the beginning the end users who I thought about were mainly people who suffered from dyslexia and other various reading disabilities. Then a strange thing happened. I started getting feedback from users that I had not even thought of would use ReadSpeaker. These were senior citizens who appreciated the comfort of having the choice between reading or listening the text content of web sites. These were foreigners living in Sweden who liked to be able to listen to Swedish instead of reading it. These were students who could listen to lessons by saving the mp3 file to their mobile devices. These were “information workers” who in their fast paced environments needed to listen to web content while taking care of other tasks at the same time. These were….well you got me, the circle of users kept and keeps on getting bigger and bigger. This trend also had an effect on the customers that we started approaching and that were also increasingly contacting us. From the narrower group of public and disability web sites, we started implementing ReadSpeaker on a greater variety of areas like the banking sector, the insurance companies, transport organisations, online media, etc.

What happened next was very interesting, but more on that in another (soon to come) post :-)

Posted in: TTS

Google Knol – now with text-to-speech

Listen to this page using ReadSpeaker

A few days ago Google announced that they begin to experiment with text-to-speech on their “Knol”.

Quote from their site: “We are experimenting with Audio Playback as an option for some knols, starting with a handful of English language featured knols. You can listen using our Flash player, or by downloading an mp3 file and using any mp3 player.”

If a listen-button is shown next to the “print” and “share” button, you know that the Knol is available also as audio.

Read all about it and try it out here: http://knol.google.com/k/knol-help/knol-audio-playback/

Posted in: TTS

Non Latin Support for the ReadSpeaker Enterprise

Listen to this page using ReadSpeaker

 In August we are releasing the support for Chinese and Arabic on the ReadSpeaker Enterprise Services platform. Please contact info@voice-corp.com or stay tuned to this blog to hear more about it.

Posted in: TTS

Guest blogger: Speech syntheses – one for each purpose

Listen to this page using ReadSpeaker

This is a post from todays guest blogger: Daniel Erkstam, Nordic Sales Director for VoiceCorp.

 Two robots

The pictures shows two robots. The left one is an industrial robot from ABB that probably is used to build cars or something similar. The right one is one of the most advanced AI robots that can be found today. It is possible to converse with it and it is very human like.

Both robots serve their purpose and do it well. And it is the same with speech syntheses.

When we launched the first speaking web services back in 2001 the only available voices was very robotic ones and became kind of boring listening to on longer texts. Today we use voices made in a complete different technique and the quality become closer and closer to recorded speech.

But the thing is that the older voices is still used by a lot of people and is even preferred compared to the newer ones for some purposes. For example people with visual impairment often prefer the older voices for screen-reading software’s like Jaws. The reason is that the older voices are more consequent on how they read the text and you can get used to the odd and robotic character of the voice. The older voices also read out the text in a more detailed way. The voices we use today are a lot more human like but also more “forgiving” when it comes to spelling errors and some words from foreign languages etc. The secret behind that is many times bigger database with the phonemes.

We know that the smaller need a person have for a synthetic speech, the harder judge he/she will be. We who doesn’t have reading difficulties or visual impairment can see/read the text and compare that to the voice speaking. Then we react on every little slight error in the pronouncing by the synthesis.

We put a lot of effort to make the reading as good as possible by making a lot of customizations so that the speech syntheses pronounce the current website’s vocabulary as good as possible. Because we know that there is a strong connection between how good it sounds and how many people that will use the service.

Back to the robots again: They might both serve their purposes well. But I guess it would be an easy choice which one you would pick to serve visitors at the reception desk, right?

Posted in: TTS

SpeechMachine text-to-speech in Viral Marketing

Listen to this page using ReadSpeaker

A sausage says more then a thousand words.

Scan, one of the leading Swedish brands just launched a really great viral marketing campaign using VoiceCorps SpeechMachine solution. The idea for the campaign is quite cool. The campaign is for marketing Scan’s new line of spicy sausages. They wanted to add some nice interaction with the user so they added the strongest media around. Speech.

The core functionalty is that the users can send “speech-cards” to each other. They enter the text, listens if it is good and send the speech card to a friend.

The cool thing is that we used a Spanish voice but using Swedish speech rules. The result is a Spanish guy speaking Swedish. It’s brilliant! It really sounds like a guy from Spain that only lived a few years in Sweden. Enough time to learn the language but keeping a strong Spanish accent. The speech solution itself was delivered in just a couple of hours thanks to SpeechMachines ability to integrate with all the TTS engines on the market.

SpeechMachine is provided by VoiceCorp as a 100% hosted service that allows creative web developers to easily add text-to-speech functionality to their web apps without requiring any knowledge about text-to-speech technology. The communication with the customer’s web based app and the SpeechMachine is based on standard HTTP requests, and is therefore really easy to integrate in any web app.

Want to try out the app, http://www.scan.se/kryddigakorvar/

Posted in: TTS Viral Marketing
Tags:

Podcasting made simple! rSpeak VocalFruits

Listen to this page using ReadSpeaker

rSpeak VocalFruits Logo 

VoiceCorp announced today, together with VocalFruits, that they launcing the rSpeak VocalFruits Information Composing System.

It is a “Web 2.0″ web application where anyone can create a podcast from any RSS source and where content owners such as bloggers can offer their audience a speaking version of their content!

The blog posts- Bang! Right into iTunes.

rSpeak VocalFruits will basically replace AudioFeed (www.audiofeedcreator.com), a not very social, but very appreciated free web service that I created about a year and a half ago.

With the new web based podcasting service, any registred user can create a podcast from any RSS feed in no time! There are also a couple of really cool features like aggregating a number of RSS feeds into one podcast or why not create a personal podcast that you can update (adding posts to) just by emailing to your personal vocalfruits email address.

In addition to the podcast RSS feed it also creates a web browser version and a mobile version ideal for mobile devices such as mobile phones and PDA’s.

In this release there will be support for US English, French and Spanish. Support for more languages (Swedish, Dutch, German, UK English and Portuguese) will be available within a month or two from what I heard. Also it will be changed so that you do not need to be a registered user to be able to listen… Anybody should be able to listen. That is key.

Check it out at www.vocalfruits.com. Stay tuned!

Posted in: TTS
© 2012 ReadSpeaker Holding B.V. | www.readspeaker.com | Powered by WordPress