Speech-enabling services for all

This post is a follow-up to to our views on accessibility.
Device Independence
When developing our web based text to speech products, we pay special attention to terms such as cross-platform and device independence. Supporting as broad spectra as possible of different kinds of devices and versions of software or operating systems give users the greatest freedom and flexibility to choose what suits them best and use our services in the most advantageous ways. Some of our users are extremely advanced and up-to-date using the latest technology with a fast computer and large screen. They may even use our services for purposes we didn’t have in mind when we developed our solutions.
Others use our services from mobile devices with small screens or low bandwidth. In fact they share much of their user experience and conditions with visually impaired users using magnification software or braille displays and therefore only can consume small amounts of information each time. The low bandwidth can also be because they live in places where Internet connections are not that fast yet. It is our goal that the same product shall be usable and provide good service everywhere, from game consoles, mobile devices, text-based terminals, to top modern desktop computers running the latest of everything, even software that didn’t exist when our development took place. Forward compatibility is as important as backward compatibility.
The ReadSpeaker point of view on accessibility

Introduction
This post and and other subsequent ones will describe the accessibility policy applied for web services developed by ReadSpeaker. It covers our definition of what we mean with accessibility and exemplify how our web based products comply to this view. Further it discusses how our view conforms to our interpretation of different existing international accessibility standards with emphasis put on web accessibility standards, especially the ones from the W3C. In short, concepts like “graceful transformation”, “open standards” and “device/platform independence” are of key importance to us. Another important principle is universal design which means that we should not assume anything about the end-users using our products, or under what situations or circumstances they are using it.
How text to speech is made

Following yesterday’s post about a brief history of text to speech, today we list some of the techniques involved in creating speech synthesis.
Articulatory synthesis
In an articulatory synthesis, models of the human articulators (tongue, lips, teeth, jaw) and vocal ligament are used to simulate how an airflow passes through, to calculate what the resulting sound will be like. It is a great challenge to find good mathematical models and therefore the development of articulatory synthesis is still in research. The technique is very computation-intensive but memory requirements is almost nothing.
Formant
The synthesis is a sort of source-filter-method that is based on mathematic models of the human speech organ.
The approach pipe is modelled from a number of resonances with resemblance to the formants (frequency bands with high energy in voices) in natural speech.
The first electronic voices Voder, and later on OVE and PAT, were speaking with totally synthetic and electronic produced sounds using formant synthesis. As with articulatory synthesis, the memory consumption is small but CPU usage is large.
Concatenating synthesis
A concatenating synthesis is made of recorded pieces of speech (sound-clips) that is then unitized and formed to speech. Depending on how long sound-clips that are used it become a diphone or a polyphonic synthesis. The later in a more developed version is also called a Unit Selection synthesis, where the synthesizer has access to both long and short segments of speech and the best segments for the actual context is chosen.
Diphone
For a diphone synthesis the elements from the recorded speech are very small.
The strength in this case is that almost any sentence or expression may be read but quite often there are errors in the pronunciation and if the model used for prosody is not good, or modelling is difficult, the speech may sound a bit monotonic.
A diphone synthesis doesn’t work that well in languages where there is a lot of inconsequence in the pronunciation rules (English, Swedish etc) and in special cases where letters is pronounced differently than in general. The diphone works better for languages that have large consistencies in the pronunciation (Spanish, Finnish etc.) Another advantage is that the prosody, the intonation, can be described in very much detail.
Unit selection
The greatest difference between a Unit selection and a diphone voice is the length of the used speech segments. There are entire words and phrases stored in the unit database. this implies that the database for the Unit selection voices are many times bigger than for diphone voices. Thus, the memory consumption is huge while the CPU consumption is low.
The most important issue is to still get a natural and smooth prosody. This is hard because the units contain both intonation and pronunciation since entire phrases are used almost directly from the recorded data. Since the first Unit selection voice was released, over eight years ago, there has been much improvements for each new voice with every release. This is by far the most widely used technique among our providers.
HMM synthesis
A quite new technology is speech synthesis based on HMM, a mathematical concept called Hidden Markov models. It is a statistical method where the text-to-speech system is based on a model that is not known beforehand but it is refined by continuous training. The technique consumes large CPU resources but very little memory. This approach seems to give a better prosody, without glitches, and still producing very natural sounding, human-like speech. We collaborate with providers offering this technique as well.
Customizations and improvements
On top of using the best voices available we also add our own layer of improvement, both general and customer specific customizations. We have linguists with long experience of speech synthesis working with transcriptions to tweak the pronunciation and reading of the spoken text. Therefore we can greatly help our customers that want to optimize the quality of the text to speech on their web pages. Sometimes it is enough to do a quality control of a couple of hours listening to your website and correct the errors we find. In other cases some of our customers have industry specific words (think of the pharmaceutical industry for example) where it is very important that they are pronounced correctly.
One of the largest customizations we have made so far was for a customer who sent us a list of over 3000 words that had to be quality controlled. Another customization was for a site with about 200 000 pages where the same acronym or abbreviation had to be expanded differently depending on at what part of the site it was mentioned in. Many users wonder why the same voice reads so much better when it is used in our services compared to when the same voice, or text-to-speech system, is used for reading similar, or the same, content with other softwares or services. The answer is the above mentioned customizations.
Thanks to Professor Hartmut Traunmüller, Dept. of Linguistics at the University of Stockholm for a lot of the facts, the picture and the sound samples on this page.
A brief history of speech synthesis (text to speech)

Behind all our services there is a server-based software performing the speech synthesis, called text-to-speech software. The voices we use are provided by different providers but the technique behind the different voices has many similarities. Therefore we like to tell you briefly about the development of speech synthesis and its history.
A new easy way to speech-enable Flash applications

We developed a new way to implement ReadSpeaker® speechMachine™, which is very interesting for potential customers.
By simply adding our API package to your project, existing or new, you can have your Flash application talking in just three lines of code. The instructions are easy to follow and get you up and running in a matter of minutes.
The API is built in a way that you make use of the powerful ReadSpeaker® speechMachine™ service, without getting your hands dirty with extensive ActionScript programming. If you require more control of the sound process, you can have the API communicate with your project, by using callbacks.








