Convert Text to Voice in your Skype Bot

Wouldn’t it be cool if your bot could read you the news to you every morning?

Or maybe your bot could read a bed time story out loud or recite your favorite movie quotes to you?

There are an infinite number of use cases for using text to speech (TTS) technology in bots and here we will show you how you can do it in Skype

Introduction

In the previous blog post, we showed you how to build a Skype bot using the Real-Time Media Platform. In this second post, we will teach you how to use TTS (text-to-speech) technology in your Skype bots. At the end of this post, you will be able to build automated Skype bots which synthesize speech from text.

First, a few core concepts…

TTS (text-to-speech, also known as speech synthesis) is the artificial production of human speech. Essentially, when you supply a character string to the TTS engine, you get back an audio stream of the spoken text generated using a specific voice font (yeah, you read that right – voice font). Voice fonts can be configured for specific locales, or to use male or female voices and different intonation, pitch and speed.

For this example, we will generate the audio buffers needed by the Real-Time Media Platform using two different TTS engines: the local TTS engine from the Azure Virtual Machine running your cloud service, and the Bing Text to Speech engine from the Microsoft Cognitive Services.

Try me!

Want to take a test drive? Just open Skype and add the bot from this sample clicking on this link! Simply start an audio or video call to hear the audio generated by the bot of text from one of our favorite movies!

Code Repository

You can get the complete solution for this and other blog posts in git:

> git clone https://github.com/Microsoft/skype-dev-bots.git

Using the local TTS engine

As we mentioned, for this example we will generate the audio streams using two different TTS engines.

To begin, we are going to use the Speech Synthesis APIs. These APIs make use of the local TTS engine installed on the machine running the application. This can be your local machine or the Azure VM you will be deploying the bot to. The advantage of this approach is that there is no need for additional network requests. The disadvantage is that the synthesis is limited by the voice fonts that are installed in the “Region & Language” section of your Windows VM machine settings.

Let’s look at the code in TtsEngineLocal:

private readonly SpeechSynthesizer _synth; 
 
public TtsEngineLocal() 
{ 
    // Initialize the speech synthesizer 
    _synth = new SpeechSynthesizer(); 
    _synth.SelectVoiceByHints(VoiceGender.Female); 
} 

You can fully personalize your synthesizer and define things like gender, speed and volume to best suit your bot scenarios.

Once the synthesizer is ready we can generate the audio stream:

public MemoryStream SynthesizeText(string text) 
{ 
   var audioStream = new MemoryStream(); 
 
   _synth.SetOutputToAudioStream( 
       audioStream, 
       new SpeechAudioFormatInfo( 
       samplesPerSecond: 16000, 
  bitsPerSample: AudioBitsPerSample.Sixteen,  
             channel: AudioChannel.Mono)); 
 
   _synth.Speak(text); 
 
   return audioStream; 
} 

NOTE: when we are setting the output of the synthesizer to audio stream we need to make sure that the output format is matching the audio socket settings we specify when we initialize the MediaSession of the bot (refer to ):

... 
_audioSocket = new AudioSocket( 
    new AudioSocketSettings 
    { 
        StreamDirections = StreamDirection.Sendrecv, 
        SupportedAudioFormat = AudioFormat.Pcm16K, 
        CallId = correlationId 
    }); 

In this case the audio format of the audio socket is Pcm16k, which means 16.000 samples per second with each sample consisting of 16 bits.

That’s it! Now our audio stream contains the generated audio, let’s see how we can do the same thing using the Bing Speech APIs, then we’ll learn how to convert the stream to an array of Audio Buffers that we’ll feed to the bot audio socket.

Using the Bing Text to Speech engine

Now, as an alternative, let’s try using the Bing Text to Speech API. In this case we will make an HTTP request to a cloud server which will do the speech synthesis and return the audio file to us. This has the advantage of avoiding the need to set up your machine and gives us much more flexibility, since new voice fonts and locales are constantly added and updated in the cloud.

The first step is to create a new Bing Speech API endpoint in your Azure subscription and take note of the API key (you will find it under “Resource Management -> Keys” on the Azure Portal), we will need it to be able to perform our HTTP requests. This code is in the TTSEngineService class.

// key and name of the Bing Speech Api azure subscription 
private const string BingSpeechApiKey = @"<api key hex value>"; 
private const string BingSpeechAppName = @"ttsbot";

Once we have the key, we need to handle the authentication, for the full details on how to set up the cognitive services APIs please refer to the “Get Started” documentation. In our sample we handle the authentication in the CognitiveServicesAuthentication class:

var authentication = new CognitiveServicesAuthentication(BingSpeechApiKey); 
_token = authentication.GetAccessToken()

We can now use the authentication token to create the HTTP request. Again, make sure the output format matches the audio socket audio format.

var request = (HttpWebRequest) HttpWebRequest.Create(RequestUri); 
 
request.Method = "POST"; 
request.ProtocolVersion = HttpVersion.Version11; 
request.ContentType = "application/ssml+xml"; 
request.UserAgent = BingSpeechAppName; 
request.Headers["X-Microsoft-OutputFormat"] = @"raw-16khz-16bit-mono-pcm"; 
request.Headers["X-Search-AppId"] = _appId; 
request.Headers["X-Search-ClientID"] = _clientId; 
request.Headers["Authorization"] = "Bearer " + _token

The Bing Speech APIs use the standard Speech Synthesis Markup Language (SSML) to provide a markup language for controlling the aspects of the speech synthesis. Let’s create a simple SSML string to send as body for our request:

var ssml = new StringBuilder(); 
 
ssml.Append(@"<speak version='1.0'"); 
ssml.Append("<voice xml:lang='en-US' name='Microsoft Server Speech Text to Speech Voice (en-US, JessaRUS)'>"); 
ssml.Append(text); 
ssml.Append("</voice>"); 
ssml.Append("</speak>"); 
 
// send the request 
byte[] requestBody = Encoding.UTF8.GetBytes(ssml.ToString()); 
using (var stream = request.GetRequestStream()) 
{ 
   stream.Write(requestBody, 0, requestBody.Length);   
   stream.Flush(); 
} 

Now the only thing left is to get the response stream and copy it to our local audio stream

var audioStream = new MemoryStream(); 
var responseStream = request.GetResponse().GetResponseStream(); 
            
responseStream?.CopyTo(audioStream); 

And that’s it! Similarly to the local TTS engine now our audio stream contains the synthesized audio. Let’s learn now how to convert it to a list of Audio Buffers.

Creating the audio buffers

The Real-Time Media Platform works with enqueuing list of synchronized audio/video buffers, our bot though will not use video so we start by creating a media configuration with only the audio socket. Let’s look at the code in the MediaSession constructor:

// create the audio socket 
_audioSocket = new AudioSocket(new AudioSocketSettings 
{ 
    StreamDirections = StreamDirection.Sendrecv, 
    SupportedAudioFormat = AudioFormat.Pcm16K, 
    CallId = correlationId 
}); 
              
// create the mediaconfiguration with only audio channel 
MediaConfiguration = MediaPlatform.CreateMediaConfiguration(_audioSocket); 

Next step is to create the AudioVideoFramePlayer that we will use to enqueue our audio buffers, also in this case we will ignore the video parameters

 
AudioVideoFramePlayer audioVideoFramePlayer = new AudioVideoFramePlayer( 
    _audioSocket,  
    null,  
    new AudioVideoFramePlayerSettings(new AudioSettings(20),  
    new VideoSettings(),  
    1000))

Notice how we are setting the audio buffer length to 20ms, this is important now that we have to build the audio buffers. The function that take cares of this is populateAudioBuffersFromStream.

private long populateAudioBuffersFromStream( 
    Stream stream,  
    List<AudioMediaBuffer> audioBuffers,  
    long referenceTimeTick) 

This function takes as input the stream containing the synthesized audio, the list of audio buffers we will populate and a reference time tick that will tell us the start time for the audio playback.
Let’s start with defining a byte array that will contain our data, since each buffer will be 20ms, we need to figure out how big this array will be in bytes. Our audio format has 16.000 samples a second, or 16 samples per ms, so for 20ms we have 16*20=320 samples. Each sample is 16 bits, or 2 bytes, so the total size of our byte array is 320*2=640 bytes.

// a 20ms buffer is 640 bytes 
int bufferSize = 640; 
byte[] bytesToRead = new byte[bufferSize]; 

Next thing to do is to simply read from our stream into the buffer and create the Audio Buffer objects with the correct time tick reference.

// create 20ms buffers from the input stream 
while (stream.Read(bytesToRead, 0, bytesToRead.Length) >= bufferSize)change 
{ 
    IntPtr unmanagedBuffer = Marshal.AllocHGlobal(bufferSize); 
    Marshal.Copy(bytesToRead, 0, unmanagedBuffer, bufferSize); 
                 
    // move the reference time by 20ms (there are 10K ticks in 1 ms) 
    referenceTimeTick += 20 * 10000;  
 
    // create the audio buffer and add it to the list 
    var audioBuffer = new AudioSendBuffer( 
        unmanagedBuffer,  
        bufferSize,    
        AudioFormat.Pcm16K,  
        referenceTimeTick); 
    audioBuffers.Add(audioBuffer); 
} 
 
// return the reference time tick of the last buffer  
// so we can queue new data if needed 
return referenceTimeTick; 

Now we can simply feed the list of Audio Buffers to the AudioVideoFramePlayer. The list of Video Buffers can be empty as we are using only the audio socket.

await audioVideoFramePlayer.EnqueueBuffersAsync( 
    audioMediaBuffers,  
    new List<VideoMediaBuffer>()); 

Pros and cons of the two engines

We saw two different ways of synthesizing speech, each one of them with some pros and cons:

  • Local TTS engine: this is the simplest way to generate speech, it uses the local speech engine which means you will have to correctly set up the machine that will run the app, installing the needed locales and voice fonts. The advantage is that you will have good performances with no need for network connectivity but at the cost of limited quality of the final synthesized speech.
  • Bing Text to Speech API: these APIs use the full power of the Microsoft Cognitive Services through a fast and scalable REST service. You have a large and constantly updated set of locales and high-quality voice fonts to choose from, but you have to consider the limitation of maximum of 15 seconds returned per request.

Conclusions

With this sample you learned how to convert text into voice for your bot. This opens up a whole world of new and exciting scenarios where users can listen to your bot just by calling and using natural language. Stay tuned for the next post where we will learn how to add a talking avatar to your bot!

Happy Skype coding!  For more information on Developing for the Skype platform, check out dev.skype.com.