Add a talking face to your Skype bots

…or how to build a TTVS (text-to-video-speech) bot in Skype.

Introduction

In our previous post we showed you how to build a simple TTS (text-to-speech) bot. But Skype is a video platform! Wouldn’t it be great if you could build a Skype bot with a face, which you can have a conversation or a video call with?

Well, you can do that in Skype! We’ll show you how below:

Try me!

You can add the bot from this sample clicking on this link! Simply start an audio or video call to the bot to enjoy the generated audio and video.

Code Repository

You can get the complete solution for this and other blog posts in git:

 > git clone https://github.com/Microsoft/skype-dev-bots.git

First, the core concepts…

To animate our bot avatar we will use visemes. A viseme is the visual equivalent of a phoneme – it defines the shape and position of your mouth and face when you make the sound associated with a phoneme.

For this sample we will use the same TTS engine from the Speech Synthesis APIs we used in the previous blog post. We need to do that, because the Bing Text to Speech service does not support visemes just yet. We will generate an audio stream exactly as we did for our previous sample, and in parallel we will generate a synchronized list of Video Buffers using a set of static frames associated with each viseme. We like to call this a TTVS (text-to-video-speech) engine.

Setting up the video call

This bot needs both the audio and video sockets to be enabled, so let’s see how to set that up.

First of all we’ll set up the audio socket, something we already learned in the previous blog post

// create the audio socket 
_audioSocket = new AudioSocket(new AudioSocketSettings 
{ 
      StreamDirections = StreamDirection.Sendrecv, 
      SupportedAudioFormat = AudioFormat.Pcm16K, 
      CallId = correlationId 
}); 

Then let’s set up the video socket

// define the video format 
_defaultVideoFormat = VideoFormat.Rgb24_1280x720_30Fps; 
 
// create the video socket 
_videoSocket = new VideoSocket(new VideoSocketSettings 
{ 
StreamDirections = StreamDirection.Sendrecv, 
ReceiveColorFormat = VideoColorFormat.NV12, 
                     
SupportedSendVideoFormats = new List<VideoFormat>() { 
      _defaultVideoFormat 
}, 
 
CallId = correlationId 
}); 

It’s important to keep track of the video format we use to initialize the video socket (Rgb24_1280x720_30Fps) since we will need it later to determine the size and duration of the video frames.

We can then create the MediaConfiguration and the AudioVideoFramePlayer we will use to send our audio and video buffers:

// create the mediaconfiguration 
MediaConfiguration = MediaPlatform.CreateMediaConfiguration( 
                          _audioSocket,  
                          _videoSocket); 
 
// create an audio/video frame player 
audioVideoFramePlayer = new AudioVideoFramePlayer( 
                            _audioSocket,  
                            _videoSocket,  
                            new AudioVideoFramePlayerSettings( 
                                new AudioSettings(20),  
                                new VideoSettings(),  
                                1000)); 

Let’s create visemes!

All the magic happens in the TtvsEngine class. In the constructor we simply initialize the SpeechSynthesizer and we preload the viseme in the PreloadVisemes() function. This function simply loads the avatar frames into a data structure mapping visemes values to byte arrays containing the bitmaps.

private Dictionary<int, byte[]> _visemeBitmaps = new Dictionary<int, byte[]>();

Each image is converted to a byte array by the function BitmapToByteArray that also makes sure the size of the bitmap is matching the size of the call video format.

In this example we load the avatar_neutral image for the viseme value 0:

visemeBitmaps.Add(0, Utilities.BitmapToByteArray(Properties.Resources.avatar_neutral, _videoFormat)); 

The full list of the 21 viseme values for US English is described here, for simplicity in this sample we use only 10 avatar frames so we will have to map the same frame to different visemes. Extending the avatar frames to the full list of 21 visemes will greatly increase the quality and smoothness of the final rendered video.

The SynthesizeText method will generate the audio and video buffers for the given text. This step is very similar to what we did in the previous bot sample, with the exception that now we subscribe to the VisemeReached event in order to build the visemes timeline:

// observe the synthesizer and generate the visemes timeline   
VisemesTimeline timeline = new VisemesTimeline(); 
_synth.VisemeReached += (sender, visemeReachedEventArgs) => 
{ 
timeline.Add(visemeReachedEventArgs.Viseme, 
             visemeReachedEventArgs.Duration.Milliseconds); 
}; 

The VisemesTimeline is a simple data structure that keeps track of what viseme is active at any point of time during the synthesized text, so for example timeline.get(0) will return 0 (the viseme value for the neutral position) since we are starting from a neutral position of silence, timeline.get(1000) will return the viseme value relative to whatever sound is produced at 1 second of playback, and so on.

Creating the Audio Buffers

The Audio Buffers are generated by the CreateAudioBuffers function and the code is the same as our previous bot, so please refer to <link to previous blog post> for the full implementation details.

Creating the Video Buffers

The new Video Buffers are generated in the CreateVideoBuffers function, the idea is very simple: for each video frame we are about to build we first go look in the visemes timeline what is the active viseme at that specific point in time, then we lookup the bitmap for that viseme in the preloaded list of bitmaps and copy it in the buffer.

First thing we need to define the size of the frame in bytes, this fully depends on the video format we used to initialize the video socket, and it is the size (width x height) of the frame multiplied the number of bytes for each pixel.

// compute the frame buffer size in bytes for the current video format 
var frameSize = (int) (_videoFormat.Width * 
                       _videoFormat.Height * 
                       Helper.GetBitsPerPixel(_videoFormat.VideoColorFormat) / 8); 

Next let’s find out the duration of each frame in milliseconds, and this depends again on the video format and its framerate:

// compute the frame duration for the current framerate 
var frameDurationInMs = (int) (1000.0 / (double) _videoFormat.FrameRate); 

Then similarly to the Audio Buffers, we create and fill the Video Buffers with the bitmap data and the correct time tick reference.

var durationInMs = 0; 
 
// create video frames for the whole viseme timeline lenght 
while (durationInMs < visemesTimeline.Length) 
{ 
// get the current viseme 
byte[] visemeBitmap = _visemeBitmaps[visemesTimeline.Get(durationInMs)]; 
 
// create the buffer 
IntPtr unmanagedBuffer = Marshal.AllocHGlobal(frameSize); 
Marshal.Copy(visemeBitmap, 0, unmanagedBuffer, frameSize); 
 
// increase the current duration by one frame 
durationInMs += frameDurationInMs; 
 
// create the video buffer and add it to the list 
var videoSendBuffer = new VideoSendBuffer( 
                          unmanagedBuffer,  
                          (uint) frameSize, 
                          _videoFormat,  
                          referenceTimeTick + durationInMs * 10000); 
 
videoBuffers.Add(videoSendBuffer); 
} 

Now we can simply feed both list of Audio and Video buffers to the AudioVideoFramePlayer that will take care of synchronizing them based on the reference time ticks of each buffer.

await audioVideoFramePlayer.EnqueueBuffersAsync( 
    audioBuffers,  
videoBuffers)

Conclusion

We showed you how to use the Speech Synthesis APIs to generate an audio stream and the associated visemes values, then it was easy to generate the video content and keep it synchronized it with the audio through the Real-Time Media Platform. Now you can add a face to your bot!

Happy Skype coding!  For more information on Developing for the Skype platform, check out dev.skype.com.