How to build a real-time media-enabled Skype bot

Introduction

During Microsoft Build 2017, we announced that Skype bots are now supporting 1:1 real-time media calling. This is a unique capability available on Skype.

In this post, we will create and deploy a bot that plays back a video upon receiving a call. We will create our sample bot using C# and the Real-time Media Platform. We will name the bot Video Player. You can try the Video Player bot by using this join link to add it to your Skype client and then giving it a Skype call to play the video.

Deployment

Here are the 5 steps to create and deploy your own version of the Video Player bot.

1. Create a Cloud Service and a Storage Account on Azure

First, you will need to create a Cloud Service and a Storage Account on http://portal.azure.com. Your content will be stored on the Storage Account, whereas the bot lives on the Cloud Service.

2. Upload the content to Azure

Before starting to look at the code, you need to upload the content you want to stream to your Storage Account. You need to split the audio track from the video track and convert each of them to raw format.

In this sample, we will use:

  • NV12 640×360 at 30fps for the video
  • PCM 16 for the audio

To convert an existing video, you can use the popular ffmpeg tool:

Command to convert and extract the video:
> ffmpeg -i myvideo.mp4 -s 640×360 -r 30 -f rawvideo -pix_fmt nv12 video.yuv

Command to convert and extract the audio:
> ffmpeg -i myvideo.mp4 -acodec pcm_s16le -ac 1 -ar 16000 audio.wav

Now that the “.yuv” and “.wav.” files have been generated, upload them to your Storage Account. Place each file in the corresponding container specified in your configuration file (see step 4. Update Config).

3. Get the code

To get this bot, clone the following Git repo: https://github.com/Microsoft/skype-dev-bots.git
> git clone https://github.com/Microsoft/skype-dev-bots.git

The source code is under the “Samples/Csharp/RealtimeMedia/VideoPlayer” directory

4. Update Config

In app.config of the WorkerRole, replace $BotHandle$, $MicrosoftAppId$ and $BotSecret$ with values obtained during bot registration.

<!-- app.config --> 
<!-- update app settings with your bot id, app id, and app password --> 
<appSettings> 
  <add key="BotId" value="$BotHandle$" /> 
  <add key="MicrosoftAppId" value="$MicrosoftAppId$" /> 
  <add key="MicrosoftAppPassword" value="$BotSecret$" /> 
</appSettings>

In service configuration (ServiceConfiguration.Cloud.cscfg file), make the replacements below:

<!-- Replace $ConnectionString$ with the Connection String found on the ‘Access Keys’ of your Storage Account. --> 
<Setting name="Microsoft.WindowsAzure.Plugins.Diagnostics.ConnectionString" value="$ConnectionString$" /> 
... 
<!-- Replace $CertificateThumbprint$ with the thumbprint value find on the ‘Certificate’ tab of your Cloud Service. -->
<Setting name="DefaultCertificate" value="$CertificateThumbprint$" /> 
... 
<!-- Replace $audioFileName$ and $videoFileName$ with the name of your audio and video file --> 
<Setting name="AudioContainer" value="audio" /> 
<Setting name="AudioFile" value="$audioFileName$" /> 
<Setting name="VideoContainer" value="video" /> 
<Setting name="VideoFile" value="$videoFileName$" />

Note: the default containers are ‘video’ for video files and ‘audio’ for audio files, but you can modify those settings as well.

5. Build and deploy the bot

Prerequisites and instructions for deploying are available here. Update the configuration before deploying the sample per the instructions above.

And voila! Congratulations, you just deployed your first real-time media Skype bot

Architecture

When the bot receives a call, it creates a video socket and an audio socket to the user’s device, using the Real-time Media Platform. The bot also connects to the Storage Account, where the content resides, using the Microsoft Storage Account SDK.

A few seconds of audio and video (2s in this sample) are then downloaded from the Storage Account into bot memory buffers. The buffers are then sent to the Real-time Media Platform to be streamed to the user.

When the data available in the buffers goes below a certain threshold, the bot downloads more content from the Storage Account to feed the sockets.

Here is a generic architecture of the different components used in this bot:

Let’s look at the code!

The streaming logic is contained in 2 cs files: BlobDownloader & MediaSession

BlobDownloader.cs — This class is in charge of connecting and downloading data from your Storage Account.  

The ConnectToStorageAccount method uses the Microsoft Storage Account SDK to connect to your Storage Account and initialize a reference to your video and audio file.

public void ConnectToStorageAccount() 
{ 
  var config = Service.Instance.Configuration; 
  CloudStorageAccount storageAccount = CloudStorageAccount.Parse(config.StorageAccountConnection); 
  CloudBlobClient client = storageAccount.CreateCloudBlobClient(); 
  CloudBlobContainer videoContainer = client.GetContainerReference(config.VideoContainer);             
  CloudBlobContainer audioContainer = client.GetContainerReference(config.AudioContainer); 
  _videoBlob = videoContainer.GetBlockBlobReference(Service.Instance.Configuration.VideoFile); 
  _audioBlob = audioContainer.GetBlockBlobReference(Service.Instance.Configuration.AudioFile); 
} 

Downloading video from the Storage Account is handled by the GetVideoMediaBuffers method. This method downloads a certain amount of content (2s in our sample) using the Microsoft Storage Account and extracts each frame in a VideoMediaBuffer object. A list of VideoMediaBuffer is then returned. This method is synchronous; be careful which thread is calling it!

The BlobDownloader keeps track of which frames have already been downloaded and always returns the next 2s. E.g. the first call to GetVideoMediaBuffer will return seconds 1 and 2 of your video, the second call will return seconds 3 and 4, and so on… When the end of the video has been reached, the method returns an empty list of VideoMediaBuffer.

public List<VideoMediaBuffer> GetVideoMediaBuffers(long currentTick) 
{ 
  ... 
  // 1. Download _nbSecondToLoad seconds of content from the storage account 
  long bufferSize = _frameSize * _videoFormat.FrameRate * _nbSecondToLoad; 
  byte[] bytesToRead = new byte[bufferSize]; 
  var nbByteRead = _videoBlob.DownloadRangeToByteArray(bytesToRead, 0, _videoOffset, bytesToRead.Length, null, null); 
  //2. Extract each video frame in a VideoMediaBuffer object 
  List<VideoMediaBuffer> videoMediaBuffers = new List<VideoMediaBuffer>(); 
  long  referenceTime = currentTick; 
  for (int index = 0; index < nbByteRead; index += _frameSize) 
  { 
    IntPtr unmanagedBuffer = Marshal.AllocHGlobal(_frameSize); 
    Marshal.Copy(bytesToRead, index, unmanagedBuffer, _frameSize); 
    referenceTime += _frameDurationInTicks; 
    var videoSendBuffer = new VideoSendBuffer(unmanagedBuffer, (uint)_frameSize, _videoFormat, referenceTime); 
    videoMediaBuffers.Add(videoSendBuffer); 
    _videoOffset += _frameSize; 
  } 
  ... 
  return videoMediaBuffers; 
} 

The GetAudioMediaBuffers function follows the same logic for the audio content and returns a list of AudioMediaBuffer.

public List<VideoMediaBuffer> GetAudioMediaBuffers(long currentTick) 
{ 
  ... 
  // 1. Downlaod _nbSecondToLoad seconds of audio content from the storage account     
  long bufferSize = 16000 * 2 * _nbSecondToLoad; // Pcm16K is 16000 samples per seconds, each sample is 2 bytes  
  byte[] bytesToRead = new byte[bufferSize]; 
  var nbByteRead = _audioBlob.DownloadRangeToByteArray(bytesToRead, 0, _audioOffset, bytesToRead.Length, null, null); 
  //2. Extract each audio sample in a AudioMediaBuffer object 
  List<AudioMediaBuffer> audioMediaBuffers = new List<AudioMediaBuffer>(); 
  int audioBufferSize = (int)(16000 * 2 * 0.02); // the Real-time media platform expects audio buffer duration of 20ms         
  long referenceTime = currentTick; 
  for (int index = 0; index < nbByteRead; index += audioBufferSize) 
  { 
    IntPtr unmanagedBuffer = Marshal.AllocHGlobal(audioBufferSize); 
    Marshal.Copy(bytesToRead, index, unmanagedBuffer, audioBufferSize); 
    // 10000 ticks in a ms 
    referenceTime += 20 * 10000; 
    var audioBuffer = new AudioSendBuffer(unmanagedBuffer, audioBufferSize, _audioFormat, referenceTime); 
    audioMediaBuffers.Add(audioBuffer); 
    _audioOffset += audioBufferSize; 
  }  
  ... 
  return videoMediaBuffers; 
} 

Be careful to specify accurate timestamps for audio and video to ensure they stay in sync. The audioVideoFramePlayer allows 150ms deviation, past that it might drop a video frame.

MediaSession.cs — This class is in charge of initializing the media player and loading content.

This logic is contained in the StartAudioVideoFramePlayer function:

private async Task StartAudioVideoFramePlayer()
{
  ...
  AudioVideoFramePlayerSettings settings = new AudioVideoFramePlayerSettings(
  new AudioSettings(20),
  new VideoSettings(),
  _mediaBufferToLoadInSeconds * 1000);
  _audioVideoFramePlayer = new AudioVideoFramePlayer(_audioSocket, _videoSocket, settings);
  _downloadManager.ConnectToStorageAccount();
  _audioVideoFramePlayer.LowOnFrames += OnLowOnFrames;
  var currentTick = DateTime.Now.Ticks;
  _videoMediaBuffers = _downloadManager.GetVideoMediaBuffers(currentTick);
  _audioMediaBuffers = _downloadManager.GetAudioMediaBuffers(currentTick);
  //update the tick for next iteration 
  _mediaTick = Math.Max(_audioMediaBuffers.Last().Timestamp, _videoMediaBuffers.Last().Timestamp);
  await _audioVideoFramePlayer.EnqueueBuffersAsync(_audioMediaBuffers, _videoMediaBuffers);
  ...
}

The last parameter of the AudioVideoFramePlayerSettings constructor is the number of milliseconds before the OnLowOnFrames event is raised. In our sample, if the player has less than 2s of media in the queue, we will raise the OnLowFrame event. Then we just download the video and audio content and send them to the player with the EnqueueBuffersAsync function.

The OnLowFrame function just downloads more content for the media player:

private void OnLowOnFrames(object sender, LowOnFramesEventArgs e) 
{ 
  ... 
  _videoMediaBuffers = _downloadManager.GetVideoMediaBuffers(_mediaTick); 
  _audioMediaBuffers = _downloadManager.GetAudioMediaBuffers(_mediaTick); 
  _mediaTick = Math.Max( 
  _audioMediaBuffers.Last().Timestamp, 
  _videoMediaBuffers.Last().Timestamp); 
  _audioVideoFramePlayer 
        .EnqueueBuffersAsync(_audioMediaBuffers, _videoMediaBuffers) 
        .ForgetAndLogException(); 
  ... 
} 

Further considerations

The Real-time media platform is still in preview, there are a couple of things to keep in mind.

  1. If you want to create more than 10 video sockets at any single time, be sure to spin up additional VMs to handle the volume.
  2. The Real-time media platform only accepts raw input. Once converted to raw format your video may be quite large, each frame being more than 1Mb. If you are trying to use a very high resolution you might reach the maximum bandwidth capacity between your Storage Account and your Cloud Service.

Stay tuned as we continue to make updates!

Powerful and engaging!

We hope you had fun following this “How-to” and we can’t wait to see what you will create with this unique Skype capability. We would love to hear your questions and feedback in the comment section below; it is also your chance to tell us what you need to create awesome video and audio bots!

Happy Skype coding!  For more information on Developing for the Skype platform, check out dev.skype.com.