Microsoft Azure Speech to Text

Introduction

Transcribe audio to text from a range of sources including microphone and audio files in more than 85 languages and variants.

Azure’s speech-to-text service defaults to using the Universal language model. This model was trained using Microsoft-owned data and is deployed in the cloud. It is optimal for conversational and dictation scenarios. Add customisation and training to your model by uploading audio data and transcripts., this is particularly useful for addressing ambient noise or industry-specific vocabulary.

Additionally, you can reference your custom model through SDK, Speech CLI, or REST APIs on a coding language of your choice.

Fees

Pricing depends on usage and requires a Microsoft account, an Azure account, and a Speech service subscription. The free Speech service instance allows a limited amount of audio hours Speech-to-text each month. For more information on pricing visit Speech Services Pricing.

Setting up Speech-to-Text

First, set up all the account, subscription, and resources required.

  1. Sign-up for a Microsoft account if you do not have one yet at the Microsoft account portal.
  2. Once you have a Microsoft account, go to the Azure sign-up page, select Start free, and create a new Azure account using a Microsoft account.
  3. Go to portal.azure.com and sign in with your account.
  4. Create a new Azure subscription. If you already have the Azure subscription you want to use (including free trial), skip this step.
    1. Search for “Subscriptions” in the top search bar
      Screenshot of searching for subscription on the top search bar
    2. Select Add
      Screenshot of selecting add under subscriptions to add a new subscription
    3. Select the relevant billing account, fill in the  form and click Create. For more information on the different fields visit Azure subscription.
  5. Create an Azure Speech service resource
    1. Open the collapsed menu in the upper left corner of the screen and click Create a resource
      Screenshot of create a Resource from the collapsable side menu option
    2. In the new window, type "speech" in the search box and press enter.
    3. In the search results, select Speech.
      Screenshot of searching and selecting speech
    4. Select Create, then fill in the form. For more information on the different fields visit Speech service.
    5. Select Create. This will take you to the deployment overview and display progress. It will take a few moment to deploy the new Speech resource.

Option 1: Out of the box Speech-to-text Service

The out of the box speech-to-text Service is available for quick real-time Speech-to-text service and transcription of WAV audio file(s) (16kHz or 8kHz, 16-bit, and mono PCM).

  1. Sign in to Speech Studio with your Azure account.
  2. Select the speech service resource you need to get started.
  3. Select Real-time Speech-to-text. 
    Screenshot of selecting Real-time speech-to-text
  4. Select language of speech  or set to auto detect.
  5. Choose audio files by clicking Browse files or click the microphone icon to start recording audio live.
    Screenshot of uploading an audio file or select to record audio with a microphone
  6. The output  of the recorded or uploaded audio will be displayed on the right of screen.
    Output of recorded audio

Option 2: Implement Speech services through Speech SDK, Speech CLI, or REST APIs (coding required)

Azure Speech service is also available via the Speech SDK, the REST API, and the Speech CLI. You can reference an out-of-the-box model or your own custom model through the keys and location/region of a completed deployment.

  1. Find keys and location/region
    1. Open the collapsed menu in the upper left corner of the screen and click All resources.
      Screenshot of selecting all resources
    2. Select the name of your Cognitive Service resource.
    3. On the left pane, select Keys and Endpoint under Resource Management.
      Screenshot of selecting keys and endpoint
    4. Copy one of the 2 keys available (and also location/region value for SDK calls)
  2. Visit Speech-to-text overview for quickstarts, sample code and how-to-guides on customisations. Some relevant links (also available from overview page)
    • Speech-to-text REST API: Speech-to-text has two different REST APIs. Each API serves its special purpose and uses different sets of endpoints
    • Speech SDK : Quickstart guides available in C#, C++, Java, JavaScript, Objective-C/SWIFT, Python
    • Speech SDK Github Repository: Sample codes available in C#, C++, Java, JavaScript, Objective-C, SWIFT, Python. This sample code can be edited to extract more information from each Speech-to-text utterance. For example, the below additions to the python from-microphone code prints the model output of difference possible sentences an utterance could be including the confidence level of each.
      • At the start of the code:

        import json

      • At the speech_config options:
        speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
        speech_config.set_service_property(name='wordLevelConfidence', value='true', channel=speechsdk.ServicePropertyChannel.UriQueryParameter) speech_config.set_service_property(name='format', value='detailed', channel=speechsdk.ServicePropertyChannel.UriQueryParameter)
      • After print("Recognized: {}".format(result.text)):

        my_json = json.loads(result.json)

        for i in my_json.get('NBest'):

        print(str(i.get('Lexical')) + ": " + str(i.get('Confidence')))

        Screenshot of code as described above and output


Option 3: Batch Transcription from File (coding required)

Batch transcription is a set of REST API operations that enables transcription of a large amount of audio in storage. Azure blob storage is used for this service. The batch transcription API supports the following formats:

  • WAV, PCM, 16bit, 8kHz/16kHz mono or stereo
  • MP3, PCM, 16-bit, 8kHz/16kHz mono or stereo
  • OGG, OPUS, 16-bit, 8kHz/16kHz mono or stereo
  1. Create an Azure blob storage
    1. Open the collapsed menu in the upper left corner of the screen and click Storage accounts
      Screenshot of clicking storage accounts on side menu
    2. Click Create
      Screenshot of clicking create under storage accounts
    3. Fill in relevant fields then click Review + create
  2. Wait for storage to deploy  then click on  it to open the overview.
  3. Create a new Storage Container
    1. Click on Containers under Data Storage
      Screenshot of clicking containers under data storage side menu
    2. Click + Container
      Screenshot of creating a new container
    3. Fill in container name and change Public access level to Blob (anonymous read access for blobs only)
    4. Click Create
  4. Open newly created container and upload audio files.
  5. Generate SAS URL
    1. Click on an uploaded audio file
    2. Click the Generate SAS tab.
      Screenshot of the generate SAS tab for an audio file
    3. Update relevant information and click Generate SAS token and URL
    4. Copy Blob SAS URL to be inputted in code.
  6. Find keys and location/region
    1. Open the collapsed menu in the upper left corner of the screen and click All resources.
      Screenshot of selecting all resources
    2. Select the name of your Cognitive Service resource.
    3. On the left pane, select Keys and Endpoint under Resource Management.
      Screenshot of selecting keys and endpoint
    4. Copy one of the 2 keys available and region value.
  7. Download and install the API client library through Swagger. (Below steps taken from Azure github repository)
    1. Go to https://editor.swagger.io.
    2. Click File, then click Import URL
    3. Enter the Swagger URL for the Speech Services API: https://.dev.cognitive.microsoft.com/docs/services/speech-to-text-api-v3-0/export?DocumentFormat=Swagger&ApiName=Speech%20to%20Text%20API%20v3.0
    4. Click Generate Client and select Python.
    5. Save the client library.
    6. Extract the downloaded python-client-generated.zip somewhere in your file system.
    7. Install the extracted python-client module in your Python environment using pip: pip install path/to/package/python-client .
    8. The installed package has the name swagger_client . You can check that the installation worked using the command python -c "import swagger_client"
  8. Install other dependencies
    1. Install requests libraries through the pip install requests command.
  9. Run the sample code making sure to update the following:
    1. Subscription key and region (See step 6)
    2. URL of audio recordings in blob storage (See step 5)
  10. Following all of the steps above and running the code should generate results  such as the following:

Additional Option: Improve Accuracy through training your own Custom model

Custom speech allows evaluation and improvement to the Microsoft speech-to-text accuracy through further model training. This option requires additional test and training data such as human-labelled transcripts and related text.  For more information visit Custom Speech.

Learn More