This microservice takes in an audio stream (in .wav format at 16kHz) and transcribes commands in various domains in near real-time.

The ASR AI model is a neural network employed to understand the mapping of sound features to actual words in English and is trained to perform well on a Singaporean-accented English speaker. It is also expected to perform better when the speech content is related to generic news topics.

This ASR model should also work for other content types and accents but with a lower accuracy. One potential use cases is the live transcription of call center recordings.

Suggested use cases

Extraction of text from voice (both real-time as well as from sound files), especially with Singaporean accented English
Extraction of text from video, especially with Singaporean accented English

Websocket API

[Websocket API. Events Sequence]

1.Establish connection (websocket)

Client connects to server with extra header field

API Endpoint websocket	wss://onlinecommandasr.sentient.io
Authentication	Authentication is done using the x-api-key in the connect header
Content-Type	application/json
Output	JSON

Server returns HTTP Status Code: 100 if connected and authenticated successfully. If HTTP Status Code is not 100, please refer to Sentient Standard Errors for troubleshooting.

2. Connected

Client send json message below: [Input parameters]

Field	Type	Description
x-api-key	String	“Your API KEY string”
action	String	“start”
model	String	model name: "en (English)" OR "zh (Chinese)"
wordlist	string	“space delimited list of command words”
sampling-rate	integer	“Optional. Sampling rate of the input wave file. 8000 (Default) or 16000.”

Sample input JSON

{
  "x-api-key": "Replace this with your APIKey",
  "action": "start",
  "model": "en",
  "sampling-rate": 8000,
  "wordlist": "play stop pause review"
}

Server returns json:

Field	Type	Description
status	String	“listening”

Sample output JSON

1
2
3

{
  "status": "listening"
}

3. Message exchanges

Client sends byte stream of audio:

Format: wav
Channel Type: mono
Sample Rate: 8kHz or 16 kHz

Server returns 2 possible results, depending if the utterance is partial or complete:

If utterance is partial, server returns:

Field	Type	Description
partial	String	Partial results of the received audio stream

Sample output JSON

1
2
3

{
  "partial": "HELLO WORLD THESE"
}

If utterance is complete, server returns:

Field	Type	Description
partial	String	Predicted text for the utterance.
result	array of dict	Predicted words and meta-data of the utterance.
word	string	Predicted word.
start	float	Start time of word, in seconds. Offset from the start of stream.
end	float	End time of word, in seconds. Offset from the start of stream.
conf	float	Confidence score, 0.0 to 1.0.

Sample output JSON

{
    "result": [{"word": "HELLO", "start": 3.42, "end": 3.63, "conf": 1},
              {"word": "WORLD", "start": 3.63, "end": 4.02, "conf": 1},
              {"word": "THIS", "start": 4.02, "end": 4.14, "conf": 1},
              {"word": "IS", "start": 4.14, "end": 4.26, "conf": 0.9976},
              {"word": "A", "start": 4.26, "end": 4.439, "conf": 0.838751},
              {"word": "TEST", "start": 4.47, "end": 4.92, "conf": 0.96717}],
   "text": "HELLO WORLD THIS IS A TEST"
  }

4. End of stream

Client sends 0 byte message to indicate end of stream.

Server returns final (partial or complete) result in json.

Field	Type	Description
text	String	Predicted text for the utterance.
partial	string	Only for partial result. Partial results of the received audio stream
result	array of dict	Only for complete utterance. Predicted words and meta-data of the utterance.
word	string	Predicted word.
start	float	Start time of word, in seconds. Offset from the start of stream.
end	float	End time of word, in seconds. Offset from the start of stream.
conf	float	Confidence score, 0.0 to 1.0.

Sample output JSON

1
2
3

{
  "text": "GOOD JOB",
  "result": [{"word": "GOOD", "start": 3.42, "end": 3.63, "conf": 1},
            {"word": "JOB", "start": 3.63, "end": 4.02, "conf": 1}]
}

Sample Python client file

#!/usr/bin/env python3

          import asyncio
          import websockets
          import sys
          import json
          import time

          import logging
          import argparse

          logging.basicConfig(level=logging.INFO, format='%(message)s')
          logger = logging.getLogger(__file__)
          logger.setLevel(logging.INFO)

          parser = argparse.ArgumentParser(description='Simple ASR websocket client.')
          parser.add_argument('uri', type=str, help='uri.')
          parser.add_argument('apikey', type=str, help='apikey.')
          parser.add_argument('model', type=str, help='model name.')
          parser.add_argument('in_wav', type=str, help='16k wav file.')
          parser.add_argument('-v', dest='verbose', action='store_true', help='Verbose mode')
          parser.add_argument('-q', dest='quiet', action='store_true', help='Quiet mode')

          async def remote_asr(uri, apikey, model, in_wav):

            extra_headers = websockets.http.Headers()
            extra_headers['x-api-key'] = apikey
            try:
              async with websockets.connect(uri, extra_headers=extra_headers) as websocket:
                start = time.time()
                json_req = json.dumps({'action': 'start', 'model': model})
                await websocket.send(json_req)
                data = await websocket.recv()
                jsondata = json.loads(data)
                logger.debug(jsondata)
                if 'status' in jsondata and jsondata['status'] == 'listening':

                  wf = open(in_wav, "rb")
                  while True:
                    data = wf.read(16000)
                    if len(data) == 0:
                      break

                    await websocket.send(data)
                    data = await websocket.recv()
                    logger.debug(data)
                    jsondata = json.loads(data)

                    if logger.level == logging.INFO and 'partial' in jsondata:
                      print("*", jsondata['partial'].rstrip('\n'), end='\r')

                    if 'result' in jsondata:
                      logger.info(jsondata['text'].rstrip('\n'))

                  stopmessage = b''
                  await websocket.send(stopmessage)

                  data = await websocket.recv()
                  jsondata = json.loads(data)
                  logger.debug(jsondata)
                  if 'result' in jsondata:
                    logger.info("\n" + jsondata['text'].rstrip('\n'))
                  end = time.time()
                  logger.info("Total elapsed time: %2.2f sec" %(end - start))
                else:
                  logger.error(jsondata)
            except websockets.exceptions.WebSocketException as e:
              logger.error("error: {}".format(e))

          if __name__ == '__main__':
              kwargs = vars(parser.parse_args())
              if kwargs['verbose']:
                  logging.basicConfig(level=logging.DEBUG, format='%(levels): %(message)s')
                  logger.setLevel(logging.DEBUG)
              elif kwargs['quiet']:
                  logging.basicConfig(level=logging.ERROR, format='%(levels): %(message)s')
                  logger.setLevel(logging.ERROR)

              uri = kwargs['uri']
              apikey = kwargs['apikey']
              model = kwargs['model']
              in_wav = kwargs["in_wav"]

          asyncio.get_event_loop().run_until_complete(remote_asr(uri, apikey, model, in_wav))

Response errors

401 Unauthorized
 {
   "message":"Missing Authentication Token",
   "status":"Failure"
 }
403 Forbidden
 {
   "message":"Access Denied Unauthorized User",
   "status":"Failure"
 }
402 Payment Required
 {
   "message":"Insufficient Credits Kindly Top Up",
   "status":"Failure"
 }
404 Not Found
 {
   "message":"Invalid Request URL",
   "status":"Failure"
 }
500 Internal Server error
 {
   "message":"Internal Server Error",
   "status":"Failure"
 }