This microservice takes in an audio stream (in .wav format at 16kHz) and transcribes commands in various domains in near real-time.

The ASR AI model is a neural network employed to understand the mapping of sound features to actual words in English and is trained to perform well on a Singaporean-accented English speaker. It is also expected to perform better when the speech content is related to generic news topics.

This ASR model should also work for other content types and accents but with a lower accuracy. One potential use cases is the live transcription of call center recordings.

Suggested use cases

Websocket API

[Websocket API. Events Sequence]



1.Establish connection (websocket)

Client connects to server with extra header field

API Endpoint websocket

wss://onlinecommandasr.sentient.io

Authentication

Authentication is done using the x-api-key in the connect header

Content-Type

application/json

Output

JSON

Server returns HTTP Status Code: 100 if connected and authenticated successfully. If HTTP Status Code is not 100, please refer to Sentient Standard Errors for troubleshooting.




2. Connected

Client send json message below: [Input parameters]

Field

Type

Description

x-api-key

String

“Your API KEY string”

action

String

“start”

model

String

model name:
"en (English)" OR "zh (Chinese)"

wordlist

string

“space delimited list of command words”

sampling-rate

integer

“Optional. Sampling rate of the input wave file. 8000 (Default) or 16000.”

Sample input JSON

1 2 3 4 { "x-api-key": "Replace this with your APIKey", "action": "start", "model": "en", "sampling-rate": 8000, "wordlist": "play stop pause review" }

Server returns json:

Field

Type

Description

status

String

“listening”

Sample output JSON

1 2 3 { "status": "listening" }


3. Message exchanges

Client sends byte stream of audio:

  • Format: wav

  • Channel Type: mono

  • Sample Rate: 8kHz or 16 kHz

Server returns 2 possible results, depending if the utterance is partial or complete:

  • If utterance is partial, server returns:

Field

Type

Description

partial

String

Partial results of the received audio stream

Sample output JSON

1 2 3 { "partial": "HELLO WORLD THESE" }
  • If utterance is complete, server returns:

Field

Type

Description

partial

String

Predicted text for the utterance.

result

array of dict

Predicted words and meta-data of the utterance.

word

string

Predicted word.

start

float

Start time of word, in seconds. Offset from the start of stream.

end

float

End time of word, in seconds. Offset from the start of stream.

conf

float

Confidence score, 0.0 to 1.0.

Sample output JSON

1 2 3 4 5 6 7 8 9 { "result": [{"word": "HELLO", "start": 3.42, "end": 3.63, "conf": 1}, {"word": "WORLD", "start": 3.63, "end": 4.02, "conf": 1}, {"word": "THIS", "start": 4.02, "end": 4.14, "conf": 1}, {"word": "IS", "start": 4.14, "end": 4.26, "conf": 0.9976}, {"word": "A", "start": 4.26, "end": 4.439, "conf": 0.838751}, {"word": "TEST", "start": 4.47, "end": 4.92, "conf": 0.96717}], "text": "HELLO WORLD THIS IS A TEST" }


4. End of stream

Client sends 0 byte message to indicate end of stream.

Server returns final (partial or complete) result in json.

Field

Type

Description

text

String

Predicted text for the utterance.

partial

string

Only for partial result. Partial results of the received audio stream

result

array of dict

Only for complete utterance. Predicted words and meta-data of the utterance.

word

string

Predicted word.

start

float

Start time of word, in seconds. Offset from the start of stream.

end

float

End time of word, in seconds. Offset from the start of stream.

conf

float

Confidence score, 0.0 to 1.0.

Sample output JSON

1 2 3 { "text": "GOOD JOB", "result": [{"word": "GOOD", "start": 3.42, "end": 3.63, "conf": 1}, {"word": "JOB", "start": 3.63, "end": 4.02, "conf": 1}] }

Sample Python client file

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 #!/usr/bin/env python3 import asyncio import websockets import sys import json import time import logging import argparse logging.basicConfig(level=logging.INFO, format='%(message)s') logger = logging.getLogger(__file__) logger.setLevel(logging.INFO) parser = argparse.ArgumentParser(description='Simple ASR websocket client.') parser.add_argument('uri', type=str, help='uri.') parser.add_argument('apikey', type=str, help='apikey.') parser.add_argument('model', type=str, help='model name.') parser.add_argument('in_wav', type=str, help='16k wav file.') parser.add_argument('-v', dest='verbose', action='store_true', help='Verbose mode') parser.add_argument('-q', dest='quiet', action='store_true', help='Quiet mode') async def remote_asr(uri, apikey, model, in_wav): extra_headers = websockets.http.Headers() extra_headers['x-api-key'] = apikey try: async with websockets.connect(uri, extra_headers=extra_headers) as websocket: start = time.time() json_req = json.dumps({'action': 'start', 'model': model}) await websocket.send(json_req) data = await websocket.recv() jsondata = json.loads(data) logger.debug(jsondata) if 'status' in jsondata and jsondata['status'] == 'listening': wf = open(in_wav, "rb") while True: data = wf.read(16000) if len(data) == 0: break await websocket.send(data) data = await websocket.recv() logger.debug(data) jsondata = json.loads(data) if logger.level == logging.INFO and 'partial' in jsondata: print("*", jsondata['partial'].rstrip('\n'), end='\r') if 'result' in jsondata: logger.info(jsondata['text'].rstrip('\n')) stopmessage = b'' await websocket.send(stopmessage) data = await websocket.recv() jsondata = json.loads(data) logger.debug(jsondata) if 'result' in jsondata: logger.info("\n" + jsondata['text'].rstrip('\n')) end = time.time() logger.info("Total elapsed time: %2.2f sec" %(end - start)) else: logger.error(jsondata) except websockets.exceptions.WebSocketException as e: logger.error("error: {}".format(e)) if __name__ == '__main__': kwargs = vars(parser.parse_args()) if kwargs['verbose']: logging.basicConfig(level=logging.DEBUG, format='%(levels): %(message)s') logger.setLevel(logging.DEBUG) elif kwargs['quiet']: logging.basicConfig(level=logging.ERROR, format='%(levels): %(message)s') logger.setLevel(logging.ERROR) uri = kwargs['uri'] apikey = kwargs['apikey'] model = kwargs['model'] in_wav = kwargs["in_wav"] asyncio.get_event_loop().run_until_complete(remote_asr(uri, apikey, model, in_wav))

Response errors

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 401 Unauthorized { "message":"Missing Authentication Token", "status":"Failure" } 403 Forbidden { "message":"Access Denied Unauthorized User", "status":"Failure" } 402 Payment Required { "message":"Insufficient Credits Kindly Top Up", "status":"Failure" } 404 Not Found { "message":"Invalid Request URL", "status":"Failure" } 500 Internal Server error { "message":"Internal Server Error", "status":"Failure" }