This microservice takes in an audio stream (in .wav format at 16kHz) and transcribes commands in various domains in near real-time.
The ASR AI model is a neural network employed to understand the mapping of sound features to actual words in English and is trained to perform well on a Singaporean-accented English speaker. It is also expected to perform better when the speech content is related to generic news topics.
This ASR model should also work for other content types and accents but with a lower accuracy. One potential use cases is the live transcription of call center recordings.
Suggested use cases
Extraction of text from voice (both real-time as well as from sound files), especially with Singaporean accented English
Extraction of text from video, especially with Singaporean accented English
Websocket API
[Websocket API. Events Sequence]
1.Establish connection (websocket)
Client connects to server with extra header field
|
API Endpoint websocket |
wss://onlinecommandasr.sentient.io |
|
Authentication |
Authentication is done using the x-api-key in the connect header |
|
Content-Type |
application/json |
|
Output |
JSON |
Server returns HTTP Status Code: 100 if connected and authenticated successfully. If HTTP Status Code is not 100, please refer to Sentient Standard Errors for troubleshooting.
2. Connected
Client send json message below: [Input parameters]
|
Field |
Type |
Description |
|
x-api-key |
String |
“Your API KEY string” |
|
action |
String |
“start” |
|
model |
String |
model name: |
|
wordlist |
string |
“space delimited list of command words” |
|
sampling-rate |
integer |
“Optional. Sampling rate of the input wave file. 8000 (Default) or 16000.” |
Sample input JSON
1
2
3
4
{
"x-api-key": "Replace this with your APIKey",
"action": "start",
"model": "en",
"sampling-rate": 8000,
"wordlist": "play stop pause review"
}
Server returns json:
|
Field |
Type |
Description |
|
status |
String |
“listening” |
Sample output JSON
1
2
3
{
"status": "listening"
}
3. Message exchanges
Client sends byte stream of audio:
Format: wav
Channel Type: mono
Sample Rate: 8kHz or 16 kHz
Server returns 2 possible results, depending if the utterance is partial or complete:
If utterance is partial, server returns:
|
Field |
Type |
Description |
|
partial |
String |
Partial results of the received audio stream |
Sample output JSON
1
2
3
{
"partial": "HELLO WORLD THESE"
}
If utterance is complete, server returns:
|
Field |
Type |
Description |
|
partial |
String |
Predicted text for the utterance. |
|
result |
array of dict |
Predicted words and meta-data of the utterance. |
|
word |
string |
Predicted word. |
|
start |
float |
Start time of word, in seconds. Offset from the start of stream. |
|
end |
float |
End time of word, in seconds. Offset from the start of stream. |
|
conf |
float |
Confidence score, 0.0 to 1.0. |
Sample output JSON
1
2
3
4
5
6
7
8
9
{
"result": [{"word": "HELLO", "start": 3.42, "end": 3.63, "conf": 1},
{"word": "WORLD", "start": 3.63, "end": 4.02, "conf": 1},
{"word": "THIS", "start": 4.02, "end": 4.14, "conf": 1},
{"word": "IS", "start": 4.14, "end": 4.26, "conf": 0.9976},
{"word": "A", "start": 4.26, "end": 4.439, "conf": 0.838751},
{"word": "TEST", "start": 4.47, "end": 4.92, "conf": 0.96717}],
"text": "HELLO WORLD THIS IS A TEST"
}
4. End of stream
Client sends 0 byte message to indicate end of stream.
Server returns final (partial or complete) result in json.
|
Field |
Type |
Description |
|
text |
String |
Predicted text for the utterance. |
|
partial |
string |
Only for partial result. Partial results of the received audio stream |
|
result |
array of dict |
Only for complete utterance. Predicted words and meta-data of the utterance. |
|
word |
string |
Predicted word. |
|
start |
float |
Start time of word, in seconds. Offset from the start of stream. |
|
end |
float |
End time of word, in seconds. Offset from the start of stream. |
|
conf |
float |
Confidence score, 0.0 to 1.0. |
Sample output JSON
1
2
3
{
"text": "GOOD JOB",
"result": [{"word": "GOOD", "start": 3.42, "end": 3.63, "conf": 1},
{"word": "JOB", "start": 3.63, "end": 4.02, "conf": 1}]
}
Sample Python client file
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
#!/usr/bin/env python3
import asyncio
import websockets
import sys
import json
import time
import logging
import argparse
logging.basicConfig(level=logging.INFO, format='%(message)s')
logger = logging.getLogger(__file__)
logger.setLevel(logging.INFO)
parser = argparse.ArgumentParser(description='Simple ASR websocket client.')
parser.add_argument('uri', type=str, help='uri.')
parser.add_argument('apikey', type=str, help='apikey.')
parser.add_argument('model', type=str, help='model name.')
parser.add_argument('in_wav', type=str, help='16k wav file.')
parser.add_argument('-v', dest='verbose', action='store_true', help='Verbose mode')
parser.add_argument('-q', dest='quiet', action='store_true', help='Quiet mode')
async def remote_asr(uri, apikey, model, in_wav):
extra_headers = websockets.http.Headers()
extra_headers['x-api-key'] = apikey
try:
async with websockets.connect(uri, extra_headers=extra_headers) as websocket:
start = time.time()
json_req = json.dumps({'action': 'start', 'model': model})
await websocket.send(json_req)
data = await websocket.recv()
jsondata = json.loads(data)
logger.debug(jsondata)
if 'status' in jsondata and jsondata['status'] == 'listening':
wf = open(in_wav, "rb")
while True:
data = wf.read(16000)
if len(data) == 0:
break
await websocket.send(data)
data = await websocket.recv()
logger.debug(data)
jsondata = json.loads(data)
if logger.level == logging.INFO and 'partial' in jsondata:
print("*", jsondata['partial'].rstrip('\n'), end='\r')
if 'result' in jsondata:
logger.info(jsondata['text'].rstrip('\n'))
stopmessage = b''
await websocket.send(stopmessage)
data = await websocket.recv()
jsondata = json.loads(data)
logger.debug(jsondata)
if 'result' in jsondata:
logger.info("\n" + jsondata['text'].rstrip('\n'))
end = time.time()
logger.info("Total elapsed time: %2.2f sec" %(end - start))
else:
logger.error(jsondata)
except websockets.exceptions.WebSocketException as e:
logger.error("error: {}".format(e))
if __name__ == '__main__':
kwargs = vars(parser.parse_args())
if kwargs['verbose']:
logging.basicConfig(level=logging.DEBUG, format='%(levels): %(message)s')
logger.setLevel(logging.DEBUG)
elif kwargs['quiet']:
logging.basicConfig(level=logging.ERROR, format='%(levels): %(message)s')
logger.setLevel(logging.ERROR)
uri = kwargs['uri']
apikey = kwargs['apikey']
model = kwargs['model']
in_wav = kwargs["in_wav"]
asyncio.get_event_loop().run_until_complete(remote_asr(uri, apikey, model, in_wav))
Response errors
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
401 Unauthorized
{
"message":"Missing Authentication Token",
"status":"Failure"
}
403 Forbidden
{
"message":"Access Denied Unauthorized User",
"status":"Failure"
}
402 Payment Required
{
"message":"Insufficient Credits Kindly Top Up",
"status":"Failure"
}
404 Not Found
{
"message":"Invalid Request URL",
"status":"Failure"
}
500 Internal Server error
{
"message":"Internal Server Error",
"status":"Failure"
}