Create your own Voice based application using Python

Sundar Krishnan
4 min readJul 23, 2018

Voice based devices/applications are growing a lot. Today, there are Google Assistant, Alexa which takes our voice as input, process them and perform actions based on it. It uses state of art process in Speech to Text, Natural language understanding, deep learning and Text to speech. Before we dive into the code, let us understand a voice based application on a high level.

The first step to build a voice based application is to listen for user voice constantly and then transcribe the voice to text. The python code that I shared in this article will cover this topic. The rest as highlighted in the box varies based on the application. So, the user can add their own application at the end of the code snippet.

It is difficult to create the voice to text transcription engine with higher accuracy as we need to train our model on lots of data (clean and noisy environment). Leaders in this industry like Google, Amazon, Baidu, IBM Watson, Wit, Microsoft provide API based services which can be easily integrated with applications. Google also offers voice actions which is an API based service to perform actions within app seamlessly using voice. Annyang, a tiny javascript can let you integrate voice recognition to websites easily. If you are interested to develop your own speech to text application, please look at these links below.

Here are the steps to follow, before we build a python based application.

  1. Create a Google cloud account.
  2. Click on “Select a project” to create a project in Google Cloud. Click on “New project” and provide a name.
  3. Type “Cloud Speech API” on the project search page. We need to enable this API to use the Speech to Text API service. In addition, we need to provide credit/debit card or bank account details to use the free API service. There is no auto charge after the free trial ends. So please provide all the details and enable the API. There are limitations in the free trial use which is provided in the Google speech API documentation.
  4. In the API page, click on the “Credentials” section and then click on “Create Credentials”.
  5. Once done, we need to create a service account so that we can download the key as an JSON file. In search bar, type “service accounts” and create a new service account. Provide a name and assign “Owner” for the Project Role. Click on “Furnish a new private key” and check “json” object and click “Save”. This will download the API key in a json format.
  6. We are almost done. All we need to do now is to set the OS environment path for the API key. This can be done by using the code below in Terminal/CommandPrompt.
export GOOGLE_APPLICATION_CREDENTIALS="[PATH]"### For exampleexport GOOGLE_APPLICATION_CREDENTIALS="/Users/Downloads/[FILE_NAME].json"

And now here is the entire python code,

Note: You will need to enable Microphone and loudspeaker in your laptop.

Let us break into pieces to understand each functions.

audio_int()

def audio_int(num_samples=50):
""" Gets average audio intensity of your mic sound. You can use it to get
average intensities while you're talking and/or silent. The average
is the avg of the 20% largest intensities recorded.
"""

p = pyaudio.PyAudio()

stream = p.open(format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK)

values = [math.sqrt(abs(audioop.avg(stream.read(CHUNK), 4)))
for x in range(num_samples)]
values = sorted(values, reverse=True)
r = sum(values[:int(num_samples * 0.2)]) / int(num_samples * 0.2)
print(" Average audio intensity is ", r)
stream.close()
p.terminate()

if r > THRESHOLD:
listen(0)

threading.Timer(SILENCE_LIMIT, audio_int).start()

audio_int() functions constantly listens to the user voice input. The threshold above which it should invoke an action is determined by the variable “THRESHOLD”. Any audio intensity below the set threshold will be ignored. This is to ensure that background noises are not processed. You can adjust this variable to suit your needs.

listen(x)

def listen(x):
r=rs.Recognizer()
if x == 0:
system('say Hi. How can I help?')
with rs.Microphone() as source:
audio=r.listen(source)
try:
text = r.recognize_google(audio)
y = process(text.lower())
return(y)
except:
if x == 1:
system('say Good Bye!')
else:
system('say I did not get that. Please say again.')
listen(1)

The listen(x) function will be invoked once the audio intensity is above the set threshold. For ease of use, I have used Python system function instead of Text to Speech API. So, let’s say you begin your conversation with “Hi there”, the code will then respond using the Python system function “Hi. How can I help?”. You can begin your conversation with any starter. As long as the audio intensity is high, it will work. And you can change your voice command greeting too.

system('say "Put your command here"')

This is a start. After this, whatever you speak will be transcribed using the recognize_google(audio) function. The transcribed text will be stored in the variable “text”. If your audio intensity is low at this step, it would ask you one more time to repeat what you said. Even after that, if your audio is low, it would end the conversation and you would have to start all over again.

process(text)

def process(text):
# ''''''''''''''''''''''''''''''''''''''''''''''''
# '''''''Your application goes here ''''''''''''''
# ''''''''''''''''''''''''''''''''''''''''''''''''

The final step is to tie your application in the process(text) function. I left this function empty. You can customize this portion as you wish and process the text to respond to user queries.

Voila! You have created your own voice based application. Have fun!

--

--

Sundar Krishnan

I am passionate about Artificial Intelligence and Data Science. I focus on 360 degree customer analytics models and machine learning workflow automation.