Home Building a Voice Assistant using ChatGPT API
Post
Cancel

Building a Voice Assistant using ChatGPT API

In today’s world, creating a custom application that can perform tasks specific to your needs has become an essential skill. With so many technologies and tools available, it can be challenging to decide where to start. In this blog post, we will discuss creating an app using Streamlit to make a Voice Assistant based on OpenAI API calls to Whisper and ChatGPT, two of the most popular natural language processing models available.

ChatGPT is a cutting-edge conversational AI model that can understand and respond to a wide range of topics. Whisper is a powerful speech recognition model that can transcribe audio input into text with high accuracy. Both of these models were developed by OpenAI and are essential tools for building chatbots and virtual assistants.

In addition to ChatGPT and Whisper, we’ll be using Streamlit - a powerful framework for building interactive data science applications. With Streamlit, you can create custom web applications that allow users to explore and interact with data in a dynamic and intuitive way.

Our application consists of four main components: audio input recording, speech recognition using Whisper API, natural language processing via ChatGPT API to generate a response, and audio output synthesis using Google’s text-to-speech (gTTS). The schematic figure below illustrates these components.

Four components of our voice-assistant app

Create an API Key

Setting up your OpenAI API key is a simple process that can open up a world of possibilities for your projects! With the recent release of API keys for Whisper and ChatGPT (see announcement), you now have access to a powerful suite of tools to help you build amazing things. Plus, OpenAI is currently offering $18 of credit per month for each user, so there’s never been a better time to get started!

To get your API key set up, all you need to do is follow a few simple steps. First, log in (or sign up) to create an OpenAI account at https://platform.openai.com. Once you’re logged in, just click on your account name at the top-left of the screen to bring up the drop-down menu and select “View API keys”. From there, you can create a new secret key and copy it to a file saved locally.

Step 1 Step 2
Login to your OpenAI account View API keys
Step 3 Step 4
Create a new secret key Copy the key and save it locally

To test the API key, save your API key in your home folder with the following file path ~/OPENAI_API_KEY. Once that’s done, you can activate the key in Python by importing the necessary libraries and using the following code:

1
2
3
4
5
6
import os
import openai

# Set up the API key
home_dir = os.path.expanduser("~")
openai.api_key_path = os.path.join(home_dir, 'OPENAI_API_KEY')

After activating the key in your Python session, you can test the Whisper API by transcribing an audio file of your own. Here’s some sample code that shows you how to do this:

1
2
with open('myaudio.mp3', 'rb') as fp:
    transcript = openai.Audio.translate("whisper-1", fp)

It’s that easy! And if you ever need to check how much credit you have left, just go to “Manage Account” on the drop-down menu and select “Usage” on the left panel. With your OpenAI API key set up, you’re now ready to take your projects to the next level. Don’t hesitate to explore the amazing possibilities offered by OpenAI’s API keys and start building something incredible today!

Create conda enviroment and install packages

So, now let’s setup our conda environment and install the necessary packages ad libraies. We start by creating a new env called VoiceAssistant via conda create -n VoiceAssistant python=3.9, and the activatng the env conda activate VoiceAssistant. Next we need to install the following packages:

To get started with building our Voice Assistant, we need to create a new conda environment and install some essential packages. Let’s call our new environment VoiceAssistant and use Python 3.9. Here’s how to set up the environment:

1
2
conda create -n VoiceAssistant python=3.9
conda activate VoiceAssistant

Now that we’re in the new environment, let’s install the following packages that we’ll need:

  • openai: This package allows us to use OpenAI’s natural language processing APIs. You can install it with pip install openai.
  • playsound and PyObjC: These packages allow us to play audio files. You can install them with pip install playsound PyObjC.
  • gTTS and wave: These packages allow us to convert text to speech and save it as a WAV file. You can install them with pip install gTTS wave.
  • streamlit and streamlit_lottie: These packages allow us to build a simple web interface for our Voice Assistant. You can install them with pip install streamlit streamlit_lottie watchdog.
  • pyaudio: This package allows us to record audio input. You can install it with conda install pyaudio.

Here’s the complete list of package installations:

1
2
3
4
5
6
pip install openai
pip install playsound PyObjC
pip install gTTS wave
pip install streamlit streamlit_lottie watchdog

conda install pyaudio

Record audio from python

The first component of our Voice Assistant app is to record the user’s voice prompt. For this, we’ll use the pyaudio and wave packages. The following code defines a function to record audio from the microphone and save it to a WAV file. You can write this function in a separate module in a file called audio_recorder.py:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import pyaudio
import wave

# Set audio parameters
FORMAT = pyaudio.paInt16  # Audio format
CHANNELS = 1  # Number of audio channels
RATE = 16000  # Sampling rate
CHUNK = 1024  # Number of audio frames per buffer
RECORD_SECONDS = 5  # Duration of recording in seconds
RECORDING_FILENAME = 'recording.wav'  # Name of output file


def record(seconds=RECORD_SECONDS, filename=RECORDING_FILENAME):
    # Initialize PyAudio object
    audio = pyaudio.PyAudio()

    # Open audio stream
    stream = audio.open(
        format=FORMAT, channels=CHANNELS,
        rate=RATE, input=True,
        frames_per_buffer=CHUNK)
        
    # Record audio
    frames = []
    for i in range(0, int(RATE / CHUNK * seconds)):
        data = stream.read(CHUNK)
        frames.append(data)

    # Stop audio stream and PyAudio object
    stream.stop_stream()
    stream.close()
    audio.terminate()

    # Write frames to a WAV file
    wave_file = wave.open(filename, 'wb')
    wave_file.setnchannels(CHANNELS)
    wave_file.setsampwidth(audio.get_sample_size(FORMAT))
    wave_file.setframerate(RATE)
    wave_file.writeframes(b''.join(frames))
    wave_file.close()

if __name__ == '__main__':
    # Run the record function with default parameters
    record()

In this code, we define the record() function, which uses the pyaudio package to open a microphone audio stream and record audio frames in WAV format for a specified duration of seconds. The function saves the recorded audio frames to a file with the specified filename.

To use the record() function, you can import it into your main script and call it as follows:

1
2
3
4
from audio_recorder import record

# Record audio for 10 seconds and save to 'my_recording.wav'
record(seconds=10, filename='my_recording.wav')

In this example, we call the record() function for a duration of 10 seconds and save the output to a file named 'my_recording.wav'. You can modify the function parameters to suit your needs, such as changing the recording duration or output filename.

Call the Whisper API to transcribe the audio

Next, we’ll use OpenAI’s Whisper API to transcribe the user’s voice prompt from the recorded audio file. To do this, we first need to set up our API key and then define a function to call the openai.Audio.translate() function. Here’s the code:

1
2
3
4
5
6
7
8
9
10
11
12
import os
import openai

# setup the API key
home_dir = os.path.expanduser("~")
openai.api_key_path = os.path.join(home_dir, 'OPENAI_API_KEY')


def get_transcription(filename):
    with open(filename, 'rb') as fp:
        transcript = openai.Audio.translate("whisper-1", fp)
    return transcript['text']

In this code, we define the get_transcription() function, which takes an audio file filename as input and returns the text transcription of the audio using the OpenAI Whisper API. The function first opens the audio file in read-binary mode ('rb') and then calls the openai.Audio.translate() function with the whisper-1 model and the file object fp. The function returns the text transcription as a string.

To use the get_transcription() function, you can import it into your main script and call it as follows:

1
2
3
4
5
6
7
8
9
from audio_recorder import record
from whisper_transcriber import get_transcription

# Record audio for 10 seconds and save to 'my_recording.wav'
record(seconds=10, filename='my_recording.wav')

# Get the text transcription of the audio file
transcription = get_transcription('my_recording.wav')
print(transcription)

Call the ChatGPT API to get a ChatGPT’s response

Now that we have the text version of the user’s prompt obtained through the Whisper API, our next step is to send this text prompt to ChatGPT and get a response message. To do this, we’ll use OpenAI’s Chat API and pass the current date to ChatGPT as context.

Here’s the code to define the get_response() function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
home_dir = os.path.expanduser("~")
openai.api_key_path = os.path.join(home_dir, 'OPENAI_API_KEY')

today_date = datetime.date.today()


def get_response(prompt):
    response = openai.ChatCompletion.create(
      model="gpt-3.5-turbo",
      messages=[
            {
                "role": "system",
                "content": (
                    f'You are ChatGPT, a large language model trained by OpenAI. '
                    f'Answer as concisely as possible. Current date: {today_date}.'
                )
            },
            {
                "role": "user",
                "content": prompt
            }
        ]
    )

In this code, we define the get_response() function, which takes a text prompt prompt as input and returns a response message using the OpenAI Chat API. The function calls the openai.Completion.create() function with the gpt-3.5-turbo model, the prompt text that includes the current date and the user prompt, and various completion parameters to generate the response. Finally, the function returns the response text as a string.

To use the get_response() function, you can import it into your main script and call it as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
from audio_recorder import record
from whisper_transcriber import get_transcription
from chatbot import get_response

# Record audio for 10 seconds and save to 'my_recording.wav'
record(seconds=10, filename='my_recording.wav')

# Get the text transcription of the audio file
transcription = get_transcription('my_recording.wav')

# Get the chatbot response to the transcription
response = get_response(transcription)
print(response)

Convert the response to audio using text-to-speech

The final component of our VoiceAssistant app is to convert the response message to speech using Google’s TTS package (gTTS) and then play the speech audio using the playsound package. Here’s the code to define the run_tts() function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import os
import tempfile
from io import BytesIO
from gtts import gTTS
from playsound import playsound


def run_tts(text):
    # Create a gTTS object with the specified text
    tts = gTTS(text)
    
    # Create a BytesIO object to hold the audio data
    mp3_fp = BytesIO()
    
    # Write the audio data to the BytesIO object
    tts.write_to_fp(mp3_fp)

    # Extract the byte string from the BytesIO object
    mp3_bytes = mp3_fp.getvalue()

    # Save the audio data to a temporary file
    with tempfile.NamedTemporaryFile(suffix='.mp3', delete=False) as f:
        f.write(mp3_bytes)
        audio_file = f.name

    # Play the audio data using the playsound module
    playsound(audio_file)

    # Delete the temporary audio file
    os.unlink(audio_file)

if __name__ == '__main__':
    # Run the run_tts function with the specified text
    run_tts(text='this is a test')

In this code, we define the run_tts() function, which takes a text message text as input, generates speech audio using the gTTS package, saves the audio in a temporary file with a .mp3 suffix, and then plays the saved audio using the playsound package. The function first creates a gTTS object with the specified text, then creates a BytesIO object to hold the audio data, writes the audio data to the BytesIO object, and extracts the byte string from the BytesIO object. The function then saves the audio data to a temporary file with a .mp3 suffix using the tempfile.NamedTemporaryFile() function and plays the audio data using the playsound() function. Finally, the function deletes the temporary audio file using the os.unlink() function.

To use the run_tts() function, you can import it into your main script and call it with the response message as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from audio_recorder import record
from whisper_transcriber import get_transcription
from chatbot import get_response
from text_to_speech import run_tts

# Record audio for 10 seconds and save to 'my_recording.wav'
record(seconds=10, filename='my_recording.wav')

# Get the text transcription of the audio file
transcription = get_transcription('my_recording.wav')

# Get the chatbot response to the transcription
response = get_response(transcription)

# Convert the response message to speech and play it
run_tts(response)

Build the app interface using Streamlit

In this section, we are creating the interface of our Voice Assistant application using Streamlit. To make our application visually appealing, we are using an animation from lottiefiles website. streamlit_lottie package is used to display the animation. In the code, we define a function load_lottie(url) to load the animation from the given URL.

Next, we initialize the session state, which is a dictionary-like object that persists data between runs of the Streamlit application. We initialize three variables to store the recording state, user prompt text, and ChatGPT’s response text.

After that, we define the callback_record() function, which is called when the user clicks on the “Record” button. This function sets the is_recording flag to True, which disables the record button and displays a message indicating that the recording has started. It then records the user’s voice prompt, processes the recording, and sends it to the Whisper API to get the text transcription. Finally, it gets ChatGPT’s response to the user’s prompt.

The next part of the code defines the layout of the interface using Streamlit’s columns() function. The left column displays the animation, while the right column displays the header, a message prompting the user to press the Record button, the Record button, and a message box to display the user’s prompt.

The second container displays the ChatGPT’s response to the user’s prompt. If there is a response available, it displays the response message box, splits the response message by newline character, and writes each line one by one. After each line, the playsound package is used to play the Text-to-Speech of that line.

We name this script app.py, and in the next section, we will run our app.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import os
import requests

import streamlit as st
from streamlit_lottie import st_lottie

# local modules
import audio_recorder
import audio_text2speech
import openai_whisper
import openai_chatgpt

# constants
LOTTIE_URL = 'https://assets6.lottiefiles.com/packages/lf20_6e0qqtpa.json'
PROMPT_WAVFILE = 'prompt.wav'


# Create the animation
def load_lottie(url):
    r = requests.get(url)
    if r.status_code != 200:
        return
    return r.json()

lottie_anim = load_lottie(LOTTIE_URL)

st.set_page_config(page_title="ChatGPT-VA", page_icon='', layout='centered')

# Initialize session state
if "is_recording" not in st.session_state:
    st.session_state.is_recording = False
if "prompt_text" not in st.session_state:
    st.session_state.prompt_text = None
if "chat_text" not in st.session_state:
    st.session_state.chat_text = None


# Define button callbacks
def callback_record():
    st.session_state.is_recording = True
    prompt_box.write("Recording started ...")

    # record the prompt
    audio_recorder.record(filename=PROMPT_WAVFILE)
    prompt_box.write("Processing the prompt ...")

    # Process recording
    prompt = openai_whisper.get_transcription(PROMPT_WAVFILE)

    st.session_state.is_recording = False
    st.session_state.prompt_text = prompt

    response = openai_chatgpt.get_response(prompt)
    json.dump(response, open('response.json', 'wt'))

    st.session_state.chat_text = response


##########################
with st.container():
    left, right = st.columns([2, 3])
    with left:
        st_lottie(lottie_anim, height=300, key='coding')

    with right:
        st.subheader('Hi, I am ChatGPT Voice Assistant!')

        st.write('Press Record to start recording your promot')

        rec_button = st.button(
            label="Record :microphone:", type='primary',
            on_click=callback_record,
            disabled=st.session_state.is_recording)

        prompt_box = st.empty()
        if st.session_state.prompt_text:
            prompt_box.write(f'Prompt: {st.session_state.prompt_text}')


##########################
with st.container():
    st.write('---')

    message_box = st.empty()
    if st.session_state.chat_text:
        choice = st.session_state.chat_text['choices'][0]
        # write and play line by line
        for line in choice['message']['content'].split('\n'):
            if not line:
                continue
            message_box.write(line)
            audio_text2speech.run_tts(line)

        # write the entire message
        message_box.write(choice['message']['content'])

Running the Voice Assistant app

Now that we have put together all the four components we defined earlier, we can run the Voice Assistant app and give it a try. To run the app, use the following command in your terminal:

1
streamlit run app.py

This command will start the Streamlit application, and you will be able to interact with the Voice Assistant app through your web browser.

Check out the following video for an example use case of the Voice Assistant app:

In this blog post, we showed you how to create a cool Voice Assistant app using Python and some amazing APIs and packages. With our app, you can record your voice prompt, and our Voice Assistant will transcribe it, generate a response using ChatGPT, and convert it to speech. We even used Streamlit to make a user-friendly interface so that you can easily interact with the Voice Assistant through your web browser. I hope you enjoyed reading this and that you’re inspired to build your own Voice Assistant app!

This post is licensed under CC BY 4.0 by the author.
Contents