Gemini 2.0 Flash: Create Audio Clips from Text

Ever imagined turning your ideas directly into audio clips? Now it’s possible with Google’s innovative Gemini 2.0 Flash model. Building on the foundations of the 1.x models, Gemini 2.0 brings a revolutionary “idea-to-speech” capability. This guide will walk you through harnessing its advanced text-to-speech (TTS) features using Python, transforming your prompts into vivid audio experiences.

How Create Audio Clips with Gemini 2.0 Flash Model — *Gemini 2.0 Flash Create Audio from Text*

Gemini’s API brings generative AI abilities directly into your applications, enabling seamless integration of advanced AI functionalities. This guide focuses specifically on the new text-to-audio feature introduced in Gemini 2.0 Flash- Create Audio from Text.

Prerequisites
Installation and Setup
Building the Audio Clip Generator
Running the Script
Summary

Prerequisites

Before diving into the implementation, ensure you have the following:

Google Cloud Account: Access to Google APIs requires a Google Cloud account.
API Key: You need an API key with access to the Gemini 2.0 Flash model.
Python 3 Environment: The client library is available for Python 3.
Basic Python Knowledge: Familiarity with Python programming and virtual environments.

Installation and Setup

Follow these steps to install the necessary client library and set up your environment:

Install the Gemini Client Library:

pip install -U google-genai

Create an API Key: If you don’t already have one, create an API key through the Google Cloud Console.
Secure Your API Key: Save your API key securely. For this guide, we’ll store it in a settings.py file:

# settings.py
API_KEY = 'YOUR_API_KEY'

Note: For production environments, avoid hard-coding API keys. Instead, use environment variables or a secrets manager to keep your keys secure.

Building the Audio Clip Generator

We’ll create a Python script that sends a text prompt to Gemini and saves the generated audio response.

Imports

First, import the necessary libraries:

import asyncio
import contextlib
import wave

from google import genai
from settings import API_KEY

asyncio: Handles asynchronous operations.
contextlib: Manages context managers.
wave: Processes WAVE audio files.
genai: Google’s Gemini client library.
API_KEY: Your secured API key from settings.py.

Constants and Audio File Writer

Define constants for the API client, model configuration, prompt, and output filename:

CLIENT = genai.Client(api_key=API_KEY, http_options={'api_version': 'v1alpha'})
MODEL = 'gemini-2.0-flash-exp'
CONFIG = {'generation_config': {'response_modalities': ['AUDIO']}}
PROMPT = 'Describe a dog sound in a few sentences'
FILENAME = 'dog_sound.wav'

Next, set up a context manager for writing WAV files:

@contextlib.contextmanager
def wave_file(filename, channels=1, rate=24000, sample_width=2):
    'Set up .wav file writer'
    with wave.open(filename, 'wb') as wf:
        wf.setnchannels(channels)
        wf.setsampwidth(sample_width)
        wf.setframerate(rate)
        yield wf

Core Functionality

Implement the asynchronous function to request audio from Gemini:

async def request_audio(prompt=PROMPT, filename=FILENAME):
    'Request LLM to generate audio file given prompt'
    print(f'\n** LLM prompt: "{prompt}"')
    async with CLIENT.aio.live.connect(model=MODEL, config=CONFIG) as session:
        with wave_file(filename) as f:
            await session.send(prompt, end_of_turn=True)
            async for response in session.receive():
                if response.data:
                    f.writeframes(response.data)
    print(f'** Saved audio to "{filename}"')

This function:

Connects to the Gemini 2.0 Flash model.
Sends the text prompt.
Receives audio data in chunks.
Writes the data to a WAV file.

Full Script: `gem20-audio.py`

Combine all components into a single script:

import asyncio
import contextlib
import wave

from google import genai
from settings import API_KEY

CLIENT = genai.Client(api_key=API_KEY, http_options={'api_version': 'v1alpha'})
MODEL = 'gemini-2.0-flash-exp'
CONFIG = {'generation_config': {'response_modalities': ['AUDIO']}}
PROMPT = 'Describe a dog sound in a few sentences'
FILENAME = 'dog_sound.wav'

@contextlib.contextmanager
def wave_file(filename, channels=1, rate=24000, sample_width=2):
    'Set up .wav file writer'
    with wave.open(filename, 'wb') as wf:
        wf.setnchannels(channels)
        wf.setsampwidth(sample_width)
        wf.setframerate(rate)
        yield wf

async def request_audio(prompt=PROMPT, filename=FILENAME):
    'Request LLM to generate audio file given prompt'
    print(f'\n** LLM prompt: "{prompt}"')
    async with CLIENT.aio.live.connect(model=MODEL, config=CONFIG) as session:
        with wave_file(filename) as f:
            await session.send(prompt, end_of_turn=True)
            async for response in session.receive():
                if response.data:
                    f.writeframes(response.data)
    print(f'** Saved audio to "{filename}"')

if __name__ == "__main__":
    asyncio.run(request_audio())

Running the Script

Execute the script to generate the audio clip:

python3 gem20-audio.py

Expected Output:

** LLM prompt: "Describe a dog sound in a few sentences"
** Saved audio to "dog_sound.wav"

You should now have a Dog_sound.wav file in your directory containing the audio explanation of the dog sound.

Summary

Developers are increasingly eager to integrate AI/ML capabilities into their applications. Google’s Gemini 2.0 Flash model provides powerful generative AI features accessible via API, enabling functionalities like text-to-audio conversion. This guide walked you through setting up the environment, installing the necessary client library, and implementing a Python script to generate audio clips from text prompts.

Protect Your API Keys: Never hard-code API keys in production code. Use environment variables or secret managers to store sensitive information.
Limit API Key Permissions: Assign only the necessary permissions to your API keys to minimize potential misuse.
Monitor Usage: Regularly monitor your API usage to detect any unauthorized access or anomalies.

Now that you’ve successfully generated audio clips using Gemini 2.0 Flash, consider exploring the following:

Advanced Prompts: Experiment with different prompts to generate varied audio outputs.
Integration: Incorporate the audio generation feature into web or mobile applications.
Multimodal Applications: Combine audio with other modalities like text or images for richer user experiences.

Follow us for more tutorials on deploying generative AI web apps to Google Cloud and other exciting features from Google’s suite of APIs!

How Create Audio Clips with Gemini 2.0 Flash Model

Table of Contents

Prerequisites

Installation and Setup

Building the Audio Clip Generator

Imports

Constants and Audio File Writer

Core Functionality

Full Script: `gem20-audio.py`

Running the Script

Summary

Related

Comments

Leave a Reply Cancel reply

Table of Contents

Prerequisites

Installation and Setup

Building the Audio Clip Generator

Imports

Constants and Audio File Writer

Core Functionality

Full Script: gem20-audio.py

Running the Script

Summary

Share this:

Related

Comments

Leave a Reply Cancel reply

Full Script: `gem20-audio.py`