Managing FastAPI Projects with Poetry: A Step-by-Step Guide

In this post, I walk through how to use Wav2Vec2 for speech-to-text (STT) and pyannote.audio 3.1 for speaker diarization on Ubuntu 24.04. I'll use a Python virtual environment with Python 3.10, CUDA GPU acceleration, and Hugging Face models.
Ubuntu 24.04 comes with Python 3.12 by default, but since pyannote.audio 3.1.1 was trained with torch==1.13.1 and CUDA 11.7, which support Python 3.10, I need to install Python 3.10:
stt@GU502DU:~$ python3.10 -m venv wav2vec2_env
stt@GU502DU:~$ source wav2vec2_env/bin/activate
My laptop is equipped with an NVIDIA GeForce GTX 1660, so I chose to install and use CUDA. Since the GPU driver reports CUDA Version 12.8, I installed the cu121 build of PyTorch, which is compatible with it.
(wav2vec2_env) stt@GU502DU:~$ pip install torch==2.5.1+cu121 torchaudio==2.2.1+cu121 \
--extra-index-url https://download.pytorch.org/whl/cu121
To ensure compatibility between your installed CUDA version and the PyTorch build, run the following command:
(diarization_env) stt@GU502DU:~$ nvidia-smi
+------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 |
+------------------------------------------------------------------------+
Make sure that the CUDA version reported here (e.g., CUDA Version: 12.8) is greater than or equal to the CUDA version used by the installed PyTorch build (in this case, cu121 = CUDA 12.1).
(wav2vec2_env) stt@GU502DU:~$ pip install pyannote.audio==3.1.1 numpy scipy librosa \ huggingface_hub
(wav2vec2_env) stt@GU502DU:~$
pip install transformers datasets --no-deps
To make sure my specific versions of packages like torch are not overridden, I add the --no-deps option.
Create a Hugging Face access token from hf.co/settings/tokens and use it in your code:
from huggingface_hub import login
login("hf_your_token_here")
Visit the following URLs and click “Agree” to accept the terms:
Below is a Python script that performs STT with Wav2Vec2 and diarization with pyannote.audio. It includes GPU memory optimization and merges word-level results into full sentences, without saving an intermediate file:
(wav2vec2_env) stt@GU502DU:~$ python3 speaker_test6.py
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:64"
from pyannote.audio import Pipeline
from transformers import pipeline as hf_pipeline
import json
import torch
HUGGINGFACE_TOKEN = "your_token_here"
diarization_pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token=HUGGINGFACE_TOKEN
)
diarization_pipeline.to(torch.device("cuda"))
diarization = diarization_pipeline("day1.wav")
torch.cuda.empty_cache()
stt = hf_pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", chunk_length_s=10)
stt_result = stt("day1.wav", return_timestamps="word")
segments = stt_result["chunks"] if "chunks" in stt_result else [stt_result]
def get_speaker(start, end, diarization):
for turn, _, speaker in diarization.itertracks(yield_label=True):
overlap = max(0, min(end, turn.end) - max(start, turn.start))
if overlap > (end - start) * 0.5:
return speaker
return "Unknown"
merged_sentences = []
current = None
for seg in segments:
speaker = get_speaker(seg["timestamp"][0], seg["timestamp"][1], diarization)
start = seg["timestamp"][0]
end = seg["timestamp"][1]
text = seg["text"]
if current is None:
current = {"speaker": speaker, "start": start, "end": end, "text": text}
continue
if speaker == current["speaker"] and start - current["end"] < 1.0:
current["end"] = end
current["text"] += " " + text
else:
merged_sentences.append(current)
current = {"speaker": speaker, "start": start, "end": end, "text": text}
if current:
merged_sentences.append(current)
with open("wav2vec2_stt_sentences.json", "w") as f:
json.dump(merged_sentences, f, indent=2)
The script will generate a wav2vec2_stt_sentences.json file containing aligned transcriptions and speaker labels in sentence units like:
{
"speaker": "SPEAKER_00",
"start": 1.25,
"end": 5.6,
"text": "Hello, I just had diarrhea for three days. It affected my daily life."
}
If you encounter errors related to NumPy 2.x like np.NaN was removed, downgrade NumPy to 1.x, I recommend keeping NumPy fixed at 1.26.x when using pyannote.audio and PyTorch.
(wav2vec2_env) stt@GU502DU:~$ pip install numpy==1.26.4
Comments
Post a Comment