Managing FastAPI Projects with Poetry: A Step-by-Step Guide

If you want to generate subtitles with speaker labels from an audio file using Whisper and pyannote.audio on Ubuntu 24.04, this blog post walks you through the full setup process.
Ubuntu 24.04 comes with Python 3.12 by default, but pyannote.audio 3.1.1 was trained with torch==1.13.1 and CUDA 11.7, which supports Python 3.10. So I need to install Python 3.10:
stt@GU502DU:~$ python3.10 -m venv whisper_env
stt@GU502DU:~$ source whisper_env/bin/activate
My laptop is equipped with an NVIDIA GeForce GTX 1660, so I chose to install and use CUDA. Since the GPU driver reports CUDA Version 12.8, I installed the cu121 build of PyTorch, which is compatible with it.
(whisper_env) stt@GU502DU:~$ pip install torch==2.5.1+cu121 torchaudio==2.2.1+cu121 \
--extra-index-url https://download.pytorch.org/whl/cu121
To ensure compatibility between your installed CUDA version and the PyTorch build, run the following command:
(diarization_env) stt@GU502DU:~$ nvidia-smi
+------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 |
+------------------------------------------------------------------------+
Make sure that the CUDA version reported here (e.g., CUDA Version: 12.8) is greater than or equal to the CUDA version used by the installed PyTorch build (in this case, cu121 = CUDA 12.1).
(whisper_env) stt@GU502DU:~$ pip install pyannote.audio==3.1.1 numpy scipy librosa
huggingface_hub
(whisper_env) stt@GU502DU:~$ pip install git+https://github.com/openai/whisper.git --upgrade --no-deps
I use the --no-deps option to prevent pip from overriding already-installed packages like torch.
I create a Hugging Face access token from hf.co/settings/tokens and use it in my code:
from huggingface_hub import login
login("hf_your_token_here")
I visit the following URLs and click “Agree” to accept the terms:
(whisper_env) stt@GU502DU:~$ python3 speaker_test4.py
from pyannote.audio import Pipeline
import whisper
import json
# Diarization
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="hf_your_token"
)
diarization = pipeline("test.wav")
# STT
model = whisper.load_model("base")
whisper_result = model.transcribe("test.wav")
# Merge
def get_speaker(start_time, end_time, diarization):
for turn, _, speaker in diarization.itertracks(yield_label=True):
overlap = max(0, min(end_time, turn.end) - max(start_time, turn.start))
if overlap > (end_time - start_time) * 0.5:
return speaker
return "Unknown"
merged = []
for seg in whisper_result["segments"]:
speaker = get_speaker(seg["start"], seg["end"], diarization)
merged.append({
"speaker": speaker,
"start": seg["start"],
"end": seg["end"],
"text": seg["text"]
})
with open("test_stt_merged.json", "w") as f:
json.dump(merged, f, indent=2)
This generates test_stt_merged.json that combines speaker info and transcribed text.
{
"speaker": "SPEAKER_00",
"start": 11.200000000000001,
"end": 19.36,
"text": " you this morning? Um, I just had some, um, diary for the last three days. Um, and it's"
},
{
"speaker": "SPEAKER_00",
"start": 19.36,
"end": 24.16,
"text": " been affecting me. I need to stay close to the toilet and, um, yeah, it's been affecting"
},
Comments
Post a Comment