Speech-to-Text and Speaker Diarization with Whisper and pyannote.audio 3.1 on Ubuntu 24.04

- May 05, 2025

If you want to generate subtitles with speaker labels from an audio file using Whisper and pyannote.audio on Ubuntu 24.04, this blog post walks you through the full setup process.

Step 1: Python 3.10 Installation (Manual)

Ubuntu 24.04 comes with Python 3.12 by default, but pyannote.audio 3.1.1 was trained with torch==1.13.1 and CUDA 11.7, which supports Python 3.10. So I need to install Python 3.10:

How To Install Python 3.10 on Ubuntu 24.04

Step 2: Create and Activate a Python Virtual Environment

stt@GU502DU:~$ python3.10 -m venv whisper_env
stt@GU502DU:~$ source whisper_env/bin/activate

Step 3: Install torch 2.5.1 and cuda 12.1

My laptop is equipped with an NVIDIA GeForce GTX 1660, so I chose to install and use CUDA. Since the GPU driver reports CUDA Version 12.8, I installed the cu121 build of PyTorch, which is compatible with it.

(whisper_env) stt@GU502DU:~$ pip install torch==2.5.1+cu121 torchaudio==2.2.1+cu121 \
  --extra-index-url https://download.pytorch.org/whl/cu121

Note on CUDA Compatibility

To ensure compatibility between your installed CUDA version and the PyTorch build, run the following command:

(diarization_env) stt@GU502DU:~$ nvidia-smi
+------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07  Driver Version: 570.133.07  CUDA Version: 12.8  |
+------------------------------------------------------------------------+

Make sure that the CUDA version reported here (e.g., CUDA Version: 12.8) is greater than or equal to the CUDA version used by the installed PyTorch build (in this case, cu121 = CUDA 12.1).

Step 4: pyannote.audio and dependencies:

(whisper_env) stt@GU502DU:~$ pip install pyannote.audio==3.1.1 numpy scipy librosa huggingface_hub

Step 5: Install Whisper (with GPU support)

(whisper_env) stt@GU502DU:~$ pip install git+https://github.com/openai/whisper.git --upgrade --no-deps

I use the --no-deps option to prevent pip from overriding already-installed packages like torch.

Step 6: Hugging Face Access Token

I create a Hugging Face access token from hf.co/settings/tokens and use it in my code:

from huggingface_hub import login
login("hf_your_token_here")

Step 7: Accept User Conditions for the Following Models

I visit the following URLs and click “Agree” to accept the terms:

Step 8: Run the following code (Whisper + pyannote.audio)

(whisper_env) stt@GU502DU:~$ python3 speaker_test4.py

from pyannote.audio import Pipeline
import whisper
import json

# Diarization
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="hf_your_token"
)
diarization = pipeline("test.wav")

# STT
model = whisper.load_model("base")
whisper_result = model.transcribe("test.wav")

# Merge
def get_speaker(start_time, end_time, diarization):
    for turn, _, speaker in diarization.itertracks(yield_label=True):
        overlap = max(0, min(end_time, turn.end) - max(start_time, turn.start))
        if overlap > (end_time - start_time) * 0.5:
            return speaker
    return "Unknown"

merged = []
for seg in whisper_result["segments"]:
    speaker = get_speaker(seg["start"], seg["end"], diarization)
    merged.append({
        "speaker": speaker,
        "start": seg["start"],
        "end": seg["end"],
        "text": seg["text"]
    })

with open("test_stt_merged.json", "w") as f:
    json.dump(merged, f, indent=2)

Output

This generates test_stt_merged.json that combines speaker info and transcribed text.

{
  "speaker": "SPEAKER_00",
  "start": 11.200000000000001,
  "end": 19.36,
  "text": " you this morning? Um, I just had some, um, diary for the last three days. Um, and it's"
},
{
  "speaker": "SPEAKER_00",
  "start": 19.36,
  "end": 24.16,
  "text": " been affecting me. I need to stay close to the toilet and, um, yeah, it's been affecting"
},

Search This Blog

Software Engineer's Blog

Managing FastAPI Projects with Poetry: A Step-by-Step Guide