Speech-to-Text and Speaker Diarization with Wav2Vec2 and pyannote.audio 3.1 on Ubuntu 24.04

- May 07, 2025

In this post, I walk through how to use Wav2Vec2 for speech-to-text (STT) and pyannote.audio 3.1 for speaker diarization on Ubuntu 24.04. I'll use a Python virtual environment with Python 3.10, CUDA GPU acceleration, and Hugging Face models.

Step 1: Python 3.10 Installation

Ubuntu 24.04 comes with Python 3.12 by default, but since pyannote.audio 3.1.1 was trained with torch==1.13.1 and CUDA 11.7, which support Python 3.10, I need to install Python 3.10:

How To Install Python 3.10 on Ubuntu 24.04

Step 2: Create Virtual Environment

stt@GU502DU:~$ python3.10 -m venv wav2vec2_env
stt@GU502DU:~$ source wav2vec2_env/bin/activate

Step 3: Install torch 2.5.1 and cuda 12.1

My laptop is equipped with an NVIDIA GeForce GTX 1660, so I chose to install and use CUDA. Since the GPU driver reports CUDA Version 12.8, I installed the cu121 build of PyTorch, which is compatible with it.

(wav2vec2_env) stt@GU502DU:~$ pip install torch==2.5.1+cu121 torchaudio==2.2.1+cu121 \
  --extra-index-url https://download.pytorch.org/whl/cu121

Note on CUDA Compatibility

To ensure compatibility between your installed CUDA version and the PyTorch build, run the following command:

(diarization_env) stt@GU502DU:~$ nvidia-smi
+------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07  Driver Version: 570.133.07  CUDA Version: 12.8  |
+------------------------------------------------------------------------+

Make sure that the CUDA version reported here (e.g., CUDA Version: 12.8) is greater than or equal to the CUDA version used by the installed PyTorch build (in this case, cu121 = CUDA 12.1).

Step 4: pyannote.audio and dependencies:

(wav2vec2_env) stt@GU502DU:~$ pip install pyannote.audio==3.1.1 numpy scipy librosa \
huggingface_hub

Step 5: Install Transformers for Wav2Vec2

(wav2vec2_env) stt@GU502DU:~$ pip install transformers datasets --no-deps

To make sure my specific versions of packages like torch are not overridden, I add the --no-deps option.

Step 6: Hugging Face Access Token

Create a Hugging Face access token from hf.co/settings/tokens and use it in your code:

from huggingface_hub import login
login("hf_your_token_here")

Step 7: Accept User Conditions for the Following Models

Visit the following URLs and click “Agree” to accept the terms:

Step 8: Run the following code (Wav2Vec2 + pyannote.audio)

Below is a Python script that performs STT with Wav2Vec2 and diarization with pyannote.audio. It includes GPU memory optimization and merges word-level results into full sentences, without saving an intermediate file:

(wav2vec2_env) stt@GU502DU:~$ python3 speaker_test6.py

import os
  os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:64"

from pyannote.audio import Pipeline
from transformers import pipeline as hf_pipeline
import json
import torch

HUGGINGFACE_TOKEN = "your_token_here"

diarization_pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token=HUGGINGFACE_TOKEN
)
diarization_pipeline.to(torch.device("cuda"))
diarization = diarization_pipeline("day1.wav")
torch.cuda.empty_cache()

stt = hf_pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", chunk_length_s=10)
stt_result = stt("day1.wav", return_timestamps="word")
segments = stt_result["chunks"] if "chunks" in stt_result else [stt_result]

def get_speaker(start, end, diarization):
    for turn, _, speaker in diarization.itertracks(yield_label=True):
        overlap = max(0, min(end, turn.end) - max(start, turn.start))
        if overlap > (end - start) * 0.5:
            return speaker
    return "Unknown"

merged_sentences = []
current = None

for seg in segments:
    speaker = get_speaker(seg["timestamp"][0], seg["timestamp"][1], diarization)
    start = seg["timestamp"][0]
    end = seg["timestamp"][1]
    text = seg["text"]

    if current is None:
        current = {"speaker": speaker, "start": start, "end": end, "text": text}
        continue

    if speaker == current["speaker"] and start - current["end"] < 1.0:
        current["end"] = end
        current["text"] += " " + text
    else:
        merged_sentences.append(current)
        current = {"speaker": speaker, "start": start, "end": end, "text": text}

if current:
    merged_sentences.append(current)

with open("wav2vec2_stt_sentences.json", "w") as f:
    json.dump(merged_sentences, f, indent=2)

Output

The script will generate a wav2vec2_stt_sentences.json file containing aligned transcriptions and speaker labels in sentence units like:

{
  "speaker": "SPEAKER_00",
  "start": 1.25,
  "end": 5.6,
  "text": "Hello, I just had diarrhea for three days. It affected my daily life."
}

NumPy Compatibility Note

If you encounter errors related to NumPy 2.x like np.NaN was removed, downgrade NumPy to 1.x, I recommend keeping NumPy fixed at 1.26.x when using pyannote.audio and PyTorch.

(wav2vec2_env) stt@GU502DU:~$ pip install numpy==1.26.4

Search This Blog

Software Engineer's Blog

Managing FastAPI Projects with Poetry: A Step-by-Step Guide