Managing FastAPI Projects with Poetry: A Step-by-Step Guide

Image
This guide will walk you through how to use Poetry to manage dependencies and structure your project in FastAPI . It includes detailed explanations of Poetry's core concepts and commands to help prevent issues that can arise during team collaboration and deployment. 1. What is Poetry? Poetry is a dependency management and packaging tool for Python. It goes beyond simply installing libraries by allowing you to clearly declare the dependencies your project needs and ensuring that all developers on the project have the same library versions. Clarity in Dependency Management : Explicitly manage your project's basic information and required libraries through the pyproject.toml file. Reproducible Builds : By locking all dependency versions in the poetry.lock file, it fundamentally prevents "it works on my machine" problems. Integrated Development Environment : It automatically creates and manages isolated virtual environments for each project and handles mo...

Speech-to-Text and Speaker Diarization with Wav2Vec2 and pyannote.audio 3.1 on Ubuntu 24.04

Wav2Vec2

In this post, I walk through how to use Wav2Vec2 for speech-to-text (STT) and pyannote.audio 3.1 for speaker diarization on Ubuntu 24.04. I'll use a Python virtual environment with Python 3.10, CUDA GPU acceleration, and Hugging Face models.

Step 1: Python 3.10 Installation

Ubuntu 24.04 comes with Python 3.12 by default, but since pyannote.audio 3.1.1 was trained with torch==1.13.1 and CUDA 11.7, which support Python 3.10, I need to install Python 3.10:

Step 2: Create Virtual Environment

stt@GU502DU:~$ python3.10 -m venv wav2vec2_env
stt@GU502DU:~$ source wav2vec2_env/bin/activate

Step 3: Install torch 2.5.1 and cuda 12.1

My laptop is equipped with an NVIDIA GeForce GTX 1660, so I chose to install and use CUDA. Since the GPU driver reports CUDA Version 12.8, I installed the cu121 build of PyTorch, which is compatible with it.

(wav2vec2_env) stt@GU502DU:~$ pip install torch==2.5.1+cu121 torchaudio==2.2.1+cu121 \
  --extra-index-url https://download.pytorch.org/whl/cu121

Note on CUDA Compatibility

To ensure compatibility between your installed CUDA version and the PyTorch build, run the following command:

(diarization_env) stt@GU502DU:~$ nvidia-smi
+------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07  Driver Version: 570.133.07  CUDA Version: 12.8  |
+------------------------------------------------------------------------+

Make sure that the CUDA version reported here (e.g., CUDA Version: 12.8) is greater than or equal to the CUDA version used by the installed PyTorch build (in this case, cu121 = CUDA 12.1).

Step 4: pyannote.audio and dependencies:

(wav2vec2_env) stt@GU502DU:~$ pip install pyannote.audio==3.1.1 numpy scipy librosa \
huggingface_hub

Step 5: Install Transformers for Wav2Vec2

(wav2vec2_env) stt@GU502DU:~$ pip install transformers datasets --no-deps

To make sure my specific versions of packages like torch are not overridden, I add the --no-deps option.

Step 6: Hugging Face Access Token

Create a Hugging Face access token from hf.co/settings/tokens and use it in your code:

from huggingface_hub import login
login("hf_your_token_here")

Step 7: Accept User Conditions for the Following Models

Visit the following URLs and click “Agree” to accept the terms:

Step 8: Run the following code (Wav2Vec2 + pyannote.audio)

Below is a Python script that performs STT with Wav2Vec2 and diarization with pyannote.audio. It includes GPU memory optimization and merges word-level results into full sentences, without saving an intermediate file:

(wav2vec2_env) stt@GU502DU:~$ python3 speaker_test6.py
import os
  os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:64"

from pyannote.audio import Pipeline
from transformers import pipeline as hf_pipeline
import json
import torch

HUGGINGFACE_TOKEN = "your_token_here"

diarization_pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token=HUGGINGFACE_TOKEN
)
diarization_pipeline.to(torch.device("cuda"))
diarization = diarization_pipeline("day1.wav")
torch.cuda.empty_cache()

stt = hf_pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", chunk_length_s=10)
stt_result = stt("day1.wav", return_timestamps="word")
segments = stt_result["chunks"] if "chunks" in stt_result else [stt_result]

def get_speaker(start, end, diarization):
    for turn, _, speaker in diarization.itertracks(yield_label=True):
        overlap = max(0, min(end, turn.end) - max(start, turn.start))
        if overlap > (end - start) * 0.5:
            return speaker
    return "Unknown"

merged_sentences = []
current = None

for seg in segments:
    speaker = get_speaker(seg["timestamp"][0], seg["timestamp"][1], diarization)
    start = seg["timestamp"][0]
    end = seg["timestamp"][1]
    text = seg["text"]

    if current is None:
        current = {"speaker": speaker, "start": start, "end": end, "text": text}
        continue

    if speaker == current["speaker"] and start - current["end"] < 1.0:
        current["end"] = end
        current["text"] += " " + text
    else:
        merged_sentences.append(current)
        current = {"speaker": speaker, "start": start, "end": end, "text": text}

if current:
    merged_sentences.append(current)

with open("wav2vec2_stt_sentences.json", "w") as f:
    json.dump(merged_sentences, f, indent=2)

Output

The script will generate a wav2vec2_stt_sentences.json file containing aligned transcriptions and speaker labels in sentence units like:

{
  "speaker": "SPEAKER_00",
  "start": 1.25,
  "end": 5.6,
  "text": "Hello, I just had diarrhea for three days. It affected my daily life."
}

NumPy Compatibility Note

If you encounter errors related to NumPy 2.x like np.NaN was removed, downgrade NumPy to 1.x, I recommend keeping NumPy fixed at 1.26.x when using pyannote.audio and PyTorch.

(wav2vec2_env) stt@GU502DU:~$ pip install numpy==1.26.4

Comments

Popular posts from this blog

Resolving Key Exchange Failure When Connecting with SecureCRT to OpenSSH

SecureCRT] How to Back Up and Restore SecureCRT Settings on Windows

How to Set Up Vaultwarden (Bitwarden) on Synology NAS (Best Free Alternative to LastPass)