ASR_test

ASR Test - Speech-to-Text + Diarization + Translation

A powerful Automatic Speech Recognition (ASR) pipeline built with WhisperX, supporting:


Features


Installation

1. Install PyTorch (CUDA 12.8)

pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128

2. Install dependencies

pip install git+https://github.com/m-bain/whisperx.git
pip install google-generativeai
pip install deep-translator
pip install scikit-learn librosa soundfile

3. Install FFmpeg

Required for audio preprocessing Download: https://ffmpeg.org/


Usage

Basic command

python new.py --input "audio.mp3"
python new.py \
  --input "test2.mp3" \
  --model large-v3 \
  --batch_size 8 \
  --diarize \
  --gemini_batch_size 30 \
  --out_dir "outputs"

Optional: Gemini Correction

Add your API key to enable grammar correction:

python new.py --input "audio.mp3" --gemini_key YOUR_API_KEY

If not provided:

System will skip correction step.


Speaker Diarization

Enable with:

--diarize

Optional:

--num_speakers 2

If not set:

System automatically estimates number of speakers using clustering.


Language Settings

Inside code:

INPUT_LANG = "vi"
OUTPUT_LANG = "en"

Output Files

All outputs are saved in --out_dir:

File Description
.json Full structured segments
.srt Subtitles (EN + VI)
.vtt Web subtitles
_raw.txt Original transcription
_corrected.txt Grammar-corrected text
_translated_en.txt Translated text

Subtitle Format Example

1
00:00:01,000 --> 00:00:03,000
[SPEAKER_00] Hello everyone
[SPEAKER_00] Xin chào mọi người

Notes


Pipeline Overview

Audio → FFmpeg → WhisperX → Alignment
       → Diarization → (Gemini Correction)
       → Translation → Export (SRT/VTT/TXT/JSON)

Example

python new.py \
  --input "meeting.mp3" \
  --diarize \
  --gemini_key YOUR_KEY \
  --out_dir results

TODO

```