A powerful Automatic Speech Recognition (ASR) pipeline built with WhisperX, supporting:
large-v3)pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
pip install git+https://github.com/m-bain/whisperx.git
pip install google-generativeai
pip install deep-translator
pip install scikit-learn librosa soundfile
Required for audio preprocessing Download: https://ffmpeg.org/
python new.py --input "audio.mp3"
python new.py \
--input "test2.mp3" \
--model large-v3 \
--batch_size 8 \
--diarize \
--gemini_batch_size 30 \
--out_dir "outputs"
Add your API key to enable grammar correction:
python new.py --input "audio.mp3" --gemini_key YOUR_API_KEY
If not provided:
System will skip correction step.
Enable with:
--diarize
Optional:
--num_speakers 2
If not set:
System automatically estimates number of speakers using clustering.
Inside code:
INPUT_LANG = "vi"
OUTPUT_LANG = "en"
All outputs are saved in --out_dir:
| File | Description |
|---|---|
.json |
Full structured segments |
.srt |
Subtitles (EN + VI) |
.vtt |
Web subtitles |
_raw.txt |
Original transcription |
_corrected.txt |
Grammar-corrected text |
_translated_en.txt |
Translated text |
1
00:00:01,000 --> 00:00:03,000
[SPEAKER_00] Hello everyone
[SPEAKER_00] Xin chào mọi người
cuda)Audio → FFmpeg → WhisperX → Alignment
→ Diarization → (Gemini Correction)
→ Translation → Export (SRT/VTT/TXT/JSON)
python new.py \
--input "meeting.mp3" \
--diarize \
--gemini_key YOUR_KEY \
--out_dir results
```