Transcribing Like a Boss, For No Cost
One question that I’ve been asked a few times in the past
year is if I was aware of a good tool to transcribe text from a video or audio
file. AWS has its transcribe API for this, but there is a monthly free limit
before it starts charging. There is now a fantastic free option in the form of
OpenAI’s Whisper.
With the increasing amount of audio and video content being
generated and made available online, the ability to quickly and accurately
transcribe this content is becoming increasingly important. OpenAI's Whisper
audio-to-text capability offers a powerful solution to this problem.
Whisper is a deep learning-based model trained on large
amounts of data to produce high-quality text transcriptions from audio. It has
been specifically designed to transcribe speech in various settings, including
noisy environments, and to handle multiple speakers and accents. The model has
been trained on a wide range of data, including publicly available audio
content, which means that it is well-suited for use in the field of OSINT.
Whisper is capable of processing large amounts of audio data
quickly and accurately, and currently at least, for free. Another advantage of
using Whisper for OSINT is its ability to handle multiple languages and
accents. This makes it possible to transcribe audio content from various
sources, regardless of the language spoken.
I installed Whisper on my Windows host system using the
command:
pip install -U openai-whisper
You can view the code for the project here: https://github.com/openai/whisper
The tool also requires the audio & video processing tool
ffmpeg to be installed on your system. https://ffmpeg.org/
Once installed, I tested it on a video from my personal
trainer Ben Canning. By default, Whisper uses the first 30 seconds of audio to
determine what language to use. Here it correctly detects English and starts
transcribing the audio.
Once Whisper was finished processing the video, it generated
multiple text files with the transcription. Some have just the text; others
contain the text along with the timestamps, similar to the view produced in the
terminal window.
Whisper handled the video with zero issues, so I decided to
try one in a language other than English and with lower-quality audio. I picked
one of the videos of Juan Joya Borja, AKA “Spanish Laughing Guy”.
Here, whisper incorrectly identifies the language as
Galician, which is understandable considering its similarity to Spanish and
Portuguese.
We can force Whisper to use a specific language with the “—language” option, as shown below.
As you can see here, Whisper is capable of handling a large number
of languages.
As if all this wasn’t enough, Whisper can translate the video for you instead of transcribing.
For most uses, I would stick to having Whisper translate and
then utilizing a dedicated translation engine like Google Translate or DeepL to
translate, but I can see use cases where having the translation taken care of in
“real-time” would be advantageous.
This capability has countless uses, including transcribing
audio interviews or statements, transcribing and/or translating videos or audio
content posted online, etc. To have this level of capability available for free
is an extremely handy tool to have in the OSINT practitioner’s toolbox.
Comments
Post a Comment