Introducing Ancestral Audio

Using whisper and AI to transcribe audio and generate thematic clips

Two decades ago I recorded a conversation with my grandmother on an old answering machine as part of an undergraduate Anthropology project. That audio has been bit-rotting away on numerous hard drives since then, shared with only a handful of people through some cassette copies I made at the time. It's a project that has stuck with me all these years later. I have always thought it would be awesome if more people could listen to our conversation, but sifting though thirty minutes of garbled audio is tedious. Then came whisper and ChatpGPT.

Using these tools allowed me to quickly create a series of short audio snippets, complete with their transcription. I then re-used a UI I built for another projected and loaded in the data. You can see the final output at Ancestral Audio. Read on to learn about more about the process.

Cleaning the Audio and Generating the Clips

I attempted to clean up the scratchy, analog audio using AI, however the results were unimpressive. I ended up cleaning it using some simple filters in Audacity, mainly Noise Reduction, Click Removal, and Normalize. I then fed this audio through whisper to generate a basic transcription.

~/main -m ./models/ggml-small.en-tdrz.bin -f $input_audio > $output_file

This creates a text file with timestamps, broken out for each speaker.

[00:00:00.000 --> 00:00:07.000] [Phone rings]
[00:00:07.000 --> 00:00:11.480] Hello?
[00:00:11.480 --> 00:00:15.480] Hello, Grandma.

Using the transcription, I manually fed chunks into ChatGPT to generate thematic snippets and timestamps. This process gave me a good starting point, however it needed some refining. I further manually updated the timestamps to better reflect the conversation and to cut out extraneous text.

## Sample ChatGPT Output

Life on the Farm and Daily Routine

 - Start: 00:25:51.520
 - Stop: 00:29:17.520
 - Short Name: Farm Life

Finally, I created the actual audio clips from the main Audacity export using ffmpeg, and re-generated the transcription for each clip, without the timestamps this time.

ffmpeg -i "${input_audio}" -ss "${start_time}" -to "${end_time}" -c copy "${output_file}"

Again - here is the final product.

Libraries across the globe store thousands if not tens of thousands of hours of archival audio. Using AI tools to split this audio into manageable clips could help bring these voices to the masses, and preserve their message for future generations. And while the process to create this current project was only augmented with an AI, creating a fully-automated process is definitely feasible.

~ Malcolm Meyer