Best Audio Chunk Sizes for Transcription Services

If you're using an audio chunker for transcription, choosing the right segment size is critical. Too large, and your transcription API may fail or produce poor results. Too small, and you'll waste time and money on unnecessary API calls.

This guide covers the optimal audio chunk sizes for transcription across all major services, including OpenAI Whisper, Google Speech-to-Text, AWS Transcribe, and more.

Transcription API Limits: Quick Reference

Here's a comprehensive comparison of file limits for popular transcription services:

Service	Max File Size	Max Duration	Recommended Chunk
OpenAI Whisper API	25 MB	~2-3 hours (varies)	10-15 minutes
Google Speech-to-Text	10 MB (sync)	1 min (sync) / unlimited (async)	1-5 minutes
AWS Transcribe	2 GB	4 hours	30-60 minutes
AssemblyAI	5 GB	Unlimited	30-60 minutes
Rev.ai	2 GB	17 hours	30-60 minutes
Deepgram	2 GB	Unlimited	30-60 minutes

OpenAI Whisper API: Optimal Chunk Size

The Whisper API is one of the most popular transcription services, but it has a strict 25 MB file size limit. Here's how to optimize your audio for Whisper:

Whisper API Recommendations

Optimal chunk size: 10-15 minutes of audio
File format: MP3 at 128kbps (best size/quality ratio)
Max file size: 25 MB

At 128kbps MP3, you can fit approximately 25-30 minutes of audio into the 25 MB limit. However, we recommend 10-15 minute chunks for several reasons:

Better error handling - if one chunk fails, you only need to retry that segment
Improved context accuracy - shorter segments tend to produce more accurate transcriptions
Easier to manage timestamps when combining transcripts

Pro Tip: When using ChunkAudio as your audio chunker for transcription, enable Smart Silence Detection. This ensures your chunks don't cut off mid-sentence, which improves transcription accuracy.

Google Speech-to-Text: Chunk Size Guide

Google offers two transcription modes with very different limits:

Synchronous Recognition

Max duration: 1 minute
Max file size: 10 MB
Best for: Short clips, real-time applications

Asynchronous Recognition

Max duration: 480 minutes (8 hours)
File must be in Google Cloud Storage
Best for: Long-form content

For synchronous transcription, you'll need to cut audio into segments of 1 minute or less. For async, larger chunks of 30-60 minutes work well.

AWS Transcribe: Chunk Size Guide

AWS Transcribe is more lenient with file sizes but has a 4-hour duration limit:

AWS Transcribe Recommendations

Optimal chunk size: 30-60 minutes
Max file size: 2 GB
Max duration: 4 hours
Supported formats: MP3, MP4, WAV, FLAC, OGG, AMR, WebM

How to Split Audio for Transcription

Here's the recommended workflow for preparing long audio files for transcription:

Analyze your audio: Check the total duration and file size
Choose your transcription service: Different services have different limits
Calculate chunk size: Use the table above to determine optimal segment length
Use an audio chunker: Split your audio into equal parts using ChunkAudio
Enable silence detection: Ensure cuts happen at natural pauses
Process and combine: Transcribe each chunk, then combine the results

Important: When combining transcripts from multiple chunks, pay attention to the segment boundaries. Even with silence detection, you may need to manually check for repeated or cut-off words at chunk boundaries.

File Format Considerations

The file format affects both size and compatibility:

Format	Size per Minute	Compatibility	Recommendation
MP3 128kbps	~1 MB	Universal	Best for most APIs
MP3 320kbps	~2.4 MB	Universal	Better quality if needed
WAV 16-bit	~10 MB	Universal	Avoid - too large
FLAC	~5 MB	Most APIs	Good for quality priority

For most transcription use cases, MP3 at 128kbps offers the best balance of file size and audio quality. Transcription accuracy is rarely affected by lossy compression at this bitrate for speech.

Handling Long Recordings

For very long recordings (4+ hours), consider this workflow:

Split into 30-minute chunks using ChunkAudio
Transcribe chunks in parallel (most APIs support batch processing)
Use timestamps to align and merge transcripts
Review boundary points for accuracy

API-Specific Requirements and Recommendations

OpenAI Whisper API

Whisper accepts files up to 25 MB with a maximum duration of about 4 hours. However, accuracy drops significantly for segments over 20 minutes. The sweet spot for Whisper is 5-10 minute chunks — long enough for context but short enough for high accuracy. Whisper also benefits from chunks that start and end at sentence boundaries, which ChunkAudio's silence detection helps achieve.

Google Cloud Speech-to-Text

Google's API supports synchronous recognition for audio up to 1 minute and asynchronous for longer files. For best results, split audio into 1-minute chunks for synchronous (fastest response) or 10-15 minute chunks for asynchronous processing. Google handles multiple formats but recommends FLAC or LINEAR16 for best accuracy.

Amazon Transcribe

Amazon accepts files up to 2 GB and 4 hours in length. While it handles long files well, splitting into 15-30 minute segments allows parallel processing, dramatically reducing total transcription time. This is especially valuable for multi-hour recordings like conferences or depositions.

AssemblyAI

AssemblyAI is optimized for longer audio and handles files up to several hours. However, for real-time or near-real-time transcription needs, splitting into 5-minute chunks enables streaming-style processing with faster initial results.

Optimizing Transcription Accuracy Through Smart Splitting

Why Chunk Boundaries Matter

Where you cut your audio directly impacts transcription quality. Cutting mid-sentence forces the transcription engine to start a new segment without context, often producing errors in the first few words. ChunkAudio's Smart Silence Detection finds natural pauses — between sentences or speakers — for cleaner cuts and better transcription results.

Handling Multi-Speaker Audio

For meetings or interviews with multiple speakers, consider longer chunks (15-20 minutes) to give the transcription engine enough context for speaker diarization (identifying who said what). Very short chunks make it harder for AI models to distinguish between speakers.

Post-Split Workflow

After splitting and transcribing, you'll need to merge the text results. Most transcription APIs return timestamped text, making it easy to concatenate results in order. Tools like Python scripts or simple copy-paste can reassemble the full transcript from chunked outputs.

Split Audio for Transcription Now

Use ChunkAudio to prepare your audio files for any transcription service. Free, private, and instant.

Open ChunkAudio

Tim

Founder, ChunkAudio

Tim built ChunkAudio to make audio splitting fast, free, and private. No uploads, no signups — just results.