Fun Audio Chat powers natural, low-latency voice interaction at scale.
Built by Alibaba's Tongyi Fun Team, Fun Audio Chat delivers top-tier Spoken QA, audio understanding, and voice empathy. The 8B model uses a dual-resolution 5Hz architecture to cut GPU cost by nearly 50% while keeping speech quality crisp.
Performance that speaks for itself
Fun Audio Chat ranks at the top of OpenAudioBench, VoiceBench, and UltraEval-Audio. Full-duplex interaction, function calling, and bilingual Chinese/English support are production-ready from day one.
Why teams choose Fun Audio Chat
A purpose-built audio model with open-source freedom, efficiency, and a full-stack speech interaction toolkit.
Exceptional benchmark results
Top-tier Spoken QA and audio understanding across OpenAudioBench, VoiceBench, and UltraEval-Audio.
50% better efficiency
Dual-resolution 5Hz backbone reduces GPU usage while maintaining premium speech quality.
Full feature coverage
Speech function calling, instruction-following, and voice empathy for real-world assistants.
Open source, commercial-ready
Apache 2.0 license enables private deployment and zero API call fees.
Benchmark leadership
Fun Audio Chat outperforms similar-size competitors on spoken QA, audio understanding, and speech function calling.
| Dimension | Fun Audio Chat | Competitors |
|---|---|---|
| Frame rate efficiency | 5Hz | 12.5Hz - 25Hz |
| Spoken QA | Best in class | Excellent / Good |
| Speech function calling | SOTA | Supported / Partial |
| Voice empathy | Top-tier | Supported / Partial |
Core technologies
Dual-resolution speech
Shared 5Hz backbone + 25Hz refined output head for speed without sacrificing fidelity.
Core-cocktail training
Preserves strong language reasoning while optimizing for speech-specific tasks.
Full-duplex interaction
Natural interruption handling and contextual memory for real-time, bidirectional conversation.
Built for voice-first products
Deploy Fun Audio Chat across customer service, education, healthcare, and high-concurrency voice assistants.
Intelligent customer service
24/7 support with empathetic, natural dialogue.
Voice assistants
Multi-turn dialogue and task execution with low latency.
Education & training
Real-time feedback and emotional support for learners.
Healthcare & companionship
Voice consultation assistance and comfort-focused experiences.
Quick start
Download the model and run speech-to-text or speech-to-speech inference in minutes.
HuggingFace
pip install huggingface-hub
hf download FunAudioLLM/Fun-Audio-Chat-8B --local-dir ./pretrained_models/Fun-Audio-Chat-8B
ModelScope
modelscope download --model FunAudioLLM/Fun-Audio-Chat-8B --local_dir pretrained_models/Fun-Audio-Chat-8B
Run inference
export PYTHONPATH=`pwd`
python examples/infer_s2t.py
python examples/infer_s2s.py
Resources
Technical report
Open source repository
FAQ
What makes Fun Audio Chat different from GPT-4o voice mode?
Fun Audio Chat is open source (Apache 2.0), supports private deployment, and delivers 5Hz efficiency to cut compute costs while maintaining top-tier voice performance.
What hardware is required?
Inference typically requires ~24GB VRAM GPUs like RTX 4090, A10, or A100. Training uses 4×80GB VRAM.
Is commercial use allowed?
Yes. The Apache 2.0 license allows commercial use, modification, and private deployment without API fees.
Which languages are supported?
Fun Audio Chat is optimized for Chinese and English voice interaction.