Next-Generation Large Audio Language Model

Fun Audio Chat powers natural, low-latency voice interaction at scale.

Built by Alibaba's Tongyi Fun Team, Fun Audio Chat delivers top-tier Spoken QA, audio understanding, and voice empathy. The 8B model uses a dual-resolution 5Hz architecture to cut GPU cost by nearly 50% while keeping speech quality crisp.

Performance that speaks for itself

Fun Audio Chat ranks at the top of OpenAudioBench, VoiceBench, and UltraEval-Audio. Full-duplex interaction, function calling, and bilingual Chinese/English support are production-ready from day one.

8B Parameters
5Hz Frame rate
Apache 2.0 License

Why teams choose Fun Audio Chat

A purpose-built audio model with open-source freedom, efficiency, and a full-stack speech interaction toolkit.

Exceptional benchmark results

Top-tier Spoken QA and audio understanding across OpenAudioBench, VoiceBench, and UltraEval-Audio.

50% better efficiency

Dual-resolution 5Hz backbone reduces GPU usage while maintaining premium speech quality.

Full feature coverage

Speech function calling, instruction-following, and voice empathy for real-world assistants.

Open source, commercial-ready

Apache 2.0 license enables private deployment and zero API call fees.

Benchmark leadership

Fun Audio Chat outperforms similar-size competitors on spoken QA, audio understanding, and speech function calling.

Dimension Fun Audio Chat Competitors
Frame rate efficiency 5Hz 12.5Hz - 25Hz
Spoken QA Best in class Excellent / Good
Speech function calling SOTA Supported / Partial
Voice empathy Top-tier Supported / Partial

Core technologies

Dual-resolution speech

Shared 5Hz backbone + 25Hz refined output head for speed without sacrificing fidelity.

Core-cocktail training

Preserves strong language reasoning while optimizing for speech-specific tasks.

Full-duplex interaction

Natural interruption handling and contextual memory for real-time, bidirectional conversation.

Built for voice-first products

Deploy Fun Audio Chat across customer service, education, healthcare, and high-concurrency voice assistants.

Intelligent customer service

24/7 support with empathetic, natural dialogue.

Voice assistants

Multi-turn dialogue and task execution with low latency.

Education & training

Real-time feedback and emotional support for learners.

Healthcare & companionship

Voice consultation assistance and comfort-focused experiences.

Quick start

Download the model and run speech-to-text or speech-to-speech inference in minutes.

HuggingFace

pip install huggingface-hub
hf download FunAudioLLM/Fun-Audio-Chat-8B --local-dir ./pretrained_models/Fun-Audio-Chat-8B

ModelScope

modelscope download --model FunAudioLLM/Fun-Audio-Chat-8B --local_dir pretrained_models/Fun-Audio-Chat-8B

Run inference

export PYTHONPATH=`pwd`
python examples/infer_s2t.py
python examples/infer_s2s.py

Resources

FAQ

What makes Fun Audio Chat different from GPT-4o voice mode?

Fun Audio Chat is open source (Apache 2.0), supports private deployment, and delivers 5Hz efficiency to cut compute costs while maintaining top-tier voice performance.

What hardware is required?

Inference typically requires ~24GB VRAM GPUs like RTX 4090, A10, or A100. Training uses 4×80GB VRAM.

Is commercial use allowed?

Yes. The Apache 2.0 license allows commercial use, modification, and private deployment without API fees.

Which languages are supported?

Fun Audio Chat is optimized for Chinese and English voice interaction.