The development of conversational voice agents is advancing rapidly, yet evaluating their performance remains a complex challenge due to the need to balance both task accuracy and conversational quality. Existing evaluation methods often isolate these factors, focusing either on task success or on conversational dynamics, but rarely both simultaneously. Addressing this gap, researchers have introduced EVA, an end-to-end evaluation framework designed to assess multi-turn spoken conversations in voice agents, capturing both task accuracy and user experience.
EVA operates using a realistic bot-to-bot audio architecture composed of five components: a User Simulator that mimics human callers with natural speech using high-quality text-to-speech models; the Voice Agent under evaluation, supporting both cascade and audio-native architectures; a Tool Executor that provides deterministic responses through scenario-specific databases; Validators that ensure conversation validity without human annotation; and a comprehensive Metrics Suite that analyzes conversation recordings, transcripts, and tool interactions. This structure enables EVA to detect nuanced interaction dynamics such as interruptions, error recovery, and latency effects that impact user experience.
The framework was initially applied to an airline domain dataset featuring 50 scenarios including flight rebooking, cancellations, and voucher issuance. Benchmarking across 20 voice agent systems, including speech-to-speech and large audio language models, revealed a consistent tradeoff: agents excelling in task accuracy often delivered poorer conversational experiences, and vice versa. This finding highlights the importance of integrated evaluation metrics to inform future voice agent development.
EVA builds on and extends previous evaluation efforts that have either focused on single-turn speech understanding or isolated conversational behaviors without considering task completion. By jointly scoring EVA-A (Accuracy) and EVA-X (Experience), the framework provides a holistic view of voice agent performance in realistic, multi-step conversational workflows. This approach aims to advance the field by enabling more comprehensive assessments that reflect real-world deployment conditions.
Read more: huggingface.co