Accuracy comparison on search-intensive and agentic tasks. 7B-Base refers to Qwen-2.5-7B-Base and 7B-Inst refers to Qwen-2.5-7B-Instruct. AutoGen and our AgentFlow method are agentic systems, which use Qwen-2.5-7B-Instruct for the LLM-powered agents and tools for fair comparison. We visualize the gains of AgentFlow to each baseline in the Δ columns.
Baselines: We compare against four categories of baselines: (1) Open-source LLMs: Qwen-2.5 (7B, 14B, 32B) and Llama-3.3-70B; (2) Proprietary LLMs: GPT-4o-mini and GPT-4o; (3) Tool-integrated reasoning LLMs: Supervised Fine-Tuning (SFT), Iter-RetGen, Search-R1, ZeroSearch, ReSearch, StepSearch, and VerlTool; (4) Training-free agentic system: AutoGen.
Accuracy comparison of mathematical and scientific reasoning tasks. We visualize the gains of AgentFlow to each baseline in the Δ columns.
Baselines: We compare against four categories of baselines: (1) Open-source LLMs: Qwen-2.5 (7B, 14B) and Llama-3.3-70B, Llama-3.1-405B; (2) Proprietary LLMs: GPT-4o-mini and GPT-4o; (3) Reasoning LLMs: Supervised Fine-Tuning (SFT), SimpleRL-reason, Open-Reasoner-Zero, General-Reasoner, and Luffy; (4) Tool-integrated reasoning LLMs: TIR and ToRL; (5) Training-free agentic system: AutoGen.