AgentFlow: In-the-Flow Agentic System Optimization

YouTube Video

Thanks to Discover AI for featuring AgentFlow!

Introduction

Outcome-driven reinforcement learning has advanced reasoning in large language models (LLMs), but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this scales poorly with long horizons and diverse tools and generalizes weakly to new scenarios. Agentic systems offer a promising alternative by decomposing work across specialized modules, yet most remain training-free or rely on offline training decoupled from the live dynamics of multi-turn interaction.

We introduce AgentFlow, a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. To train on-policy in live environments, we propose Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages.

Across ten benchmarks, AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks, even surpassing larger proprietary models like GPT-4o. Further analyses confirm the benefits of in-the-flow optimization, showing improved planning, enhanced tool-calling reliability, and positive scaling with model size and reasoning turns.

One case study example. Initially failed with repetitive errors (left), AgentFlow, trained with Flow-GRPO, explores a new solution pathway at turn 4 after two failed attempts (right).

AgentFlow: An In-the-Flow Agentic System

(a) Overview of AgentFlow, a trainable agentic system for in-the-flow planning and tool use. Four modules—planner, executor, verifier, and generator—interact via evolving memory $M$ and toolset $K$, given query $q$. The planner policy is optimized on-policy inside the system's multi-turn loop for adaptive reasoning. (b) A single state transition: $a^t$, $e^t$, and $v^t$ update memory from $M^t$ to $M^{t+1}$.

AgentFlow is a general-purpose tool-integrated agentic framework for solving complex reasoning tasks through fine-grained planning and effective tool use. It comprises four specialized modules—Planner $\mathcal{P}$, Executor $\mathcal{E}$, Verifier $\mathcal{V}$, and Generator $\mathcal{G}$—coordinated by shared memory $M$ and a toolset $K$. We formalize AgentFlow's problem-solving process as a multi-turn Markov Decision Process (MDP): given query $q$ and toolset $K$, the planner $\mathcal{P}$ (a trainable policy $\pi_\theta$) produces an action $a^t \sim \pi_\theta(a^t \mid q, K, M^t)$ that formulates a sub-goal, selects a tool $k \in K$, and retrieves relevant context from memory $M^t$. The executor $\mathcal{E}$ invokes tools according to $a^t$, yielding execution results $e^t \sim \mathcal{E}(e^t \mid a^t, K)$. The verifier $\mathcal{V}$ evaluates $e^t$, producing a binary verification signal $v^t \sim \mathcal{V}(v^t \mid q, e^t, M^t)$. If $v^t = 0$, the memory is updated deterministically: $M^{t+1} = f_{\text{mem}}(M^t, a^t, e^t, v^t)$. This process repeats until $v^t = 1$ (termination) or a maximum turn budget is reached. Upon termination at turn $T$, the generator $\mathcal{G}$ produces the final solution $o \sim \mathcal{G}(o \mid q, M^T)$. After $T$ turns, the trajectory $\tau = \{(a^t, e^t, v^t)\}_{t=1}^T$ records planning, execution, and verification steps. The joint generative process is:

$$p_\theta(\{a^t,e^t,v^t\}_{1:T}, o \mid q) = \Big[\prod_{t=1}^T \pi_\theta(a^t \mid q,K,M^t)\; \mathcal{E}(e^t \mid a^t,K)\; \mathcal{V}(v^t \mid q,e^t,M^t)\Big]\; \mathcal{G}(o \mid q,M^T).$$

Flow-based Group Refined Policy Optimization

Optimization of AgentFlow. Given a query $q$, memory $M$, and toolset $K$, the policy generates actions for sub-goals and tool selection. It is trained via Flow-GRPO — a reinforcement learning method enabling multi-turn, stable optimization under collaborative dynamics.

Training Objective

We optimize the planner policy $\pi_\theta$ online within the AgentFlow system. For each query $(q,y^*)$, we sample $G$ on-policy trajectories $\{\tau_i\}_{i=1}^G$ where $\tau_i = \{a_i^1, \ldots, a_i^{T_i}, o_i\}$. The planner maximizes: $$\mathcal{J}(\theta)=\mathbb{E}_{\tau\sim\pi_\theta}[R(\tau)], \quad \theta^\star=\arg\max_\theta \mathcal{J}(\theta).$$

We use a final-outcome reward: every action receives the same trajectory-level signal based on solution correctness: $$r = R(a^t) = \bar{R}(o, q, y^*), \quad \forall t = 1,\dots,T,$$ where $\bar{R}(o, q, y^*) \in \{0, 1\}$ is determined by an LLM-as-judge. This broadcasts the global success signal to all intermediate decisions.

Flow-GRPO Formulation

Let $s_i^t=(q, K, M_i^t)$ be the state at turn $t$ of rollout $i$, and $a_i^t$ the planner's action (token sequence of length $|a_i^t|$). The objective is:

$$\begin{aligned} \mathcal{J}_{\text{Flow-GRPO}}(\theta) &= \mathbb{E}_{(q, y^*) \sim \mathcal{D}, \; \{\tau_i\}_{i=1}^{G} \sim \pi_{\theta_\text{old}}} \\ & \Bigg[ \frac{1}{G}\sum_{i=1}^{G}\frac{1}{T_i}\sum_{t=1}^{T_i}\frac{1}{|a_i^t|}\sum_{j=1}^{|a_i^t|} \min\!\Big\{ \rho_{i,j}^t A_i^t,\, \mathrm{clip}(\rho_{i,j}^t,\,1-\epsilon,\,1+\epsilon)\,A_i^t \Big\} \;-\; \beta\, \mathbb{D}_{\mathrm{KL}}\!\big(\pi_\theta \,\|\, \pi_{\text{ref}}\big) \Bigg], \end{aligned}$$

where $\rho_{i,j}^t = \frac{\pi_\theta(a_{i,j}^t \mid s_i^t, a_{i,1:j-1}^t)}{\pi_{\theta_\text{old}}(a_{i,j}^t \mid s_i^t, a_{i,1:j-1}^t)}$ is the token-level importance ratio, $\epsilon>0$ is the PPO clipping parameter, and $\beta>0$ controls KL penalty to reference policy $\pi_{\text{ref}}$.

The advantage is group-normalized to reduce variance: $$A_i^t = \frac{\bar{R}(o_i, q, y^*) - \mathrm{mean}\left( \{ \bar{R}(o_k, q, y^*) \}_{k=1}^{G} \right)}{\mathrm{std}\left( \{ \bar{R}(o_k, q, y^*) \}_{k=1}^{G} \right)}.$$ By broadcasting a single trajectory-level reward to all turns, we decompose multi-turn RL into tractable single-turn policy updates.

Featured Tools

AgentFlow leverages a diverse set of specialized tools to accomplish complex reasoning tasks

Case Study Visualization

Experimental Results

Main Results

To comprehensively evaluate tool-use capabilities of AgentFlow, we conduct experiments on four types of reasoning tasks: (1) Knowledge-intensive search including Bamboogle, 2Wiki, HotpotQA, and Musique; (2) Agentic reasoning such as GAIA (where we adopt the textual split); (3) Logic-dense mathematical reasoning including AIME 2024, AMC 23, and Game Of 24; and (4) Scientific reasoning including GPQA and MedQA.

Accuracy comparison on search-intensive and agentic tasks. 7B-Base refers to Qwen-2.5-7B-Base and 7B-Inst refers to Qwen-2.5-7B-Instruct. AutoGen and our AgentFlow method are agentic systems, which use Qwen-2.5-7B-Instruct for the LLM-powered agents and tools for fair comparison. We visualize the gains of AgentFlow to each baseline in the Δ columns.

Baselines: We compare against four categories of baselines: (1) Open-source LLMs: Qwen-2.5 (7B, 14B, 32B) and Llama-3.3-70B; (2) Proprietary LLMs: GPT-4o-mini and GPT-4o; (3) Tool-integrated reasoning LLMs: Supervised Fine-Tuning (SFT), Iter-RetGen, Search-R1, ZeroSearch, ReSearch, StepSearch, and VerlTool; (4) Training-free agentic system: AutoGen.

Accuracy comparison of mathematical and scientific reasoning tasks. We visualize the gains of AgentFlow to each baseline in the Δ columns.

Baselines: We compare against four categories of baselines: (1) Open-source LLMs: Qwen-2.5 (7B, 14B) and Llama-3.3-70B, Llama-3.1-405B; (2) Proprietary LLMs: GPT-4o-mini and GPT-4o; (3) Reasoning LLMs: Supervised Fine-Tuning (SFT), SimpleRL-reason, Open-Reasoner-Zero, General-Reasoner, and Luffy; (4) Tool-integrated reasoning LLMs: TIR and ToRL; (5) Training-free agentic system: AutoGen.

In-Depth Analysis

We conduct comprehensive analyses to understand the effectiveness of Flow-GRPO and the behavior of AgentFlow across various dimensions.

Impact of Planner Training Strategies. Experiments demonstrate that training the planner with the online reinforcement learning method, Flow-GRPO, yields a significant 17.2% performance improvement, whereas traditional offline Supervised Fine-Tuning (SFT) results in a catastrophic 19.0% performance collapse.

Optimized and Adaptive Tool Selection. After optimization with Flow-GRPO, the planner learns to select the most appropriate tools for different tasks, such as increasing the use of Google Search for the broad-knowledge 2Wiki task while shifting to the more specialized Wikipedia and Web Search for the domain-specific MedQA task.

Superior Training Efficiency and Stability. Analysis of training dynamics reveals that Flow-GRPO not only continuously increases rewards (accuracy) while shortening response length but also achieves more stable and sustained performance growth compared to traditional monolithic methods like ToRL.

Consistent Gains Across Model Scales. Flow-GRPO's online fine-tuning method delivers consistent and effective performance gains on AgentFlow, regardless of whether the backbone model scales from 3B to 7B parameters.

Performance Scaling with Inference Turns. During the inference phase, increasing the maximum allowed interaction turns from 3 to 10 enables AgentFlow to conduct deeper reasoning, leading to continuous improvements in final performance across all tasks.

Adaptability to Upgraded Tool Engines. The trained AgentFlow system demonstrates strong adaptability, as its overall performance significantly improves when its internal tool engines are upgraded from Qwen-2.5-7B-Instruct to the more powerful GPT-4o.

Share AgentFlow

BibTeX

@inproceedings{li2025flow,
    title={In-the-Flow Agentic System Optimization for Effective Planning and Tool Use},
    author={Li, Zhuofeng and Zhang, Haoxiang and Han, Seungju and Liu, Sheng and Xie, Jianwen and Zhang, Yu and Choi, Yejin and Zou, James and Lu, Pan},
    booktitle = {International Conference on Learning Representations (ICLR)},
    year={2026}
}