Table of Contents

Logo AgentFlow

In-the-Flow Agentic System Optimization

1 Stanford University, 2 Texas A&M University, 3 UC San Diego, 4 Lambda,
* Equal Contribution † Co-senior authors

Performance comparison across 10 diverse benchmarks. AgentFlow with a 7B-scale backbone achieves substantial improvements over top-performing baselines across search, agentic, mathematical, and scientific reasoning tasks.

YouTube Video

Thanks to Discover AI for featuring AgentFlow!

Introduction

Outcome-driven reinforcement learning has advanced reasoning in large language models (LLMs), but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this scales poorly with long horizons and diverse tools and generalizes weakly to new scenarios. Agentic systems offer a promising alternative by decomposing work across specialized modules, yet most remain training-free or rely on offline training decoupled from the live dynamics of multi-turn interaction.

We introduce AgentFlow, a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. To train on-policy in live environments, we propose Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages.

Across ten benchmarks, AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks, even surpassing larger proprietary models like GPT-4o. Further analyses confirm the benefits of in-the-flow optimization, showing improved planning, enhanced tool-calling reliability, and positive scaling with model size and reasoning turns.

One case study example. Initially failed with repetitive errors (left), AgentFlow, trained with Flow-GRPO, explores a new solution pathway at turn 4 after two failed attempts (right).

AgentFlow: An In-the-Flow Agentic System

(a) Overview of AgentFlow, a trainable agentic system for in-the-flow planning and tool use. Four modules—planner, executor, verifier, and generator—interact via evolving memory $M$ and toolset $K$, given query $q$. The planner policy is optimized on-policy inside the system's multi-turn loop for adaptive reasoning. (b) A single state transition: $a^t$, $e^t$, and $v^t$ update memory from $M^t$ to $M^{t+1}$.

AgentFlow is a general-purpose tool-integrated agentic framework for solving complex reasoning tasks through fine-grained planning and effective tool use. It comprises four specialized modules—Planner $\mathcal{P}$, Executor $\mathcal{E}$, Verifier $\mathcal{V}$, and Generator $\mathcal{G}$—coordinated by shared memory $M$ and a toolset $K$. We formalize AgentFlow's problem-solving process as a multi-turn Markov Decision Process (MDP): given query $q$ and toolset $K$, the planner $\mathcal{P}$ (a trainable policy $\pi_\theta$) produces an action $a^t \sim \pi_\theta(a^t \mid q, K, M^t)$ that formulates a sub-goal, selects a tool $k \in K$, and retrieves relevant context from memory $M^t$. The executor $\mathcal{E}$ invokes tools according to $a^t$, yielding execution results $e^t \sim \mathcal{E}(e^t \mid a^t, K)$. The verifier $\mathcal{V}$ evaluates $e^t$, producing a binary verification signal $v^t \sim \mathcal{V}(v^t \mid q, e^t, M^t)$. If $v^t = 0$, the memory is updated deterministically: $M^{t+1} = f_{\text{mem}}(M^t, a^t, e^t, v^t)$. This process repeats until $v^t = 1$ (termination) or a maximum turn budget is reached. Upon termination at turn $T$, the generator $\mathcal{G}$ produces the final solution $o \sim \mathcal{G}(o \mid q, M^T)$. After $T$ turns, the trajectory $\tau = \{(a^t, e^t, v^t)\}_{t=1}^T$ records planning, execution, and verification steps. The joint generative process is:

$$p_\theta(\{a^t,e^t,v^t\}_{1:T}, o \mid q) = \Big[\prod_{t=1}^T \pi_\theta(a^t \mid q,K,M^t)\; \mathcal{E}(e^t \mid a^t,K)\; \mathcal{V}(v^t \mid q,e^t,M^t)\Big]\; \mathcal{G}(o \mid q,M^T).$$

Flow-based Group Refined Policy Optimization

Optimization of AgentFlow. Given a query $q$, memory $M$, and toolset $K$, the policy generates actions for sub-goals and tool selection. It is trained via Flow-GRPO — a reinforcement learning method enabling multi-turn, stable optimization under collaborative dynamics.

Training Objective

We optimize the planner policy $\pi_\theta$ online within the AgentFlow system. For each query $(q,y^*)$, we sample $G$ on-policy trajectories $\{\tau_i\}_{i=1}^G$ where $\tau_i = \{a_i^1, \ldots, a_i^{T_i}, o_i\}$. The planner maximizes: $$\mathcal{J}(\theta)=\mathbb{E}_{\tau\sim\pi_\theta}[R(\tau)], \quad \theta^\star=\arg\max_\theta \mathcal{J}(\theta).$$

We use a final-outcome reward: every action receives the same trajectory-level signal based on solution correctness: $$r = R(a^t) = \bar{R}(o, q, y^*), \quad \forall t = 1,\dots,T,$$ where $\bar{R}(o, q, y^*) \in \{0, 1\}$ is determined by an LLM-as-judge. This broadcasts the global success signal to all intermediate decisions.

Flow-GRPO Formulation

Let $s_i^t=(q, K, M_i^t)$ be the state at turn $t$ of rollout $i$, and $a_i^t$ the planner's action (token sequence of length $|a_i^t|$). The objective is:

$$\begin{aligned} \mathcal{J}_{\text{Flow-GRPO}}(\theta) &= \mathbb{E}_{(q, y^*) \sim \mathcal{D}, \; \{\tau_i\}_{i=1}^{G} \sim \pi_{\theta_\text{old}}} \\ & \Bigg[ \frac{1}{G}\sum_{i=1}^{G}\frac{1}{T_i}\sum_{t=1}^{T_i}\frac{1}{|a_i^t|}\sum_{j=1}^{|a_i^t|} \min\!\Big\{ \rho_{i,j}^t A_i^t,\, \mathrm{clip}(\rho_{i,j}^t,\,1-\epsilon,\,1+\epsilon)\,A_i^t \Big\} \;-\; \beta\, \mathbb{D}_{\mathrm{KL}}\!\big(\pi_\theta \,\|\, \pi_{\text{ref}}\big) \Bigg], \end{aligned}$$
where $\rho_{i,j}^t = \frac{\pi_\theta(a_{i,j}^t \mid s_i^t, a_{i,1:j-1}^t)}{\pi_{\theta_\text{old}}(a_{i,j}^t \mid s_i^t, a_{i,1:j-1}^t)}$ is the token-level importance ratio, $\epsilon>0$ is the PPO clipping parameter, and $\beta>0$ controls KL penalty to reference policy $\pi_{\text{ref}}$.

The advantage is group-normalized to reduce variance: $$A_i^t = \frac{\bar{R}(o_i, q, y^*) - \mathrm{mean}\left( \{ \bar{R}(o_k, q, y^*) \}_{k=1}^{G} \right)}{\mathrm{std}\left( \{ \bar{R}(o_k, q, y^*) \}_{k=1}^{G} \right)}.$$ By broadcasting a single trajectory-level reward to all turns, we decompose multi-turn RL into tractable single-turn policy updates.

Featured Tools

AgentFlow leverages a diverse set of specialized tools to accomplish complex reasoning tasks

Case Study Visualization

Experimental Results

Main Results

To comprehensively evaluate tool-use capabilities of AgentFlow, we conduct experiments on four types of reasoning tasks: (1) Knowledge-intensive search including Bamboogle, 2Wiki, HotpotQA, and Musique; (2) Agentic reasoning such as GAIA (where we adopt the textual split); (3) Logic-dense mathematical reasoning including AIME 2024, AMC 23, and Game Of 24; and (4) Scientific reasoning including GPQA and MedQA.

In-Depth Analysis

We conduct comprehensive analyses to understand the effectiveness of Flow-GRPO and the behavior of AgentFlow across various dimensions.

Share AgentFlow

BibTeX

@article{li2025flow,
    title={In-the-Flow Agentic System Optimization for Effective Planning and Tool Use},
    author={Li, Zhuofeng and Zhang, Haoxiang and Han, Seungju and Liu, Sheng and Xie, Jianwen and Zhang, Yu and Choi, Yejin and Zou, James and Lu, Pan},
    journal={arXiv preprint arXiv:2510.05592},
    year={2025}
}