<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>ReinforcementLearning | Hao Yan</title><link>https://hyan46.github.io/tag/reinforcementlearning/</link><atom:link href="https://hyan46.github.io/tag/reinforcementlearning/index.xml" rel="self" type="application/rss+xml"/><description>ReinforcementLearning</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-US</language><copyright>© 2026 Hao Yan</copyright><lastBuildDate>Thu, 01 Jan 2026 00:00:00 +0000</lastBuildDate><image><url>https://hyan46.github.io/media/icon_hudffdcafa99c609c7e4dfde01dba38f93_35970_512x512_fill_lanczos_center_3.png</url><title>ReinforcementLearning</title><link>https://hyan46.github.io/tag/reinforcementlearning/</link></image><item><title>Path-Coupled Bellman Flows for Distributional Reinforcement Learning</title><link>https://hyan46.github.io/xu-path-coupled-icml-2026/</link><pubDate>Thu, 01 Jan 2026 00:00:00 +0000</pubDate><guid>https://hyan46.github.io/xu-path-coupled-icml-2026/</guid><description>&lt;h2 id="overview">Overview&lt;/h2>
&lt;p>&lt;strong>Path-Coupled Bellman Flows (PCBF)&lt;/strong> is a continuous-time distributional reinforcement
learning method that learns return distributions with &lt;strong>flow matching&lt;/strong> using
&lt;strong>source-consistent Bellman-coupled paths&lt;/strong>: the current path starts from the required base
prior at $t{=}0$, reaches the Bellman target at $t{=}1$, and maintains a pathwise affine
relation to the successor flow at intermediate times. PCBF couples current and successor
return flows through &lt;strong>shared base noise&lt;/strong> and uses a &lt;strong>$\lambda$-parameterized control
variate&lt;/strong> that trades controlled bias for variance reduction in critic training.&lt;/p>
&lt;p>Accepted at &lt;strong>&lt;a href="https://icml.cc" target="_blank" rel="noopener">ICML 2026&lt;/a>&lt;/strong> as a &lt;strong>regular-track presentation&lt;/strong>.&lt;/p>
&lt;figure id="figure-figure-1-path-coupled-bellman-geometry-each-panel-shows-a-single-current-blue-and-successor-orange-return-flow-a-uncoupled-independent-source-noise--flows-are-unrelated-except-in-distribution-b-source-inconsistent-the-successor-starts-from-rgamma-x_0-violating-the-base-prior-at-t0-c-pcbf-shared-noise-drives-both-flows-preserving-the-base-prior-at-t0-and-the-bellman-endpoint-at-t1">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" style="width: 100%; ">&lt;img alt="Path-coupled Bellman geometry: uncoupled flows use independent noise; source-inconsistent flows violate the base prior at t=0; PCBF uses shared noise to preserve both the Gaussian source and the Bellman endpoint." srcset="
/xu-path-coupled-icml-2026/figures/comparison_hud14d972c15fc2473c8ae6fc483bd09b9_239790_67af11229e2c97b2751b66d1160f6599.webp 400w,
/xu-path-coupled-icml-2026/figures/comparison_hud14d972c15fc2473c8ae6fc483bd09b9_239790_dafef5a62ceff156ea1c4a126825fd14.webp 760w,
/xu-path-coupled-icml-2026/figures/comparison_hud14d972c15fc2473c8ae6fc483bd09b9_239790_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://hyan46.github.io/xu-path-coupled-icml-2026/figures/comparison_hud14d972c15fc2473c8ae6fc483bd09b9_239790_67af11229e2c97b2751b66d1160f6599.webp"
loading="lazy"
style="width: 100%; height: auto; display: block;" />&lt;/div>
&lt;/div>&lt;figcaption>
&lt;span class="figure-number">Figure 1: &lt;/span>&lt;strong>Path-coupled Bellman geometry.&lt;/strong> Each panel shows a single current (blue) and successor (orange) return flow. &lt;strong>(a)&lt;/strong> Uncoupled: independent source noise — flows are unrelated except in distribution. &lt;strong>(b)&lt;/strong> Source-inconsistent: the successor starts from $R+gamma X_0$, violating the base prior at $t{=}0$. &lt;strong>(c)&lt;/strong> &lt;strong>PCBF:&lt;/strong> shared noise drives both flows, preserving the base prior at $t{=}0$ and the Bellman endpoint at $t{=}1$.
&lt;/figcaption>&lt;/figure>
&lt;hr>
&lt;h2 id="animation">Animated Demo&lt;/h2>
&lt;p>The animation below visualizes learned return transport on the &lt;strong>Discrete Monte Carlo&lt;/strong>
toy environment: particles flow from a Gaussian source at $t{=}0$ to the learned return
distribution at $t{=}1$ along PCBF Bellman-coupled trajectories.&lt;/p>
&lt;figure id="figure-learned-pcbf-return-transport-on-the-discrete-monte-carlo-environment-individual-particles-colored-trajectories-are-transported-from-the-base-noise-distribution-at-t0-to-state-dependent-return-outcomes-at-t1">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" style="max-width: 980px; width: 100%; ">&lt;img alt="Demonstration of PCBF learned return transport on the Discrete MC environment"
src="https://hyan46.github.io/xu-path-coupled-icml-2026/figures/demo.gif"
loading="lazy"
style="width: 100%; height: auto; display: block;" />&lt;/div>
&lt;/div>&lt;figcaption>
Learned PCBF return transport on the Discrete Monte Carlo environment. Individual particles (colored trajectories) are transported from the base noise distribution at $t{=}0$ to state-dependent return outcomes at $t{=}1$.
&lt;/figcaption>&lt;/figure>
&lt;hr>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Distributional reinforcement learning (DRL) models the full distribution of returns rather
than only their expectation, enabling richer uncertainty representations and often better
empirical performance. Most practical DRL algorithms, however, rely on &lt;strong>finite-dimensional
approximations&lt;/strong> — categorical projections or quantile assignments — that introduce bias
when the Bellman update does not align with fixed support points.&lt;/p>
&lt;p>Reframing DRL as &lt;strong>continuous probability transport&lt;/strong> makes flow matching a natural
framework: the distributional Bellman equation defines an affine transport relationship,
and a neural velocity field can transport samples from a simple Gaussian prior to the
return law without heuristic projections.&lt;/p>
&lt;p>Directly enforcing an uncorrected pointwise Bellman map inside flow composition fails in
two critical ways:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Source boundary mismatch.&lt;/strong> Flow matching requires generation to start from a fixed
simple prior (e.g., $\mathcal{N}(0,1)$), but an uncorrected Bellman update
$Z_t = R + \gamma Z'_t$ starts from $R + \gamma X_0 \neq X_0$.&lt;/li>
&lt;li>&lt;strong>High-variance bootstrapping.&lt;/strong> When current and successor noises are sampled
independently, intermediate trajectories are not pathwise aligned; Bellman consistency
can only be enforced at the endpoint, yielding unstable per-sample targets.&lt;/li>
&lt;/ul>
&lt;p>PCBF resolves both issues through &lt;strong>source-consistent Bellman path correction&lt;/strong> and
&lt;strong>shared-noise path coupling&lt;/strong>, cleanly separating geometric flow requirements from
Bellman bootstrapping variance.&lt;/p>
&lt;hr>
&lt;h2 id="method">Method: Path-Coupled Bellman Flows&lt;/h2>
&lt;h3 id="shared-noise-paths">Shared-noise Bellman paths&lt;/h3>
&lt;p>Given shared base noise $X_0 \sim \mathcal{N}(0,1)$ and a successor return sample
$X' = \psi_{\theta^-}^{1}(X_0 \mid s', a')$ from the target flow map, PCBF defines
time-synchronized linear interpolation paths:&lt;/p>
$$
Z^{s'}_t = (1-t)X_0 + t X'
\qquad\text{(successor path)},
$$
$$
Z^{s}_t = (1-t)X_0 + t\bigl(R + \gamma X'\bigr)
\qquad\text{(current path)}.
$$
&lt;p>An equivalent form that reveals the Bellman geometry is:&lt;/p>
$$
Z^s_t = t R + \gamma Z^{s'}_t + (1-t)(1-\gamma)X_0.
$$
&lt;p>The residual anchor $(1-t)(1-\gamma)X_0$ guarantees exact alignment at $t{=}0$ regardless
of $\gamma$, while $Z^s_1 = R + \gamma X'$ satisfies the distributional Bellman boundary
at $t{=}1$. Differentiating yields the unbiased BCFM target
$\dot Z^s_t = R + \gamma X' - X_0$.&lt;/p>
&lt;h3 id="lambda-target">Lambda-parameterized control variates&lt;/h3>
&lt;p>To reduce variance from the noisy successor sample $X'$, PCBF forms the training target
$u_t^\lambda$ from two pieces:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Sample Bellman velocity (baseline):&lt;/strong> $Y = R + \gamma X' - X_0$. This is unbiased but
can have high variance because it depends directly on the bootstrapped successor return
$X'$.&lt;/li>
&lt;li>&lt;strong>Control-variate correction:&lt;/strong> $\lambda \cdot \bigl( v_{\theta^-}(t, Z^{s'}_t \mid s', a') - (X' - X_0) \bigr)$,
where $v_{\theta^-}$ is the lagged target velocity field along the successor path
$Z^{s'}_t$.&lt;/li>
&lt;/ul>
&lt;p>Putting them together,&lt;/p>
&lt;p>$u_t^\lambda = Y + \lambda \bigl( v_{\theta^-}(t, Z^{s'}_t \mid s', a') - (X' - X_0) \bigr)$.&lt;/p>
&lt;p>Setting $\lambda = 0$ recovers the unbiased sample Bellman target. Values $\lambda > 0$
introduce a variance-reducing correction using successor-flow velocity predictions. With
shared-noise coupling, the induced bias stays small: in a linear–Gaussian model, shared
noise ($\rho = 1$) gives bias on the order of $(1-\gamma)(1-t)$, which vanishes when
$\gamma \approx 1$ and at the flow endpoints $t \in \{0, 1\}$.&lt;/p>
&lt;h3 id="policy-extraction">Policy extraction for offline RL&lt;/h3>
&lt;p>At deployment, a behavior-cloned proposal policy samples $K{=}16$ candidate actions; each
is scored by the mean terminal return under the learned flow
$\hat Q_\theta(s,a) = \frac{1}{M}\sum_m \psi_\theta^{1}(X_{0,m}\mid s,a)$, and the
highest-scoring action is executed.&lt;/p>
&lt;hr>
&lt;h2 id="toy-environments">Toy Environments: Distributional Fidelity&lt;/h2>
&lt;p>We validate PCBF on three analytically tractable environments with known return laws:
&lt;strong>Solitaire Dice&lt;/strong> (heavy-tailed discrete returns), &lt;strong>Bernoulli MRP&lt;/strong> (uniform return on
$[0,2]$), and &lt;strong>Discrete Monte Carlo Chain&lt;/strong> (multimodal finite-horizon returns).&lt;/p>
&lt;figure id="figure-figure-2-learned-pcbf-maps-on-toy-environments-solitaire-top-left-bernoulli-top-right-discrete-mc-bottom-pcbf-recovers-heavy-tailed-uniform-and-multimodal-return-structures-and-closely-matches-ground-truth-histograms">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" style="width: 100%; ">&lt;img alt="Learned PCBF maps on Solitaire, Bernoulli, and Discrete MC toy environments" srcset="
/xu-path-coupled-icml-2026/figures/physics_combined_hufca0a029fbcaa665ca59a9f8c7acda01_1272620_097b50046d8c7f8176e55e305adb21b2.webp 400w,
/xu-path-coupled-icml-2026/figures/physics_combined_hufca0a029fbcaa665ca59a9f8c7acda01_1272620_853ad70ad0e59d0c8b0d1a8e726e0b9c.webp 760w,
/xu-path-coupled-icml-2026/figures/physics_combined_hufca0a029fbcaa665ca59a9f8c7acda01_1272620_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://hyan46.github.io/xu-path-coupled-icml-2026/figures/physics_combined_hufca0a029fbcaa665ca59a9f8c7acda01_1272620_097b50046d8c7f8176e55e305adb21b2.webp"
loading="lazy"
style="width: 90%; height: auto; display: block;" />&lt;/div>
&lt;/div>&lt;figcaption>
&lt;span class="figure-number">Figure 2: &lt;/span>&lt;strong>Learned PCBF maps on toy environments.&lt;/strong> Solitaire (top left), Bernoulli (top right), Discrete MC (bottom). PCBF recovers heavy-tailed, uniform, and multimodal return structures and closely matches ground-truth histograms.
&lt;/figcaption>&lt;/figure>
&lt;figure id="figure-figure-3-distributional-accuracy-on-toy-environments-learned-return-cdfs-for-pcbf-and-value-flows-dcfm-in-0-05-1-versus-ground-truth-references-pcbf-consistently-tracks-the-reference-cdfs-value-flows-degrades-as-dcfm-increases-systematically-underestimating-return-variance">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" style="width: 100%; ">&lt;img alt="CDF comparison of PCBF vs Value Flows on toy environments" srcset="
/xu-path-coupled-icml-2026/figures/toy22_hu0dcfbadcbff05a3ab4013d9c2dd219a9_131428_123c4ba7a42f2001c8d2f13204355c1a.webp 400w,
/xu-path-coupled-icml-2026/figures/toy22_hu0dcfbadcbff05a3ab4013d9c2dd219a9_131428_ecb6f40bb862be0f6c885ae09ee3e7d0.webp 760w,
/xu-path-coupled-icml-2026/figures/toy22_hu0dcfbadcbff05a3ab4013d9c2dd219a9_131428_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://hyan46.github.io/xu-path-coupled-icml-2026/figures/toy22_hu0dcfbadcbff05a3ab4013d9c2dd219a9_131428_123c4ba7a42f2001c8d2f13204355c1a.webp"
loading="lazy"
style="width: 90%; height: auto; display: block;" />&lt;/div>
&lt;/div>&lt;figcaption>
&lt;span class="figure-number">Figure 3: &lt;/span>&lt;strong>Distributional accuracy on toy environments.&lt;/strong> Learned return CDFs for PCBF and Value Flows (dcfm $in {0, 0.5, 1}$) versus ground-truth references. PCBF consistently tracks the reference CDFs; Value Flows degrades as dcfm increases, systematically underestimating return variance.
&lt;/figcaption>&lt;/figure>
&lt;figure id="figure-figure-4-hyperparameter-sensitivity-pcbf-vs-value-flows-on-solitaire-and-discrete-mc-increasing-value-flows-dcfm-coefficient-degrades-wasserstein-error-while-pcbfs-lambda-target-remains-robust-across-a-wide-range-of-values">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" style="width: 100%; ">&lt;img alt="Hyperparameter sensitivity of PCBF vs Value Flows on Solitaire and Discrete MC" srcset="
/xu-path-coupled-icml-2026/figures/two_ablation_hu1b966fd1ebe0e14665f7c6108986d77b_137262_5a027b03153b94dd54b96d6a35e57e56.webp 400w,
/xu-path-coupled-icml-2026/figures/two_ablation_hu1b966fd1ebe0e14665f7c6108986d77b_137262_efe33ea10aa36415e9e4a98d88243e49.webp 760w,
/xu-path-coupled-icml-2026/figures/two_ablation_hu1b966fd1ebe0e14665f7c6108986d77b_137262_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://hyan46.github.io/xu-path-coupled-icml-2026/figures/two_ablation_hu1b966fd1ebe0e14665f7c6108986d77b_137262_5a027b03153b94dd54b96d6a35e57e56.webp"
loading="lazy"
style="width: 90%; height: auto; display: block;" />&lt;/div>
&lt;/div>&lt;figcaption>
&lt;span class="figure-number">Figure 4: &lt;/span>&lt;strong>Hyperparameter sensitivity (PCBF vs. Value Flows).&lt;/strong> On Solitaire and Discrete MC, increasing Value Flows&amp;rsquo; dcfm coefficient degrades Wasserstein error, while PCBF&amp;rsquo;s $lambda$-target remains robust across a wide range of values.
&lt;/figcaption>&lt;/figure>
&lt;figure id="figure-figure-5-variance-reduction-via-lambda-parameterized-control-variates-larger-lambda-yields-smoother-bellman-velocity-regression-loss-trajectories-lower-within-run-standard-deviation-validating-the-control-variate-mechanism">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" style="width: 100%; ">&lt;img alt="Variance reduction via lambda control variates during training" srcset="
/xu-path-coupled-icml-2026/figures/variance_reduction_hu1a747344d4fac88e0ddace86b41e5b7e_127284_55fa780575e8f5c6d50cd9fd502fc76f.webp 400w,
/xu-path-coupled-icml-2026/figures/variance_reduction_hu1a747344d4fac88e0ddace86b41e5b7e_127284_08d444d5a1287cc5044dd0a02c28aede.webp 760w,
/xu-path-coupled-icml-2026/figures/variance_reduction_hu1a747344d4fac88e0ddace86b41e5b7e_127284_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://hyan46.github.io/xu-path-coupled-icml-2026/figures/variance_reduction_hu1a747344d4fac88e0ddace86b41e5b7e_127284_55fa780575e8f5c6d50cd9fd502fc76f.webp"
loading="lazy"
style="width: 80%; height: auto; display: block;" />&lt;/div>
&lt;/div>&lt;figcaption>
&lt;span class="figure-number">Figure 5: &lt;/span>&lt;strong>Variance reduction via $lambda$-parameterized control variates.&lt;/strong> Larger $lambda$ yields smoother Bellman velocity regression loss trajectories (lower within-run standard deviation), validating the control-variate mechanism.
&lt;/figcaption>&lt;/figure>
&lt;hr>
&lt;h2 id="path-consistency">Pathwise Bellman Residual and Discretization&lt;/h2>
&lt;p>PCBF enforces the Bellman endpoint at $t{=}1$ by construction, but training uses a
finite-step Euler solver (10 NFE). Shared-noise coupling yields smaller &lt;strong>corrected
Bellman residuals&lt;/strong> $r_{\mathrm{corr}}(t,N)$ than independent-noise ablations across
solver budgets $N \in \{4,8,16,32\}$:&lt;/p>
&lt;figure id="figure-figure-6-corrected-bellman-residual-r_mathrmcorrtn-on-solitaire-dice-shared-noise-pcbf-blue-maintains-lower-residuals-than-independent-noise-coupling-orange-across-flow-times-and-euler-budgets">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" style="width: 100%; ">&lt;img alt="Corrected Bellman residual on Solitaire Dice for shared vs independent noise coupling" srcset="
/xu-path-coupled-icml-2026/figures/nfe_hua7d3c173ccc1cb1fc20db9d794571924_101634_5a9ef4c9f72e14fa5f15897607e3b55d.webp 400w,
/xu-path-coupled-icml-2026/figures/nfe_hua7d3c173ccc1cb1fc20db9d794571924_101634_6e9df35176ecf054e979ec0790f83267.webp 760w,
/xu-path-coupled-icml-2026/figures/nfe_hua7d3c173ccc1cb1fc20db9d794571924_101634_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://hyan46.github.io/xu-path-coupled-icml-2026/figures/nfe_hua7d3c173ccc1cb1fc20db9d794571924_101634_5a9ef4c9f72e14fa5f15897607e3b55d.webp"
loading="lazy"
style="width: 80%; height: auto; display: block;" />&lt;/div>
&lt;/div>&lt;figcaption>
&lt;span class="figure-number">Figure 6: &lt;/span>&lt;strong>Corrected Bellman residual&lt;/strong> $r_{mathrm{corr}}(t,N)$ on Solitaire Dice. Shared-noise PCBF (blue) maintains lower residuals than independent-noise coupling (orange) across flow times and Euler budgets.
&lt;/figcaption>&lt;/figure>
&lt;hr>
&lt;h2 id="offline-rl-benchmarks">Offline RL Benchmarks&lt;/h2>
&lt;p>We evaluate PCBF on &lt;strong>38 offline RL tasks&lt;/strong>: 30 OGBench single-task variants (four
state-based manipulation domains and two pixel-based domains) plus eight D4RL Adroit tasks.
Baselines include distributional methods (IQN, CODAC, Value Flows), flow-based scalar
critics (FloQ, FQL), and IQL.&lt;/p>
&lt;figure id="figure-figure-7-ogbench-tasks-state-based-cube-scene-and-puzzle-manipulation-environments-and-pixel-based-visual-variants-used-in-our-evaluation">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" style="width: 100%; ">&lt;img alt="OGBench task illustrations" srcset="
/xu-path-coupled-icml-2026/figures/ogbench_hub9bc7e2659a4678c92a9f8dd67bd8f62_503234_e186c92b9267115b0189a2b2e0111064.webp 400w,
/xu-path-coupled-icml-2026/figures/ogbench_hub9bc7e2659a4678c92a9f8dd67bd8f62_503234_502ececf3d83746f0ce32528258521fa.webp 760w,
/xu-path-coupled-icml-2026/figures/ogbench_hub9bc7e2659a4678c92a9f8dd67bd8f62_503234_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://hyan46.github.io/xu-path-coupled-icml-2026/figures/ogbench_hub9bc7e2659a4678c92a9f8dd67bd8f62_503234_e186c92b9267115b0189a2b2e0111064.webp"
loading="lazy"
style="width: 70%; height: auto; display: block;" />&lt;/div>
&lt;/div>&lt;figcaption>
&lt;span class="figure-number">Figure 7: &lt;/span>&lt;strong>OGBench tasks.&lt;/strong> State-based cube, scene, and puzzle manipulation environments and pixel-based visual variants used in our evaluation.
&lt;/figcaption>&lt;/figure>
&lt;h3 id="quantitative">Aggregated results&lt;/h3>
&lt;style>
.pcbf-results-wrap { overflow-x: auto; margin: 1.25rem 0; }
table.pcbf-results {
width: 100%;
border-collapse: collapse;
font-size: 0.92rem;
font-family: 'Noto Sans', sans-serif;
background: #fff;
}
table.pcbf-results th, table.pcbf-results td {
padding: 8px 10px;
text-align: center;
border-bottom: 1px solid #e6e6e6;
}
table.pcbf-results thead tr.group th {
background: #f5f7fa;
font-weight: 700;
border-bottom: 1px solid #d6d9df;
}
table.pcbf-results td.domain, table.pcbf-results th.domain {
text-align: left;
font-weight: 500;
white-space: nowrap;
}
table.pcbf-results tr.proposed {
background: #eaf3ff;
font-weight: 700;
}
table.pcbf-results tr.proposed td { border-bottom: 1px solid #c9def5; }
table.pcbf-results td.best { color: #0a66c2; font-weight: 700; }
table.pcbf-results caption {
caption-side: top;
text-align: left;
padding: 0.25rem 0 0.75rem 0;
font-size: 0.95rem;
color: #444;
}
&lt;/style>
&lt;div class="pcbf-results-wrap">
&lt;table class="pcbf-results">
&lt;caption>&lt;strong>Table 1.&lt;/strong> Offline RL results on OGBench and D4RL Adroit.
Success rates (%) for OGBench domains (5 tasks each) and normalized scores for D4RL.
Results averaged over 8 seeds (4 for pixel tasks). Bold values are within 95% of the
best method on each domain; &lt;em>PCBF (Ours)&lt;/em> is highlighted.&lt;/caption>
&lt;thead>
&lt;tr class="group">
&lt;th class="domain">Domain&lt;/th>
&lt;th>IQN&lt;/th>
&lt;th>CODAC&lt;/th>
&lt;th>FloQ&lt;/th>
&lt;th>FQL&lt;/th>
&lt;th>IQL&lt;/th>
&lt;th>Value Flows&lt;/th>
&lt;th>PCBF (Ours)&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td class="domain">cube-double-play (5 tasks)&lt;/td>
&lt;td>42 ± 8&lt;/td>&lt;td>61 ± 6&lt;/td>&lt;td>47 ± 14&lt;/td>&lt;td>29 ± 6&lt;/td>&lt;td>7 ± 1&lt;/td>&lt;td>69 ± 4&lt;/td>
&lt;td class="best">71 ± 5&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td class="domain">scene-play (5 tasks)&lt;/td>
&lt;td>40 ± 1&lt;/td>&lt;td>55 ± 1&lt;/td>&lt;td class="best">58 ± 4&lt;/td>&lt;td>56 ± 2&lt;/td>&lt;td>28 ± 3&lt;/td>&lt;td class="best">59 ± 4&lt;/td>
&lt;td>54 ± 4&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td class="domain">puzzle-4×4-play (5 tasks)&lt;/td>
&lt;td>27 ± 4&lt;/td>&lt;td>20 ± 18&lt;/td>&lt;td>28 ± 6&lt;/td>&lt;td>17 ± 5&lt;/td>&lt;td>7 ± 2&lt;/td>&lt;td>27 ± 4&lt;/td>
&lt;td class="best">30 ± 4&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td class="domain">cube-triple-play (5 tasks)&lt;/td>
&lt;td>6 ± 0&lt;/td>&lt;td>2 ± 1&lt;/td>&lt;td>8 ± 3&lt;/td>&lt;td>4 ± 2&lt;/td>&lt;td>1 ± 1&lt;/td>&lt;td class="best">14 ± 3&lt;/td>
&lt;td>4 ± 1&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td class="domain">D4RL adroit (8 tasks)&lt;/td>
&lt;td>66 ± 5&lt;/td>&lt;td>69 ± 0&lt;/td>&lt;td>70 ± 5&lt;/td>&lt;td class="best">71 ± 4&lt;/td>&lt;td>70&lt;/td>&lt;td>65 ± 2&lt;/td>
&lt;td class="best">69 ± 2&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td class="domain">visual-antmaze-teleport (5 tasks)&lt;/td>
&lt;td>4 ± 2&lt;/td>&lt;td>—&lt;/td>&lt;td>—&lt;/td>&lt;td>5 ± 2&lt;/td>&lt;td>6 ± 4&lt;/td>&lt;td>13 ± 4&lt;/td>
&lt;td class="best">14 ± 4&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td class="domain">visual-cube-double-play (5 tasks)&lt;/td>
&lt;td>1 ± 0&lt;/td>&lt;td>—&lt;/td>&lt;td>—&lt;/td>&lt;td>6 ± 1&lt;/td>&lt;td>11 ± 6&lt;/td> &lt;td class="best">13 ± 2&lt;/td>
&lt;td>3 ± 0&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;/div>
&lt;p>&lt;strong>Takeaways.&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Selective but strong gains.&lt;/strong> PCBF achieves best or near-best aggregate performance on
&lt;strong>cube-double-play&lt;/strong>, &lt;strong>puzzle-4×4-play&lt;/strong>, &lt;strong>D4RL Adroit&lt;/strong>, and
&lt;strong>visual-antmaze-teleport&lt;/strong>, where critic-side return-law fidelity and variance-controlled
bootstrapping affect action ranking.&lt;/li>
&lt;li>&lt;strong>Best distributional fidelity on toys.&lt;/strong> On analytically tractable MRPs, PCBF closely
tracks ground-truth CDFs and remains robust to $\lambda$, while Value Flows degrades as
the DCFM consistency weight increases.&lt;/li>
&lt;li>&lt;strong>Honest limitations.&lt;/strong> On &lt;strong>cube-triple-play&lt;/strong> and &lt;strong>visual-cube-double-play&lt;/strong>, PCBF
underperforms Value Flows — long-horizon sparse-reward and pixel-based settings remain
challenging when policy extraction, visual encoders, or $\lambda$ selection become
bottlenecks.&lt;/li>
&lt;li>&lt;strong>Similar cost to Value Flows.&lt;/strong> PCBF uses ~60 GB GPU memory and ~2.5× wall-clock versus
scalar critics on OGBench (single A100, $10^6$ steps); training requires 10-step Euler
integration of the velocity field.&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="key-ideas">Key Contributions&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Source-consistent Bellman-interpolated paths&lt;/strong> that resolve the $t{=}0$ boundary mismatch
of uncorrected pointwise Bellman paths while preserving the Bellman endpoint at $t{=}1$.&lt;/li>
&lt;li>&lt;strong>Shared-noise path coupling&lt;/strong> that aligns current and successor return flows pathwise,
inducing a geometric Bellman relation between velocity fields.&lt;/li>
&lt;li>&lt;strong>$\lambda$-parameterized control-variate target&lt;/strong> with a distribution-free $L_2$ bias
bound and a linear–Gaussian closed form explaining why shared-noise coupling shrinks
intrinsic bias.&lt;/li>
&lt;li>&lt;strong>Population velocity identification&lt;/strong>, shared-noise Bellman contraction, and Euler
integration sensitivity analysis supporting stable flow-based distributional critics.&lt;/li>
&lt;li>&lt;strong>Comprehensive evaluation&lt;/strong> on Solitaire Dice, Bernoulli, and Discrete MC toy MRPs plus
38 OGBench and D4RL offline RL tasks.&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="quickstart">Quick Start&lt;/h2>
&lt;p>The reference implementation is available on GitHub:
&lt;a href="https://github.com/BoyangASU/path-coupled-bellman-flows" target="_blank" rel="noopener">&lt;strong>BoyangASU/path-coupled-bellman-flows&lt;/strong>&lt;/a>.&lt;/p>
&lt;p>PCBF is implemented in JAX, adapted from the FQL codebase. Key hyperparameters: 10 Euler
integration steps, batch size 256, learning rate $3\times10^{-4}$, and domain-tuned
$\lambda$ (see paper Tables for per-domain values). State-based tasks train for 1M
gradient steps; pixel-based tasks for 500K steps.&lt;/p>
&lt;hr>
&lt;h2 id="resources">Resources&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Paper (arXiv):&lt;/strong> &lt;a href="https://arxiv.org/abs/2605.08253" target="_blank" rel="noopener">arXiv:2605.08253&lt;/a>&lt;/li>
&lt;li>&lt;strong>Code:&lt;/strong> &lt;a href="https://github.com/BoyangASU/path-coupled-bellman-flows" target="_blank" rel="noopener">github.com/BoyangASU/path-coupled-bellman-flows&lt;/a>&lt;/li>
&lt;li>&lt;strong>Venue:&lt;/strong> ICML 2026 (regular track)&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="bibtex">BibTeX&lt;/h2>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bibtex" data-lang="bibtex">&lt;span class="line">&lt;span class="cl">&lt;span class="nc">@inproceedings&lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="nl">xu2026pathcoupled&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="na">title&lt;/span> &lt;span class="p">=&lt;/span> &lt;span class="s">{Path-Coupled Bellman Flows for Distributional Reinforcement Learning}&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="na">author&lt;/span> &lt;span class="p">=&lt;/span> &lt;span class="s">{Xu, Boyang and Zou, Qing and Yang, Siqin and Yan, Hao}&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="na">booktitle&lt;/span> &lt;span class="p">=&lt;/span> &lt;span class="s">{Proceedings of the International Conference on Machine Learning (ICML)}&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="na">year&lt;/span> &lt;span class="p">=&lt;/span> &lt;span class="s">{2026}&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="na">note&lt;/span> &lt;span class="p">=&lt;/span> &lt;span class="s">{Regular track}&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="na">eprint&lt;/span> &lt;span class="p">=&lt;/span> &lt;span class="s">{2605.08253}&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="na">archivePrefix&lt;/span> &lt;span class="p">=&lt;/span> &lt;span class="s">{arXiv}&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="na">primaryClass&lt;/span> &lt;span class="p">=&lt;/span> &lt;span class="s">{cs.LG}&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="na">url&lt;/span> &lt;span class="p">=&lt;/span> &lt;span class="s">{https://arxiv.org/abs/2605.08253}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div></description></item></channel></rss>