Spartan: Outer Sumcheck

jolt draft

2025-10-30 By Ari

Remember the named constraints from Day 3. For each time step we got a witness vector $z_t$ with 19 locations. Here’s the mental model you should have. TODO picture

Each of these constraints had a $A$ part and a $B$ part, and together we wanted $A \circ B$ to zero. See this example for a refresher. For every cycle of the trace, we wanted the prover to satisfy all 19 constraints. Now let’s say we split the 19 constraints into two groups $G_0$ of size 10 and $G_1$ of size 9.

Group 0 (b=0): Constraints 1-10 (y ∈ {-4,…,5})
- Az: booleans
- Bz: i128 values
Group 1 (b=1): Constraints 11-19 (9 real + 1 padding)
- Az: u8 values
- Bz: S160 values

Algebraically, this is simply saying:

\[\widetilde{A}(t, b, y) \otimes \widetilde{B}(t, b, y) = 0 \quad \forall t \in \{0,1\}^{\log T}, \forall b \in \{0,1\}, \forall y \in \{-4, ..., 5\}\]

where $A(t, b, y)$ and $B(t, b, y)$ gives you one specific cell from the able above. Now we do the same thing we have done since the beginning of time. We want to make sure the the LHS is equal to the RHS as polynomials. So we sample $\tau_t || \tau_b || \tau_y) \xleftarrow[]{\$}\mathbb{F}^{\log T+2}$ and check if the following sumcheck holds:

\[\sum_{t \in \{0,1\}^{\log T}} \sum_{b \in \{0,1\}} \sum_{y \in \{-4,...,5\}} \text{eq}(\tau_t, t) \cdot \text{eq}(\tau_b, b)\text{eq}(\tau_y, y) \cdot \widetilde{A}(t, b, y) \widetilde{B}(t, b, y)\]

First Round Polynomial

Now in the first round we have to bind $y$, which is a little bit different from the rest.

In the first-round the prover needs to return the following univaraite polynomial to the prover.

\[g_1(Y) = \sum_{t \in \{0,1\}^{\log T}} \sum_{b \in \{0,1\}} \text{eq}(\tau_t, t) \cdot \text{eq}(\tau_b, b) \cdot \widetilde{A}(t, b, Y) \cdot \widetilde{B}(t, b, Y)\]

and then bind some $r \xleftarrow[]{\$}\mathbb{F}$ to $y$.

Observe that, the degree of $g$ must be 18. To see why observe that for a fixed $t, b$, $\widetilde{A}(t, b, Y)$ is a univariate polynomial with 10 evaluation locations. A univariate polynomial that passes through 10 locations must have degree at least 9. The same can be said about $\widetilde{B}(t, b, Y)$. Thus, their product is at most degree 18. Now we already have 10 evaluations of $g_1(Y)$. If the prover was honest these should evaluate to 0 at the locations $\{-4, -3, \ldots, 3, 4, 5\}$. To full describe $g$ we need to evaluate $A[t, b, Y]$ and $B[t, b, Y]$ at 9 more locations, and the lcoations we choose in the code are simply, points on either side of the base window.

$g_1(y) = 0$ for all $y \in \{-4, ..., 5\}$ (base window) when constraints are satisfied
We use Lagrange interpolation to extend to arbitrary $Y \in \mathbb{F}$ (extended domain)
This function computes $g_1(Y_j)$ for evaluation points outside the base window

Code Walkthrough

Section 1: Precompute Lagrange Coefficients

let base_left: i64 = -((UNIVARIATE_SKIP_DOMAIN_SIZE as i64 - 1) / 2);
let targets: [i64; UNIVARIATE_SKIP_DEGREE] =
    uniskip_targets::<UNIVARIATE_SKIP_DOMAIN_SIZE, UNIVARIATE_SKIP_DEGREE>();
let target_shifts: [i64; UNIVARIATE_SKIP_DEGREE] =
    core::array::from_fn(|j| targets[j] - base_left);
let coeffs_per_j: [[i32; UNIVARIATE_SKIP_DOMAIN_SIZE]; UNIVARIATE_SKIP_DEGREE] =
    core::array::from_fn(|j| {
        LagrangeHelper::shift_coeffs_i32::<UNIVARIATE_SKIP_DOMAIN_SIZE>(target_shifts[j])
    });

What’s happening: - base_left = -4: Leftmost point of base window {-4, -3, …, 5} - targets[j]: The extended domain evaluation points $Y_j$ (outside base window) - target_shifts[j]: Distance from base_left to each target point - coeffs_per_j[j][i]: Lagrange coefficients $\mathcal{L}_i(Y_j)$ for interpolating to point $Y_j$ from base point $i$

Purpose: To compute $Az(Y_j)$ and $Bz(Y_j)$ at extended points using: \[\widetilde{A}(t, b, Y_j) = \sum_{i=0}^{9} \text{coeffs\_per\_j}[j][i] \cdot \widetilde{A}(t, b Y_i)\]

Section 2: Split Challenge and Compute Eq Polynomials

let m = tau_low.len() / 2;
let (tau_out, tau_in) = tau_low.split_at(m);
let (E_out, E_in) = rayon::join(
    || EqPolynomial::evals_with_scaling(tau_out, Some(F::MONTGOMERY_R_SQUARE)),
    || EqPolynomial::evals(tau_in),
);

What’s happening: - tau_low has length $\log T + 1$ (the binary hypercube variables) - Split into tau_out (length $\log T/2$) and tau_in (length $\log T/2 + 1$) - E_out[x_out] = $\text{eq}(\tau_{out}, x_{out})$ for all $x_{out} \in \{0,1\}^{\log T/2}$ - E_in[x_in] = $\text{eq}(\tau_{in}, x_{in})$ for all $x_{in} \in \{0,1\}^{\log T/2 + 1}$

Note: E_out is scaled by Montgomery R² for optimization in later accumulation.

Purpose: Precompute all eq polynomial evaluations to use in the sum over the binary hypercube.

Section 3: Setup Iteration Parameters

let num_x_out_vals = E_out.len();
let num_x_in_vals = E_in.len();
assert!(
    num_x_in_vals >= 2,
    "univariate skip expects at least 2 x_in values (last bit is group index)"
);
let num_x_in_half = num_x_in_vals >> 1;

What’s happening: - num_x_out_vals = 2^(log T/2): Number of x_out values - num_x_in_vals = 2^(log T/2 + 1): Number of x_in values - Last bit of x_in is the group selector $b$ - num_x_in_half = num_x_in_vals / 2: Number of $(x_{in}')$ values (ignoring group bit)

Indexing scheme: - Even x_in indices (last bit = 0) → Group 0 - Odd x_in indices (last bit = 1) → Group 1

Section 4: Parallel Processing Setup

let num_parallel_chunks = core::cmp::min(
    num_x_out_vals,
    rayon::current_num_threads().next_power_of_two() * 8,
);
let x_out_chunk_size = if num_x_out_vals > 0 {
    core::cmp::max(1, num_x_out_vals.div_ceil(num_parallel_chunks))
} else {
    0
};
let iter_num_x_in_vars = num_x_in_vals.log_2();
let iter_num_x_in_prime_vars = iter_num_x_in_vars - 1;

What’s happening: - num_parallel_chunks: How many parallel tasks to create (up to 8× thread count or num_x_out_vals) - x_out_chunk_size: How many x_out indices each parallel chunk will process - iter_num_x_in_vars = log₂(num_x_in_vals): Number of bits for full x_in index (including group bit) - iter_num_x_in_prime_vars = iter_num_x_in_vars - 1: Number of bits for x_in’ (excluding group bit)

Purpose: Split the sum over x_out across multiple threads for parallelization.

Example: If num_x_out_vals = 1024 and 8 threads, we might create 64 chunks of size 16 each.

Section 5: Main Parallel Computation Loop

(0..num_parallel_chunks)
    .into_par_iter()
    .map(|chunk_idx| {
        let x_out_start = chunk_idx * x_out_chunk_size;
        let x_out_end = core::cmp::min((chunk_idx + 1) * x_out_chunk_size, num_x_out_vals);
        let mut acc_field: [F; UNIVARIATE_SKIP_DEGREE] = [F::zero(); UNIVARIATE_SKIP_DEGREE];

What’s happening: - Parallel iteration: each thread processes one chunk - Each chunk handles x_out values in range [x_out_start, x_out_end) - acc_field[j]: Accumulator for $g_1(Y_j)$ for each extended point $Y_j$ - This will accumulate the sum over all (x_out, x_in’, b) in this chunk

Mathematical notation: We’re computing a partial sum: \[\text{acc\_field}[j] = \sum_{x_{out} \in \text{chunk}} \sum_{x_{in}'} \left[\text{eq}(\tau_{out}, x_{out}) \cdot \left(\text{eq}(\tau_{in}, (x_{in}', 0)) \cdot Az_0(Y_j) \cdot Bz_0(Y_j) + \text{eq}(\tau_{in}, (x_{in}', 1)) \cdot Az_1(Y_j) \cdot Bz_1(Y_j)\right)\right]\]

Outer Loop: Iterate Over x_out Values

for x_out_val in x_out_start..x_out_end {
    let mut inner_acc: [Acc8S<F>; UNIVARIATE_SKIP_DEGREE] = [Acc8S::<F>::new(); UNIVARIATE_SKIP_DEGREE];

What’s happening: - Process each x_out value in this chunk - inner_acc[j]: Bigint accumulator for summing over all (x_in’, b) at this x_out - Uses 8-limb signed accumulator to avoid frequent modular reductions

Inner Loop: Iterate Over x_in’ Values

for x_in_prime in 0..num_x_in_half {
    let base_step_idx = (x_out_val << iter_num_x_in_prime_vars) | x_in_prime;
    let row_inputs = R1CSCycleInputs::from_trace::<F>(preprocess, trace, base_step_idx);

What’s happening: - base_step_idx = (x_out_val << iter_num_x_in_prime_vars) | x_in_prime: Constructs the timestep index $t$ - Bit layout: [x_out bits | x_in’ bits] - This encodes the position in the execution trace - row_inputs: The 36 witness fields for this timestep $t$

Key insight: We fetch trace data once per (x_out, x_in’) and use it for both constraint groups!

Processing Group 0 (b=0)

let x_in_even = x_in_prime << 1;
let e_in_even = E_in[x_in_even];
let az0_bool = eval_az_first_group(&row_inputs);
let bz0_i128 = eval_bz_first_group(&row_inputs);

What’s happening: - x_in_even = x_in_prime << 1: Append bit 0 to get full x_in index for Group 0 - e_in_even = eq(tau_in, (x_in', 0)): The eq polynomial evaluation for this group - az0_bool[i]: Az values for Group 0 at all 10 base domain points $y_i \in \{-4,...,5\}$ - bz0_i128[i]: Bz values for Group 0 at all 10 base domain points

Invariant check: az0_bool[i] * bz0_i128[i] = 0 for all i (constraints satisfied in base domain)

Lagrange Interpolation for Group 0

for j in 0..UNIVARIATE_SKIP_DEGREE {
    let coeffs = &coeffs_per_j[j];
    let mut sum_c_az0_i64: i64 = 0;
    let mut sum_c_bz0_s128 = S128::from(0i128);
    for i in 0..UNIVARIATE_SKIP_DOMAIN_SIZE {
        let c = coeffs[i] as i64;
        if az0_bool[i] {
            sum_c_az0_i64 += c;
        } else {
            let term = S128::from_i128(c as i128 * bz0_i128[i]);
            sum_c_bz0_s128 += term;
        }
    }
    let sum_az0_s64 = S64::from_i64(sum_c_az0_i64);
    let prod_s192 = sum_az0_s64.mul_trunc::<2, 3>(&sum_c_bz0_s128);
    inner_acc[j].fmadd(&e_in_even, &prod_s192);
}

What’s happening: For each extended point $Y_j$:

Compute $Az_0(Y_j)$ via Lagrange: \[Az_0(Y_j) = \sum_{i=0}^{9} c_{j,i} \cdot Az_0(y_i)\] Since $Az_0(y_i)$ is boolean (0 or 1), this simplifies to: \[Az_0(Y_j) = \sum_{i: Az_0(y_i)=1} c_{j,i}\] This is sum_c_az0_i64
Compute $Bz_0(Y_j)$ via Lagrange: \[Bz_0(Y_j) = \sum_{i=0}^{9} c_{j,i} \cdot Bz_0(y_i)\] This is sum_c_bz0_s128
Key optimization: Due to the invariant $Az_0(y_i) \cdot Bz_0(y_i) = 0$:
- If $Az_0(y_i) = 1$, then $Bz_0(y_i) = 0$, so we only add $c_{j,i}$ to Az sum
- If $Az_0(y_i) = 0$, then we only add $c_{j,i} \cdot Bz_0(y_i)$ to Bz sum
- The sums have disjoint support!
Compute product: $Az_0(Y_j) \cdot Bz_0(Y_j)$ as prod_s192
Accumulate: inner_acc[j] += e_in_even * prod_s192 This adds $\text{eq}(\tau_{in}, (x_{in}', 0)) \cdot Az_0(Y_j) \cdot Bz_0(Y_j)$ to the accumulator

Processing Group 1 (b=1)

let x_in_odd = x_in_even + 1;
let e_in_odd = E_in[x_in_odd];
let az1_bool = eval_az_second_group(&row_inputs);
let bz1 = eval_bz_second_group(&row_inputs);

What’s happening: - x_in_odd = x_in_even + 1: Append bit 1 to get full x_in index for Group 1 - e_in_odd = eq(tau_in, (x_in', 1)): The eq polynomial evaluation for this group - Group 1 constraints (9 real + 1 padding to reach 10)

Padding logic:

let g2_len = core::cmp::min(NUM_REMAINING_R1CS_CONSTRAINTS, UNIVARIATE_SKIP_DOMAIN_SIZE);
let mut az1_bool_padded: [bool; UNIVARIATE_SKIP_DOMAIN_SIZE] = [false; UNIVARIATE_SKIP_DOMAIN_SIZE];
let mut bz1_s160_padded: [S160; UNIVARIATE_SKIP_DOMAIN_SIZE] = [S160::from(0i128); UNIVARIATE_SKIP_DOMAIN_SIZE];
az1_bool_padded[..g2_len].copy_from_slice(&az1_bool[..g2_len]);
bz1_s160_padded[..g2_len].copy_from_slice(&bz1[..g2_len]);

Pads Group 1 to 10 elements with zeros (satisfies constraints trivially).

Lagrange interpolation for Group 1: Same logic as Group 0, but with S160 bigints for Bz - Result: prod_s256 - Accumulate: inner_acc[j] += e_in_odd * prod_s256

Fold with E_out and Reduce

let e_out = E_out[x_out_val];
for j in 0..UNIVARIATE_SKIP_DEGREE {
    let reduced = inner_acc[j].reduce();
    acc_field[j] += e_out * reduced;
}

What’s happening: - After summing over all (x_in’, b), we have computed: \[\text{inner\_acc}[j] = \sum_{x_{in}', b} \text{eq}(\tau_{in}, (x_{in}', b)) \cdot Az_{t,b}(Y_j) \cdot Bz_{t,b}(Y_j)\] where $t = (x_{out}, x_{in}')$

Multiply by $\text{eq}(\tau_{out}, x_{out})$ and add to accumulator: \[\text{acc\_field}[j] \mathrel{+}= \text{eq}(\tau_{out}, x_{out}) \cdot \text{inner\_acc}[j]\]
reduce(): Convert bigint accumulator back to field element (modular reduction)

Parallel Reduction

.reduce(
    || [F::zero(); UNIVARIATE_SKIP_DEGREE],
    |mut a, b| {
        for j in 0..UNIVARIATE_SKIP_DEGREE {
            a[j] += b[j];
        }
        a
    },
)

What’s happening: - Combine results from all parallel chunks - Sum up acc_field[j] across all chunks to get final $g_1(Y_j)$ values

Final result: Returns [F; UNIVARIATE_SKIP_DEGREE] containing: \[g_1(Y_j) = \sum_{t \in \{0,1\}^{\log T}} \sum_{b \in \{0,1\}} \text{eq}(\tau_t, t) \cdot \text{eq}(\tau_b, b) \cdot Az_{t,b}(Y_j) \cdot Bz_{t,b}(Y_j)\]

for each extended evaluation point $Y_j$.

The Extended Evaluation Points

Constants:

UNIVARIATE_SKIP_DEGREE = (NUM_R1CS_CONSTRAINTS - 1) / 2 = (19 - 1) / 2 = 9
UNIVARIATE_SKIP_DOMAIN_SIZE = UNIVARIATE_SKIP_DEGREE + 1 = 10
UNIVARIATE_SKIP_EXTENDED_DOMAIN_SIZE = 2 * UNIVARIATE_SKIP_DEGREE + 1 = 19

So: - Base window: 10 points from $\{-4, -3, -2, -1, 0, 1, 2, 3, 4, 5\}$ - Extended domain: 19 points total from $\{-9, -8, ..., 8, 9\}$ - Extended points (outside base): 9 points = $\{-9, -8, -7, -6, -5\} \cup \{6, 7, 8, 9\}$

The function uniskip_targets determines which 9 points to evaluate at:

Algorithm: Alternately select points just outside the base window: 1. Start at n = -5 (one left of base) and p = 6 (one right of base) 2. Alternate: add n, then p, then decrement n, increment p 3. Continue until 9 points collected

Result:

targets = [-5, 6, -6, 7, -7, 8, -8, 9, -9]

The points spiral outward from the base window, alternating left and right!

Now let’s say we split the 19 constraints into two groups $G_0$ of size 10 and $G_1$ of size 9.

Group 0 (b=0): Constraints 1-10 (y ∈ {-4,…,5})
- Az: booleans
- Bz: i128 values
Group 1 (b=1): Constraints 11-19 (9 real + 1 padding)
- Az: u8 values
- Bz: S160 values

Algebraically, this is simply saying:

\[\widetilde{A}(t, b, y) \otimes \widetilde{B}(t, b, y) = 0 \quad \forall t \in \{0,1\}^{\log T}, \forall b \in \{0,1\}, \forall y \in \{-4, ..., 5\}\]