The Jolt Parallelism Problem

Problem Statement

Earlier last week, we described how changing the set from which we sampled sum-check challenges could lead 1.3x faster sum-checks. We benchmarked this optimisation, and the results were quite conclusive – a particular kind of multiplication was nearly 2x faster, which could mean binding was 1.3-1.6x faster (accomodating for variance). Furthermore, the empirical timings from criterion were well aligned with our theoretical expectations.

So we went ahead an implemented the changes into Jolt, and this led to us changing nearly 145 files, re-design certain parts of Jolt, all so we could manifest this speedup in our end to end traces. We run

cargo run --release -p jolt-core profile --name sha3 --format chrome

and focused on every polynomial bind operation in Jolt1, hoping to see at see at least 5-10% end to end speedup.

A paltry 1% change in the negative direction. What we see is that essentially, all that extra work made no difference whatsoever. Furthermore, over half of the binding operations in Jolt are slower (or no different) ?? All our micro-benches indicate there should be a speedup, but end to end results indidcate otherwise. The remainder of this post a deep dive into what could possibly the reason for such discrepancy.

Jolt’s Parallelisation Problem

Conjecture: Polynomial binding is an embarrassingly parallel algorithm. In polynomial binding, we iterate through an array, and each iteration is independent of the other iterations. This is exactly the use case the author of rayon had in mind when designing the library. So it makes a lot of sense, that Jolt makes heavy use of rayon for all parallelisation.

Despite rayon’s brilliance, parallel code is incredibly hard to get right. So as a sanity check we do the following change: We change the multilinear::bind_parallelto be a sequential algorithm. All of the binding work gets done in a single thread. Our hope is that end to end we should slowdown (a lot of binds are sequential), but by removing the system overhead of parallelism will manifest our low level optimisations. Remember, the optimisations here are extremely low level. The difference with base, and our change is roughly 16 mul instructions. Shown below, is the lines of code we change for this experiment in the jolt-core/src/poly/multilinear_polynomial.rs


#[tracing::instrument(skip_all, name = "MultilinearPolynomial::bind_parallel")]
fn bind_parallel(&mut self, r: F::Challenge, order: BindingOrder) {
    match self {
        MultilinearPolynomial::LargeScalars(poly) => poly.bind(r, order),
        MultilinearPolynomial::U8Scalars(poly) => poly.bind(r, order),
        MultilinearPolynomial::U16Scalars(poly) => poly.bind(r, order),
        MultilinearPolynomial::U32Scalars(poly) => poly.bind(r, order),
        MultilinearPolynomial::U64Scalars(poly) => poly.bind(r, order),
        MultilinearPolynomial::I64Scalars(poly) => poly.bind(r, order),
        MultilinearPolynomial::I128Scalars(poly) => poly.bind(r, order),
        MultilinearPolynomial::U128Scalars(poly) => poly.bind(r, order),
        _ => unimplemented!("Unexpected MultilinearPolynomial variant"),
    }
    //match self {
    //    MultilinearPolynomial::LargeScalars(poly) => poly.bind_parallel(r, order),
    //    MultilinearPolynomial::U8Scalars(poly) => poly.bind_parallel(r, order),
    //    MultilinearPolynomial::U16Scalars(poly) => poly.bind_parallel(r, order),
    //    MultilinearPolynomial::U32Scalars(poly) => poly.bind_parallel(r, order),
    //    MultilinearPolynomial::U64Scalars(poly) => poly.bind_parallel(r, order),
    //    MultilinearPolynomial::I64Scalars(poly) => poly.bind_parallel(r, order),
    //    MultilinearPolynomial::I128Scalars(poly) => poly.bind_parallel(r, order),
    //    MultilinearPolynomial::U128Scalars(poly) => poly.bind_parallel(r, order),
    //    _ => unimplemented!("Unexpected MultilinearPolynomial variant"),
    //}
}

How To Read The Screenshots below

In what follows, unless otherwise stated, Trace 1 will always refer to running Jolt without small challenges, and Trace 2 refers to our small challenge optimisation. We will run both the settings with parallel and sequential binding.

Jolt does not see the benefits of our optimisation.

Current version of Jolt, where binding is implemented using rayon.

Changing binding to being single threaded leads to 20% empirical improvement in sum-check binding (aligned with our original expectations) in the end to end traces. This hints that something fishy is possibly going on with how we use rayon. Based on the analysis, this number should be higher than 20% – our micro-benches indicated that this number should be closer to 40-50%. This 20% loss in performance can be attributed to caching.

In the figure above, multi linear polynomial binding is a single threaded algorithm

The Actual Problem

Diving deep into what’s actually happening, we show that the issue is not parallelisation, but moreso pure utilisation of cache lines. Our starting point is the following piece of code for binding multi-linear polynomials. From the outset looks perfectly fine, and aligns with exactly what rayon experts would recommend we do.

pub fn bound_poly_var_top_zero_optimized(&mut self, r: &F::Challenge) {
    let n = self.len() / 2;

    let (left, right) = self.Z.split_at_mut(n);

    left.par_iter_mut()
        .zip(right.par_iter())
        .filter(|(&mut a, &b)| a != b)
        .for_each(|(a, b)| {
            *a += *r * (*b - *a);
        });

    self.num_vars -= 1;
    self.len = n;
}

We make one line of change

```rust pub fn bound_poly_var_top_zero_optimized(&mut self, r: &F::Challenge) { let n = self.len() / 2;

let (left, right) = self.Z.split_at_mut(n);

left.par_iter_mut()
    .zip(right.par_iter())
    .with_min_len(4096) // <-- Only change to the code base 
    .filter(|(&mut a, &b)| a != b)
    .for_each(|(a, b)| {
        *a += *r * (*b - *a);
    });
  1. there is really nothing we can do, parallel code is too hard (possibly, but I am not willing to give in just yet).

So we went ahead and sequentialised the expensive path of the code path. The results are shown below. Unsurprisingly, the overeall bind time is worse. However, it’s only 2x worse.

What’s next

After all this work, the thing that is clear, is nothing is clear. I do not have an definitive answer as what about our use of Rayon nullifies low level multiplication optimisations.

  1. A low hanging fruit is to have a sequential version of every bind, and then check which ones improve the overall end to end speed. This might work in the short term but this is not sustainable. As Jolt evolves and improves, the number of permutations of which binds to keep sequential, and which binds to keep parallel will explode exponentially.

  2. Over the last week, we went ahead studied the Rayon source code. We first outline our working understanding (this will shortly be added) of how Rayon works, with experiments to back up our understanding. A critical feature of rayon is that it is optimistically parallel. That is to say, given how “work-stealing” is implemented, if the threads are busy, rayon just works like a sequential algorithm, with very minimal overhead. However, our observations described above do not align this claim. Two things are possible – the overhead is really not that minimal (especially, when our optimisations are at the multiplication level, and we are counting assembly instructions). Alternatively, the way we are using rayon is sub-optimal. There is a third possibility that rayon is not the right tool for Jolt, but I find this highly unlikely.

  3. With our working understanding, we have to develop the tools to measure sub-optimal parallelisation. Right now, we can just measure elapsed time. We need to look at cpu counters, and strace reports to see how often the threads are spinning, or if there are stragglers. Perhaps we can implement some dynamic instrumentation to focus our CPU counter collection only during some functions. This gives us a much clearer picture as to how multi-threading could be a bottleneck.


  1. The figures are generated using jolt-perfetto, a light wrapper on top of trace_processor shell.↩︎