Faster Polynomial Evaluaton

Tracked Data Types

Before, describing optimisations to polynomial evaluation, we discuss an additional tool to debug numerical algebra libraries. In Jolt sum-check algorithms, almost invariably the bottleneck in computation is number of field multiplications by large scalars. So it might be helpful to be able to count the number of multiplications your algorithm does, and check if we’re under the theoretical limit.

Towards this we introduce TrackedFr data type which is exactly the same as ark_bn254::Fr except for 1 critical difference. Implementation here

Before performing a multiplication operation, we increment a global counter atomically (in a thread safe way). Rust makes this relatively easy to do (it’s remarkable how far libraries have come that prevent us from writing semaphores and mutexes from scratch) See [official docs here]](https://doc.rust-lang.org/std/sync/atomic/struct.AtomicUsize.html)

use std::sync::atomic::{AtomicUsize, Ordering};

  // A global counter that tracks how many times we've use a*b where a, b are ark_bn254::Fr types
  // see jolt-core::field::tracked_fr::TrackedFr for more details
  pub static MULT_COUNT: AtomicUsize = AtomicUsize::new(0);

We have the ability to reset this counter at any time as well. So now all we need to do if we wanted to see how many multiplications transpired, is to wrap the function being examined with reset_mult_count() and get_mult_count(). See an example of how to use here.

That’s it! Right now it just tracks multiplications, but you may extend it to any operation ark_bn254::Fr does by simply wrapping the operation as follows:

For all types of multiplication we do w

// crate is jolt-core
  use crate::utils::counters::MULT_COUNT;

  // Owned * owned
  impl Mul<TrackedFr> for TrackedFr {
      type Output = TrackedFr;
      fn mul(self, rhs: TrackedFr) -> Self::Output {
          MULT_COUNT.fetch_add(1, Ordering::Relaxed);
          TrackedFr(self.0 * rhs.0)
      }
  }

  // Owned * borrowed
  impl<'a> Mul<&'a TrackedFr> for TrackedFr {
      type Output = TrackedFr;
      fn mul(self, rhs: &'a TrackedFr) -> Self::Output {
          MULT_COUNT.fetch_add(1, Ordering::Relaxed);
          TrackedFr(self.0 * rhs.0)
      }
  }

  // Borrowed * owned
  impl Mul<TrackedFr> for &TrackedFr {
      type Output = TrackedFr;

      fn mul(self, rhs: TrackedFr) -> Self::Output {
          MULT_COUNT.fetch_add(1, Ordering::Relaxed);
          TrackedFr(self.0 * rhs.0)
      }
  }

  #[allow(clippy::needless_lifetimes)]
  impl<'a, 'b> Mul<&'b TrackedFr> for &'a TrackedFr {
      type Output = TrackedFr;
      fn mul(self, rhs: &'b TrackedFr) -> Self::Output {
          MULT_COUNT.fetch_add(1, Ordering::Relaxed);
          TrackedFr(self.0 * rhs.0)
      }
  }

Problem Statement

Fix \(n \in \mathbb{N}\) and define \(N = 2^n\). In the Jolt codebase there is an algorithm that takes as input a vector \(f \in \mathbb{F}^N\) and an \(\vec{z} \in \mathbb{F}^n\) and outputs \(\widetilde{f}(\vec{z})\) where \(\widetilde{f}\) is the multilinear extension (MLE) of \(f\). See Chapter 3 of (Thaler, 2020) for details on how to compute MLE of a vector.

More specifically, the current algorithm does the following:

Given \(\vec{z} \in \mathbb{F}^n\) it computes all \(N = 2^n\) evaluations of Lagrange basis polynomials \(\Big\{\chi_{\vec{w}}(\vec{z})\Big\}_{\vec{w} \in \{0,1 \}^n}\) where \(\chi_{\vec{w}}(\vec{z}) = \prod_{k=1}^{n} (1 -z_k)(1-w_k) + z_k w_k = eq(\vec{w}, \vec{z})\).

Using the memoization algorithm described in (Vu et al., 2013), this computation require \(N\) multiplications. See Lemma 3.8 of (Thaler, 2020) for a simple exposition and analysis of this algorithm. An implementation can be found here.

Then to compute the answer we compute a dot product, which requires \(N\) multiplications. \[\widetilde{f}(\vec{z}) = \sum_{w \in \{0,1\}^n}f(w)\chi_{\vec{w}}(\vec{z})\]

In what follows, we describe how to speed up the implementation by a factor of 2.

For a single polynomial evaluation, even if \(f(w)\) were to be 0, the current code base still proceeds to do multiplication instead of just returning 0. This is fixed in batched evaluation, but we mention it here, as this will be reflected when performing benchmarks.

A Constant Factor Improvement

Re-writing the definition of MLE of \(f\) we get \[ \begin{align} \widetilde{f}(\vec{z}) &= \sum_{\vec{w} \in \{0,1\}^n}f(w)\cdot \chi_{\vec{w}}(\vec{z})\\[10pt] &=\sum_{\vec{w} \in \{0,1\}^n} f(\vec{w})\cdot eq(\vec{z},\vec{w}) \\[10pt] &= \sum_{w_1 \in \{0,1\} } eq(z_1,w_1) \sum_{w_2 \in \{0,1\} } eq(z_2,w_2) \ldots \sum_{w_n \in \{0, 1\}} eq(z_n,w_n) * f(w_1,...,w_n) \end{align} \]

Notice that for any \((w_1, \ldots, w_{n-1}) \in \{ 0, 1\}^{n-1}\) \[ \begin{align} \sum_{w_n \in \{0,1\}} \operatorname{eq}(z_n, w_n) \cdot f(w_1, \ldots, w_n) &= f(w_1, \ldots, w_{n-1}, 0) \notag \\ &\quad + z_n \cdot \left( f(w_1, \ldots, w_{n-1}, 1) - f(w_1, \ldots, w_{n-1}, 0) \right) \end{align} \]

which is one multiplication for every instantiation of \((w_1, \ldots, w_{n-1})\), therefore it is a total of \(2^{n-1}\) multiplications.

Now if we compute things inside out i.e \(j \in \{n, \ldots, 1\}\),

The first iteration of the loop is \(2^{n-1}\) multiplications
The second iteration will be \(2^{n-2}\) multiplications
…
The \(j\)’th itertion will be \(2^{n-j}\) multiplications
…
The last iteration is just 1 multiplication.

Thus, there is a total of \(N =2^n\) multiplications, instead of the two step process described above, each of which had \(2N\) multiplications aove

Fully Worked Out Example For Inside Out Algorithm

Notation: Bits are written in BigEndian where the left most bit is the MSB.

Given all evaluations of multi-linear polynomial \(f\) \[ \begin{aligned} f(0, 0, 0) &= 0 \\ f(0, 0, 1) &= 0 \\ f(0, 1, 0) &= 1 \\ f(0, 1, 1) &= 0 \\ f(1, 0, 0) &= 0 \\ f(1, 0, 1) &= 0 \\ f(1, 1, 0) &= 0 \\ f(1, 1, 1) &= 1 \end{aligned} \]

We are told to compute \(f(4, 3, 2)\),

Step 1

\[ \begin{align*} f(0, 0, 2) &= f(0, 0, 0) + 2(f(0, 0, 1) - f(0, 0, 0)) &= 0 + 2(0 - 0) &= 0 \\ f(0, 1, 2) &= f(0, 1, 0) + 2(f(0, 1, 1) - f(0, 1, 0)) &= 1 + 2(0 - 1) &= -1 \\ f(1, 0, 2) &= f(1, 0, 0) + 2(f(1, 0, 1) - f(1, 0, 0)) &= 0 + 2(0 - 0) &= 0 \\ f(1, 1, 2) &= f(1, 1, 0) + 2(f(1, 1, 1) - f(1, 1, 0)) &= 0 + 2(1 - 0) &= 2 \end{align*} \]

Step 2:

\[ \begin{align*} f(0, 3, 2) &= f(0, 0, 2) + 3(f(0, 1, 2) - f(0, 0, 2)) &= 0 + 3(-1 - 0) &= -3 \\ f(1, 3, 2) &= f(1, 0, 2) + 3(f(1, 1, 2) - f(1, 0, 2)) &= 0 + 3(2 - 0) &= 6 \end{align*} \]

Step 3

\[ \begin{align*} f(4, 3, 2) &= f(0, 3, 2) + 4(f(1, 3, 2) - f(0, 3, 2)) \\ &= -3 + 4(6 - (-3)) = -3 + 4 \cdot 9 = -3 + 36 = \boxed{33} \end{align*} \]

Leveraging Sparsity

Now what if told you that \(\gamma\) fraction of the evaluations of \(\widetilde{f}\) over \(\{0, 1\}^n\) are zero. Then the dot product implementation requires :

\(N + \gamma N\) multiplications.

The current Jolt Implementation does not leverage sparsity at all. That is even multiplying a*b it does not check if a or b is zero to simply return 0. It goes ahead and does montgomery multiplication.

This is not the case with batched evaluation we will discuss later.

An alternative way to implement this dot product algorithm is take \(\vec{z} = \vec{z_1} || \vec{z_2}\) and split into 2 halves of size \(n/2\) and compute

\[\sum_{\vec{x_1} \in \{ 0, 1\}^{n/2}}\mathsf{eq}(\vec{z_1}, \vec{x_1})\sum_{\vec{x_2} \in \{ 0, 1\}^{n/2}}eq(\vec{z_2}, \vec{x_2})\widetilde{f}(x_1||x_2)\]

This takes

\(2\sqrt{N} + \sqrt{N} + \gamma N = 3\sqrt{N} + \gamma N\) multiplications

The inside-out evaluation, also leverages sparsity, but in a way that depends on the distribution of 0s. Imagine \(N=8\) and the following two arrays

[12, 32, 55, 121, 0, 0, 0, 0]
  [0, 32, 55, 0, 0, 21, 11, 0]

In the first case, despite having half the evaluations set to 0, \(z(f[i] - f[m2+i])\) is not zero. in the second case, the zero’s align well and we can get rid of two out of 4 multiplications.

Below are two plots that show the impact of sparsity on a single polynomial evaluation. The plot on top shows exactly how many multiplications are occurring for each algorithm (note dot_product is not checking for zeros, so it’s \(2N\) always). Also shown in a dashed line is the theoretical upper bound for the algorithm that spilts eq, and this confirms that our implementation is efficient,

The plot below shows the physical times. Notice how the two trends are almost identical, which does confirm that the bottleneck in polynomial evaluation really is field multiplication.

How to get these plots yourself:

cargo run -p jolt-core --example polynomial_evaluation --release

This should generate a file called results.csv from which the above numbers can be easily calculated.

Batched Polynomial Evaluation

Now we get to batched evaluations, that is we want to evaluate multilinear polynomials \(f_1, \ldots, f_s\) at the same point \(\vec{z}\).

Now the cost of computing \(eq\) is not as bad as it can be re-used for all \(\{f_i\}_{i \in [s]}\). Still we show splitting \(eq\) into \(eq_1\) and \(eq_2\) can provide time savings.

Once again we will measure both the exact number of multiplications, and the actual time it took to multiply things.

To get these plots run

cargo run -p jolt-core --example polynomial_evaluation --release

and you shall receive a file called batch_results.csv

References

Thaler, J., 2020. Proofs, arguments, and zero-knowledge.

Vu, V., Setty, S., Blumberg, A.J., Walfish, M., 2013. A Hybrid Architecture for Interactive Verifiable Computation , in: 2012 IEEE Symposium on Security and Privacy. IEEE Computer Society, Los Alamitos, CA, USA, pp. 223–237. https://doi.org/10.1109/SP.2013.48