Notes From The Montgomery Underground
Background
In an earlier post, we analysed why this implementation of the improved Montgomery multiplication algorithm by Yuval Domb was not empirically faster out of the box. After disassembling the code using the Intel Pin tool, we identified large memory reads and writes as the primary source of slower run times. Although the analysis was specific to x86 architectures, the empirical findings were consistent across MacBook M2 Laptops and Intel x86 Digital Ocean servers. This loss of performance, however, is an artefact of the implementation, not the algorithm itself1. The new algorithm uses fewer multiplication operations than the CIOS algorithm, and therefore, there should exist a faster implementation (at least in theory).
The large chunk reads and writes were likely due to the
high-level Rust code storing intermediate \(2n\) limbed values in contiguous
regions of memory (see)
This likely led the compiler to demand the same of the generated
assembly code as well. The key thing to note here is that the
algorithm specification does not require contiguous memory storage;
it’s just an implementation detail. As a resolution, we found an alternate
implementation. In this implementation, all intermediate values are
stored as Rust variables as opposed to constant-sized vecs/arrays.
The code explicitly does school-book multiplication, storing all
intermediate values in registers, before adding them in a long
sequence using school-book addition. The code came with
cargo bench
, and when compared to the current Arkworks
CIOS implementation, this code seemed to be 5-10% faster (at least
using their bench setup)
The lack of arrays/continuous memory is not the only thing that is different about this alternate implementation when compared to the earlier implementation for World Coin we benchmarked. We will get to that in a bit.
For the remainder of the post, we refer to the original Arkworks
CIOS algorithm for Montgomery multiplication as c-mul
,
the previous implementation of the new multiplication as
s-mul
, and the faster implementation as
h-mul
. c stands for CIOS, s stands for single step, and
h stands for hybrid. The naming convention will soon become
clear.
Putting this faster code into Jolt
Ignoring speed for a minute, we plugged in the c-mul
and smul
multiplication algorithms in their native
form2 into Jolt. Then we ran
cargo test -p jolt-core -- sha3_e2e_hyperkzg
, and the
unit tests fail for both
s-mult
and h-mult
.
thread '<unnamed>' panicked at /Users/aribiswas3/Work-With-A16z/arkworks-algebra/ff/src/fields/models/fp/montgomery_backend.rs:1360:9:
High limb expected to be 0 if S >= 1 before 2p comparison
failures:
jolt::vm::rv32i_vm::tests::sha3_e2e_hyperkzg
test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 219 filtered out; finished in 4.79s
Digging deeper, the part of the Jolt codebase that fails is code
for subtraction. The Conditional
Subtraction For Barrett Reduction throws an error claiming that
the number we passed in was too large. If you look at
line 1360
, it specifically needs the Big integer listed
as r_tmp
not to have its high bit set to 1. This error
goes away when using Arkworks-CIOS as the multiplication
algorithm.
Why does this happen?
Scanning the lines
of code for h-mult
, we observe that the final
answer is not reduced \(\mod p\).
When you multiply two numbers that are both less than \(p\), the result may still cross over
the modulus \(p\).
In Arkworks-CIOS,
the last line of multiplication involves subtracting \(p\) from the final answer if the answer
is greater than \(p\). The line
__subtract_modulus(a)
simply outputs \(p-a\) if the final output \(a \ge p\), otherwise leaves \(a\) unchanged.
In the multiplication
code for s-mul
, notice they have a
reduce_ct
function. In this function, we find that they
subtract \(2p\) from \(a\) only if the most significant bit of
\(a\) is 1. Thus, we have three
algorithms, and although the final answer for each algorithm is the
same (modulo p), the exact numerical value is different for all 3
algorithms. They are, however, not always different. Sometimes they
are the same, sometimes two out of three of them are the same, and
at times all three are different.
To summarise:
c-mult
: that passes unit tests and subtracts \(p\) from \(a\) when \(a \ge p\).h-mult
: that fails unit tests and does not do any reduction.s-mult
: that fails unit tests and subtracts \(2p\) from \(a\) when the msb of \(a\) is 1.
First attempt: We stick a
__subtract_modulus(a)
at the end of h-mul
matching the behaviour in c-mul
.
The tests still panic with the same error message
A few hours of debugging follow, we observe that even after
subtracting \(p\) from the answer,
the final answer is still sometimes greater than \(p\), i.e h-mul
often
outputs something outputs something that is sometimes greater than
\(2p\).
Ok, what happens if we put in two instances of
__subtract_modulus(a)
?
Jolt passes!
Did we get lucky? One way to check is to try a lot of random
inputs and try to break the test. \(100000\) randomly seeded inputs, with
\(10\) chained multiplications in
each trial, the numerical value of c-mult
and
h-mult
are EXACTLY the same.
However, adding two conditional subtractions to
s-mul
– Jolt tests fail again.
But adding 3 conditional subtractions to s-mul
makes
Jolt pass with s-mul
, and now all 3 algorithms are the
same.
What is going on? Why do we need one subtraction for one algorithm, two and three for the others? How do we know we don’t need 3 or 4 or 5? \(100000\) trials are minuscule when compared to the input space of 256-bit numbers. What if there are edge cases that we can’t catch via fudging?
In what follows, we derive a formal guarantee on how large the answer of these algorithms can be, therefore establishing the maximum number of conditional subtracts we must employ. In the process, we establish that there is not just 1 new algorithm for Montgomery multiplication, but a family of algorithms that are all related to each other, each demonstrating speed for different input distributions.
Montgomery Multiplication Maths
We begin with the basics of Montgomery multiplication, and then describe the CIOS algorithm. This gives all the background needed to figure out what is going on.
Below, we actually describe the Separated Operand Scanning Algorithm (SOS), but deceptively refer to it as the Coarse Integrated Operand Scanning (CIOS) algorithm. This is done for ease of exposition only; the claims would hold regardless. See Acar’s thesis Chapter 2 for more details. We also do not discuss the optimisations made by the Gnark team as they are not relevant in this context.
Given a \(n\) limb3 modulus \(p\), integers \(a\) and \(b\) that are at most \(p-1\), Montgomery multiplication computes
\[ c \equiv abR^{-1} \mod p\]
where \(R=r^n\) and \(r = 2^w\), where \(w\) is the word size (typically 64-bits). To avoid explicitly computing modulo \(p\) by long division, we instead compute \((c + mp)/R \equiv cR^{-1}\mod p\) where \(m\) is an integer such that \(m+cp \equiv 0 \mod R\). A simple calculation solving for \(m\) gives us that \(m = -p^{-1}c \mod R\). For our purposes, \(p\) is prime so \(p\) and \(r\) are co-prime. Therefore, \(-p^{-1}\) is well defined, and thus so is \(m\).
Key Idea: As \(R\) is a power of 2, the last division can be done by bit shift operations native to most instruction architectures.
The CIOS algorithm
What we describe above is the pseudocode description of the
Montgomery reduction algorithm. In practice, we do not actually
compute \(m\) as a whole, then
multiply it by \(p\) and then add
\(mp\) to \(c\). We perform the computation limb by
limb.
\(c\) can be written as the
following sum:
\[ \begin{align} c &\equiv (c_{2n-1} r^{2n-1} + c_{2n-2} r^{2n-2} + \ldots + c_1r + c_0) \mod p \\[10pt] cr^{-1} &\equiv ( c_{2n-1}r^{2n-1} + c_{2n-2} r^{2n-2} + \ldots + c_1r)r^{-1} + c_0 r^{-1} \mod p \\[10pt] \end{align} \]
The limb by limb process follows by repeating the following sub-routine \(n\) times.
Important Sub-Routine
Observe that \(c_0r^{-1} \equiv (c_0 + mp)r^{-1} \mod p\), where \(m\) such that \(c_0 + mp \equiv 0 \mod r\). Solving for \(m = -p^{-1}c_0 \equiv \mu c_0 \mod r\). Also as we are working modulo \(r\), it suffices to use \(m = \mu_{0}c_0\) discarding any carries, where \(\mu_{0}\) is the least significant word of \(\mu\).
We overload m a little in notation. This m is NOT the same m as the one defined in the pseudocode description on top. This m is just to make c0 + mp divisible by r.
Once we compute \(mp\), and add this to \(c\) to get \(c^{(1)}\), this should clear the least significant digit based on how we picked \(m\). See below
\[ \begin{align} c &\equiv (c_{2n-1} r^{2n-1} + c_{2n-2} r^{2n-2} + \ldots + c_1r + c_0) \mod p \\[10pt] cr^{-1} &\equiv ( c_{2n-1}r^{2n-1} + c_{2n-2} r^{2n-2} + \ldots + c_1r)r^{-1} + c_0 r^{-1} \mod p \\[10pt] &\equiv ( c_{2n-1}r^{2n-1} + c_{2n-2} r^{2n-2} + \ldots + c_1r)r^{-1} + (c_0+ mp)r^{-1} \mod p \\[10pt] &\equiv (c + mp)r^{-1} \mod p \\[10pt] &\equiv ( c^{(1)}_{2n-1}r^{2n-1} + \ldots c^{(1)}_1r + 0)r^{-1} \mod p \\[10pt] &\equiv ( c^{(1)}_{2n-1}r^{2n-2} + \ldots c^{(1)}_1) \mod p \\[10pt] \end{align} \]
Repeat The Subroutine \(n\) times
Now we can play the subroutine again with \(c^{(1)}_1\) replacing the role of \(c_0\), and the updated value of \(m = \mu_0c_1^{(1)}\) (NOTE the same \(\mu_0\) is always used)
\[ \begin{align} cr^{-1} &\equiv ( c^{(1)}_{2n-1}r^{2n-2} + \ldots c^{(1)}_1 ) \mod p \\ cr^{-2} &\equiv ( c^{(1)}_{2n-1}r^{2n-3} + \ldots + c_1^{(1)}+ c_1^{(1)}r^{-1} \mod p \\ &\equiv ( c^{(2)}_{2n-1}r^{2n-3} + \ldots + c_1^{(2)} ) \mod p \end{align} \]
In fact we play this game \(n\) times
\[cR^{-1} \equiv cr^{-n} \equiv c^{(n)}_{2n-1}r^{n-1} + c^{(n)}_{n+1}r +c^{(n)}_{n} \mod p\]
To get the final answer.
Why Can’t This Answer Be Greater Than \(2p\) ?
What did we actually do in the above process ? Over \(n\) iterations, in each iteration, we computed \(m\), added \(mp\) to \(c\) and divided by \(r\). We denote the \(m\) in iteration \(i\) with \(m_i\) here. More explicitly,
\[ \begin{align} cR^{-1} &\equiv (\ldots((c + m_0p)r^{-1} + m_1p)r^{-1} + \dots + m_{n-1}p)r^{-1} \\[10pt] &= \frac{c + p\sum_{i=0}^{n-1} m_ir^{i}}{r^n} \\ &= \frac{c + m'p }{R} \\ &< \frac{Rp + pR}{R}\\ &= 2p \end{align} \]
Here \(m'\) is some integer
expressed using \(n\) limbs,
therefore strictly less than \(R\).
This is exactly why, if the answer is greater than \(p\), one subtraction
suffices for c-mul
. We NEVER need more
than one subtraction!
Yuval’s New Multiplication Algorithm
The new algorithm is based on a subtle observation. Quoting the original blog:
Per round of the Montgomery reduction, we previously computed ci r−1 mod p using n + 1 digit multiplications. An alternative method to achieve this, using only n digit multiplications, is by computing ρ1 = r−1 mod p. One way to understand this algorithm is that, in each round, the current sum is divided by 𝑟, and the resulting fraction is converted into an n + 1-digit number ci ρ1, which is then added to the integer part of the division.
In simple words, in CIOS we were adding and then dividing4, now we first divide and then add things. See below:
\[ c \equiv (c_{2n-1} r^{2n-1} + c_{2n-2} r^{2n-2} + \ldots + c_1r + c_0) \mod p\]
Dividing \(c\) with \(r^{-1}\), \(n-1\) times
\[ \begin{align} cr^{-(n-1)} &\equiv c_{2n-1}r^{n} + \ldots + c_{n-1} + c_{n-2}r^{-1} \ldots + c_1r^{-(n-2)}+ c_0r^{-(n-1)} \mod p \end{align} \]
If we pre-computed5 \(\rho_i \equiv r^{-i} \mod p\) for all \(i \in [n-1]\), we can re-write the above equation as
\[cr^{-(n-1)} \equiv \Big(c_{2n-1}r^{n} + \ldots +c_{n-1} \Big) + \Big(c_{n-2}\rho_1\Big) \ldots + \Big(c_1\rho_{n-2}\Big)+ \Big(c_0\rho_{n-1}\Big) \mod p\]
Barring the left most bracketed term, the remaining bracketed terms are \(n+1\) limbed numbers referred to as the fractional part of the division in the quoted text and \(\Big(c_{2n-1}r^{n} + \ldots +c_{n-1} \Big)\) is the integer part of the division. More simply, the remainder and quotient.
These can be calculated independently and in parallel.
Let \[s = \Big(c_{2n-1}r^{n} + \ldots +c_{n-1} \Big) + \Big(c_{n-2}\rho_1\Big) \ldots + \Big(c_1\rho_{n-2}\Big)+ \Big(c_0\rho_{n-1}\Big)\]
To compute the final answer, we do the standard thing of adding \(mp\) and dividing by \(r\) for the last step, where \(m= s_0\mu_0\) and \(s_0\) is the least significant word of \(s\). This is the akin to playing the CIOS game for 1 round.
\[ cR^{-1} \mod p \equiv (s+ mp)/r\]
What the code described in the s-mul
algorithm does
is – it implements the above algorithm exactly as
described above.
Theorem
We claim that s < n ⋅ (rp).
Dividing \(c\) by \(r^{n-1}\) we get
\[ \begin{align} \frac{c}{r^{n-1}} &= Q + R \end{align} \] where \(R = \Big(c_{n-2}\rho_1\Big) \ldots + \Big(c_1\rho_{n-2}\Big)+ \Big(c_0\rho_{n-1}\Big)\) and \(Q = \Big(c_{2n-1}r^{n} + \ldots +c_{n-1} \Big)\)
Now \(R < (n-1)rp\) as each bracketed term is at most \(rp\).
Also \(Q \le \lfloor \frac{c}{r^{n-1}}\rfloor < \frac{p^2}{r^{n-1}} < \frac{r^np}{r^{n-1}} = rp\)
Therefore, \(Q+R < n\cdot rp\)
From this, we can conclude that
\[ cR^{-1} \mod p \equiv (s+ mp)/r < (nrp + rp)/r < (n+1)p\]
Therefore, it is often not sufficient to subtract \(2p\) from the answer, you might have to go as high as subtracting \(4p\).
Why Does h-mult Pass?
It makes sense why s-mul
fails. From the above
bounds its very possible that the final answer is much bigger than
\(p\). But this does not explain
why h-mul
, which computes the same thing, passes with
only one more subtraction?
The answer to this is that although h-mul
are
s-mul
are based on the same algorithm, they are
implemented slightly differently.
A hybrid algorithm
More specifically, h-mul
is a hybrid between the
completely expanded s-mul
algorithm and the
c-mul
algorithm. We explain this difference in the
context of \(n=4\).
The h-mul
algorithm can be described as an iteration
of \(\log_2 n\) division
operations. Contrasting this with s-mul
, where we do
\(n-1\) division operations by
\(r^{-1}\) before adding things, or
in c-mul
where we NEVER divide before adding.
In h-mul
, at each iteration, we reduce half of the
remaining digits. If \(n=8\), we
divide by \(r^{-4}, r^{-2},
r^{-1}\). If \(n=4\), we
first divide by \(r^{-2} \equiv \rho_2
\mod p\), and then we divide by \(r^{-1}\equiv \rho_1 \mod p\).
So, for our bn-254
specific implementation, where
\(n=4\), we first reduce by two
digits \[
\begin{align}
cr^{-2} &\equiv \Big(c_{7}r^{5} + \ldots +c_{3}r + c_2 \Big) +
\rho_2(c_1r + c_0)
\end{align}
\]
In the next step, we reduce by 1 digit
\[ \begin{align} cr^{-3} &\equiv \Big(c_{7}r^{4} + \ldots +c_{3} \Big) + \rho_1(c_2 + \rho_2(c_1r + c_0)) \\ &\equiv Q + \text{Remainder} \end{align} \]
Note that the total effect is the same as in h-mul
where
we also divide by r−3 before adding things.
The key difference between h-mul
and s-mul
is
that in s-mul
we call do all 3 divisions in parallel,
whereas here we first divide by r−2, and only then can we
divide by r−1. So
we lose some parallelism in h-mul
, but as we will soon see,
we minimise the amount of overflow.
From here on with just do CIOS.
For each division operation, we blow up the current sum by at most \(rp\). See below \(n=4\).
\[ \begin{align} cr^{-2} &< \frac{p^2}{r^2} + \rho_2(c_1r + c_0) \\[10pt] &< \frac{p^2}{r^2} + \rho_2(r^2) \\[10pt] cr^{-3} &< \frac{p^2}{r^3} + \rho_1r + \rho_2r \end{align} \]
In h-mul
we only divide twice to get \(Q\), whereas in s-mul
we
divide three times before adding. This is why the number of
subtractions in h-mul
will always be 1 less than the
one needed for s-mul
.
If you stare at the code long enough, you’ll confirm that
s-mul
uses 3 precomputed constants
U64_I3, U64_I2, U64_I1
. On the other hand
h-mul
uses just U64_I2, U64_I1
. These are the
ρ’s.
Note that this strategy gives rise to a family of algorithms – we can employ another division strategy. We just divide once by \(r^{-2}\), and then play CIOS.
\[ \begin{align} cr^{-2} &\equiv \Big(c_{7}r^{5} + \ldots +c_{3}r + c_2 \Big) + \rho_2(c_1r + c_0) \\[10pt] &\equiv Q +R \end{align} \]
In this case, the number of subtractions is always 1 less than
that of h-mul
but one more than c-mul
.
Thus, s-mul
can be thought of as maximising parallel
division operations before adding, and c-mul
can be
thought of as minimising division operations before an add. They are
the two extremes of this family of algorithms. Using the above
analysis, one can easily show the following:
Theorem
Let u be the number of multiplication operations the CIOS algorithm with inputs a and b that each at most p, and v be the number of addition operations (including the reduction phase). Then, for any 0 ≤ k ≤ n − 1, if we divide ab by rk before adding, we have k fewer multiplication operations compared to CIOS. If we have m rounds of division, then we have m more addition operations than CIOS (due to the reduction step of making the answer less than p).
Missing Factor Of \(1p\)
At this point, there is one question left. We proved that in theory
s-mul
needs at most 4 subtractions and saves 3 multiplications.h-mul
needs at most 3 subtractions and saves 3 multiplications..c-mul
needs at most 1 subtraction.
But why does two subtract work out for
h-mul
? Shouldn’t it be three?
Similarly, why did three work
s-mul
?
Re-examining the analysis for h-mul
\[cr^{-4} < \Big(\frac{Q}{r^3} + r(\rho_1 + \rho_2) + mp \Big)/r < 4p\]
The \(4p\) is the 1 improvement
from the \(5p\) we proved for
s-mul
The reason we get away with at most three
subtractions and two subtractions in practice in
s-mul
and h-mul
respectively, has to do
with the specific value of \(p\)
for the bn-254
scalar field. If we tighten our
analysis, \(\rho_1\) and \(\rho_2\) are fixed. The largest value
for \(Q\) is \((p-1)^2\). Plugging in exact
numbers
\[cr^{-4} < \frac{p^2}{r^4} + (\rho_1 + \rho_2) + p < 2.25p\]
It turns out to be less than \(3.25p\) for s-mul
It’s the bn-254
prime modulus p that allows us to do one
subtraction less for h-mul
and s-mul
than what
theory would suggest for any general prime p. If p were only a little smaller than
rn, then
we’d have to put in one extra reduction for each of these
algorithms.
The Need For Speed
Remember, we began with the goal of speeding up Montgomery multiplication. So far, we have only understood correctness to make Jolt unit tests pass. We had to add extra instructions to handle overflow. Do the speedups previously observed still exist?
Without Subtractions
The first step is to remove all subtractions from all algorithms and observe how fast things are.
warning: `minimal_mult` (example "ark-multiplication") generated 1 warning
Finished `release` profile [optimized] target(s) in 3.76s
Running `target/release/examples/ark-multiplication`
100000 Trials of 100 chained multiplications
Benchmarking done
Average Advantage: 11.905494513610297
C-mult 1779.19279
H-mult 1567.37109
dtype: float64
We are 11% faster. Note is very difficult to pinpoint if it’s due to fewer multiplications or how we wrote the code without using contiguous memory, but with 100,000 random inputs and 100 chained multiplications, we appear to be consistently faster. Although fast, this code will not pass – we need the answer to be reduced \(\mod p\).
Putting In The Subtractions
Now, for starters, ignoring the fact that h-mul
actually needs two conditional subtractions, we
stick one conditional subtraction for both c-mul
and
h-mul
. We should still have a speed advantage, right?
After all, we had one algorithm faster than the second algorithm,
and we added the same line of code to both.
Finished `release` profile [optimized] target(s) in 0.80s
Running `target/release/examples/ark-multiplication`
100000 Trials of 100 chained multiplications
Benchmarking done
Average Advantage: -5.434536875745394
C-mult 1791.88866
H-mult 1889.26951
dtype: float64
We are 5% slower. With a little thought and listening Matt
Godbolt, this observation is not surprising. As we are writing
CPU code, the answer is branch prediction. The CPU gets it wrong a
lot for h-mul
. Why? It’s because the line of code we
added was
if a >= p{
a = p - a
}
Based on our theory so far, we know that c-mul
overflows much less frequently than h-mul
– With random
inputs \(\mod p\), we see that CIOS
only overflows above \(p\) 5% of
the time, whereas h-mul
overflows 68% of the time. If
things were distributed uniformly6, it would imply that
95% of the time, the if statement evaluates to false for
c-mul
. So the CPU can pretty much always get rid of
this line of code, and be wrong only 5% of the time.
If we wish to double down on this theory. We just put the
if
check, but remove the actual subtraction from the
reduction (this will break correctness, but we’re only checking if
our theory is correct). All our speed is recovered. The branch
prediction is the same for both code bases, as the compiler will
just get rid of the check in the first place.
100000 Trials of 100 chained multiplications
Benchmarking done
Average Advantage: 11.633622123165752
C-mult 1804.97929
H-mult 1594.99482
C-overflow 0.05428
H-overflow 0.68088
dtype: float64
Aside: Same Code Two Conclusions
As a brief distraction, in this section, we give an example of how finicky performance benchmarking can be.
One of the reasons we embarked on this journey was that
preliminary experimentation showed 10-15% speedups7. Without the subtractions, we
have a clear understanding of why this was the case (fewer
multiplications and no big read/write chunks). However, even if we
add the requisite subtractions to the Ingo
skyscraper code code and run cargo bench
based on
Ingo
skyscraper beches, we see something different from what’s
discussed above.
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
Running benches/my_benchmark.rs (target/release/deps/my_benchmark-8968c072d5fb1ae5)
Arkworks CIOS (10 chained c-mul, looped)
time: [1.9783 µs 1.9847 µs 1.9929 µs]
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe
h-mul (10 chained h-mul, looped)
time: [1.8386 µs 1.8411 µs 1.8442 µs]
h-mul
is FASTER even with the subtractions!!!!
How can this be? This contradicts everything we
have discussed so far. In terms of the body of the algorithms – this
bench and our tests run the EXACT same code, and
yet we have totally different outcomes. A lazy conclusion might be
variance – but that’s not true. We can randomise, and bootstrap,
build stratified experiments, h-mul
is still marginally
faster.
The real answer can be found by thinking about the CPU’s branch prediction algorithm and then staring at the inputs to the algorithm.
All of Montgomery’s theory relies on the fact that the initial
inputs \(a\) and \(b\) are at most \(p-1\). In the
Ingo skyscraper
code, the inputs are random \(\mod r^4\). This strips away the
overflow prevention property of CIOS – therefore, we lose the gains
of the branch prediction, which makes CIOS fast in practice. Now the
CPU cannot predict the branch, and this is much closer to the
no-subtraction situation. Of course, we still lose a bit as
h-mul
needs 2 subtractions and 2 checks as opposed to 1
– hence we don’t see the 11% speedup, but still see some
speedup.
In conclusion, something as innocuous as random inputs can lead to completely different conclusions about the real-world performance of code.
So Where Do We Go Now?
At this point, where do we stand? We have a family of algorithms for
Montgomery multiplication. If we are not bothered about reducing mod p, then s-mul
should be the fastest algorithm, owing to its maximal parallelism
(allows maximal for out-of-order execution on the CPU). But we
need the final answer to be less than p for the rest of the Jolt setup to
work. Does this mean that for Jolt, CIOS remains the best algorithm to
use?
We try a middle ground. We use our theory framework so far to
come up with an alternate strategy: – We will save 2 multiplications
(instead of 3) but also need to reduce one fewer time compared to
h-mul
By our analysis and the actual value of \(p\) in bn-254
, we will
need at most one subtraction now as opposed to the two from
h-mul
(just like CIOS). We hope the number of overflows
will reduce, and therefore, we save on branching. Note, it will
never be as small as CIOS, but we also have 2 fewer multiplications.
We hope that, on average, this gives us speedups we are after
Unfortunately, we do not have a prior implementation of this
algorithm that we can fork – so we must write one from scratch. Our
implementation can be found here
and sticks to the code formatting of Ingo skyscraper implementation,
which we found to be fast. Let’s call this code
a-mul
.
Without subtractions, a-mul
is 9% faster. As
expected, we are a little slower than h-mul
But as the
percentage of times we overflow is much smaller - 27% as opposed to
68% with random inputs \(\mod
p\).
100000 Trials of 100 chained multiplications
Benchmarking done
Average Advantage: 9.704649409540085
C-mult 1778.47723
H-mult 1605.88225
C-overflow 0.05365
H-overflow 0.27255
dtype: float64
Putting the subtractions back in. We are marginally faster than
c-mul
in the worst case with fully random inputs.
300000 Trials of 100 chained multiplications of random values
300000 Trials of 1 chained multiplications of small values
Scanning file: arkworks_multiplication.csv
Average Advantage (%): 2.011834252517198
C-mult 2213.471443
H-mult 2168.940067
C-overflow 0.000000
H-overflow 0.000000
dtype: float64
Now, what would happen if we could control the inputs to be small? This will prevent overflows. We should recover speed, right?
Scanning file: arkworks_multiplication_small.csv
Average Advantage (%): 16.30465277384584
C-mult 28.788490
H-mult 24.094627
C-overflow 0.000000
H-overflow 0.000000
dtype: float64
It’s exactly what we expect. When we force the inputs \(a\) and \(b\) to small values (note small in Montgomery form so \(a = a'r \mod p\) and \(b=b'r \mod p\), \(a'\) and \(b'\) may or may not be small.), We kill overflows, and we get the speedups we were after. Now the CPU’s branch prediction, which is roughly based on what’s happened most recently, gets it right.
But when we run with random numbers again, we’re back to
essentially the performance of CIOS. In conclusion, the number of
multiplications is one source of slowdown in the real world, but
there are factors where CIOS does better. Now, a priori, if you knew
the input ranges to your algorithm, then you could predict how much
each of them with overflow. This would allow you to use the right
level of parallelism and pre-computation to get maximal speed. For
example, if you know that the inputs are near \(p-1\), you’d likely still use CIOS. If
you knew the answers would be very small, then you’d use
s-mul
. For anything in between, you’d want to use the
mini-theorem above and figure out how to balance the number of
multiplications vs overflow.
Instructions For Running Benchmarks Discussed Above
To get individual times on chained multiplications
git clone https://github.com/abiswas3/Montgomery-Benchmarks.git
cd Montgomery-Benchmarks
cargo bench
To get time comparisons inside of arkworks run the following command
cargo run --example ark-multiplication --release && python3 plotting.py arkworks_multiplication.csv
For squaring run (we have not yet optimised the squaring library
to use a-mul
)
cargo run --example ark-squaring --release && python3 plotting.py arkworks_squaring.csv
This specific implementation is optimised for parallelism instead, looking at all the SIMD code, that’s what it seems at least.↩︎
unchanged from how we found them in the wild↩︎
Limbs here just mean digits. A 4 limb number is a 4-digit number where each digit is \(w\) bits large, i.e each digit is between 0 and \(2^w -1\).↩︎
First add \(mp\) to \(c\), then divide \((c+mp)/r\).↩︎
We guarantee that \(\rho_i < p\)↩︎
Which they are not.↩︎
Remember the bench code in the alternate implementation?↩︎