Notes From The Montgomery Underground

Ari

Published: 2025-05-24

Background

In an earlier post, we analysed why this implementation of the improved Montgomery multiplication algorithm by Yuval Domb was not empirically faster out of the box. After disassembling the code using the Intel Pin tool, we identified large memory reads and writes as the primary source of slower run times. Although the analysis was specific to x86 architectures, the empirical findings were consistent across MacBook M2 Laptops and Intel x86 Digital Ocean servers. This loss of performance, however, is an artefact of the implementation, not the algorithm itself¹. The new algorithm uses fewer multiplication operations than the CIOS algorithm, and therefore, there should exist a faster implementation (at least in theory).

The large chunk reads and writes were likely due to the high-level Rust code storing intermediate \(2n\) limbed values in contiguous regions of memory (see) This likely led the compiler to demand the same of the generated assembly code as well. The key thing to note here is that the algorithm specification does not require contiguous memory storage; it’s just an implementation detail. As a resolution, we found an alternate implementation. In this implementation, all intermediate values are stored as Rust variables as opposed to constant-sized vecs/arrays. The code explicitly does school-book multiplication, storing all intermediate values in registers, before adding them in a long sequence using school-book addition. The code came with cargo bench, and when compared to the current Arkworks CIOS implementation, this code seemed to be 5-10% faster (at least using their bench setup)

The lack of arrays/continuous memory is not the only thing that is different about this alternate implementation when compared to the earlier implementation for World Coin we benchmarked. We will get to that in a bit.

For the remainder of the post, we refer to the original Arkworks CIOS algorithm for Montgomery multiplication as c-mul, the previous implementation of the new multiplication as s-mul, and the faster implementation as h-mul. c stands for CIOS, s stands for single step, and h stands for hybrid. The naming convention will soon become clear.

Putting this faster code into Jolt

Ignoring speed for a minute, we plugged in the c-mul and smul multiplication algorithms in their native form² into Jolt. Then we ran cargo test -p jolt-core -- sha3_e2e_hyperkzg, and the unit tests fail for both s-mult and h-mult.

thread '<unnamed>' panicked at /Users/aribiswas3/Work-With-A16z/arkworks-algebra/ff/src/fields/models/fp/montgomery_backend.rs:1360:9:
High limb expected to be 0 if S >= 1 before 2p comparison


failures:
    jolt::vm::rv32i_vm::tests::sha3_e2e_hyperkzg

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 219 filtered out; finished in 4.79s

Digging deeper, the part of the Jolt codebase that fails is code for subtraction. The Conditional Subtraction For Barrett Reduction throws an error claiming that the number we passed in was too large. If you look at line 1360, it specifically needs the Big integer listed as r_tmp not to have its high bit set to 1. This error goes away when using Arkworks-CIOS as the multiplication algorithm.

Why does this happen?

Scanning the lines of code for h-mult, we observe that the final answer is not reduced \(\mod p\). When you multiply two numbers that are both less than \(p\), the result may still cross over the modulus \(p\).

In Arkworks-CIOS, the last line of multiplication involves subtracting \(p\) from the final answer if the answer is greater than \(p\). The line __subtract_modulus(a)simply outputs \(p-a\) if the final output \(a \ge p\), otherwise leaves \(a\) unchanged.

In the multiplication code for s-mul, notice they have a reduce_ct function. In this function, we find that they subtract \(2p\) from \(a\) only if the most significant bit of \(a\) is 1. Thus, we have three algorithms, and although the final answer for each algorithm is the same (modulo p), the exact numerical value is different for all 3 algorithms. They are, however, not always different. Sometimes they are the same, sometimes two out of three of them are the same, and at times all three are different.

To summarise:

c-mult: that passes unit tests and subtracts \(p\) from \(a\) when \(a \ge p\).
h-mult: that fails unit tests and does not do any reduction.
s-mult: that fails unit tests and subtracts \(2p\) from \(a\) when the msb of \(a\) is 1.

First attempt: We stick a __subtract_modulus(a) at the end of h-mul matching the behaviour in c-mul.

The tests still panic with the same error message

A few hours of debugging follow, we observe that even after subtracting \(p\) from the answer, the final answer is still sometimes greater than \(p\), i.e h-mul often outputs something outputs something that is sometimes greater than \(2p\).

Ok, what happens if we put in two instances of __subtract_modulus(a)?

Jolt passes!

Did we get lucky? One way to check is to try a lot of random inputs and try to break the test. \(100000\) randomly seeded inputs, with \(10\) chained multiplications in each trial, the numerical value of c-mult and h-mult are EXACTLY the same.

However, adding two conditional subtractions to s-mul – Jolt tests fail again.
But adding 3 conditional subtractions to s-mul makes Jolt pass with s-mul, and now all 3 algorithms are the same.

What is going on? Why do we need one subtraction for one algorithm, two and three for the others? How do we know we don’t need 3 or 4 or 5? \(100000\) trials are minuscule when compared to the input space of 256-bit numbers. What if there are edge cases that we can’t catch via fudging?

In what follows, we derive a formal guarantee on how large the answer of these algorithms can be, therefore establishing the maximum number of conditional subtracts we must employ. In the process, we establish that there is not just 1 new algorithm for Montgomery multiplication, but a family of algorithms that are all related to each other, each demonstrating speed for different input distributions.

Montgomery Multiplication Maths

We begin with the basics of Montgomery multiplication, and then describe the CIOS algorithm. This gives all the background needed to figure out what is going on.

Below, we actually describe the Separated Operand Scanning Algorithm (SOS), but deceptively refer to it as the Coarse Integrated Operand Scanning (CIOS) algorithm. This is done for ease of exposition only; the claims would hold regardless. See Acar’s thesis Chapter 2 for more details. We also do not discuss the optimisations made by the Gnark team as they are not relevant in this context.

Given a \(n\) limb³ modulus \(p\), integers \(a\) and \(b\) that are at most \(p-1\), Montgomery multiplication computes

\[ c \equiv abR^{-1} \mod p\]

where \(R=r^n\) and \(r = 2^w\), where \(w\) is the word size (typically 64-bits). To avoid explicitly computing modulo \(p\) by long division, we instead compute \((c + mp)/R \equiv cR^{-1}\mod p\) where \(m\) is an integer such that \(m+cp \equiv 0 \mod R\). A simple calculation solving for \(m\) gives us that \(m = -p^{-1}c \mod R\). For our purposes, \(p\) is prime so \(p\) and \(r\) are co-prime. Therefore, \(-p^{-1}\) is well defined, and thus so is \(m\).

Key Idea: As \(R\) is a power of 2, the last division can be done by bit shift operations native to most instruction architectures.

The CIOS algorithm

What we describe above is the pseudocode description of the Montgomery reduction algorithm. In practice, we do not actually compute \(m\) as a whole, then multiply it by \(p\) and then add \(mp\) to \(c\). We perform the computation limb by limb.
\(c\) can be written as the following sum:

The limb by limb process follows by repeating the following sub-routine \(n\) times.

Important Sub-Routine

Observe that \(c_0r^{-1} \equiv (c_0 + mp)r^{-1} \mod p\), where \(m\) such that \(c_0 + mp \equiv 0 \mod r\). Solving for \(m = -p^{-1}c_0 \equiv \mu c_0 \mod r\). Also as we are working modulo \(r\), it suffices to use \(m = \mu_{0}c_0\) discarding any carries, where \(\mu_{0}\) is the least significant word of \(\mu\).

We overload m a little in notation. This m is NOT the same m as the one defined in the pseudocode description on top. This m is just to make c₀ + mp divisible by r.

Once we compute \(mp\), and add this to \(c\) to get \(c^{(1)}\), this should clear the least significant digit based on how we picked \(m\). See below

\[ \begin{align} c &\equiv (c_{2n-1} r^{2n-1} + c_{2n-2} r^{2n-2} + \ldots + c_1r + c_0) \mod p \\[10pt] cr^{-1} &\equiv ( c_{2n-1}r^{2n-1} + c_{2n-2} r^{2n-2} + \ldots + c_1r)r^{-1} + c_0 r^{-1} \mod p \\[10pt] &\equiv ( c_{2n-1}r^{2n-1} + c_{2n-2} r^{2n-2} + \ldots + c_1r)r^{-1} + (c_0+ mp)r^{-1} \mod p \\[10pt] &\equiv (c + mp)r^{-1} \mod p \\[10pt] &\equiv ( c^{(1)}_{2n-1}r^{2n-1} + \ldots c^{(1)}_1r + 0)r^{-1} \mod p \\[10pt] &\equiv ( c^{(1)}_{2n-1}r^{2n-2} + \ldots c^{(1)}_1) \mod p \\[10pt] \end{align} \]

Repeat The Subroutine \(n\) times

Now we can play the subroutine again with \(c^{(1)}_1\) replacing the role of \(c_0\), and the updated value of \(m = \mu_0c_1^{(1)}\) (NOTE the same \(\mu_0\) is always used)

\[ \begin{align} cr^{-1} &\equiv ( c^{(1)}_{2n-1}r^{2n-2} + \ldots c^{(1)}_1 ) \mod p \\ cr^{-2} &\equiv ( c^{(1)}_{2n-1}r^{2n-3} + \ldots + c_1^{(1)}+ c_1^{(1)}r^{-1} \mod p \\ &\equiv ( c^{(2)}_{2n-1}r^{2n-3} + \ldots + c_1^{(2)} ) \mod p \end{align} \]

In fact we play this game \(n\) times

\[cR^{-1} \equiv cr^{-n} \equiv c^{(n)}_{2n-1}r^{n-1} + c^{(n)}_{n+1}r +c^{(n)}_{n} \mod p\]

To get the final answer.

Why Can’t This Answer Be Greater Than \(2p\) ?

What did we actually do in the above process ? Over \(n\) iterations, in each iteration, we computed \(m\), added \(mp\) to \(c\) and divided by \(r\). We denote the \(m\) in iteration \(i\) with \(m_i\) here. More explicitly,

\[ \begin{align} cR^{-1} &\equiv (\ldots((c + m_0p)r^{-1} + m_1p)r^{-1} + \dots + m_{n-1}p)r^{-1} \\[10pt] &= \frac{c + p\sum_{i=0}^{n-1} m_ir^{i}}{r^n} \\ &= \frac{c + m'p }{R} \\ &< \frac{Rp + pR}{R}\\ &= 2p \end{align} \]

Here \(m'\) is some integer expressed using \(n\) limbs, therefore strictly less than \(R\). This is exactly why, if the answer is greater than \(p\), one subtraction suffices for c-mul. We NEVER need more than one subtraction!

Yuval’s New Multiplication Algorithm

The new algorithm is based on a subtle observation. Quoting the original blog:

Per round of the Montgomery reduction, we previously computed c_i r⁻¹ mod p using n + 1 digit multiplications. An alternative method to achieve this, using only n digit multiplications, is by computing ρ₁ = r⁻¹ mod p. One way to understand this algorithm is that, in each round, the current sum is divided by 𝑟, and the resulting fraction is converted into an n + 1-digit number c_i ρ₁, which is then added to the integer part of the division.

In simple words, in CIOS we were adding and then dividing⁴, now we first divide and then add things. See below:

\[ c \equiv (c_{2n-1} r^{2n-1} + c_{2n-2} r^{2n-2} + \ldots + c_1r + c_0) \mod p\]

Dividing \(c\) with \(r^{-1}\), \(n-1\) times

\[ \begin{align} cr^{-(n-1)} &\equiv c_{2n-1}r^{n} + \ldots + c_{n-1} + c_{n-2}r^{-1} \ldots + c_1r^{-(n-2)}+ c_0r^{-(n-1)} \mod p \end{align} \]

If we pre-computed⁵ \(\rho_i \equiv r^{-i} \mod p\) for all \(i \in [n-1]\), we can re-write the above equation as

\[cr^{-(n-1)} \equiv \Big(c_{2n-1}r^{n} + \ldots +c_{n-1} \Big) + \Big(c_{n-2}\rho_1\Big) \ldots + \Big(c_1\rho_{n-2}\Big)+ \Big(c_0\rho_{n-1}\Big) \mod p\]

Barring the left most bracketed term, the remaining bracketed terms are \(n+1\) limbed numbers referred to as the fractional part of the division in the quoted text and \(\Big(c_{2n-1}r^{n} + \ldots +c_{n-1} \Big)\) is the integer part of the division. More simply, the remainder and quotient.

These can be calculated independently and in parallel.

Let \[s = \Big(c_{2n-1}r^{n} + \ldots +c_{n-1} \Big) + \Big(c_{n-2}\rho_1\Big) \ldots + \Big(c_1\rho_{n-2}\Big)+ \Big(c_0\rho_{n-1}\Big)\]

To compute the final answer, we do the standard thing of adding \(mp\) and dividing by \(r\) for the last step, where \(m= s_0\mu_0\) and \(s_0\) is the least significant word of \(s\). This is the akin to playing the CIOS game for 1 round.

\[ cR^{-1} \mod p \equiv (s+ mp)/r\]

What the code described in the s-mul algorithm does is – it implements the above algorithm exactly as described above.

Theorem

We claim that s < n ⋅ (rp).

Dividing \(c\) by \(r^{n-1}\) we get

\[ \begin{align} \frac{c}{r^{n-1}} &= Q + R \end{align} \] where \(R = \Big(c_{n-2}\rho_1\Big) \ldots + \Big(c_1\rho_{n-2}\Big)+ \Big(c_0\rho_{n-1}\Big)\) and \(Q = \Big(c_{2n-1}r^{n} + \ldots +c_{n-1} \Big)\)

Now \(R < (n-1)rp\) as each bracketed term is at most \(rp\).

Also \(Q \le \lfloor \frac{c}{r^{n-1}}\rfloor < \frac{p^2}{r^{n-1}} < \frac{r^np}{r^{n-1}} = rp\)

Therefore, \(Q+R < n\cdot rp\)

From this, we can conclude that

\[ cR^{-1} \mod p \equiv (s+ mp)/r < (nrp + rp)/r < (n+1)p\]

Therefore, it is often not sufficient to subtract \(2p\) from the answer, you might have to go as high as subtracting \(4p\).

Why Does h-mult Pass?

It makes sense why s-mul fails. From the above bounds its very possible that the final answer is much bigger than \(p\). But this does not explain why h-mul, which computes the same thing, passes with only one more subtraction?

The answer to this is that although h-mul are s-mul are based on the same algorithm, they are implemented slightly differently.

A hybrid algorithm

More specifically, h-mul is a hybrid between the completely expanded s-mul algorithm and the c-mul algorithm. We explain this difference in the context of \(n=4\).

The h-mul algorithm can be described as an iteration of \(\log_2 n\) division operations. Contrasting this with s-mul, where we do \(n-1\) division operations by \(r^{-1}\) before adding things, or in c-mul where we NEVER divide before adding.

In h-mul, at each iteration, we reduce half of the remaining digits. If \(n=8\), we divide by \(r^{-4}, r^{-2}, r^{-1}\). If \(n=4\), we first divide by \(r^{-2} \equiv \rho_2 \mod p\), and then we divide by \(r^{-1}\equiv \rho_1 \mod p\).

So, for our bn-254 specific implementation, where \(n=4\), we first reduce by two digits \[ \begin{align} cr^{-2} &\equiv \Big(c_{7}r^{5} + \ldots +c_{3}r + c_2 \Big) + \rho_2(c_1r + c_0) \end{align} \]

In the next step, we reduce by 1 digit

\[ \begin{align} cr^{-3} &\equiv \Big(c_{7}r^{4} + \ldots +c_{3} \Big) + \rho_1(c_2 + \rho_2(c_1r + c_0)) \\ &\equiv Q + \text{Remainder} \end{align} \]

Note that the total effect is the same as in h-mul where we also divide by r⁻³ before adding things. The key difference between h-mul and s-mul is that in s-mul we call do all 3 divisions in parallel, whereas here we first divide by r⁻², and only then can we divide by r⁻¹. So we lose some parallelism in h-mul, but as we will soon see, we minimise the amount of overflow.

From here on with just do CIOS.

For each division operation, we blow up the current sum by at most \(rp\). See below \(n=4\).

\[ \begin{align} cr^{-2} &< \frac{p^2}{r^2} + \rho_2(c_1r + c_0) \\[10pt] &< \frac{p^2}{r^2} + \rho_2(r^2) \\[10pt] cr^{-3} &< \frac{p^2}{r^3} + \rho_1r + \rho_2r \end{align} \]

In h-mul we only divide twice to get \(Q\), whereas in s-mul we divide three times before adding. This is why the number of subtractions in h-mul will always be 1 less than the one needed for s-mul.

If you stare at the code long enough, you’ll confirm that s-mul uses 3 precomputed constants U64_I3, U64_I2, U64_I1. On the other hand h-mul uses just U64_I2, U64_I1. These are the ρ’s.

Note that this strategy gives rise to a family of algorithms – we can employ another division strategy. We just divide once by \(r^{-2}\), and then play CIOS.

\[ \begin{align} cr^{-2} &\equiv \Big(c_{7}r^{5} + \ldots +c_{3}r + c_2 \Big) + \rho_2(c_1r + c_0) \\[10pt] &\equiv Q +R \end{align} \]

In this case, the number of subtractions is always 1 less than that of h-mul but one more than c-mul. Thus, s-mul can be thought of as maximising parallel division operations before adding, and c-mul can be thought of as minimising division operations before an add. They are the two extremes of this family of algorithms. Using the above analysis, one can easily show the following:

Theorem

Let u be the number of multiplication operations the CIOS algorithm with inputs a and b that each at most p, and v be the number of addition operations (including the reduction phase). Then, for any 0 ≤ k ≤ n − 1, if we divide ab by r^k before adding, we have k fewer multiplication operations compared to CIOS. If we have m rounds of division, then we have m more addition operations than CIOS (due to the reduction step of making the answer less than p).

Missing Factor Of \(1p\)

At this point, there is one question left. We proved that in theory

s-mul needs at most 4 subtractions and saves 3 multiplications.
h-mul needs at most 3 subtractions and saves 3 multiplications..
c-mul needs at most 1 subtraction.

But why does two subtract work out for h-mul? Shouldn’t it be three? Similarly, why did three work s-mul?

Re-examining the analysis for h-mul

\[cr^{-4} < \Big(\frac{Q}{r^3} + r(\rho_1 + \rho_2) + mp \Big)/r < 4p\]

The \(4p\) is the 1 improvement from the \(5p\) we proved for s-mul

The reason we get away with at most three subtractions and two subtractions in practice in s-mul and h-mul respectively, has to do with the specific value of \(p\) for the bn-254 scalar field. If we tighten our analysis, \(\rho_1\) and \(\rho_2\) are fixed. The largest value for \(Q\) is \((p-1)^2\). Plugging in exact numbers

\[cr^{-4} < \frac{p^2}{r^4} + (\rho_1 + \rho_2) + p < 2.25p\]

It turns out to be less than \(3.25p\) for s-mul

It’s the bn-254 prime modulus p that allows us to do one subtraction less for h-mul and s-mul than what theory would suggest for any general prime p. If p were only a little smaller than rⁿ, then we’d have to put in one extra reduction for each of these algorithms.

The Need For Speed

Remember, we began with the goal of speeding up Montgomery multiplication. So far, we have only understood correctness to make Jolt unit tests pass. We had to add extra instructions to handle overflow. Do the speedups previously observed still exist?

Without Subtractions

The first step is to remove all subtractions from all algorithms and observe how fast things are.

warning: `minimal_mult` (example "ark-multiplication") generated 1 warning
    Finished `release` profile [optimized] target(s) in 3.76s
     Running `target/release/examples/ark-multiplication`
100000 Trials of 100 chained multiplications
Benchmarking done
Average Advantage: 11.905494513610297
C-mult    1779.19279
H-mult    1567.37109
dtype: float64

We are 11% faster. Note is very difficult to pinpoint if it’s due to fewer multiplications or how we wrote the code without using contiguous memory, but with 100,000 random inputs and 100 chained multiplications, we appear to be consistently faster. Although fast, this code will not pass – we need the answer to be reduced \(\mod p\).

Putting In The Subtractions

Now, for starters, ignoring the fact that h-mul actually needs two conditional subtractions, we stick one conditional subtraction for both c-mul and h-mul. We should still have a speed advantage, right? After all, we had one algorithm faster than the second algorithm, and we added the same line of code to both.

    Finished `release` profile [optimized] target(s) in 0.80s
     Running `target/release/examples/ark-multiplication`
100000 Trials of 100 chained multiplications
Benchmarking done
Average Advantage: -5.434536875745394
C-mult    1791.88866
H-mult    1889.26951
dtype: float64

We are 5% slower. With a little thought and listening Matt Godbolt, this observation is not surprising. As we are writing CPU code, the answer is branch prediction. The CPU gets it wrong a lot for h-mul. Why? It’s because the line of code we added was

if a >= p{
    a = p - a
}

Based on our theory so far, we know that c-mul overflows much less frequently than h-mul – With random inputs \(\mod p\), we see that CIOS only overflows above \(p\) 5% of the time, whereas h-mul overflows 68% of the time. If things were distributed uniformly⁶, it would imply that 95% of the time, the if statement evaluates to false for c-mul. So the CPU can pretty much always get rid of this line of code, and be wrong only 5% of the time.

If we wish to double down on this theory. We just put the if check, but remove the actual subtraction from the reduction (this will break correctness, but we’re only checking if our theory is correct). All our speed is recovered. The branch prediction is the same for both code bases, as the compiler will just get rid of the check in the first place.

100000 Trials of 100 chained multiplications
Benchmarking done
Average Advantage: 11.633622123165752
C-mult         1804.97929
H-mult         1594.99482
 C-overflow       0.05428
 H-overflow       0.68088
dtype: float64

Aside: Same Code Two Conclusions

As a brief distraction, in this section, we give an example of how finicky performance benchmarking can be.

One of the reasons we embarked on this journey was that preliminary experimentation showed 10-15% speedups⁷. Without the subtractions, we have a clear understanding of why this was the case (fewer multiplications and no big read/write chunks). However, even if we add the requisite subtractions to the Ingo skyscraper code code and run cargo bench based on Ingo skyscraper beches, we see something different from what’s discussed above.

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/my_benchmark.rs (target/release/deps/my_benchmark-8968c072d5fb1ae5)
Arkworks CIOS (10 chained c-mul, looped)
                        time:   [1.9783 µs 1.9847 µs 1.9929 µs]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

h-mul (10 chained h-mul, looped)
                        time:   [1.8386 µs 1.8411 µs 1.8442 µs]

h-mul is FASTER even with the subtractions!!!! How can this be? This contradicts everything we have discussed so far. In terms of the body of the algorithms – this bench and our tests run the EXACT same code, and yet we have totally different outcomes. A lazy conclusion might be variance – but that’s not true. We can randomise, and bootstrap, build stratified experiments, h-mul is still marginally faster.

The real answer can be found by thinking about the CPU’s branch prediction algorithm and then staring at the inputs to the algorithm.

All of Montgomery’s theory relies on the fact that the initial inputs \(a\) and \(b\) are at most \(p-1\). In the Ingo skyscraper code, the inputs are random \(\mod r^4\). This strips away the overflow prevention property of CIOS – therefore, we lose the gains of the branch prediction, which makes CIOS fast in practice. Now the CPU cannot predict the branch, and this is much closer to the no-subtraction situation. Of course, we still lose a bit as h-mul needs 2 subtractions and 2 checks as opposed to 1 – hence we don’t see the 11% speedup, but still see some speedup.

In conclusion, something as innocuous as random inputs can lead to completely different conclusions about the real-world performance of code.

So Where Do We Go Now?

At this point, where do we stand? We have a family of algorithms for Montgomery multiplication. If we are not bothered about reducing mod p, then s-mul should be the fastest algorithm, owing to its maximal parallelism (allows maximal for out-of-order execution on the CPU). But we need the final answer to be less than p for the rest of the Jolt setup to work. Does this mean that for Jolt, CIOS remains the best algorithm to use?

We try a middle ground. We use our theory framework so far to come up with an alternate strategy: – We will save 2 multiplications (instead of 3) but also need to reduce one fewer time compared to h-mul By our analysis and the actual value of \(p\) in bn-254, we will need at most one subtraction now as opposed to the two from h-mul (just like CIOS). We hope the number of overflows will reduce, and therefore, we save on branching. Note, it will never be as small as CIOS, but we also have 2 fewer multiplications. We hope that, on average, this gives us speedups we are after

Unfortunately, we do not have a prior implementation of this algorithm that we can fork – so we must write one from scratch. Our implementation can be found here and sticks to the code formatting of Ingo skyscraper implementation, which we found to be fast. Let’s call this code a-mul.

Without subtractions, a-mul is 9% faster. As expected, we are a little slower than h-mul But as the percentage of times we overflow is much smaller - 27% as opposed to 68% with random inputs \(\mod p\).

100000 Trials of 100 chained multiplications
Benchmarking done
Average Advantage: 9.704649409540085
C-mult         1778.47723
H-mult         1605.88225
 C-overflow       0.05365
 H-overflow       0.27255
dtype: float64

Putting the subtractions back in. We are marginally faster than c-mul in the worst case with fully random inputs.


300000 Trials of 100 chained multiplications of random values
300000 Trials of 1 chained multiplications of small values

Scanning file: arkworks_multiplication.csv
Average Advantage (%): 2.011834252517198
C-mult         2213.471443
H-mult         2168.940067
 C-overflow       0.000000
 H-overflow       0.000000
dtype: float64

Now, what would happen if we could control the inputs to be small? This will prevent overflows. We should recover speed, right?


Scanning file: arkworks_multiplication_small.csv
Average Advantage (%): 16.30465277384584
C-mult         28.788490
H-mult         24.094627
 C-overflow     0.000000
 H-overflow     0.000000
dtype: float64

It’s exactly what we expect. When we force the inputs \(a\) and \(b\) to small values (note small in Montgomery form so \(a = a'r \mod p\) and \(b=b'r \mod p\), \(a'\) and \(b'\) may or may not be small.), We kill overflows, and we get the speedups we were after. Now the CPU’s branch prediction, which is roughly based on what’s happened most recently, gets it right.

But when we run with random numbers again, we’re back to essentially the performance of CIOS. In conclusion, the number of multiplications is one source of slowdown in the real world, but there are factors where CIOS does better. Now, a priori, if you knew the input ranges to your algorithm, then you could predict how much each of them with overflow. This would allow you to use the right level of parallelism and pre-computation to get maximal speed. For example, if you know that the inputs are near \(p-1\), you’d likely still use CIOS. If you knew the answers would be very small, then you’d use s-mul. For anything in between, you’d want to use the mini-theorem above and figure out how to balance the number of multiplications vs overflow.

Instructions For Running Benchmarks Discussed Above

To get individual times on chained multiplications

git clone https://github.com/abiswas3/Montgomery-Benchmarks.git
cd Montgomery-Benchmarks
cargo bench

To get time comparisons inside of arkworks run the following command

cargo run --example ark-multiplication --release && python3 plotting.py arkworks_multiplication.csv

For squaring run (we have not yet optimised the squaring library to use a-mul)

cargo run --example ark-squaring --release && python3 plotting.py arkworks_squaring.csv

This specific implementation is optimised for parallelism instead, looking at all the SIMD code, that’s what it seems at least.↩︎
unchanged from how we found them in the wild↩︎
Limbs here just mean digits. A 4 limb number is a 4-digit number where each digit is \(w\) bits large, i.e each digit is between 0 and \(2^w -1\).↩︎
First add \(mp\) to \(c\), then divide \((c+mp)/r\).↩︎
We guarantee that \(\rho_i < p\)↩︎
Which they are not.↩︎
Remember the bench code in the alternate implementation?↩︎