Introduction

On April 7th, 2025, NEAR requested zkSecurity to perform a security assessment of the Rust p256 crate (https://crates.io/crates/p256). No major issues were found and the codebase was found to be thoroughly tested and well-architected.

In-Scope Components

The scope of the audit included:

  • Elliptic Curve. The p256 elliptic curve group operations, including (de)serialization and exposed functionalities relevant to NEAR.
  • ECDSA. ECDSA-related code, focused on signature verification (as it is NEAR’s main use case).
  • Field Arithmetic. The internal scalar and base field implementations for the curve, including dependencies on the crypto-bigint crate (https://github.com/RustCrypto/crypto-bigint).

Out-of-Scope Components

As the usage mainly focused on signature verification on 32-bit machine, the following items were not the focus of this audit:

  • Signature generation. The audit focused on signature verification.
  • Constant-time code. Due to the previous point, while we did look at the constant-time code to ensure correctness of implementation, we did not focus on ensuring that the code was indeed constant-time (as signature verification does not involve secret data).
  • 64-bit specific code. This is important as many algorithms are reimplemented for 32-bit and 64-bit machines, and thus the absence of bugs in the 32-bit implementation might not necessarily mean that the 64-bit implementation is bug-free.
  • Unrelated primitives. We did not look at ECDH or H implementations,TasCthey are not used in NEAR.

Code Base Reviewed

Noteworthy: the audit focused on the last commit of the p256 crate which was 32343a78f1522aa5bd856556f114053d4bb938e0 at the time of the audit and integrated changes like the move from generic-array to hybrid-array (https://github.com/RustCrypto/elliptic-curves/pull/1011) not used in NEAR’s code. We expect NEAR to upgrade to newer releases as they become available.

Reference Integrations Provided by NEAR

In addition, we were provided the following PRs as example usage of NEAR:

Overview Of The p256 Crate

The p256 crate implements the NIST P-256 elliptic curve along with a number of cryptographic schemes that make use of P-256. The P-256 curve is a widely used elliptic curve defined over a prime field and with a prime order. We define the curve later in this document.

Organization Of The Code

The p256 crate is part of the elliptic-curves repository and provides high-level APIs to access primitives (like ECDSA and ECDH) built on top of P-256 as well as low-level P-256 functionalities. The crate includes local code but also builds on top of many other dependencies from the same ecosystem of RustCrypto crates:

The signature primitives rely on:

Note that, as mentioned in the introduction, we did not look into the generic-array crate (https://github.com/fizyk20/generic-array) as it is not used anymore in the latest commit of the p256 crate. Furthermore, while we looked at parts of the subtle crate (https://github.com/dalek-cryptography/subtle) to understand its logic, we did not focus on ensuring that it indeed provides constant-time operations, as this was not the focus of the audit.

The P-256 Curve

The P-256 elliptic curve is a widely used elliptic curve defined over a prime field, and with a prime order. It is known under different names and specified in different standards: P-256 in NIST’s SP 800-186 (https://csrc.nist.gov/pubs/sp/800/186/final), secp256r1 in SECG’s SEC2v1 (https://www.secg.org/SEC2-Ver-1.0.pdf), prime256v1 in ANSI X9.62 (https://www.x9.org/).

The curve is defined by the short Weierstrass equation:

y2=x3+ax+b

over the finite field Fp, where:

  • p=22562224+2192+2961
  • a=3
  • b=41058363725152142129326129780047268409114441015993725554835256314039467401291

The following is an implementation of the P-256 curve in SageMath:

p = 0xffffffff00000001000000000000000000000000ffffffffffffffffffffffff
a = 0xffffffff00000001000000000000000000000000fffffffffffffffffffffffc
b = 0x5ac635d8aa3a93e7b3ebbd55769886bc651d06b0cc53b0f63bce3c3e27d2604b
G = (0x6b17d1f2e12c4247f8bce6e563a440f277037d812deb33a0f4a13945d898c296, 0x4fe342e2fe1a7f9b8ee7eb4a7c0f9e162bce33576b315ececbb6406837bf51f5)
n = 0xffffffff00000000ffffffffffffffffbce6faada7179e84f3b9cac2fc632551
h = 0x1

# Create the base field
Fp = FiniteField(p)

# Define the elliptic curve E: y^2 = x^3 + ax + b over Fp
E = EllipticCurve(Fp, [a, b])

# Define the generator point G
G = E(G[0], G[1])

Some of these properties have implications over the performance and security for the implementations.

Optimize multiplication operations. The constant coefficient a=3 allows performance optimization for point addition and doubling by avoiding multiplication operations related to the constant a.

Optimize modular arithmetic. The sparse structure of the prime field p can be represented as 32-bit limbs in little-endian format:

[0xffffffff, 0xffffffff, 0xffffffff, 0x00000000, 0x00000000, 0x00000000, 0x00000001, 0xffffffff]

These simplified values in each limb allow for efficient modular arithmetic using bitwise operations during the reduction process.

Avoid small subgroup attack. When the cofactor h is 1, all points on the curve belong to a single subgroup of order n. As there are no small subgroups, extra checks against small subgroup attacks are unnecessary.

BigInt Implementation Overview

In order to take advantage of the architecture’s native word size, the crypto-bigint crate uses a representation of integers based on machine words. This allows for efficient arithmetic operations on large integers.

To implement this, the crate defines the type Word (and wrapper type Limb(Word) as u32 or u64 depending on the Rust feature flag target_pointer_width.

A Uint<LIMBS> type is then defined generically as a fixed-size array of Limbs. For example, the p256 code will make use of Uint<8> (resp. Uint<4>) to represent 256-bit integers on 32-bit (resp. 64-bit) architectures.

Furthermore, limbs are ordered in a little-endian way, meaning the least-significant limb comes first (on the left). This allows for efficient serialization and deserialization of integers, as well as efficient arithmetic operations, without the need to reorder limbs on little-endian architecture (which makes up the majority of modern systems).

For example, any u8 value n can be represented as [n, 0, 0, 0, 0, 0, 0, 0] on a 32-bit architecture.

Field Implementation

As seen above, the P-256 curve is defined over a prime field that we call the base field (implemented in the field submodule). The order of the curve (i.e. the number of points) is prime, producing a field we call the scalar field (implemented in the scalar submodule).

Both fields are of size 256-bit, and have similar implementations relying on the same U256 = Uint<8> type (on a 32-bit machine).

A notable difference is that:

  • the base field implementation is based on the Montgomery reduction
  • the scalar field implementation is based on Barrett reduction

As Montgomery reduction requires converting inputs to Montgomery form (and then back to a canonical form), it is generally best suited for applications that perform many multiplications in a row. Barrett reduction, on the other hand, is more efficient for applications that perform only a few multiplications. ECDSA signature verification follow the latter approach and thus Barrett reduction is used.

We also note that while the underlying implementation supports Karatsuba multiplication for larger integers (i.e. with at least 16 limbs), the P-256 implementation uses schoolbook multiplication in the integers.

In the next sections we explain the implementations based on Montgomery and Barrett reduction.

Montgomery Form and Montgomery Reduction

Montgomery reduction is explained at a low-level in section 14.3.2 on the Handbook of Applied Cryptography. Below, we provide a more understandable explanation of the algorithm and its implementation.

What Is The Montgomery Form?

For a modulus m and an element x modulo m, we can shift them with some known Montgomery value R>m in the following way:

x¯=x·Rmodm

This is the Montgomery form of a value x!

Naturally, we can unshift to recover the value:

x¯·R1=xmodm

Montgomery reduction does exactly this: it multiplies a value with the inverse of R and produces the result modulo m. As such, it should be immediately clear that Montgomery reduction is at least useful to convert a value in Montgomery form back to its canonical form.

Note that for the inverse of R modulo m to exist we need R and m to be coprime. In other words, we need gcd(m,R)=1.

When we perform field operations (add and multiply) we constantly need to perform modular reductions (to reduce the result of an operation if it became larger than m). Modular reductions can get quite expensive. In fact, they are the most expensive part of handling field elements. For example, reducing a large number modulo a 256-bit prime involves multiple division and subtraction steps, which are computationally intensive.

On the other hand, it turns out that modular reductions in the Montgomery form can be faster to compute. For example, consider two elements in Montgomery form, a·R and b·R, each 256 bits. When we multiply them, the result is c=ab·R2, a 512-bit value. Using Montgomery reduction, we can efficiently reduce c modulo m to obtain d=c·R1modm. This result, d=ab·Rmodm, is still in Montgomery form and ready for further operations.

Note that for addition things are much simpler, adding two values in Montgomery form still gets us a value in Montgomery form which can be easily reduced by subtracting m if needed. As such, Montgomery form is useful mainly for optimizing modular reductions after multiplications.

It remains to show: why is modular reduction faster in Montgomery form, and how to perform that modular reduction.

Let’s tl;dr the first question first: why is it faster? This is because the Montgomery value is usually chosen as R=2n for n big enough such that R>m, which makes operations extremely straightforward and fast as we can use bitwise operations which are usually fast on hardware. Indeed, if you didn’t know that yet, a left shift << is a multiplication by 2 and a right shift >> is a division by 2.

Next, we explain how the Montgomery reduction works.

How Does The Montgomery Reduction Work?

Notice that we would like to perform the following division by R in the integer to simulate the multiplication with R1:

x¯/R

This will only work if x¯ is divisible by R in the integers. As most likely it is not, we need to find a different element of the same residue class that is divisible by R. That is, a k such that x¯+k·m is divisible by R in the integers. Or in other words:

x¯+k·m=0modR

We can directly solve for k:

k=m1·x¯modR

Finally, if this all makes sense, you now understand that we can calculate x¯R1 modulo m using the following formula in the integers:

x¯+(m1·x¯modR)·mR

This should give us a value that is equivalent to x¯R1modm but not necessarily in a standard representative in [0,m).

In the bound analysis we will see that the above formula is in [0,2m), producing either exactly x¯R1 (the “canonical representative”) or x¯R1+m (which is easy to correct).

Here’s a quick way to see that in SageMath:

class Montgomery:
    def __init__(self, modulus, R):
        self.modulus = modulus
        self.R = R
        assert R > modulus
        assert gcd(modulus, R) == 1

    def to_montgomery(self, x):
        return (x * self.R) % self.modulus

    def from_montgomery(self, x):
        R_inv = inverse_mod(self.R, self.modulus)
        return (x * R_inv) % self.modulus

    def redc(self, x):
        assert x >= 0
        assert x < self.modulus * self.R

        # this can be precomputed
        inv_mod_R = inverse_mod(-self.modulus, self.R)

        # we find a representative that is divisible by R    
        k =  (x * inv_mod_R) & (self.R - 1) # k = (x * -m^(-1)) % R
        nominator = x + k * self.modulus
        assert nominator % self.R == 0

        # nominator / R
        res = nominator >> self.R.log(2)

        # we correct the representative we found
        if res >= self.modulus:
            return res - self.modulus
        else:
            return res

# examples
M = Montgomery(modulus=13 * 17, R=2^8) # we use an RSA modulus in this example

# (notice that it works for 0 and 1 as well which are special cases)
for x in range(0, 100):
    assert M.redc(M.to_montgomery(x)) == x

Bound analysis

In Montgomery reduction, the x¯ is usually the multiplication result of two field elements. Thus, we assume x¯<m·m<R·m. Then we have

x¯+(m1·x¯modR)·mR<R·m+R·mR=2m

This shows that the above formula is within the range of [0,2m).

Word-by-Word Montgomery Reduction

In practice, Montgomery reduction is usually implemented in a word-by-word fashion. Rather than computing with large integers directly, the algorithm operates over fixed-size words—typically 32 or 64 bits wide—and processes them one at a time. This enables the use of fast hardware operations like word multiplication, and allows division to be replaced with bit shifts, which are significantly more efficient.

Let the word (or base) be b (usually 232 or 264), and suppose the modulus m is an n-word integer. Define R=bn and assume gcd(R,m)=1. The goal is to compute the Montgomery reduction of a 2n-word integer T (we assume T<mR), i.e., compute:

T·R1modm

The idea is to reduce T by one word at a time until it becomes a n-word integer. Notice that

T·R1modm=T·bnmodm=((T·b1)·b1)b1modm=((T·b1modm)·b1modm)b1modm

We can just divide T by b per round, and repeat n round to produce the final T·bn. In each round, the limb size of T will decrease by one, then after n round T will become a n-word integer, which is close to the final target. Below is the process of the idea:

  • Step 0. Initialize T0T.

  • Step 1. Compute T1T0+k1·mb, where k1=m1·T0modb. This ensures that the numerator is divisible by b. Now, T1=T0·b1modm, and T1 is a (2n1)-word integer.

  • Step i (from 1 to n). Compute TiTi1+ki·mb, where ki=m1·Ti1modb. This reduces the word size of Ti by one in each step, ensuring Ti=T0·bimodm.

  • Final Step. After n iterations, Tn is a n-word integer equivalent to T·R1modm. If Tnm, perform a final subtraction: TnTnm.

Note that the calculation of ki can be simplified. First, precompute m=m1modb. Then, Timodb is simply the least significant limb of Ti, denoted as Ti[0]. Thus, ki can be computed as:

ki=m·Ti[0]modb

The overall algorithm proceeds as follows:

  1. For each step i from 1 to n, reduce Ti1 by one word:
kiTi1[0]·mmodb TiTi1+ki·mb
  1. If Tnm, then TnTnm.
  2. Output Tn.

This algorithm is efficient because it avoids costly division operations. Instead, it uses word-sized multiplication and addition, which are highly optimized on modern CPUs, making it well-suited for large integers.

Barrett Reduction

Barrett reduction is an algorithm to reduce a value modulo m. We have a value x that’s in [0,m2) because a multiplication must have occured between two values modulo m. We would like to thus reduce the value modulo m.

Intuition on why it’s fast

Barrett Reduction is a way to do that efficiently, relying on the same principles of the Montgomery reduction algorithm:

  • we do things with limbs of size b the size of the machine words on which we’re executing the algorithm
  • we do mostly division and multiplication and modular operations with an operand that’s a power of 2, because it’s easy to do on our binary machines (x/2n translates to x >> n, x·2n is x << n, xmod2n is x & (2^n - 1)).

To make things even more optimal, we also define limbs using a base b corresponding to the machine word size (let’s say b=232 or b=264) and k such that bk=log2(m). In other words the smallest k such that m fits in 32·k or 64·k bits.

Main idea

We want to compute r=xmodm, which is the same as x=q·m+r for some q.

One technique to compute r is to compute q as:

q=x/m

And once you find q compute r=xq·m. Barrett Reduction allows us to compute q (or an approximation of q as you will see) more efficiently than with the formula above.

First, let’s write the formula above as:

q=x/m=xm·b2kb2k=xm·b2kbk+1·bk1=b2km·xbk1·1bk+1

This allows us to precompute μ=b2km and to rewrite the above as:

q=xbk1·μbk+1

So far our computation is exact. But instead of computing q exactly, we’ll approximate it by computing:

q~=xbk1·μbk+1

Note: Notice that ·bk1 and ·bk+1 are fast to compute with right shifts.

Since the two approximations can be smaller than their exact computations, we have that q~q. The analysis section will show that q~[q2,q].

Computing r in xq~·m+r

If we had computed exactly q then what would be left for us to do would be to compute the result as:

r=xq·m

and since we know that r[0,m) we also know that only the bitlength of m is involved in that subtraction. As such we can also compute it faster by only involving the least significant bk bits:

r=xq·mmodbk=(xmodbk)(q·mmodbk)

(where modulo bk can be computed efficiently)

But remember, we have computed an approximation q~ of q, that is q~[q2,q].

We know that either:

  1. r=xq~·m
  2. r=x(q~+1)·m
  3. r=x(q~+2)·m

So what we do is that we compute 1 first, then if we don’t get something smaller than m we subtract by m once or twice to get there.

This means that we might not have r~=xq~·m<bk right away. But instead a value m or 2m times larger.

Since m<bk we have that 2·m<2·bk. This allows us to upperbound our approximation:

rr~r+2m<bk+2·bk<bk+1

Then we can calculate r~ in a more efficient way:

r~=((xmodbk+1)(q~·mmodbk+1))modbk+1

Again, here modbk+1 operation is very efficient on binary machine.

Analysis of the bound of Barrett reduction

Let’s analyze the bound of calculated q~ regarding to the actual quotient q, where

q=xm q~=xbk1·b2kmbk+1

First, since q~ is derived from a truncated approximation of xm, it naturally satisfies q~q.

Let’s denote α=xmodbk1, where α<bk1. Denote β=b2kmodm, where β<m. Then we can remove the floor operation:

xbk1=xαbk1 b2km=b2kβm

Then we can simplify q~:

q~=xbk1·b2kmbk+1=xαbk1·b2kβmbk+1=(xα)·(b2kβ)m·b2k=xmα·b2k+β·(xα)m·b2k

We use z to denote the red part (note that z0). Then we have q~=xmz. By the floor function inequality x+y+1x+y, we have:

xmz+z+1(xmz)+(z)=xm=q

That is saying q~+z+1q.

The bound of z is the key to analyze the bound of q~. If we can prove that 0z<2, then we will have z1 and finally have q~+2q~+z+1q. That’s exactly our goal. Let’s analyze the bound of z.

Prove that z<2

We have:

z=α·b2k+β·(xα)m·b2k<bk1·b2k+β·b2km·b2k=bk1+βm

We know that bk1<m (because m is k-limb) and β<m. Then we have:

z<bk1+βm<m+mm=2

And thus:

z1

That’s exactly what we want. Therefore, we conclude that q~+2q~+z+1q.

Tighter bound: z<1 in practice

We have proved that z<2 under all circumstances. However, that is a loose bound. Here we claim that for almost all of the modulus m in practice, we have a tighter bound: z<1 and thus have q~+1q~+z+1q.

Let’s see what kind of m will have that tighter bound. Remembering that z<bk1+βm, then z<1 holds if the following holds:

bk1+βm1

Which means:

bk1+βm

Furtherly:

βmbk1

That’s saying if β is in the range of [0,mbk1], then z<1.

We know that β=b2kmodm. Then β is always in [0,m). In practice, b is 232 or 264. m is close to bk. Therefore mbk1 is very close to m. That means the range [0,mbk1] fills most of the range [0,m).

Since β=b2kmodm behaves like a uniformly random number in [0, m) for a random modulus m, it’s highly likely that βmbk1. As a result, for most modulus m, we have z<1 and thus q~+1q.

We assume that bk2<m<bk (which is the common case in practice) and β is uniformly random. Then

Pr[βmbk1]=mbk1m>12b
  • If b=232, then probability is greater than 11231.
  • If b=264, then probability is greater than 11263.

That is, for nearly all moduli in practice, the tighter bound q~+1q holds.

The bound for P-256 scalar field

We have proven that, given a modulus m, if βmbk1 holds (where β=b2kmodm), then the tighter bound q~+1q holds.

For the P-256 scalar field, it’s not surprising that it satisfies βmbk1 for b=232 or 264. This immediately implies the tighter bound:

q~+1q

That is, the computed q~ is at most 1 less than the actual quotient q.

Therefore, the second sub here and here in the Barrett reduction—which conditionally subtracts another copy of the modulus—is unnecessary and can be safely removed in this case.

Elliptic Curve Implementation Detail

The p256 module defines the basic properties of the curve shown in the P-256 section above, as well as the arithmetics for both base field and scalar field. The primeorder module provides the curve arithmetic specific for the curve property a=3.

Encoding Points

The primeorder module uses the crate sec1 to define how the points on the curve should be encoded and decoded. Depending on the encoding tag, a point can be encoded in different formats: identity, compressed, uncompressed, and compact.

point encoding

Identity (0x00) represents the point at infinity (or the zero point), which has no affine coordinates. The encoding only uses one byte, which is the tag 0x00.

Compressed (0x02, 0x03) represents a point on the curve with only the x-coordinate provided. The y-coordinate is computed from the x-coordinate using the curve equation. Substituting the x-coordinate into the curve equation yields two possible y values, y and y resulted from the square root y=±x33x+b. Because the prime field p is odd, either py or y is even, and the other is odd.

The tag 0x02 is used to choose even y-coordinates, while 0x03 is used to choose odd y-coordinates. Whether the y-coordinate is even or odd can be determined by taking ±ymod2.

The total length of the compressed point is 33 bytes, which includes the tag byte and the x-coordinate (32 bytes).

Uncompressed (0x04) represents a point on the curve with both x and y-coordinates provided. The tag is followed by both the x-coordinate (32 bytes) and the y-coordinate (32 bytes). The total length of the uncompressed point is 65 bytes.

Compact (0x05) is similar to the compressed tags, but it always choose the larger y-coordinate resulted from the square root. This is not a encoding standard specified in SEC1 V2. But it is implemented in the crate dependency sec1.

Formulas In Affine Coordinates

To perform curve arithmetic we need to implement one main low-level operation: point addition, and sometimes a distinct point doubling operation as well. The latter is needed when the two points are identical because the vanilla formula for point addition would lead to a division by zero (unlike complete addition formulas when they are implemented).

When QP, the formula for adding two distinct points:

λ=y2y1x2x1,x3=λ2x1x2,y3=λ(x1x3)y1.

This formula can’t be applied to two identical points, because the denominator x2x1 in λ will be zero.

Therefore, when Q=P, the formula should be based on the tangent line of the curve at the point P, instead of the secant line between P and Q:

λ=3x12+a2y1,x3=λ22x1,y3=λ(x1x3)y1.

In affine form, these addition methods are sufficient to compute any point resulted from the scalar multiplication of a point, [k]P, on the curve.

Projective Formulas And Optimizations

The problem is that the division operation(realized as multiplying the inverse of one field) needed in λ is expensive and is harder for constant time computation. By converting the points to the projective form, these division operations can be replaced with multiplications and additions, which are more efficient, and easier to implement for constant time computing.

The implementation uses the algorithms from Renes-Costello-Batina 2015, which takes advantage of the fact that the constant a=3, allowing further optimizations.

Point addition follows the Algorithm 4 from [RCB 2015], which expresses the result (X3,Y3,Z3) in projective coordinates:

X3=(X1Y2+X2Y1)(Y1Y2+3(X1Z2+X2Z1bZ1Z2))3(Y1Z2+Y2Z1)(b(X1Z2+X2Z1)X1X23Z1Z2),Y3=3(3X1X23Z1Z2)(b(X1Z2+X2Z1)X1X23Z1Z2)+(Y1Y23(X1Z2+X2Z1bZ1Z2))(Y1Y2+3(X1Z2+X2Z1bZ1Z2)),Z3=(Y1Z2+Y2Z1)(Y1Y23(X1Z2+X2Z1bZ1Z2))+(X1Y2+X2Y1)(3X1X23Z1Z2).

The constants 3 come from the curve property a=3. Because the constant 3 is small, instead of a full field multiply, the algorithm uses the Double and Add method to compute the multiplication by 3 with the intermediate results.

For example, it applies Double and Add method with the intermediate result X1Z2+X2Z1bZ1Z2 for the effect of multiplication by 3, as shown in the code:

let bzz_part = xz_pairs - (C::EQUATION_B * zz); // 19, 20
let bzz3_part = bzz_part.double() + bzz_part; // 21, 22

For point doubling formula, the addition formula can be simplified to the following, because the points (X1,Y1,Z1) and (X2,Y2,Z2) are the same:

X3=2XY(Y2+3(2XZbZ2))6YZ(2bXZX23Z2),Y3=(Y23(2XZbZ2))(Y2+3(2XZbZ2))+3(3X23Z2)(2bXZX23Z2),Z3=8Y3Z.

Similarly, instead of doing multiplications with the constants in red color, the implementation applies uses field addition operations with the intermediate results.

Mixed Addition

Mixed addition is another optimization method for point addition, in which one of the points is in affine coordinates and the other is in projective coordinates. By converting the point in affine coordinates to projective coordinates, the converted point has Z2=1. By plugging it into the addition formula, it cancels the Z2 in the formula:

X3=(X1Y2+X2Y1)(Y1Y2+3(X1+X2Z1bZ1))3(Y1+Y2Z1)(b(X1+X2Z1)X1X23Z1),Y3=3(3X1X23Z1)(b(X1+X2Z1)X1X23Z1)+(Y1Y23(X1+X2Z1bZ1))(Y1Y2+3(X1+X2Z1bZ1)),Z3=(Y1+Y2Z1)(Y1Y23(X1+X2Z1bZ1))+(X1Y2+X2Y1)(3X1X23Z1).

This avoids the multiplication operations for the terms involved with Z2.

Scalar Multiplication

In Elliptic Curve cryptography, the scalar multiplication

[k]P=P+P++Pk times.

is the fundamental building block for popular cryptographic schemes, such as ECDSA and ECDH. It is crucial to implement the scalar multiplication in constant time to prevent side-channel attacks, while ensuring the performance is efficient even it is done in constant time.

Typically, the scalar value k can be represented as an n-bit value:

k=i=0n1bi·2i

where bi is the i-th bit of the scalar value k.

The scalar multiplication [k]P can be computed using the constant-time Double and Add algorithm, in which the next increment is always doubled (2·P) in each bit iteration i and accumulated by multiplying with the bit value bi:

k·P=i=0n1bi·2i·P

Instead of relying on this typical Double and Add algorithm, the implementation uses a windowed method to speed up the computation while preserving the constant time property.

The windowed method precomputes a table of points Pi for i[0,2m1], where Pi=[di]P and di is the i-th digit in base 2m.

With the precomputed points, the typical Double and Add algorithm can be generalized as:

[k]P=i=0nm12imPi.

The implementation uses a 16 points table precomputed, which means that m=4 and the scalar multiplication can be computed as:

[k]P=i=0n4116iPi=i=0n4116idiP.

With Horner’s method, it computes the scalar multiplication from the most significant digit to the least significant digit:

k=((di1·16+di2)·16+di3)·16++d0

In the implementation, the scalar value is firstly encoded in a byte array of size n. Because the precomputed points table is 16 points, the scalar value needs to be processed in 4 bits at a time. Thus, it needs to process the byte array like a 4-bit array.

The pos tracks the current position of the bit in the byte array Bi where i[0,n8]. The pos starts from the most significant byte, and decrements 4 bit positions after each windowed process:

pos=n4·i

With the current position, it determines which index of the byte array:

i=pos8

Then it determins the bit offset of the position, which is either 0 or 4, in that byte Bi:

offset=posmod8

Right shift the offset in the byte to obtain the slot value, which is either top or the bottom 4 bits in the byte:

(Bi>>offset)mod16

The slot value is computed as:

let slot = (k[pos >> 3] >> (pos & 7)) & 0xf;

To choose the correct point from the table, it compares the slot value, which is [0,15], with the index of the precomputed table. When the indexes match, (sloti)1 underflows to the maximum value of the field due to wrap around, and its most significant bit will be 1 for conditional_assign.

for i in 1..16 {
      t.conditional_assign(
         &pc[i],
         Choice::from(((slot as usize ^ i).wrapping_sub(1) >> 8) as u8 & 1),
      );
}

ECDSA Overview

ECDSA has a number of standards, such as FIPS 186-5 (https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.186-5.pdf) and SEC1 (https://www.secg.org/sec1-v2.pdf), which specify the algorithm and its parameters.

RustCrypto’s ecdsa crate follows the RFC 6979 for deterministic nonce generation instead of relying on random nonce generation, which can be vulnerable to attacks if the same nonce is reused for different messages.

It also handles DER-encoded signatures and OIDs for compatibility with certificate systems, as specified in RFCs 3279, 5758, 5480, 5280 and 5912.

Besides, the crate supports signature normalization to avoid signature malleability attacks, and supports recovering public keys from signatures.

Note that the crate partially implements the hash truncation bit2int specified in RFC6979. Even follows the truncation from the standards, it doesn’t guarantee resistance to hash collisions. Below explains the implementation details and the security implications.

Signing Algorithm

At the core, the crate uses the following ECDSA signing algorithm:

1. Hash the Message

Compute the cryptographic hash of the message m:

e=HASH(m).

Let Ln=log2n be the bit‐length of the curve order n, and define

z=the leftmost Ln bits of e,

and reduce z via modulo n

z2=z mod n

2. Generate Nonce k

Deterministically generate k based on z2, private key d and an optional entropy via HMAC, such that

0<k<n.

3. Compute Point R

Multiply the generator point G by k:

R=k×G=(x1,y1).

4. Compute r

Reduce the x-coordinate modulo n:

r=x1modn

5. Compute s

Compute the signature’s second component:

s=k1(z2+rd)modn,

where d is your private key.

6. Output the Signature

The final signature is the pair (r,s).

Hazmat Functions

The hazmat functions in the RustCrypto ecdsa crate are low-level functions that provide the fundamental algorithms for the crate’s higher-level ECDSA functions. These hazmat functions require careful handling due to their security assumptions. These functions include:

  • bits2field: Converts a hash value e into field bytes to represent z. It handles truncation and padding to ensure the hash is within a certain range on a specific curve.

  • sign_prehashed: Signs a pre-hashed message z2 using the private key d and nonce k. It is up to the caller to ensure that the nonce k is not reused.

  • sign_prehashed_rfc6979: Uses RFC 6979’s deterministic nonce generation, ensuring that the generated nonce k is unique for each combination of truncated message z and private key d.

  • verify_prehashed: Verifies a pre-hashed signature (r,s) against the public key dG and message hash z.

Hash Truncation

At step 1, the hash value e is truncated to the leftmost Ln bits to z.

The implementation bits2field partially follows the bit2int defined in [RFC6979 § 2.3.2]. The bits2field function requires the hash value to be at least half the field size (e.g., 128 bits for the P-256 curve). This is an additional security check, deviated from the standards, which do not mandate a minimum hash size.

pub fn bits2field<C: EcdsaCurve>(bits: &[u8]) -> Result<FieldBytes<C>> {
   // Minimum allowed bits size is half the field size
   if bits.len() < C::FieldBytesSize::USIZE / 2 {
       return Err(Error::new());
   }

   let mut field_bytes = FieldBytes::<C>::default();

   match bits.len().cmp(&C::FieldBytesSize::USIZE) {
       cmp::Ordering::Equal => field_bytes.copy_from_slice(bits),
       cmp::Ordering::Less => {
           // If bits is smaller than the field size, pad with zeroes on the left
           field_bytes[(C::FieldBytesSize::USIZE - bits.len())..].copy_from_slice(bits);
       }
       cmp::Ordering::Greater => {
           // If bits is larger than the field size, truncate
           field_bytes.copy_from_slice(&bits[..C::FieldBytesSize::USIZE]);
       }
   }

   Ok(field_bytes)
}

The truncation ensures that the bit length of z is not greater than the scalar field of the elliptic curve. But this doesn’t guarantee that it can entirely avoid the hash collisions due to the hash value overflowing the scalar field, as explained later below.

The Ln is determined by the FieldBytesSize defined for a curve. For P-256, its field size in bits happens to be multiple of 8, so the FieldBytesSize is 32 bytes.

impl elliptic_curve::Curve for NistP256 {
   /// 32-byte serialized field elements.
   type FieldBytesSize = U32;

   TRUNCATED...
}

Even though the truncated hash value z is having the same bit length as the scalar modulus n, it is not guaranteed that the hash value is in the range of [0,n). For example, the z is in the range [0,2256), but the upper bound scalar modulus n is far less than 2256. So the z may overflow the field size.

This overflow issue can be demonstrated in the following test:

#[test]
fn rfc6979() {
   let x = hex!("c9afa9d845ba75166b5c215767b1d6934e50c3db36e89b127b8a622b120f6721");
   let signer = SigningKey::from_bytes(&x.into()).unwrap();
   let digest1 = hex!("ffffffff00000000ffffffffffffffffbce6faada7179e84f3b9cac2fc632552");
   let digest2 = hex!("0000000000000000000000000000000000000000000000000000000000000001");
   let signature1: Signature = signer.sign_prehash(&digest1).unwrap();
   let signature2: Signature = signer.sign_prehash(&digest2).unwrap();

   assert_eq!(
       signature1.to_bytes().as_slice(),
       signature2.to_bytes().as_slice()
   );
}

For P-256, the set of 256-bit integers n has size roughly

2224(since n22562224+),

so about one in 232 random SHA-256 outputs will lie above n and “wrap around”. Collisions do exist, but for a uniformly random 256-bit hash they’re vanishingly rare in practice.

Therefore, the truncation specified in the standards, such as RFC6979, is to reduce the chances of hash collisions rather than entirely eliminate them.

The security implication also depends on how the verifier incorporates the hash value z. If the application directly uses the hash value supplied by the prover instead of generating the hash, then the prover can take advantage of this overflow issue to do replay attacks. For example, the prover can reuse a signature (r,s) but with a different hash value if the application relies on the hash value for identity purposes, such as a key for nullifier.

Furthermore, note that for the curve P-521, the field size is not a multiple of 8, so the FieldBytesSize is 66 bytes, which is equivalent to 528 bits, leading to 7 more bits (528 - 521) for hash collisions. Beside the higher chances of hash collisions, it doesn’t follow the RFC6979 standard, which requires the hash value to be at least the same size as the field size.

impl elliptic_curve::Curve for NistP521 {
   /// 66-byte serialized field elements.
   type FieldBytesSize = U66;

   TRUNCATED...
}

Deterministic Nonce Generation (RFC 6979)

One of the potential problems of directly using the hazmat function sign_prehashed is the risk of nonce reuse, which will lead to private key recovery. The nonce k must be unique for each signature.

The hazmat function sign_prehashed_rfc6979 implements the deterministic nonce method specified in the RFC 6979:

pub fn sign_prehashed_rfc6979<C, D>(
   d: &NonZeroScalar<C>,
   z: &FieldBytes<C>,
   ad: &[u8],
) -> Result<(Signature<C>, RecoveryId)>
where
   C: EcdsaCurve + CurveArithmetic,
   D: Digest + BlockSizeUser + FixedOutput + FixedOutputReset,
   SignatureSize<C>: ArraySize,
{
   let z2 = <Scalar<C> as Reduce<C::Uint>>::reduce_bytes(z);
   let k = NonZeroScalar::<C>::from_repr(rfc6979::generate_k::<D, _>(
       &d.to_repr(),
       &C::ORDER.encode_field_bytes(),
       &z2.to_repr(),
       ad,
   ))
   .unwrap();
   sign_prehashed(d, &k, z)
}

The nonce k is generated using HMAC, seeded with the private key d, curve order n, reduced hash value z2, and optional additional data ad (extra entropy or domain separation). This ensures k is unique per message and private key, eliminating reuse risks.

Recovery ID

From the sign_prehashed, along with the signature, a recovery ID is also returned. The recovery ID is used to recover the public key from the signature and message. It is particularly useful for checking if a signature is corresponding to a specific public key.

Recall that in the public key recovery algorithm, the potential public key can be calculated as:

R=(x1,y1)

where x1 can be the residual of r modulo n, and y1 can be calculated by plugging x1 into the elliptic curve equation y2=x3+ax+b, in which a and b defined for specific curve.

Because the point is a field element on the field Fp, which can be greater than curve order n, so the x1 can be either r or r+n. Meanwhile, the y1 can be either y1 or y1.

To obtain the correct public key, the recovery Id indicates which of the possible R points corresponds to the signature:

let recovery_id = RecoveryId::new(R.y_is_odd().into(), x_is_reduced);

Normalize Signature

ECDSA signatures are malleable because signatures (r,s) and (r,s) are both valid for the same message. To prevent the verifier from accepting both signatures, the sign_prehashed function normalizes the s value to ensure it is in the lower half of the range. This is done by checking if s is greater than half the curve order n and adjusting it accordingly.

The crate provides an interface NORMALIZE_S for the curve implementation to define whether sign_prehashed should normalize the s value.

if C::NORMALIZE_S {
   recovery_id.0 ^= s.is_high().unwrap_u8();
   signature = signature.normalize_s();
}

The is_high() determines if s>n/2, and normalize_s() adjusts s. Since adjusting s to s=ns is equivalent to replacing the nonce k by k, the new nonce produces the public key point R=(k)G, which flips the y-coordinate. The recovery ID needs to be adjusted accordingly to indicate the new point R.

Signature Format

The signature (r,s) is encoded in a format called DER (Distinguished Encoding Rules), ensuring that the signature can be transmitted and stored in a standardized way.

OID (Object Identifier) is a unique identifier used to specify the algorithm used for signing. In the context of ECDSA, the OID indicates the specific elliptic curve and hash function used in the signature. For example, the OID for ECDSA with SHA-256 with the P-256 curve is 1.2.840.10045.3.1.7.

In the RustCrypto ecdsa crate, the combination of the signature and the OID can be represented by the struct SignatureWithOid<C>.

pub struct SignatureWithOid<C: EcdsaCurve> {
   signature: Signature<C>,

   oid: ObjectIdentifier,
}

Verification Algorithm

Given a public key Q, message m or pre-hash e, and signature (r,s), the verification algorithm is as follows:

1. Validate the ranges (r,s)

0<r,s<n

2. Hash and truncate

If not pre-hashed, compute e=HASH(m) and set

z=leftmost Ln bits of e.

3. Inverse

w=s1(modn).

4. Twin scalars

u1=zw(modn),u2=rw(modn).

5. Joint multiplication

(x1,y1)=u1G+u2Q.

Reject the point at infinity.

6. Final check

Accept only if

rx1(modn).

Whether to reject signatures where s>n/2 to prevent malleability, it is up to the NORMALIZE_S definition of the curve or how the caller uses the verify API.

Verification Implementations

The verification can be done either by hashing the message and then verifying the signature, or by providing hash and verifying the signature directly.

With SignatureWithOid to represent the signature and the hashing algorithm, it hashes the message with corresponding hashing algorithm to different OID before verifying the signature:

impl<C> Verifier<SignatureWithOid<C>> for VerifyingKey<C>
where
   C: EcdsaCurve + CurveArithmetic + DigestPrimitive,
   SignatureSize<C>: ArraySize,
{
   fn verify(&self, msg: &[u8], sig: &SignatureWithOid<C>) -> Result<()> {
       match sig.oid() {
           ECDSA_SHA224_OID => self.verify_prehash(&Sha224::digest(msg), sig.signature()),
           ECDSA_SHA256_OID => self.verify_prehash(&Sha256::digest(msg), sig.signature()),
           ECDSA_SHA384_OID => self.verify_prehash(&Sha384::digest(msg), sig.signature()),
           ECDSA_SHA512_OID => self.verify_prehash(&Sha512::digest(msg), sig.signature()),
           _ => Err(Error::new()),
       }
   }
}

Otherwise, it verifies with pre-hashed message:

impl<C> PrehashVerifier<Signature<C>> for VerifyingKey<C>
where
   C: EcdsaCurve + CurveArithmetic,
   SignatureSize<C>: ArraySize,
{
   fn verify_prehash(&self, prehash: &[u8], signature: &Signature<C>) -> Result<()> {
       if C::NORMALIZE_S && signature.s().is_high().into() {
           return Err(Error::new());
       }

       hazmat::verify_prehashed::<C>(
           &ProjectivePoint::<C>::from(*self.inner.as_affine()),
           &bits2field::<C>(prehash)?,
           signature,
       )
   }
}

Performance Optimizations

The arithmetic operations involved in ECDSA signing and verification can be categorized into two types: scalar multiplication and field element operations.

The point additions and doubling, which are performed in the calculations like public key dG and u1G+u2Q, can be optimized through the Montgomery reduction. This is because in projective coordinates, it needs to deal with more multiplications. With the Montgomery reduction, the overhead of repeated multiplications can be amortized.

For the scalar field operations, such as s=k1(z+rd), can be optimized by Barrett reduction.