Poly1305 Proof for Monocypher
A semi-formal proof that Monocypher’s Poly1305 algorithm works. Only the multiplication and carry propagations are addressed there because they’re the hardest to test.
All numbers are written in hexadecimal, except those at the
right-hand side of arithmetic shift operators. Formalism otherwise
follow C’s syntax. Note that 2 to the power of x is noted
“1 << x
”.
Compression function
The main function repeatedly performs a sum and a product:
h = (h + c) * r
r
is a random number with some of its bits chopped off,
c
is a 16-byte block worth of the input message (plus
2^128), and h
starts at zero, then is fed back to
itself.
The preconditions of the compression function are:
(1) h <= 4_ffffffff_ffffffff_ffffffff_ffffffff
(2) c <= 1_ffffffff_ffffffff_ffffffff_ffffffff
(3) r <= 0ffffffc_0ffffffc_0ffffffc_0fffffff
(4) r & f0000003_f0000003_f0000003_f0000000 == 0
That last condition for r
reflects the fact that some of
its bits are chopped off.
To represent big numbers, we use arrays of integers. The context used in the code is this:
typedef struct {
uint32_t r[4];
uint32_t h[5];
uint32_t c[5];
uint32_t pad[4];
size_t c_idx;
} crypto_poly1305_ctx;
Numbers are represented in little-endian. Let h
,
c
, and r
be:
(4) h == h0
+ (h1 << 32)
+ (h2 << 64)
+ (h3 << 96)
+ (h4 << 128)
(6) c == c0
+ (c1 << 32)
+ (c2 << 64)
+ (c3 << 96)
+ (c4 << 128)
(7) r == r0
+ (r1 << 32)
+ (r2 << 64)
+ (r3 << 96)
From (1) and (4), and the representation of h:
(8) h0 <= ffffffff
(9) h1 <= ffffffff
(10) h2 <= ffffffff
(11) h3 <= ffffffff
(12) h4 <= 4
From (2) and (5), and the representation of c:
(13) c0 <= ffffffff
(14) c1 <= ffffffff
(15) c2 <= ffffffff
(16) c3 <= ffffffff
(17) c4 <= 1
From (3), (4), (7), and the representation of r:
(18) r0 <= 0fffffff
(19) r1 <= 0ffffffc
(20) r2 <= 0ffffffc
(21) r3 <= 0ffffffc
(22) r0 & f0000000 == 0
(23) r1 & f0000003 == 0
(24) r2 & f0000003 == 0
(25) r3 & f0000003 == 0
Now the compression function proper:
static void poly_block(crypto_poly1305_ctx *ctx)
{
// s = h + c
const u64 s0 = ctx->h[0] + (u64)ctx->c[0];
const u64 s1 = ctx->h[1] + (u64)ctx->c[1];
const u64 s2 = ctx->h[2] + (u64)ctx->c[2];
const u64 s3 = ctx->h[3] + (u64)ctx->c[3];
const u32 s4 = ctx->h[4] + (u64)ctx->c[4];
// Local all the things!
const u32 r0 = ctx->r[0];
const u32 r1 = ctx->r[1];
const u32 r2 = ctx->r[2];
const u32 r3 = ctx->r[3];
const u32 rr0 = (r0 >> 2) * 5;
const u32 rr1 = (r1 >> 2) + r1;
const u32 rr2 = (r2 >> 2) + r2;
const u32 rr3 = (r3 >> 2) + r3;
// (h + c) * r, without carry propagation
const u64 x0 = s0*r0 + s1*rr3 + s2*rr2 + s3*rr1 + s4*rr0;
const u64 x1 = s0*r1 + s1*r0 + s2*rr3 + s3*rr2 + s4*rr1;
const u64 x2 = s0*r2 + s1*r1 + s2*r0 + s3*rr3 + s4*rr2;
const u64 x3 = s0*r3 + s1*r2 + s2*r1 + s3*r0 + s4*rr3;
const u32 x4 = s4 * (r0 & 3);
// partial reduction modulo 2^130 - 5
const u32 u5 = x4 + (x3 >> 32);
const u64 u0 = (u5 >> 2) * 5 + (x0 & 0xffffffff);
const u64 u1 = (u0 >> 32) + (x1 & 0xffffffff) + (x0 >> 32);
const u64 u2 = (u1 >> 32) + (x2 & 0xffffffff) + (x1 >> 32);
const u64 u3 = (u2 >> 32) + (x3 & 0xffffffff) + (x2 >> 32);
const u64 u4 = (u3 >> 32) + (u5 & 3);
// Update the hash
ctx->h[0] = u0 & 0xffffffff;
ctx->h[1] = u1 & 0xffffffff;
ctx->h[2] = u2 & 0xffffffff;
ctx->h[3] = u3 & 0xffffffff;
ctx->h[4] = u4;
}
inputs
First, we compute the sum of h
and
c
, s
:
// s = h + c
const u64 s0 = ctx->h[0] + (u64)ctx->c[0];
const u64 s1 = ctx->h[1] + (u64)ctx->c[1];
const u64 s2 = ctx->h[2] + (u64)ctx->c[2];
const u64 s3 = ctx->h[3] + (u64)ctx->c[3];
const u32 s4 = ctx->h[4] + (u64)ctx->c[4];
From the above:
(26) s0 == h0 + c0
(27) s1 == h1 + c1
(28) s2 == h2 + c2
(29) s3 == h3 + c3
(30) s4 == h4 + c4
let s
be:
(31) s == s0
+ (s1 << 32)
+ (s2 << 64)
+ (s3 << 96)
+ (s4 << 128)
From (27), (28), (29), (30), and (31):
s == (h0 + c0)
+ ((h1 + c1) << 32)
+ ((h2 + c2) << 64)
+ ((h3 + c3) << 96)
+ ((h4 + c4) << 128)
s == h0 + c0
+ (h1 << 32) + (c1 << 32)
+ (h2 << 64) + (c2 << 64)
+ (h3 << 96) + (c3 << 96)
+ (h4 << 128) + (c4 << 128)
(32) s == h + c
We have the intended sum.
In addition, from (8) and (13), (9) and (14), (10) and (15), (11) and (16), (12) and (17):
(33) s0 <= 1_fffffffe
(34) s1 <= 1_fffffffe
(35) s2 <= 1_fffffffe
(36) s3 <= 1_fffffffe
(37) s4 <= 5
The limbs of s
do not overflow the 64-bit integers they
are represented with. (For s4, this is a 32-bit integer).
Second, we load r
to local
variables:
const u32 r0 = ctx->r[0];
const u32 r1 = ctx->r[1];
const u32 r2 = ctx->r[2];
const u32 r3 = ctx->r[3];
Third, we divide r
by four, then
multiply it by 5, to help with modular reduction:
const u32 rr0 = (r0 >> 2) * 5;
const u32 rr1 = (r1 >> 2) + r1;
const u32 rr2 = (r2 >> 2) + r2;
const u32 rr3 = (r3 >> 2) + r3;
From (23), (24), (25), and the above:
r1 = (r1 >> 2) * 4
r2 = (r2 >> 2) * 4
r3 = (r3 >> 2) * 4
r1 + (r1 >> 2) == (r1 >> 2) * 4 + (r1 >> 2)
r2 + (r2 >> 2) == (r2 >> 2) * 4 + (r2 >> 2)
r3 + (r3 >> 2) == (r3 >> 2) * 4 + (r3 >> 2)
r1 + (r1 >> 2) == (r1 >> 2) * 5
r2 + (r2 >> 2) == (r2 >> 2) * 5
r3 + (r3 >> 2) == (r3 >> 2) * 5
(38) rr1 == (r1 >> 2) * 5
(39) rr2 == (r2 >> 2) * 5
(40) rr3 == (r3 >> 2) * 5
let rr
be:
(41) rr == rr0
+ (rr1 << 32)
+ (rr2 << 64)
+ (rr3 << 96)
rr == ((r0 >> 2) * 5)
+ (((r1 >> 2) * 5) << 32)
+ (((r2 >> 2) * 5) << 64)
+ (((r3 >> 2) * 5) << 96)
rr == (( r0
+ (r1 << 32)
+ (r2 << 64)
+ (r3 << 96)
) >> 2) * 5
(42) rr = (r >> 2) * 5
In addition, from (18), (19), (20), (21), (38), (39), and (40):
(r0 >> 2) * 5 <= 13fffffb
(r1 >> 2) * 5 <= 13fffffb
(r2 >> 2) * 5 <= 13fffffb
(r3 >> 2) * 5 <= 13fffffb
(r0 >> 2) * 5 <= 13fffffb
(r1 >> 2) + r1 <= 13fffffb
(r2 >> 2) + r2 <= 13fffffb
(r3 >> 2) + r3 <= 13fffffb
(43) rr0 <= 13fffffb
(44) rr1 <= 13fffffb
(45) rr2 <= 13fffffb
(46) rr3 <= 13fffffb
The limbs of rr
do not overflow the 32-bit integers they
are represented with.
Multiplication
const u64 x0 = s0*r0 + s1*rr3 + s2*rr2 + s3*rr1 + s4*rr0;
const u64 x1 = s0*r1 + s1*r0 + s2*rr3 + s3*rr2 + s4*rr1;
const u64 x2 = s0*r2 + s1*r1 + s2*r0 + s3*rr3 + s4*rr2;
const u64 x3 = s0*r3 + s1*r2 + s2*r1 + s3*r0 + s4*rr3;
const u32 x4 = s4 * (r0 & 3);
Let x
be:
(47) x == x0
+ (x1 << 32)
+ (x2 << 64)
+ (x3 << 96)
+ (x4 << 128)
From (7), (31), and the schoolbook multiplication algorithm:
(48) s * r == s0*r0
+ (s0*r1 + s1*r0 ) << 32
+ (s0*r2 + s1*r1 + s2*r0 ) << 64
+ (s0*r3 + s1*r2 + s2*r1 + s3*r0 ) << 96
+ ( s1*r3 + s2*r2 + s3*r1 + s4*r0) << 128
+ ( s2*r3 + s3*r2 + s4*r1) << 160
+ ( s3*r3 + s4*r2) << 192
+ ( s4*r3) << 224
Now, we’re doing the multiplication modulo 2^130-5.
Let srl
and srh
be:
(49) srl == s0*r0
+ (s0*r1 + s1*r0 ) << 32
+ (s0*r2 + s1*r1 + s2*r0 ) << 64
+ (s0*r3 + s1*r2 + s2*r1 + s3*r0 ) << 96
(50) srh == s1*r3 + s2*r2 + s3*r1 + s4*r0
+ ( s2*r3 + s3*r2 + s4*r1) << 32
+ ( s3*r3 + s4*r2) << 64
+ ( s4*r3) << 96
From (48), (49), and (50):
(51) s * r == srl + (srh << 128);
Let v
and w
be:
(52) v == s4 * (r0 & 3)
(53) w == srh - v
From (52) and (53):
s4 * r0 - s4 * (r0 & 3)) == s4* (r0 - (r0 & 3))
s4 * r0 - s4 * (r0 & 3)) == s4* (r0 & fffffffc)
w == s1*r3 + s2*r2 + s3*r1 + s4*r0
+ ( s2*r3 + s3*r2 + s4*r1) << 32
+ ( s3*r3 + s4*r2) << 64
+ ( s4*r3) << 96
- (s4 * (r0 & 3))
w == s1*r3 + s2*r2 + s3*r1 + s4*r0 - (s4 * (r0 & 3))
+ ( s2*r3 + s3*r2 + s4*r1) << 32
+ ( s3*r3 + s4*r2) << 64
+ ( s4*r3) << 96
(54) w == s1*r3 + s2*r2 + s3*r1 + s4* (r0 & fffffffc)
+ ( s2*r3 + s3*r2 + s4*r1) << 32
+ ( s3*r3 + s4*r2) << 64
+ ( s4*r3) << 96
let ww
be (w >> 2) * 5
. From (23),
(24), (25), (38), (39), (40), and (54):
ww == (( s1*r3 + s2*r2 + s3*r1 + s4* (r0 & fffffffc)
+ ( s2*r3 + s3*r2 + s4*r1) << 32
+ ( s3*r3 + s4*r2) << 64
+ ( s4*r3) << 96
) >> 2) * 5
ww == s1 * (( r3 ) >> 2) * 5
+ s2 * (( r2 ) >> 2) * 5
+ s3 * (( r1 ) >> 2) * 5
+ s4 * (((r0 & fffffffc)) >> 2) * 5
+ ( s2 * (r3 >> 2) * 5
+ s3 * (r2 >> 2) * 5
+ s4 * (r1 >> 2) * 5) << 32
+ ( s3 * (r3 >> 2) * 5
+ s4 * (r2 >> 2) * 5) << 64
+ ( s4 * (r3 >> 2) * 5) << 96
ww == s1 * rr3
+ s2 * rr2
+ s3 * rr1
+ s4 * rr0
+ ( s2 * rr3
+ s3 * rr2
+ s4 * rr1) << 32
+ ( s3 * rr3
+ s4 * rr2) << 64
+ ( s4 * rr3) << 96
(55) ww == s1*rr3 + s2*rr2 + s3*rr1 + s4*rr0
+ ( s2*rr3 + s3*rr2 + s4*rr1) << 32
+ ( s3*rr3 + s4*rr2) << 64
+ ( s4*rr3) << 96
let p
be (1 << 130) - 5
.
From (49), (50), (51), (52), (53), and (55)
(s*r) % p == (srl + (srh << 128)) % p
(s*r) % p == srl % p + (srh << 128) % p
(s*r) % p == srl % p + ((v + w) << 128) % p
(s*r) % p == srl % p + (v << 128) % p + (w << 128) % p
(s*r) % p == srl % p + (v << 128) % p + (((w << 128) >> 130) * 5) % p
(s*r) % p == srl % p + (v << 128) % p + (w >> 2) * 5 % p
(s*r) % p == srl % p + (v << 128) % p + ww % p
(s*r) % p == (srl + (v << 128) + ww) % p
(s * r) % p == ( s0*r0
+ (s0*r1 + s1*r0 ) << 32
+ (s0*r2 + s1*r1 + s2*r0 ) << 64
+ (s0*r3 + s1*r2 + s2*r1 + s3*r0 ) << 96
+ (s4 * (r0 & 3) ) << 128
+ s1*rr3 + s2*rr2 + s3*rr1 + s4*rr0
+ ( s2*rr3 + s3*rr2 + s4*rr1) << 32
+ ( s3*rr3 + s4*rr2) << 64
+ ( s4*rr3) << 96
) %p
(56) (s*r) % p == ( s0*r0 s1*rr3 + s2*rr2 + s3*rr1 + s4*rr0
+ (s0*r1 + s1*r0 + s2*rr3 + s3*rr2 + s4*rr1) << 32
+ (s0*r2 + s1*r1 + s2*r0 + s3*rr3 + s4*rr2) << 64
+ (s0*r3 + s1*r2 + s2*r1 + s3*r0 + s4*rr3) << 96
+ (s4 * (r0 & 3) ) << 128
) % p
From (47) and (56):
(57) x % p == (s * r) % p
From (18), (19), (20), (21), (33), (34), (35), (36), (37), (43), (44), (45), and (46):
x0 == s0*r0 s1*rr3 + s2*rr2 + s3*rr1 + s4*rr0
x1 == s0*r1 + s1*r0 + s2*rr3 + s3*rr2 + s4*rr1
x2 == s0*r2 + s1*r1 + s2*r0 + s3*rr3 + s4*rr2
x3 == s0*r3 + s1*r2 + s2*r1 + s3*r0 + s4*rr3
x4 == s4 * (r0 & 3)
x0 <= 1_fffffffe * 0fffffff
+ 1_fffffffe * 13fffffb
+ 1_fffffffe * 13fffffb
+ 1_fffffffe * 13fffffb
+ 8 * 13fffffb
x1 <= 1_fffffffe * 0ffffffc
+ 1_fffffffe * 0fffffff
+ 1_fffffffe * 13fffffb
+ 1_fffffffe * 13fffffb
+ 8 * 13fffffb
x2 <= 1_fffffffe * 0ffffffc
+ 1_fffffffe * 0ffffffc
+ 1_fffffffe * 0fffffff
+ 1_fffffffe * 13fffffb
+ 8 * 13fffffb
x3 <= 1_fffffffe * 0ffffffc
+ 1_fffffffe * 0ffffffc
+ 1_fffffffe * 0ffffffc
+ 1_fffffffe * 0fffffff
+ 8 * 13fffffb
x4 <= 8 * 3
(58) x0 <= 97ffffe0_07fffff8
(59) x1 <= 8fffffe2_0ffffff6
(60) x2 <= 87ffffe4_17fffff4
(61) x3 <= 7fffffe6_1ffffff2
(62) x4 <= f
The limbs of x
do not overflow the 64-bit integers they
are represented with.
Carry propagation
const u32 u5 = x4 + (x3 >> 32);
const u64 u0 = (u5 >> 2) * 5 + (x0 & 0xffffffff);
const u64 u1 = (u0 >> 32) + (x1 & 0xffffffff) + (x0 >> 32);
const u64 u2 = (u1 >> 32) + (x2 & 0xffffffff) + (x1 >> 32);
const u64 u3 = (u2 >> 32) + (x3 & 0xffffffff) + (x2 >> 32);
const u64 u4 = (u3 >> 32) + (u5 & 3);
Let xl0
, xl1
, xl2
,
xl3
, xh0
, xh1
, xh2
,
and xh3
be:
(63) xl0 == x0 & ffffffff
(64) xl1 == x1 & ffffffff
(65) xl2 == x2 & ffffffff
(66) xl3 == x3 & ffffffff
(67) xh0 == x0 >> 32
(68) xh1 == x1 >> 32
(69) xh2 == x2 >> 32
(70) xh3 == x3 >> 32
Let ul0
, ul1
, ul2
,
ul3
, ul5
, uh0
, uh1
,
uh2
, uh3
, and uh5
be:
(71) ul0 == u0 & ffffffff
(72) ul1 == u1 & ffffffff
(73) ul2 == u2 & ffffffff
(74) ul3 == u3 & ffffffff
(76) ul5 == u5 & 3
(77) uh0 == u0 >> 32
(78) uh1 == u1 >> 32
(79) uh2 == u2 >> 32
(80) uh3 == u3 >> 32
(82) uh5 == u5 >> 2
From (57) and (62) to 82):
(s * r) % p == ( x0
+ x1 << 32
+ x2 << 64
+ x3 << 96
+ x4 << 128
) % p
(s * r) % p == ( xl0 + (xh0 << 32)
+ (xl1 + (xh1 << 32)) << 32
+ (xl2 + (xh2 << 32)) << 64
+ (xl3 + (xh3 << 32)) << 96
+ x4 << 128
) % p
(s * r) % p == ( xl0
+ (xl1 + xh0) << 32
+ (xl2 + xh1) << 64
+ (xl3 + xh2) << 96
+ (x4 + xh3) << 128
) % p
(s * r) % p == ( xl0
+ (xl1 + xh0) << 32
+ (xl2 + xh1) << 64
+ (xl3 + xh2) << 96
+ u5 << 128
) % p
(s * r) % p == ( xl0
+ (xl1 + xh0) << 32
+ (xl2 + xh1) << 64
+ (xl3 + xh2) << 96
+ (ul5 + (u5h << 2)) << 128
) % p
(s * r) % p == ( xl0 + u5h * 5
+ (xl1 + xh0) << 32
+ (xl2 + xh1) << 64
+ (xl3 + xh2) << 96
+ ul5 << 128
) % p
(s * r) % p == ( u0
+ (xl1 + xh0) << 32
+ (xl2 + xh1) << 64
+ (xl3 + xh2) << 96
+ ul5 << 128
) % p
(s * r) % p == ( ul0
+ (uh0 + xl1 + xh0) << 32
+ ( xl2 + xh1) << 64
+ ( xl3 + xh2) << 96
+ ul5 << 128
) % p
(s * r) % p == ( ul0
+ ul1 << 32
+ (uh1 + xl2 + xh1) << 64
+ ( xl3 + xh2) << 96
+ ul5 << 128
) % p
(s * r) % p == ( ul0
+ ul1 << 32
+ ul2 << 64
+ (uh2 + xl3 + xh2) << 96
+ ul5 << 128
) % p
(s * r) % p == ( ul0
+ ul1 << 32
+ ul2 << 64
+ ul3 << 96
+ (uh3 + ul5) << 128
) % p
(83) (s * r) % p == ( ul0
+ ul1 << 32
+ ul2 << 64
+ ul3 << 96
+ u4 << 128
) % p
Update the hash
ctx->h[0] = u0 & 0xffffffff;
ctx->h[1] = u1 & 0xffffffff;
ctx->h[2] = u2 & 0xffffffff;
ctx->h[3] = u3 & 0xffffffff;
ctx->h[4] = u4;
From the above, (71), (72), (73), (74), and (83):
H0 == ul0;
H1 == ul1;
H2 == ul2;
H3 == ul3;
H4 == u4;
Let H
be:
(84) H == H0
+ H1 << 32
+ H2 << 64
+ H3 << 96
+ H4 << 128
From (83) and (84):
H % p == (s * r) % p
It Works!
From (58) to (62):
u5 <= x4 + (x3 >> 32)
u5 <= f + 7fffffe6
u5 <= 7ffffff5
u5
does not overflow the 32-bit integer it is
represented with.
u0 <= (u5 >> 2) * 5 + (x0 & 0xffffffff)
u0 <= (7ffffff5 >> 2) * 5 + ffffffff
u0 <= 1_9ffffff0
u1 <= (u0 >> 32) + (x1 & 0xffffffff) + (x0 >> 32)
u1 <= (1_9ffffff0 >> 32) + ffffffff + 97ffffe0
u1 <= 1_97ffffe0
u2 <= (u1 >> 32) + (x2 & 0xffffffff) + (x1 >> 32)
u2 <= (1_97ffffe0 >> 32) + ffffffff + 8fffffe2
u2 <= 1_8fffffe2
u3 <= (u2 >> 32) + (x3 & 0xffffffff) + (x2 >> 32)
u3 <= (1_8fffffe2 >> 32) + ffffffff + 87ffffe4
u3 <= 1_87ffffe4
u4 <= (u3 >> 32) + (u5 & 3)
u4 <= (1_87ffffe4 >> 32) + 3
(85) u4 <= 4
u0
, u1
, u2
, u3
,
and u4
, do not overflow the 64-bit integers they are
represented with.
From (84) and (85):
H <= 4_ffffffff_ffffffff_ffffffff_ffffffff
Since h
starts at zero and is only modified through the
compression function, we deduce by induction that it never exceeds the
number above.
Final reduction modulo p
const u64 u0 = (u64)5 + ctx->h[0];
const u64 u1 = (u0 >> 32) + ctx->h[1];
const u64 u2 = (u1 >> 32) + ctx->h[2];
const u64 u3 = (u2 >> 32) + ctx->h[3];
const u64 u4 = (u3 >> 32) + ctx->h[4];
const u64 uu0 = (u4 >> 2) * 5 + ctx->h[0] + ctx->pad[0];
const u64 uu1 = (uu0 >> 32) + ctx->h[1] + ctx->pad[1];
const u64 uu2 = (uu1 >> 32) + ctx->h[2] + ctx->pad[2];
const u64 uu3 = (uu2 >> 32) + ctx->h[3] + ctx->pad[3];
store32_le(mac , uu0);
store32_le(mac + 4, uu1);
store32_le(mac + 8, uu2);
store32_le(mac + 12, uu3);
Subtract or not subtract
const u64 u0 = (u64)5 + ctx->h[0];
const u64 u1 = (u0 >> 32) + ctx->h[1];
const u64 u2 = (u1 >> 32) + ctx->h[2];
const u64 u3 = (u2 >> 32) + ctx->h[3];
const u64 u4 = (u3 >> 32) + ctx->h[4];
From the above:
h == h0
+ h1 << 32
+ h2 << 64
+ h3 << 96
+ h4 << 128
h + 5 == h0 + 5
+ h1 << 32
+ h2 << 64
+ h3 << 96
+ h4 << 128
h + 5 == u0
+ h1 << 32
+ h2 << 64
+ h3 << 96
+ h4 << 128
h + 5 == u0 & ffffffff
+ (u0 >> 32) + h1 << 32
+ h2 << 64
+ h3 << 96
+ h4 << 128
h + 5 == u0 & ffffffff
+ u1 & ffffffff << 32
+ (u1 >> 32) + h2 << 64
+ h3 << 96
+ h4 << 128
h + 5 == u0 & ffffffff
+ u1 & ffffffff << 32
+ u2 & ffffffff << 64
+ (u2 >> 32) + h3 << 96
+ h4 << 128
h + 5 == u0 & ffffffff
+ u1 & ffffffff << 32
+ u2 & ffffffff << 64
+ u3 & ffffffff << 96
+ (u3 >> 32) + h4 << 128
h + 5 == u0 & ffffffff
+ u1 & ffffffff << 32
+ u2 & ffffffff << 64
+ u3 & ffffffff << 96
+ u4 << 128
Recall that p
equals
(1 << 130) - 5
.
(h + 5) >= 1 << 130 iff u4 >> 2 >= 1
h >= 1 << 130 - 5 iff u4 >> 2 >= 1
h >= p iff u4 >> 2 >= 1
u
exceeds zero iff h
exceeds
p
. Additionally, since:
h <= 4_ffffffff_ffffffff_ffffffff_ffffffff
Then:
h + 5 <= 5_00000000_00000000_00000000_00000004
h + 5 < (1 << 130) * 2
u4 >> 2 <= 1
u4 >> 2
never exceeds 1. Thus:
u4 >> 2 == 1 if h >= p
u4 >> 2 == 0 if h < p
u4 >> 2 == 1 if h % p == h - p
u4 >> 2 == 0 if h % p == p
To obtain a fully reduced representation of h
modulo
p
, we must subtract p
iff
u4 >> 2
is not zero.
final carry propagation (if any)
const u64 uu0 = (u4 >> 2) * 5 + ctx->h[0] + ctx->pad[0];
const u64 uu1 = (uu0 >> 32) + ctx->h[1] + ctx->pad[1];
const u64 uu2 = (uu1 >> 32) + ctx->h[2] + ctx->pad[2];
const u64 uu3 = (uu2 >> 32) + ctx->h[3] + ctx->pad[3];
Let u5
, uu4
, and pad
be:
u5 == (u4 >> 2) * 5
uu4 == (uu3 >> 32) + h4
pad == pad0
+ pad1 << 32
+ pad2 << 64
+ pad3 << 128
Recall that:
u5 == 5 if h % p == h - p
u5 == 0 if h % p == p
Now:
h + pad == h0 + pad0
+ h1 + pad1 << 32
+ h2 + pad2 << 64
+ h3 + pad3 << 96
+ h4 << 128
h + pad == u5 + h0 + pad0
+ h1 + pad1 << 32
+ h2 + pad2 << 64
+ h3 + pad3 << 96
+ h4 << 128
- u5
h + pad == uu0
+ h1 + pad1 << 32
+ h2 + pad2 << 64
+ h3 + pad3 << 96
+ h4 << 128
- u5
h + pad == uu0 & ffffffff
+ (uu0 >> 32) + h1 + pad1 << 32
+ h2 + pad2 << 64
+ h3 + pad3 << 96
+ h4 << 128
- u5
h + pad == uu0 & ffffffff
+ uu1 & ffffffff << 32
+ (uu1 >> 32) + h2 + pad2 << 64
+ h3 + pad3 << 96
+ h4 << 128
- u5
h + pad == uu0 & ffffffff
+ uu1 & ffffffff << 32
+ uu2 & ffffffff << 64
+ uu3 & ffffffff << 96
+ (uu3 >> 32) + h4 << 128
- u5
h + pad == uu0 & ffffffff
+ uu1 & ffffffff << 32
+ uu2 & ffffffff << 64
+ uu3 & ffffffff << 96
+ uu4 << 128
- u5
If u5
equals zero, then h
was fully reduced
to begin with, and no computation needed to be done–and none was:
h == h % p
(h % p) + pad == uu0 & ffffffff
+ uu1 & ffffffff << 32
+ uu2 & ffffffff << 64
+ uu3 & ffffffff << 96
+ uu4 << 128
- u5
(h % p) + pad == uu0 & ffffffff
+ uu1 & ffffffff << 32
+ uu2 & ffffffff << 64
+ uu3 & ffffffff << 96
+ uu4 << 128
If, however, u5
equals five, then:
h >= p
h >= 1 << 130 - 5
h + 5 >= 1 << 130
uu4 >> 2 <= 1
uu4 >> 2 - 1 <= 1
h + pad == uu0 & ffffffff
+ uu1 & ffffffff << 32
+ uu2 & ffffffff << 64
+ uu3 & ffffffff << 96
+ uu4 << 128
- 5
h + pad == uu0 & ffffffff
+ uu1 & ffffffff << 32
+ uu2 & ffffffff << 64
+ uu3 & ffffffff << 96
+ uu4 & 3 << 128
+ (uu4 >> 2) << 130
- 5
h + pad == uu0 & ffffffff
+ uu1 & ffffffff << 32
+ uu2 & ffffffff << 64
+ uu3 & ffffffff << 96
+ uu4 & 3 << 128
+ (uu4 >> 2) - 1 << 130
+ (1 << 130) - 5
h + pad == uu0 & ffffffff
+ uu1 & ffffffff << 32
+ uu2 & ffffffff << 64
+ uu3 & ffffffff << 96
+ uu4 & 3 << 128
+ (uu4 >> 2) - 1 << 130
+ p
(h - p) + pad == uu0 & ffffffff
+ uu1 & ffffffff << 32
+ uu2 & ffffffff << 64
+ uu3 & ffffffff << 96
+ uu4 & 3 << 128
+ (uu4 >> 2) - 1 << 130
(h % p) + pad == uu0 & ffffffff
+ uu1 & ffffffff << 32
+ uu2 & ffffffff << 64
+ uu3 & ffffffff << 96
+ uu4 & 3 << 128
+ (uu4 >> 2) - 1 << 130
Recall that in this case:
h % p == h - p
(uu4 >> 2) - 1 <= 0
No funny underflow there.
serialisation
store32_le(mac , uu0);
store32_le(mac + 4, uu1);
store32_le(mac + 8, uu2);
store32_le(mac + 12, uu3);
We store the 128 least significant bits of (h % p) + pad
in little-endian. Depending on the carry propagation above, one of the
following is true:
(h % p) + pad == uu0 & ffffffff
+ uu1 & ffffffff << 32
+ uu2 & ffffffff << 64
+ uu3 & ffffffff << 96
+ uu4 & 3 << 128
+ (uu4 >> 2) - 1 << 130
Or:
(h % p) + pad == uu0 & ffffffff
+ uu1 & ffffffff << 32
+ uu2 & ffffffff << 64
+ uu3 & ffffffff << 96
+ uu4 << 128
In both cases, though:
((h % p) + pad) % (1 << 128) == uu0 & ffffffff
+ uu1 & ffffffff << 32
+ uu2 & ffffffff << 64
+ uu3 & ffffffff << 96
So, we store the right bits either way.
overflow
From (1), and the fact that pad
is represented with
32-bit limbs:
u0 <= 5 + ffffffff
u1 <= (u0 >> 32) + ffffffff
u2 <= (u1 >> 32) + ffffffff
u3 <= (u2 >> 32) + ffffffff
u4 <= (u3 >> 32) + 4
uu0 <= (u4 >> 2) * 5 + ffffffff + ffffffff
uu1 <= (uu0 >> 32) + ffffffff + ffffffff
uu2 <= (uu1 >> 32) + ffffffff + ffffffff
uu3 <= (uu2 >> 32) + ffffffff + ffffffff
u0 <= 1_00000004
u1 <= 1_00000000
u2 <= 1_00000000
u3 <= 1_00000000
u4 <= 5
uu0 <= 2_00000003
uu1 <= 2_00000000
uu2 <= 2_00000000
uu3 <= 2_00000000
None the variables above overflow the 64-bit integers they are represented with.
QED. I think. Can I implement my own crypto now?