Monocypher

Boring crypto that simply works

Poly1305 Proof for Monocypher

A semi-formal proof that Monocypher’s Poly1305 algorithm works. Only the multiplication and carry propagations are addressed there because they’re the hardest to test.

All numbers are written in hexadecimal, except those at the right-hand side of arithmetic shift operators. Formalism otherwise follow C’s syntax. Note that 2 to the power of x is noted “1 << x”.

Compression function

The main function repeatedly performs a sum and a product:

h = (h + c) * r

r is a random number with some of its bits chopped off, c is a 16-byte block worth of the input message (plus 2^128), and h starts at zero, then is fed back to itself.

The preconditions of the compression function are:

(1)  h <= 4_ffffffff_ffffffff_ffffffff_ffffffff
(2)  c <= 1_ffffffff_ffffffff_ffffffff_ffffffff
(3)  r <=   0ffffffc_0ffffffc_0ffffffc_0fffffff

(4)  r & f0000003_f0000003_f0000003_f0000000 == 0

That last condition for r reflects the fact that some of its bits are chopped off.

To represent big numbers, we use arrays of integers. The context used in the code is this:

typedef struct {
    uint32_t r[4];
    uint32_t h[5];
    uint32_t c[5];
    uint32_t pad[4];
    size_t   c_idx;
} crypto_poly1305_ctx;

Numbers are represented in little-endian. Let h, c, and r be:

(4)  h == h0
       + (h1 <<  32)
       + (h2 <<  64)
       + (h3 <<  96)
       + (h4 << 128)

(6)  c == c0
       + (c1 <<  32)
       + (c2 <<  64)
       + (c3 <<  96)
       + (c4 << 128)

(7)  r == r0
       + (r1 << 32)
       + (r2 << 64)
       + (r3 << 96)

From (1) and (4), and the representation of h:

(8)   h0 <= ffffffff
(9)   h1 <= ffffffff
(10)  h2 <= ffffffff
(11)  h3 <= ffffffff
(12)  h4 <= 4

From (2) and (5), and the representation of c:

(13)  c0 <= ffffffff
(14)  c1 <= ffffffff
(15)  c2 <= ffffffff
(16)  c3 <= ffffffff
(17)  c4 <= 1

From (3), (4), (7), and the representation of r:

(18)  r0 <= 0fffffff
(19)  r1 <= 0ffffffc
(20)  r2 <= 0ffffffc
(21)  r3 <= 0ffffffc
(22)  r0 & f0000000 == 0
(23)  r1 & f0000003 == 0
(24)  r2 & f0000003 == 0
(25)  r3 & f0000003 == 0

Now the compression function proper:

static void poly_block(crypto_poly1305_ctx *ctx)
{
    // s = h + c
    const u64 s0 = ctx->h[0] + (u64)ctx->c[0];
    const u64 s1 = ctx->h[1] + (u64)ctx->c[1];
    const u64 s2 = ctx->h[2] + (u64)ctx->c[2];
    const u64 s3 = ctx->h[3] + (u64)ctx->c[3];
    const u32 s4 = ctx->h[4] + (u64)ctx->c[4];

    // Local all the things!
    const u32 r0 = ctx->r[0];
    const u32 r1 = ctx->r[1];
    const u32 r2 = ctx->r[2];
    const u32 r3 = ctx->r[3];
    const u32 rr0 = (r0 >> 2) * 5;
    const u32 rr1 = (r1 >> 2) + r1;
    const u32 rr2 = (r2 >> 2) + r2;
    const u32 rr3 = (r3 >> 2) + r3;

    // (h + c) * r, without carry propagation
    const u64 x0 = s0*r0 + s1*rr3 + s2*rr2 + s3*rr1 + s4*rr0;
    const u64 x1 = s0*r1 + s1*r0  + s2*rr3 + s3*rr2 + s4*rr1;
    const u64 x2 = s0*r2 + s1*r1  + s2*r0  + s3*rr3 + s4*rr2;
    const u64 x3 = s0*r3 + s1*r2  + s2*r1  + s3*r0  + s4*rr3;
    const u32 x4 = s4 * (r0 & 3);

    // partial reduction modulo 2^130 - 5
    const u32 u5 = x4 + (x3 >> 32);
    const u64 u0 = (u5 >>  2) * 5 + (x0 & 0xffffffff);
    const u64 u1 = (u0 >> 32)     + (x1 & 0xffffffff) + (x0 >> 32);
    const u64 u2 = (u1 >> 32)     + (x2 & 0xffffffff) + (x1 >> 32);
    const u64 u3 = (u2 >> 32)     + (x3 & 0xffffffff) + (x2 >> 32);
    const u64 u4 = (u3 >> 32)     + (u5 & 3);

    // Update the hash
    ctx->h[0] = u0 & 0xffffffff;
    ctx->h[1] = u1 & 0xffffffff;
    ctx->h[2] = u2 & 0xffffffff;
    ctx->h[3] = u3 & 0xffffffff;
    ctx->h[4] = u4;
}

inputs

First, we compute the sum of h and c, s:

// s = h + c
const u64 s0 = ctx->h[0] + (u64)ctx->c[0];
const u64 s1 = ctx->h[1] + (u64)ctx->c[1];
const u64 s2 = ctx->h[2] + (u64)ctx->c[2];
const u64 s3 = ctx->h[3] + (u64)ctx->c[3];
const u32 s4 = ctx->h[4] + (u64)ctx->c[4];

From the above:

(26)  s0 == h0 + c0
(27)  s1 == h1 + c1
(28)  s2 == h2 + c2
(29)  s3 == h3 + c3
(30)  s4 == h4 + c4

let s be:

(31)  s == s0
        + (s1 <<  32)
        + (s2 <<  64)
        + (s3 <<  96)
        + (s4 << 128)

From (27), (28), (29), (30), and (31):

s == (h0 + c0)
  + ((h1 + c1) <<  32)
  + ((h2 + c2) <<  64)
  + ((h3 + c3) <<  96)
  + ((h4 + c4) << 128)

s == h0          +  c0
  + (h1 <<  32)  + (c1 <<  32)
  + (h2 <<  64)  + (c2 <<  64)
  + (h3 <<  96)  + (c3 <<  96)
  + (h4 << 128)  + (c4 << 128)

(32)  s == h + c

We have the intended sum.

In addition, from (8) and (13), (9) and (14), (10) and (15), (11) and (16), (12) and (17):

(33)  s0 <= 1_fffffffe
(34)  s1 <= 1_fffffffe
(35)  s2 <= 1_fffffffe
(36)  s3 <= 1_fffffffe
(37)  s4 <= 5

The limbs of s do not overflow the 64-bit integers they are represented with. (For s4, this is a 32-bit integer).

Second, we load r to local variables:

const u32 r0 = ctx->r[0];
const u32 r1 = ctx->r[1];
const u32 r2 = ctx->r[2];
const u32 r3 = ctx->r[3];

Third, we divide r by four, then multiply it by 5, to help with modular reduction:

const u32 rr0 = (r0 >> 2) * 5;
const u32 rr1 = (r1 >> 2) + r1;
const u32 rr2 = (r2 >> 2) + r2;
const u32 rr3 = (r3 >> 2) + r3;

From (23), (24), (25), and the above:

r1 = (r1 >> 2) * 4
r2 = (r2 >> 2) * 4
r3 = (r3 >> 2) * 4

r1 + (r1 >> 2) == (r1 >> 2) * 4 + (r1 >> 2)
r2 + (r2 >> 2) == (r2 >> 2) * 4 + (r2 >> 2)
r3 + (r3 >> 2) == (r3 >> 2) * 4 + (r3 >> 2)

r1 + (r1 >> 2) == (r1 >> 2) * 5
r2 + (r2 >> 2) == (r2 >> 2) * 5
r3 + (r3 >> 2) == (r3 >> 2) * 5

(38)  rr1 == (r1 >> 2) * 5
(39)  rr2 == (r2 >> 2) * 5
(40)  rr3 == (r3 >> 2) * 5

let rr be:

(41) rr == rr0
        + (rr1 <<  32)
        + (rr2 <<  64)
        + (rr3 <<  96)

     rr == ((r0 >> 2) * 5)
        + (((r1 >> 2) * 5) <<  32)
        + (((r2 >> 2) * 5) <<  64)
        + (((r3 >> 2) * 5) <<  96)

     rr == ((   r0
             + (r1 <<  32)
             + (r2 <<  64)
             + (r3 <<  96)
            ) >> 2) * 5

(42) rr = (r >> 2) * 5

In addition, from (18), (19), (20), (21), (38), (39), and (40):

(r0 >> 2) * 5 <= 13fffffb
(r1 >> 2) * 5 <= 13fffffb
(r2 >> 2) * 5 <= 13fffffb
(r3 >> 2) * 5 <= 13fffffb

(r0 >> 2) * 5  <= 13fffffb
(r1 >> 2) + r1 <= 13fffffb
(r2 >> 2) + r2 <= 13fffffb
(r3 >> 2) + r3 <= 13fffffb

(43)  rr0 <= 13fffffb
(44)  rr1 <= 13fffffb
(45)  rr2 <= 13fffffb
(46)  rr3 <= 13fffffb

The limbs of rr do not overflow the 32-bit integers they are represented with.

Multiplication

const u64 x0 = s0*r0 + s1*rr3 + s2*rr2 + s3*rr1 + s4*rr0;
const u64 x1 = s0*r1 + s1*r0  + s2*rr3 + s3*rr2 + s4*rr1;
const u64 x2 = s0*r2 + s1*r1  + s2*r0  + s3*rr3 + s4*rr2;
const u64 x3 = s0*r3 + s1*r2  + s2*r1  + s3*r0  + s4*rr3;
const u32 x4 = s4 * (r0 & 3);

Let x be:

(47)  x == x0
        + (x1 <<  32)
        + (x2 <<  64)
        + (x3 <<  96)
        + (x4 << 128)

From (7), (31), and the schoolbook multiplication algorithm:

(48)  s * r == s0*r0
            + (s0*r1 + s1*r0                        ) <<  32
            + (s0*r2 + s1*r1 + s2*r0                ) <<  64
            + (s0*r3 + s1*r2 + s2*r1 + s3*r0        ) <<  96
            + (        s1*r3 + s2*r2 + s3*r1 + s4*r0) << 128
            + (                s2*r3 + s3*r2 + s4*r1) << 160
            + (                        s3*r3 + s4*r2) << 192
            + (                                s4*r3) << 224

Now, we’re doing the multiplication modulo 2^130-5.

Let srl and srh be:

(49)  srl == s0*r0
          + (s0*r1 + s1*r0                        ) << 32
          + (s0*r2 + s1*r1 + s2*r0                ) << 64
          + (s0*r3 + s1*r2 + s2*r1 + s3*r0        ) << 96

(50)  srh ==         s1*r3 + s2*r2 + s3*r1 + s4*r0
          + (                s2*r3 + s3*r2 + s4*r1) << 32
          + (                        s3*r3 + s4*r2) << 64
          + (                                s4*r3) << 96

From (48), (49), and (50):

(51)  s * r == srl + (srh << 128);

Let v and w be:

(52)  v == s4 * (r0 & 3)
(53)  w == srh - v

From (52) and (53):

s4 * r0 - s4 * (r0 & 3)) ==  s4* (r0 - (r0 & 3))
s4 * r0 - s4 * (r0 & 3)) ==  s4* (r0 & fffffffc)

      w == s1*r3 + s2*r2 + s3*r1 + s4*r0
         + (       s2*r3 + s3*r2 + s4*r1) << 32
         + (               s3*r3 + s4*r2) << 64
         + (                       s4*r3) << 96
         - (s4 * (r0 & 3))

      w == s1*r3 + s2*r2 + s3*r1 + s4*r0 - (s4 * (r0 & 3))
         + (       s2*r3 + s3*r2 + s4*r1) << 32
         + (               s3*r3 + s4*r2) << 64
         + (                       s4*r3) << 96

(54)  w == s1*r3 + s2*r2 + s3*r1 + s4* (r0 & fffffffc)
         + (       s2*r3 + s3*r2 + s4*r1) << 32
         + (               s3*r3 + s4*r2) << 64
         + (                       s4*r3) << 96

let ww be (w >> 2) * 5. From (23), (24), (25), (38), (39), (40), and (54):

      ww  == ((  s1*r3 + s2*r2 + s3*r1 + s4* (r0 & fffffffc)
               + (       s2*r3 + s3*r2 + s4*r1) << 32
               + (               s3*r3 + s4*r2) << 64
               + (                       s4*r3) << 96
              ) >> 2) * 5

      ww  == s1 * (( r3            ) >> 2) * 5
           + s2 * (( r2            ) >> 2) * 5
           + s3 * (( r1            ) >> 2) * 5
           + s4 * (((r0 & fffffffc)) >> 2) * 5
           + (  s2 * (r3 >> 2) * 5
              + s3 * (r2 >> 2) * 5
              + s4 * (r1 >> 2) * 5) << 32
           + (  s3 * (r3 >> 2) * 5
              + s4 * (r2 >> 2) * 5) << 64
           + (  s4 * (r3 >> 2) * 5) << 96

      ww  == s1 * rr3
           + s2 * rr2
           + s3 * rr1
           + s4 * rr0
           + (  s2 * rr3
              + s3 * rr2
              + s4 * rr1) << 32
           + (  s3 * rr3
              + s4 * rr2) << 64
           + (  s4 * rr3) << 96

(55)  ww  == s1*rr3 + s2*rr2 + s3*rr1 + s4*rr0
           + (        s2*rr3 + s3*rr2 + s4*rr1) << 32
           + (                 s3*rr3 + s4*rr2) << 64
           + (                          s4*rr3) << 96

let p be (1 << 130) - 5.

From (49), (50), (51), (52), (53), and (55)

(s*r) % p == (srl + (srh << 128)) % p
(s*r) % p == srl % p + (srh << 128) % p
(s*r) % p == srl % p + ((v + w) << 128) % p
(s*r) % p == srl % p + (v << 128) % p + (w << 128) % p
(s*r) % p == srl % p + (v << 128) % p + (((w << 128) >> 130) * 5) % p
(s*r) % p == srl % p + (v << 128) % p + (w >> 2) * 5 % p
(s*r) % p == srl % p + (v << 128) % p + ww % p
(s*r) % p == (srl + (v << 128) + ww) % p

(s * r) % p == (   s0*r0
                + (s0*r1 + s1*r0                            ) << 32
                + (s0*r2 + s1*r1  + s2*r0                   ) << 64
                + (s0*r3 + s1*r2  + s2*r1  + s3*r0          ) << 96
                + (s4 * (r0 & 3)                            ) << 128
                +          s1*rr3 + s2*rr2 + s3*rr1 + s4*rr0
                + (                 s2*rr3 + s3*rr2 + s4*rr1) << 32
                + (                          s3*rr3 + s4*rr2) << 64
                + (                                   s4*rr3) << 96
               ) %p

(56) (s*r) % p == (  s0*r0   s1*rr3 + s2*rr2 + s3*rr1 + s4*rr0
                  + (s0*r1 + s1*r0  + s2*rr3 + s3*rr2 + s4*rr1) << 32
                  + (s0*r2 + s1*r1  + s2*r0  + s3*rr3 + s4*rr2) << 64
                  + (s0*r3 + s1*r2  + s2*r1  + s3*r0  + s4*rr3) << 96
                  + (s4 * (r0 & 3)                            ) << 128
                  ) % p

From (47) and (56):

(57)  x % p == (s * r) % p

From (18), (19), (20), (21), (33), (34), (35), (36), (37), (43), (44), (45), and (46):

x0 == s0*r0   s1*rr3 + s2*rr2 + s3*rr1 + s4*rr0
x1 == s0*r1 + s1*r0  + s2*rr3 + s3*rr2 + s4*rr1
x2 == s0*r2 + s1*r1  + s2*r0  + s3*rr3 + s4*rr2
x3 == s0*r3 + s1*r2  + s2*r1  + s3*r0  + s4*rr3
x4 == s4 * (r0 & 3)

x0 <= 1_fffffffe * 0fffffff
    + 1_fffffffe * 13fffffb
    + 1_fffffffe * 13fffffb
    + 1_fffffffe * 13fffffb
    +          8 * 13fffffb

x1 <= 1_fffffffe * 0ffffffc
    + 1_fffffffe * 0fffffff
    + 1_fffffffe * 13fffffb
    + 1_fffffffe * 13fffffb
    +          8 * 13fffffb

x2 <= 1_fffffffe * 0ffffffc
    + 1_fffffffe * 0ffffffc
    + 1_fffffffe * 0fffffff
    + 1_fffffffe * 13fffffb
    +          8 * 13fffffb

x3 <= 1_fffffffe * 0ffffffc
    + 1_fffffffe * 0ffffffc
    + 1_fffffffe * 0ffffffc
    + 1_fffffffe * 0fffffff
    +          8 * 13fffffb

x4 <=          8 * 3

(58)  x0 <= 97ffffe0_07fffff8
(59)  x1 <= 8fffffe2_0ffffff6
(60)  x2 <= 87ffffe4_17fffff4
(61)  x3 <= 7fffffe6_1ffffff2
(62)  x4 <=                 f

The limbs of x do not overflow the 64-bit integers they are represented with.

Carry propagation

const u32 u5 = x4 + (x3 >> 32);
const u64 u0 = (u5 >>  2) * 5 + (x0 & 0xffffffff);
const u64 u1 = (u0 >> 32)     + (x1 & 0xffffffff) + (x0 >> 32);
const u64 u2 = (u1 >> 32)     + (x2 & 0xffffffff) + (x1 >> 32);
const u64 u3 = (u2 >> 32)     + (x3 & 0xffffffff) + (x2 >> 32);
const u64 u4 = (u3 >> 32)     + (u5 & 3);

Let xl0, xl1, xl2, xl3, xh0, xh1, xh2, and xh3 be:

(63)  xl0 == x0 & ffffffff
(64)  xl1 == x1 & ffffffff
(65)  xl2 == x2 & ffffffff
(66)  xl3 == x3 & ffffffff

(67)  xh0 == x0 >> 32
(68)  xh1 == x1 >> 32
(69)  xh2 == x2 >> 32
(70)  xh3 == x3 >> 32

Let ul0, ul1, ul2, ul3, ul5, uh0, uh1, uh2, uh3, and uh5 be:

(71)  ul0 == u0 & ffffffff
(72)  ul1 == u1 & ffffffff
(73)  ul2 == u2 & ffffffff
(74)  ul3 == u3 & ffffffff
(76)  ul5 == u5 &        3

(77)  uh0 == u0 >> 32
(78)  uh1 == u1 >> 32
(79)  uh2 == u2 >> 32
(80)  uh3 == u3 >> 32
(82)  uh5 == u5 >>  2

From (57) and (62) to 82):

      (s * r) % p == (  x0
                      + x1 <<  32
                       + x2 <<  64
                       + x3 <<  96
                       + x4 << 128
                      ) % p

      (s * r) % p  == (   xl0 + (xh0 << 32)
                       + (xl1 + (xh1 << 32)) <<  32
                       + (xl2 + (xh2 << 32)) <<  64
                       + (xl3 + (xh3 << 32)) <<  96
                       + x4 << 128
                      ) % p

      (s * r) % p  == (   xl0
                       + (xl1 + xh0) <<  32
                       + (xl2 + xh1) <<  64
                       + (xl3 + xh2) <<  96
                       + (x4  + xh3) << 128
                      ) % p

      (s * r) % p  == (   xl0
                       + (xl1 + xh0) <<  32
                       + (xl2 + xh1) <<  64
                       + (xl3 + xh2) <<  96
                       +  u5         << 128
                      ) % p

      (s * r) % p  == (   xl0
                       + (xl1 + xh0)        <<  32
                       + (xl2 + xh1)        <<  64
                       + (xl3 + xh2)        <<  96
                       + (ul5 + (u5h << 2)) << 128
                      ) % p

      (s * r) % p  == (   xl0 + u5h * 5
                       + (xl1 + xh0) <<  32
                       + (xl2 + xh1) <<  64
                       + (xl3 + xh2) <<  96
                       +  ul5        << 128
                      ) % p

      (s * r) % p  == (   u0
                       + (xl1 + xh0) <<  32
                       + (xl2 + xh1) <<  64
                       + (xl3 + xh2) <<  96
                       +  ul5        << 128
                      ) % p

      (s * r) % p  == (   ul0
                       + (uh0 + xl1 + xh0) <<  32
                       + (      xl2 + xh1) <<  64
                       + (      xl3 + xh2) <<  96
                       +  ul5              << 128
                      ) % p

      (s * r) % p  == (   ul0
                       +  ul1              <<  32
                       + (uh1 + xl2 + xh1) <<  64
                       + (      xl3 + xh2) <<  96
                       +  ul5              << 128
                      ) % p

      (s * r) % p  == (   ul0
                       +  ul1              <<  32
                       +  ul2              <<  64
                       + (uh2 + xl3 + xh2) <<  96
                       +  ul5              << 128
                      ) % p

      (s * r) % p  == (   ul0
                       +  ul1        <<  32
                       +  ul2        <<  64
                       +  ul3        <<  96
                       + (uh3 + ul5) << 128
                      ) % p

(83)  (s * r) % p  == (  ul0
                       + ul1 <<  32
                       + ul2 <<  64
                       + ul3 <<  96
                       + u4  << 128
                      ) % p

Update the hash

ctx->h[0] = u0 & 0xffffffff;
ctx->h[1] = u1 & 0xffffffff;
ctx->h[2] = u2 & 0xffffffff;
ctx->h[3] = u3 & 0xffffffff;
ctx->h[4] = u4;

From the above, (71), (72), (73), (74), and (83):

H0 == ul0;
H1 == ul1;
H2 == ul2;
H3 == ul3;
H4 == u4;

Let H be:

 (84) H == H0
         + H1 <<  32
         + H2 <<  64
         + H3 <<  96
         + H4 << 128

From (83) and (84):

H % p == (s * r) % p

It Works!

From (58) to (62):

u5 <= x4 + (x3 >> 32)
u5 <= f + 7fffffe6
u5 <= 7ffffff5

u5 does not overflow the 32-bit integer it is represented with.

      u0 <= (u5       >> 2) * 5 + (x0 & 0xffffffff)
      u0 <= (7ffffff5 >> 2) * 5 + ffffffff
      u0 <= 1_9ffffff0

      u1 <= (u0         >> 32) + (x1 & 0xffffffff) + (x0 >> 32)
      u1 <= (1_9ffffff0 >> 32) + ffffffff          + 97ffffe0
      u1 <= 1_97ffffe0

      u2 <= (u1         >> 32) + (x2 & 0xffffffff) + (x1 >> 32)
      u2 <= (1_97ffffe0 >> 32) + ffffffff          + 8fffffe2
      u2 <= 1_8fffffe2

      u3 <= (u2         >> 32) + (x3 & 0xffffffff) + (x2 >> 32)
      u3 <= (1_8fffffe2 >> 32) + ffffffff          + 87ffffe4
      u3 <= 1_87ffffe4

      u4 <= (u3         >> 32) + (u5 & 3)
      u4 <= (1_87ffffe4 >> 32) + 3
(85)  u4 <= 4

u0, u1, u2, u3, and u4, do not overflow the 64-bit integers they are represented with.

From (84) and (85):

H <= 4_ffffffff_ffffffff_ffffffff_ffffffff

Since h starts at zero and is only modified through the compression function, we deduce by induction that it never exceeds the number above.

Final reduction modulo p

const u64 u0 = (u64)5     + ctx->h[0];
const u64 u1 = (u0 >> 32) + ctx->h[1];
const u64 u2 = (u1 >> 32) + ctx->h[2];
const u64 u3 = (u2 >> 32) + ctx->h[3];
const u64 u4 = (u3 >> 32) + ctx->h[4];

const u64 uu0 = (u4 >> 2) * 5 + ctx->h[0] + ctx->pad[0];
const u64 uu1 = (uu0 >> 32)   + ctx->h[1] + ctx->pad[1];
const u64 uu2 = (uu1 >> 32)   + ctx->h[2] + ctx->pad[2];
const u64 uu3 = (uu2 >> 32)   + ctx->h[3] + ctx->pad[3];

store32_le(mac     , uu0);
store32_le(mac +  4, uu1);
store32_le(mac +  8, uu2);
store32_le(mac + 12, uu3);

Subtract or not subtract

const u64 u0 = (u64)5     + ctx->h[0];
const u64 u1 = (u0 >> 32) + ctx->h[1];
const u64 u2 = (u1 >> 32) + ctx->h[2];
const u64 u3 = (u2 >> 32) + ctx->h[3];
const u64 u4 = (u3 >> 32) + ctx->h[4];

From the above:

h == h0
   + h1 <<  32
   + h2 <<  64
   + h3 <<  96
   + h4 << 128

h + 5 == h0 + 5
       + h1 <<  32
       + h2 <<  64
       + h3 <<  96
       + h4 << 128

h + 5 == u0
       + h1 <<  32
       + h2 <<  64
       + h3 <<  96
       + h4 << 128

h + 5 == u0 & ffffffff
       + (u0 >> 32) + h1 <<  32
       +              h2 <<  64
       +              h3 <<  96
       +              h4 << 128

h + 5 == u0 & ffffffff
       + u1 & ffffffff   <<  32
       + (u1 >> 32) + h2 <<  64
       +              h3 <<  96
       +              h4 << 128

h + 5 == u0 & ffffffff
       + u1 & ffffffff   <<  32
       + u2 & ffffffff   <<  64
       + (u2 >> 32) + h3 <<  96
       +              h4 << 128

h + 5 == u0 & ffffffff
       + u1 & ffffffff   <<  32
       + u2 & ffffffff   <<  64
       + u3 & ffffffff   <<  96
       + (u3 >> 32) + h4 << 128

h + 5 == u0 & ffffffff
       + u1 & ffffffff <<  32
       + u2 & ffffffff <<  64
       + u3 & ffffffff <<  96
       + u4            << 128

Recall that p equals (1 << 130) - 5.

(h + 5) >= 1 << 130      iff  u4 >> 2 >= 1
 h      >= 1 << 130 - 5  iff  u4 >> 2 >= 1
 h      >= p             iff  u4 >> 2 >= 1

u exceeds zero iff h exceeds p. Additionally, since:

 h <= 4_ffffffff_ffffffff_ffffffff_ffffffff

Then:

 h + 5   <= 5_00000000_00000000_00000000_00000004
 h + 5   <  (1 << 130) * 2
 u4 >> 2 <= 1

u4 >> 2 never exceeds 1. Thus:

u4 >> 2 == 1  if  h >= p
u4 >> 2 == 0  if  h <  p

u4 >> 2 == 1  if  h % p == h - p
u4 >> 2 == 0  if  h % p == p

To obtain a fully reduced representation of h modulo p, we must subtract p iff u4 >> 2 is not zero.

final carry propagation (if any)

const u64 uu0 = (u4 >> 2) * 5 + ctx->h[0] + ctx->pad[0];
const u64 uu1 = (uu0 >> 32)   + ctx->h[1] + ctx->pad[1];
const u64 uu2 = (uu1 >> 32)   + ctx->h[2] + ctx->pad[2];
const u64 uu3 = (uu2 >> 32)   + ctx->h[3] + ctx->pad[3];

Let u5, uu4, and pad be:

u5  == (u4 >> 2) * 5

uu4 == (uu3 >> 32) + h4

pad == pad0
     + pad1 <<  32
     + pad2 <<  64
     + pad3 << 128

Recall that:

u5 == 5  if  h % p == h - p
u5 == 0  if  h % p == p

Now:

h + pad == h0 + pad0
         + h1 + pad1 <<  32
         + h2 + pad2 <<  64
         + h3 + pad3 <<  96
         + h4        << 128

h + pad == u5 + h0 + pad0
         +      h1 + pad1 <<  32
         +      h2 + pad2 <<  64
         +      h3 + pad3 <<  96
         +      h4        << 128
         - u5

h + pad == uu0
         +      h1 + pad1 <<  32
         +      h2 + pad2 <<  64
         +      h3 + pad3 <<  96
         +      h4        << 128
         - u5

h + pad ==  uu0 & ffffffff
         + (uu0 >> 32) + h1 + pad1 <<  32
         +               h2 + pad2 <<  64
         +               h3 + pad3 <<  96
         +               h4        << 128
         - u5

h + pad ==  uu0 & ffffffff
         +  uu1 & ffffffff         <<  32
         + (uu1 >> 32) + h2 + pad2 <<  64
         +               h3 + pad3 <<  96
         +               h4        << 128
         - u5

h + pad ==  uu0 & ffffffff
         +  uu1 & ffffffff  <<  32
         +  uu2 & ffffffff  <<  64
         +  uu3 & ffffffff  <<  96
         + (uu3 >> 32) + h4 << 128
         - u5

h + pad == uu0 & ffffffff
         + uu1 & ffffffff <<  32
         + uu2 & ffffffff <<  64
         + uu3 & ffffffff <<  96
         + uu4            << 128
         - u5

If u5 equals zero, then h was fully reduced to begin with, and no computation needed to be done–and none was:

h == h % p

(h % p) + pad == uu0 & ffffffff
               + uu1 & ffffffff <<  32
               + uu2 & ffffffff <<  64
               + uu3 & ffffffff <<  96
               + uu4            << 128
               - u5

(h % p) + pad == uu0 & ffffffff
               + uu1 & ffffffff <<  32
               + uu2 & ffffffff <<  64
               + uu3 & ffffffff <<  96
               + uu4            << 128

If, however, u5 equals five, then:

h            >= p
h            >= 1 << 130 - 5
h + 5        >= 1 << 130
uu4 >> 2     <= 1
uu4 >> 2 - 1 <= 1

h + pad == uu0 & ffffffff
         + uu1 & ffffffff <<  32
         + uu2 & ffffffff <<  64
         + uu3 & ffffffff <<  96
         + uu4            << 128
         - 5

h + pad == uu0 & ffffffff
         + uu1 & ffffffff <<  32
         + uu2 & ffffffff <<  64
         + uu3 & ffffffff <<  96
         + uu4 & 3        << 128
         + (uu4 >> 2)     << 130
         - 5

h + pad == uu0 & ffffffff
         + uu1 & ffffffff <<  32
         + uu2 & ffffffff <<  64
         + uu3 & ffffffff <<  96
         + uu4 & 3        << 128
         + (uu4 >> 2) - 1 << 130
         + (1 << 130) - 5

h + pad == uu0 & ffffffff
         + uu1 & ffffffff <<  32
         + uu2 & ffffffff <<  64
         + uu3 & ffffffff <<  96
         + uu4 & 3        << 128
         + (uu4 >> 2) - 1 << 130
         + p

(h - p) + pad == uu0 & ffffffff
               + uu1 & ffffffff <<  32
               + uu2 & ffffffff <<  64
               + uu3 & ffffffff <<  96
               + uu4 & 3        << 128
               + (uu4 >> 2) - 1 << 130

(h % p) + pad == uu0 & ffffffff
               + uu1 & ffffffff <<  32
               + uu2 & ffffffff <<  64
               + uu3 & ffffffff <<  96
               + uu4 & 3        << 128
               + (uu4 >> 2) - 1 << 130

Recall that in this case:

h % p          == h - p
(uu4 >> 2) - 1 <= 0

No funny underflow there.

serialisation

store32_le(mac     , uu0);
store32_le(mac +  4, uu1);
store32_le(mac +  8, uu2);
store32_le(mac + 12, uu3);

We store the 128 least significant bits of (h % p) + pad in little-endian. Depending on the carry propagation above, one of the following is true:

(h % p) + pad == uu0 & ffffffff
               + uu1 & ffffffff <<  32
               + uu2 & ffffffff <<  64
               + uu3 & ffffffff <<  96
               + uu4 & 3        << 128
               + (uu4 >> 2) - 1 << 130

Or:

(h % p) + pad == uu0 & ffffffff
               + uu1 & ffffffff <<  32
               + uu2 & ffffffff <<  64
               + uu3 & ffffffff <<  96
               + uu4            << 128

In both cases, though:

((h % p) + pad) % (1 << 128) == uu0 & ffffffff
                              + uu1 & ffffffff <<  32
                              + uu2 & ffffffff <<  64
                              + uu3 & ffffffff <<  96

So, we store the right bits either way.

overflow

From (1), and the fact that pad is represented with 32-bit limbs:

u0 <= 5          + ffffffff
u1 <= (u0 >> 32) + ffffffff
u2 <= (u1 >> 32) + ffffffff
u3 <= (u2 >> 32) + ffffffff
u4 <= (u3 >> 32) +        4

uu0 <= (u4 >> 2) * 5 + ffffffff + ffffffff
uu1 <= (uu0 >> 32)   + ffffffff + ffffffff
uu2 <= (uu1 >> 32)   + ffffffff + ffffffff
uu3 <= (uu2 >> 32)   + ffffffff + ffffffff

u0 <= 1_00000004
u1 <= 1_00000000
u2 <= 1_00000000
u3 <= 1_00000000
u4 <=          5

uu0 <= 2_00000003
uu1 <= 2_00000000
uu2 <= 2_00000000
uu3 <= 2_00000000

None the variables above overflow the 64-bit integers they are represented with.

QED. I think. Can I implement my own crypto now?