4×16 Peak Allocation — Naive vs Two-Pass

Naive 4×16: all 16 columns processed simultaneously in one MAC chain per row. The RHS must cover 16 cols × 32 K-elements = 256 bytes, requiring i8m8 for the raw load. The MAC accumulator must hold 128 i16 values → i16m8 (8 phys regs). The sumf accumulators need 16 floats per row → f32m2 (2 phys regs each). Total: 42 of 32 — impossible at VLEN=256.

Variable

Type

Phys regs

vs 4×8

sumf0–3 (4 rows × 16 cols)

f32m2 ×4

rhs_lo_0/1/2/3, hi_0/1/2/3

i8m4 ×4

lhs_0_8 .. lhs_3_8

i8m2 ×4

same

MAC accumulator

i16m8 ×1

b_scales_vec (16 cols)

f32m2 ×1

Peak total

42 of 32

+17 ✗

The register file overflow is not marginal — 10 registers over budget. The compiler would be forced to spill and reload on every loop iteration, turning vector registers into memory traffic and eliminating the performance benefit entirely. Naive 4×16 is not viable at VLEN=256.

Two-pass 4×16 (interleaved K-loop): processes 8 columns per MAC chain pass. The MAC accumulator stays i16m4. The only extra cost over 4×8 is 4 additional f32m1 accumulators (sumf4–7) for the second column group, which remain live across both passes within each row block. RHS registers are freed and reused between passes.

Variable

Type

Phys regs

vs 4×8

sumf0–3 (cols 0–7, all rows)

f32m1 ×4

same

sumf4–7 (cols 8–15, all rows)

f32m1 ×4

+4 new

lhs_0..3 (current row, both passes)

i8m2 ×4

same

rhs halves (one pass at a time)

i8m2 ×4

same

MAC accumulator

i16m4 ×1

same

b_scales_vec (one pass at a time)

f32m1 ×1

same

Peak total

29 of 32

+4 ✓

3 registers spare. The key insight is that rhs halves from pass 1 are freed before pass 2 loads its own rhs halves — they never coexist. The lhs registers are the only ones held across both passes, and they already existed in the 4×8 baseline. The sole cost is the 4 extra sumf accumulators — the minimum possible overhead for doubling the output column count.