4×16 Peak Allocation — Naive vs Two-Pass

Register pressure at VLEN=256, relative to the 4×8 baseline (25 of 32 physical registers)

Naive 4×16: all 16 columns processed simultaneously in one MAC chain per row. The RHS must cover 16 cols × 32 K-elements = 256 bytes, requiring i8m8 for the raw load. The MAC accumulator must hold 128 i16 values → i16m8 (8 phys regs). The sumf accumulators need 16 floats per row → f32m2 (2 phys regs each). Total: 42 of 32 — impossible at VLEN=256.
Variable
Type
Phys regs
vs 4×8
sumf0–3 (4 rows × 16 cols)
f32m2 ×4
8
+4
rhs_lo_0/1/2/3, hi_0/1/2/3
i8m4 ×4
16
+8
lhs_0_8 .. lhs_3_8
i8m2 ×4
8
same
MAC accumulator
i16m8 ×1
8
+4
b_scales_vec (16 cols)
f32m2 ×1
2
+1
Peak total
42 of 32
+17 ✗
The register file overflow is not marginal — 10 registers over budget. The compiler would be forced to spill and reload on every loop iteration, turning vector registers into memory traffic and eliminating the performance benefit entirely. Naive 4×16 is not viable at VLEN=256.
Two-pass 4×16 (interleaved K-loop): processes 8 columns per MAC chain pass. The MAC accumulator stays i16m4. The only extra cost over 4×8 is 4 additional f32m1 accumulators (sumf4–7) for the second column group, which remain live across both passes within each row block. RHS registers are freed and reused between passes.
Variable
Type
Phys regs
vs 4×8
sumf0–3 (cols 0–7, all rows)
f32m1 ×4
4
same
sumf4–7 (cols 8–15, all rows)
f32m1 ×4
4
+4 new
lhs_0..3 (current row, both passes)
i8m2 ×4
8
same
rhs halves (one pass at a time)
i8m2 ×4
8
same
MAC accumulator
i16m4 ×1
4
same
b_scales_vec (one pass at a time)
f32m1 ×1
1
same
Peak total
29 of 32
+4 ✓
3 registers spare. The key insight is that rhs halves from pass 1 are freed before pass 2 loads its own rhs halves — they never coexist. The lhs registers are the only ones held across both passes, and they already existed in the 4×8 baseline. The sole cost is the 4 extra sumf accumulators — the minimum possible overhead for doubling the output column count.