Register pressure at VLEN=256, relative to the 4×8 baseline (25 of 32 physical registers)
Naive 4×16: all 16 columns processed simultaneously in one MAC chain per row.
The RHS must cover 16 cols × 32 K-elements = 256 bytes, requiring i8m8 for the raw
load. The MAC accumulator must hold 128 i16 values → i16m8 (8 phys regs). The
sumf accumulators need 16 floats per row → f32m2 (2 phys regs each).
Total: 42 of 32 — impossible at VLEN=256.
Variable
Type
Phys regs
vs 4×8
sumf0–3 (4 rows × 16 cols)
f32m2 ×4
8
+4
rhs_lo_0/1/2/3, hi_0/1/2/3
i8m4 ×4
16
+8
lhs_0_8 .. lhs_3_8
i8m2 ×4
8
same
MAC accumulator
i16m8 ×1
8
+4
b_scales_vec (16 cols)
f32m2 ×1
2
+1
Peak total
42 of 32
+17 ✗
The register file overflow is not marginal — 10 registers over budget. The compiler would be
forced to spill and reload on every loop iteration, turning vector registers into memory traffic
and eliminating the performance benefit entirely.
Naive 4×16 is not viable at VLEN=256.
Two-pass 4×16 (interleaved K-loop): processes 8 columns per MAC chain pass.
The MAC accumulator stays i16m4. The only extra cost over 4×8 is 4 additional
f32m1 accumulators (sumf4–7) for the second column group, which remain live across
both passes within each row block. RHS registers are freed and reused between passes.
Variable
Type
Phys regs
vs 4×8
sumf0–3 (cols 0–7, all rows)
f32m1 ×4
4
same
sumf4–7 (cols 8–15, all rows)
f32m1 ×4
4
+4 new
lhs_0..3 (current row, both passes)
i8m2 ×4
8
same
rhs halves (one pass at a time)
i8m2 ×4
8
same
MAC accumulator
i16m4 ×1
4
same
b_scales_vec (one pass at a time)
f32m1 ×1
1
same
Peak total
29 of 32
+4 ✓
3 registers spare. The key insight is that rhs halves from pass 1 are freed before
pass 2 loads its own rhs halves — they never coexist. The lhs registers are the only
ones held across both passes, and they already existed in the 4×8 baseline. The sole cost is
the 4 extra sumf accumulators — the minimum possible overhead for doubling the output column
count.