4×8 Register Pressure — RISC-V Vector, VLEN=256

Physical register allocation at peak (inside MAC chain, inner K-block loop)

At VLEN=256, each physical vector register holds 256 bits. RISC-V V provides 32 physical vector registers (v0–v31). LMUL groups consecutive physical registers into one wider logical register.
Logical type
LMUL
Phys regs
Elements
Use in kernel
f32m1
1
1
8 f32
sumf accumulators, b_scales
i8m2
2
2
64 i8
RHS half-vectors, LHS chunks
i8m4
4
4
128 i8
raw B block load
i16m4
4
4
64 i16
MAC accumulator (per row)
LMUL = physical register cost. An i16m4 costs 4 physical registers at any VLEN — because widening 64 i8 products into i16 requires 1024 bits = 4 × 256-bit registers. There is no smaller type that fits the MAC accumulator.
Peak pressure occurs inside the MAC chain for one row. All four RHS half-vectors, all four LHS chunks, the MAC accumulator, all four float accumulators, and b_scales must be live simultaneously. Physical registers = count × LMUL.
Variable
Type
LMUL
Count
Calculation → phys regs
sumf0, sumf1, sumf2, sumf3
f32m1
1
4
4 × 1 = 4
rhs_lo_0, lo_1, hi_0, hi_1
i8m2
2
4
4 × 2 = 8
lhs_0_8, lhs_1_8, lhs_2_8, lhs_3_8
i8m2
2
4
4 × 2 = 8
sumi_lX (one row at a time)
i16m4
4
1
1 × 4 = 4
b_scales_vec
f32m1
1
1
1 × 1 = 1
Peak total
4 + 8 + 8 + 4 + 1 = 25 of 32  (7 free)
sumf (4)
v0–v3
RHS (8)
v4–v11
LHS (8)
v12–v19
MAC (4)
v20–v23
bscale (1)
v24
free (7)
v25–v31
All 32 physical registers (v0–v31) mapped at peak — inside the MAC chain for one row. Each cell = one physical register. LMUL=4 groups span 4 consecutive cells.
sumf0–3 (4 regs)
RHS halves (8 regs)
LHS chunks (8 regs)
MAC acc i16m4 (4 regs)
b_scales (1 reg)
free (7 regs)
Note: rhs_raw_vec (the initial i8m4 load) is not counted here — by the time the MAC chain begins it has been consumed to produce the four i8m2 half-vectors and is no longer live. If the compiler does not immediately reuse its registers, peak pressure could briefly reach 29.
Comparing tile strategies. The 4×16 two-pass strategy processes cols 0–7 then cols 8–15 within each K-block iteration, keeping the A block live across both passes. The only additional cost is 4 extra f32m1 accumulators for the second column group.
4×8 baseline 4×16 two-pass
Output values / tile 32 64
A loads per K-block (reused)
B loads per output value 1 / 8 cols 1 / 16 cols ↓2×
Peak phys regs used 25 / 32 29 / 32 (tight)
Free registers 7 3 (spill risk)
f32 lanes per sumf reg 8 8
Outer x-loop iterations (nc=64) 8 4 (fewer branches)
The 4×16 tile doubles arithmetic intensity with only 4 extra registers — a good trade at VLEN=256. At VLEN=512 a native 4×16 (with vl=64) would cover 16 columns per register naturally, giving the same output count with comfortable headroom and no two-pass complexity.