Physical register allocation at peak (inside MAC chain, inner K-block loop)
At VLEN=256, each physical vector register holds 256 bits. RISC-V V provides
32 physical vector registers (v0–v31). LMUL groups consecutive physical registers
into one wider logical register.
Logical type
LMUL
Phys regs
Elements
Use in kernel
f32m1
1
1
8 f32
sumf accumulators, b_scales
i8m2
2
2
64 i8
RHS half-vectors, LHS chunks
i8m4
4
4
128 i8
raw B block load
i16m4
4
4
64 i16
MAC accumulator (per row)
LMUL = physical register cost. An i16m4 costs 4 physical registers at any VLEN —
because widening 64 i8 products into i16 requires 1024 bits = 4 × 256-bit registers. There is no
smaller type that fits the MAC accumulator.
Peak pressure occurs inside the MAC chain for one row. All four RHS half-vectors, all four LHS
chunks, the MAC accumulator, all four float accumulators, and b_scales must be live simultaneously.
Physical registers = count × LMUL.
Variable
Type
LMUL
Count
Calculation → phys regs
sumf0, sumf1, sumf2, sumf3
f32m1
1
4
4 × 1 = 4
rhs_lo_0, lo_1, hi_0, hi_1
i8m2
2
4
4 × 2 = 8
lhs_0_8, lhs_1_8, lhs_2_8, lhs_3_8
i8m2
2
4
4 × 2 = 8
sumi_lX (one row at a time)
i16m4
4
1
1 × 4 = 4
b_scales_vec
f32m1
1
1
1 × 1 = 1
Peak total
4 + 8 + 8 + 4 + 1 = 25 of 32 (7 free)
sumf (4)
v0–v3
RHS (8)
v4–v11
LHS (8)
v12–v19
MAC (4)
v20–v23
bscale (1)
v24
free (7)
v25–v31
All 32 physical registers (v0–v31) mapped at peak — inside the MAC chain for one row.
Each cell = one physical register. LMUL=4 groups span 4 consecutive cells.
sumf0–3 (4 regs)
RHS halves (8 regs)
LHS chunks (8 regs)
MAC acc i16m4 (4 regs)
b_scales (1 reg)
free (7 regs)
Note:rhs_raw_vec (the initial i8m4 load) is not counted here —
by the time the MAC chain begins it has been consumed to produce the four i8m2 half-vectors
and is no longer live. If the compiler does not immediately reuse its registers, peak pressure could
briefly reach 29.
Comparing tile strategies. The 4×16 two-pass strategy processes cols 0–7 then cols 8–15 within
each K-block iteration, keeping the A block live across both passes. The only additional cost is
4 extra f32m1 accumulators for the second column group.
4×8 baseline
4×16 two-pass
Output values / tile
32
64
A loads per K-block
1×
1× (reused)
B loads per output value
1 / 8 cols
1 / 16 cols ↓2×
Peak phys regs used
25 / 32
29 / 32 (tight)
Free registers
7
3 (spill risk)
f32 lanes per sumf reg
8
8
Outer x-loop iterations (nc=64)
8
4 (fewer branches)
The 4×16 tile doubles arithmetic intensity with only 4 extra registers — a good trade at VLEN=256.
At VLEN=512 a native 4×16 (with vl=64) would cover 16 columns per register naturally,
giving the same output count with comfortable headroom and no two-pass complexity.