4×8 Register Pressure — RISC-V VLEN=256

At VLEN=256, each physical vector register holds 256 bits. RISC-V V provides 32 physical vector registers (v0–v31). LMUL groups consecutive physical registers into one wider logical register.

Logical type

LMUL

Phys regs

Elements

Use in kernel

f32m1

8 f32

sumf accumulators, b_scales

i8m2

64 i8

RHS half-vectors, LHS chunks

i8m4

128 i8

raw B block load

i16m4

64 i16

MAC accumulator (per row)

LMUL = physical register cost. An i16m4 costs 4 physical registers at any VLEN — because widening 64 i8 products into i16 requires 1024 bits = 4 × 256-bit registers. There is no smaller type that fits the MAC accumulator.

Peak pressure occurs inside the MAC chain for one row. All four RHS half-vectors, all four LHS chunks, the MAC accumulator, all four float accumulators, and b_scales must be live simultaneously. Physical registers = count × LMUL.

Variable

Type

LMUL

Count

Calculation → phys regs

sumf0, sumf1, sumf2, sumf3

f32m1

4 × 1 = 4

rhs_lo_0, lo_1, hi_0, hi_1

i8m2

4 × 2 = 8

lhs_0_8, lhs_1_8, lhs_2_8, lhs_3_8

i8m2

4 × 2 = 8

sumi_lX (one row at a time)

i16m4

1 × 4 = 4

b_scales_vec

f32m1

1 × 1 = 1

Peak total

4 + 8 + 8 + 4 + 1 = 25 of 32 (7 free)

sumf (4)

v0–v3

RHS (8)

v4–v11

LHS (8)

v12–v19

MAC (4)

v20–v23

bscale (1)

v24

free (7)

v25–v31

All 32 physical registers (v0–v31) mapped at peak — inside the MAC chain for one row. Each cell = one physical register. LMUL=4 groups span 4 consecutive cells.

sumf0–3 (4 regs)

RHS halves (8 regs)

LHS chunks (8 regs)

MAC acc i16m4 (4 regs)

b_scales (1 reg)

free (7 regs)

Note: rhs_raw_vec (the initial i8m4 load) is not counted here — by the time the MAC chain begins it has been consumed to produce the four i8m2 half-vectors and is no longer live. If the compiler does not immediately reuse its registers, peak pressure could briefly reach 29.

Comparing tile strategies. The 4×16 two-pass strategy processes cols 0–7 then cols 8–15 within each K-block iteration, keeping the A block live across both passes. The only additional cost is 4 extra f32m1 accumulators for the second column group.

	4×8 baseline	4×16 two-pass
Output values / tile	32	64
A loads per K-block	1×	1× (reused)
B loads per output value	1 / 8 cols	1 / 16 cols ↓2×
Peak phys regs used	25 / 32	29 / 32 (tight)
Free registers	7	3 (spill risk)
f32 lanes per sumf reg	8	8
Outer x-loop iterations (nc=64)	8	4 (fewer branches)

The 4×16 tile doubles arithmetic intensity with only 4 extra registers — a good trade at VLEN=256. At VLEN=512 a native 4×16 (with vl=64) would cover 16 columns per register naturally, giving the same output count with comfortable headroom and no two-pass complexity.

4×8 Register Pressure — RISC-V Vector, VLEN=256