conv_x (src0)
vlse32 stride = ncs×4 bytes — gathers same column from all vl rows
weights (src1)
vlse32 stride = nc×4 bytes
output (dst)
vse32 — one instruction writes all vl lanes
vsum register
LMUL=m4 · vl=4 lanes · f32
RVV instruction
—
Scalar equivalent
—