conv_x buffer (src0) — sliding window context
d_conv-1+n_t cols · d_inner rows [row stride = ncs × 4 bytes]
conv1d.weight (src1) — per-channel filters
d_conv cols · d_inner rows
output (dst)
d_inner rows · n_t cols
0
0
dot product — selected cell
—
= —
Thread parallelism over d_inner
Each thread owns a slice of rows.
Sequences & tokens loop inside.
Sequences & tokens loop inside.
Where this sits in LFM2-700M