SSM conv kernel — scalar depthwise 1D convolution

ggml_compute_forward_ssm_conv_f32 · LFM2-700M · d_inner=1536, d_conv=4

src0 conv_x [d_conv-1+n_t, d_inner] src1 weight [d_conv, d_inner] dst output [d_inner, n_t]
conv_x buffer (src0) — sliding window context
d_conv-1+n_t cols · d_inner rows [row stride = ncs × 4 bytes]
conv1d.weight (src1) — per-channel filters
d_conv cols · d_inner rows
output (dst)
d_inner rows · n_t cols
0
0
dot product — selected cell

=
Thread parallelism over d_inner
Each thread owns a slice of rows.
Sequences & tokens loop inside.
Where this sits in LFM2-700M