State Space Models
This architecture is early, and the largest model right now is 2.8B params only.
That said, the way State Space Models (like Mamba) compress the context of a prompt into storage avoids having to attend to a long history when generating each new token.
This is a fundamental advantage. It’s a transformer, with compression built in (very loosely speaking).
Cheers, Ronan
Links: