SSM

State space models (SSMs) have long been a fundamental concept in signal processing. Recent work has demonstrated their effectiveness as sequence models, particularly in capturing long-range dependencies. They have set new performance standards across various benchmarks, such as the Long-Range Arena (LRA), and have shown impressive results in tasks like speech generation. Despite these successes, SSMs have historically fallen short compared to attention mechanisms in language modeling tasks.

The obvious question then is: how can this performance gap be closed? A novel layer called H3, short for “Hungry Hungry Hippos,” was developed by researchers at Stanford University specifically to tackle associative recall challenges. By integrating H3, nearly all the attention layers in GPT-style transformers can be replaced, achieving either superior or comparable quality.

To successfully perform associative recall, two main capabilities are essential: the ability to memorize tokens throughout the sequence and the ability to compare the current token with previous ones.

For memorization, an SSM is constructed with a diagonal matrix, a method similar to strategies like S4D or DSS. This setup effectively captures and stores information across the entire sequence.

In order to compare the current token with earlier tokens, an SSM is designed using a shift matrix. This essentially creates a state that retains the previous token, allowing for straightforward comparisons through multiplicative interactions