Back

2026-01-09

Why Long Context Research Is Difficult

So You Want to Understand Long-Context Models?

(The Mamba, Titans, Hope Dependency Hell Guide)

"I heard Attention is O(n²) and slow. So I looked for alternatives..." → gates of hell: opened


0. Reality Check First

What you want:


"I just want a rough idea of how Mamba and Titans work"

"They say it replaces Attention — I'm curious about the principles"

What you'll actually face:


Long-Context Models

│
├── Mamba (State Space Model family)
│ ├── SSM Theory
│ │ ├── Control Theory
│ │ ├── Signal Processing
│ │ └── Complex Analysis
│ ├── Selection Mechanism
│ │ └── Information Theory
│ └── Hardware Optimization
│ └── CUDA Kernel Programming
│ └── GPU Architecture
│ └── Computer Architecture
│ └── (this is Electrical Engineering now)
│
├── Titans (Memory Module family)
│ ├── Neural Memory
│ │ └── Hopfield Networks
│ │ └── Statistical Physics
│ ├── Surprise Metric
│ │ └── Information Theory (it's back)
│ └── Test-Time Learning
│ └── Meta-Learning
│ └── The Abyss of Optimization Theory
│
└── Hope (Self-Modifying Networks)
    ├── Nested Learning
    │ └── Higher-Order Optimization
    │ └── Implicit Differentiation
    │ └── Functional Analysis
    └── Causal Memory
        └── Causal Inference
            └── Philosophy (seriously)



Estimated time: When will it end? (answer refused)


1. Why You Entered This Hell: The Problem with Attention

What's Wrong with Attention?


"Attention is O(n²)"

│
└─→ "What's n?"
    │
    └─→ "Sequence length"
        │
        └─→ "So what's the problem?"
            │
            └─→ "Double the length = 4x the cost"
                │
                └─→ "What if I feed in 100K tokens?"
                    │
                    └─→ "10 billion operations"
                        │
                        └─→ "GPU memory goes boom"
                            │
                            └─→ "So we need alternatives"
                                │
                                └─→ This is where hell begins

The Genealogy of Alternatives


"I want O(n) instead of O(n²)"

│
├─→ Sparse Attention (just look at some of it)
│ │
│ └─→ "Which parts should we look at?"
│ │
│ └─→ "If we knew that, we wouldn't need Attention"
│ │
│ └─→ 【Contradiction】
│ │
│ └─→ We do it with heuristics anyway (runs away)
│
├─→ Linear Attention (kernel trick)
│ │
│ └─→ "If we approximate Softmax..."
│ │
│ └─→ "Why doesn't it work exactly?"
│ │
│ └─→ 【Kernel Methods】
│ │
│ └─→ RKHS
│ │
│ └─→ 【Functional Analysis】
│ │
│ └─→ There goes one semester
│
├─→ State Space Models (Mamba)
│ │
│ └─→ "Let's compress with recurrence"
│ │
│ └─→ Details below (horror)
│
└─→ Memory Systems (Titans, Hope)
    │
    └─→ "Let's store it in memory"
        │
        └─→ Further below (more horror)


2. So You Want to Understand Mamba?

Full Dependency Tree


"What's Mamba?"

│
├─→ "It's a Selective State Space Model"
│ │
│ ├─→ "What's a State Space Model?"
│ │ │
│ │ └─→ "A recurrent model from discretized continuous systems"
│ │ │
│ │ ├─→ "Continuous systems?"
│ │ │ │
│ │ │ └─→ "dx/dt = Ax + Bu, y = Cx + Du"
│ │ │ │
│ │ │ └─→ "What does that even mean?"
│ │ │ │
│ │ │ └─→ 【Control Theory】
│ │ │ │
│ │ │ ├─→ State Space Representation
│ │ │ ├─→ Transfer Functions
│ │ │ ├─→ Stability Analysis
│ │ │ └─→ EE junior year (hello again)
│ │ │
│ │ ├─→ "How do you discretize?"
│ │ │ │
│ │ │ └─→ "Several methods"
│ │ │ │
│ │ │ ├─→ Zero-Order Hold (ZOH)
│ │ │ │ │
│ │ │ │ └─→ "Derive by integration"
│ │ │ │ │
│ │ │ │ └─→ 【Matrix Exponential】
│ │ │ │ │
│ │ │ │ └─→ e^{At} = Σ(At)^n/n!
│ │ │ │ │
│ │ │ │ └─→ "Does the infinite series converge?"
│ │ │ │ │
│ │ │ │ └─→ 【Real Analysis】
│ │ │ │ │
│ │ │ │ └─→ 2 weeks here
│ │ │ │
│ │ │ ├─→ Bilinear Transform
│ │ │ │ │
│ │ │ │ └─→ "s = 2/Δ · (z-1)/(z+1)"
│ │ │ │ │
│ │ │ │ └─→ "What's z?"
│ │ │ │ │
│ │ │ │ └─→ 【Z-Transform】
│ │ │ │ │
│ │ │ │ └─→ 【Signal Processing】
│ │ │ │ │
│ │ │ │ ├─→ Laplace Transform
│ │ │ │ ├─→ Fourier Transform
│ │ │ │ └─→ EE sophomore year (we meet again)
│ │ │ │
│ │ │ └─→ "Which one should I use?"
│ │ │ │
│ │ │ └─→ "Depends on the situation" (irresponsible answer)
│ │ │
│ │ └─→ "Why is recurrence good?"
│ │ │
│ │ └─→ "Because h_t = Āh_{t-1} + B̄x_t"
│ │ │
│ │ └─→ "Why is that O(n)?"
│ │ │
│ │ └─→ "Each step only references the previous state"
│ │ │
│ │ └─→ "Oh, so it's like an RNN?"
│ │ │
│ │ └─→ "Conceptually similar, but parallelizable"
│ │ │
│ │ └─→ "How?"
│ │ │
│ │ └─→ 【Parallel Scan】
│ │ │
│ │ └─→ Continued below
│ │
│ └─→ "What does Selective mean?"
│ │
│ └─→ "Parameters change based on input"
│ │
│ ├─→ "What changes?"
│ │ │
│ │ └─→ "B, C, Δ are functions of the input"
│ │ │
│ │ └─→ "What's Δ?"
│ │ │
│ │ └─→ "Discretization step size"
│ │ │
│ │ └─→ "Why should it vary with input?"
│ │ │
│ │ └─→ "Go slow on important stuff, fast on unimportant stuff"
│ │ │
│ │ └─→ "Why does that work?"
│ │ │
│ │ └─→ "Large Δ → less previous state influence, more current input influence"
│ │ │
│ │ └─→ "Show me the math"
│ │ │
│ │ └─→ Ā = exp(ΔA), large Δ → Ā → 0
│ │ │
│ │ └─→ 【Matrix Exponential】 appears again
│ │
│ └─→ "Why does this matter?"
│ │
│ └─→ "LTI vs LTV"
│ │
│ ├─→ "LTI (Linear Time-Invariant)"
│ │ │
│ │ └─→ "Fixed parameters"
│ │ │
│ │ └─→ "Can be computed as convolution"
│ │ │
│ │ └─→ "O(n log n) with FFT"
│ │ │
│ │ └─→ 【Fourier Transform】
│ │ │
│ │ └─→ Finally something familiar!!! (emotional)
│ │ │
│ │ └─→ But Mamba doesn't use this
│ │ │
│ │ └─→ ??? (was that a waste of time?)
│ │
│ └─→ "LTV (Linear Time-Varying)"
│ │
│ └─→ "Parameters change over time"
│ │
│ └─→ "Can't use convolution"
│ │
│ └─→ "Then how do you compute it fast?"
│ │
│ └─→ 【Parallel Scan】 ← This is the key
│ │
│ └─→ Below...
│
├─→ 【Parallel Scan (Selective Scan)】 ← Mamba's real secret
│ │
│ └─→ "Recurrence but parallelizable?"
│ │
│ └─→ "If the operation is Associative, yes"
│ │
│ └─→ "What's Associative?"
│ │ │
│ │ └─→ "(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)"
│ │ │
│ │ └─→ 【Algebra】 Associativity
│ │ │
│ │ └─→ Finally, high school math!!! (celebration)
│ │
│ └─→ "Is the SSM update Associative?"
│ │
│ └─→ "If you express it as matrices, yes"
│ │
│ └─→ [h_t] [Ā B̄x_t] [h_{t-1}]
│ │ [1 ] = [0 1 ] [1 ]
│ │
│ └─→ "Matrix multiplication is Associative"
│ │
│ └─→ "So you can compute it in parallel like Prefix Sum"
│ │
│ └─→ "work O(n), span O(log n)"
│ │
│ └─→ 【Parallel Algorithms】
│ │
│ └─→ GPU Programming
│ │
│ └─→ CUDA
│ │
│ └─→ Another month here
│
├─→ 【HiPPO】 The Secret to Long-Term Memory
│ │
│ └─→ "Why does SSM handle long-term memory well?"
│ │
│ └─→ "The A matrix initialization is special"
│ │
│ └─→ "What's a HiPPO matrix?"
│ │
│ └─→ "Compresses the past using orthogonal polynomials"
│ │
│ ├─→ "Orthogonal polynomials?"
│ │ │
│ │ └─→ 【Functional Analysis】
│ │ │
│ │ ├─→ Legendre Polynomials
│ │ ├─→ Laguerre Polynomials
│ │ ├─→ Chebyshev Polynomials
│ │ └─→ Why are all the names French?
│ │ │
│ │ └─→ (historical reasons)
│ │
│ └─→ "Why is compressing with orthogonal polynomials good?"
│ │
│ └─→ "It guarantees optimal approximation"
│ │
│ └─→ 【Approximation Theory】
│ │
│ └─→ Weierstrass Theorem
│ │
│ └─→ Continuous functions can be approximated by polynomials...
│ │
│ └─→ 2 weeks here (minimum)
│
└─→ 【Hardware Optimization】 Making It Actually Fast
    │
    └─→ "It's theoretically O(n), but is it actually fast?"
        │
        └─→ "Kernel fusion is essential"
            │
            └─→ "What's kernel fusion?"
                │
                └─→ "Combining multiple operations into a single CUDA kernel"
                    │
                    └─→ "Why is that necessary?"
                        │
                        └─→ "Memory bandwidth bottleneck"
                            │
                            └─→ 【GPU Architecture】
                                │
                                ├─→ SRAM vs HBM
                                ├─→ Memory Hierarchy
                                ├─→ Warp Scheduling
                                └─→ You're a hardware engineer now
                                    │
                                    └─→ (career change?)

Mamba Summary (if you survived this far)


Mamba = SSM + Selection + Parallel Scan + HiPPO + CUDA wizardry

      = Control Theory + Information Theory + Parallel Algorithms + Functional Analysis + GPU Programming

      = ??? (confusion)


3. So You Want to Understand Titans?

Full Dependency Tree


"What's Titans?"

│
├─→ "It combines Attention + Neural Memory"
│ │
│ ├─→ "What's Neural Memory?"
│ │ │
│ │ └─→ "A learnable memory module"
│ │ │
│ │ └─→ "What does that even mean?"
│ │ │
│ │ └─→ "It updates memory at test time"
│ │ │
│ │ ├─→ "Learning at test time?"
│ │ │ │
│ │ │ └─→ "Yep, Test-Time Training"
│ │ │ │
│ │ │ └─→ 【The World of Meta-Learning】 ← branch point
│ │ │ │
│ │ │ ├─→ MAML
│ │ │ │ │
│ │ │ │ └─→ "Learning to learn"
│ │ │ │ │
│ │ │ │ └─→ "Double optimization"
│ │ │ │ │
│ │ │ │ └─→ 【Bilevel Optimization】
│ │ │ │ │
│ │ │ │ └─→ One month here
│ │ │ │
│ │ │ └─→ In-Context Learning
│ │ │ │
│ │ │ └─→ "Nobody knows why it works (seriously)"
│ │ │ │
│ │ │ └─→ 【Open Problem】
│ │ │
│ │ └─→ "How do you update the memory?"
│ │ │
│ │ └─→ "With Gradient Descent"
│ │ │
│ │ └─→ "Gradients during inference?"
│ │ │
│ │ └─→ "Isn't that slow, you ask?"
│ │ │
│ │ └─→ "Just 1 step of Mini-batch SGD"
│ │ │
│ │ └─→ "Still slow though?"
│ │ │
│ │ └─→ "So we need to find efficient methods..."
│ │ │
│ │ └─→ (research ongoing)
│ │
│ └─→ "Why combine it with Attention?"
│ │
│ └─→ "Short-term → Attention, Long-term → Memory"
│ │
│ └─→ "Why is that split good?"
│ │
│ └─→ "It's similar to the human memory system"
│ │
│ └─→ 【Cognitive Science】 surprise appearance
│ │
│ ├─→ Working Memory vs Long-term Memory
│ ├─→ Atkinson-Shiffrin Model
│ └─→ Intro to Psychology (nice to meet you)
│
├─→ 【Neural Memory Details】 ← the core
│ │
│ └─→ "What exactly is the memory?"
│ │
│ └─→ "A learnable parameter M ∈ R^{d×k}"
│ │
│ ├─→ "How do you read from it?"
│ │ │
│ │ └─→ "Attention with a Query"
│ │ │
│ │ └─→ Attention shows up again here
│ │ │
│ │ └─→ 【Infinite loop begins】
│ │
│ ├─→ "How do you write to it?"
│ │ │
│ │ └─→ "Update based on Surprise"
│ │ │
│ │ └─→ "What's Surprise?"
│ │ │
│ │ └─→ "-log p(x), it's information content"
│ │ │
│ │ └─→ 【Information Theory】 returns
│ │ │
│ │ ├─→ Self-Information
│ │ ├─→ Entropy
│ │ └─→ KL Divergence
│ │ │
│ │ └─→ 2 weeks here
│ │
│ └─→ "Why update based on Surprise?"
│ │
│ └─→ "Because surprising things should be remembered"
│ │
│ └─→ "That's intuitive"
│ │
│ └─→ Finally something that makes sense!!! (emotional)
│
├─→ 【Modern Hopfield Networks Connection】
│ │
│ └─→ "Is Neural Memory related to Hopfield Networks?"
│ │
│ └─→ "Similar from an Associative Memory perspective"
│ │
│ └─→ "What's a Hopfield Network?"
│ │
│ └─→ "An energy-based associative memory model"
│ │
│ └─→ 【Statistical Physics】 surprise appearance
│ │
│ ├─→ Energy Function
│ │ │
│ │ └─→ E = -Σ w_ij s_i s_j
│ │ │
│ │ └─→ "Isn't that the Ising Model?"
│ │ │
│ │ └─→ "Yep"
│ │ │
│ │ └─→ 【Statistical Mechanics】
│ │ │
│ │ └─→ Say hello to the Physics department
│ │
│ ├─→ Associative Memory
│ │ │
│ │ └─→ "Store and retrieve patterns"
│ │ │
│ │ └─→ "How many can you store?"
│ │ │
│ │ └─→ "0.14N (Classic)"
│ │ │
│ │ └─→ "Modern Hopfield can do exp(N)"
│ │ │
│ │ └─→ "How?"
│ │ │
│ │ └─→ "Change the energy function"
│ │ │
│ │ └─→ 【Ramsauer et al. 2020】
│ │ │
│ │ └─→ "This is the same as Attention?"
│ │ │
│ │ └─→ "Yep"
│ │ │
│ │ └─→ Connection found!!! (emotional)
│ │
│ └─→ But Titans uses a different approach
│ │
│ └─→ (Was that a waste of time? Well, not exactly...)
│
├─→ 【Memory Integration Methods】
│ │
│ └─→ "How do you attach memory to a Transformer?"
│ │
│ ├─→ "MAC (Memory as Context)"
│ │ │
│ │ └─→ "Append memory reads to the context"
│ │ │
│ │ └─→ "Doesn't that make the sequence longer?"
│ │ │
│ │ └─→ "So you need to limit memory size"
│ │ │
│ │ └─→ (trade-off)
│ │
│ ├─→ "MAG (Memory as Gate)"
│ │ │
│ │ └─→ "Use memory to modulate Attention"
│ │ │
│ │ └─→ "What's Gating?"
│ │ │
│ │ └─→ "σ(x) ⊙ y form"
│ │ │
│ │ └─→ "Like LSTM gates?"
│ │ │
│ │ └─→ "Yep"
│ │ │
│ │ └─→ 【LSTM】 returns
│ │ │
│ │ └─→ The 1997 paper
│ │ │
│ │ └─→ The OG of long-term memory (emotional)
│ │
│ └─→ "MAL (Memory as Layer)"
│ │
│ └─→ "Process memory as a separate layer"
│ │
│ └─→ "Cleanest separation"
│ │
│ └─→ "Performance?"
│ │
│ └─→ "Gotta run experiments"
│ │
│ └─→ (more experiments)
│
└─→ 【Persistent Memory vs Contextual Memory】
    │
    └─→ "Task-level vs Instance-level memory distinction?"
        │
        └─→ "Yes, you need to separate the two"
            │
            ├─→ "Persistent Memory"
            │ │
            │ └─→ "Shared across all sequences"
            │ │
            │ └─→ "Stores learned knowledge"
            │
            └─→ "Contextual Memory"
                │
                └─→ "Specific to the current sequence"
                    │
                    └─→ "Updated at test-time"
                        │
                        └─→ "Why does this distinction matter?"
                            │
                            └─→ "Separating knowledge from context"
                                │
                                └─→ 【Episodic vs Semantic Memory】
                                    │
                                    └─→ Cognitive science strikes again


4. So You Want to Understand Hope?

Full Dependency Tree


"What's Hope?"

│
├─→ "It builds Self-Modifying networks with Nested Learning"
│ │
│ ├─→ "What's Nested Learning?"
│ │ │
│ │ └─→ "Learning inside of learning"
│ │ │
│ │ └─→ "Inner Loop / Outer Loop?"
│ │ │
│ │ └─→ "Yes, it's Bilevel Optimization"
│ │ │
│ │ └─→ 【Bilevel Optimization】
│ │ │
│ │ ├─→ "Upper problem depends on lower problem's optimal solution"
│ │ │ │
│ │ │ └─→ "Show me the math"
│ │ │ │
│ │ │ └─→ min_θ L(θ, φ*(θ))
│ │ │ where φ*(θ) = argmin_φ l(θ, φ)
│ │ │ │
│ │ │ └─→ "How do you compute the gradient?"
│ │ │ │
│ │ │ └─→ 【Implicit Differentiation】
│ │ │ │
│ │ │ └─→ Need to compute dφ*/dθ
│ │ │ │
│ │ │ └─→ 【Implicit Function Theorem】
│ │ │ │
│ │ │ └─→ 【Real Analysis】
│ │ │ │
│ │ │ └─→ Another 2 weeks here
│ │ │
│ │ └─→ "Why do something this complicated?"
│ │ │
│ │ └─→ "For rapid adaptation"
│ │ │
│ │ └─→ "Is this the same as Meta-Learning?"
│ │ │
│ │ └─→ "Related"
│ │ │
│ │ └─→ MAML, Reptile, ...
│ │ │
│ │ └─→ One month here
│ │
│ └─→ "What does Self-Modifying mean?"
│ │
│ └─→ "The network modifies its own parameters"
│ │
│ └─→ "At test-time?"
│ │
│ └─→ "Yes"
│ │
│ └─→ "Is that even possible?"
│ │
│ └─→ "Similar concept to Hypernetworks"
│ │
│ └─→ 【Hypernetwork】
│ │
│ └─→ "A network generates another network's weights"
│ │
│ └─→ "Related to Neural Turing Machines?"
│ │
│ └─→ 【Differentiable Computing】
│ │
│ └─→ The 2014-2016 DeepMind era
│ │
│ └─→ History lesson (escape)
│
├─→ 【Causal Memory [...]】 ← The core of Hope
│ │
│ └─→ "What's CMS?"
│ │
│ └─→ "A system that updates memory causally"
│ │
│ ├─→ "What does 'causal' mean here?"
│ │ │
│ │ └─→ "Past affects future, future does NOT affect past"
│ │ │
│ │ └─→ "Isn't that obvious?"
│ │ │
│ │ └─→ "Attention is bidirectional though"
│ │ │
│ │ └─→ "Oh, so that's why we use Causal Masks"
│ │ │
│ │ └─→ "But it matters for memory systems too"
│ │ │
│ │ └─→ "Why?"
│ │ │
│ │ └─→ "Preventing future information leakage"
│ │ │
│ │ └─→ 【Causal Inference】 out of nowhere
│ │ │
│ │ ├─→ Pearl's Causal Hierarchy
│ │ ├─→ do-calculus
│ │ └─→ This is philosophy now (seriously)
│ │
│ └─→ "How does memory update work?"
│ │
│ └─→ "Gradient-based + Causal Masking"
│ │
│ └─→ "Does it also use Surprise?"
│ │
│ └─→ "There's a similar concept"
│ │
│ └─→ Prediction Error
│ │
│ └─→ 【Predictive Coding】
│ │
│ └─→ A theory from neuroscience
│ │
│ └─→ Brain science enters the chat (how far does this go)
│
├─→ 【TTT (Test-Time Training) Layer】
│ │
│ └─→ "There was a TTT Layer paper before Hope"
│ │
│ └─→ "What's the relationship?"
│ │
│ └─→ "TTT: uses a linear model as inner state"
│ │
│ └─→ "Hope: a more general framework"
│ │
│ └─→ "Should I understand TTT first to get Hope?"
│ │
│ └─→ "Recommended"
│ │
│ └─→ 【TTT paper first】
│ │
│ └─→ Sun et al. 2024
│ │
│ └─→ (added to reading list)
│
└─→ 【Nested Learning Details】
    │
    └─→ "What exactly is learning inside learning?"
        │
        └─→ "Outer: overall sequence Loss"
            │
            └─→ "Inner: local prediction Loss"
                │
                └─→ "Inner updates the memory"
                    │
                    └─→ "Outer updates all parameters"
                        │
                        └─→ "So backward pass runs twice?"
                            │
                            └─→ "Yes, Gradient of Gradient"
                                │
                                └─→ 【Second-Order Optimization】
                                    │
                                    ├─→ Hessian
                                    │ │
                                    │ └─→ O(n²) parameters
                                    │ │
                                    │ └─→ "Computationally infeasible"
                                    │ │
                                    │ └─→ "Need approximations"
                                    │ │
                                    │ └─→ Hessian-Free Methods
                                    │ │
                                    │ └─→ 【Numerical Optimization】
                                    │ │
                                    │ └─→ Another month here
                                    │
                                    └─→ Truncated Backprop
                                        │
                                        └─→ "Only unroll a few steps"
                                            │
                                            └─→ "Doesn't that introduce bias?"
                                                │
                                                └─→ "It does"
                                                    │
                                                    └─→ "But it's practical so we use it"
                                                        │
                                                        └─→ It works so whatever (runs away)


5. So How Do These Three Relate?

Comparison


                    Mamba Titans Hope

─────────────────────────────────────────────────────────────

Core Idea SSM + Selection Neural Memory Nested Learning

Complexity O(n) O(n) + α O(n) + αα

Long-term Memory State Compression Explicit Memory Self-Modifying Memory

Theoretical Basis Control Theory Information Theory Optimization Theory

Implementation Pain CUDA hell Medium High

Required Reading Physics+Engineering CogSci+ML Optimization+ML

─────────────────────────────────────────────────────────────

Decision Guide


"What should I study?"

│
├─→ If you want practical implementation
│ │
│ └─→ Mamba (most available implementations)
│ │
│ └─→ But need to understand CUDA kernels (hard)
│
├─→ If you want theoretical understanding
│ │
│ └─→ Titans (most intuitive concepts)
│ │
│ └─→ But need to understand test-time learning
│
└─→ If you want to follow cutting-edge research
    │
    └─→ Hope (late 2024 paper)
        │
        └─→ But you need to know all the previous stuff
            │
            └─→ 【Infinite loop】


6. Do I Really Need to Know All This?

The Honest Answer


"Do I need to know everything?"

│
├─→ To write papers: go deep in your area, concepts-only for the rest
├─→ To implement: focus on Mamba + use libraries
├─→ To compare: just the core ideas of each
└─→ To just use them:



    from mamba_ssm import Mamba

    model = Mamba(d_model=512)

    # done

Realistic Time Estimates

| Goal | Time Required | What It Actually Means |

|------|----------|----------|

| API usage | 1 day | Nothing |

| Fine-tuning | 1-2 weeks | Editing config files |

| Architecture understanding | 1-3 months | Level 1 of the tree |

| Paper comprehension | 6 months - 1 year | Level 2 of the tree |

| Suggesting improvements | 1-2 years | Most of the tree |

| Proposing new architectures | 3+ years | Entire tree + intuition |

| Complete understanding | ??? | Even the authors don't know everything (seriously) |

Required Knowledge by Area


【Shared Prerequisites】

├── Linear Algebra: Matrices, Eigenvalues, SVD
├── Calculus: Partial Derivatives, Chain Rule
├── Probability: Basic Statistics
└── Deep Learning: Transformer Basics



【Mamba-Specific】

├── Control Theory: State Space, Stability
├── Signal Processing: Z-Transform, Discretization
├── Parallel Algorithms: Scan Operations
└── GPU Programming: CUDA (optional)



【Titans-Specific】

├── Information Theory: Entropy, Surprise
├── Cognitive Science: Memory Systems (concepts only)
├── Meta-Learning: MAML etc. (concepts only)
└── Hopfield Networks: Energy-Based Models



【Hope-Specific】

├── Optimization Theory: Bilevel, Implicit Diff
├── Meta-Learning: Advanced
├── Causal Inference: Basic Concepts
└── Second-Order Methods: Hessian Approximation



【What You Actually Use】

└── model.forward(x)


7. Escaping Dependency Hell: A Guide

Option A: Strategic Surrender


# Using Mamba

from mamba_ssm import Mamba

layer = Mamba(d_model=512, d_state=16, d_conv=4)



# Or just use Transformers (still great)

from transformers import AutoModel

model = AutoModel.from_pretrained("...")



# Honestly Attention is fast enough (reality)

Option B: Dig When Needed


1. Read only the Abstract + Figure 1 of the paper

2. Run the code

3. If something's weird, dig into that part

4. Repeat

5. Somehow you just know it now (magic)

Option C: The Purist Path (not recommended)


Linear Algebra (3 months)

    ↓

Calculus + Analysis (3 months)

    ↓

Probability (2 months)

    ↓

Signal Processing (2 months)

    ↓

Control Theory (3 months)

    ↓

Optimization Theory (3 months)

    ↓

Deep Learning Basics (3 months)

    ↓

Transformers (2 months)

    ↓

SSM Theory (2 months)

    ↓

Meta-Learning (2 months)

    ↓

Mamba/Titans/Hope (3 months)

    ↓

Total time: 2.5-3 years



By then a new architecture has dropped

(Attention Is All You Need → No It Isn't → Actually Maybe)

(rip)

Option D: Hybrid (recommended)


1. Get the big picture from this post (1 day)

2. Build intuition with Lilian Weng's blog (1 week)

3. Hands-on with code (1-2 weeks)

4. Deep dive only into what interests you (ongoing)

5. Math on a need-to-know basis

6. Focus on the Methods section of papers



Key insight: You don't need to know everything

              You only need to know what you need

              (This is the hardest part)


8. FAQ

"Doesn't performance drop without Attention?"


├─→ In general: similar or slightly lower
├─→ Long sequences: can actually be better
├─→ Conclusion: depends on the situation (irresponsible answer)
└─→ Also: Hybrid is the trend

"Can I skip the math?"


├─→ Just using it: yes
├─→ Fine-tuning: yes
├─→ Implementing: tough
├─→ Understanding papers: no
├─→ Research: absolutely not
└─→ But you can learn what you need as you go

"Where do I start?"


Recommended order:

1. Review Attention (essential)

2. "The Annotated S4" blog (S4/Mamba intuition)

3. Mamba paper + official implementation (hands-on)

4. Titans/Hope if you're interested (optional)



Videos:

- Yannic Kilcher's Mamba review

- AI Coffee Break explainer



Blogs:

- Lilian Weng (GOAT)

- Jay Alammar's visualizations

"How long will it take?"


"1 week" - Genius or someone who only knows Attention

"1 month" - Optimist

"3 months" - Realist

"6 months" - Deep diver

"A lifetime" - The truth

"I don't know" - Honest person

"Can I just give up?"


Yes.



Attention is still plenty good,

Flash Attention has solved a lot of the hardware issues,

100K tokens just works now.



However:

- For 1M+ tokens, you need alternatives

- If you're a researcher, you need to know this

- If your company tells you to... good luck

"What's the outlook for this field?"

├─→ It's hot (seriously)
├─→ Mamba2, Jamba, Griffin etc. keep coming
├─→ Hybrid is the trend
├─→ But who knows in 6 months
└─→ This is the 10th time I've heard "Transformers are dead"
    └─→ They're not dead yet (the eternal rule)

9. Closing Thoughts

Your State After Reading This


Before: "Mamba replaces Attention apparently, let me look into it"

After: "Control theory... signal processing... statistical physics... cognitive science..."

       "Why did I choose this path"

       "Maybe I'll just use Attention"

Words of Comfort

  • Even experts in this field don't know all of it

  • The Mamba authors don't know Titans internals, and the Titans authors don't know Hope's details

  • "I don't really get it but it works when I run experiments" is the industry standard

  • Nobody reads the paper's Appendix, not even reviewers (seriously)

  • You're not the only one struggling

But If You Still Want to Do This


Key tips:

1. Don't try to understand everything perfectly

2. Pick only what you need and go deep there

3. Code first, theory later

4. Don't be embarrassed to ask questions

5. Not understanding a paper on the first read is normal



In the end: Time + Persistence + Googling = You'll get there eventually



You got this!!!

(The author still can't derive Implicit Differentiation)


Appendix: Recommended Resources (in order)

Blogs (free, essential)

  1. "The Annotated S4" - srush/annotated-s4

  2. Lilian Weng - "State Space Models"

  3. "A Visual Guide to Mamba" - Maarten Grootendorst

Videos (free)

  1. Yannic Kilcher - Mamba paper review

  2. AI Coffee Break - SSM explainer

Papers (in order)

  1. S4 (Gu et al., 2021) - SSM fundamentals

  2. Mamba (Gu & Dao, 2023) - Selection + Scan

  3. Mamba-2 (Dao & Gu, 2024) - State Space Duality

  4. Titans (Behrouz et al., 2024) - Neural Memory

  5. Hope (Poli et al., 2024) - Nested Learning

Code

  • state-spaces/mamba (official)

  • huggingface/transformers (Mamba support)

  • Each paper's official repo

Math Foundations (when you get stuck)

  • 3Blue1Brown Linear Algebra (video)

  • Steve Brunton Control Theory (video)

  • For probability... you'll need a textbook (runs away)