So You Want to Understand Long-Context Models?

(The Mamba, Titans, Hope Dependency Hell Guide)

"I heard Attention is O(n²) and slow. So I looked for alternatives..." → gates of hell: opened

0. Reality Check First

What you want:


"I just want a rough idea of how Mamba and Titans work"

"They say it replaces Attention — I'm curious about the principles"

What you'll actually face:


Long-Context Models

│
├── Mamba (State Space Model family)
│ ├── SSM Theory
│ │ ├── Control Theory
│ │ ├── Signal Processing
│ │ └── Complex Analysis
│ ├── Selection Mechanism
│ │ └── Information Theory
│ └── Hardware Optimization
│ └── CUDA Kernel Programming
│ └── GPU Architecture
│ └── Computer Architecture
│ └── (this is Electrical Engineering now)
│
├── Titans (Memory Module family)
│ ├── Neural Memory
│ │ └── Hopfield Networks
│ │ └── Statistical Physics
│ ├── Surprise Metric
│ │ └── Information Theory (it's back)
│ └── Test-Time Learning
│ └── Meta-Learning
│ └── The Abyss of Optimization Theory
│
└── Hope (Self-Modifying Networks)
    ├── Nested Learning
    │ └── Higher-Order Optimization
    │ └── Implicit Differentiation
    │ └── Functional Analysis
    └── Causal Memory
        └── Causal Inference
            └── Philosophy (seriously)



Estimated time: When will it end? (answer refused)

1. Why You Entered This Hell: The Problem with Attention

What's Wrong with Attention?


"Attention is O(n²)"

│
└─→ "What's n?"
    │
    └─→ "Sequence length"
        │
        └─→ "So what's the problem?"
            │
            └─→ "Double the length = 4x the cost"
                │
                └─→ "What if I feed in 100K tokens?"
                    │
                    └─→ "10 billion operations"
                        │
                        └─→ "GPU memory goes boom"
                            │
                            └─→ "So we need alternatives"
                                │
                                └─→ This is where hell begins

The Genealogy of Alternatives


"I want O(n) instead of O(n²)"

│
├─→ Sparse Attention (just look at some of it)
│ │
│ └─→ "Which parts should we look at?"
│ │
│ └─→ "If we knew that, we wouldn't need Attention"
│ │
│ └─→ 【Contradiction】
│ │
│ └─→ We do it with heuristics anyway (runs away)
│
├─→ Linear Attention (kernel trick)
│ │
│ └─→ "If we approximate Softmax..."
│ │
│ └─→ "Why doesn't it work exactly?"
│ │
│ └─→ 【Kernel Methods】
│ │
│ └─→ RKHS
│ │
│ └─→ 【Functional Analysis】
│ │
│ └─→ There goes one semester
│
├─→ State Space Models (Mamba)
│ │
│ └─→ "Let's compress with recurrence"
│ │
│ └─→ Details below (horror)
│
└─→ Memory Systems (Titans, Hope)
    │
    └─→ "Let's store it in memory"
        │
        └─→ Further below (more horror)

2. So You Want to Understand Mamba?

Full Dependency Tree


"What's Mamba?"

│
├─→ "It's a Selective State Space Model"
│ │
│ ├─→ "What's a State Space Model?"
│ │ │
│ │ └─→ "A recurrent model from discretized continuous systems"
│ │ │
│ │ ├─→ "Continuous systems?"
│ │ │ │
│ │ │ └─→ "dx/dt = Ax + Bu, y = Cx + Du"
│ │ │ │
│ │ │ └─→ "What does that even mean?"
│ │ │ │
│ │ │ └─→ 【Control Theory】
│ │ │ │
│ │ │ ├─→ State Space Representation
│ │ │ ├─→ Transfer Functions
│ │ │ ├─→ Stability Analysis
│ │ │ └─→ EE junior year (hello again)
│ │ │
│ │ ├─→ "How do you discretize?"
│ │ │ │
│ │ │ └─→ "Several methods"
│ │ │ │
│ │ │ ├─→ Zero-Order Hold (ZOH)
│ │ │ │ │
│ │ │ │ └─→ "Derive by integration"
│ │ │ │ │
│ │ │ │ └─→ 【Matrix Exponential】
│ │ │ │ │
│ │ │ │ └─→ e^{At} = Σ(At)^n/n!
│ │ │ │ │
│ │ │ │ └─→ "Does the infinite series converge?"
│ │ │ │ │
│ │ │ │ └─→ 【Real Analysis】
│ │ │ │ │
│ │ │ │ └─→ 2 weeks here
│ │ │ │
│ │ │ ├─→ Bilinear Transform
│ │ │ │ │
│ │ │ │ └─→ "s = 2/Δ · (z-1)/(z+1)"
│ │ │ │ │
│ │ │ │ └─→ "What's z?"
│ │ │ │ │
│ │ │ │ └─→ 【Z-Transform】
│ │ │ │ │
│ │ │ │ └─→ 【Signal Processing】
│ │ │ │ │
│ │ │ │ ├─→ Laplace Transform
│ │ │ │ ├─→ Fourier Transform
│ │ │ │ └─→ EE sophomore year (we meet again)
│ │ │ │
│ │ │ └─→ "Which one should I use?"
│ │ │ │
│ │ │ └─→ "Depends on the situation" (irresponsible answer)
│ │ │
│ │ └─→ "Why is recurrence good?"
│ │ │
│ │ └─→ "Because h_t = Āh_{t-1} + B̄x_t"
│ │ │
│ │ └─→ "Why is that O(n)?"
│ │ │
│ │ └─→ "Each step only references the previous state"
│ │ │
│ │ └─→ "Oh, so it's like an RNN?"
│ │ │
│ │ └─→ "Conceptually similar, but parallelizable"
│ │ │
│ │ └─→ "How?"
│ │ │
│ │ └─→ 【Parallel Scan】
│ │ │
│ │ └─→ Continued below
│ │
│ └─→ "What does Selective mean?"
│ │
│ └─→ "Parameters change based on input"
│ │
│ ├─→ "What changes?"
│ │ │
│ │ └─→ "B, C, Δ are functions of the input"
│ │ │
│ │ └─→ "What's Δ?"
│ │ │
│ │ └─→ "Discretization step size"
│ │ │
│ │ └─→ "Why should it vary with input?"
│ │ │
│ │ └─→ "Go slow on important stuff, fast on unimportant stuff"
│ │ │
│ │ └─→ "Why does that work?"
│ │ │
│ │ └─→ "Large Δ → less previous state influence, more current input influence"
│ │ │
│ │ └─→ "Show me the math"
│ │ │
│ │ └─→ Ā = exp(ΔA), large Δ → Ā → 0
│ │ │
│ │ └─→ 【Matrix Exponential】 appears again
│ │
│ └─→ "Why does this matter?"
│ │
│ └─→ "LTI vs LTV"
│ │
│ ├─→ "LTI (Linear Time-Invariant)"
│ │ │
│ │ └─→ "Fixed parameters"
│ │ │
│ │ └─→ "Can be computed as convolution"
│ │ │
│ │ └─→ "O(n log n) with FFT"
│ │ │
│ │ └─→ 【Fourier Transform】
│ │ │
│ │ └─→ Finally something familiar!!! (emotional)
│ │ │
│ │ └─→ But Mamba doesn't use this
│ │ │
│ │ └─→ ??? (was that a waste of time?)
│ │
│ └─→ "LTV (Linear Time-Varying)"
│ │
│ └─→ "Parameters change over time"
│ │
│ └─→ "Can't use convolution"
│ │
│ └─→ "Then how do you compute it fast?"
│ │
│ └─→ 【Parallel Scan】 ← This is the key
│ │
│ └─→ Below...
│
├─→ 【Parallel Scan (Selective Scan)】 ← Mamba's real secret
│ │
│ └─→ "Recurrence but parallelizable?"
│ │
│ └─→ "If the operation is Associative, yes"
│ │
│ └─→ "What's Associative?"
│ │ │
│ │ └─→ "(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)"
│ │ │
│ │ └─→ 【Algebra】 Associativity
│ │ │
│ │ └─→ Finally, high school math!!! (celebration)
│ │
│ └─→ "Is the SSM update Associative?"
│ │
│ └─→ "If you express it as matrices, yes"
│ │
│ └─→ [h_t] [Ā B̄x_t] [h_{t-1}]
│ │ [1 ] = [0 1 ] [1 ]
│ │
│ └─→ "Matrix multiplication is Associative"
│ │
│ └─→ "So you can compute it in parallel like Prefix Sum"
│ │
│ └─→ "work O(n), span O(log n)"
│ │
│ └─→ 【Parallel Algorithms】
│ │
│ └─→ GPU Programming
│ │
│ └─→ CUDA
│ │
│ └─→ Another month here
│
├─→ 【HiPPO】 The Secret to Long-Term Memory
│ │
│ └─→ "Why does SSM handle long-term memory well?"
│ │
│ └─→ "The A matrix initialization is special"
│ │
│ └─→ "What's a HiPPO matrix?"
│ │
│ └─→ "Compresses the past using orthogonal polynomials"
│ │
│ ├─→ "Orthogonal polynomials?"
│ │ │
│ │ └─→ 【Functional Analysis】
│ │ │
│ │ ├─→ Legendre Polynomials
│ │ ├─→ Laguerre Polynomials
│ │ ├─→ Chebyshev Polynomials
│ │ └─→ Why are all the names French?
│ │ │
│ │ └─→ (historical reasons)
│ │
│ └─→ "Why is compressing with orthogonal polynomials good?"
│ │
│ └─→ "It guarantees optimal approximation"
│ │
│ └─→ 【Approximation Theory】
│ │
│ └─→ Weierstrass Theorem
│ │
│ └─→ Continuous functions can be approximated by polynomials...
│ │
│ └─→ 2 weeks here (minimum)
│
└─→ 【Hardware Optimization】 Making It Actually Fast
    │
    └─→ "It's theoretically O(n), but is it actually fast?"
        │
        └─→ "Kernel fusion is essential"
            │
            └─→ "What's kernel fusion?"
                │
                └─→ "Combining multiple operations into a single CUDA kernel"
                    │
                    └─→ "Why is that necessary?"
                        │
                        └─→ "Memory bandwidth bottleneck"
                            │
                            └─→ 【GPU Architecture】
                                │
                                ├─→ SRAM vs HBM
                                ├─→ Memory Hierarchy
                                ├─→ Warp Scheduling
                                └─→ You're a hardware engineer now
                                    │
                                    └─→ (career change?)

Mamba Summary (if you survived this far)


Mamba = SSM + Selection + Parallel Scan + HiPPO + CUDA wizardry

      = Control Theory + Information Theory + Parallel Algorithms + Functional Analysis + GPU Programming

      = ??? (confusion)

3. So You Want to Understand Titans?

Full Dependency Tree


"What's Titans?"

│
├─→ "It combines Attention + Neural Memory"
│ │
│ ├─→ "What's Neural Memory?"
│ │ │
│ │ └─→ "A learnable memory module"
│ │ │
│ │ └─→ "What does that even mean?"
│ │ │
│ │ └─→ "It updates memory at test time"
│ │ │
│ │ ├─→ "Learning at test time?"
│ │ │ │
│ │ │ └─→ "Yep, Test-Time Training"
│ │ │ │
│ │ │ └─→ 【The World of Meta-Learning】 ← branch point
│ │ │ │
│ │ │ ├─→ MAML
│ │ │ │ │
│ │ │ │ └─→ "Learning to learn"
│ │ │ │ │
│ │ │ │ └─→ "Double optimization"
│ │ │ │ │
│ │ │ │ └─→ 【Bilevel Optimization】
│ │ │ │ │
│ │ │ │ └─→ One month here
│ │ │ │
│ │ │ └─→ In-Context Learning
│ │ │ │
│ │ │ └─→ "Nobody knows why it works (seriously)"
│ │ │ │
│ │ │ └─→ 【Open Problem】
│ │ │
│ │ └─→ "How do you update the memory?"
│ │ │
│ │ └─→ "With Gradient Descent"
│ │ │
│ │ └─→ "Gradients during inference?"
│ │ │
│ │ └─→ "Isn't that slow, you ask?"
│ │ │
│ │ └─→ "Just 1 step of Mini-batch SGD"
│ │ │
│ │ └─→ "Still slow though?"
│ │ │
│ │ └─→ "So we need to find efficient methods..."
│ │ │
│ │ └─→ (research ongoing)
│ │
│ └─→ "Why combine it with Attention?"
│ │
│ └─→ "Short-term → Attention, Long-term → Memory"
│ │
│ └─→ "Why is that split good?"
│ │
│ └─→ "It's similar to the human memory system"
│ │
│ └─→ 【Cognitive Science】 surprise appearance
│ │
│ ├─→ Working Memory vs Long-term Memory
│ ├─→ Atkinson-Shiffrin Model
│ └─→ Intro to Psychology (nice to meet you)
│
├─→ 【Neural Memory Details】 ← the core
│ │
│ └─→ "What exactly is the memory?"
│ │
│ └─→ "A learnable parameter M ∈ R^{d×k}"
│ │
│ ├─→ "How do you read from it?"
│ │ │
│ │ └─→ "Attention with a Query"
│ │ │
│ │ └─→ Attention shows up again here
│ │ │
│ │ └─→ 【Infinite loop begins】
│ │
│ ├─→ "How do you write to it?"
│ │ │
│ │ └─→ "Update based on Surprise"
│ │ │
│ │ └─→ "What's Surprise?"
│ │ │
│ │ └─→ "-log p(x), it's information content"
│ │ │
│ │ └─→ 【Information Theory】 returns
│ │ │
│ │ ├─→ Self-Information
│ │ ├─→ Entropy
│ │ └─→ KL Divergence
│ │ │
│ │ └─→ 2 weeks here
│ │
│ └─→ "Why update based on Surprise?"
│ │
│ └─→ "Because surprising things should be remembered"
│ │
│ └─→ "That's intuitive"
│ │
│ └─→ Finally something that makes sense!!! (emotional)
│
├─→ 【Modern Hopfield Networks Connection】
│ │
│ └─→ "Is Neural Memory related to Hopfield Networks?"
│ │
│ └─→ "Similar from an Associative Memory perspective"
│ │
│ └─→ "What's a Hopfield Network?"
│ │
│ └─→ "An energy-based associative memory model"
│ │
│ └─→ 【Statistical Physics】 surprise appearance
│ │
│ ├─→ Energy Function
│ │ │
│ │ └─→ E = -Σ w_ij s_i s_j
│ │ │
│ │ └─→ "Isn't that the Ising Model?"
│ │ │
│ │ └─→ "Yep"
│ │ │
│ │ └─→ 【Statistical Mechanics】
│ │ │
│ │ └─→ Say hello to the Physics department
│ │
│ ├─→ Associative Memory
│ │ │
│ │ └─→ "Store and retrieve patterns"
│ │ │
│ │ └─→ "How many can you store?"
│ │ │
│ │ └─→ "0.14N (Classic)"
│ │ │
│ │ └─→ "Modern Hopfield can do exp(N)"
│ │ │
│ │ └─→ "How?"
│ │ │
│ │ └─→ "Change the energy function"
│ │ │
│ │ └─→ 【Ramsauer et al. 2020】
│ │ │
│ │ └─→ "This is the same as Attention?"
│ │ │
│ │ └─→ "Yep"
│ │ │
│ │ └─→ Connection found!!! (emotional)
│ │
│ └─→ But Titans uses a different approach
│ │
│ └─→ (Was that a waste of time? Well, not exactly...)
│
├─→ 【Memory Integration Methods】
│ │
│ └─→ "How do you attach memory to a Transformer?"
│ │
│ ├─→ "MAC (Memory as Context)"
│ │ │
│ │ └─→ "Append memory reads to the context"
│ │ │
│ │ └─→ "Doesn't that make the sequence longer?"
│ │ │
│ │ └─→ "So you need to limit memory size"
│ │ │
│ │ └─→ (trade-off)
│ │
│ ├─→ "MAG (Memory as Gate)"
│ │ │
│ │ └─→ "Use memory to modulate Attention"
│ │ │
│ │ └─→ "What's Gating?"
│ │ │
│ │ └─→ "σ(x) ⊙ y form"
│ │ │
│ │ └─→ "Like LSTM gates?"
│ │ │
│ │ └─→ "Yep"
│ │ │
│ │ └─→ 【LSTM】 returns
│ │ │
│ │ └─→ The 1997 paper
│ │ │
│ │ └─→ The OG of long-term memory (emotional)
│ │
│ └─→ "MAL (Memory as Layer)"
│ │
│ └─→ "Process memory as a separate layer"
│ │
│ └─→ "Cleanest separation"
│ │
│ └─→ "Performance?"
│ │
│ └─→ "Gotta run experiments"
│ │
│ └─→ (more experiments)
│
└─→ 【Persistent Memory vs Contextual Memory】
    │
    └─→ "Task-level vs Instance-level memory distinction?"
        │
        └─→ "Yes, you need to separate the two"
            │
            ├─→ "Persistent Memory"
            │ │
            │ └─→ "Shared across all sequences"
            │ │
            │ └─→ "Stores learned knowledge"
            │
            └─→ "Contextual Memory"
                │
                └─→ "Specific to the current sequence"
                    │
                    └─→ "Updated at test-time"
                        │
                        └─→ "Why does this distinction matter?"
                            │
                            └─→ "Separating knowledge from context"
                                │
                                └─→ 【Episodic vs Semantic Memory】
                                    │
                                    └─→ Cognitive science strikes again

4. So You Want to Understand Hope?

Full Dependency Tree


"What's Hope?"

│
├─→ "It builds Self-Modifying networks with Nested Learning"
│ │
│ ├─→ "What's Nested Learning?"
│ │ │
│ │ └─→ "Learning inside of learning"
│ │ │
│ │ └─→ "Inner Loop / Outer Loop?"
│ │ │
│ │ └─→ "Yes, it's Bilevel Optimization"
│ │ │
│ │ └─→ 【Bilevel Optimization】
│ │ │
│ │ ├─→ "Upper problem depends on lower problem's optimal solution"
│ │ │ │
│ │ │ └─→ "Show me the math"
│ │ │ │
│ │ │ └─→ min_θ L(θ, φ*(θ))
│ │ │ where φ*(θ) = argmin_φ l(θ, φ)
│ │ │ │
│ │ │ └─→ "How do you compute the gradient?"
│ │ │ │
│ │ │ └─→ 【Implicit Differentiation】
│ │ │ │
│ │ │ └─→ Need to compute dφ*/dθ
│ │ │ │
│ │ │ └─→ 【Implicit Function Theorem】
│ │ │ │
│ │ │ └─→ 【Real Analysis】
│ │ │ │
│ │ │ └─→ Another 2 weeks here
│ │ │
│ │ └─→ "Why do something this complicated?"
│ │ │
│ │ └─→ "For rapid adaptation"
│ │ │
│ │ └─→ "Is this the same as Meta-Learning?"
│ │ │
│ │ └─→ "Related"
│ │ │
│ │ └─→ MAML, Reptile, ...
│ │ │
│ │ └─→ One month here
│ │
│ └─→ "What does Self-Modifying mean?"
│ │
│ └─→ "The network modifies its own parameters"
│ │
│ └─→ "At test-time?"
│ │
│ └─→ "Yes"
│ │
│ └─→ "Is that even possible?"
│ │
│ └─→ "Similar concept to Hypernetworks"
│ │
│ └─→ 【Hypernetwork】
│ │
│ └─→ "A network generates another network's weights"
│ │
│ └─→ "Related to Neural Turing Machines?"
│ │
│ └─→ 【Differentiable Computing】
│ │
│ └─→ The 2014-2016 DeepMind era
│ │
│ └─→ History lesson (escape)
│
├─→ 【Causal Memory [...]】 ← The core of Hope
│ │
│ └─→ "What's CMS?"
│ │
│ └─→ "A system that updates memory causally"
│ │
│ ├─→ "What does 'causal' mean here?"
│ │ │
│ │ └─→ "Past affects future, future does NOT affect past"
│ │ │
│ │ └─→ "Isn't that obvious?"
│ │ │
│ │ └─→ "Attention is bidirectional though"
│ │ │
│ │ └─→ "Oh, so that's why we use Causal Masks"
│ │ │
│ │ └─→ "But it matters for memory systems too"
│ │ │
│ │ └─→ "Why?"
│ │ │
│ │ └─→ "Preventing future information leakage"
│ │ │
│ │ └─→ 【Causal Inference】 out of nowhere
│ │ │
│ │ ├─→ Pearl's Causal Hierarchy
│ │ ├─→ do-calculus
│ │ └─→ This is philosophy now (seriously)
│ │
│ └─→ "How does memory update work?"
│ │
│ └─→ "Gradient-based + Causal Masking"
│ │
│ └─→ "Does it also use Surprise?"
│ │
│ └─→ "There's a similar concept"
│ │
│ └─→ Prediction Error
│ │
│ └─→ 【Predictive Coding】
│ │
│ └─→ A theory from neuroscience
│ │
│ └─→ Brain science enters the chat (how far does this go)
│
├─→ 【TTT (Test-Time Training) Layer】
│ │
│ └─→ "There was a TTT Layer paper before Hope"
│ │
│ └─→ "What's the relationship?"
│ │
│ └─→ "TTT: uses a linear model as inner state"
│ │
│ └─→ "Hope: a more general framework"
│ │
│ └─→ "Should I understand TTT first to get Hope?"
│ │
│ └─→ "Recommended"
│ │
│ └─→ 【TTT paper first】
│ │
│ └─→ Sun et al. 2024
│ │
│ └─→ (added to reading list)
│
└─→ 【Nested Learning Details】
    │
    └─→ "What exactly is learning inside learning?"
        │
        └─→ "Outer: overall sequence Loss"
            │
            └─→ "Inner: local prediction Loss"
                │
                └─→ "Inner updates the memory"
                    │
                    └─→ "Outer updates all parameters"
                        │
                        └─→ "So backward pass runs twice?"
                            │
                            └─→ "Yes, Gradient of Gradient"
                                │
                                └─→ 【Second-Order Optimization】
                                    │
                                    ├─→ Hessian
                                    │ │
                                    │ └─→ O(n²) parameters
                                    │ │
                                    │ └─→ "Computationally infeasible"
                                    │ │
                                    │ └─→ "Need approximations"
                                    │ │
                                    │ └─→ Hessian-Free Methods
                                    │ │
                                    │ └─→ 【Numerical Optimization】
                                    │ │
                                    │ └─→ Another month here
                                    │
                                    └─→ Truncated Backprop
                                        │
                                        └─→ "Only unroll a few steps"
                                            │
                                            └─→ "Doesn't that introduce bias?"
                                                │
                                                └─→ "It does"
                                                    │
                                                    └─→ "But it's practical so we use it"
                                                        │
                                                        └─→ It works so whatever (runs away)

5. So How Do These Three Relate?

Comparison


                    Mamba Titans Hope

─────────────────────────────────────────────────────────────

Core Idea SSM + Selection Neural Memory Nested Learning

Complexity O(n) O(n) + α O(n) + αα

Long-term Memory State Compression Explicit Memory Self-Modifying Memory

Theoretical Basis Control Theory Information Theory Optimization Theory

Implementation Pain CUDA hell Medium High

Required Reading Physics+Engineering CogSci+ML Optimization+ML

─────────────────────────────────────────────────────────────

Decision Guide


"What should I study?"

│
├─→ If you want practical implementation
│ │
│ └─→ Mamba (most available implementations)
│ │
│ └─→ But need to understand CUDA kernels (hard)
│
├─→ If you want theoretical understanding
│ │
│ └─→ Titans (most intuitive concepts)
│ │
│ └─→ But need to understand test-time learning
│
└─→ If you want to follow cutting-edge research
    │
    └─→ Hope (late 2024 paper)
        │
        └─→ But you need to know all the previous stuff
            │
            └─→ 【Infinite loop】

6. Do I Really Need to Know All This?

The Honest Answer


"Do I need to know everything?"

│
├─→ To write papers: go deep in your area, concepts-only for the rest
├─→ To implement: focus on Mamba + use libraries
├─→ To compare: just the core ideas of each
└─→ To just use them:



    from mamba_ssm import Mamba

    model = Mamba(d_model=512)

    # done

Realistic Time Estimates

| Goal | Time Required | What It Actually Means |

|------|----------|----------|

| API usage | 1 day | Nothing |

| Fine-tuning | 1-2 weeks | Editing config files |

| Architecture understanding | 1-3 months | Level 1 of the tree |

| Paper comprehension | 6 months - 1 year | Level 2 of the tree |

| Suggesting improvements | 1-2 years | Most of the tree |

| Proposing new architectures | 3+ years | Entire tree + intuition |

| Complete understanding | ??? | Even the authors don't know everything (seriously) |

Required Knowledge by Area


【Shared Prerequisites】

├── Linear Algebra: Matrices, Eigenvalues, SVD
├── Calculus: Partial Derivatives, Chain Rule
├── Probability: Basic Statistics
└── Deep Learning: Transformer Basics



【Mamba-Specific】

├── Control Theory: State Space, Stability
├── Signal Processing: Z-Transform, Discretization
├── Parallel Algorithms: Scan Operations
└── GPU Programming: CUDA (optional)



【Titans-Specific】

├── Information Theory: Entropy, Surprise
├── Cognitive Science: Memory Systems (concepts only)
├── Meta-Learning: MAML etc. (concepts only)
└── Hopfield Networks: Energy-Based Models



【Hope-Specific】

├── Optimization Theory: Bilevel, Implicit Diff
├── Meta-Learning: Advanced
├── Causal Inference: Basic Concepts
└── Second-Order Methods: Hessian Approximation



【What You Actually Use】

└── model.forward(x)

7. Escaping Dependency Hell: A Guide

Option A: Strategic Surrender


# Using Mamba

from mamba_ssm import Mamba

layer = Mamba(d_model=512, d_state=16, d_conv=4)



# Or just use Transformers (still great)

from transformers import AutoModel

model = AutoModel.from_pretrained("...")



# Honestly Attention is fast enough (reality)

Option B: Dig When Needed


1. Read only the Abstract + Figure 1 of the paper

2. Run the code

3. If something's weird, dig into that part

4. Repeat

5. Somehow you just know it now (magic)

Option C: The Purist Path (not recommended)


Linear Algebra (3 months)

    ↓

Calculus + Analysis (3 months)

    ↓

Probability (2 months)

    ↓

Signal Processing (2 months)

    ↓

Control Theory (3 months)

    ↓

Optimization Theory (3 months)

    ↓

Deep Learning Basics (3 months)

    ↓

Transformers (2 months)

    ↓

SSM Theory (2 months)

    ↓

Meta-Learning (2 months)

    ↓

Mamba/Titans/Hope (3 months)

    ↓

Total time: 2.5-3 years



By then a new architecture has dropped

(Attention Is All You Need → No It Isn't → Actually Maybe)

(rip)

Option D: Hybrid (recommended)


1. Get the big picture from this post (1 day)

2. Build intuition with Lilian Weng's blog (1 week)

3. Hands-on with code (1-2 weeks)

4. Deep dive only into what interests you (ongoing)

5. Math on a need-to-know basis

6. Focus on the Methods section of papers



Key insight: You don't need to know everything

              You only need to know what you need

              (This is the hardest part)

8. FAQ

"Doesn't performance drop without Attention?"


├─→ In general: similar or slightly lower
├─→ Long sequences: can actually be better
├─→ Conclusion: depends on the situation (irresponsible answer)
└─→ Also: Hybrid is the trend

"Can I skip the math?"


├─→ Just using it: yes
├─→ Fine-tuning: yes
├─→ Implementing: tough
├─→ Understanding papers: no
├─→ Research: absolutely not
└─→ But you can learn what you need as you go

"Where do I start?"


Recommended order:

1. Review Attention (essential)

2. "The Annotated S4" blog (S4/Mamba intuition)

3. Mamba paper + official implementation (hands-on)

4. Titans/Hope if you're interested (optional)



Videos:

- Yannic Kilcher's Mamba review

- AI Coffee Break explainer



Blogs:

- Lilian Weng (GOAT)

- Jay Alammar's visualizations

"How long will it take?"


"1 week" - Genius or someone who only knows Attention

"1 month" - Optimist

"3 months" - Realist

"6 months" - Deep diver

"A lifetime" - The truth

"I don't know" - Honest person

"Can I just give up?"


Yes.



Attention is still plenty good,

Flash Attention has solved a lot of the hardware issues,

100K tokens just works now.



However:

- For 1M+ tokens, you need alternatives

- If you're a researcher, you need to know this

- If your company tells you to... good luck

"What's the outlook for this field?"

├─→ It's hot (seriously)
├─→ Mamba2, Jamba, Griffin etc. keep coming
├─→ Hybrid is the trend
├─→ But who knows in 6 months
└─→ This is the 10th time I've heard "Transformers are dead"
    └─→ They're not dead yet (the eternal rule)

9. Closing Thoughts

Your State After Reading This


Before: "Mamba replaces Attention apparently, let me look into it"

After: "Control theory... signal processing... statistical physics... cognitive science..."

       "Why did I choose this path"

       "Maybe I'll just use Attention"

Words of Comfort

Even experts in this field don't know all of it
The Mamba authors don't know Titans internals, and the Titans authors don't know Hope's details
"I don't really get it but it works when I run experiments" is the industry standard
Nobody reads the paper's Appendix, not even reviewers (seriously)
You're not the only one struggling

But If You Still Want to Do This


Key tips:

1. Don't try to understand everything perfectly

2. Pick only what you need and go deep there

3. Code first, theory later

4. Don't be embarrassed to ask questions

5. Not understanding a paper on the first read is normal



In the end: Time + Persistence + Googling = You'll get there eventually



You got this!!!

(The author still can't derive Implicit Differentiation)

Appendix: Recommended Resources (in order)

Blogs (free, essential)

"The Annotated S4" - srush/annotated-s4
Lilian Weng - "State Space Models"
"A Visual Guide to Mamba" - Maarten Grootendorst

Videos (free)

Yannic Kilcher - Mamba paper review
AI Coffee Break - SSM explainer

Papers (in order)

S4 (Gu et al., 2021) - SSM fundamentals
Mamba (Gu & Dao, 2023) - Selection + Scan
Mamba-2 (Dao & Gu, 2024) - State Space Duality
Titans (Behrouz et al., 2024) - Neural Memory
Hope (Poli et al., 2024) - Nested Learning

Code

state-spaces/mamba (official)
huggingface/transformers (Mamba support)
Each paper's official repo

Math Foundations (when you get stuck)

3Blue1Brown Linear Algebra (video)
Steve Brunton Control Theory (video)
For probability... you'll need a textbook (runs away)

Why Long Context Research Is Difficult

So You Want to Understand Long-Context Models?

(The Mamba, Titans, Hope Dependency Hell Guide)

0. Reality Check First

1. Why You Entered This Hell: The Problem with Attention

What's Wrong with Attention?

The Genealogy of Alternatives

2. So You Want to Understand Mamba?

Full Dependency Tree

Mamba Summary (if you survived this far)

3. So You Want to Understand Titans?

Full Dependency Tree

4. So You Want to Understand Hope?

Full Dependency Tree

5. So How Do These Three Relate?

Comparison

Decision Guide

6. Do I Really Need to Know All This?

The Honest Answer

Realistic Time Estimates

Required Knowledge by Area

7. Escaping Dependency Hell: A Guide

Option A: Strategic Surrender

Option B: Dig When Needed

Option C: The Purist Path (not recommended)

Option D: Hybrid (recommended)

8. FAQ

"Doesn't performance drop without Attention?"

"Can I skip the math?"

"Where do I start?"

"How long will it take?"

"Can I just give up?"

"What's the outlook for this field?"

9. Closing Thoughts

Your State After Reading This

Words of Comfort

But If You Still Want to Do This

Appendix: Recommended Resources (in order)

Blogs (free, essential)

Videos (free)

Papers (in order)

Code

Math Foundations (when you get stuck)