So You Want to Understand Long-Context Models?
(The Mamba, Titans, Hope Dependency Hell Guide)
"I heard Attention is O(n²) and slow. So I looked for alternatives..." → gates of hell: opened
0. Reality Check First
What you want:
"I just want a rough idea of how Mamba and Titans work"
"They say it replaces Attention — I'm curious about the principles"
What you'll actually face:
Long-Context Models
│
├── Mamba (State Space Model family)
│ ├── SSM Theory
│ │ ├── Control Theory
│ │ ├── Signal Processing
│ │ └── Complex Analysis
│ ├── Selection Mechanism
│ │ └── Information Theory
│ └── Hardware Optimization
│ └── CUDA Kernel Programming
│ └── GPU Architecture
│ └── Computer Architecture
│ └── (this is Electrical Engineering now)
│
├── Titans (Memory Module family)
│ ├── Neural Memory
│ │ └── Hopfield Networks
│ │ └── Statistical Physics
│ ├── Surprise Metric
│ │ └── Information Theory (it's back)
│ └── Test-Time Learning
│ └── Meta-Learning
│ └── The Abyss of Optimization Theory
│
└── Hope (Self-Modifying Networks)
├── Nested Learning
│ └── Higher-Order Optimization
│ └── Implicit Differentiation
│ └── Functional Analysis
└── Causal Memory
└── Causal Inference
└── Philosophy (seriously)
Estimated time: When will it end? (answer refused)
1. Why You Entered This Hell: The Problem with Attention
What's Wrong with Attention?
"Attention is O(n²)"
│
└─→ "What's n?"
│
└─→ "Sequence length"
│
└─→ "So what's the problem?"
│
└─→ "Double the length = 4x the cost"
│
└─→ "What if I feed in 100K tokens?"
│
└─→ "10 billion operations"
│
└─→ "GPU memory goes boom"
│
└─→ "So we need alternatives"
│
└─→ This is where hell begins
The Genealogy of Alternatives
"I want O(n) instead of O(n²)"
│
├─→ Sparse Attention (just look at some of it)
│ │
│ └─→ "Which parts should we look at?"
│ │
│ └─→ "If we knew that, we wouldn't need Attention"
│ │
│ └─→ 【Contradiction】
│ │
│ └─→ We do it with heuristics anyway (runs away)
│
├─→ Linear Attention (kernel trick)
│ │
│ └─→ "If we approximate Softmax..."
│ │
│ └─→ "Why doesn't it work exactly?"
│ │
│ └─→ 【Kernel Methods】
│ │
│ └─→ RKHS
│ │
│ └─→ 【Functional Analysis】
│ │
│ └─→ There goes one semester
│
├─→ State Space Models (Mamba)
│ │
│ └─→ "Let's compress with recurrence"
│ │
│ └─→ Details below (horror)
│
└─→ Memory Systems (Titans, Hope)
│
└─→ "Let's store it in memory"
│
└─→ Further below (more horror)
2. So You Want to Understand Mamba?
Full Dependency Tree
"What's Mamba?"
│
├─→ "It's a Selective State Space Model"
│ │
│ ├─→ "What's a State Space Model?"
│ │ │
│ │ └─→ "A recurrent model from discretized continuous systems"
│ │ │
│ │ ├─→ "Continuous systems?"
│ │ │ │
│ │ │ └─→ "dx/dt = Ax + Bu, y = Cx + Du"
│ │ │ │
│ │ │ └─→ "What does that even mean?"
│ │ │ │
│ │ │ └─→ 【Control Theory】
│ │ │ │
│ │ │ ├─→ State Space Representation
│ │ │ ├─→ Transfer Functions
│ │ │ ├─→ Stability Analysis
│ │ │ └─→ EE junior year (hello again)
│ │ │
│ │ ├─→ "How do you discretize?"
│ │ │ │
│ │ │ └─→ "Several methods"
│ │ │ │
│ │ │ ├─→ Zero-Order Hold (ZOH)
│ │ │ │ │
│ │ │ │ └─→ "Derive by integration"
│ │ │ │ │
│ │ │ │ └─→ 【Matrix Exponential】
│ │ │ │ │
│ │ │ │ └─→ e^{At} = Σ(At)^n/n!
│ │ │ │ │
│ │ │ │ └─→ "Does the infinite series converge?"
│ │ │ │ │
│ │ │ │ └─→ 【Real Analysis】
│ │ │ │ │
│ │ │ │ └─→ 2 weeks here
│ │ │ │
│ │ │ ├─→ Bilinear Transform
│ │ │ │ │
│ │ │ │ └─→ "s = 2/Δ · (z-1)/(z+1)"
│ │ │ │ │
│ │ │ │ └─→ "What's z?"
│ │ │ │ │
│ │ │ │ └─→ 【Z-Transform】
│ │ │ │ │
│ │ │ │ └─→ 【Signal Processing】
│ │ │ │ │
│ │ │ │ ├─→ Laplace Transform
│ │ │ │ ├─→ Fourier Transform
│ │ │ │ └─→ EE sophomore year (we meet again)
│ │ │ │
│ │ │ └─→ "Which one should I use?"
│ │ │ │
│ │ │ └─→ "Depends on the situation" (irresponsible answer)
│ │ │
│ │ └─→ "Why is recurrence good?"
│ │ │
│ │ └─→ "Because h_t = Āh_{t-1} + B̄x_t"
│ │ │
│ │ └─→ "Why is that O(n)?"
│ │ │
│ │ └─→ "Each step only references the previous state"
│ │ │
│ │ └─→ "Oh, so it's like an RNN?"
│ │ │
│ │ └─→ "Conceptually similar, but parallelizable"
│ │ │
│ │ └─→ "How?"
│ │ │
│ │ └─→ 【Parallel Scan】
│ │ │
│ │ └─→ Continued below
│ │
│ └─→ "What does Selective mean?"
│ │
│ └─→ "Parameters change based on input"
│ │
│ ├─→ "What changes?"
│ │ │
│ │ └─→ "B, C, Δ are functions of the input"
│ │ │
│ │ └─→ "What's Δ?"
│ │ │
│ │ └─→ "Discretization step size"
│ │ │
│ │ └─→ "Why should it vary with input?"
│ │ │
│ │ └─→ "Go slow on important stuff, fast on unimportant stuff"
│ │ │
│ │ └─→ "Why does that work?"
│ │ │
│ │ └─→ "Large Δ → less previous state influence, more current input influence"
│ │ │
│ │ └─→ "Show me the math"
│ │ │
│ │ └─→ Ā = exp(ΔA), large Δ → Ā → 0
│ │ │
│ │ └─→ 【Matrix Exponential】 appears again
│ │
│ └─→ "Why does this matter?"
│ │
│ └─→ "LTI vs LTV"
│ │
│ ├─→ "LTI (Linear Time-Invariant)"
│ │ │
│ │ └─→ "Fixed parameters"
│ │ │
│ │ └─→ "Can be computed as convolution"
│ │ │
│ │ └─→ "O(n log n) with FFT"
│ │ │
│ │ └─→ 【Fourier Transform】
│ │ │
│ │ └─→ Finally something familiar!!! (emotional)
│ │ │
│ │ └─→ But Mamba doesn't use this
│ │ │
│ │ └─→ ??? (was that a waste of time?)
│ │
│ └─→ "LTV (Linear Time-Varying)"
│ │
│ └─→ "Parameters change over time"
│ │
│ └─→ "Can't use convolution"
│ │
│ └─→ "Then how do you compute it fast?"
│ │
│ └─→ 【Parallel Scan】 ← This is the key
│ │
│ └─→ Below...
│
├─→ 【Parallel Scan (Selective Scan)】 ← Mamba's real secret
│ │
│ └─→ "Recurrence but parallelizable?"
│ │
│ └─→ "If the operation is Associative, yes"
│ │
│ └─→ "What's Associative?"
│ │ │
│ │ └─→ "(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)"
│ │ │
│ │ └─→ 【Algebra】 Associativity
│ │ │
│ │ └─→ Finally, high school math!!! (celebration)
│ │
│ └─→ "Is the SSM update Associative?"
│ │
│ └─→ "If you express it as matrices, yes"
│ │
│ └─→ [h_t] [Ā B̄x_t] [h_{t-1}]
│ │ [1 ] = [0 1 ] [1 ]
│ │
│ └─→ "Matrix multiplication is Associative"
│ │
│ └─→ "So you can compute it in parallel like Prefix Sum"
│ │
│ └─→ "work O(n), span O(log n)"
│ │
│ └─→ 【Parallel Algorithms】
│ │
│ └─→ GPU Programming
│ │
│ └─→ CUDA
│ │
│ └─→ Another month here
│
├─→ 【HiPPO】 The Secret to Long-Term Memory
│ │
│ └─→ "Why does SSM handle long-term memory well?"
│ │
│ └─→ "The A matrix initialization is special"
│ │
│ └─→ "What's a HiPPO matrix?"
│ │
│ └─→ "Compresses the past using orthogonal polynomials"
│ │
│ ├─→ "Orthogonal polynomials?"
│ │ │
│ │ └─→ 【Functional Analysis】
│ │ │
│ │ ├─→ Legendre Polynomials
│ │ ├─→ Laguerre Polynomials
│ │ ├─→ Chebyshev Polynomials
│ │ └─→ Why are all the names French?
│ │ │
│ │ └─→ (historical reasons)
│ │
│ └─→ "Why is compressing with orthogonal polynomials good?"
│ │
│ └─→ "It guarantees optimal approximation"
│ │
│ └─→ 【Approximation Theory】
│ │
│ └─→ Weierstrass Theorem
│ │
│ └─→ Continuous functions can be approximated by polynomials...
│ │
│ └─→ 2 weeks here (minimum)
│
└─→ 【Hardware Optimization】 Making It Actually Fast
│
└─→ "It's theoretically O(n), but is it actually fast?"
│
└─→ "Kernel fusion is essential"
│
└─→ "What's kernel fusion?"
│
└─→ "Combining multiple operations into a single CUDA kernel"
│
└─→ "Why is that necessary?"
│
└─→ "Memory bandwidth bottleneck"
│
└─→ 【GPU Architecture】
│
├─→ SRAM vs HBM
├─→ Memory Hierarchy
├─→ Warp Scheduling
└─→ You're a hardware engineer now
│
└─→ (career change?)
Mamba Summary (if you survived this far)
Mamba = SSM + Selection + Parallel Scan + HiPPO + CUDA wizardry
= Control Theory + Information Theory + Parallel Algorithms + Functional Analysis + GPU Programming
= ??? (confusion)
3. So You Want to Understand Titans?
Full Dependency Tree
"What's Titans?"
│
├─→ "It combines Attention + Neural Memory"
│ │
│ ├─→ "What's Neural Memory?"
│ │ │
│ │ └─→ "A learnable memory module"
│ │ │
│ │ └─→ "What does that even mean?"
│ │ │
│ │ └─→ "It updates memory at test time"
│ │ │
│ │ ├─→ "Learning at test time?"
│ │ │ │
│ │ │ └─→ "Yep, Test-Time Training"
│ │ │ │
│ │ │ └─→ 【The World of Meta-Learning】 ← branch point
│ │ │ │
│ │ │ ├─→ MAML
│ │ │ │ │
│ │ │ │ └─→ "Learning to learn"
│ │ │ │ │
│ │ │ │ └─→ "Double optimization"
│ │ │ │ │
│ │ │ │ └─→ 【Bilevel Optimization】
│ │ │ │ │
│ │ │ │ └─→ One month here
│ │ │ │
│ │ │ └─→ In-Context Learning
│ │ │ │
│ │ │ └─→ "Nobody knows why it works (seriously)"
│ │ │ │
│ │ │ └─→ 【Open Problem】
│ │ │
│ │ └─→ "How do you update the memory?"
│ │ │
│ │ └─→ "With Gradient Descent"
│ │ │
│ │ └─→ "Gradients during inference?"
│ │ │
│ │ └─→ "Isn't that slow, you ask?"
│ │ │
│ │ └─→ "Just 1 step of Mini-batch SGD"
│ │ │
│ │ └─→ "Still slow though?"
│ │ │
│ │ └─→ "So we need to find efficient methods..."
│ │ │
│ │ └─→ (research ongoing)
│ │
│ └─→ "Why combine it with Attention?"
│ │
│ └─→ "Short-term → Attention, Long-term → Memory"
│ │
│ └─→ "Why is that split good?"
│ │
│ └─→ "It's similar to the human memory system"
│ │
│ └─→ 【Cognitive Science】 surprise appearance
│ │
│ ├─→ Working Memory vs Long-term Memory
│ ├─→ Atkinson-Shiffrin Model
│ └─→ Intro to Psychology (nice to meet you)
│
├─→ 【Neural Memory Details】 ← the core
│ │
│ └─→ "What exactly is the memory?"
│ │
│ └─→ "A learnable parameter M ∈ R^{d×k}"
│ │
│ ├─→ "How do you read from it?"
│ │ │
│ │ └─→ "Attention with a Query"
│ │ │
│ │ └─→ Attention shows up again here
│ │ │
│ │ └─→ 【Infinite loop begins】
│ │
│ ├─→ "How do you write to it?"
│ │ │
│ │ └─→ "Update based on Surprise"
│ │ │
│ │ └─→ "What's Surprise?"
│ │ │
│ │ └─→ "-log p(x), it's information content"
│ │ │
│ │ └─→ 【Information Theory】 returns
│ │ │
│ │ ├─→ Self-Information
│ │ ├─→ Entropy
│ │ └─→ KL Divergence
│ │ │
│ │ └─→ 2 weeks here
│ │
│ └─→ "Why update based on Surprise?"
│ │
│ └─→ "Because surprising things should be remembered"
│ │
│ └─→ "That's intuitive"
│ │
│ └─→ Finally something that makes sense!!! (emotional)
│
├─→ 【Modern Hopfield Networks Connection】
│ │
│ └─→ "Is Neural Memory related to Hopfield Networks?"
│ │
│ └─→ "Similar from an Associative Memory perspective"
│ │
│ └─→ "What's a Hopfield Network?"
│ │
│ └─→ "An energy-based associative memory model"
│ │
│ └─→ 【Statistical Physics】 surprise appearance
│ │
│ ├─→ Energy Function
│ │ │
│ │ └─→ E = -Σ w_ij s_i s_j
│ │ │
│ │ └─→ "Isn't that the Ising Model?"
│ │ │
│ │ └─→ "Yep"
│ │ │
│ │ └─→ 【Statistical Mechanics】
│ │ │
│ │ └─→ Say hello to the Physics department
│ │
│ ├─→ Associative Memory
│ │ │
│ │ └─→ "Store and retrieve patterns"
│ │ │
│ │ └─→ "How many can you store?"
│ │ │
│ │ └─→ "0.14N (Classic)"
│ │ │
│ │ └─→ "Modern Hopfield can do exp(N)"
│ │ │
│ │ └─→ "How?"
│ │ │
│ │ └─→ "Change the energy function"
│ │ │
│ │ └─→ 【Ramsauer et al. 2020】
│ │ │
│ │ └─→ "This is the same as Attention?"
│ │ │
│ │ └─→ "Yep"
│ │ │
│ │ └─→ Connection found!!! (emotional)
│ │
│ └─→ But Titans uses a different approach
│ │
│ └─→ (Was that a waste of time? Well, not exactly...)
│
├─→ 【Memory Integration Methods】
│ │
│ └─→ "How do you attach memory to a Transformer?"
│ │
│ ├─→ "MAC (Memory as Context)"
│ │ │
│ │ └─→ "Append memory reads to the context"
│ │ │
│ │ └─→ "Doesn't that make the sequence longer?"
│ │ │
│ │ └─→ "So you need to limit memory size"
│ │ │
│ │ └─→ (trade-off)
│ │
│ ├─→ "MAG (Memory as Gate)"
│ │ │
│ │ └─→ "Use memory to modulate Attention"
│ │ │
│ │ └─→ "What's Gating?"
│ │ │
│ │ └─→ "σ(x) ⊙ y form"
│ │ │
│ │ └─→ "Like LSTM gates?"
│ │ │
│ │ └─→ "Yep"
│ │ │
│ │ └─→ 【LSTM】 returns
│ │ │
│ │ └─→ The 1997 paper
│ │ │
│ │ └─→ The OG of long-term memory (emotional)
│ │
│ └─→ "MAL (Memory as Layer)"
│ │
│ └─→ "Process memory as a separate layer"
│ │
│ └─→ "Cleanest separation"
│ │
│ └─→ "Performance?"
│ │
│ └─→ "Gotta run experiments"
│ │
│ └─→ (more experiments)
│
└─→ 【Persistent Memory vs Contextual Memory】
│
└─→ "Task-level vs Instance-level memory distinction?"
│
└─→ "Yes, you need to separate the two"
│
├─→ "Persistent Memory"
│ │
│ └─→ "Shared across all sequences"
│ │
│ └─→ "Stores learned knowledge"
│
└─→ "Contextual Memory"
│
└─→ "Specific to the current sequence"
│
└─→ "Updated at test-time"
│
└─→ "Why does this distinction matter?"
│
└─→ "Separating knowledge from context"
│
└─→ 【Episodic vs Semantic Memory】
│
└─→ Cognitive science strikes again
4. So You Want to Understand Hope?
Full Dependency Tree
"What's Hope?"
│
├─→ "It builds Self-Modifying networks with Nested Learning"
│ │
│ ├─→ "What's Nested Learning?"
│ │ │
│ │ └─→ "Learning inside of learning"
│ │ │
│ │ └─→ "Inner Loop / Outer Loop?"
│ │ │
│ │ └─→ "Yes, it's Bilevel Optimization"
│ │ │
│ │ └─→ 【Bilevel Optimization】
│ │ │
│ │ ├─→ "Upper problem depends on lower problem's optimal solution"
│ │ │ │
│ │ │ └─→ "Show me the math"
│ │ │ │
│ │ │ └─→ min_θ L(θ, φ*(θ))
│ │ │ where φ*(θ) = argmin_φ l(θ, φ)
│ │ │ │
│ │ │ └─→ "How do you compute the gradient?"
│ │ │ │
│ │ │ └─→ 【Implicit Differentiation】
│ │ │ │
│ │ │ └─→ Need to compute dφ*/dθ
│ │ │ │
│ │ │ └─→ 【Implicit Function Theorem】
│ │ │ │
│ │ │ └─→ 【Real Analysis】
│ │ │ │
│ │ │ └─→ Another 2 weeks here
│ │ │
│ │ └─→ "Why do something this complicated?"
│ │ │
│ │ └─→ "For rapid adaptation"
│ │ │
│ │ └─→ "Is this the same as Meta-Learning?"
│ │ │
│ │ └─→ "Related"
│ │ │
│ │ └─→ MAML, Reptile, ...
│ │ │
│ │ └─→ One month here
│ │
│ └─→ "What does Self-Modifying mean?"
│ │
│ └─→ "The network modifies its own parameters"
│ │
│ └─→ "At test-time?"
│ │
│ └─→ "Yes"
│ │
│ └─→ "Is that even possible?"
│ │
│ └─→ "Similar concept to Hypernetworks"
│ │
│ └─→ 【Hypernetwork】
│ │
│ └─→ "A network generates another network's weights"
│ │
│ └─→ "Related to Neural Turing Machines?"
│ │
│ └─→ 【Differentiable Computing】
│ │
│ └─→ The 2014-2016 DeepMind era
│ │
│ └─→ History lesson (escape)
│
├─→ 【Causal Memory [...]】 ← The core of Hope
│ │
│ └─→ "What's CMS?"
│ │
│ └─→ "A system that updates memory causally"
│ │
│ ├─→ "What does 'causal' mean here?"
│ │ │
│ │ └─→ "Past affects future, future does NOT affect past"
│ │ │
│ │ └─→ "Isn't that obvious?"
│ │ │
│ │ └─→ "Attention is bidirectional though"
│ │ │
│ │ └─→ "Oh, so that's why we use Causal Masks"
│ │ │
│ │ └─→ "But it matters for memory systems too"
│ │ │
│ │ └─→ "Why?"
│ │ │
│ │ └─→ "Preventing future information leakage"
│ │ │
│ │ └─→ 【Causal Inference】 out of nowhere
│ │ │
│ │ ├─→ Pearl's Causal Hierarchy
│ │ ├─→ do-calculus
│ │ └─→ This is philosophy now (seriously)
│ │
│ └─→ "How does memory update work?"
│ │
│ └─→ "Gradient-based + Causal Masking"
│ │
│ └─→ "Does it also use Surprise?"
│ │
│ └─→ "There's a similar concept"
│ │
│ └─→ Prediction Error
│ │
│ └─→ 【Predictive Coding】
│ │
│ └─→ A theory from neuroscience
│ │
│ └─→ Brain science enters the chat (how far does this go)
│
├─→ 【TTT (Test-Time Training) Layer】
│ │
│ └─→ "There was a TTT Layer paper before Hope"
│ │
│ └─→ "What's the relationship?"
│ │
│ └─→ "TTT: uses a linear model as inner state"
│ │
│ └─→ "Hope: a more general framework"
│ │
│ └─→ "Should I understand TTT first to get Hope?"
│ │
│ └─→ "Recommended"
│ │
│ └─→ 【TTT paper first】
│ │
│ └─→ Sun et al. 2024
│ │
│ └─→ (added to reading list)
│
└─→ 【Nested Learning Details】
│
└─→ "What exactly is learning inside learning?"
│
└─→ "Outer: overall sequence Loss"
│
└─→ "Inner: local prediction Loss"
│
└─→ "Inner updates the memory"
│
└─→ "Outer updates all parameters"
│
└─→ "So backward pass runs twice?"
│
└─→ "Yes, Gradient of Gradient"
│
└─→ 【Second-Order Optimization】
│
├─→ Hessian
│ │
│ └─→ O(n²) parameters
│ │
│ └─→ "Computationally infeasible"
│ │
│ └─→ "Need approximations"
│ │
│ └─→ Hessian-Free Methods
│ │
│ └─→ 【Numerical Optimization】
│ │
│ └─→ Another month here
│
└─→ Truncated Backprop
│
└─→ "Only unroll a few steps"
│
└─→ "Doesn't that introduce bias?"
│
└─→ "It does"
│
└─→ "But it's practical so we use it"
│
└─→ It works so whatever (runs away)
5. So How Do These Three Relate?
Comparison
Mamba Titans Hope
─────────────────────────────────────────────────────────────
Core Idea SSM + Selection Neural Memory Nested Learning
Complexity O(n) O(n) + α O(n) + αα
Long-term Memory State Compression Explicit Memory Self-Modifying Memory
Theoretical Basis Control Theory Information Theory Optimization Theory
Implementation Pain CUDA hell Medium High
Required Reading Physics+Engineering CogSci+ML Optimization+ML
─────────────────────────────────────────────────────────────
Decision Guide
"What should I study?"
│
├─→ If you want practical implementation
│ │
│ └─→ Mamba (most available implementations)
│ │
│ └─→ But need to understand CUDA kernels (hard)
│
├─→ If you want theoretical understanding
│ │
│ └─→ Titans (most intuitive concepts)
│ │
│ └─→ But need to understand test-time learning
│
└─→ If you want to follow cutting-edge research
│
└─→ Hope (late 2024 paper)
│
└─→ But you need to know all the previous stuff
│
└─→ 【Infinite loop】
6. Do I Really Need to Know All This?
The Honest Answer
"Do I need to know everything?"
│
├─→ To write papers: go deep in your area, concepts-only for the rest
├─→ To implement: focus on Mamba + use libraries
├─→ To compare: just the core ideas of each
└─→ To just use them:
from mamba_ssm import Mamba
model = Mamba(d_model=512)
# done
Realistic Time Estimates
| Goal | Time Required | What It Actually Means |
|------|----------|----------|
| API usage | 1 day | Nothing |
| Fine-tuning | 1-2 weeks | Editing config files |
| Architecture understanding | 1-3 months | Level 1 of the tree |
| Paper comprehension | 6 months - 1 year | Level 2 of the tree |
| Suggesting improvements | 1-2 years | Most of the tree |
| Proposing new architectures | 3+ years | Entire tree + intuition |
| Complete understanding | ??? | Even the authors don't know everything (seriously) |
Required Knowledge by Area
【Shared Prerequisites】
├── Linear Algebra: Matrices, Eigenvalues, SVD
├── Calculus: Partial Derivatives, Chain Rule
├── Probability: Basic Statistics
└── Deep Learning: Transformer Basics
【Mamba-Specific】
├── Control Theory: State Space, Stability
├── Signal Processing: Z-Transform, Discretization
├── Parallel Algorithms: Scan Operations
└── GPU Programming: CUDA (optional)
【Titans-Specific】
├── Information Theory: Entropy, Surprise
├── Cognitive Science: Memory Systems (concepts only)
├── Meta-Learning: MAML etc. (concepts only)
└── Hopfield Networks: Energy-Based Models
【Hope-Specific】
├── Optimization Theory: Bilevel, Implicit Diff
├── Meta-Learning: Advanced
├── Causal Inference: Basic Concepts
└── Second-Order Methods: Hessian Approximation
【What You Actually Use】
└── model.forward(x)
7. Escaping Dependency Hell: A Guide
Option A: Strategic Surrender
# Using Mamba
from mamba_ssm import Mamba
layer = Mamba(d_model=512, d_state=16, d_conv=4)
# Or just use Transformers (still great)
from transformers import AutoModel
model = AutoModel.from_pretrained("...")
# Honestly Attention is fast enough (reality)
Option B: Dig When Needed
1. Read only the Abstract + Figure 1 of the paper
2. Run the code
3. If something's weird, dig into that part
4. Repeat
5. Somehow you just know it now (magic)
Option C: The Purist Path (not recommended)
Linear Algebra (3 months)
↓
Calculus + Analysis (3 months)
↓
Probability (2 months)
↓
Signal Processing (2 months)
↓
Control Theory (3 months)
↓
Optimization Theory (3 months)
↓
Deep Learning Basics (3 months)
↓
Transformers (2 months)
↓
SSM Theory (2 months)
↓
Meta-Learning (2 months)
↓
Mamba/Titans/Hope (3 months)
↓
Total time: 2.5-3 years
By then a new architecture has dropped
(Attention Is All You Need → No It Isn't → Actually Maybe)
(rip)
Option D: Hybrid (recommended)
1. Get the big picture from this post (1 day)
2. Build intuition with Lilian Weng's blog (1 week)
3. Hands-on with code (1-2 weeks)
4. Deep dive only into what interests you (ongoing)
5. Math on a need-to-know basis
6. Focus on the Methods section of papers
Key insight: You don't need to know everything
You only need to know what you need
(This is the hardest part)
8. FAQ
"Doesn't performance drop without Attention?"
├─→ In general: similar or slightly lower
├─→ Long sequences: can actually be better
├─→ Conclusion: depends on the situation (irresponsible answer)
└─→ Also: Hybrid is the trend
"Can I skip the math?"
├─→ Just using it: yes
├─→ Fine-tuning: yes
├─→ Implementing: tough
├─→ Understanding papers: no
├─→ Research: absolutely not
└─→ But you can learn what you need as you go
"Where do I start?"
Recommended order:
1. Review Attention (essential)
2. "The Annotated S4" blog (S4/Mamba intuition)
3. Mamba paper + official implementation (hands-on)
4. Titans/Hope if you're interested (optional)
Videos:
- Yannic Kilcher's Mamba review
- AI Coffee Break explainer
Blogs:
- Lilian Weng (GOAT)
- Jay Alammar's visualizations
"How long will it take?"
"1 week" - Genius or someone who only knows Attention
"1 month" - Optimist
"3 months" - Realist
"6 months" - Deep diver
"A lifetime" - The truth
"I don't know" - Honest person
"Can I just give up?"
Yes.
Attention is still plenty good,
Flash Attention has solved a lot of the hardware issues,
100K tokens just works now.
However:
- For 1M+ tokens, you need alternatives
- If you're a researcher, you need to know this
- If your company tells you to... good luck
"What's the outlook for this field?"
├─→ It's hot (seriously)
├─→ Mamba2, Jamba, Griffin etc. keep coming
├─→ Hybrid is the trend
├─→ But who knows in 6 months
└─→ This is the 10th time I've heard "Transformers are dead"
└─→ They're not dead yet (the eternal rule)
9. Closing Thoughts
Your State After Reading This
Before: "Mamba replaces Attention apparently, let me look into it"
After: "Control theory... signal processing... statistical physics... cognitive science..."
"Why did I choose this path"
"Maybe I'll just use Attention"
Words of Comfort
-
Even experts in this field don't know all of it
-
The Mamba authors don't know Titans internals, and the Titans authors don't know Hope's details
-
"I don't really get it but it works when I run experiments" is the industry standard
-
Nobody reads the paper's Appendix, not even reviewers (seriously)
-
You're not the only one struggling
But If You Still Want to Do This
Key tips:
1. Don't try to understand everything perfectly
2. Pick only what you need and go deep there
3. Code first, theory later
4. Don't be embarrassed to ask questions
5. Not understanding a paper on the first read is normal
In the end: Time + Persistence + Googling = You'll get there eventually
You got this!!!
(The author still can't derive Implicit Differentiation)
Appendix: Recommended Resources (in order)
Blogs (free, essential)
-
"The Annotated S4" - srush/annotated-s4
-
Lilian Weng - "State Space Models"
-
"A Visual Guide to Mamba" - Maarten Grootendorst
Videos (free)
-
Yannic Kilcher - Mamba paper review
-
AI Coffee Break - SSM explainer
Papers (in order)
-
S4 (Gu et al., 2021) - SSM fundamentals
-
Mamba (Gu & Dao, 2023) - Selection + Scan
-
Mamba-2 (Dao & Gu, 2024) - State Space Duality
-
Titans (Behrouz et al., 2024) - Neural Memory
-
Hope (Poli et al., 2024) - Nested Learning
Code
-
state-spaces/mamba (official)
-
huggingface/transformers (Mamba support)
-
Each paper's official repo
Math Foundations (when you get stuck)
-
3Blue1Brown Linear Algebra (video)
-
Steve Brunton Control Theory (video)
-
For probability... you'll need a textbook (runs away)