Few-step distillation usually means holding a teacher and a student model in memory at once. We distilled a 28-step diffusion teacher into a 4-step LoRA on one 16 GB card by running three roles off a single frozen backbone.
How I got an MoE-style LoRA to actually specialize on a T2I model — through a cold-start deadlock, a failed jitter attempt, orthogonalized experts via SVD slicing, expert warmup, and σ-conditional routing borrowed from T-LoRA.
Sharing a persona prompt, memory layout, and the 'excuse tool' — patterns I kept refining from Opus 4.1 through 4.7.
DiT training has three sources of shape dynamism that cause torch.compile to recompile every step. We eliminated all three and got stable compiled training on a consumer GPU.
Flash Attention 4 doesn't support consumer Blackwell GPUs yet. We fixed three critical bugs and got it running on the RTX 5060 Ti.
A C++ critique video as a lens into vibe coding and the myth of total code comprehension.
Two precision-oriented features for the LoRA training pipeline: lora_fp32_accumulation and attn_softmax_scale.
Personal opinion on the paper 'Epiplexity'
How to build native Windows desktop applications that integrate with the Claude Code CLI using a pure Rust backend.
Personal opinion on AI slops
An exploration of the challenges in long context language model research.
A personal pick of recently published papers that show strong potential.
A personal take on the role of persona in the agentic LLM paradigm.
How to build native Windows desktop applications that integrate with the Claude Code CLI using a pure Rust backend.