Occam’s razor is an old principle: “If two explanations account for the same thing equally well, choose the simpler one.”

The problem is that in the age of deep learning, this often remained little more than a nice slogan. People kept repeating ideas like “simple models,” “simple explanations,” and “simplicity helps generalization,” but it was hard to pin down what any of that meant numerically.

This paper changes that by making Occam one step more concrete. It brings simplicity down from the realm of “beautiful philosophy” into something that can actually be measured in bits.

1) The most intuitive way to write Occam mathematically: “compression”

The classical foundation this paper builds on is MDL, or Minimum Description Length. The core idea is simple.

To describe data, you need two things. First, the number of bits required to write down the model itself—the “rulebook.” Second, the number of bits required to encode the data additionally using that model.

A complex model gets penalized because the first term is large, and it only survives if it reduces the second term enough to compensate. In plain terms, Occam is implemented here as: store repeated patterns in the model, and do not redundantly write them again on the data side.

Up to this point, this is the familiar story: Occam = MDL.

2) What is new in this paper: it flips the question — not model selection, but data selection

The paper’s key shift is captured in a single sentence:

“If MDL is a criterion for choosing models given fixed data, epiplexity can be seen as a criterion for choosing data under a fixed compute budget.”

The traditional approach asks: “How can we build a more compact model—through regularization, structure, or architecture?”

This paper opens up a different direction: “Under the same compute budget, what data should we train on so that the model can build a more efficient rulebook?”

As deep learning scales up, this shift becomes even more important. Trimming the model a little matters less than understanding which data actually leads training toward absorbing structure.

3) Epiplexity measures the amount of “extractable structure” in bits

The paper defines a “model” not merely as a function, but as a program that can both sample and evaluate probabilities within a time limit (T). In other words, it only allows computable explanations.

On top of that, it selects the optimal program (P^*) by minimizing:

(program length) + (average code length)

From there:

epiplexity (S_T) is the length of the optimal program — how many bits the rulebook itself takes
time-bounded entropy (H_T) is the remaining uncertainty — the part the rulebook still cannot compress away

The core message here is simple:

Randomness and structure are separated according to the compute budget. Even with the same data, a learner with stronger computational power will see more of it as structure.

So epiplexity is not asking, “How intrinsically complex is this data?” It is asking, “How much structure can we actually extract from it with the learning machinery we have?”

4) The practical trick that makes this usable in deep learning: not “weight size,” but “compressing the training process”

At this point a practical issue appears. If we want to measure “rulebook bits,” the obvious first thought is the size of the weight file. But the paper explicitly avoids that route.

Instead, it treats the training process itself as an encoding process, estimating model information through methods such as prequential coding.

Intuitively, it works like this. Early in training, the loss is high. As training proceeds, the loss goes down. The area under the learning curve above the final loss can be interpreted as being close to the amount of structure the model has absorbed.

More concretely, at each step the current model is used to encode the next token (required bits = ( $\log 1/P_i(z_i))$ ), and then the model is updated using that token. In this way, the system jointly codes both training and model.

The paper also expresses the compute budget in FLOPs, and translates this into scaling-law language familiar to deep learning: with model size (N) and token count (D), training cost is approximated as something like 6ND, so that “fixed budget” becomes a practical notion rather than just a theoretical one.

5) Why this could be a real shift: because data is now the main bottleneck in deep learning

In practice, deep learning is already moving in this direction. More often than not, performance and generalization are determined less by making the model slightly prettier and more by how the data is chosen—through filtering, curriculum, synthetic data, or sampling.

And yet, discussions of data have mostly stayed qualitative: “this seems high quality,” “this looks noisy,” and so on.

The epiplexity perspective gives us a quantitative language for this.

Under the same compute budget, some datasets lead the model to build a larger and richer “rulebook” (more structure absorbed, higher ( $S_T$ )), while other datasets mostly increase the leftover randomness instead (pushing more mass into ( $H_T$ )).

This framing is powerful because the paper explicitly positions it as a way to think about the real hard problems we face in deep learning today—things like data valuation and OOD generalization.

6) It becomes even more “deep-learning-like” in the conditional case: the input is given, and all we need to predict is the label or next token

In problems like image classification, the input (X) is simply given, and what we actually care about is ( $Y \mid X$ ). For such cases, the paper separately defines conditional epiplexity.

That is, instead of measuring “the complexity of generating the image,” it measures only the rulebook bits needed to predict the label.

The paper also gives an important warning: this conditional version generally does not reduce to a simple entropy difference. It does not decompose cleanly by a chain rule.

If anything, that is a strength. It means the paper is willing to face head-on the fact that real learning in deep learning does not decompose as neatly as textbook theory often suggests.

Conclusion: Occam’s razor is no longer about “model aesthetics,” but about “data engineering”

If I had to summarize the paper in one sentence, it would be this:

It flips Occam’s razor from a principle for shaving down models into a principle for choosing data. And it tries to quantify that in terms of the amount of structure a learner can actually extract—the number of bits in the rulebook.

As deep learning continues to scale, the real competitive frontier shifts away from “making the architecture 1% more elegant” and toward data and training designs that let the model absorb more structure under the same budget.

Epiplexity may become one of the key languages for describing that frontier.

Why Epiplexity matters