Introduction

Trendy keyword별로 연구 맥락을 쭉 리뷰해보고자 했습니다. 그 첫번째로 딥러닝의 암기 현상과 일반화 능력간의 counter-intuitive한 관계에 대해 알아보겠습니다.

Initial Motivations

Understanding deep learning requires rethinking generalization (ICLR 2017)

핵심 관찰: 충분히 큰 neural network는 무작위 label을 가진 데이터셋에도 training error 0에 수렴합니다. 즉, 일반적인 통계학적 capacity 이론(VC dimension, Rademacher complexity)으로는 딥러닝의 일반화 성능을 설명할 수 없습니다.

시사점: 모델이 training set을 완벽히 memorize 할 수 있음에도 unseen data에 일반화한다는 사실은, 일반화 성능이 implicit한 inductive bias (optimizer, architecture, data distribution) 에서 비롯된다는 것을 시사합니다. 이후 거의 모든 generalization 연구의 출발점이 된 paper입니다.

개인 의견 :

A Closer Look at Memorization in Deep Networks (ICML 2017)

핵심 관찰: 같은 모델이 random label에 대해 학습하는 dynamics와 real label에 대해 학습하는 dynamics는 다릅니다. real label은 학습 초반부터 generalizable feature를 학습하는 반면, random label은 처음부터 단순 memorization으로 진입합니다.

시사점: 모델은 'pattern을 먼저 찾고 outlier는 memorize하는' 경향을 가집니다. 즉 generalization과 memorization은 동시에 일어날 수 있는 현상이며, 이 둘은 데이터의 특성에 따라 다른 비율로 섞여 있습니다.

개인 의견 :

Characteristics of Memorization

An Empirical Study of Example Forgetting during Deep Neural Network Learning (ICLR 2019)

핵심 관찰: 학습 도중 어떤 example은 한번 맞춘 뒤로 다시 틀리지 않지만(unforgettable), 다른 example은 학습 중에 여러 번 맞췄다 틀렸다 하는 forgetting event를 겪습니다. forgettable 한 example은 데이터셋 내에서 hard sample이거나, label noise를 가진 sample, 혹은 distribution의 outlier에 해당합니다.

시사점: 데이터의 어떤 부분이 'memorize 되는지'는 random하지 않습니다. forgetting frequency라는 단순한 지표만으로도 dataset에서 outlier/atypical example을 식별할 수 있고, unforgettable example만으로 학습해도 일반화 성능이 거의 유지됩니다.

개인 의견 :

Uniform convergence may be unable to explain generalization in deep learning (NeurIPS 2019)

핵심 관찰: 기존의 PAC-Bayes / margin-based generalization bound는 전부 uniform convergence 위에 세워져 있습니다. 그런데 실험적으로 보면, training set 크기를 늘리면서 generalization gap은 줄어드는데도 uniform convergence 기반 bound는 오히려 vacuous (즉 1보다 큰 무의미한 값) 하게 발산하는 경우가 존재합니다.

시사점: 딥러닝의 일반화를 설명하려면 uniform convergence 자체를 넘어선 새로운 분석 framework가 필요합니다. 단순히 hypothesis class의 capacity나 sample complexity 식의 bound로는 본질적으로 부족합니다.

개인 의견 :

Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models (NeurIPS 2022)

핵심 관찰: LLM은 training data를 정확히 외우면서도 unseen data에 대한 perplexity가 함께 개선됩니다. 즉 'memorization → overfitting' 이라는 전통적인 등식이 LLM에서는 성립하지 않습니다. 또한 memorization 능력은 모델 크기에 대해 sub-linear하게 scale하며, training loss와는 거의 독립적인 dynamics를 가집니다.

시사점: LLM의 memorization은 'overfitting의 부산물'이 아니라 별도의 capacity dimension으로 봐야 합니다. 동시에 어떤 sample이 외워지는지는 학습 순서, batch composition, model size에 의해 결정됩니다.

개인 의견 :

Analyses

What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation (NeurIPS 2020)

핵심 주장: 데이터 분포가 long-tailed인 상황에서 일반화 성능을 유지하려면, 모델이 long tail에 속한 atypical example을 memorize 해야만 합니다. memorization은 generalization의 부작용이 아니라, long-tail에 대한 일반화의 필수 조건 이라는 주장입니다.

방법: Feldman의 'memorization score' (특정 example이 학습 set에 포함됐는지 여부에 따라 prediction이 얼마나 바뀌는지) 와 'influence score' (어떤 train example이 어떤 test example의 prediction에 영향을 주는지) 를 정의하고, 이 두 score 간의 상관관계를 실험적으로 분석합니다.

시사점: memorization-vs-generalization을 binary trade-off로 보던 기존 관점을 뒤집어, 선택적 memorization이 일반화에 도움이 된다는 형식화를 제공합니다. 이후 LLM의 unlearning, privacy, data attribution 등 분야의 이론적 토대 중 하나로 자주 인용됩니다.

개인 의견 :

Are polynomial features the root of all evil? (2024)

핵심 주장: polynomial feature는 통상적으로 'overfitting의 대명사'로 취급되지만, basis 선택 문제일 뿐 polynomial 자체의 문제는 아니다 라는 주장을 펴는 블로그 포스트입니다. monomial basis 대신 Bernstein basis (혹은 Legendre basis 등) 로 polynomial을 표현하면, 같은 차수에서도 훨씬 잘 generalize 한다는 점을 실험으로 보여줍니다.

시사점: 'polynomial은 위험하다'는 통념이 사실은 numerical conditioning 문제와 norm 정의 문제에 가깝다는 것을 시사합니다. 일반화 능력이 함수 class 자체 보다는 그 함수를 어떻게 parameterize 하는가 에 더 크게 의존한다는 관점에서, neural network의 implicit bias 논의와 결이 맞닿아 있습니다.

개인 의견 :

Generalization via Memorization

Introduction

Initial Motivations

Understanding deep learning requires rethinking generalization (ICLR 2017)

A Closer Look at Memorization in Deep Networks (ICML 2017)

Characteristics of Memorization

An Empirical Study of Example Forgetting during Deep Neural Network Learning (ICLR 2019)

Uniform convergence may be unable to explain generalization in deep learning (NeurIPS 2019)

Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models (NeurIPS 2022)

Analyses

What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation (NeurIPS 2020)

Are polynomial features the root of all evil? (2024)