Elements of Statistical Learning. 8/10

This is the first in a series of book recommendations for new quants. We will only focus on books that quants commonly come across in their journey.

Elements of Statistical Learning (ESL) is the classic recommendation for new quants, for good reason. However it’s a massive tome and many sections aren’t that useful – reflecting older techniques, the authors’ personal research agendas, or things that aren’t applicable to the trading domain. Chapters 1, 2, 3, and 7 are all great, constituting one of the best foundations for building empirical models from data. The framing of the model-building problem in ESL fits the trading domain better than other common entry points into statistical learning like Bayesian learning, NLP, bandits, convergence proofs, reinforcement learning, or statistical mechanics – which all emphasize assumptions that are inapplicable to trading. Trading is essentially empirical, extracting transient patterns from recent data, neither IID nor even stationary. Trading is a social system that just so happens to be fittable by certain classes of models. One major hurdle for new quants is learning the priors that limit the range of models that are applicable to this domain, and forgetting some of the theoretically appealing assumptions so prevalent in academia.

One major problem with ESL is that it mostly misses the practicalities of building models. It doesn’t touch on developing fast or robust software, processes that minimize divergence between training and inference, modular data cleaning and featurization pipelines, etc which are much more than half the battle when developing new automated quantitative strategies. Google’s Rules of Machine Learning is a good complementary read to fill in most of these gaps: https://developers.google.com/machine-learning/guides/rules-of-ml.

At the bottom of this review is the index of ESL’s 12th edition. This book is available for free at https://hastie.su.domains/ElemStatLearn/printings/ESLII_print12_toc.pdf. The highly recommended sections are bolded in blue, sections that could be interesting are in black, and sections that can be safely deprioritized are italicized in red. Blue sections will be applicable on a day-to-day basis, black sections may provide useful tools for thinking about data, and red sections are likely a less valuable use of time than simply studying for class, experimenting with kaggle problems, or prototyping an automated trading strategy yourself.

Recommended unconditionally (Blue): Chapter 1, 2, 3, and 7.

Worthwhile with caveats (Black):

  • Chapter 4: Most trading strategies don’t use classification, because the utility function/decision logic is based on buying and selling along with risk, messaging, and other constraints, which depend on PNL, and don’t fit in the classification framework that usually assumes samples are independent.
  • Chapter 8 & 16: Kaggle’s “no free hunch” blogs have better battle-tested ensembling ideas.
  • Chapters 9, 10, & 15: These are the authors’ main research area so they go into too much detail- in practice people just call xgboost or lightgbm.
  • Chapter 11: ESL is not the best intro to neural nets.
  • Chapters 13 & 14: Unsupervised methods can sometimes be used to engineer features, but are generally minor components.

Safe to skip (Red): Chapter 5, 6, 12, 17, and 18.


1 Introduction 1
2 Overview of Supervised Learning 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Variable Types and Terminology . . . . . . . . . . . . . . 9
2.3 Two Simple Approaches to Prediction:
Least Squares and Nearest Neighbors . . . . . . . . . . . 11
2.3.1 Linear Models and Least Squares . . . . . . . . 11
2.3.2 Nearest-Neighbor Methods . . . . . . . . . . . . 14
2.3.3 From Least Squares to Nearest Neighbors . . . . 16
2.4 Statistical Decision Theory . . . . . . . . . . . . . . . . . 18
2.5 Local Methods in High Dimensions . . . . . . . . . . . . . 22
2.6 Statistical Models, Supervised Learning
and Function Approximation . . . . . . . . . . . . . . . . 28
2.6.1 A Statistical Model
for the Joint Distribution Pr(X, Y ) . . . . . . . 28
2.6.2 Supervised Learning . . . . . . . . . . . . . . . . 29
2.6.3 Function Approximation . . . . . . . . . . . . . 29
2.7 Structured Regression Models . . . . . . . . . . . . . . . 32
2.7.1 Difficulty of the Problem . . . . . . . . . . . . . 32
xiv Contents
2.8 Classes of Restricted Estimators . . . . . . . . . . . . . . 33
2.8.1 Roughness Penalty and Bayesian Methods . . . 34
2.8.2 Kernel Methods and Local Regression . . . . . . 34
2.8.3 Basis Functions and Dictionary Methods . . . . 35
2.9 Model Selection and the Bias–Variance Tradeoff . . . . . 37
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 39
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3 Linear Methods for Regression 43
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Linear Regression Models and Least Squares . . . . . . . 44
3.2.1 Example: Prostate Cancer . . . . . . . . . . . . 49
3.2.2 The Gauss–Markov Theorem . . . . . . . . . . . 51
3.2.3 Multiple Regression
from Simple Univariate Regression . . . . . . . . 52
3.2.4 Multiple Outputs . . . . . . . . . . . . . . . . . 56
3.3 Subset Selection . . . . . . . . . . . . . . . . . . . . . . . 57
3.3.1 Best-Subset Selection . . . . . . . . . . . . . . . 57
3.3.2 Forward- and Backward-Stepwise Selection . . . 58
3.3.3 Forward-Stagewise Regression . . . . . . . . . . 60
3.3.4 Prostate Cancer Data Example (Continued) . . 61
3.4 Shrinkage Methods . . . . . . . . . . . . . . . . . . . . . . 61
3.4.1 Ridge Regression . . . . . . . . . . . . . . . . . 61
3.4.2 The Lasso . . . . . . . . . . . . . . . . . . . . . 68
3.4.3 Discussion: Subset Selection, Ridge Regression
and the Lasso . . . . . . . . . . . . . . . . . . . 69
3.4.4 Least Angle Regression . . . . . . . . . . . . . . 73
3.5 Methods Using Derived Input Directions . . . . . . . . . 79
3.5.1 Principal Components Regression . . . . . . . . 79
3.5.2 Partial Least Squares . . . . . . . . . . . . . . . 80
3.6 Discussion: A Comparison of the Selection
and Shrinkage Methods . . . . . . . . . . . . . . . . . . . 82
3.7 Multiple Outcome Shrinkage and Selection . . . . . . . . 84
3.8 More on the Lasso and Related Path Algorithms . . . . . 86
3.8.1 Incremental Forward Stagewise Regression . . . 86
3.8.2 Piecewise-Linear Path Algorithms . . . . . . . . 89
3.8.3 The Dantzig Selector . . . . . . . . . . . . . . . 89
3.8.4 The Grouped Lasso . . . . . . . . . . . . . . . . 90
3.8.5 Further Properties of the Lasso . . . . . . . . . . 91
3.8.6 Pathwise Coordinate Optimization . . . . . . . . 92
3.9 Computational Considerations . . . . . . . . . . . . . . . 93
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 94
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Contents xv

4 Linear Methods for Classification 101
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2 Linear Regression of an Indicator Matrix . . . . . . . . . 103
4.3 Linear Discriminant Analysis . . . . . . . . . . . . . . . . 106
4.3.1 Regularized Discriminant Analysis . . . . . . . . 112
4.3.2 Computations for LDA . . . . . . . . . . . . . . 113
4.3.3 Reduced-Rank Linear Discriminant Analysis . . 113
4.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . 119
4.4.1 Fitting Logistic Regression Models . . . . . . . . 120
4.4.2 Example: South African Heart Disease . . . . . 122
4.4.3 Quadratic Approximations and Inference . . . . 124
4.4.4 L1 Regularized Logistic Regression . . . . . . . . 125
4.4.5 Logistic Regression or LDA? . . . . . . . . . . . 127
4.5 Separating Hyperplanes . . . . . . . . . . . . . . . . . . . 129
4.5.1 Rosenblatt’s Perceptron Learning Algorithm . . 130
4.5.2 Optimal Separating Hyperplanes . . . . . . . . . 132
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 135
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5 Basis Expansions and Regularization 139
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.2 Piecewise Polynomials and Splines . . . . . . . . . . . . . 141
5.2.1 Natural Cubic Splines . . . . . . . . . . . . . . . 144
5.2.2 Example: South African Heart Disease (Continued)146
5.2.3 Example: Phoneme Recognition . . . . . . . . . 148
5.3 Filtering and Feature Extraction . . . . . . . . . . . . . . 150
5.4 Smoothing Splines . . . . . . . . . . . . . . . . . . . . . . 151
5.4.1 Degrees of Freedom and Smoother Matrices . . . 153
5.5 Automatic Selection of the Smoothing Parameters . . . . 156
5.5.1 Fixing the Degrees of Freedom . . . . . . . . . . 158
5.5.2 The Bias–Variance Tradeoff . . . . . . . . . . . . 158
5.6 Nonparametric Logistic Regression . . . . . . . . . . . . . 161
5.7 Multidimensional Splines . . . . . . . . . . . . . . . . . . 162
5.8 Regularization and Reproducing Kernel Hilbert Spaces . 167
5.8.1 Spaces of Functions Generated by Kernels . . . 168
5.8.2 Examples of RKHS . . . . . . . . . . . . . . . . 170
5.9 Wavelet Smoothing . . . . . . . . . . . . . . . . . . . . . 174
5.9.1 Wavelet Bases and the Wavelet Transform . . . 176
5.9.2 Adaptive Wavelet Filtering . . . . . . . . . . . . 179
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 181
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Appendix: Computational Considerations for Splines . . . . . . 186
Appendix: B-splines . . . . . . . . . . . . . . . . . . . . . 186
Appendix: Computations for Smoothing Splines . . . . . 189
xvi Contents

6 Kernel Smoothing Methods 191
6.1 One-Dimensional Kernel Smoothers . . . . . . . . . . . . 192
6.1.1 Local Linear Regression . . . . . . . . . . . . . . 194
6.1.2 Local Polynomial Regression . . . . . . . . . . . 197
6.2 Selecting the Width of the Kernel . . . . . . . . . . . . . 198
6.3 Local Regression in IRp
. . . . . . . . . . . . . . . . . . . 200
6.4 Structured Local Regression Models in IRp
. . . . . . . . 201
6.4.1 Structured Kernels . . . . . . . . . . . . . . . . . 203
6.4.2 Structured Regression Functions . . . . . . . . . 203
6.5 Local Likelihood and Other Models . . . . . . . . . . . . 205
6.6 Kernel Density Estimation and Classification . . . . . . . 208
6.6.1 Kernel Density Estimation . . . . . . . . . . . . 208
6.6.2 Kernel Density Classification . . . . . . . . . . . 210
6.6.3 The Naive Bayes Classifier . . . . . . . . . . . . 210
6.7 Radial Basis Functions and Kernels . . . . . . . . . . . . 212
6.8 Mixture Models for Density Estimation and Classification 214
6.9 Computational Considerations . . . . . . . . . . . . . . . 216
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 216
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

7 Model Assessment and Selection 219
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 219
7.2 Bias, Variance and Model Complexity . . . . . . . . . . . 219
7.3 The Bias–Variance Decomposition . . . . . . . . . . . . . 223
7.3.1 Example: Bias–Variance Tradeoff . . . . . . . . 226
7.4 Optimism of the Training Error Rate . . . . . . . . . . . 228
7.5 Estimates of In-Sample Prediction Error . . . . . . . . . . 230
7.6 The Effective Number of Parameters . . . . . . . . . . . . 232
7.7 The Bayesian Approach and BIC . . . . . . . . . . . . . . 233
7.8 Minimum Description Length . . . . . . . . . . . . . . . . 235
7.9 Vapnik–Chervonenkis Dimension . . . . . . . . . . . . . . 237
7.9.1 Example (Continued) . . . . . . . . . . . . . . . 239
7.10 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . 241
7.10.1 K-Fold Cross-Validation . . . . . . . . . . . . . 241
7.10.2 The Wrong and Right Way
to Do Cross-validation . . . . . . . . . . . . . . . 245
7.10.3 Does Cross-Validation Really Work? . . . . . . . 247
7.11 Bootstrap Methods . . . . . . . . . . . . . . . . . . . . . 249
7.11.1 Example (Continued) . . . . . . . . . . . . . . . 252
7.12 Conditional or Expected Test Error? . . . . . . . . . . . . 254
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 257
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

8 Model Inference and Averaging 261
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 261
Contents xvii
8.2 The Bootstrap and Maximum Likelihood Methods . . . . 261
8.2.1 A Smoothing Example . . . . . . . . . . . . . . 261
8.2.2 Maximum Likelihood Inference . . . . . . . . . . 265
8.2.3 Bootstrap versus Maximum Likelihood . . . . . 267
8.3 Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . 267
8.4 Relationship Between the Bootstrap
and Bayesian Inference . . . . . . . . . . . . . . . . . . . 271
8.5 The EM Algorithm . . . . . . . . . . . . . . . . . . . . . 272
8.5.1 Two-Component Mixture Model . . . . . . . . . 272
8.5.2 The EM Algorithm in General . . . . . . . . . . 276
8.5.3 EM as a Maximization–Maximization Procedure 277
8.6 MCMC for Sampling from the Posterior . . . . . . . . . . 279
8.7 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
8.7.1 Example: Trees with Simulated Data . . . . . . 283
8.8 Model Averaging and Stacking . . . . . . . . . . . . . . . 288
8.9 Stochastic Search: Bumping . . . . . . . . . . . . . . . . . 290
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 292
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
9 Additive Models, Trees, and Related Methods 295
9.1 Generalized Additive Models . . . . . . . . . . . . . . . . 295
9.1.1 Fitting Additive Models . . . . . . . . . . . . . . 297
9.1.2 Example: Additive Logistic Regression . . . . . 299
9.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . 304
9.2 Tree-Based Methods . . . . . . . . . . . . . . . . . . . . . 305
9.2.1 Background . . . . . . . . . . . . . . . . . . . . 305
9.2.2 Regression Trees . . . . . . . . . . . . . . . . . . 307
9.2.3 Classification Trees . . . . . . . . . . . . . . . . 308
9.2.4 Other Issues . . . . . . . . . . . . . . . . . . . . 310
9.2.5 Spam Example (Continued) . . . . . . . . . . . 313
9.3 PRIM: Bump Hunting . . . . . . . . . . . . . . . . . . . . 317
9.3.1 Spam Example (Continued) . . . . . . . . . . . 320
9.4 MARS: Multivariate Adaptive Regression Splines . . . . . 321
9.4.1 Spam Example (Continued) . . . . . . . . . . . 326
9.4.2 Example (Simulated Data) . . . . . . . . . . . . 327
9.4.3 Other Issues . . . . . . . . . . . . . . . . . . . . 328
9.5 Hierarchical Mixtures of Experts . . . . . . . . . . . . . . 329
9.6 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . 332
9.7 Computational Considerations . . . . . . . . . . . . . . . 334
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 334
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
10 Boosting and Additive Trees 337
10.1 Boosting Methods . . . . . . . . . . . . . . . . . . . . . . 337
10.1.1 Outline of This Chapter . . . . . . . . . . . . . . 340
xviii Contents
10.2 Boosting Fits an Additive Model . . . . . . . . . . . . . . 341
10.3 Forward Stagewise Additive Modeling . . . . . . . . . . . 342
10.4 Exponential Loss and AdaBoost . . . . . . . . . . . . . . 343
10.5 Why Exponential Loss? . . . . . . . . . . . . . . . . . . . 345
10.6 Loss Functions and Robustness . . . . . . . . . . . . . . . 346
10.7 “Off-the-Shelf” Procedures for Data Mining . . . . . . . . 350
10.8 Example: Spam Data . . . . . . . . . . . . . . . . . . . . 352
10.9 Boosting Trees . . . . . . . . . . . . . . . . . . . . . . . . 353
10.10 Numerical Optimization via Gradient Boosting . . . . . . 358
10.10.1 Steepest Descent . . . . . . . . . . . . . . . . . . 358
10.10.2 Gradient Boosting . . . . . . . . . . . . . . . . . 359
10.10.3 Implementations of Gradient Boosting . . . . . . 360
10.11 Right-Sized Trees for Boosting . . . . . . . . . . . . . . . 361
10.12 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 364
10.12.1 Shrinkage . . . . . . . . . . . . . . . . . . . . . . 364
10.12.2 Subsampling . . . . . . . . . . . . . . . . . . . . 365
10.13 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . 367
10.13.1 Relative Importance of Predictor Variables . . . 367
10.13.2 Partial Dependence Plots . . . . . . . . . . . . . 369
10.14 Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . 371
10.14.1 California Housing . . . . . . . . . . . . . . . . . 371
10.14.2 New Zealand Fish . . . . . . . . . . . . . . . . . 375
10.14.3 Demographics Data . . . . . . . . . . . . . . . . 379
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 380
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
11 Neural Networks 389
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 389
11.2 Projection Pursuit Regression . . . . . . . . . . . . . . . 389
11.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 392
11.4 Fitting Neural Networks . . . . . . . . . . . . . . . . . . . 395
11.5 Some Issues in Training Neural Networks . . . . . . . . . 397
11.5.1 Starting Values . . . . . . . . . . . . . . . . . . . 397
11.5.2 Overfitting . . . . . . . . . . . . . . . . . . . . . 398
11.5.3 Scaling of the Inputs . . . . . . . . . . . . . . . 398
11.5.4 Number of Hidden Units and Layers . . . . . . . 400
11.5.5 Multiple Minima . . . . . . . . . . . . . . . . . . 400
11.6 Example: Simulated Data . . . . . . . . . . . . . . . . . . 401
11.7 Example: ZIP Code Data . . . . . . . . . . . . . . . . . . 404
11.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 408
11.9 Bayesian Neural Nets and the NIPS 2003 Challenge . . . 409
11.9.1 Bayes, Boosting and Bagging . . . . . . . . . . . 410
11.9.2 Performance Comparisons . . . . . . . . . . . . 412
11.10 Computational Considerations . . . . . . . . . . . . . . . 414
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 415
Contents xix
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
12 Support Vector Machines and
Flexible Discriminants 417
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 417
12.2 The Support Vector Classifier . . . . . . . . . . . . . . . . 417
12.2.1 Computing the Support Vector Classifier . . . . 420
12.2.2 Mixture Example (Continued) . . . . . . . . . . 421
12.3 Support Vector Machines and Kernels . . . . . . . . . . . 423
12.3.1 Computing the SVM for Classification . . . . . . 423
12.3.2 The SVM as a Penalization Method . . . . . . . 426
12.3.3 Function Estimation and Reproducing Kernels . 428
12.3.4 SVMs and the Curse of Dimensionality . . . . . 431
12.3.5 A Path Algorithm for the SVM Classifier . . . . 432
12.3.6 Support Vector Machines for Regression . . . . . 434
12.3.7 Regression and Kernels . . . . . . . . . . . . . . 436
12.3.8 Discussion . . . . . . . . . . . . . . . . . . . . . 438
12.4 Generalizing Linear Discriminant Analysis . . . . . . . . 438
12.5 Flexible Discriminant Analysis . . . . . . . . . . . . . . . 440
12.5.1 Computing the FDA Estimates . . . . . . . . . . 444
12.6 Penalized Discriminant Analysis . . . . . . . . . . . . . . 446
12.7 Mixture Discriminant Analysis . . . . . . . . . . . . . . . 449
12.7.1 Example: Waveform Data . . . . . . . . . . . . . 451
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 455
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455

13 Prototype Methods and Nearest-Neighbors 459
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 459
13.2 Prototype Methods . . . . . . . . . . . . . . . . . . . . . 459
13.2.1 K-means Clustering . . . . . . . . . . . . . . . . 460
13.2.2 Learning Vector Quantization . . . . . . . . . . 462
13.2.3 Gaussian Mixtures . . . . . . . . . . . . . . . . . 463
13.3 k-Nearest-Neighbor Classifiers . . . . . . . . . . . . . . . 463
13.3.1 Example: A Comparative Study . . . . . . . . . 468
13.3.2 Example: k-Nearest-Neighbors
and Image Scene Classification . . . . . . . . . . 470
13.3.3 Invariant Metrics and Tangent Distance . . . . . 471
13.4 Adaptive Nearest-Neighbor Methods . . . . . . . . . . . . 475
13.4.1 Example . . . . . . . . . . . . . . . . . . . . . . 478
13.4.2 Global Dimension Reduction
for Nearest-Neighbors . . . . . . . . . . . . . . . 479
13.5 Computational Considerations . . . . . . . . . . . . . . . 480
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 481
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
xx Contents
14 Unsupervised Learning 485
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 485
14.2 Association Rules . . . . . . . . . . . . . . . . . . . . . . 487
14.2.1 Market Basket Analysis . . . . . . . . . . . . . . 488
14.2.2 The Apriori Algorithm . . . . . . . . . . . . . . 489
14.2.3 Example: Market Basket Analysis . . . . . . . . 492
14.2.4 Unsupervised as Supervised Learning . . . . . . 495
14.2.5 Generalized Association Rules . . . . . . . . . . 497
14.2.6 Choice of Supervised Learning Method . . . . . 499
14.2.7 Example: Market Basket Analysis (Continued) . 499
14.3 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . 501
14.3.1 Proximity Matrices . . . . . . . . . . . . . . . . 503
14.3.2 Dissimilarities Based on Attributes . . . . . . . 503
14.3.3 Object Dissimilarity . . . . . . . . . . . . . . . . 505
14.3.4 Clustering Algorithms . . . . . . . . . . . . . . . 507
14.3.5 Combinatorial Algorithms . . . . . . . . . . . . 507
14.3.6 K-means . . . . . . . . . . . . . . . . . . . . . . 509
14.3.7 Gaussian Mixtures as Soft K-means Clustering . 510
14.3.8 Example: Human Tumor Microarray Data . . . 512
14.3.9 Vector Quantization . . . . . . . . . . . . . . . . 514
14.3.10 K-medoids . . . . . . . . . . . . . . . . . . . . . 515
14.3.11 Practical Issues . . . . . . . . . . . . . . . . . . 518
14.3.12 Hierarchical Clustering . . . . . . . . . . . . . . 520
14.4 Self-Organizing Maps . . . . . . . . . . . . . . . . . . . . 528
14.5 Principal Components, Curves and Surfaces . . . . . . . . 534
14.5.1 Principal Components . . . . . . . . . . . . . . . 534
14.5.2 Principal Curves and Surfaces . . . . . . . . . . 541
14.5.3 Spectral Clustering . . . . . . . . . . . . . . . . 544
14.5.4 Kernel Principal Components . . . . . . . . . . . 547
14.5.5 Sparse Principal Components . . . . . . . . . . . 550
14.6 Non-negative Matrix Factorization . . . . . . . . . . . . . 553
14.6.1 Archetypal Analysis . . . . . . . . . . . . . . . . 554
14.7 Independent Component Analysis
and Exploratory Projection Pursuit . . . . . . . . . . . . 557
14.7.1 Latent Variables and Factor Analysis . . . . . . 558
14.7.2 Independent Component Analysis . . . . . . . . 560
14.7.3 Exploratory Projection Pursuit . . . . . . . . . . 565
14.7.4 A Direct Approach to ICA . . . . . . . . . . . . 565
14.8 Multidimensional Scaling . . . . . . . . . . . . . . . . . . 570
14.9 Nonlinear Dimension Reduction
and Local Multidimensional Scaling . . . . . . . . . . . . 572
14.10 The Google PageRank Algorithm . . . . . . . . . . . . . 576
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 578
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
Contents xxi
15 Random Forests 587
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 587
15.2 Definition of Random Forests . . . . . . . . . . . . . . . . 587
15.3 Details of Random Forests . . . . . . . . . . . . . . . . . 592
15.3.1 Out of Bag Samples . . . . . . . . . . . . . . . . 592
15.3.2 Variable Importance . . . . . . . . . . . . . . . . 593
15.3.3 Proximity Plots . . . . . . . . . . . . . . . . . . 595
15.3.4 Random Forests and Overfitting . . . . . . . . . 596
15.4 Analysis of Random Forests . . . . . . . . . . . . . . . . . 597
15.4.1 Variance and the De-Correlation Effect . . . . . 597
15.4.2 Bias . . . . . . . . . . . . . . . . . . . . . . . . . 600
15.4.3 Adaptive Nearest Neighbors . . . . . . . . . . . 601
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 602
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
16 Ensemble Learning 605
16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 605
16.2 Boosting and Regularization Paths . . . . . . . . . . . . . 607
16.2.1 Penalized Regression . . . . . . . . . . . . . . . 607
16.2.2 The “Bet on Sparsity” Principle . . . . . . . . . 610
16.2.3 Regularization Paths, Over-fitting and Margins . 613
16.3 Learning Ensembles . . . . . . . . . . . . . . . . . . . . . 616
16.3.1 Learning a Good Ensemble . . . . . . . . . . . . 617
16.3.2 Rule Ensembles . . . . . . . . . . . . . . . . . . 622
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 623
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624
17 Undirected Graphical Models 625
17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 625
17.2 Markov Graphs and Their Properties . . . . . . . . . . . 627
17.3 Undirected Graphical Models for Continuous Variables . 630
17.3.1 Estimation of the Parameters
when the Graph Structure is Known . . . . . . . 631
17.3.2 Estimation of the Graph Structure . . . . . . . . 635
17.4 Undirected Graphical Models for Discrete Variables . . . 638
17.4.1 Estimation of the Parameters
when the Graph Structure is Known . . . . . . . 639
17.4.2 Hidden Nodes . . . . . . . . . . . . . . . . . . . 641
17.4.3 Estimation of the Graph Structure . . . . . . . . 642
17.4.4 Restricted Boltzmann Machines . . . . . . . . . 643
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645

18 High-Dimensional Problems: p ≫ N 649
18.1 When p is Much Bigger than N . . . . . . . . . . . . . . 649
xxii Contents
18.2 Diagonal Linear Discriminant Analysis
and Nearest Shrunken Centroids . . . . . . . . . . . . . . 651
18.3 Linear Classifiers with Quadratic Regularization . . . . . 654
18.3.1 Regularized Discriminant Analysis . . . . . . . . 656
18.3.2 Logistic Regression
with Quadratic Regularization . . . . . . . . . . 657
18.3.3 The Support Vector Classifier . . . . . . . . . . 657
18.3.4 Feature Selection . . . . . . . . . . . . . . . . . . 658
18.3.5 Computational Shortcuts When p ≫ N . . . . . 659
18.4 Linear Classifiers with L1 Regularization . . . . . . . . . 661
18.4.1 Application of Lasso
to Protein Mass Spectroscopy . . . . . . . . . . 664
18.4.2 The Fused Lasso for Functional Data . . . . . . 666
18.5 Classification When Features are Unavailable . . . . . . . 668
18.5.1 Example: String Kernels
and Protein Classification . . . . . . . . . . . . . 668
18.5.2 Classification and Other Models Using
Inner-Product Kernels and Pairwise Distances . 670
18.5.3 Example: Abstracts Classification . . . . . . . . 672
18.6 High-Dimensional Regression:
Supervised Principal Components . . . . . . . . . . . . . 674
18.6.1 Connection to Latent-Variable Modeling . . . . 678
18.6.2 Relationship with Partial Least Squares . . . . . 680
18.6.3 Pre-Conditioning for Feature Selection . . . . . 681
18.7 Feature Assessment and the Multiple-Testing Problem . . 683
18.7.1 The False Discovery Rate . . . . . . . . . . . . . 687
18.7.2 Asymmetric Cutpoints and the SAM Procedure 690
18.7.3 A Bayesian Interpretation of the FDR . . . . . . 692
18.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . 693
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694

References 699
Author Index 729
Index 737