Elements of Statistical Learning. 8/10

This is the first in a series of book recommendations for new quants. We will only focus on books that quants commonly come across in their journey.

Elements of Statistical Learning (ESL) is the classic recommendation for new quants, for good reason. However it’s a massive tome and many sections aren’t that useful – reflecting older techniques, the authors’ personal research agendas, or things that aren’t applicable to the trading domain. Chapters 1, 2, 3, and 7 are all great, constituting one of the best foundations for building empirical models from data. The framing of the model-building problem in ESL fits the trading domain better than other common entry points into statistical learning like Bayesian learning, NLP, bandits, convergence proofs, reinforcement learning, or statistical mechanics – which all emphasize assumptions that are inapplicable to trading. Trading is essentially empirical, extracting transient patterns from recent data, neither IID nor even stationary. Trading is a social system that just so happens to be fittable by certain classes of models. One major hurdle for new quants is learning the priors that limit the range of models that are applicable to this domain, and forgetting some of the theoretically appealing assumptions so prevalent in academia.

One major problem with ESL is that it mostly misses the practicalities of building models. It doesn’t touch on developing fast or robust software, processes that minimize divergence between training and inference, modular data cleaning and featurization pipelines, etc which are much more than half the battle when developing new automated quantitative strategies. Google’s Rules of Machine Learning is a good complementary read to fill in most of these gaps: https://developers.google.com/machine-learning/guides/rules-of-ml.

At the bottom of this review is the index of ESL’s 12th edition. This book is available for free at https://hastie.su.domains/ElemStatLearn/printings/ESLII_print12_toc.pdf. The highly recommended sections are bolded in blue, sections that could be interesting are in black, and sections that can be safely deprioritized are italicized in red. Blue sections will be applicable on a day-to-day basis, black sections may provide useful tools for thinking about data, and red sections are likely a less valuable use of time than simply studying for class, experimenting with kaggle problems, or prototyping an automated trading strategy yourself.

Recommended unconditionally (Blue): Chapter 1, 2, 3, and 7.

Worthwhile with caveats (Black):

  • Chapter 4: Most trading strategies don’t use classification, because the utility function/decision logic is based on buying and selling along with risk, messaging, and other constraints, which depend on PNL, and don’t fit in the classification framework that usually assumes samples are independent.
  • Chapter 8 & 16: Kaggle’s “no free hunch” blogs have better battle-tested ensembling ideas.
  • Chapters 9, 10, & 15: These are the authors’ main research area so they go into too much detail- in practice people just call xgboost or lightgbm.
  • Chapter 11: ESL is not the best intro to neural nets.
  • Chapters 13 & 14: Unsupervised methods can sometimes be used to engineer features, but are generally minor components.

Safe to skip (Red): Chapter 5, 6, 12, 17, and 18.


1 Introduction 1
2 Overview of Supervised Learning 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Variable Types and Terminology . . . . . . . . . . . . . . 9
2.3 Two Simple Approaches to Prediction:
Least Squares and Nearest Neighbors . . . . . . . . . . . 11
2.3.1 Linear Models and Least Squares . . . . . . . . 11
2.3.2 Nearest-Neighbor Methods . . . . . . . . . . . . 14
2.3.3 From Least Squares to Nearest Neighbors . . . . 16
2.4 Statistical Decision Theory . . . . . . . . . . . . . . . . . 18
2.5 Local Methods in High Dimensions . . . . . . . . . . . . . 22
2.6 Statistical Models, Supervised Learning
and Function Approximation . . . . . . . . . . . . . . . . 28
2.6.1 A Statistical Model
for the Joint Distribution Pr(X, Y ) . . . . . . . 28
2.6.2 Supervised Learning . . . . . . . . . . . . . . . . 29
2.6.3 Function Approximation . . . . . . . . . . . . . 29
2.7 Structured Regression Models . . . . . . . . . . . . . . . 32
2.7.1 Difficulty of the Problem . . . . . . . . . . . . . 32
xiv Contents
2.8 Classes of Restricted Estimators . . . . . . . . . . . . . . 33
2.8.1 Roughness Penalty and Bayesian Methods . . . 34
2.8.2 Kernel Methods and Local Regression . . . . . . 34
2.8.3 Basis Functions and Dictionary Methods . . . . 35
2.9 Model Selection and the Bias–Variance Tradeoff . . . . . 37
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 39
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3 Linear Methods for Regression 43
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Linear Regression Models and Least Squares . . . . . . . 44
3.2.1 Example: Prostate Cancer . . . . . . . . . . . . 49
3.2.2 The Gauss–Markov Theorem . . . . . . . . . . . 51
3.2.3 Multiple Regression
from Simple Univariate Regression . . . . . . . . 52
3.2.4 Multiple Outputs . . . . . . . . . . . . . . . . . 56
3.3 Subset Selection . . . . . . . . . . . . . . . . . . . . . . . 57
3.3.1 Best-Subset Selection . . . . . . . . . . . . . . . 57
3.3.2 Forward- and Backward-Stepwise Selection . . . 58
3.3.3 Forward-Stagewise Regression . . . . . . . . . . 60
3.3.4 Prostate Cancer Data Example (Continued) . . 61
3.4 Shrinkage Methods . . . . . . . . . . . . . . . . . . . . . . 61
3.4.1 Ridge Regression . . . . . . . . . . . . . . . . . 61
3.4.2 The Lasso . . . . . . . . . . . . . . . . . . . . . 68
3.4.3 Discussion: Subset Selection, Ridge Regression
and the Lasso . . . . . . . . . . . . . . . . . . . 69
3.4.4 Least Angle Regression . . . . . . . . . . . . . . 73
3.5 Methods Using Derived Input Directions . . . . . . . . . 79
3.5.1 Principal Components Regression . . . . . . . . 79
3.5.2 Partial Least Squares . . . . . . . . . . . . . . . 80
3.6 Discussion: A Comparison of the Selection
and Shrinkage Methods . . . . . . . . . . . . . . . . . . . 82
3.7 Multiple Outcome Shrinkage and Selection . . . . . . . . 84
3.8 More on the Lasso and Related Path Algorithms . . . . . 86
3.8.1 Incremental Forward Stagewise Regression . . . 86
3.8.2 Piecewise-Linear Path Algorithms . . . . . . . . 89
3.8.3 The Dantzig Selector . . . . . . . . . . . . . . . 89
3.8.4 The Grouped Lasso . . . . . . . . . . . . . . . . 90
3.8.5 Further Properties of the Lasso . . . . . . . . . . 91
3.8.6 Pathwise Coordinate Optimization . . . . . . . . 92
3.9 Computational Considerations . . . . . . . . . . . . . . . 93
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 94
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Contents xv

4 Linear Methods for Classification 101
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2 Linear Regression of an Indicator Matrix . . . . . . . . . 103
4.3 Linear Discriminant Analysis . . . . . . . . . . . . . . . . 106
4.3.1 Regularized Discriminant Analysis . . . . . . . . 112
4.3.2 Computations for LDA . . . . . . . . . . . . . . 113
4.3.3 Reduced-Rank Linear Discriminant Analysis . . 113
4.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . 119
4.4.1 Fitting Logistic Regression Models . . . . . . . . 120
4.4.2 Example: South African Heart Disease . . . . . 122
4.4.3 Quadratic Approximations and Inference . . . . 124
4.4.4 L1 Regularized Logistic Regression . . . . . . . . 125
4.4.5 Logistic Regression or LDA? . . . . . . . . . . . 127
4.5 Separating Hyperplanes . . . . . . . . . . . . . . . . . . . 129
4.5.1 Rosenblatt’s Perceptron Learning Algorithm . . 130
4.5.2 Optimal Separating Hyperplanes . . . . . . . . . 132
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 135
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5 Basis Expansions and Regularization 139
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.2 Piecewise Polynomials and Splines . . . . . . . . . . . . . 141
5.2.1 Natural Cubic Splines . . . . . . . . . . . . . . . 144
5.2.2 Example: South African Heart Disease (Continued)146
5.2.3 Example: Phoneme Recognition . . . . . . . . . 148
5.3 Filtering and Feature Extraction . . . . . . . . . . . . . . 150
5.4 Smoothing Splines . . . . . . . . . . . . . . . . . . . . . . 151
5.4.1 Degrees of Freedom and Smoother Matrices . . . 153
5.5 Automatic Selection of the Smoothing Parameters . . . . 156
5.5.1 Fixing the Degrees of Freedom . . . . . . . . . . 158
5.5.2 The Bias–Variance Tradeoff . . . . . . . . . . . . 158
5.6 Nonparametric Logistic Regression . . . . . . . . . . . . . 161
5.7 Multidimensional Splines . . . . . . . . . . . . . . . . . . 162
5.8 Regularization and Reproducing Kernel Hilbert Spaces . 167
5.8.1 Spaces of Functions Generated by Kernels . . . 168
5.8.2 Examples of RKHS . . . . . . . . . . . . . . . . 170
5.9 Wavelet Smoothing . . . . . . . . . . . . . . . . . . . . . 174
5.9.1 Wavelet Bases and the Wavelet Transform . . . 176
5.9.2 Adaptive Wavelet Filtering . . . . . . . . . . . . 179
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 181
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Appendix: Computational Considerations for Splines . . . . . . 186
Appendix: B-splines . . . . . . . . . . . . . . . . . . . . . 186
Appendix: Computations for Smoothing Splines . . . . . 189
xvi Contents

6 Kernel Smoothing Methods 191
6.1 One-Dimensional Kernel Smoothers . . . . . . . . . . . . 192
6.1.1 Local Linear Regression . . . . . . . . . . . . . . 194
6.1.2 Local Polynomial Regression . . . . . . . . . . . 197
6.2 Selecting the Width of the Kernel . . . . . . . . . . . . . 198
6.3 Local Regression in IRp
. . . . . . . . . . . . . . . . . . . 200
6.4 Structured Local Regression Models in IRp
. . . . . . . . 201
6.4.1 Structured Kernels . . . . . . . . . . . . . . . . . 203
6.4.2 Structured Regression Functions . . . . . . . . . 203
6.5 Local Likelihood and Other Models . . . . . . . . . . . . 205
6.6 Kernel Density Estimation and Classification . . . . . . . 208
6.6.1 Kernel Density Estimation . . . . . . . . . . . . 208
6.6.2 Kernel Density Classification . . . . . . . . . . . 210
6.6.3 The Naive Bayes Classifier . . . . . . . . . . . . 210
6.7 Radial Basis Functions and Kernels . . . . . . . . . . . . 212
6.8 Mixture Models for Density Estimation and Classification 214
6.9 Computational Considerations . . . . . . . . . . . . . . . 216
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 216
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

7 Model Assessment and Selection 219
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 219
7.2 Bias, Variance and Model Complexity . . . . . . . . . . . 219
7.3 The Bias–Variance Decomposition . . . . . . . . . . . . . 223
7.3.1 Example: Bias–Variance Tradeoff . . . . . . . . 226
7.4 Optimism of the Training Error Rate . . . . . . . . . . . 228
7.5 Estimates of In-Sample Prediction Error . . . . . . . . . . 230
7.6 The Effective Number of Parameters . . . . . . . . . . . . 232
7.7 The Bayesian Approach and BIC . . . . . . . . . . . . . . 233
7.8 Minimum Description Length . . . . . . . . . . . . . . . . 235
7.9 Vapnik–Chervonenkis Dimension . . . . . . . . . . . . . . 237
7.9.1 Example (Continued) . . . . . . . . . . . . . . . 239
7.10 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . 241
7.10.1 K-Fold Cross-Validation . . . . . . . . . . . . . 241
7.10.2 The Wrong and Right Way
to Do Cross-validation . . . . . . . . . . . . . . . 245
7.10.3 Does Cross-Validation Really Work? . . . . . . . 247
7.11 Bootstrap Methods . . . . . . . . . . . . . . . . . . . . . 249
7.11.1 Example (Continued) . . . . . . . . . . . . . . . 252
7.12 Conditional or Expected Test Error? . . . . . . . . . . . . 254
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 257
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

8 Model Inference and Averaging 261
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 261
Contents xvii
8.2 The Bootstrap and Maximum Likelihood Methods . . . . 261
8.2.1 A Smoothing Example . . . . . . . . . . . . . . 261
8.2.2 Maximum Likelihood Inference . . . . . . . . . . 265
8.2.3 Bootstrap versus Maximum Likelihood . . . . . 267
8.3 Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . 267
8.4 Relationship Between the Bootstrap
and Bayesian Inference . . . . . . . . . . . . . . . . . . . 271
8.5 The EM Algorithm . . . . . . . . . . . . . . . . . . . . . 272
8.5.1 Two-Component Mixture Model . . . . . . . . . 272
8.5.2 The EM Algorithm in General . . . . . . . . . . 276
8.5.3 EM as a Maximization–Maximization Procedure 277
8.6 MCMC for Sampling from the Posterior . . . . . . . . . . 279
8.7 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
8.7.1 Example: Trees with Simulated Data . . . . . . 283
8.8 Model Averaging and Stacking . . . . . . . . . . . . . . . 288
8.9 Stochastic Search: Bumping . . . . . . . . . . . . . . . . . 290
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 292
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
9 Additive Models, Trees, and Related Methods 295
9.1 Generalized Additive Models . . . . . . . . . . . . . . . . 295
9.1.1 Fitting Additive Models . . . . . . . . . . . . . . 297
9.1.2 Example: Additive Logistic Regression . . . . . 299
9.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . 304
9.2 Tree-Based Methods . . . . . . . . . . . . . . . . . . . . . 305
9.2.1 Background . . . . . . . . . . . . . . . . . . . . 305
9.2.2 Regression Trees . . . . . . . . . . . . . . . . . . 307
9.2.3 Classification Trees . . . . . . . . . . . . . . . . 308
9.2.4 Other Issues . . . . . . . . . . . . . . . . . . . . 310
9.2.5 Spam Example (Continued) . . . . . . . . . . . 313
9.3 PRIM: Bump Hunting . . . . . . . . . . . . . . . . . . . . 317
9.3.1 Spam Example (Continued) . . . . . . . . . . . 320
9.4 MARS: Multivariate Adaptive Regression Splines . . . . . 321
9.4.1 Spam Example (Continued) . . . . . . . . . . . 326
9.4.2 Example (Simulated Data) . . . . . . . . . . . . 327
9.4.3 Other Issues . . . . . . . . . . . . . . . . . . . . 328
9.5 Hierarchical Mixtures of Experts . . . . . . . . . . . . . . 329
9.6 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . 332
9.7 Computational Considerations . . . . . . . . . . . . . . . 334
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 334
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
10 Boosting and Additive Trees 337
10.1 Boosting Methods . . . . . . . . . . . . . . . . . . . . . . 337
10.1.1 Outline of This Chapter . . . . . . . . . . . . . . 340
xviii Contents
10.2 Boosting Fits an Additive Model . . . . . . . . . . . . . . 341
10.3 Forward Stagewise Additive Modeling . . . . . . . . . . . 342
10.4 Exponential Loss and AdaBoost . . . . . . . . . . . . . . 343
10.5 Why Exponential Loss? . . . . . . . . . . . . . . . . . . . 345
10.6 Loss Functions and Robustness . . . . . . . . . . . . . . . 346
10.7 “Off-the-Shelf” Procedures for Data Mining . . . . . . . . 350
10.8 Example: Spam Data . . . . . . . . . . . . . . . . . . . . 352
10.9 Boosting Trees . . . . . . . . . . . . . . . . . . . . . . . . 353
10.10 Numerical Optimization via Gradient Boosting . . . . . . 358
10.10.1 Steepest Descent . . . . . . . . . . . . . . . . . . 358
10.10.2 Gradient Boosting . . . . . . . . . . . . . . . . . 359
10.10.3 Implementations of Gradient Boosting . . . . . . 360
10.11 Right-Sized Trees for Boosting . . . . . . . . . . . . . . . 361
10.12 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 364
10.12.1 Shrinkage . . . . . . . . . . . . . . . . . . . . . . 364
10.12.2 Subsampling . . . . . . . . . . . . . . . . . . . . 365
10.13 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . 367
10.13.1 Relative Importance of Predictor Variables . . . 367
10.13.2 Partial Dependence Plots . . . . . . . . . . . . . 369
10.14 Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . 371
10.14.1 California Housing . . . . . . . . . . . . . . . . . 371
10.14.2 New Zealand Fish . . . . . . . . . . . . . . . . . 375
10.14.3 Demographics Data . . . . . . . . . . . . . . . . 379
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 380
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
11 Neural Networks 389
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 389
11.2 Projection Pursuit Regression . . . . . . . . . . . . . . . 389
11.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 392
11.4 Fitting Neural Networks . . . . . . . . . . . . . . . . . . . 395
11.5 Some Issues in Training Neural Networks . . . . . . . . . 397
11.5.1 Starting Values . . . . . . . . . . . . . . . . . . . 397
11.5.2 Overfitting . . . . . . . . . . . . . . . . . . . . . 398
11.5.3 Scaling of the Inputs . . . . . . . . . . . . . . . 398
11.5.4 Number of Hidden Units and Layers . . . . . . . 400
11.5.5 Multiple Minima . . . . . . . . . . . . . . . . . . 400
11.6 Example: Simulated Data . . . . . . . . . . . . . . . . . . 401
11.7 Example: ZIP Code Data . . . . . . . . . . . . . . . . . . 404
11.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 408
11.9 Bayesian Neural Nets and the NIPS 2003 Challenge . . . 409
11.9.1 Bayes, Boosting and Bagging . . . . . . . . . . . 410
11.9.2 Performance Comparisons . . . . . . . . . . . . 412
11.10 Computational Considerations . . . . . . . . . . . . . . . 414
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 415
Contents xix
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
12 Support Vector Machines and
Flexible Discriminants 417
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 417
12.2 The Support Vector Classifier . . . . . . . . . . . . . . . . 417
12.2.1 Computing the Support Vector Classifier . . . . 420
12.2.2 Mixture Example (Continued) . . . . . . . . . . 421
12.3 Support Vector Machines and Kernels . . . . . . . . . . . 423
12.3.1 Computing the SVM for Classification . . . . . . 423
12.3.2 The SVM as a Penalization Method . . . . . . . 426
12.3.3 Function Estimation and Reproducing Kernels . 428
12.3.4 SVMs and the Curse of Dimensionality . . . . . 431
12.3.5 A Path Algorithm for the SVM Classifier . . . . 432
12.3.6 Support Vector Machines for Regression . . . . . 434
12.3.7 Regression and Kernels . . . . . . . . . . . . . . 436
12.3.8 Discussion . . . . . . . . . . . . . . . . . . . . . 438
12.4 Generalizing Linear Discriminant Analysis . . . . . . . . 438
12.5 Flexible Discriminant Analysis . . . . . . . . . . . . . . . 440
12.5.1 Computing the FDA Estimates . . . . . . . . . . 444
12.6 Penalized Discriminant Analysis . . . . . . . . . . . . . . 446
12.7 Mixture Discriminant Analysis . . . . . . . . . . . . . . . 449
12.7.1 Example: Waveform Data . . . . . . . . . . . . . 451
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 455
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455

13 Prototype Methods and Nearest-Neighbors 459
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 459
13.2 Prototype Methods . . . . . . . . . . . . . . . . . . . . . 459
13.2.1 K-means Clustering . . . . . . . . . . . . . . . . 460
13.2.2 Learning Vector Quantization . . . . . . . . . . 462
13.2.3 Gaussian Mixtures . . . . . . . . . . . . . . . . . 463
13.3 k-Nearest-Neighbor Classifiers . . . . . . . . . . . . . . . 463
13.3.1 Example: A Comparative Study . . . . . . . . . 468
13.3.2 Example: k-Nearest-Neighbors
and Image Scene Classification . . . . . . . . . . 470
13.3.3 Invariant Metrics and Tangent Distance . . . . . 471
13.4 Adaptive Nearest-Neighbor Methods . . . . . . . . . . . . 475
13.4.1 Example . . . . . . . . . . . . . . . . . . . . . . 478
13.4.2 Global Dimension Reduction
for Nearest-Neighbors . . . . . . . . . . . . . . . 479
13.5 Computational Considerations . . . . . . . . . . . . . . . 480
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 481
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
xx Contents
14 Unsupervised Learning 485
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 485
14.2 Association Rules . . . . . . . . . . . . . . . . . . . . . . 487
14.2.1 Market Basket Analysis . . . . . . . . . . . . . . 488
14.2.2 The Apriori Algorithm . . . . . . . . . . . . . . 489
14.2.3 Example: Market Basket Analysis . . . . . . . . 492
14.2.4 Unsupervised as Supervised Learning . . . . . . 495
14.2.5 Generalized Association Rules . . . . . . . . . . 497
14.2.6 Choice of Supervised Learning Method . . . . . 499
14.2.7 Example: Market Basket Analysis (Continued) . 499
14.3 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . 501
14.3.1 Proximity Matrices . . . . . . . . . . . . . . . . 503
14.3.2 Dissimilarities Based on Attributes . . . . . . . 503
14.3.3 Object Dissimilarity . . . . . . . . . . . . . . . . 505
14.3.4 Clustering Algorithms . . . . . . . . . . . . . . . 507
14.3.5 Combinatorial Algorithms . . . . . . . . . . . . 507
14.3.6 K-means . . . . . . . . . . . . . . . . . . . . . . 509
14.3.7 Gaussian Mixtures as Soft K-means Clustering . 510
14.3.8 Example: Human Tumor Microarray Data . . . 512
14.3.9 Vector Quantization . . . . . . . . . . . . . . . . 514
14.3.10 K-medoids . . . . . . . . . . . . . . . . . . . . . 515
14.3.11 Practical Issues . . . . . . . . . . . . . . . . . . 518
14.3.12 Hierarchical Clustering . . . . . . . . . . . . . . 520
14.4 Self-Organizing Maps . . . . . . . . . . . . . . . . . . . . 528
14.5 Principal Components, Curves and Surfaces . . . . . . . . 534
14.5.1 Principal Components . . . . . . . . . . . . . . . 534
14.5.2 Principal Curves and Surfaces . . . . . . . . . . 541
14.5.3 Spectral Clustering . . . . . . . . . . . . . . . . 544
14.5.4 Kernel Principal Components . . . . . . . . . . . 547
14.5.5 Sparse Principal Components . . . . . . . . . . . 550
14.6 Non-negative Matrix Factorization . . . . . . . . . . . . . 553
14.6.1 Archetypal Analysis . . . . . . . . . . . . . . . . 554
14.7 Independent Component Analysis
and Exploratory Projection Pursuit . . . . . . . . . . . . 557
14.7.1 Latent Variables and Factor Analysis . . . . . . 558
14.7.2 Independent Component Analysis . . . . . . . . 560
14.7.3 Exploratory Projection Pursuit . . . . . . . . . . 565
14.7.4 A Direct Approach to ICA . . . . . . . . . . . . 565
14.8 Multidimensional Scaling . . . . . . . . . . . . . . . . . . 570
14.9 Nonlinear Dimension Reduction
and Local Multidimensional Scaling . . . . . . . . . . . . 572
14.10 The Google PageRank Algorithm . . . . . . . . . . . . . 576
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 578
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
Contents xxi
15 Random Forests 587
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 587
15.2 Definition of Random Forests . . . . . . . . . . . . . . . . 587
15.3 Details of Random Forests . . . . . . . . . . . . . . . . . 592
15.3.1 Out of Bag Samples . . . . . . . . . . . . . . . . 592
15.3.2 Variable Importance . . . . . . . . . . . . . . . . 593
15.3.3 Proximity Plots . . . . . . . . . . . . . . . . . . 595
15.3.4 Random Forests and Overfitting . . . . . . . . . 596
15.4 Analysis of Random Forests . . . . . . . . . . . . . . . . . 597
15.4.1 Variance and the De-Correlation Effect . . . . . 597
15.4.2 Bias . . . . . . . . . . . . . . . . . . . . . . . . . 600
15.4.3 Adaptive Nearest Neighbors . . . . . . . . . . . 601
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 602
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
16 Ensemble Learning 605
16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 605
16.2 Boosting and Regularization Paths . . . . . . . . . . . . . 607
16.2.1 Penalized Regression . . . . . . . . . . . . . . . 607
16.2.2 The “Bet on Sparsity” Principle . . . . . . . . . 610
16.2.3 Regularization Paths, Over-fitting and Margins . 613
16.3 Learning Ensembles . . . . . . . . . . . . . . . . . . . . . 616
16.3.1 Learning a Good Ensemble . . . . . . . . . . . . 617
16.3.2 Rule Ensembles . . . . . . . . . . . . . . . . . . 622
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 623
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624
17 Undirected Graphical Models 625
17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 625
17.2 Markov Graphs and Their Properties . . . . . . . . . . . 627
17.3 Undirected Graphical Models for Continuous Variables . 630
17.3.1 Estimation of the Parameters
when the Graph Structure is Known . . . . . . . 631
17.3.2 Estimation of the Graph Structure . . . . . . . . 635
17.4 Undirected Graphical Models for Discrete Variables . . . 638
17.4.1 Estimation of the Parameters
when the Graph Structure is Known . . . . . . . 639
17.4.2 Hidden Nodes . . . . . . . . . . . . . . . . . . . 641
17.4.3 Estimation of the Graph Structure . . . . . . . . 642
17.4.4 Restricted Boltzmann Machines . . . . . . . . . 643
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645

18 High-Dimensional Problems: p ≫ N 649
18.1 When p is Much Bigger than N . . . . . . . . . . . . . . 649
xxii Contents
18.2 Diagonal Linear Discriminant Analysis
and Nearest Shrunken Centroids . . . . . . . . . . . . . . 651
18.3 Linear Classifiers with Quadratic Regularization . . . . . 654
18.3.1 Regularized Discriminant Analysis . . . . . . . . 656
18.3.2 Logistic Regression
with Quadratic Regularization . . . . . . . . . . 657
18.3.3 The Support Vector Classifier . . . . . . . . . . 657
18.3.4 Feature Selection . . . . . . . . . . . . . . . . . . 658
18.3.5 Computational Shortcuts When p ≫ N . . . . . 659
18.4 Linear Classifiers with L1 Regularization . . . . . . . . . 661
18.4.1 Application of Lasso
to Protein Mass Spectroscopy . . . . . . . . . . 664
18.4.2 The Fused Lasso for Functional Data . . . . . . 666
18.5 Classification When Features are Unavailable . . . . . . . 668
18.5.1 Example: String Kernels
and Protein Classification . . . . . . . . . . . . . 668
18.5.2 Classification and Other Models Using
Inner-Product Kernels and Pairwise Distances . 670
18.5.3 Example: Abstracts Classification . . . . . . . . 672
18.6 High-Dimensional Regression:
Supervised Principal Components . . . . . . . . . . . . . 674
18.6.1 Connection to Latent-Variable Modeling . . . . 678
18.6.2 Relationship with Partial Least Squares . . . . . 680
18.6.3 Pre-Conditioning for Feature Selection . . . . . 681
18.7 Feature Assessment and the Multiple-Testing Problem . . 683
18.7.1 The False Discovery Rate . . . . . . . . . . . . . 687
18.7.2 Asymmetric Cutpoints and the SAM Procedure 690
18.7.3 A Bayesian Interpretation of the FDR . . . . . . 692
18.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . 693
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694

References 699
Author Index 729
Index 737

OutOfLine – A Memory-Locality Pattern for High Performance C++

In my time at Headlands Technologies, I’ve gotten the opportunity to build some utilities that have improved the ergonomics of maintaining high-performance C++ codebases. This article will give a generic overview of one of those utilities, OutOfLine.

Let’s start with a motivating example. Suppose you have a system that opens a very large number of paths. Maybe they are files, maybe they are named UNIX sockets, maybe pipes. But for whatever reason, you open a lot of file descriptors at startup, then you do a lot of processing on those descriptors, and finally when you’re done you close the descriptors and unlink the paths.

An (abbreviated) initial design might look like this:

class UnlinkingFD {
  std::string path;
 public:
  int fd;

  UnlinkingFD(const std::string& p) : path(p) {
    fd = open(p.c_str(), O_RDWR, 0);
  }
  ~UnlinkingFD() { close(fd); unlink(path.c_str()); }
  UnlinkingFD(const UnlinkingFD&) = delete;
};

And that’s a nice, logically reasonable design. It RAIIs the close and unlink for you. You can allocate a big array of these things, operate on them, and they clean up after themselves when the array’s lifetime ends.

But what about performance? Suppose you use the fd very often, and you use path only when cleaning up the object. Now we have an array of 40B objects, and our critical path only ever uses 4B of that. Which means you’ll see more cache line misses as you keep having to “skip over” the 90% overhead.

One very common solution to this is to switch from array-of-structs to struct-of-arrays. And that would net us our performance win here, but it would cost us the RAII. Is there a way to have the best of both worlds?

One initial compromise might be to not store a std::string which is 32B, and instead store a std::unique_ptr – which is only 8B. That takes your object size down from 40B to 16B, which is a big win. But it’s not as good as parallel arrays.

OutOfLine is a tool that you can use to keep RAII, and move your cold members completely outside your object with zero space overhead. You use OutOfLine by inheriting from it, like this. It is a CRTP base class, so the first template argument should be the child class that is inheriting. The second argument is the cold data that should be associated with each “fast” object.

struct UnlinkingFD : private OutOfLine<UnlinkingFD, std::string> {
  int fd;

  UnlinkingFD(const std::string& p) : OutOfLine<UnlinkingFD, std::string>(p) {
    fd = open(p.c_str(), O_RDWR, 0);
  }
  ~UnlinkingFD();
  UnlinkingFD(const UnlinkingFD&) = delete;
};

And what does that class look like itself?

template <class FastData, class ColdData>
class OutOfLine {

The implementation is based on the idea of a global map hiding somewhere that maps pointers to fast objects to pointers of cold objects.

  inline static std::map<OutOfLine const*, std::unique_ptr<ColdData>> global_map_;

You can build this base from anything you can build your cold data from. And when you do, it’ll create that cold data and associate it with your fast object.

  template <class... TArgs>
  explicit OutOfLine(TArgs&&... args) {
    global_map_[this] = std::make_unique<ColdData>(std::forward<TArgs>(args)...);
  }

When your fast object gets cleaned up, the corresponding cold object will too:

  ~OutOfLine() { global_map_.erase(this); }

When you move your fast object, the corresponding cold object is reassociated with the new fast object (remember that means you shouldn’t use the cold data on a moved-from object).

  explicit OutOfLine(OutOfLine&& other) { *this = other; }
  OutOfLine& operator=(OutOfLine&& other) {
    global_map_[this] = std::move(global_map_[&other]);
    return *this;
  }

The current implementation just makes OutOfLine non-copyable for simplicity, but one could instead choose to implement copy construction by copying the cold data.

  OutOfLine(OutOfLine const&) = delete;
  OutOfLine& operator=(OutOfLine const&) = delete;

Now for this to be useful to us though, it has to actually be convenient to access that cold data. When you inherit from OutOfLine, your class gains const and non-const member functions cold():

  ColdData& cold() noexcept { return *global_map_[this]; }
  ColdData const& cold() const noexcept { return *global_map_[this]; }

Calling these gives you a reference (of appropriate constness) to your cold data.

And that’s it. This UnlinkingFD will be only 4B large, it provides access to the fd member at full speed, and it still preserves all the same RAII behavior. All the lifetime-related work is handled for you. When you move the fast object, the cold object is reassociated to the new fast object. When your fast object goes away, the cold object does too.

Sometimes though, your data conspires to make your life difficult, and the fast data must be constructed first because it is a constructor argument to the cold data. That makes the construction order the reverse of the order OutOfLine imposes on you. Also, sometimes you need the fast data to outlive the cold data (maybe the cold data holds a reference to the fast data). For these cases, we need an “escape hatch” to control the order in which data is initialized and deinitialized.

  struct TwoPhaseInit {};
  OutOfLine(TwoPhaseInit){}
  template <class... TArgs>
  void init_cold_data(TArgs&&... args) {
    global_map_.find(this)->second = std::make_unique<ColdData>(std::forward<TArgs>(args)...);
  }
  void release_cold_data() { global_map_[this].reset(); }

There is another constructor of OutOfLine that your class could call, one that accepts the tag type TwoPhaseInit. If you build your OutOfLine in that way, your cold data will not be initialized, and you’ll be left in a half-constructed state. You then finish your two-phase construction by calling init_cold_data (with any arguments from which you can construct a ColdData) and you’ll be done. Just remember not to call .cold() on an object that has not yet had its cold data initialized. And the parallel holds too – you can release your cold data early if your data requires it by calling release_cold_data.

}; // end of class OutOfLine

And that’s all of it. So ultimately, what did our 29 SLOC buy us? It bought us one more option in the space of tradeoffs. Any time you have an object where some members are drastically more important than other members, you might consider OutOfLine. It lets you make some members a little bit faster at the expense of making accesses to other members a lot slower, so you would reach for this in situations where that sounds like a good tradeoff to you.

We’ve been able to apply this technique in several places – it’s fairly common to want to tag fast data with extra metadata that is logged out on shutdown, or in rare or unexpected situations. Whether that’s recording which user this connection belongs to, which internal trade desk this order is attributed to, or the handle to a hardware-accelerated market-data session – this will keep your cache lines clean while you’re in your critical paths.

I’ve included a benchmark that you can use to see and explore the differences.

Scenario Time (ns)
With cold data in-line (original) 34684547
With cold data thrown away (best-case scenario) 2938327
With OutOfLine 2947645

I measured an ~10x speedup by using OutOfLine. Obviously this benchmark is contrived to provide the best-case-scenario of OutOfLine, but it serves to demonstrate that cache optimization can have very real performance impact, and that OutOfLine really does deliver on that front. And keeping your data cache clear of cold data can also have a difficult-to-measure holistic benefit on the rest of your code. As always, you need to measure each application to optimize it, but this might be a useful tool to have in your belt.

Successfully Starting a Career in Quant Research

People setting out to pursue a career in the quantitative trading industry won’t always know what questions they should be asking of prospective employers. As a result, they are susceptible to making a mistake in choosing their first job. That mistake could lead to later giving up on the industry, when in fact you may be well suited for success. Without experience, it is easy to make an error; but I want to provide some guidance to help you avoid some common pitfalls.

Here’s a checklist of key structural questions you’ll want to understand before choosing your first firm. These questions will help you understand how well you’ll be able to learn and grow in a given company. You’ll want to learn and grow throughout your entire career, but it is especially important at the beginning when you have the least experience.

Ideally, for each of these seven questions you’ll want your first company’s answer to be “no”. “Yes” answers indicate additional challenges for you as you build your career. “Yes” answers don’t necessarily indicate a bad job, but they do indicate additional layers of risk. Below the list, I’ve provided some additional context.

 

  1. Are brainteasers, gambling, poker, or mental math questions used in the interview process?
  2. Will you have a 2 year noncompete?
  3. Will you be blocked from accessing any part of the source code?
  4. As a researcher, will you be the primary on-call trader monitoring any live trading processes?
  5. Will you be blocked from viewing the PnL of any strategies that utilize your research?
  6. Are strategy parameters manually changed based on judgment calls during the day?
  7. Are there other employees in the company in direct competition with you?

 

  1. Are brainteasers, gambling, poker, or mental math questions used in the interview?
    If a company asks these types of questions, it is potentially a sign they value manual, non-automated trader intuition and decision making more than quantitative, algorithmic, and research-driven approaches. If your background is quantitative, you will want a company that will value those skills the most highly. At ‘trader’-centric firms, quants have less influence over the trading strategy and more limited career prospects.
  2. Will you have a 2 year (or greater) noncompete?
    Non-compete agreements are a fact of life for quants working in the trading industry, but the lengths of those agreements vary widely (the standard term is one year). Non-competes exist to help protect the intellectual property you’ll be developing. However, one year should be sufficient to protect your work. Longer terms can sometimes be designed not to protect the company as much as to limit the employee’s career options. Even if you join a good company, there’s always a chance of ending up in a weak team, or with an inexperienced manager, being paid unfairly, or simply not fitting in perfectly. With a 2 year noncompete, other companies may be much less willing to hire you.
  3. Will you be blocked from accessing any part of the source code?
    Some companies encrypt or password-protect parts of their source code. This goes beyond taking adequate steps to protect proprietary property, like securing an internal filesystem from external intruders or preventing employees from copying files off the company network- these companies even prevent their own full-time employees from seeing parts of the existing codebase. To people outside the industry the idea of encryption might sound sufficiently strange that they wouldn’t think to ask. However, it is actually quite common in quantitative trading firms. Having parts of the source code blocked limits your ability to learn, collaborate with coworkers, and make an impact.
  4. As a researcher, will you be the primary on-call trader monitoring any live trading processes?
    Companies that highly value research will have separate dedicated operations and trading teams to handle the majority of the routine day-to-day tasks of running and monitoring an automated trading system.
    At some companies, often those with roots as floor traders or click traders, or those that struggle to manage unstable operations processes, ‘quant traders’ are expected to do both research and operations. While this might seem exciting at first, monitoring live trading, monitoring system health, and ensuring system functionality is a full time responsibility and will severely reduce your time available to concentrate and do high quality research.
  5. Will you be blocked from viewing the PnL (revenue) of any strategies that utilize your research?
    One of the main attractions of working in trading is the fast feedback you get on your research. You can think of an idea, implement the idea, and then see the results within a few days. This tight feedback loop compares favorably to, say, a Physics department, where a single idea could take years to validate. However, some companies separate alpha signal researchers from strategy developers. If you’re separated from the trading PnL, then you can’t get real-time feedback. Companies might do this to prevent their secrets from leaking out easily, but there are plenty of successful companies that trust their employees and encourage loyalty in other ways.
  6. Are strategy parameters manually changed based on judgment calls during the day?
    It’s almost impossible to do proper statistical analysis of your ideas on historic data if ‘click-traders’ are tweaking parameters, because you can’t model the human element. That kind of company is a good place to be a ‘click-trader’ – not a quant.
  7. Are there other employees in the company in direct competition with you?
    Okay, final question! This is important to ask, because some companies have employees or teams directly competing with each other so that the company is diversified in its revenue streams. But for your career, you want your company to invest fully in you. If there are competing teams, you don’t know whether you’ll end up on one that eventually wins- or loses. There will also be fewer opportunities to learn from others; you don’t get the benefit of collaboration with the widest group of other quants. Finally, this type of structure tends to foster a cut-throat, ‘zero-sum-game’ culture.

I hope you can get clear ‘yes’ or ‘no’ answers on all 7 of these questions from all your prospective employers. Consider a vague or indirect answer as a ‘yes’. Don’t let yourself be persuaded just because you’re less informed than you will be later in your career. Finally, it is never a bad idea to double check with friends or classmates that are now in the quantitative trading industry. Forewarned, you can avoid these 7 problems that I’ve seen through my friends’ experiences, and get to work on the fascinating problems tackled in quantitative trading.

Quantitative Trading Summary

This summary is an attempt to shed some light on modern quantitative trading since there is limited information available for people who are not already in the industry. Hopefully this is useful for students and candidates coming from outside the industry who are looking to understand what it’s like working for a quantitative trading firm. Job titles like “researcher” or “developer” don’t give a clear picture of the day-to-day roles and responsibilities at a trading firm. Most quantitative trading firms have converged on roughly the same basic organizational framework so this is a reasonably accurate description of the roles at any established quantitative trading firm.

The product of a quantitative trading company is an automated software program that buys and sells electronically traded securities to make a profit. This software program is supported by many systems designed to maintain and optimize it. Most companies are roughly divided into 3 main groups: strategy research, core development, and operations. Employees generally start and stay in one of these groups throughout a career. This guide focuses on strategy research and core development roles.

Primary job requirements:

  • Strategy research (‘research’): Programming, statistics, trading intuition, and the ability to understand market data
  • Core development (‘dev’): Low-level software engineering, networking, and system architecture

The software components of a quantitative trading system are built by one of these two teams. The majority of the components are built in-house at most major trading firms, so below is a list of the programs you could expect to build or maintain if you were on the research or dev teams. Each of these programs can be a separate process, although we’ll discuss some variants later.

Programs for live production trading:

  1. Market data parser: Dev. Normalizes each exchange’s protocol (including different versions over time) into the same internal format.
  2. Trading strategy: Research/Dev. Receives normalized data, decides whether to buy or sell.
  3. Order gateway: Dev. Converts from internal order format to each exchange’s order entry protocol (different than the market data protocol).

Programs to support live production trading:

  1. Monitoring GUI: Dev. GUIs used to be important for click traders but are now mainly used to monitor that the trading system is performing appropriately. They are still occasionally used to manually adjust a few parameters, such as overall risk tolerance.
  2. Drop copy: Dev. Secondary order confirmation to make sure you have the trading positions you think you do.
  3. Market data capture: Dev. Record the market data in parallel to what’s going into the strategy to verify later that the strategy behaved as intended and to run statistical tests on historical data (live capture is more reliable than purchasing from a vendor so most major firms avoid buying data from a vendor).
  4. Startup scripts: Dev/Operations. Launch all these different software programs in the right order and at the right time of day each time they need to be restarted (typically daily or weekly), and alert or recover from startup problems.

Programs to optimize and analyze the trading strategy:

  1. Parameter optimization: Research. Regressions or other metrics to help people compare one trading strategy parameter setting to another to find the best.
  2. Production reconciliation: Research. Metrics to confirm that the state in the trading strategy’s internal algorithm matched calculations using captured market data.
  3. Back testing simulator: Research. Shows estimated trading strategy profit or loss on historical data.
  4. Graphing: Dev/Research. Display profit or loss, volume, price and other statistics over time.

In a ‘typical’ established quantitative trading company the department breakdown would be:

  • Research
  • Dev
  • Back office and operations
    • operations/monitoring
    • telco/networking/hardware
    • accounting/HR
    • management/business development
    • legal/compliance

Since we won’t focus on them later, here’s a brief description of the latter groups:

  • Operations/monitoring: Monitor strategies and risk intraday and overnight to ensure there are no problems (like Knight Capital’s +$400m loss).
  • Telco/networking/hardware: Purchase and rack servers, configure switch firmware, operating system settings, and network interface cards or FPGAs, connect co-located datacenters (possibly in different countries), etc.
  • Accounting/HR: Like any business there is tax, accounting, and human resources work
  • Management/business development: There’s a lot of legwork to trading multiple exchanges around the world, such as finding contacts in other countries, negotiating fees, licensing telecom networks, and keeping ahead of new updates.
  • Legal/compliance: Trading is one of the most regulated industries. There are US and international regulators (SEC, CFTC, FCA, etc), huge and diverse rulesets such as MIFID, industry regulating agencies like FINRA, and exchange-level self-regulatory regimes (CME, NYSE, etc) that each have their own rules. Ensuring and documenting compliance with each set of rules takes a lot of work.

Some of the key differentiating factors between quantitative trading companies are:

  1. How they divide up research teams – internal collaboration vs competition between siloed research/trading teams.
  2. Which exchanges and products they focus on.
  3. What type of trading strategy is used and how it’s optimized.

Although we are unable to explain how each specific company divides up their research/trading teams, the overall structure and organization of employees and software at most major quantitative trading firms follows a similar general pattern to what was described above.

Trading strategies

Now that you have a high-level understanding of what a typical quantitative trading company does and the different roles that exist, let’s go into more detail about trading strategies.

The industry has generally settled on three main types of strategies that are sustainable because they provide real economic value to the market:

  • Arbitrage: Arbitrage and its economic benefits have been well understood for quite some time and documented by academia. The companies that are still competitive in arbitrage have one of 3 advantages:
    • Scale: To determine that some complex option or futures spread products are mispriced relative to a set of others, nontrivial calculations must be performed, including the fee per leg, and then the hedged position has to be held and margined until expiry. Being able to manage this and have low fees requires scale.
    • Speed: Speed either comes from having faster telco or being able to hedge. For example, triangular arbitrage on FX products traded in London, NY, and Japan and are a major impetus for the Go West and Hibernia microwave telecom projects. Arbitrageurs rely on the speed of their order gateway connections so they can hedge on related markets if they are overfilled.
    • Queue position: Being able to enter one leg of an arbitrage by passively buying on the bid or selling on the offer reduces costs by not having to cross the spread on that leg, so being able to achieve good queue position can give an edge in arb trades.
  • Market Taking: Placing a marketable buy or sell order to profit from a predicted price change. The economic value market takers are being paid for is either:
    • properly pricing the relative value of related securities
    • trading and thereby contributing to price discovery in products after observed changes in supply and demand

    Like a real estate negotiation that can change a deal’s value minute-by-minute when negotiators come to the table face-to-face and discover each other’s positions, even though the fundamentals of the deal or real estate market certainly don’t fluctuate by the minute, market takers are the high-stakes-mediators of the trading world. Market taking requires predictive signals and relatively low-latency because you pay to cross the spread. A common low-latency market taking strategy would be to attempt to buy the remaining liquidity at a price after a large buy trade. Some firms have FPGAs configured to send orders as soon as they see a trade message matching the right conditions (more on this later).

  • Market Making: Posting passive non-marketable buy and sell orders with the goal to profit from the spread. The economic value market makers are being paid for is connecting buyers and sellers who don’t arrive at the market at the same time. Market makers are compensated for the risk that there may be more buyers than sellers or vice versa for an extended time, such as during times of market stress.

Basic trading system design

A quantitative trading system’s input is market data and its output is orders. In between is the strategy algorithm.

Input

The input to a trading system is tick-by-tick market data. The input is handled in an event loop. The events are the packets sent by the exchange that are read off the network and normalized by the market data parser. Each packet gives information about the current supply and demand for a security and the current price. A packet can tell you one of three things:

  • A limit order was added to the book. Primary fields: {price, side, order id, quantity}
  • A limit order was canceled. Primary fields: {order id}
  • A trade occurred. Primary fields: {price, aggressor side, quantity}

For example, a few packets look like this (for a more detailed, real example see section 4 appendix 1 of this spec):

AddOrder { end_of_packet: 1; seq_number: 103901; symbol_id: 81376629; receive_time: 13:03:46.304606537089; source: 1; side: S; qty: 1; order_id: 210048618; price: 99.25; }

CancelOrder { end_of_packet: 1; seq_number: 103900; symbol_id: 81376629; receive_time: 13:03:41.863834923132; source: 1; qty: 0; order_id: 210048542; price: 99.00; side: S; }

Trade { end_of_packet: 1; seq_number: 103902; symbol_id: 81376629; receive_time: 13:03:46.304606537835; source: 1; aggressor_side: B; qty: 1; order_id: 210048321; price: 99.00; match_id: 20154940; }

If the trading system adds up all the AddOrder packets and subtracts CancelOrder and Trade packets, it can see what the order book currently looks like. The order book shows the aggregate visible supply and demand currently available at each price. The order book is an industry-standard normalization layer.

When you add up all the orders, the order book could look like this:

Sell 10 for $99.25

Sell 5 for $99.00 (best offer)

Buy 10 for $98.75 (best bid)

Buy 10 for $98.50

This is the main view of the market data input used by the strategy algorithm.

Strategy algorithm

To put into practice what we discussed above, let’s outline a market taking strategy utilizing what is often referred to as market micro-structure signals that may have made money back before quantitative trading became very competitive. Some companies have each member of their intern classes program a strategy like this as a teaching project during a summer. This strategy calculates some signals using the order book as input, and buys or sells when the aggregate signals are strong enough.

Market microstructure signals

A signal is an algorithm that takes market data as the input and outputs a theoretical price for a security. Market micro-structure signals generally rely on price, size, and trade data coming directly from data feeds. Please reference the order book state provided previously as we walk through the following signal examples.

  • A basic signal, likely used in some form by most firms, is ‘book pressure’. In this case book pressure is simply (99.00*10 + 98.75*5)/(10+5) = 98.9167. Because there is more demand on the bid, the theoretical price is closer to the offer than the bid. Another way of understanding why this is a valid predictor it is that if buy and sell trades randomly arrive in the market on the bid and offer, then there’s a 2/3 chance of them filling the entire offer before the entire bid, because it’s 2 times bigger, so the expected future price is slightly closer to the offer than the bid.
  • A second basic signal that many quantitative trading firms use is ‘trade impulse’. A common form is to plug trade quantity into something like the book pressure formula, but with the average bid and offer quantity in the denominator instead of the current quantity (let’s say the average is 15). So if there is a sell trade for 9 on this book, the trade impulse would be -0.25*9/15 = -0.15. This example signal would only be valid for the span of 1 packet. Another way of understanding why this is a valid predictor is that sometimes buy and sell trade quantity is autocorrelated over very short intervals, because there are often multiple orders in flight sent in reaction to the same trigger by different people (this is easily measured), so if you see one sell trade, then typically the next order will also be a sell.
  • A third common basic signal is ‘related trade’. Basically, you could just take the same signal as (2), but translate it over from a different security that is highly correlated, by multiplying it by the correlation between them.

The book pressure and trade impulse signal are enough to create a market taking strategy. After the sell trade for 9, the remaining quantity on the book is:

Sell 10 for $99.25

Sell 5 for $99.00 (best offer)

Buy (10-9 = 1) for $98.75 (best bid)

Buy 10 for $98.50

But our theoretical price is = book pressure + trade impulse = (99.00*1 + 98.75*5)/(1+5) + -0.25*9/15 = 98.79167 – 0.15 = $98.64167! Since our theoretical price is below the best bid, we will send an order to sell the last remaining quantity of 1 at $98.75, for a theoretical profit of $0.10833.

That is a high-level overview of a simple quantitative strategy, and provides a basic understanding of the flow from the input (market data) to the output (orders).

Digression: Trade signal on an FPGA

If you ran the market taking strategy from the previous section live in a real trading system, you would likely find that your orders rarely get filled. You want to trade when your theoretical price implies there’s a profitable opportunity, but other trading systems are faster than yours so their orders reach the market first and there’s nothing left for you.

State of the art latency, as of 2017, can be achieved by putting the trading logic on an FPGA. A basic trading system architecture with an FPGA is to have the FPGA connected directly to the exchange and also to the old trading system. The old trading system is now only responsible for calculating hypothetical scenarios. Instead of sending the order, it notifies the FPGA what hypothetical condition needs to be met to send the order. Using the same case as before, it could hypothetically evaluate the signal for a range of trade quantities:

  • Sell trade, quantity = 1…
  • Sell trade, quantity = 2…
  • Sell trade, quantity = 3…
  • Sell trade, quantity = 4…
  • Sell trade, quantity = 5…
  • Sell trade, quantity = 6: (99.00*4 + 98.75*5)/(4+5) + -0.25*6/15 = 98.7611
  • Sell trade, quantity = 7: (99.00*3 + 98.75*5)/(3+5) + -0.25*7/15 = 98.7271
  • Sell trade, quantity = 8…

With any sell trade of quantity 7 or more, the theoretical price would cross below the threshold of the best bid (98.75), indicating a profitable opportunity to trade, so we’d want to send an order to sell the remaining bid. With a trade quantity of 6 or less we wouldn’t want to do anything.

The FPGA is pre-programmed to know the byte layout of the exchange’s trade message, so all it has to do now is wait for the market data, and then check a few bits and send the order. This doesn’t require advanced Verilog. For example, the message from the exchange could look like the following struct:

struct Trade {

uint64_t time;

uint32_t security_id;

uint8_t side;

uint64_t price;

uint32_t quantity;

} __attribute__((packed));

Because of the relative ease of this setup, it has become a very competitive trade – some trading firms can make these types of trade decisions in less than one microsecond. Also, because the FPGA connects directly to the exchange, an additional connection must be purchased for each FPGA, which can get expensive. Unfortunately, if you only have one shared connection, and broadcast data internally with a switch, the switch might introduce too much latency to be competitive. Many companies will now pay for multiple connections which raises their costs significantly.

Digression: ‘Minimum viable trading system’

As I mentioned above, the simple 3-signal trading strategy could have made money several years ago. Even a few years ago, the ‘minimum viable trading system’ that could cover trading fees was simple enough that an individual could build a successful one. Here’s a good article by someone who created their own trading system in 2009, and could be another starting point to understand the basics of automated trading if all of this has gone over your head- http://jspauld.com/post/35126549635/how-i-made-500k-with-machine-learning-and-hft.

This guide only covers, at a high level, trading and work being done by professionals in established quantitative trading firms, so things like co-location, direct connection to the exchange without going through an API, using a high-performance language like C++ for production (never Python, R, Matlab, etc), Linux configuration (processor affinity, NUMA, etc), clock synchronization, etc are taken for granted. These are large and interesting topics which are now well understood inside and outside the industry.

Other strategies besides market micro-structure signals

Market micro-structure signal based strategies, as described above before the two digressions, are just one type of strategy. Here are some other example trading strategy algorithm components used by many major quantitative trading companies:

  • Model based
  • Rule based
    • Only buy or sell during a certain time range
    • Don’t trade against an iceberg order
    • Cancel a resting order if the queue position is worse than 50%
    • Don’t trade if the last 10 trades lost money

Supporting research infrastructure

Now that you have a brief high-level overview of the production trading system, let’s dive deeper into research. The job of a researcher is to optimize the settings of the trading system and to ensure it is behaving properly. Working for an established company, this whole software system will likely already be in place, and your job would be to make it better.

With that in mind, here are some more details about 4 other main software components I listed above that are programmed and used by the research team to optimize and analyze the trading strategy:

  1. Parameter optimization: Most major quantitative trading firms have a combination of signals, model-based pricing, and rule-based logic. Each of these also have parameters. Parameters enable you to tailor a generic strategy to make more money on a specific product, or adapt it over time. For one product, you might want to weight a certain signal higher than another, or you might want to down-weight it as the signal decays. You quickly run into the curse of dimensionality as parameter permutations multiply. One of the main jobs for a researcher is to figure out the optimal settings for everything, or to figure out automated ways of optimizing them. Some approaches include:
    • Manual selection based on intuition
    • Regression for signal weights or hedge ratios
    • Live tweaking or AB testing in production
    • Backtesting different settings and picking the best
  2. Production reconciliation: Sophisticated strategies have many internal components that need to be continually verified in live production trading. Measuring these, monitoring them, and alerting on discrepancies is how researchers make sure things are working as they expected. If the algorithm performs differently in production than it did on historical data, then it may lose money when it was supposed to be profitable.
  3. Backtesting simulator: Plenty of information is available publicly about backtesting, such as the tools available from Quantopian or TradeStation. Simulating a low latency strategy using tick data is challenging. The volume of data to simulate a single day reaches into the 100s of GBs so storing and replaying data requires carefully designed systems.
  4. Graphing: The trading strategy is a mathematical formula in a computer, so debugging it and adding new features can be difficult. Utilizing a Python or JavaScript plotting library to publish custom data and statistics can be helpful. Additionally, it is essential to understand positions and profits or losses during and after the trading day. Graphical representations of different types of data sets makes many tasks easier.

Conclusion

Most people who are new to the industry think that researchers primarily work on new signal development, and developers primarily optimize latency. Hopefully now it’s obvious that the system has so many components that those two jobs are just a few parts of a much wider set of roles and responsibilities. The most important skills for success are actually very close attention to detail, hard work, and trading intuition. On top of that it should be clear that having strong programming skills is essential. All of these systems are tailor-made in-house and have to be constantly tweaked and improved by the users themselves – you.

If you’re interested in joining our team at Headlands, please see our careers page and send your resume to careers@headlandstech.com

Note:

The information above is a collection of some helpful information to shed some light on what a quantitative trading firm does and what you could be doing if you worked at one.  The information, although intended to be helpful to you, should not be relied on and is not represented to be accurate or current. Please note this is by no means an exhaustive description of what goes on at a quantitative trading firm.  Nor should this be taken as covering industry best practices or everything you need to know to start trading quantitatively.  This is simply a very high overview of information I think those considering joining a quantitative trading firm may find useful as they navigate the interview process. 

Appendix: Latency and the timing of events

Similar to the breakdowns by Grace Hopper (https://youtu.be/ZR0ujwlvbkQ?t=45m08s) and Peter Norvig (http://norvig.com/21-days.html#answers), here’s a table of approximately how long things take:

  • Time to receive market data and send an order via an FPGA: ~300 nanoseconds
  • Time to receive market data and send an order via a ‘slow’ software trading system: ~30 microseconds
  • Minimum time between two packets from the exchange: ~10-1000 microseconds
  • Microwave between BATS and INET stock exchanges: ~100 microseconds
  • Fiber between BATS and INET stock exchanges: ~150 microseconds
  • Time for an exchange to match an order and send a response: ~100 microseconds – ~5 milliseconds
  • Microwave between NY and Chicago: ~4 milliseconds
  • Fiber between NY and Chicago: ~7 milliseconds
  • Fiber between NY and European exchanges: ~35 milliseconds

Appendix: Exchange idiosyncrasies

Exchanges almost all use different technology, some which dates back 10+ years. Different technology decisions and antiquated infrastructure have resulted in trading idiosyncrasies. There are many publicly available discussions of the effects of these idiosyncrasies. Here are a few interesting items:

https://www.slideshare.net/EurexGroup/deutsche-brses-t7-insights-into-trading-system-dynamics

https://www.bloomberg.com/news/articles/2017-03-17/currency-traders-race-to-reform-last-look-after-bank-scandals

https://markets.cboe.com/us/equities/market_statistics/order_type_usage/

http://www.cmegroup.com/notices/disciplinary/2016/11/NOTICE-OF-EMERGENCY-ACTION/NYMEX-16-0600-ELDORADO-TRADING-GROUP-LLC.html

https://brillianteyes.wordpress.com/2010/08/28/espeed-trading-procotol/

http://cdn.batstrading.com/resources/membership/CBOE_FUTURES_EXCHANGE_PLATFORM_CHANGE_MATRIX.pdf

https://quantitativebrokers.com/whitepapers

www.wsj.com/articles/SB10001424127887323798104578455032466082920

https://www.wsj.com/articles/SB10001424127887324766604578456783718395580