Opinion: Rationalizing Latency Competition in High-Frequency Trading

Introduction

There is a common misunderstanding, even among practitioners, that low-latency trading is a waste of human talent and resources that could instead go to advancing physics or curing cancer. It’s been attacked by books like Flash Boys, governments trying to pass transaction taxes, and exchanges bending to pressure by implementing speed bumps or periodic batch auctions. This essay argues the positive case for HFT and latency competition based on four main reasons: (1) Low latency trading lowers spreads, (2) Economically significant things do happen on sub-millisecond time scales, (3) HFT is the optimization layer for capitalism, and (4) Markets are not a zero-sum game.

Latency plays a significant role in how humans interact with the world. When we perceive and respond to stimuli, there is an inherent delay between the occurrence of an event and our reaction to it. This delay, or latency, is typically around 200 milliseconds, which is the time it takes for light to enter the eye, be converted into an electrical signal, traverse the brain’s neurons to decide on a response, and finally travel to the muscles to trigger an action.

The implications of this latency extend beyond simple reactions. When we make decisions based on information from various sources, the age of that information can significantly impact the accuracy and effectiveness of our choices. For instance, reading news from a day-old newspaper means basing decisions on information that is 24 hours old. In the past, communication latency was even more pronounced. In 1776, Americans made decisions based on information about Europeans that was around 3 weeks old, due to the time it took for ships to cross the Atlantic.

Latency also plays a role in more mundane interactions, such as negotiating the price of a used car. Both the buyer and seller enter the negotiation with pre-existing knowledge and expectations, but the actual negotiation process involves a rapid exchange of information through verbal and non-verbal cues. This high-frequency interaction helps both parties uncover additional information about the other’s willingness to pay or sell, ultimately leading to a mutually acceptable price.

In essence, latency is present in all human interactions, from simple reactions to complex decision-making processes. Understanding the impact of latency on these interactions is crucial for developing effective strategies to manage and mitigate its effects, both in personal and professional contexts. The field of high-frequency trading has emerged as a means to address latency in financial markets, aiming to facilitate more efficient price discovery and reduce the potential for market distortions.

  1. Latency Increases Spreads

One of the most important ways that HFT benefits markets is by reducing bid-ask spreads. The bid-ask spread is the difference between the highest price a buyer is willing to pay for an asset (the bid) and the lowest price a seller is willing to accept (the ask). Spreads represent the cost of transacting in a market – the wider the spread, the more expensive it is to trade. As a reminder, spreads matter so much because of the famous Coase Theorem from economics. Under the Coase Theorem, in a world with zero transaction costs the allocation of resources will be efficient regardless of the initial distribution of property rights, government tariffs, or organization of firms.

But in the real world, transaction costs are never zero. And the biggest source of transaction costs in electronic markets is the bid-ask spread. The wider the spread, the more friction there is in the market, and the harder it is for prices to efficiently reflect all available information.

Let’s see why spreads get tighter when latency is decreased. Market makers set the spread by posting bids and offers. Let’s take a bid for example. If the market maker is filled on the bid, they will want to hedge their risk on a similar security. Imagine you’re making markets on the S&P 500 futures in Chicago. If your bid is filled you are long S&P500. You need to hedge by selling short S&P 500 stocks in New York. If the latency to New York is high, then you may not be able to fill your hedge orders at a good price. In fact when market makers quote on S&P 500 futures they will use pricing logic to estimate their potential hedge prices, and only quote if there are reliable liquid hedges available in another market. High latency reduces the reliability of hedges.

Let’s plug in some numbers. Say there’s a bid for the ETF SPY for $100 on Nasdaq in New York. Then market makers can bid CME ES futures for $99.99 in Chicago, since if they’re filled, they can hedge by selling to lock in a $0.01 profit in New York. However if latency is high, then by the time their sell order goes over the network from Chicago to NY, the Nasdaq bid might have dropped below $100, so the market maker either makes no profit or has a loss. Prices are volatile so the longer the order is in flight on the network, the more risk there is the price could drop. If latency is very long, market makers might only bid for ES for 99.98. Similarly the offer side will be wider so spreads will be wide.

We can quantify this effect using some concepts from options pricing theory. A market maker’s quote is equivalent to writing an option that the market can execute against. The width of their quotes corresponds to the premium they charge for this option. And one of the key drivers of options premiums is time to expiry – the longer the option lasts, the more risk the option writer takes on, and thus the more expensive the option. In this context, latency is equivalent to time to expiry for a market maker. The longer it takes them to hedge an executed quote, the more risk they are exposed to, and the wider the spreads they will need to charge. Cutting latency is like reducing the lifespan of the options that market makers are implicitly short – it makes their job less risky and enables them to charge lower premiums, i.e., quote tighter spreads. If time to expiry is cut in half, then by Black-Scholes the option price goes down by 30% (sqrt(1) to sqrt(0.5)). Note there are other frictions that contribute to wider spreads, like monopolist exchanges charging high fees, so the 30% reduction would only apply to the liquidity-option-selling aspect of market making.

Even bringing latency down from 100 microseconds to 10 microseconds significantly shrinks the option-selling cost of market making since volatility is spiky in short time intervals. In short time intervals rather than measuring time in wall clock time, volatility is proportional to market data packet time. A lot of market updates can happen in a few 10s of microseconds because market activity is bursty and discontinuous.

Studies from around the world empirically confirm that HFT decreases spreads. The following summarize the conclusions of the SEC, a regulator from ESMA, and Bank of Canada-
SEC: “Algorithmic trading in general, and HFT specifically, increases the accuracy of prices and lowers transaction costs”
European Securities and Markets Authority regulator: “investments in high-frequency trading technology provide positive economic spillovers to the overall market since they reduce transaction costs not only for those who invest in this technology but for all market participants by enhancing the quality of securities markets.”
Bank of Canada: “Passive HFT entry leads to a tightening of the best incumbent bid-ask spread. … incumbents tighten spreads by approximately 0.8 basis points on average.”
Sources:
Gerig, Austin. “High-Frequency Trading Synchonizes Prices in Financial Markets” (2012) https://www.sec.gov/files/dera-wp-hft-synchronizes.pdf
In Clapham, B., Haferkorn, M. & Zimmermann, K. The Impact of High-Frequency Trading on Modern Securities Markets (9/2022), https://link.springer.com/article/10.1007/s12599-022-00768-6.
Jonathan Brogaard, Corey Garriott, Anna Pomeranets, “In High-Frequency Trading Competition”, Working Paper 2014-19, https://www.bankofcanada.ca/wp-content/uploads/2014/05/wp2014-19.pdf

  1. Feedback Loops in Global Supply and Demand Negotiation

HFT plays a crucial role in handling supply and demand negotiations to uncover true prices. The faster these negotiations are resolved, the fewer opportunities there are for feedback loops to form. Feedback loops can result in incorrect prices and even bubbles, but HFT helps prevent these issues by anticipating and rapidly propagating price updates.

Consider a prototypical example of a feedback loop:
1) The price of gold increases in London.
2) 100ms later, people in America see the price increase and raise their prices.
3) 100ms after that, people in London see the price increase in America and consider raising their prices.
4) The feedback loop continues.
HFT breaks these feedback loops by propagating information more quickly and using short term predictive signals to untangle causality. While the example above may seem trivially solvable, reality is exponentially more complex. There are thousands of interacting assets, news being released in different locations, and exogenous forces of supply and demand from anonymous participants buying and selling.

Bad, undampened positive feedback loops manifest as bank runs, flash crashes, and bubbles. When these persist they cause inefficient resource allocations for longer periods of time. Most assets’ intrinsic values don’t impose tight price constraints. For example gold may be partly upper bounded by the cost of mining deeper deposits or lower bounded by its value in industrial applications. Prices can remain incorrect for a long time if they’re allowed to get out of whack initially, especially once people lose track of how the prices got there.

With that background, let’s see why speed and latency matter so much. Suppose that each person has about one financial need per hour. This could be buying a meal, leaving a tip, turning down the AC to save money, flipping a light switch, paying a recurring subscriptions, taxes, etc. With 6 billion people, one transaction per hour (3600 seconds) is about one every millionth of a second, aka one microsecond. HFT systems can react in about 1 microsecond. In reality most transactions an individual makes don’t have enough supply and demand information to move the price.

But we’ve undercounted the number of transactions by a few orders of magnitude. For one, there are many other entities producing financial transactions- bonds, contracts, businesses, etc. Second, a large move in one asset implies price changes in other things. Imagine that a major company announces positive earnings, causing a hedge fund to buy its stock so the price jumps. This price change contains information that is relevant to many other assets – the company’s suppliers, its competitors, maybe even broad market indices – triggering a cascade of trades. HFTs need to be fast enough to handle every single transaction. Since every transaction echoes information back to its source, ideally the entire network-wide update can resolve before any new exogenous information arrives, to avoid confusion.

To emphasize the complexity of identifying the trigger of a trade it helps to know a little bit about the market. Most exchanges are electronic after Covid shuttered the remaining pits. With humans out of the loop, the biggest source of latency is from information traveling long distances. Economies scattered around the planet are interlinked. It takes information tens of milliseconds to travel from NY to London on Hibernia’s fiber optic cable. So when gold prices are moving it can be hard to identify the root cause, because echoes are bouncing back and forth between continents creating interfering waves and sometimes resonating into feedback loops. With Raft shortwave this is a few milliseconds shorter. Luckily when the sun shines on one continent, producing energy and activity from people being awake, most price formation happens there. So when it’s the European trading session, Asia is relaxing after work and America is still sleeping. Feedback loops are always present, but some HFT firms reduce their computation by simply assuming the exchange where it’s daytime has the correct price and the only relevant supply and demand info – but this doesn’t always work since many exchanges are open 23 hours.

Alpha signals can help unravel the causality behind sequences of events, to avoid exacerbating feedback loops. Alphas can also simultaneously produce accurate multi symbol first-order price updates. Since this article is about latency and alphas have unique proprietary variations, we will not elaborate more on alphas.

With each transaction causing ripple effects that need to be processed within microseconds to avoid destabilizing feedback loops, even nanosecond-level improvements benefits market efficiency. By continually mediating supply and demand at high speeds, HFT serves as a stabilizing force in the market. In an increasingly fast-paced and interconnected global economy, this is a necessary service.

  1. Low Level Optimization Layer of Capitalism

Electronic markets are information processing systems with a layered architecture, similar to other systems like the internet and CPUs. Each layer relies on the layers below to abstract away higher frequency tasks. HFT is the second-lowest layer of the market information processing system, comparable to the branch predictor in a CPU.

Branch prediction is not essential for running software, but it makes everything more efficient. Similarly, HFT is not the most important component of the financial system, but it’s worth having a few specialists focusing on it to make everyone else more efficient. This optimization role would likely exist even in alternative economic systems, such as a planned cybernetic socialist economy.

Let’s see the parallels between the layers around CPUs and electronic markets:

CPU:
Application: Internet, email, gaming (milliseconds)
C++: Types, higher-order functions, data structures (microseconds)
Assembly: Simple operations (nanoseconds)
Branch prediction (nanoseconds)
Transistors: Silicon (picoseconds)

Electronic markets:
Investors: fundamental valuation (years)
Hedge funds: earnings prediction (1 quarter)
Stat arb: first-order news and relative value (hours – days)
HFT: supply and demand (microseconds – minutes)
Exchange: transaction database (microseconds)

In the context of capitalism, HFT serves as a low-level optimization layer in the global price discovery process. Capitalism can be viewed as a distributed and decentralized system for resource allocation, in contrast to centrally planned economies. Solving the Nash Equilibrium of an economy is NP-complete (Christos Papadimitriou, 2008). Central planning often leads to accumulated pricing errors and societal collapse, as seen in the Soviet Union and Chile. Committees of experts struggle to determine the costs and supply chains for even basic goods (Leonard Read, 1958). Price discovery is a non-trivial problem.

The HFT industry is relatively small, with total revenue less than $30 billion in most years, even lower costs, and headcount in the single-digit thousands across all firms. Similarly to how there aren’t many Intel CPU engineers or GCC compiler devs compared to the total number of web developers, the HFT industry supports a much larger ecosystem. McDonald’s, Exxon, JP Morgan, etc each have workforces of around 100k, individually dwarfing the entire HFT industry, while benefiting from more accurate agriculture, oil, bond, and FX pricing from the thin HFT layer below.

Some people might say that flip phones were powerful enough, or that calculators used to be fast enough, and yet with each new speedup we enable new capabilities, such as smartphones, video streaming, and AI. In 1943, Thomas Watson, president of IBM, supposedly remarked, “I think there is a world market for maybe five computers”. As HFT improves efficiency, new applications also become feasible, like personal retirement portfolios and placing trades throughout the day without needing to worry about careful timing. We will speculate on more possibilities in the conclusion.

Criticism of HFT as a resource sink or brain drain from other fields may be misplaced, as it plays a vital role in the operating system of capitalism. It functions as the optimization layer, facilitating the efficient operation of this complex, decentralized economic computer.

  1. Not “Winner-takes-all”

Even some people who work in the industry are disillusioned by latency competition. They are generally in one of two groups: people working at smaller, less-profitable firms that worry their lack of traction is due to latency being an “all or nothing game”; and second, people at big firms that feel their micro-optimization work maintains a moat for another year or two but will be replaced shortly and not have any lasting value to humanity. They both think, isn’t it wasteful to invest so much in ASICs in a race to the bottom to shave off 5 nanoseconds to capture a winner-takes-all profit in a zero-sum game? Isn’t it a misallocation of resources for so many companies to build something, for only one to win, and the rest to have squandered resources and time? Even if I’m getting paid well, should I work on something more lasting instead? In this section I’ll explain the nuances of the industry to show how we are in fact all cumulatively building up technology stacks that collectively improve the market.

In reality, companies are mostly rational. Most decided that ASICs aren’t a good investment. For the few that did, one may focus on futures while the other focuses on options, so they could both use them effectively in their own markets. As you should be able to see now, the market has numerous layers and niches, room for many trading bots with different specialties.

There was a time around 2007 when multiple firms tried to build direct fiber optic lines from NY to Chicago to compete on SPY vs ES. This was parodied in the Hollywood blockbuster flop The Hummingbird Project and mischaracterized in Michael Lewis’s Flash Boys. Investments in expensive telecommunications lines like Spread networks resulted in cost overruns for companies including Getco and Chopper. However, those business decisions could be excused because there were numerous complicating strategic factors. They may not have known how many others were racing with them; They couldn’t imagine the complexity of licensing a straight path from the FCC; And few knew breakthroughs in microwave would make the payback period only a few years rather than 20 years.

On the other hand, competition squeezes people to continue improving even when they aren’t yet the bottleneck. Who asked to store 10000 photos on their phone instead of 9000, or for a 1 cent cheaper microwave? No one. Particularly since exchanges are the bottleneck now, taking 10s of microseconds, we may struggle to see how shaving off nanoseconds helps. However, over time, surplus technological capabilities have a way of paving the way for new applications. Exchanges could adopt the technological advances pioneered by HFT firms. With more efficient systems, exchanges could develop new features like implied calculations or new order types. Lower latency means higher throughput, so exchanges could list more products with less hardware. Trading firms have excess capacity so they could handle more products already. Exchanges could pass along cost savings from more efficient technical operations by lowering fees, which would lower spreads further. Unfortunately, exchanges need pressure from regulators to adopt modern systems to lower latency and fees since they are natural monopolies.

Finally, almost all competition looks like it’s winner takes all when you define the space of competition in a narrow enough region, but if you look big picture it’s never that simple. HFTs are diverse. A one nanosecond speedup in one component doesn’t allow anyone to take the whole market. Trading strategies’ reaction latencies vary by trigger type- some may be highly tuned, others may not. Strategies have different signals and risk appetites throughout the day. Exchanges want to commoditize their complements by ensuring multiple traders can make markets profitably, so they add artificial latency variance to order entry network connections. So there are very few situations where markets are truly winner-takes-all.

Conclusion

In this essay, we examined sub-millisecond latency optimization in modern electronic trading, with a focus on three key benefits: reducing spreads, disseminating fair prices, and mediating supply and demand negotiation. We saw how latency reductions allow market makers to manage risk more effectively, leading to tighter spreads that benefit all market participants. We explored HFT’s role in rapidly propagating information across related assets and preventing harmful feedback loops, thereby ensuring that prices remain an accurate reflection of supply and demand. Third, we drew analogies to low level computer optimizations that ultimately enabled new software applications as they chased Moore’s Law. Finally, we challenged the simplistic narrative that HFT is winner-takes-all with most effort going to waste. Latency improvements expand what is possible for markets, enabling innovations like smaller tick sizes and more product listings. Since the industry is profitable, even after accounting for small, wasted investments and opportunity costs, and has technological side benefits, effort on HFT is going to a good purpose. It has no negative externalities like pollution or employee injury, so it doesn’t need taxation or banning.

So what comes next? Will simulation-based models of earth’s 8b human agents, each represented by an AI, be feasible with a few orders of magnitude more compute? Will microeconomics be solved by agent-based simulation rather than finding the intersection of supply and demand curves? Could prediction markets add a layer below the news, Twitter comment wars, and podcast appearances as a more efficient clearinghouse for uncertainties about the present or future? Could debates be settled in bets not words? Will AI agent fiduciaries that know and communicate our entire personal vector of wants, needs, goods, and services instantly cross with others’ for a perfect barter system, without the lossy projection down to a single number, the price? Will electronic trading concepts become so well known that we can expand realtime central limit order books to more markets, like some tech companies have moved toward, such as Amazon for retail, Google for ads, and Uber for rides? We don’t know what improvements technology may enable but hopefully this essay has at least helped explain what currently exists. And why even nanosecond improvements are in fact valuable to society and should be cheered and encouraged rather than criticized.


This article is not an endorsement by Headlands Technologies LLC or any of its affiliates (collectively, “Headlands Tech”) of the papers discussed, their viewpoints or the companies discussed. The views expressed above reflect those of the authors and are not necessarily the views of Headlands Tech. The information presented above is only for informational and educational purposes and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. Additionally, the above information is not intended to provide, and should not be relied upon for investment, accounting, legal or tax advice. Headlands Tech makes no representations, express or implied, regarding the accuracy or completeness of this information, and the reader accepts all risks in relying on the above information for any purpose whatsoever.

Elements of Statistical Learning. 8/10

This is the first in a series of book recommendations for new quants. We will only focus on books that quants commonly come across in their journey.

Elements of Statistical Learning (ESL) is the classic recommendation for new quants, for good reason. However it’s a massive tome and many sections aren’t that useful – reflecting older techniques, the authors’ personal research agendas, or things that aren’t applicable to the trading domain. Chapters 1, 2, 3, and 7 are all great, constituting one of the best foundations for building empirical models from data. The framing of the model-building problem in ESL fits the trading domain better than other common entry points into statistical learning like Bayesian learning, NLP, bandits, convergence proofs, reinforcement learning, or statistical mechanics – which all emphasize assumptions that are inapplicable to trading. Trading is essentially empirical, extracting transient patterns from recent data, neither IID nor even stationary. Trading is a social system that just so happens to be fittable by certain classes of models. One major hurdle for new quants is learning the priors that limit the range of models that are applicable to this domain, and forgetting some of the theoretically appealing assumptions so prevalent in academia.

One major problem with ESL is that it mostly misses the practicalities of building models. It doesn’t touch on developing fast or robust software, processes that minimize divergence between training and inference, modular data cleaning and featurization pipelines, etc which are much more than half the battle when developing new automated quantitative strategies. Google’s Rules of Machine Learning is a good complementary read to fill in most of these gaps: https://developers.google.com/machine-learning/guides/rules-of-ml.

At the bottom of this review is the index of ESL’s 12th edition. This book is available for free at https://hastie.su.domains/ElemStatLearn/printings/ESLII_print12_toc.pdf. The highly recommended sections are bolded in blue, sections that could be interesting are in black, and sections that can be safely deprioritized are italicized in red. Blue sections will be applicable on a day-to-day basis, black sections may provide useful tools for thinking about data, and red sections are likely a less valuable use of time than simply studying for class, experimenting with kaggle problems, or prototyping an automated trading strategy yourself.

Recommended unconditionally (Blue): Chapter 1, 2, 3, and 7.

Worthwhile with caveats (Black):

  • Chapter 4: Most trading strategies don’t use classification, because the utility function/decision logic is based on buying and selling along with risk, messaging, and other constraints, which depend on PNL, and don’t fit in the classification framework that usually assumes samples are independent.
  • Chapter 8 & 16: Kaggle’s “no free hunch” blogs have better battle-tested ensembling ideas.
  • Chapters 9, 10, & 15: These are the authors’ main research area so they go into too much detail- in practice people just call xgboost or lightgbm.
  • Chapter 11: ESL is not the best intro to neural nets.
  • Chapters 13 & 14: Unsupervised methods can sometimes be used to engineer features, but are generally minor components.

Safe to skip (Red): Chapter 5, 6, 12, 17, and 18.


1 Introduction 1
2 Overview of Supervised Learning 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Variable Types and Terminology . . . . . . . . . . . . . . 9
2.3 Two Simple Approaches to Prediction:
Least Squares and Nearest Neighbors . . . . . . . . . . . 11
2.3.1 Linear Models and Least Squares . . . . . . . . 11
2.3.2 Nearest-Neighbor Methods . . . . . . . . . . . . 14
2.3.3 From Least Squares to Nearest Neighbors . . . . 16
2.4 Statistical Decision Theory . . . . . . . . . . . . . . . . . 18
2.5 Local Methods in High Dimensions . . . . . . . . . . . . . 22
2.6 Statistical Models, Supervised Learning
and Function Approximation . . . . . . . . . . . . . . . . 28
2.6.1 A Statistical Model
for the Joint Distribution Pr(X, Y ) . . . . . . . 28
2.6.2 Supervised Learning . . . . . . . . . . . . . . . . 29
2.6.3 Function Approximation . . . . . . . . . . . . . 29
2.7 Structured Regression Models . . . . . . . . . . . . . . . 32
2.7.1 Difficulty of the Problem . . . . . . . . . . . . . 32
xiv Contents
2.8 Classes of Restricted Estimators . . . . . . . . . . . . . . 33
2.8.1 Roughness Penalty and Bayesian Methods . . . 34
2.8.2 Kernel Methods and Local Regression . . . . . . 34
2.8.3 Basis Functions and Dictionary Methods . . . . 35
2.9 Model Selection and the Bias–Variance Tradeoff . . . . . 37
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 39
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3 Linear Methods for Regression 43
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Linear Regression Models and Least Squares . . . . . . . 44
3.2.1 Example: Prostate Cancer . . . . . . . . . . . . 49
3.2.2 The Gauss–Markov Theorem . . . . . . . . . . . 51
3.2.3 Multiple Regression
from Simple Univariate Regression . . . . . . . . 52
3.2.4 Multiple Outputs . . . . . . . . . . . . . . . . . 56
3.3 Subset Selection . . . . . . . . . . . . . . . . . . . . . . . 57
3.3.1 Best-Subset Selection . . . . . . . . . . . . . . . 57
3.3.2 Forward- and Backward-Stepwise Selection . . . 58
3.3.3 Forward-Stagewise Regression . . . . . . . . . . 60
3.3.4 Prostate Cancer Data Example (Continued) . . 61
3.4 Shrinkage Methods . . . . . . . . . . . . . . . . . . . . . . 61
3.4.1 Ridge Regression . . . . . . . . . . . . . . . . . 61
3.4.2 The Lasso . . . . . . . . . . . . . . . . . . . . . 68
3.4.3 Discussion: Subset Selection, Ridge Regression
and the Lasso . . . . . . . . . . . . . . . . . . . 69
3.4.4 Least Angle Regression . . . . . . . . . . . . . . 73
3.5 Methods Using Derived Input Directions . . . . . . . . . 79
3.5.1 Principal Components Regression . . . . . . . . 79
3.5.2 Partial Least Squares . . . . . . . . . . . . . . . 80
3.6 Discussion: A Comparison of the Selection
and Shrinkage Methods . . . . . . . . . . . . . . . . . . . 82
3.7 Multiple Outcome Shrinkage and Selection . . . . . . . . 84
3.8 More on the Lasso and Related Path Algorithms . . . . . 86
3.8.1 Incremental Forward Stagewise Regression . . . 86
3.8.2 Piecewise-Linear Path Algorithms . . . . . . . . 89
3.8.3 The Dantzig Selector . . . . . . . . . . . . . . . 89
3.8.4 The Grouped Lasso . . . . . . . . . . . . . . . . 90
3.8.5 Further Properties of the Lasso . . . . . . . . . . 91
3.8.6 Pathwise Coordinate Optimization . . . . . . . . 92
3.9 Computational Considerations . . . . . . . . . . . . . . . 93
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 94
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Contents xv

4 Linear Methods for Classification 101
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2 Linear Regression of an Indicator Matrix . . . . . . . . . 103
4.3 Linear Discriminant Analysis . . . . . . . . . . . . . . . . 106
4.3.1 Regularized Discriminant Analysis . . . . . . . . 112
4.3.2 Computations for LDA . . . . . . . . . . . . . . 113
4.3.3 Reduced-Rank Linear Discriminant Analysis . . 113
4.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . 119
4.4.1 Fitting Logistic Regression Models . . . . . . . . 120
4.4.2 Example: South African Heart Disease . . . . . 122
4.4.3 Quadratic Approximations and Inference . . . . 124
4.4.4 L1 Regularized Logistic Regression . . . . . . . . 125
4.4.5 Logistic Regression or LDA? . . . . . . . . . . . 127
4.5 Separating Hyperplanes . . . . . . . . . . . . . . . . . . . 129
4.5.1 Rosenblatt’s Perceptron Learning Algorithm . . 130
4.5.2 Optimal Separating Hyperplanes . . . . . . . . . 132
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 135
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5 Basis Expansions and Regularization 139
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.2 Piecewise Polynomials and Splines . . . . . . . . . . . . . 141
5.2.1 Natural Cubic Splines . . . . . . . . . . . . . . . 144
5.2.2 Example: South African Heart Disease (Continued)146
5.2.3 Example: Phoneme Recognition . . . . . . . . . 148
5.3 Filtering and Feature Extraction . . . . . . . . . . . . . . 150
5.4 Smoothing Splines . . . . . . . . . . . . . . . . . . . . . . 151
5.4.1 Degrees of Freedom and Smoother Matrices . . . 153
5.5 Automatic Selection of the Smoothing Parameters . . . . 156
5.5.1 Fixing the Degrees of Freedom . . . . . . . . . . 158
5.5.2 The Bias–Variance Tradeoff . . . . . . . . . . . . 158
5.6 Nonparametric Logistic Regression . . . . . . . . . . . . . 161
5.7 Multidimensional Splines . . . . . . . . . . . . . . . . . . 162
5.8 Regularization and Reproducing Kernel Hilbert Spaces . 167
5.8.1 Spaces of Functions Generated by Kernels . . . 168
5.8.2 Examples of RKHS . . . . . . . . . . . . . . . . 170
5.9 Wavelet Smoothing . . . . . . . . . . . . . . . . . . . . . 174
5.9.1 Wavelet Bases and the Wavelet Transform . . . 176
5.9.2 Adaptive Wavelet Filtering . . . . . . . . . . . . 179
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 181
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Appendix: Computational Considerations for Splines . . . . . . 186
Appendix: B-splines . . . . . . . . . . . . . . . . . . . . . 186
Appendix: Computations for Smoothing Splines . . . . . 189
xvi Contents

6 Kernel Smoothing Methods 191
6.1 One-Dimensional Kernel Smoothers . . . . . . . . . . . . 192
6.1.1 Local Linear Regression . . . . . . . . . . . . . . 194
6.1.2 Local Polynomial Regression . . . . . . . . . . . 197
6.2 Selecting the Width of the Kernel . . . . . . . . . . . . . 198
6.3 Local Regression in IRp
. . . . . . . . . . . . . . . . . . . 200
6.4 Structured Local Regression Models in IRp
. . . . . . . . 201
6.4.1 Structured Kernels . . . . . . . . . . . . . . . . . 203
6.4.2 Structured Regression Functions . . . . . . . . . 203
6.5 Local Likelihood and Other Models . . . . . . . . . . . . 205
6.6 Kernel Density Estimation and Classification . . . . . . . 208
6.6.1 Kernel Density Estimation . . . . . . . . . . . . 208
6.6.2 Kernel Density Classification . . . . . . . . . . . 210
6.6.3 The Naive Bayes Classifier . . . . . . . . . . . . 210
6.7 Radial Basis Functions and Kernels . . . . . . . . . . . . 212
6.8 Mixture Models for Density Estimation and Classification 214
6.9 Computational Considerations . . . . . . . . . . . . . . . 216
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 216
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

7 Model Assessment and Selection 219
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 219
7.2 Bias, Variance and Model Complexity . . . . . . . . . . . 219
7.3 The Bias–Variance Decomposition . . . . . . . . . . . . . 223
7.3.1 Example: Bias–Variance Tradeoff . . . . . . . . 226
7.4 Optimism of the Training Error Rate . . . . . . . . . . . 228
7.5 Estimates of In-Sample Prediction Error . . . . . . . . . . 230
7.6 The Effective Number of Parameters . . . . . . . . . . . . 232
7.7 The Bayesian Approach and BIC . . . . . . . . . . . . . . 233
7.8 Minimum Description Length . . . . . . . . . . . . . . . . 235
7.9 Vapnik–Chervonenkis Dimension . . . . . . . . . . . . . . 237
7.9.1 Example (Continued) . . . . . . . . . . . . . . . 239
7.10 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . 241
7.10.1 K-Fold Cross-Validation . . . . . . . . . . . . . 241
7.10.2 The Wrong and Right Way
to Do Cross-validation . . . . . . . . . . . . . . . 245
7.10.3 Does Cross-Validation Really Work? . . . . . . . 247
7.11 Bootstrap Methods . . . . . . . . . . . . . . . . . . . . . 249
7.11.1 Example (Continued) . . . . . . . . . . . . . . . 252
7.12 Conditional or Expected Test Error? . . . . . . . . . . . . 254
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 257
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

8 Model Inference and Averaging 261
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 261
Contents xvii
8.2 The Bootstrap and Maximum Likelihood Methods . . . . 261
8.2.1 A Smoothing Example . . . . . . . . . . . . . . 261
8.2.2 Maximum Likelihood Inference . . . . . . . . . . 265
8.2.3 Bootstrap versus Maximum Likelihood . . . . . 267
8.3 Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . 267
8.4 Relationship Between the Bootstrap
and Bayesian Inference . . . . . . . . . . . . . . . . . . . 271
8.5 The EM Algorithm . . . . . . . . . . . . . . . . . . . . . 272
8.5.1 Two-Component Mixture Model . . . . . . . . . 272
8.5.2 The EM Algorithm in General . . . . . . . . . . 276
8.5.3 EM as a Maximization–Maximization Procedure 277
8.6 MCMC for Sampling from the Posterior . . . . . . . . . . 279
8.7 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
8.7.1 Example: Trees with Simulated Data . . . . . . 283
8.8 Model Averaging and Stacking . . . . . . . . . . . . . . . 288
8.9 Stochastic Search: Bumping . . . . . . . . . . . . . . . . . 290
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 292
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
9 Additive Models, Trees, and Related Methods 295
9.1 Generalized Additive Models . . . . . . . . . . . . . . . . 295
9.1.1 Fitting Additive Models . . . . . . . . . . . . . . 297
9.1.2 Example: Additive Logistic Regression . . . . . 299
9.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . 304
9.2 Tree-Based Methods . . . . . . . . . . . . . . . . . . . . . 305
9.2.1 Background . . . . . . . . . . . . . . . . . . . . 305
9.2.2 Regression Trees . . . . . . . . . . . . . . . . . . 307
9.2.3 Classification Trees . . . . . . . . . . . . . . . . 308
9.2.4 Other Issues . . . . . . . . . . . . . . . . . . . . 310
9.2.5 Spam Example (Continued) . . . . . . . . . . . 313
9.3 PRIM: Bump Hunting . . . . . . . . . . . . . . . . . . . . 317
9.3.1 Spam Example (Continued) . . . . . . . . . . . 320
9.4 MARS: Multivariate Adaptive Regression Splines . . . . . 321
9.4.1 Spam Example (Continued) . . . . . . . . . . . 326
9.4.2 Example (Simulated Data) . . . . . . . . . . . . 327
9.4.3 Other Issues . . . . . . . . . . . . . . . . . . . . 328
9.5 Hierarchical Mixtures of Experts . . . . . . . . . . . . . . 329
9.6 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . 332
9.7 Computational Considerations . . . . . . . . . . . . . . . 334
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 334
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
10 Boosting and Additive Trees 337
10.1 Boosting Methods . . . . . . . . . . . . . . . . . . . . . . 337
10.1.1 Outline of This Chapter . . . . . . . . . . . . . . 340
xviii Contents
10.2 Boosting Fits an Additive Model . . . . . . . . . . . . . . 341
10.3 Forward Stagewise Additive Modeling . . . . . . . . . . . 342
10.4 Exponential Loss and AdaBoost . . . . . . . . . . . . . . 343
10.5 Why Exponential Loss? . . . . . . . . . . . . . . . . . . . 345
10.6 Loss Functions and Robustness . . . . . . . . . . . . . . . 346
10.7 “Off-the-Shelf” Procedures for Data Mining . . . . . . . . 350
10.8 Example: Spam Data . . . . . . . . . . . . . . . . . . . . 352
10.9 Boosting Trees . . . . . . . . . . . . . . . . . . . . . . . . 353
10.10 Numerical Optimization via Gradient Boosting . . . . . . 358
10.10.1 Steepest Descent . . . . . . . . . . . . . . . . . . 358
10.10.2 Gradient Boosting . . . . . . . . . . . . . . . . . 359
10.10.3 Implementations of Gradient Boosting . . . . . . 360
10.11 Right-Sized Trees for Boosting . . . . . . . . . . . . . . . 361
10.12 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 364
10.12.1 Shrinkage . . . . . . . . . . . . . . . . . . . . . . 364
10.12.2 Subsampling . . . . . . . . . . . . . . . . . . . . 365
10.13 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . 367
10.13.1 Relative Importance of Predictor Variables . . . 367
10.13.2 Partial Dependence Plots . . . . . . . . . . . . . 369
10.14 Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . 371
10.14.1 California Housing . . . . . . . . . . . . . . . . . 371
10.14.2 New Zealand Fish . . . . . . . . . . . . . . . . . 375
10.14.3 Demographics Data . . . . . . . . . . . . . . . . 379
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 380
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
11 Neural Networks 389
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 389
11.2 Projection Pursuit Regression . . . . . . . . . . . . . . . 389
11.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 392
11.4 Fitting Neural Networks . . . . . . . . . . . . . . . . . . . 395
11.5 Some Issues in Training Neural Networks . . . . . . . . . 397
11.5.1 Starting Values . . . . . . . . . . . . . . . . . . . 397
11.5.2 Overfitting . . . . . . . . . . . . . . . . . . . . . 398
11.5.3 Scaling of the Inputs . . . . . . . . . . . . . . . 398
11.5.4 Number of Hidden Units and Layers . . . . . . . 400
11.5.5 Multiple Minima . . . . . . . . . . . . . . . . . . 400
11.6 Example: Simulated Data . . . . . . . . . . . . . . . . . . 401
11.7 Example: ZIP Code Data . . . . . . . . . . . . . . . . . . 404
11.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 408
11.9 Bayesian Neural Nets and the NIPS 2003 Challenge . . . 409
11.9.1 Bayes, Boosting and Bagging . . . . . . . . . . . 410
11.9.2 Performance Comparisons . . . . . . . . . . . . 412
11.10 Computational Considerations . . . . . . . . . . . . . . . 414
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 415
Contents xix
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
12 Support Vector Machines and
Flexible Discriminants 417
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 417
12.2 The Support Vector Classifier . . . . . . . . . . . . . . . . 417
12.2.1 Computing the Support Vector Classifier . . . . 420
12.2.2 Mixture Example (Continued) . . . . . . . . . . 421
12.3 Support Vector Machines and Kernels . . . . . . . . . . . 423
12.3.1 Computing the SVM for Classification . . . . . . 423
12.3.2 The SVM as a Penalization Method . . . . . . . 426
12.3.3 Function Estimation and Reproducing Kernels . 428
12.3.4 SVMs and the Curse of Dimensionality . . . . . 431
12.3.5 A Path Algorithm for the SVM Classifier . . . . 432
12.3.6 Support Vector Machines for Regression . . . . . 434
12.3.7 Regression and Kernels . . . . . . . . . . . . . . 436
12.3.8 Discussion . . . . . . . . . . . . . . . . . . . . . 438
12.4 Generalizing Linear Discriminant Analysis . . . . . . . . 438
12.5 Flexible Discriminant Analysis . . . . . . . . . . . . . . . 440
12.5.1 Computing the FDA Estimates . . . . . . . . . . 444
12.6 Penalized Discriminant Analysis . . . . . . . . . . . . . . 446
12.7 Mixture Discriminant Analysis . . . . . . . . . . . . . . . 449
12.7.1 Example: Waveform Data . . . . . . . . . . . . . 451
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 455
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455

13 Prototype Methods and Nearest-Neighbors 459
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 459
13.2 Prototype Methods . . . . . . . . . . . . . . . . . . . . . 459
13.2.1 K-means Clustering . . . . . . . . . . . . . . . . 460
13.2.2 Learning Vector Quantization . . . . . . . . . . 462
13.2.3 Gaussian Mixtures . . . . . . . . . . . . . . . . . 463
13.3 k-Nearest-Neighbor Classifiers . . . . . . . . . . . . . . . 463
13.3.1 Example: A Comparative Study . . . . . . . . . 468
13.3.2 Example: k-Nearest-Neighbors
and Image Scene Classification . . . . . . . . . . 470
13.3.3 Invariant Metrics and Tangent Distance . . . . . 471
13.4 Adaptive Nearest-Neighbor Methods . . . . . . . . . . . . 475
13.4.1 Example . . . . . . . . . . . . . . . . . . . . . . 478
13.4.2 Global Dimension Reduction
for Nearest-Neighbors . . . . . . . . . . . . . . . 479
13.5 Computational Considerations . . . . . . . . . . . . . . . 480
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 481
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
xx Contents
14 Unsupervised Learning 485
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 485
14.2 Association Rules . . . . . . . . . . . . . . . . . . . . . . 487
14.2.1 Market Basket Analysis . . . . . . . . . . . . . . 488
14.2.2 The Apriori Algorithm . . . . . . . . . . . . . . 489
14.2.3 Example: Market Basket Analysis . . . . . . . . 492
14.2.4 Unsupervised as Supervised Learning . . . . . . 495
14.2.5 Generalized Association Rules . . . . . . . . . . 497
14.2.6 Choice of Supervised Learning Method . . . . . 499
14.2.7 Example: Market Basket Analysis (Continued) . 499
14.3 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . 501
14.3.1 Proximity Matrices . . . . . . . . . . . . . . . . 503
14.3.2 Dissimilarities Based on Attributes . . . . . . . 503
14.3.3 Object Dissimilarity . . . . . . . . . . . . . . . . 505
14.3.4 Clustering Algorithms . . . . . . . . . . . . . . . 507
14.3.5 Combinatorial Algorithms . . . . . . . . . . . . 507
14.3.6 K-means . . . . . . . . . . . . . . . . . . . . . . 509
14.3.7 Gaussian Mixtures as Soft K-means Clustering . 510
14.3.8 Example: Human Tumor Microarray Data . . . 512
14.3.9 Vector Quantization . . . . . . . . . . . . . . . . 514
14.3.10 K-medoids . . . . . . . . . . . . . . . . . . . . . 515
14.3.11 Practical Issues . . . . . . . . . . . . . . . . . . 518
14.3.12 Hierarchical Clustering . . . . . . . . . . . . . . 520
14.4 Self-Organizing Maps . . . . . . . . . . . . . . . . . . . . 528
14.5 Principal Components, Curves and Surfaces . . . . . . . . 534
14.5.1 Principal Components . . . . . . . . . . . . . . . 534
14.5.2 Principal Curves and Surfaces . . . . . . . . . . 541
14.5.3 Spectral Clustering . . . . . . . . . . . . . . . . 544
14.5.4 Kernel Principal Components . . . . . . . . . . . 547
14.5.5 Sparse Principal Components . . . . . . . . . . . 550
14.6 Non-negative Matrix Factorization . . . . . . . . . . . . . 553
14.6.1 Archetypal Analysis . . . . . . . . . . . . . . . . 554
14.7 Independent Component Analysis
and Exploratory Projection Pursuit . . . . . . . . . . . . 557
14.7.1 Latent Variables and Factor Analysis . . . . . . 558
14.7.2 Independent Component Analysis . . . . . . . . 560
14.7.3 Exploratory Projection Pursuit . . . . . . . . . . 565
14.7.4 A Direct Approach to ICA . . . . . . . . . . . . 565
14.8 Multidimensional Scaling . . . . . . . . . . . . . . . . . . 570
14.9 Nonlinear Dimension Reduction
and Local Multidimensional Scaling . . . . . . . . . . . . 572
14.10 The Google PageRank Algorithm . . . . . . . . . . . . . 576
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 578
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
Contents xxi
15 Random Forests 587
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 587
15.2 Definition of Random Forests . . . . . . . . . . . . . . . . 587
15.3 Details of Random Forests . . . . . . . . . . . . . . . . . 592
15.3.1 Out of Bag Samples . . . . . . . . . . . . . . . . 592
15.3.2 Variable Importance . . . . . . . . . . . . . . . . 593
15.3.3 Proximity Plots . . . . . . . . . . . . . . . . . . 595
15.3.4 Random Forests and Overfitting . . . . . . . . . 596
15.4 Analysis of Random Forests . . . . . . . . . . . . . . . . . 597
15.4.1 Variance and the De-Correlation Effect . . . . . 597
15.4.2 Bias . . . . . . . . . . . . . . . . . . . . . . . . . 600
15.4.3 Adaptive Nearest Neighbors . . . . . . . . . . . 601
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 602
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
16 Ensemble Learning 605
16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 605
16.2 Boosting and Regularization Paths . . . . . . . . . . . . . 607
16.2.1 Penalized Regression . . . . . . . . . . . . . . . 607
16.2.2 The “Bet on Sparsity” Principle . . . . . . . . . 610
16.2.3 Regularization Paths, Over-fitting and Margins . 613
16.3 Learning Ensembles . . . . . . . . . . . . . . . . . . . . . 616
16.3.1 Learning a Good Ensemble . . . . . . . . . . . . 617
16.3.2 Rule Ensembles . . . . . . . . . . . . . . . . . . 622
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 623
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624
17 Undirected Graphical Models 625
17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 625
17.2 Markov Graphs and Their Properties . . . . . . . . . . . 627
17.3 Undirected Graphical Models for Continuous Variables . 630
17.3.1 Estimation of the Parameters
when the Graph Structure is Known . . . . . . . 631
17.3.2 Estimation of the Graph Structure . . . . . . . . 635
17.4 Undirected Graphical Models for Discrete Variables . . . 638
17.4.1 Estimation of the Parameters
when the Graph Structure is Known . . . . . . . 639
17.4.2 Hidden Nodes . . . . . . . . . . . . . . . . . . . 641
17.4.3 Estimation of the Graph Structure . . . . . . . . 642
17.4.4 Restricted Boltzmann Machines . . . . . . . . . 643
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645

18 High-Dimensional Problems: p ≫ N 649
18.1 When p is Much Bigger than N . . . . . . . . . . . . . . 649
xxii Contents
18.2 Diagonal Linear Discriminant Analysis
and Nearest Shrunken Centroids . . . . . . . . . . . . . . 651
18.3 Linear Classifiers with Quadratic Regularization . . . . . 654
18.3.1 Regularized Discriminant Analysis . . . . . . . . 656
18.3.2 Logistic Regression
with Quadratic Regularization . . . . . . . . . . 657
18.3.3 The Support Vector Classifier . . . . . . . . . . 657
18.3.4 Feature Selection . . . . . . . . . . . . . . . . . . 658
18.3.5 Computational Shortcuts When p ≫ N . . . . . 659
18.4 Linear Classifiers with L1 Regularization . . . . . . . . . 661
18.4.1 Application of Lasso
to Protein Mass Spectroscopy . . . . . . . . . . 664
18.4.2 The Fused Lasso for Functional Data . . . . . . 666
18.5 Classification When Features are Unavailable . . . . . . . 668
18.5.1 Example: String Kernels
and Protein Classification . . . . . . . . . . . . . 668
18.5.2 Classification and Other Models Using
Inner-Product Kernels and Pairwise Distances . 670
18.5.3 Example: Abstracts Classification . . . . . . . . 672
18.6 High-Dimensional Regression:
Supervised Principal Components . . . . . . . . . . . . . 674
18.6.1 Connection to Latent-Variable Modeling . . . . 678
18.6.2 Relationship with Partial Least Squares . . . . . 680
18.6.3 Pre-Conditioning for Feature Selection . . . . . 681
18.7 Feature Assessment and the Multiple-Testing Problem . . 683
18.7.1 The False Discovery Rate . . . . . . . . . . . . . 687
18.7.2 Asymmetric Cutpoints and the SAM Procedure 690
18.7.3 A Bayesian Interpretation of the FDR . . . . . . 692
18.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . 693
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694

References 699
Author Index 729
Index 737


This article is not an endorsement by Headlands Technologies LLC or any of its affiliates (collectively, “Headlands Tech”) of the papers discussed, their viewpoints or the companies discussed. The views expressed above reflect those of the authors and are not necessarily the views of Headlands Tech. The information presented above is only for informational and educational purposes and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. Additionally, the above information is not intended to provide, and should not be relied upon for investment, accounting, legal or tax advice. Headlands Tech makes no representations, express or implied, regarding the accuracy or completeness of this information, and the reader accepts all risks in relying on the above information for any purpose whatsoever.

Successfully Starting a Career in Quant Research

People setting out to pursue a career in the quantitative trading industry won’t always know what questions they should be asking of prospective employers. As a result, they are susceptible to making a mistake in choosing their first job. That mistake could lead to later giving up on the industry, when in fact you may be well suited for success. Without experience, it is easy to make an error; but I want to provide some guidance to help you avoid some common pitfalls.

Here’s a checklist of key structural questions you’ll want to understand before choosing your first firm. These questions will help you understand how well you’ll be able to learn and grow in a given company. You’ll want to learn and grow throughout your entire career, but it is especially important at the beginning when you have the least experience.

Ideally, for each of these seven questions you’ll want your first company’s answer to be “no”. “Yes” answers indicate additional challenges for you as you build your career. “Yes” answers don’t necessarily indicate a bad job, but they do indicate additional layers of risk. Below the list, I’ve provided some additional context.

  1. Are brainteasers, gambling, poker, or mental math questions used in the interview process?
  2. Will you have a 2 year noncompete?
  3. Will you be blocked from accessing any part of the source code?
  4. As a researcher, will you be the primary on-call trader monitoring any live trading processes?
  5. Will you be blocked from viewing the PnL of any strategies that utilize your research?
  6. Are strategy parameters manually changed based on judgment calls during the day?
  7. Are there other employees in the company in direct competition with you?
  1. Are brainteasers, gambling, poker, or mental math questions used in the interview?
    If a company asks these types of questions, it is potentially a sign they value manual, non-automated trader intuition and decision making more than quantitative, algorithmic, and research-driven approaches. If your background is quantitative, you will want a company that will value those skills the most highly. At ‘trader’-centric firms, quants have less influence over the trading strategy and more limited career prospects.
  2. Will you have a 2 year (or greater) noncompete?
    Non-compete agreements are a fact of life for quants working in the trading industry, but the lengths of those agreements vary widely (the standard term is 18 months). Non-competes exist to help protect the intellectual property you’ll be developing. However, 18 months should be sufficient to protect your work. Longer terms can sometimes be designed not to protect the company as much as to limit the employee’s career options. Even if you join a good company, there’s always a chance of ending up in a weak team, or with an inexperienced manager, being paid unfairly, or simply not fitting in perfectly. With a 2 year noncompete, other companies may be much less willing to hire you.
  3. Will you be blocked from accessing any part of the source code?
    Some companies encrypt or password-protect parts of their source code. This goes beyond taking adequate steps to protect proprietary property, like securing an internal filesystem from external intruders or preventing employees from copying files off the company network- these companies even prevent their own full-time employees from seeing parts of the existing codebase. To people outside the industry the idea of encryption might sound sufficiently strange that they wouldn’t think to ask. However, it is actually quite common in quantitative trading firms. Having parts of the source code blocked limits your ability to learn, collaborate with coworkers, and make an impact.
  4. As a researcher, will you be the primary on-call trader monitoring any live trading processes?
    Companies that highly value research will have separate dedicated operations and trading teams to handle the majority of the routine day-to-day tasks of running and monitoring an automated trading system.
    At some companies, often those with roots as floor traders or click traders, or those that struggle to manage unstable operations processes, ‘quant traders’ are expected to do both research and operations. While this might seem exciting at first, monitoring live trading, monitoring system health, and ensuring system functionality is a full time responsibility and will severely reduce your time available to concentrate and do high quality research.
  5. Will you be blocked from viewing the PnL (revenue) of any strategies that utilize your research?
    One of the main attractions of working in trading is the fast feedback you get on your research. You can think of an idea, implement the idea, and then see the results within a few days. This tight feedback loop compares favorably to, say, a Physics department, where a single idea could take years to validate. However, some companies separate alpha signal researchers from strategy developers. If you’re separated from the trading PnL, then you can’t get real-time feedback. Companies might do this to prevent their secrets from leaking out easily, but there are plenty of successful companies that trust their employees and encourage loyalty in other ways.
  6. Are strategy parameters manually changed based on judgment calls during the day?
    It’s almost impossible to do proper statistical analysis of your ideas on historic data if ‘click-traders’ are tweaking parameters, because you can’t model the human element. That kind of company is a good place to be a ‘click-trader’ – not a quant.
  7. Are there other employees in the company in direct competition with you?
    Okay, final question! This is important to ask, because some companies have employees or teams directly competing with each other so that the company is diversified in its revenue streams. But for your career, you want your company to invest fully in you. If there are competing teams, you don’t know whether you’ll end up on one that eventually wins- or loses. There will also be fewer opportunities to learn from others; you don’t get the benefit of collaboration with the widest group of other quants. Finally, this type of structure tends to foster a cut-throat, ‘zero-sum-game’ culture.

I hope you can get clear ‘yes’ or ‘no’ answers on all 7 of these questions from all your prospective employers. Consider a vague or indirect answer as a ‘yes’. Don’t let yourself be persuaded just because you’re less informed than you will be later in your career. Finally, it is never a bad idea to double check with friends or classmates that are now in the quantitative trading industry. Forewarned, you can avoid these 7 problems that I’ve seen through my friends’ experiences, and get to work on the fascinating problems tackled in quantitative trading.


This article is not an endorsement by Headlands Technologies LLC or any of its affiliates (collectively, “Headlands Tech”) of the papers discussed, their viewpoints or the companies discussed. The views expressed above reflect those of the authors and are not necessarily the views of Headlands Tech. The information presented above is only for informational and educational purposes and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. Additionally, the above information is not intended to provide, and should not be relied upon for investment, accounting, legal or tax advice. Headlands Tech makes no representations, express or implied, regarding the accuracy or completeness of this information, and the reader accepts all risks in relying on the above information for any purpose whatsoever.

Quantitative Trading Summary

(pdf version)

This summary is an attempt to shed some light on modern quantitative trading since there is limited information available for people who are not already in the industry. Hopefully this is useful for students and candidates coming from outside the industry who are looking to understand what it’s like working for a quantitative trading firm. Job titles like “researcher” or “developer” don’t give a clear picture of the day-to-day roles and responsibilities at a trading firm. Most quantitative trading firms have converged on roughly the same basic organizational framework so this is a reasonably accurate description of the roles at any established quantitative trading firm.

The product of a quantitative trading company is an automated software program that buys and sells electronically traded securities to make a profit. This software program is supported by many systems designed to maintain and optimize it. Most companies are roughly divided into 3 main groups: strategy research, core development, and operations. Employees generally start and stay in one of these groups throughout a career. This guide focuses on strategy research and core development roles.

Primary job requirements:

  • Strategy research (‘research’): Programming, statistics, trading intuition, and the ability to understand market data
  • Core development (‘dev’): Low-level software engineering, networking, and system architecture

The software components of a quantitative trading system are built by one of these two teams. The majority of the components are built in-house at most major trading firms, so below is a list of the programs you could expect to build or maintain if you were on the research or dev teams. Each of these programs can be a separate process, although we’ll discuss some variants later.

Programs for live production trading:

  1. Market data parser: Dev. Normalizes each exchange’s protocol (including different versions over time) into the same internal format.
  2. Trading strategy: Research/Dev. Receives normalized data, decides whether to buy or sell.
  3. Order gateway: Dev. Converts from internal order format to each exchange’s order entry protocol (different than the market data protocol).

Programs to support live production trading:

  1. Monitoring GUI: Dev. GUIs used to be important for click traders but are now mainly used to monitor that the trading system is performing appropriately. They are still occasionally used to manually adjust a few parameters, such as overall risk tolerance.
  2. Drop copy: Dev. Secondary order confirmation to make sure you have the trading positions you think you do.
  3. Market data capture: Dev. Record the market data in parallel to what’s going into the strategy to verify later that the strategy behaved as intended and to run statistical tests on historical data (live capture is more reliable than purchasing from a vendor so most major firms avoid buying data from a vendor).
  4. Startup scripts: Dev/Operations. Launch all these different software programs in the right order and at the right time of day each time they need to be restarted (typically daily or weekly), and alert or recover from startup problems.

Programs to optimize and analyze the trading strategy:

  1. Parameter optimization: Research. Regressions or other metrics to help people compare one trading strategy parameter setting to another to find the best.
  2. Production reconciliation: Research. Metrics to confirm that the state in the trading strategy’s internal algorithm matched calculations using captured market data.
  3. Back testing simulator: Research. Shows estimated trading strategy profit or loss on historical data.
  4. Graphing: Dev/Research. Display profit or loss, volume, price and other statistics over time.

In a ‘typical’ established quantitative trading company the department breakdown would be:

  • Research
  • Dev
  • Back office and operations
    • operations/monitoring
    • telco/networking/hardware
    • accounting/HR
    • management/business development
    • legal/compliance

Since we won’t focus on them later, here’s a brief description of the latter groups:

  • Operations/monitoring: Monitor strategies and risk intraday and overnight to ensure there are no problems (like Knight Capital’s +$400m loss).
  • Telco/networking/hardware: Purchase and rack servers, configure switch firmware, operating system settings, and network interface cards, connect co-located datacenters (possibly in different countries), etc.
  • Accounting/HR: Like any business there is tax, accounting, and human resources work
  • Management/business development: There’s a lot of legwork to trading multiple exchanges around the world, such as finding contacts in other countries, negotiating fees, licensing telecom networks, and keeping ahead of new updates.
  • Legal/compliance: Trading is one of the most regulated industries. There are US and international regulators (SEC, CFTC, FCA, etc), huge and diverse rulesets such as MIFID, industry regulating agencies like FINRA, and exchange-level self-regulatory regimes (CME, NYSE, etc) that each have their own rules. Ensuring and documenting compliance with each set of rules takes a lot of work.

Some of the key differentiating factors between quantitative trading companies are:

  1. How they divide up research teams – internal collaboration vs competition between siloed research/trading teams.
  2. Which exchanges and products they focus on.
  3. What type of trading strategy is used and how it’s optimized.

Although we are unable to explain how each specific company divides up their research/trading teams, the overall structure and organization of employees and software at most major quantitative trading firms follows a similar general pattern to what was described above.

Trading strategies

Now that you have a high-level understanding of what a typical quantitative trading company does and the different roles that exist, let’s go into more detail about trading strategies.

The industry has generally settled on three main types of strategies that are sustainable because they provide real economic value to the market:

  • Arbitrage: Arbitrage and its economic benefits have been well understood for quite some time and documented by academia. The companies that are still competitive in arbitrage have one of 3 advantages:
    • Scale: To determine that some complex option or futures spread products are mispriced relative to a set of others, nontrivial calculations must be performed, including the fee per leg, and then the hedged position has to be held and margined until expiry. Being able to manage this and have low fees requires scale.
    • Speed: Speed either comes from having faster telco or being able to hedge. For example, triangular arbitrage on FX products traded in London, NY, and Japan and are a major impetus for the Go West and Hibernia microwave telecom projects. Arbitrageurs rely on the speed of their order gateway connections so they can hedge on related markets if they are overfilled.
    • Queue position: Being able to enter one leg of an arbitrage by passively buying on the bid or selling on the offer reduces costs by not having to cross the spread on that leg, so being able to achieve good queue position can give an edge in arb trades.
  • Market Taking: Placing a marketable buy or sell order to profit from a predicted price change. The economic value market takers are being paid for is either:
    • properly pricing the relative value of related securities
    • trading and thereby contributing to price discovery in products after observed changes in supply and demand

    Like a real estate negotiation that can change a deal’s value minute-by-minute when negotiators come to the table face-to-face and discover each other’s positions, even though the fundamentals of the deal or real estate market certainly don’t fluctuate by the minute, market takers are the high-stakes-mediators of the trading world. Market taking requires predictive signals and relatively low-latency because you pay to cross the spread. A common low-latency market taking strategy would be to attempt to buy the remaining liquidity at a price after a large buy trade.

  • Market Making: Posting passive non-marketable buy and sell orders with the goal to profit from the spread. The economic value market makers are being paid for is connecting buyers and sellers who don’t arrive at the market at the same time. Market makers are compensated for the risk that there may be more buyers than sellers or vice versa for an extended time, such as during times of market stress.

Basic trading system design

A quantitative trading system’s input is market data and its output is orders. In between is the strategy algorithm.

Input

The input to a trading system is tick-by-tick market data. The input is handled in an event loop. The events are the packets sent by the exchange that are read off the network and normalized by the market data parser. Each packet gives information about the current supply and demand for a security and the current price. A packet can tell you one of three things:

  • A limit order was added to the book. Primary fields: {price, side, order id, quantity}
  • A limit order was canceled. Primary fields: {order id}
  • A trade occurred. Primary fields: {price, aggressor side, quantity}

For example, a few packets look like this (for a more detailed, real example see section 4 appendix 1 of this spec):

AddOrder { end_of_packet: 1; seq_number: 103901; symbol_id: 81376629; receive_time: 13:03:46.304606537089; source: 1; side: S; qty: 1; order_id: 210048618; price: 99.25; }

CancelOrder { end_of_packet: 1; seq_number: 103900; symbol_id: 81376629; receive_time: 13:03:41.863834923132; source: 1; qty: 0; order_id: 210048542; price: 99.00; side: S; }

Trade { end_of_packet: 1; seq_number: 103902; symbol_id: 81376629; receive_time: 13:03:46.304606537835; source: 1; aggressor_side: B; qty: 1; order_id: 210048321; price: 99.00; match_id: 20154940; }

If the trading system adds up all the AddOrder packets and subtracts CancelOrder and Trade packets, it can see what the order book currently looks like. The order book shows the aggregate visible supply and demand currently available at each price. The order book is an industry-standard normalization layer.

When you add up all the orders, the order book could look like this:

Sell 10 for $99.25

Sell 5 for $99.00 (best offer)

Buy 10 for $98.75 (best bid)

Buy 10 for $98.50

This is the main view of the market data input used by the strategy algorithm.

Strategy algorithm

To put into practice what we discussed above, let’s outline a market taking strategy utilizing what is often referred to as market micro-structure signals that may have made money back before quantitative trading became very competitive. Some companies have each member of their intern classes program a strategy like this as a teaching project during a summer. This strategy calculates some signals using the order book as input, and buys or sells when the aggregate signals are strong enough.

Market microstructure signals

A signal is an algorithm that takes market data as the input and outputs a theoretical price for a security. Market micro-structure signals generally rely on price, size, and trade data coming directly from data feeds. Please reference the order book state provided previously as we walk through the following signal examples.

  • A basic signal, likely used in some form by most firms, is ‘book pressure’. In this case book pressure is simply (99.00*10 + 98.75*5)/(10+5) = 98.9167. Because there is more demand on the bid, the theoretical price is closer to the offer than the bid. Another way of understanding why this is a valid predictor it is that if buy and sell trades randomly arrive in the market on the bid and offer, then there’s a 2/3 chance of them filling the entire offer before the entire bid, because it’s 2 times bigger, so the expected future price is slightly closer to the offer than the bid.
  • A second basic signal that many quantitative trading firms use is ‘trade impulse’. A common form is to plug trade quantity into something like the book pressure formula, but with the average bid and offer quantity in the denominator instead of the current quantity (let’s say the average is 15). So if there is a sell trade for 9 on this book, the trade impulse would be -0.25*9/15 = -0.15. This example signal would only be valid for the span of 1 packet. Another way of understanding why this is a valid predictor is that sometimes buy and sell trade quantity is autocorrelated over very short intervals, because there are often multiple orders in flight sent in reaction to the same trigger by different people (this is easily measured), so if you see one sell trade, then typically the next order will also be a sell.
  • A third common basic signal is ‘related trade’. Basically, you could just take the same signal as (2), but translate it over from a different security that is highly correlated, by multiplying it by the correlation between them.

The book pressure and trade impulse signal are enough to create a market taking strategy. After the sell trade for 9, the remaining quantity on the book is:

Sell 10 for $99.25

Sell 5 for $99.00 (best offer)

Buy (10-9 = 1) for $98.75 (best bid)

Buy 10 for $98.50

But our theoretical price is = book pressure + trade impulse = (99.00*1 + 98.75*5)/(1+5) + -0.25*9/15 = 98.79167 – 0.15 = $98.64167! Since our theoretical price is below the best bid, we will send an order to sell the last remaining quantity of 1 at $98.75, for a theoretical profit of $0.10833.

That is a high-level overview of a simple quantitative strategy, and provides a basic understanding of the flow from the input (market data) to the output (orders).

Digression: ‘Minimum viable trading system’

The simple 3-signal trading strategy could have made money several years ago. Even a few years ago, the ‘minimum viable trading system’ that could cover trading fees was simple enough that an individual could build a successful one. Here’s a good article by someone who created their own trading system in 2009, and could be another starting point to understand the basics of automated trading if all of this has gone over your head- http://jspauld.com/post/35126549635/how-i-made-500k-with-machine-learning-and-hft.

This guide only covers, at a high level, trading and work being done by professionals in established quantitative trading firms, so things like co-location, direct connection to the exchange without going through an API, using a high-performance language like C++ for production (never Python, R, Matlab, etc), Linux configuration (processor affinity, NUMA, etc), clock synchronization, etc are taken for granted. These are large and interesting topics which are now well understood inside and outside the industry.

Other strategies besides market micro-structure signals

Market micro-structure signal based strategies, as described above before the two digressions, are just one type of strategy. Here are some other example trading strategy algorithm components used by many major quantitative trading companies:

  • Model based
  • Rule based
    • Only buy or sell during a certain time range
    • Don’t trade against an iceberg order
    • Cancel a resting order if the queue position is worse than 50%
    • Don’t trade if the last 10 trades lost money

Supporting research infrastructure

Now that you have a brief high-level overview of the production trading system, let’s dive deeper into research. The job of a researcher is to optimize the settings of the trading system and to ensure it is behaving properly. Working for an established company, this whole software system will likely already be in place, and your job would be to make it better.

With that in mind, here are some more details about 4 other main software components I listed above that are programmed and used by the research team to optimize and analyze the trading strategy:

  1. Parameter optimization: Most major quantitative trading firms have a combination of signals, model-based pricing, and rule-based logic. Each of these also have parameters. Parameters enable you to tailor a generic strategy to make more money on a specific product, or adapt it over time. For one product, you might want to weight a certain signal higher than another, or you might want to down-weight it as the signal decays. You quickly run into the curse of dimensionality as parameter permutations multiply. One of the main jobs for a researcher is to figure out the optimal settings for everything, or to figure out automated ways of optimizing them. Some approaches include:
    • Manual selection based on intuition
    • Regression for signal weights or hedge ratios
    • Live tweaking or AB testing in production
    • Backtesting different settings and picking the best
  2. Production reconciliation: Sophisticated strategies have many internal components that need to be continually verified in live production trading. Measuring these, monitoring them, and alerting on discrepancies is how researchers make sure things are working as they expected. If the algorithm performs differently in production than it did on historical data, then it may lose money when it was supposed to be profitable.
  3. Backtesting simulator: Plenty of information is available publicly about backtesting, such as the tools available from Quantopian or TradeStation. Simulating a low latency strategy using tick data is challenging. The volume of data to simulate a single day reaches into the 100s of GBs so storing and replaying data requires carefully designed systems.
  4. Graphing: The trading strategy is a mathematical formula in a computer, so debugging it and adding new features can be difficult. Utilizing a Python or JavaScript plotting library to publish custom data and statistics can be helpful. Additionally, it is essential to understand positions and profits or losses during and after the trading day. Graphical representations of different types of data sets makes many tasks easier.

Conclusion

Most people who are new to the industry think that researchers primarily work on new signal development, and developers primarily optimize latency. Hopefully now it’s obvious that the system has so many components that those two jobs are just a few parts of a much wider set of roles and responsibilities. The most important skills for success are actually very close attention to detail, hard work, and trading intuition. On top of that it should be clear that having strong programming skills is essential. All of these systems are tailor-made in-house and have to be constantly tweaked and improved by the users themselves – you.

If you’re interested in joining our team at Headlands, please see our careers page and send your resume to careers@headlandstech.com

Note:

The information above is a collection of some helpful information to shed some light on what a quantitative trading firm does and what you could be doing if you worked at one.  The information, although intended to be helpful to you, should not be relied on and is not represented to be accurate or current. Please note this is by no means an exhaustive description of what goes on at a quantitative trading firm.  Nor should this be taken as covering industry best practices or everything you need to know to start trading quantitatively.  This is simply a very high overview of information I think those considering joining a quantitative trading firm may find useful as they navigate the interview process. 

Appendix: Latency and the timing of events

Similar to the breakdowns by Grace Hopper (https://youtu.be/ZR0ujwlvbkQ?t=45m08s) and Peter Norvig (http://norvig.com/21-days.html#answers), here’s a table of approximately how long things take:

  • Time to receive market data and send an order via a ‘slow’ software trading system: ~30 microseconds
  • Minimum time between two packets from the exchange: ~10-1000 microseconds
  • Microwave between BATS and INET stock exchanges: ~100 microseconds
  • Fiber between BATS and INET stock exchanges: ~150 microseconds
  • Time for an exchange to match an order and send a response: ~100 microseconds – ~5 milliseconds
  • Microwave between NY and Chicago: ~4 milliseconds
  • Fiber between NY and Chicago: ~7 milliseconds
  • Fiber between NY and European exchanges: ~35 milliseconds

Appendix: Exchange idiosyncrasies

Exchanges almost all use different technology, some which dates back 10+ years. Different technology decisions and antiquated infrastructure have resulted in trading idiosyncrasies. There are many publicly available discussions of the effects of these idiosyncrasies. Here are a few interesting items:

https://www.slideshare.net/EurexGroup/deutsche-brses-t7-insights-into-trading-system-dynamics

https://www.bloomberg.com/news/articles/2017-03-17/currency-traders-race-to-reform-last-look-after-bank-scandals

https://markets.cboe.com/us/equities/market_statistics/order_type_usage/

http://www.cmegroup.com/notices/disciplinary/2016/11/NOTICE-OF-EMERGENCY-ACTION/NYMEX-16-0600-ELDORADO-TRADING-GROUP-LLC.html

https://brillianteyes.wordpress.com/2010/08/28/espeed-trading-procotol/

http://cdn.batstrading.com/resources/membership/CBOE_FUTURES_EXCHANGE_PLATFORM_CHANGE_MATRIX.pdf

https://quantitativebrokers.com/whitepapers

www.wsj.com/articles/SB10001424127887323798104578455032466082920

https://www.wsj.com/articles/SB10001424127887324766604578456783718395580


This article is not an endorsement by Headlands Technologies LLC or any of its affiliates (collectively, “Headlands Tech”) of the papers discussed, their viewpoints or the companies discussed. The views expressed above reflect those of the authors and are not necessarily the views of Headlands Tech. The information presented above is only for informational and educational purposes and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. Additionally, the above information is not intended to provide, and should not be relied upon for investment, accounting, legal or tax advice. Headlands Tech makes no representations, express or implied, regarding the accuracy or completeness of this information, and the reader accepts all risks in relying on the above information for any purpose whatsoever.