Do We Have Good Data? Elementary – Guest Post R.T. Leuchtkafer



Throughout 2014 we have read speeches and/or articles in the media that center on how the SEC is in the midst of a comprehensive market structure review, or beginning to undertake a data-driven holistic review of our equity market structure. We have been cautioned that correct regulation must be data-driven, and not anecdote-driven.

At SIFMA’s annual meeting yesterday, Chair White again delivered commentary in this same theme.

(Reuters) – Regulators are undertaking a comprehensive data-driven review of the rules underpinning the U.S. equity markets, including the pricing and rebate system used by exchanges, the head of the Securities and Exchange Commission said on Monday.

This of course begs the question, where is this data coming from?

R.T. Leuchtkafer wonders about this data as well. This morning he writes our morning note to you.




November 11th, 2014

Last summer the U.S. Senate held a committee hearing about stock market mechanics.  Witnesses included industry executives and academics.  Asked about stock market reform, most agreed regulators should leave well enough alone, and consider reform only if strong empirical research justified it.  Whether they called it data-driven or evidence-based research, witnesses said any proposed market reform should be based on hard economic data.

There’s only one problem.  Nobody has good data.


The SEC doesn’t seem to have it.  Last year the SEC launched a new computer system called MIDAS, and said its vast databases would enlighten the SEC about stock market plumbing.  SEC Chair Mary Jo White later told Congress she was “making certain that the SEC and its experts had the data they needed to fully understand all of the market structure issues,” and said MIDAS had significantly enhanced the SEC’s market knowledge.

Whatever its virtues, MIDAS is far from comprehensive.  It takes in trade and order data from the exchanges, but doesn’t include all exchange order data, and doesn’t include any order data from dark pools or other off-exchange trading venues.  Even the data it gets from the exchanges is cropped.  That data doesn’t tell who sent an order, or who bought and who sold, or whether someone sold short, or even whether an order came from the investing public.

MIDAS also doesn’t include any information about the millions of hidden orders that flood stock exchanges every day or what kinds of orders they are.  Altogether, MIDAS might see 25 percent or less of stock market orders, opening at best a peephole into the markets, not a full view.

The stock exchanges don’t have comprehensive data either.  Any individual exchange has its own high quality audit trail data, but the exchanges don’t share much with each other or with researchers.

There is a plan underway to build a large, multi-year database of all the exchange and off-exchange market audit trails, updated daily.  The SEC proposed this consolidated audit trail (or CAT) database soon after the May 6, 2010 flash crash.

More than four years later, the CAT is hardly a kitten.  The industry group responsible for implementing the CAT finally announced its bidder shortlist just a few months ago.  Optimists believe at least some part of it will be ready in 2018, while skeptics believe it won’t be ready before 2020.  Nobody is yet sure what it will cost or how to pay for it.


In the meantime, researchers don’t have data for the analyses everyone wants them to do.  Among the best datasets researchers have is one NASDAQ prepared years ago.  When someone says high-frequency traders lower stock market volatility, improve pricing, or lower investor costs, studies based on NASDAQ’s data might be the source.

Widely used, the data is little more than another peephole.  As researchers have described the dataset, it contains NASDAQ trades from 2008 and 2009 in 120 stocks, but it doesn’t have ETFs, however popular these are to trade.  It also doesn’t include much order data at all, and doesn’t include data from any of the dozen or so other exchanges.  Crucially, NASDAQ scrubbed the data to mask the firms behind every trade, though researchers said NASDAQ flagged trades it believed included high-frequency trading firms.

Not all of them, however.  Trades from high-frequency firms that didn’t register directly with Nasdaq, and instead accessed Nasdaq under cover of another firm, weren’t necessarily flagged – if true, a significant omission.  Just recently regulators said these kinds of high-frequency firms sometimes deploy “aggressive, potentially destabilizing trading strategies.”  Perhaps dozens of these firms, millions of their trades, and billions of their traded shares weren’t flagged.

In a separate action announced last month, the SEC accused a firm called Athena Capital Research of manipulating NASDAQ stock prices thousands of times in 2009.  The SEC said Athena was its first ever high-frequency trading manipulation case, so it’s notable Athena was likely one of those firms coming to NASDAQ under another firm’s cover.  Were Athena’s manipulative trades flagged as high-frequency trades in NASDAQ’s dataset?  According to researchers, trades from the high-frequency trading arms of big banks like Goldman Sachs weren’t flagged in NASDAQ’s data either.  How many millions of trades are those?

We don’t have a list of which firms NASDAQ flagged, so we don’t know which firms were left out.  With data like this, scientists immediately wonder about what’s called selection bias.  Does what’s missing tilt research results one way or another?  As an extreme example of selection bias, imagine if doctors studied heart disease in the U.S. but didn’t include smokers or anyone over 50 in their analysis.  Those doctors would find we were surprisingly healthy, and it’s nonsense.


No doubt because of the brawl over Michael Lewis’s Flash Boys, Chair White says the SEC is doing a broad data-driven market review.  She hasn’t yet said where data lighting that review will come from, and it shouldn’t stay a mystery.  The SEC’s MIDAS system has huge gaps, the CAT is still years away, and Nasdaq’s data is old, incomplete, and perhaps fatally miscoded.  We won’t have strong data-driven or evidence-based studies to evaluate reform until we have data, and the studies won’t be worth much unless they use complete data.

Meanwhile British, Australian, and Canadian regulators have already built thorough market databases and turned them over to their own experts for study.  The SEC must build a database at least as good as the databases other regulators have created.  It should assemble a year’s worth of complete market audit trail data and give researchers time to look at it.  A terabyte costs about $50 each at Walmart, so somewhere in the SEC’s $1.35 billion budget must be the resources for the job.

Large-scale stock market reviews are a habit at the SEC.  The last one was in 2010.  At the time Senator Ted Kaufman pointed out that researchers didn’t have the data they needed to answer many of the questions the SEC had asked.  Without high quality data, he said the “market structure review predictably will receive mainly self-serving comments from high-frequency traders themselves and from other market participants.”

Or as Sherlock Holmes put it in A Scandal in Bohemia, “It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”


                                                                                                             R.T. Leuchtkafer