The Roots are Shallow

Statistics in Plain English: Part 1

Mar 19, 2025

I am going to teach you statistics. Wait, don’t go! Give me 2 minutes before you click away.

Statistics, as it is traditionally taught, is brutally complex. I have spent much of the last 11 years of my life studying and working professionally in the field, and for essentially all of that time, I felt a deep sense of unease and confusion about what I was doing. This was despite the fact that I was being paid (quite a lot) to practice statistics professionally, that I got top grades in my statistics courses, and I wrote an undergraduate thesis for my economics degree using reasonably complex statistical methods1.

I assumed that I was the problem - I just didn’t understand statistics. I’ve heard this sentiment echoed by a number of very smart people in my life - they also felt like they didn’t understand statistics, even if they studied it in school or practiced it professionally.

I finally snapped. Despite constantly regurgitating a line about how “I loved statistics because it allowed for an objective search of truth”, I felt no confidence about the conclusions of my work. I left my cushy tech job (in no small part because of this problem) and my primary focus for the past year and a half has been trying to figure out what I wasn’t understanding.

After this time, I’m forced to conclude one of two things must be true. One, I’m just not smart enough to comprehend the brilliance of statistics. Or two, there is something seriously wrong with the way statistics is traditionally taught and thought about.

I’ll leave it to you to decide which is true, but I’m pretty confident it is the latter. I believe I’ve unlearned enough about statistics that I can explain and teach it in a simpler, clearer, and different way than you’ve likely ever heard before.

If that’s not interesting to you, you can go. But if you want to understand statistics better, even if it’s simple idle curiosity, my commitment to you is to explain this in as accessible a way as possible, regardless of if you have a strong math / statistics background. I actually suspect you’ll find these posts easier to follow without a strong statistics background, as you’ll have less to unlearn.

low-angle photography of green leaf tree — Photo by Dave Hoefler on Unsplash

There are two completely unproblematic conceptualizations of statistics. The first is what I will refer to as statistics as quantitative history. Where history in the social sciences more commonly goes deep on a particular moment in history or a small set of moments, statistics as history takes a broad approach, relating similar moments in history to each other. I use the wording “moments in history” as an intentionally vague and broad phrase. This could be a history of all recessions, the history of individuals who lived through a pandemic, or even the medical history of an individual.

The second conceptualization is statistics as mathematics. Here, statistics would be similar to any other purely mathematical field, such as algebra or trigonometry. Mathematical fields are characterized by a set of axioms, which are a set of unjustified assertions. Mathematicians then attempt to prove that certain conclusions follow from particular axioms.

Let me be clear - neither of these are trivial problems2! Statistical history involves the hard work of finding / creating datasets to use and difficult thinking and communication to explain what has happened in the past. It also involves design problems when creating visualizations to find ways to represent that data in a way that is intelligible. For statistics as mathematics, there exist a set of axioms (formulated by Andrey Kolmogorov in his Foundations of the Theory of Probability in 1933) which are popularly used in statistics. Deriving proofs from those axioms is incredibly hard work (which frankly is largely beyond me).

If this was all statistics aspired to be - trying to accurately represent the past and the creation of theory - there would be no problem. However, my experience is that statistics aspires to be much more. This is where the trouble begins.

When I think about statistics today, I think of a towering tree. I look at developments in ML and AI and statements such as, “In 12 months, we may be in a world where AI is writing essentially all of the code.” from Anthropic CEO Dario Amodei3. I think about posts by academics such as Adam Grant, where he writes “Hey managers: It’s time to stop hiding salaries in job postings.” as a conclusion from a piece of statistical history from Glassdoor4. I do not highlight these comments as a dig toward either Dario or Adam. Nevertheless, these statements are emblematic of a pattern in statistics. Some unproblematic historical statement is made (in the case of Dario, presumably that there has been uptake in the use of AI to generate code), followed by a “conclusion” which “follows” from the history.

This is what I mean with my tree analogy. I see extensive assertions from people about the future which they justify by the fact that they are “using statistics”. But if statistics is simply math and history, it says nothing intrinsically about the future. Something subtle (and dare I say, sneaky) has happened to start making statements about the future. Maybe it is the case that the past is a good predictor of the future. But even if we believe that (and I do), how exactly we should make inferences about the future from the past is a complicated problem which requires thinking and argumentation and justification.

Here is what I understand to be happening. Statistics pretends to be a mathematical field. This can be seen by a general disdain towards “descriptive statistics” (which functionally is another term for statistics as quantitative history), as all descriptive statistics does is “state what has already happened”. Instead, statistics desires the comfort of being a mathematical field. With the rigorous proofs of mathematics, statisticians can avoid the messiness and complexity of the real world while receiving the cachet of doing mathematics, a well viewed and respected field.

However, as discussed, mathematics is pure theory. Littlewood put it well when he said, “Mathematics… has no grip on the real world; if probability is to deal with the real world it must contain elements outside mathematics…”5 (I will address the term probability in a later post). But statistics as mathematics is “no better” than descriptive statistics - all it says is how to solve theoretical problems.

This is a dilemma. Statistics wants the comfort of math and the ability to say things about the real world. In my opinion, the way it resolves this is by sneaking in certain assumptions which, if true, would allow mathematic statistics to make statements about the real world. These are hidden away in definitions of terms, in the choice of axioms of mathematical statistics, and in seemingly unimportant “assumptions” of the form “if this is true, then this follows”. If challenged, there are always easily available “dodges”, where it can be said that everything is just an assumption. Rather than transparently argue and justify why the conclusions drawn from using math apply to reality, these justifications are snuck in and hidden within the onerous complexity of the math. Then when people use statistical methods, they implicitly make assumptions they aren’t even aware are present, and so they go almost entirely unacknowledged.

This is what I mean by the title. In contrast with the (often extensive) conclusions drawn using statistics, the roots of the field are startling shallow.

You might be wondering how this could be true. How could the emperor have been walking around without clothes on for so long? My theory is that almost no one practicing statistics really knows what they are doing. I still feel a level of panic and exhaustion whenever I pick up a statistics textbook, as I’m overwhelmed by formulas and highly technical jargon. I suspect lots of people are willing to either trust statistics uncritically or are sufficiently embarrassed and / or ashamed at their confusion that they just assume that they are falling short (as I was).

Additionally, as I will discuss in later posts, not all the assumptions are bad. For instance, statistics appears to produce very good predictions about the future behaviour of games of chance (such as those involving shuffling cards and rolling die). While I believe the assumptions which statistics uses are weak, they nevertheless appear to work in certain areas, which is not surprising given the strong historical roots of statistics in studying these games.

Lastly, predictions made using statistics are monstrously hard to prove wrong. One reason for this is that the end result of statistics is often just a simple number or graph. To an external observer, they have no idea how this number was produced. Even if they were to try to understand, they would need (at minimum) a strong understanding of both programming and mathematics to form their own view on if the conclusion is reasonable. Secondly, it can be very hard to tell, based on what happens, whether a statistical “prediction” was good or not. In some cases it is possible - for instance, it will either be true or not that almost all code is written by AI in early 2026. However, how can we evaluate a statement like “it is very likely that (insert political candidate here) wins an election”? Implicit in the statement is that they might lose, and should that happen, we can’t necessarily conclude that the prediction was wrong or bad.

In summary - I think we grievously overstate what statistics can accomplish. My experience is that in unpacking the hidden assumptions built into statistics, I have been able to feel comfortable using statistics for the first time in my life, and truly feel like I understand what I’m doing. As a general outline, these posts will start with the uncontroversial parts of statistics - history and pure mathematics. Once these are established, I will progress on to exploring the assumptions required to make statistics something more. I hope you’ll join me.

Statistics in Plain English series,

Part 1: The Roots are Shallow
Part 2: What Is Data?
Part 3: The Death of Probability

Multivariate linear regression.

Despite an intentionally provocatively phrased post I wrote in the past when I was still coming to the view I have today.

https://fortune.com/2025/03/13/ai-transforming-software-development-jobs-meta-ibm-anthropic/.

https://www.linkedin.com/feed/update/urn:li:activity:7307913584322248704/.

Littlewood’s Miscellany, 1953.

Complex To Simple

Discussion about this post