I am in the process of writing a series of posts I am calling “Statistics in Plain English”, where I am trying to explain statistics in a non-technical way. This post is not part of that series. Instead, this is the technical summary of my research into statistics and the assumptions required to apply statistics. It isn’t perfect - you’ll see my limited understanding of quantum mechanics come through, which I need to learn more about. Nevertheless, I believe it is a substantial step forward from statistics as it is typically taught.
In this essay, I use the word statistics to refer specifically to mathematics, popularly formulated as proofs built on the axioms presented by Andrey Kolmogorov1. This definition includes topics such as probability theory, statistical significance tests, regression analysis, and Monte Carlo simulations. In contrast, this definition explicitly omits descriptive statistics, Randomised Controlled Trials (RCTs), and methods for sampling independently from a population (though I will touch on some of these subjects throughout this essay).
As a mathematical field, the truth of statistics is dependent entirely on rigorously proving that particular conclusions follow from axioms. The implication is that whether statistics (or any other mathematical system) maps to the world around us is irrelevant to whether statistics is epistemically true. As Littlewood wrote2, “Mathematics… has no grip on the real world; if probability is to deal with the real world it must contain elements outside mathematics…”. Similarly, Kolmogorov’s third axiom simply says, “To each set A in F is assigned a non-negative real number P(A). This number P(A) is called the probability of the event A.” This definition is extremely vague and makes no particular claims about what exactly probability represents.
The question this essay seeks to answer is what additional assumptions are required to “make the jump” from pure mathematics to applying the insights gained from statistics, specifically when attempting to predict a future unknown outcome. In my view, this question is underexplored, as I will explore in the first part of this essay.
Part 1 - Naive Statistics
The first fork in the road for statistics is whether we are starting with data to produce a hypothesis or starting with a hypothesis and testing it with data. I will begin with the first problem. I will further segment this problem by asking whether the data we are using only includes a single value or includes multiple values.
In either case, the assumption is that each row in the dataset is analogous. Traditionally, we are interested in studying a random experiment, which Gunnar Blom3 defines as, “... every experiment that can be repeated under similar conditions, but whose outcome cannot be predicted exactly in advance…” In this case, each row in our dataset would be an instance of a trial in our experiment.
I - A Detour into Epistemology and Physics
There are two forms of randomness, which I will refer to as objective randomness and subjective randomness. To illustrate the difference, consider a die roll. Imagine you are sitting at a table watching someone roll a die. At the table is also a robot with x-ray vision, an extremely fast processor, and the most up to date understanding of physics that we currently possess. Each of you is asked to make a guess as to which side will land facing upwards while the die is in the air. I contend that while you would likely be indifferent between the sides, the robot would likely have a very confident prediction, based upon the angle and velocity of release of the die, the distance from the surface, and possibly even environmental factors such as wind.
The difference in confidence here illustrates that the experience of randomness is different between you and the robot. Based upon certain limits that you have as a human being, you are more uncertain about the outcome of a die roll. You experience a higher level of subjective randomness than the robot.
Inversely, we can ask whether all randomness is subjective. Pierre-Simon Laplace wrote4 (a formulation popularly conceptualized as a “demon”),
“We ought then to regard the present state of the universe as the effect of its anterior state and as the cause of the one which is to follow. Given for one instance an intelligence which could comprehend all forces by which nature is animated and the respective situation of the beings who compose it - an intelligence sufficiently vast to submit these data to analysis - it would embrace in the same formula the movements of the greatest bodies of the universe and those of the lightest atom; for it, nothing would be uncertain and the future, as the past, would be present to its eyes.”
This demon has three important qualities. Firstly, for at least one moment in time, it is able to know everything in existence at that moment and all details about those things, such as the forces operating upon it. Secondly, it has a comprehensive knowledge of the laws of reality that determine how what is true now will produce what will come next. Lastly, it has sufficient computational power to apply this knowledge to predict the future perfectly.
The question is whether we believe, were this demon at the table, they could predict the outcome of the die roll perfectly with complete confidence. If yes, we necessarily reject the existence of objective randomness. There are two ways in which the demon may fall short, however.
The first is if it is impossible to know all information at a particular moment in time. This could be a result of physics (such as the uncertainty principle) or relevant yet unrepresentable information (such as the will of an individual). Secondly, even if all information at a particular moment were knowable, the determinants of the future might not be deterministic. Put differently, we can ask whether if we repeated a die roll where truly every piece of information at the beginning which could possibly influence the roll of the die were the same, would we always see the same outcome produced in the same way? This might not be the case if there is structural randomness in how causality works, if we believe that there could be some external actor (such as a god) who can interfere with causality, or if we believe free will implies that some entity could make multiple different choices in the exact same situation.
Whether or not objective randomness exists is beyond the scope of this essay. However, the question raises two important points. The first is that Blom’s definition of a “random experiment” is coherent even if all randomness is subjective. No claim about the existence of objective randomness is implied in the definition.
The second point is more troubling. Our definition of a random experiment stated that we repeated our trials under “similar” conditions. If objective randomness does not exist (and setting aside problems of errors of observation), then the only reason why we receive different outcomes in our experiment is because of differences in the starting conditions of the trial. Put differently, the fact of receiving different outcomes implies that we combined materially different trials.
Even if we do believe in objective randomness, the range of outcomes might be quite narrow. For example, it’s possible that the level of objective randomness in our reality is small enough that it would almost never be the case that there are multiple possible outcomes for our die roll (even though if we expanded our definition of an outcome to capture more values such as the precise position of the die once settling, we would see that the outcome truly is different in at least some dimensions).
II - Interpreting Data
A popular set of problems in probability theory and combinatorics are urn problems, involving random selections of one or more marbles from an urn. The two characteristics of these problems are firstly the existence of some truly random selection process and secondly a well-defined distribution of marbles within the urn.
A popular interpretation of probability is with a frequency interpretation. As Blom writes5 about the roll of a die, “It would be possible to define the probability of the event ‘one’ to be the limit of the relative frequency as the number of throws increases to infinity.” I present that when we assert that the probability of rolling a one on a die is ⅙, we are saying that a good metaphor for how we can predict the outcome of a die roll is to imagine we have a bag with six balls, each with a number ranging from one to six. Here, we pretend that the way the outcome is determined is with random selection from that bag (with replacement), just as with our urn problems.
Importantly, the point is that this is not literally how the outcome is determined. As discussed above, that would be a radical and unjustifiable epistemic assertion. It is simply a metaphor. In fact, if we do not believe objective randomness in causality exists, the only “true” probabilities for a particular outcome of a trial would be 0 or 1, reflecting that the outcome would either never occur or always occur. If objective randomness exists, this model of reality could be more than a metaphor only if we were studying a trial where all randomness was solely objective (all subjective randomness had been filtered out and the knowable starting conditions were identical).
Now, we can return to the question of how we can start with data and apply mathematics to form conclusions. There are two categories of problems that we can solve. The first is in the special case where the data we have is a series of repeated samples from some dataset. We will have to be willing to make certain assumptions about how these samples were taken, but if we believe those are well justified, we can leverage the Central Limit Theorem (CLT) to make an inference about the average value of the underlying dataset.
The second requires us to construct a theoretical bag. This bag can contain either a finite or infinite set of balls, which respectively corresponds to a discrete probability function or a continuous probability function, respectively. We imagine that each outcome is determined by a single draw from this bag, with replacement. Subsequently, rather than analyzing the data itself, we then apply statistics to this theoretical bag. This allows us to leverage many techniques in statistics to answer questions such as “if the bag was arranged as a normal distribution, we would expect n% of samples to have a sample mean of x or less” or “if we sample from our bag three times, the expected sum of values is distributed in the following form”.
The latter problem is interesting, but appears at first glance to have no relationship to the data we have. The general point is that for us to solely apply mathematical statistics, we need to make a jump from the actual data to some theoretical bag, and the question of whether the bag is a good metaphor is beyond the scope of statistics (as a purely mathematical field). We will return to the question of how reasonable this jump is later in this essay.
III - Multiple Values
We will now consider datasets with more than one value. To start, I will acknowledge that we are able to look at any individual column in the multivariate dataset and treat it as if it were a single value dataset, as above. However, we will focus on statistics that use two or more values together.
A simple approach is to segment our dataset. This is an extension on the problem of constructing a theoretical bag. Here, we imagine that our outcome is again determined by a draw from a bag, but rather than having a single value on them, they have multiple values. We could simply choose to create a new bag which only includes balls where one or more values meet particular conditions. For instance, if we have a dataset with sex and income, we could segment our theoretical bag into one where all balls were of sex “female”, and then apply statistics to this new bag.
It is a frequently stated truism that “correlation does not equal causation”. As a way to demonstrate this, it is common to provide an example of a correlation which obviously does not appear to be a causal relationship, such as a relationship between ice cream and sunscreen sales over time. However, the assertion that despite the presence of a correlation, the relationship is not causal does not follow purely from statistics. It demands reference to reasoning beyond the scope of statistics.
In this case, what we might suspect is that the sales of these two items are driven by another value, such as temperature. Imagine we segment our dataset to only look at rows with a certain temperature, the correlation between ice cream and sunscreen sales disappears. Despite us finding this new relationship, it does not then follow from statistics that temperature is causal upon ice cream or sunscreen sales. While we have undermined a potential claim of causality between the sales of sunscreen and ice cream, this does not subsequently imply that the relationship between temperature and these other values is causal. In fact, pure statistics is incapable of making causal claims. The best it can do is identify “better” correlations, which is to say correlations which are more stable when controlling for the other values in the dataset.
While it is theoretically possible to construct a general mathematical function which “takes in” amounts for certain values and “outputs” an average of another value, from a naive statistical standpoint this is a pointless exercise. As discussed above, we can already segment our dataset to only include rows with certain values, and can trivially compute the average of a particular value within that segment. Any generalized functional form which tries to capture multiple segments will at best produce the same average for a certain segment and at worst produce a different average.
Alternatively, it is possible to construct a theoretical function which we treat as the producer of the balls in the multivariate bag, to which statistical extensions can be applied. This is an analogous jump to the single value case of imagining that the way a value is determined is with a draw from the bag. The difference is that it is at least possible that this function could be more than metaphor, if we believe that the laws which govern causality can be represented mathematically. However, all that naive statistics can do is make statements of the form, “If this function determined an outcome, what additional conclusions could we draw?” The general problem of identifying causality and creating useful functions to relate values will be discussed in Part 3.
Part 2 - Justifying Metaphor
In part 1, we identified the limits of what we can conclude solely by following statistics. This included the general case of constructing a theoretical bag to describe how an outcome is determined, for the purpose of prediction. In this part, we will ask two main questions. Firstly, how can pure statistics inform the bag metaphor? Secondly, how can we determine how good a metaphor this is for how our outcome is determined?
I - Populations
We will assume that we are trying to come to a view on what outcome will result at the start of some trial. I present that a proper trial definition should include two (possibly three) elements. The first is a set of starting conditions. At the beginning of our trial, there is certain information that we can observe. In the case of a die roll, we might observe our die and see that it appears to be evenly weighted on all sides. If we were interested in how long a racer will take to ski down a hill, we might consider factors such as conditions of the hill and experience of the racer. Some information may be extraneous - for instance, I would assume that the color of the die being rolled will have no causal influence on what the outcome of the die will be.
The second is a well defined outcome. It should be clear at the beginning of a trial what value or values we are looking for to know that our trial has completed. For instance, our die rolling trial might be interested in the top of the die over one or two rolls of the die. Outcomes are simply a (typically numeric) representation of something that happened in the real world and inevitably demands we ignore information. For instance, there is an enormous amount which could be recorded about a die roll, such as the precise position of the die once it settles. However, if we are not interested in predicting at a certain level of granularity / predicting certain information at all, we would not include it in our outcome definition.
Our trial may end in unexpected ways. For example, say we are watching someone holding a die in their hand and we are interested in what will be rolled. However, rather than rolling the die, they clench their fist so hard that the die shatters into many pieces. This is where the third possible part of the trial definition could come in - a set of invalidation criteria, which we would use to post hoc declare a trial “invalid”. For instance, if we were interested in the outcome of a roll, we could say that since the die wasn’t rolled, the trial never occurred.
The alternative to invalidation criteria is to expand our outcome definition. For instance, our outcome definition for the roll could include an “other” category to account for unexpected ends to the trial such as the die shattering example. In consulting, there is a mental framework called Mutually Exclusive, Collectively Exhaustive (MECE). A well defined trial should, ahead of time, have a MECE set of outcomes and/or invalidation criteria to account for all possible ends to the trial.
For any given well defined trial, there exists a historical set of instances where a trial has occurred with our given starting conditions, along with an outcome which followed. We will call this a population. This population provides us with one option for how our metaphorical bag could be constructed. Definitionally, this population will be a discrete probability function, as the population will be finite. With some application of external reasoning, this could be extended to a continuous probability function. For instance, if the outcome of the trial of interest was the height of someone at age 20, we might observe that the population for that trial was roughly normally distributed and believe that the best metaphor for the population is an infinitely sized bag distributed normally.
II - Comparing Bags
Naively using the population (or if we cannot access the population, some inference about it based on a sample of the population) is not the only way to form a metaphorical bag. Other bags can be formed under plenty of other assumptions. For instance, we could form one where we treat each outcome as equally likely, or one with a continuity assumption, where we assume that whatever outcome happened last is the one which will happen next.
How can we determine if the population bag is a good metaphor? I present that there are two important, interrelated questions, which help determine if the population bag is an effective metaphor. Before we explore those questions, I will briefly ask why trials with the same starting conditions do (or appear to) produce different outcomes.
There are three reasons why this may occur. The first is if the starting conditions we observe do not include all information which will causally produce the outcome of interest. If this is the case, there may be variation in the unobserved starting conditions, producing variation in outcomes. Secondly, we may incorrectly observe starting conditions and/or outcomes. If this happens, we might actually have observed two identical trials but record them as different, for example. Lastly, we may experience true randomness, which implies one of two things. The first is that even if we accurately observe all causal starting conditions and the outcome, we may still receive a different outcome. Alternatively, it may be impossible to observe all starting conditions. In this case, even if we are able to derive what the starting conditions were after the fact, at the beginning of the trial we are interested in predicting, we will be uncertain what the starting conditions are.
In light of this general point, I return to whether the population bag is a good metaphor. The first question is simply how large the population is. The larger the population, the more it will capture the variance possible in the unobserved starting conditions. Given this point, we may wish to increase the size of our population. There are a couple ways we might do this.
The first, and least troubling, is by omitting what we believe is extraneous information which has been included in the starting conditions. For instance, if we are predicting the outcome of a red die being rolled, we could increase our population size by generalizing our population from the set of rolls of red die to the set of rolls of die of any color. This involves some level of external reasoning, but at best increases population size with no tradeoffs.
The second approach is to loosen our trial definition, voluntarily ignoring some information present in our starting conditions. For instance, if we are forming a bag for an individual basketball player making a free throw shot, we could generalize our trial definition to be all NBA players who have made free throws. The tradeoff is between a larger bag or a more contextually relevant bag.
The last approach would be to produce new data in our population. Setting aside whether this is feasible or not, this brings us to our second question for how good a metaphor the population bag is. Not only is it impossible to truly repeat a trial (as we cannot rewind time), it is not clear that this would be a very productive exercise, as the trial would have the same observed and unobserved starting conditions, and so variance in outcome would only identify true randomness. Our population is necessarily made up of trials which occur in at least one of a different time and / or a different location.
What we must ask is whether we believe that the outcome of our trial is insensitive to time and place. Put differently, we should ask whether we believe that the functioning of our trial is independent from the time when it occurs and the location it occurs. If the answer is yes, this resolves a number of problems for us. Assuming that it is possible to easily run our trial repeatedly, we may not even have to rely on historical data, we can simply produce our own sample and ignore the totality of the existing population.
In contrast, if we believe our trial is sensitive to time and place, this may seriously undermine whether the population bag is a good metaphor. Even if we could produce new data, the applicability of the outcomes may deteriorate over time if we are predicting a trial later in time or in a different place. It may be very valuable to use historical data to capture historical unseen starting conditions which aren’t present / are less present now, but may be more present in the future.
An interesting parenthetical is that problems in statistics textbooks tend to have one of two qualities. They either assert the existence and distribution of a bag, or they use “toy problems”, which we have strong intuitions about being time and place insensitive. Here, not only is a historical population likely to produce a good metaphor for a trial, it lends itself to testing by individuals (for instance, by rolling a die themselves and recording what happens). Additionally, this is why pure statistics seems to be very applicable to games, as they seek to use trials which are time and space insensitive to produce randomness, whether through die rolling, shuffling cards, or the use of pseudorandom number generation in computers.
Part 3 - Formulas & Starting with Theory
The purpose of creating a formula is for the purpose of identifying causality, with the hope of being able to come to a view on what a particular outcome might be for a set of starting conditions which we have very few, or possibly no, records of the outcome which follows. We will assume that any function has a single outcome value on the left side (which we are attempting to predict) and some number of starting conditions on the right side. I will emphasize an important point here - all input values must come before the outcome. In fact, if we hope to find true causality, all input values should be recorded at the same time (the start of the trial).
Interpreting the output of this function is somewhat tricky. If we have perfectly captured all starting conditions which determine the outcome, created a function which describes how reality operates, and there is no objective randomness, the output of this function will be the actual outcome which will happen. However, it is almost impossible for these assumptions to hold.
Given these assumptions do not hold, the output of the function can instead be understood as a prediction of the center of mass of the bag for a particular set of starting conditions (I use the term “center of mass” rather of “mean” or “average” for its general applicability to both finite and infinite bags). However, if this is what we are trying to predict, as per Part 2, Section III, there is no need for a function at all. For a particular set of starting conditions, we can use the tools in Part 2 to form a view on the best metaphor for a bag, which will have its own center of mass.
This is the core problem - functions are only useful if they are identifying causality, but pure statistics is incapable of discovering it. To emphasize this point, consider that while a perfectly explanatory function would fit historical data perfectly (assuming no objective randomness and no observational errors), a function perfectly fitting historical data does not mean that we have identified causality. It is possible that we have instead “overfitted” the population and actually generated a function which will perform poorly with different starting conditions (or even future trials with the same observed starting conditions).
I present that in this situation, we must invert our approach. Rather than starting with data, we need to start with theory and use data to test that theory. While this is especially applicable for identifying causality, it is a more general point that can be valuable in the construction of theoretical bags as well.
I - Theory -> Data
The alternative approach to using data is to start with some hypothesis, and then take it to the data to see if it holds or not. I will use the term hypothesis generally in this section - this hypothesis could be either a causal functional form or a claim that a certain theoretical bag is a good metaphor to predict an outcome.
We will assume for the moment we have had no interaction with the relevant population before. We form some hypothesis, and then we check to see how well it fits the data. If the hypothesis fits well, then I contend that we ought to feel confident that our explanation is a good one that will perform well on a going forward basis.
However, problems appear if our hypothesis fits poorly to the data. We very likely will want to formulate a new hypothesis which might be better. The difference now is that we are now incapable of being objective. Any hypothesis we formulate will be informed by our interaction with the data and we become susceptible to overfitting to the data. The more we interact with the data, returning with new hypotheses, the more we “corrupt” the data.
This helps explain the general phenomena of p-hacking. If we value a hypothesis based upon its fit to a dataset, overfitting to the dataset becomes desirable. For instance, say we constructed a dataset with a set of values which were all “noise” - outputs of trials which we believe are time and space insensitive. With a large enough dataset, it is likely that there exist functional forms where one or more values would “predict” another value. However, if we start with a hypothesis and then go to that dataset, we are much less likely to identify this meaningless relationship. With that said, the more we interact with the dataset, the more likely we are to form a hypothesis about that relationship. The extreme version would be approaching the dataset from a data first approach and looking for relationships, which would eventually uncover this function. However, our confidence that we have identified causality should be negligible at best.
II - Innoculation against Corruption
A general way to mitigate the problem of overfitting is by being able to test a hypothesis against unseen data. There are two general ways to gain access to uncorrupted data. The first is with the use of a validation dataset, which is when a part of the historical population is held back and not used in the initial iterative process of hypothesis to data to hypothesis (and so forth). There are a couple practical problems with the use of a validation dataset. The primary one is if the population is already small, segmenting the dataset may even further inhibit our ability to identify true causality / produce effective metaphor. Additionally, we are left with a problem where if we come to a hypothesis on our test data which appears to be a good fit, but it does not replicate in the validation dataset. It is difficult to proceed, now that we have corrupted the validation dataset.
The alternative approach is to test the hypothesis against new data, essentially creating a validation dataset. This approach has different problems. Firstly, it may be difficult or impractical to create or capture new data, or take a long time for the data to be produced. Additionally, insofar as our hypothesis is sensitive to variation in unseen starting conditions, and the unseen starting conditions are dependent on time and space, it may be difficult to discern if our hypothesis is a good one in the long term just because it performs poorly over a particular time horizon.
The unsatisfying conclusion is that there is no perfect approach, short of being lucky and finding a good fit to the data immediately, which we ought to be more confident is a good hypothesis. Insofar as we cannot protect ourselves against corruption, the general point is that the more we test our hypothesis against corrupted data, the higher our standards should be for fit. This is because the more we learn about the dataset, the more our hypothesis will be influenced by the particulars of the particular dataset we are testing against, as opposed to an objective view of what might be true. The final conclusion is that simple fit with data does not communicate the quality of a hypothesis - the process by which that hypothesis was reached should influence our evaluation of its truth.
Conclusion
This essay was broadly inspired as a response to the use of particular language in statistics, such as the notions of “infinite populations” and the popular frequentist interpretation of probability, which can mislead about what statistics can and cannot teach us. My intention was to explore deeper first principles of statistics that I do not often hear discussed, such as defining a population for a trial being predicted, the importance of time and space insensitivity for a trial, subjective versus objective randomness, and dataset corruption.
There are robust areas for extension on this essay. A general problem I have ignored is how to update a prediction of an outcome after a trial begins. This includes the period between the trial beginning and the outcome coming to pass as well as how to form a view on an unknown outcome which has already happened. An interesting observation I would make is that insofar as objective randomness stems from not knowing what the starting conditions are at the beginning of a trial, for a sufficiently long trial duration, we may be able to identify what the starting conditions were and update accordingly, decreasing our experience of randomness.
Additionally, having demonstrated that the bag implied by a population may not be the best predictor for a trial, this suggests more general exploration of the conditions under which alternative bag construction might be most appropriate. This includes continuity assumptions, use of the bag of a trial with similar but different starting conditions, and the use of theoretic models such as a random walk.
I would also observe that statistics is fundamentally a stopgap to compensate for our limited understanding of reality - put differently, the ways in which we fall short of LaPlace’s demon. As such, understanding how we can use insights from science, particular physics, to understand causality and the nature of objective randomness (particularly quantum mechanics) can help form better hypotheses.
Finally, this essay has some implications for randomised controlled trials, where even if we control all observed starting conditions, if the trial is time / space sensitive, the relationship may not carry over well or at all in different times and places, assuming the trial occurs in a fairly constrained time / place, due to systematic differences in unobserved starting conditions. This suggests that conclusions from randomised controlled trials for new medicine would be more trustworthy than psychological studies, if we believe that response to medicine is minimally more robust to time and place than strategies in simulated game theory, for example.
I’ve worked on this in a silo for a long time, and have struggled to receive feedback (though I am immensely grateful to those who have provided it). I would appreciate your thoughts or links to writing you’re aware of which discusses similar topics. If you’ve made it this far, thank you for taking the time to read the outcome of a year and a half of blood, sweat, and tears. I doubt you know how much it means to me.
Foundations of the Theory of Probability, 1933
Littlewood’s Miscellany, 1953
Probability and Statistics, 1980, pg. 3
A Philosophical Essay on Probabilities
Probability and Statistics, pg. 10