This is part of a series where I teach statistics from the ground up with no expectation of prior knowledge. Links to whole series can be found at the bottom of this post.
Clearly defining the term data is surprisingly difficult. Merriam-Webster provides this definition,
factual information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation
Frankly, this definition makes my eyes glaze over. It might be true, but I find it vague and unhelpful. Cambridge Dictionary is slightly better,
information, especially facts or numbers, collected to be examined and considered and used to help decision-making
This is a more opinionated (and useful) definition, but I think I can do better, which I will attempt in this post. My definition is practical and tailored to the particular context of statistics.
Data Is Historical
This is a more opinionated statement than it might appear at first. Many of the examples that might come to mind of data will likely tidily meet this criteria, such as sports data, political polling, weather phenomena, etc. In fact, this may even seem tautological. Of course data has to be historical, otherwise how could we have it?
Importantly, this aspect of the definition excludes synthetic data - or put more bluntly, made up data. There is technically nothing objectionable about applying statistical techniques to fabricated data, and it may even be a useful exercise when considering hypothetical scenarios. However, I am explicitly omitting synthetic data from this definition. Data must be reflect something which actually happened.
Data Is Perception
It is a fundamental point in philosophy that it is impossible to experience anything objectively. Our experience of the world is subjective, filtered through (at minimum) our senses and processed in our brain. Data is an even more extreme example of this. Consider the steps of perception inherent in trying to capture just a single person’s opinion on a subject.
As we cannot directly access someone’s opinion, we must design an approach to elicit what the opinion of the person is.
We will need to apply our own perception to label the person’s opinion, based on the outcome of our approach.
We need to then record our perception in some form.
If someone else is using our data, they will have to apply their own perception to our records.
Each of these steps represent a level of abstraction away from the truth1, and by extension, each step is a failure point where the representation may be distorted from reality. Consider various problems which could arise,
If we need to ask the person to state their opinion, they may not actually know what it is.
Simply by asking someone their opinion, we might change it2.
The person may feel self-conscious and be unwilling to state their opinion if they expect judgment or other repercussions.
It may be ambiguous what someone’s opinion is. For instance, if we are not asking someone their opinion but attempting to observe it (so called revealed preferences), we may end up unclear. Say someone attended a political rally for a candidate in the past week - we can’t be certain if this means they support / intend to vote for that candidate.
As information passes between people, each step presents opportunities for miscommunication.
Data Is Compression
Continuing with the opinion example, imagine that you recorded an entire interview with a person. I am explicitly stating that the recording itself is not data3.
An undercurrent of how I am defining data is that it must be useful. The fundamental purpose of statistics is comparison, and so we need to convert the messiness and complexity of the real world into a compressed and homogenized form to facilitate this.
There are plenty of ways our interview could be compressed. If we are studying speech patterns, we would include different details than if we were studying political opinions. The important point is that the process of compression definitionally demands omitting information. As such, we must be thoughtful about what we choose to exclude, because that information will be unavailable moving forwards.
Data is Quantitative
For a dataset to be useful to us, there needs to be a quantitative element. This is simplest if we are recording something numerically, such as the distance a football is thrown or the time it takes for something to occur. If we are instead observing something which does not include an explicitly quantitative element (such as the gender of politicians), our data must include a level of aggregation (such as the number of politicians elected of a particular gender). This is simply because statistical methods are quantitative, and so data must be numeric to be useful.
If I were to write a succinct definition, I would define data as,
quantitative observations from the past
It is hopefully clear that what constitutes data is completely arbitrary. There is an almost limitless number of ways we could produce data about a particular moment or period in history. I will observe that I have been lazy about whether data is singular or plural and use the term interchangeably. For instance, quantitative observations about a single football throw is technically data but is not very useful on its own (as there is no relevant point of comparison), whereas quantitative observations about 100,000 football throws could be very useful. When I wish to be precise, I will use the term dataset to refer to a collection of comparable observations and an individual observation as a data point.
Statistics in Plain English series,
Part 1: The Roots are Shallow
Part 2: What Is Data?
Part 3: The Death of Probability
I use the word truth in a lazy and intuitive way in this post. I will leave it to the philosophers to define that term.
For instance, up until an embarrassing age, I had never seriously thought about how mail was delivered. One day I realized I implicitly believed mail literally flew to mailboxes (presumably because of stamps I had seen that depicted a flying envelope). As soon as I realized I believed that, the belief changed because it was clear it made no sense.
Technically the recording could be data, if what we are interested in is sound waves or something agnostic to the contents of the recording. Regardless, I hope the general point is clear.