Average class size paradox
If you go to a University website and check their statistics you will probably find a measure called “average class size”. It is a simple average but can be misleading.
Different measures have different meanings for other perspectives. Consider this example:
In a tiny University, there are only two classes. One is a large course with 90 freshmen, while the other is a small one with 10 seniors.
What is the average class size?
It depends on who you ask. If you ask the University then the answer will be 50. (90+10)/2. And that is correct, but can be misleading to students. For the University a course is a course and they are equally weighted.
If you ask a student, the answer will be totally different. A senior will say that 50 is way too many because they only have 10 people in a class. A freshman will say that 50 is not true when they are sitting with 89 people around them in every class.
90% of the University students feel that they have bigger classes than the average.
To understand the student perspective we need to do a weighted average calculation.
90 people said that the class size is 90 and 10 said that the class size is 10. That’s why we need to “duplicate” the numbers. Since we have 100 students the “real average from the student perspective” will be 82. (8200/100)
You always hear from friends: “Ohh the beach is always so crowded. There are thousands of people there.” Then the statistics say that the average of visitors is only a few hundred. Now you know the explanation. When the beach is not crowded your friends are not there to notice it.
Friendship paradox
A similar pattern can occur with friends.
Red has 1 friend, Yellow has 3, and Green and Blue have 2 friends each. This means 8 friends divided by 4 = 2 friends on average.
But what if we interview each color and ask how many friends their friends have? Here are the answers:
R: Y has 3
Y: R has 1, G has 2, and B has 2.
B: Y has 3, and G has 2
G: Y has 3, and B has 2.
If we add up the numbers the result is 18 and the reported names are 8 (note: we are not dividing by 4. We need to sum the number of names in the answers).
18/8 equals 2.25 (>2). On average friends have more friends than the colors themselves.
Let’s take a closer look at the answers. Y occurs 3 times, R occurs once, G and B occurs 2 times each. The number of occurrences equals their number of friends. So we go back to the weighted average calculation. More popular friends are listed more and contribute more to the score.
Explained through code
Let’s create some Python code to observe this in action. You can find the code at the bottom of this post!
This code represents a Probability Mass Function (PMF), which is a way to describe the probability of different numbers of friends in the population.
The _normalize
method calculates the total number of people and divides each count by the total to get a probability distribution. We have {2: 5, 5: 8, 10: 14}
, where the keys are the number of friends and the values are the number of people who have that many friends. The total number of people is 5 + 8 + 14 = 27
. So, the probability for each friend count:
P(2 friends) = 5 / 27
P(5 friends) = 8 / 27
P(10 friends) = 14 / 27
The Mean
method calculates the mean number of friends in the actual distribution.
The Bias
method creates a biased version of the PMF, where people with more friends are sampled more often. This is done by multiplying the probability by the number of friends, just like we did in the examples above.
Then we finally plot the results:
Mean number of friends (actual): 12.60. This tells us, “If you randomly pick a person, how many friends do they have?”
Mean number of friends (observed by friends): 16.69. This tells us, “If you randomly pick a friend from someone’s friend list, how many friends do they have?” People with more friends show up more often in this count.
The actual average number of friends is lower than the perceived number of friends.
PMF helps to compare the two distributions easily. The biased PMF is always shifted to the right (towards bigger numbers), meaning the observed friend count is higher.
Main sources:
Code:
import numpy as np
import matplotlib.pyplot as plt
class Pmf:
"""A simple class for Probability Mass Function (PMF) operations."""
def __init__(self, data, label=''):
self.label = label
self.pmf = self._normalize(data)
def _normalize(self, data):
total = sum(data.values())
return {k: v / total for k, v in data.items()}
def Mean(self):
return sum(k * p for k, p in self.pmf.items())
def Bias(self):
"""Bias the PMF so that people with more friends are sampled more often."""
biased_pmf = {k: v * k for k, v in self.pmf.items()} # Multiply by k (friend count)
total = sum(biased_pmf.values())
return {k: v / total for k, v in biased_pmf.items()}
def Plot(self, biased_pmf):
x = list(self.pmf.keys())
y_actual = list(self.pmf.values())
y_biased = list(biased_pmf.values())
plt.bar(x, y_actual, alpha=0.6, label="Actual Distribution")
plt.bar(x, y_biased, alpha=0.6, label="Observed Distribution (biased)", color='red')
plt.xlabel("Number of Friends")
plt.ylabel("Probability")
plt.legend()
plt.show()
# Define a hypothetical friend distribution
friend_distribution = {
2: 5, # 5 people have 2 friends
5: 8, # 8 people have 5 friends
10: 14, # 14 people have 10 friends
15: 6, # 6 people have 15 friends
20: 10, # 10 people have 20 friends
25: 5 # 5 people have 25 friends
}
pmf = Pmf(friend_distribution, label="actual")
biased_pmf = pmf.Bias()
print(f"Mean number of friends (actual): {pmf.Mean():.2f}")
print(f"Mean number of friends (observed by friends): {sum(k * v for k, v in biased_pmf.items()):.2f}")
# Plot actual vs biased distribution
pmf.Plot(biased_pmf)
Very insightful. This will come in handy for my prob and stat class 😊 Thanks