How to quantify "married people are less likely than the average person to smoke", "smokers are less likely than the average person to be married"?
When I first saw inequalities (1) and (2) below, I quantified them as:
$\color{red}{\text{1.1. married people < all smokers}}$
$\color{red}{\text{2.1. and married smokers < all married people}}$ .
My guesses are wrong because I'm comparing quantities, but the author below compares fractions. But why are my guesses wrong?
When you’re comparing two binary variables, correlation takes on a particularly simple form. To say that marital status and smoking status are negatively correlated, for example, is simply to say that married people are less likely than the average person to smoke. Or, to put it another way, smokers are less likely than the average person to be married. It’s worth taking a moment to persuade yourself that those two things are indeed the same! The first statement can be written as an inequality
$\color{limegreen}{\text{married smokers / all married people < all smokers / all people}}$ Why doesn't
$\tag{1}$
work?
and the second as
$\color{limegreen}{\text{married smokers / all smokers < all married people / all people}}$ Why doesn't
$\tag{2}$
work?
If you multiply both sides of each inequality by the common denominator (all people) × (all smokers) you can see that the two statements are different ways of saying the same thing:
(married smokers) × (all people) < (all smokers) × (all married people) Why doesn't
$\tag{3}$
work?
In the same way, if smoking and marriage were positively correlated, it would mean that married people were more likely than average to smoke and smokers more likely than average to be married.
Ellenberg, How Not to Be Wrong (2014), pages 347-8.
1 answer
Let's consider 2.1 first. You said that married smokers < married people. That is always true (unless every married person smokes, in which case the two would be equal), because anyone who is a "married smoker" is also a "married person". But since it is always true, it doesn't really give any useful information.
However, in the actual equation (2), the denominators are different. The left side is dividing by all smokers, and the right side is dividing by all people. Since not everyone smokes, those are different. Equation 2.1 drops the two denominators, which is a wrong step because the denominators are not equal.
Equation 1.1 seems to have a different error; it's trying to compare marriage and smoking to each other. We aren't given anything about this. It's possible, for instance, that 50% are married, 10% of married people smoke, and 30% of non-married people smoke. Then married > smokers, but now married smokers / married people is 10%, and all smokers / all people is 20% (check this yourself).
I wonder if you meant to write married smokers < all smokers, which is the same kind of error as for equation 2.1 if you switch the two categories.
For trying to intuit the equations, I personally think percentages are easiest. But if you find comparing numbers of people easier, I might suggest using married smokers < married * (smokers / all people). You can think of the right side as the "naive guess" of how many married people are smokers, if you used the fraction of all people who smoke. The inequality, then, says that the actual number of married smokers is less than the number who would smoke if the two categories were independent.
0 comment threads