# How to quantify "married people are less likely than the average person to smoke", "smokers are less likely than the average person to be married"?

When I first saw inequalities (1) and (2) below, I quantified them as:

$\color{red}{\text{1.1. married people < all smokers}}$

$\color{red}{\text{2.1. and married smokers < all married people}}$ .

My guesses are wrong because I'm comparing quantities, but the author below compares fractions. But why are my guesses wrong?

When you’re comparing two binary variables, correlation takes on a particularly simple form. To say that marital status and smoking status are negatively correlated, for example, is simply to say that married people are less likely than the average person to smoke. Or, to put it another way, smokers are less likely than the average person to be married. It’s worth taking a moment to persuade yourself that those two things are indeed the same! The first statement can be written as an inequality

$\color{limegreen}{\text{married smokers / all married people < all smokers / all people}}$ Why doesn't

`$\tag{1}$`

work?

and the second as

$\color{limegreen}{\text{married smokers / all smokers < all married people / all people}}$ Why doesn't

`$\tag{2}$`

work?

If you multiply both sides of each inequality by the common denominator (all people) × (all smokers) you can see that the two statements are different ways of saying the same thing:

(married smokers) × (all people) < (all smokers) × (all married people) Why doesn't

`$\tag{3}$`

work?

In the same way, if smoking and marriage were positively correlated, it would mean that married people were more likely than average to smoke and smokers more likely than average to be married.

Ellenberg, *How Not to Be Wrong* (2014), pages 347-8.

## 1 answer

Let's consider 2.1 first. You said that **married smokers** < **married people**. That is always true (unless every married person smokes, in which case the two would be equal), because anyone who is a "married smoker" is also a "married person". But since it is always true, it doesn't really give any useful information.

However, in the actual equation (2), the denominators are different. The left side is dividing by **all smokers**, and the right side is dividing by **all people**. Since not everyone smokes, those are different. Equation 2.1 drops the two denominators, which is a wrong step because the denominators are not equal.

Equation 1.1 seems to have a different error; it's trying to compare marriage and smoking to each other. We aren't given anything about this. It's possible, for instance, that 50% are married, 10% of married people smoke, and 30% of non-married people smoke. Then **married** > **smokers**, but now **married smokers** / **married people** is 10%, and **all smokers** / **all people** is 20% (check this yourself).

I wonder if you meant to write **married smokers** < **all smokers**, which is the same kind of error as for equation 2.1 if you switch the two categories.

For trying to intuit the equations, I personally think percentages are easiest. But if you find comparing numbers of people easier, I might suggest using **married smokers < married * (smokers / all people)**. You can think of the right side as the "naive guess" of how many married people are smokers, if you used the fraction of all people who smoke. The inequality, then, says that the actual number of married smokers is less than the number who would smoke if the two categories were independent.

#### 4 comments

Hi! Thanks for your replies! "I wonder if you meant to write married smokers < all smokers," No I didn't. The author's sentence is "To say that marital status and smoking status are negatively correlated, for example, is simply to say that **married people [emphasis mine]** are less likely than the **average person to smoke [emphasis mine]**." I embolded the phrases that I extracted into my inequality, but I know that "average person to smoke" $\neq$ all smokers.

"Then married > smokers, but now married smokers / married people is 10%, and all smokers / all people is 20% (check this yourself)." Can you please show the steps? Married smokers/married people = 10%/50% = 20%, but you wrote "10%"? "all smokers / all people" = (10% + 30%)/all people. But what number do I use for "all people"?

(1) Ah, I see. The words "to smoke" apply to both "married" and "average person". In probability notation it's Pr(smokes | married) < Pr(smokes), where Pr(X | Y) means "probability of X given Y is true". (Some sources just use "P" instead of "Pr".)

(2) We don't know the number of people, but for most of the calculations it doesn't matter. Let's say there are 1000. 50% are married, so 500 are married and 500 are not. Out of the 500 married people, 10% **of 500 married** people smoke, which is 50. 30% **of non-married people** (of which there are also 500) smoke, which is 150. Total smokers = 50 + 150 = 200, which is 20% of the entire population.

## 0 comments