Tuesday, December 30, 2014

An easier approach for solving conditional probability problems, that schoolkids can understand

A problem similar to the described below, may appear in many common situations (making decisions, assessments, judgments). Although we face similar problems quite often, our intuition fails nevertheless. The problem is taken from book "Thinking, Fast and Slow" of D. Kahneman, Nobel Memorial Prize in Economics, 2002, who in collaboration with A. Tversky have described common human cognitive biases.
This particular bias is named as base rate bias or base rate fallacy.

The 'canonical' approach for solving conditional probability type of problems is using the Bayes formula, which should be familiar to those who took Statistics at college. The method is rather error-prone, and people often end up making mistakes assigning prior and posterior probabilities and/or normalizing properly.
The "tree" or combinatorial method proposed here is much more intuitive (especially to those who are familiar with basic programming), and does not depend on the order of partitioning. This will be demonstrated by solving the problem by multiple methods, even not distinguishing between the two semantically different problems, described by Kahneman.

First, let's formulate the problem in it's original terms, although taxi can be substituted by people, colors - by gender, income, social position, nationality, etc, those are just 'semantic sugar', how programmers call it. We'll pay more attention to the problem's structure later.
 
"85% of the cabs in the city are Green and 15% are Blue.
A witness identified the cab as Blue. The court tested the reliability of the witness under the circumstances that existed on the night of the accident and concluded that the witness correctly identified each one of the two colors 80% of the time and failed 20% of the time.
What is the probability that the cab involved in the accident was Blue rather than Green?"

The most frequent answer (including those who studied statistics) - 80%. Which is incorrect.

Let's draw a tree of all possible combinations (as if we would program all the conditions with 'if-else'), writing the percentages of what is known. On the 2nd level of branches enumerate all possible combinations as well. On the terminal nodes (on the right) let's write down which color has been seen by the witness:




                         |----80%----> Green
      |----85%----> Green|
      |                  |-----20%---> Blue
      |
------|
      |
      |                  |----80%---> Blue
      |----15%---> Blue  |
                         |----20%---> Green



Yet, we haven't used the fact that a 'Blue' taxi was seen by the witness.

This information will narrow down our space of all possible possibilities (it is often called 'Sample Space').
We can cross-out the cases that did not occur, which leaves our tree "pruned":



                         |--
      |----85%----> Green|
      |                  |-----20%---> Blue
      |
------|
      |
      |                  |----80%---> Blue
      |----15%---> Blue  |
                         |--


Notice, that if we crossed out some branches, we would want to write their compliments (to 100%), but in this case we had had them.

At this point we can also write the total weight of the terminal branches multiplying 85%*20% and 15%*80% but we will do it later in one shot.

Let me remind the question: "What is the probability that the cab involved in the accident was Blue rather than Green?"

The case that match the question is illustrated by the bottom branch ({15%---> Blue --- 80%--> Blue}). The probability ( the wanted outcome divided by the total number of outcomes):



<The answer> = {15%*80%}/{85%*20% + 15%*80%}=0.15*0.8/(0.85*0.2+0.15*0.8)=0.41379310344 = ~ 41%


-----------------------------------------------------

We could do write down into the tree 0.85*0.20=0.17, и 0.15*0.80=0.12:

                         |--
      |----85%----> Green|
      |                  |-----20%---> Blue  (0.17)
      |
------|
      |
      |                  |----80%---> Blue (0.12)
      |----15%---> Blue  |
                         |--
<The answer> = 0.12/(0.17+0.12)= ~ 41%

-----------------------------------------------------

Another approach is to start from another partitioning (prior). But in this case we should remember that colors at the right (at tree leaves) are real colors, not those which were noticed by the witness. In the following parenthesis let's write the colors which would be noticed by the witness in those situations:


                         |----85%----> actually Green (he said Green)
      |----80%----> Right|
      |                  |-----15%---> actually Blue (he said Blue)
      |
------|
      |
      |                  |----85%---> actually Green (he said  Blue)
      |----20%---> Wrong |
                         |----15%--->  actually Blue (he said Green)


...
                         |--
      |----80%----> Right|
      |                  |-----15%---> actually Blue (he said Blue) (0.12)
      |
------|
      |
      |                  |----85%---> actually Green (he said  Blue) (0.17)
      |----20%---> Wrong |
                         |--
 
<The answer> = 0.12/(0.17+0.12)= ~ 41%
--------------------------------------------------------------------------------------------------------

A shorter method (the method of strings) which I use - when I do not have space to write the trees: 

prior (any!)
G85
B15
let's add at the end the event (what witness noticed):
G85-G80
G85-B20
B15-B80
B15-G20
we can count all weights right on this step, but we could also save on not necessary calculations, as in the previous example:
G85-G80 0.68
G85-B20 0.17
B15-B80 0.12
B15-G20 0.03
trimming impossible situations (let's leave only strings ending with B, since only Blue was witnessed):
G85-B20 0.17
B15-B80 0.12

now we are interested in the probability of the state when B is on the left side (ratio of that to all possible states):

<The answer> = 0.12/(0.17+0.12)= ~ 41%

-----------------------------------------------------

Now let's formulate a different problem from the same Kahneman's book, which is semantically different but structurally (as we'll soon discover) appeared to be the same.

The two taxi companies (Green and Blue), as in the previous problem, operate the same number of cabs, but Green cabs are involved in 85% of accidents.
The rest is the same as in the first problem.

For convenience - we are copying the rest here:
A witness identified the cab as Blue. The court tested the reliability of the witness under the circumstances that existed on the night of the accident and concluded that the witness correctly identified each one of the two colors 80% of the time and failed 20% of the time. What is the probability that the cab involved in the accident was Blue rather than Green?"

Writing down the trees, it is easy to see that the answer does not depend on the order (as above) - if all the branches are interpreted correctly (the key!). The trees will be the same as above.
And the correct answer will be the same:

<The answer> = 0.12/(0.12+0.19) = 0.41379310344 = ~ 41%

I hope, this method will be useful to somebody. In contrast with traditional Bayes theorem formulation and terminology, it seems it can be given even to school-children.


Literature:


D. Kahneman "Thinking, fast and slow"

A. Tversky, D. Kahneman, "Evidential impact of base rates, in Judgement under uncertainty: Heuristics and biases"

Kahneman, D. & Tversky, articles in the literature section of http://en.wikipedia.org/wiki/Daniel_Kahneman