Chi-Square Test of Goodness of Fit

A nice link between the Descriptive Statistics (Frequency Table) and the Inferential Statistics (Chi-Square Test of Goodness of Fit)

One of the concepts discussed early in the Introductory Statistics course is the Frequency Table, which describes the sample’s frequency (descriptive statistics). An inferential-statistics method that can be applied to such a frequency table is Chi-Square (χ2) Test of Goodness of Fit. This post explains what is the Chi-Square (χ2) Test of Goodness of Fit using a research question about gender distribution.

Who is more? Male or Female Bloggers?: Chi-Square (χ2) Test of Goodness of Fit

Gender Distribution Example 1

In 2021, there are arguably about 600 million blog posts, suggesting that there are many bloggers, too.

Do you think there are more female bloggers than male bloggers (or vice versa)? And let’s say we will allow only one of these two answers:

(a) Yes, there is an equal (or similar) number of female and male bloggers.

(b) No, there is one gender (either female or male) that outnumbers the other.

Note, in the typical hypothesis-testing process, the two possible options or hypotheses are “there is exactly the same number of females and males in the population” or “the number of females and males are not the same”, but considering that the readers could not be familiar with the hypothesis testing, I made the two answers as above to allow more rooms: equal (similar) gender distribution vs. one gender is dominating.

This is probably one of the simplest kinds of (research) questions, and a reasonable approach to answering this question is to visit many blog posts whose authors revealed their gender.

In 2020, Sysmos did exactly that. They visited more than 100 million blog posts and found this:

Gender distribution among bloggers (By Sysmos)

Although they didn’t study “all” bloggers, based on 100 million blog posts (i.e., sample with the sample size n = 100,000,000), we can confidently conclude that there would be a similar number of female and male bloggers in the population (i.e., all the female and male bloggers in the world). So, in this case, the results seem to be easy to interpret.

One Thing To Remeber

One may wonder that only the sample result showing 50% male and 50% female would support the equal gender distribution while all the other results (e.g., 49% to 51%) would support that there is an unequal gender distribution.

But even if your sample result slightly deviates from the perfectly even distribution, the population could have an equal gender distribution. That is, there should be a range of outcomes that would support the equal gender distribution in the population and there is another range of outcomes that would support the unequal gender distribution in the population.

Keeping this in mind, let’s discuss another example with smaller numbers but with slightly more ambiguous results.

Gender Distribution Example 2

According to recent data, in the US, there are around 200,000 female and male psychologists. Say, we want to draw a dichotomous conclusion regarding their gender distribution. That is, equal vs. unequal gender distribution. Let’s also assume that we will study 2,000 of them (1% of 200,000) to answer this question, as we cannot study all of them.

If our sample has around 1000 females vs. 1000 males, then, we would conclude that there would be an equal number of female and male psychologists in the population. In this case, it is easy to draw a conclusion.

If the results were very far from the 1,000 to 1,000, (e.g., 1,900 vs. 100) then, again, it would be easy for us to draw a conclusion (i.e., unequal gender distribution).

Now, let’s think about somewhat ambiguous results. What if your sample of 2,000 people shows the following distribution?

A ratio of male and female from a sample, based on which one should draw a conclusion regarding equal-gender distribution in the population.

That is, there are 22% more females than males. What would be your conclusion? Equal gender distribution or unequal gender distribution?

Here’s another case. This time, the gap is narrower: There are 10% more males than females.

Another ratio of male and female from a sample, based on which one should draw a conclusion regarding equal-gender distribution in the population.

Again, what would be your conclusion? Equal gender distribution or unequal gender distribution?

In reality, your sample (N = 2,000) could show any of the results in the following table that includes perfectly-equal distribution, perfectly-unequal distribution, and somewhere between the two.

Note, in the following table, results from the top portion tend to support the equal gender distribution in the population (as the female-male frequencies are similar to each other), and the results from the bottom portion tend to support unequal gender distribution (as there is a large discrepancy in frequencies between the two gender).

Femaleto Male
1000to1000
1001to999
1002to998
1003to997
1004to996
1005to995
1006to994
1007to993
1008to992
1009to991
1010to990
1011to989
1989to11
1990to10
1991to9
1992to8
1993to7
1994to6
1995to5
1996to4
1997to3
1998to2
1999to1
2000to0
A list showing all the possible ratios of male and female from a sample (N = 2,000), based on which one should draw a conclusion regarding equal vs. unequal gender distribution in the population. Note the top portion of the table shows sample results that would support the equal gender distribution in the population (as the female-male frequencies are similar to each other), and the bottom portion of the table shows the sample results that would support unequal gender distribution (as there is a large discrepancy in frequencies between the two gender).

By examining the table, you probably realized that there should be a “line” on the table. That is, results above the line would make you conclude “equal gender distribution in the population.” and the results below the line would make you conclude “unequal gender distribution in the population.” For example, let’s say you drew your line at 1010 to 990. Then, if your outcome is above the line, then, you would conclude “equal gender distribution in the population.” If your outcome is below the line, you conclude “unequal gender distribution in the population.”

Note that you need to draw the line “before” you actually get the result from your sample. You cannot move your line “after” you get your sample’s data.

Now, Where is Your Line?

Is your line at around the ratio of 1100 to 900?

Or 1200 to 800?

Or 1300 to 700?

This is not easy.

Even worse, you may want to move your line depending on your sample size.

For example, say your line is at 1,100 to 900.

This means that, when you study 2,000 people, and if your sample’s gender difference is equal to or greater than 200, then, you would conclude “unequal gender distribution in the population.” In other words, when your line is at 1,000 to 900 in the above table, results that are below the line would make you conclude “unequal gender distribution.”

But, what if your sample size is 200,000, instead of 2,000?

The difference of 200 between females and males could be obtained when the sample had, for example, 100,100 females vs. 99,900 males. Now, even though the difference is still 200, you are likely to conclude that there would be an equal gender distribution in the population. 

The point is this: When you wonder whether a population (200,000 psychologists) has a specific frequency distribution across categories (e.g., 100,000 females and 100,000 males), and you are drawing the conclusion based on a small subset of the population (i.e., sample), then, you need to list out all the sample frequencies as shown in the above table. More importantly, you should draw a line against which you will compare your sample’s frequency.

The Solution to This Problem: χ2 Test of Goodness of Fit

When you conclude whether or not a population has a specific frequency distribution across categories, you can use the χ2 Test of Goodness of Fit.

For example, if you wonder whether a certain farm’s animals (N of 1,000,000,000) have a specific distribution, such as, 30% cow, 30% chicken, and 40% other animals, and you want to draw the conclusion based on a sample of 1,000 animals from the farm, then, use the χ2 Test of Goodness of Fit.

Introduction to the χ2 Distribution

Let’s assume that the population of 200,000 psychologists does have an equal gender distribution. Then, what’s the most-expected frequency distribution from your sample with 2,000 psychologists? Yes, 1,000 females and 1,000 males.

Of course, your sample could show 1001 to 999, 1002 to 998, or even more extremely-deviated ratios. But, the chances are the highest for the 1000 female to 1000 male, followed by 1001 to 999, followed by 1002 to 998, and so on.

If we represent these expected probabilities of all the outcomes, it would look as below: 

(X-axis is showing various outcomes that you could get from the sample of 2,000 psychologists; the starting point is showing the sample result of 1000 to 1000, and the remaining results are shown along the X-axis. Y-axis is representing the likelihood or probability).

A graph showing all the possible gender-ratio outcomes that one could get from a sample of 2,000 people on the x-axis and the corresponding probability on the y-axis, assuming that (again important!) the population has an equal gender distribution. The probability is gradually decreasing as the outcome deviates from the most expected ratio of 1,000 females to 1,000 males.

Note, it is very important to keep in your mind that this expected probability distribution across all the possible outcomes was drawn with the assumption that there “IS” an equal gender distribution in the population. This is also what is assumed in the remainder of this blog.

The above distribution is showing that when there is an equal gender distribution in the population, from the sample of 2,000 people, 1000 to 1000 is the most likely outcome, and 1001 to 999 is the second most likely outcome, and so on, which makes sense. And this expected probability distribution is capturing the idea of a chi-square (χ2) distribution. (Chi is pronounced as /kai/)

That is, chi-square distribution is an expected probability distribution of sample outcomes that can be obtained under the assumption that the population has a specific frequency distribution (in this case, equal frequency distribution between male and female).

I also drew “the line”, which was marked by “Here is the line.” That is, outcomes before the line (the X-axis values of the white area under the curve) will be considered as likely outcomes under the equal-gender-distribution assumption regarding the population. Also, the outcomes after the line will be considered as unlikely outcomes under the equal-gender-distribution assumption (hypothesis) in the population.

How did I draw the line?

I did that based on something called a χ2 table. This handy table helps you with figuring out the line that divides the whole area under the χ2 distribution as the likely-outcome area vs. the unlikely-outcome area. For example, the likely area covers 95% of the possible outcomes (white area) while the unlikely area covers 5% of the possible outcomes (the red area). You can also change the ratio (e.g., 99% vs. 1%).

I will explain how to use the table, which is very easy, after I explain what is the χ2 itself.

What is the χ2 value? (Conceptual Explanation)

The χ2 value is an indicator of deviation between the actually-observed female-male ratio in your sample and the most expected ratio (1000 to 1000).

More generally, your χ2 value shows the degree of the discrepancy between the observed frequency in your sample (e.g., 1,100 females to 900 males) and the most expected frequency for your sample (e.g., 1,000 to 1,000), assuming that the sample’s population has a specific frequency pattern (e.g., 100,000 to 100,000).

That is, if your χ2 is 0, then, it means that there is no discrepancy between your sample’s observed frequency distribution and the most expected frequency distribution based on the assumption that the population has a certain specific frequency distribution (e.g., equal gender distribution in the population).

The larger the χ2, the more the discrepancy between your sample’s frequency distribution and the most expected frequency distribution.

Therefore, using the χ2 value, we can succinctly communicate messages like the following:

a) “The actually observed sample frequency is exactly the same to the most expected frequency” by showing a χ2 value of 0 (it would support the equal-gender-distribution hypothesis in the population).

b) “The actually observed sample frequency is very different from the most expected frequency” by showing a χ2 value of, for example, 10. (Note, χ2 of 10 is typically a large value and unlikely to be obtained from a sample if it was really derived from a population with an equal gender distribution.) In this case, it would reject the equal-gender-distribution hypothesis in the population.

Of course, your χ2 can be a somewhat ambiguous value. For example, if your χ2 value is around 2 or 3, it may suggest that The actually observed sample frequency is “somewhat” different from the most expected frequency. In this case, you would need to compare your χ2 value to the line’s χ2 value.

In sum, the χ2 value shows the degree of discrepancy between the actually observed frequency in your sample and the most expected frequency under a hypothesis.

How to Calculate the χ2 value? (Mathmatical Explanation)

Step 1: O Calculation

χ2 calculation starts with creating the frequency table from the sample. For example, if there were 860 females and 1140 males in your sample, then, the frequency table would look like this:

Step 1: Observed frequency table (O).

Such an actual frequency from the sample is called the observed frequency (O).

Step 2: E calculation

As the χ2 value indicates the degree of a discrepancy between the observed frequency of your sample and the expected frequency under an assumption about the population, the χ2 calculation naturally involves “observed frequency minus expected frequency (O – E),” which is the numerator of the formula.

This, of course, requires the creation of the expected frequency (E). That is, assuming a certain hypothesis about the population (e.g., equal gender distribution), we need to figure out the most expected frequency considering your sample size. In our example, since the sample size is 2,000, the most expected frequency under the equal-gender-distribution hypothesis is 1,000 females and 1,000 males. Such a most expected frequency under the hypothesis is called the “Expected Frequency (E)” In our example, assuming the equal-gender distribution in the population, the expected frequency table would be this:

Step 2: Expected frequency table (E).

Step 3: O – E

Calculating the difference between O and E (O – E) is the third step of χ2 calculation.

Step 3: O – E

Step 4: (O − E)2

It is tempting to add these differences up or calculate the average of these differences across categories, as χ2 is an “overall” indicator of the discrepancy between the actual frequency distribution in your sample and the most expected frequency across categories. However, the sum of the differences (O – E) would always give you 0. This is because one category’s O – E would be positive and the other category’s O – E would be negative with the exact same magnitude (similar to the sum-of-zero issue in the standard deviation calculation).

To avoid the sum-of-zero issue, we square these differences for each category (as we do the same in the standard deviation calculation): (O−E)2.  

Step 4: (O−E)2

Step 5: (O−E)2 / E

Then, we divide the squared difference scores of each category (O−E)2 by the category’s most expected value (E):

    \[(O - E)^2 \over E\]

In our example, the numbers are the same for both female and male categories: 19600 / 1000. If you calculate the χ2 value using Excel as I’m doing now, then, the relevant cells for the Female category in this step would be J4/D4 as shown below:

Step 5: (O−E)2 / E

Step 6: ∑{(O−E)2 / E}

Finally, to create an overall measure of discrepancy between observed frequency and the expected frequency across categories, you sum up the {(O − E) 2 / E} across the categories. Then, the sum itself is the value of χ2.

That is, the formula of χ2 is:

Formula of chi-square
Chi-square value of 39.2 calculated by adding up (O−E)2 / E for female and male (Cell M4 and N4).

Here’s a simpler version:

    \[\chi^2 = \sum\frac{(O - E)^2}{E}}\]

By the way, in step 5, we divide the squared difference scores of each category (O−E)2 by the most expected value of the category (E). That is, (O−E)2 / E.

I should note that the most expected value of each category is affected by the sample size. For example, assuming the equal gender distribution in the population, if your sample size is 2,000, then, the expected value of a category is 1,000 (for both male and female). But, if your sample size is 20,000, then, the expected value of a category is 10,000 (for both male and female).

The purpose of dividing the squared difference scores (O−E)2 by E, which is affected by the sample size, is to compensate for the effect of (O−E)2 on χ2 by considering the sample size. Here’s a more detailed explanation.

Let’s say, in your sample, the difference between females and males was 400 (e.g., there are 400 more females than males).

This gap of 400 could be a large or small difference depending on your sample size. Let’s assume two sample sizes: 2,000 vs. 50,000.

For the sample size of 2,000, the gap of 400 could have been obtained from a ratio, for example, 1,200 females to 800 males. This could be considered a large difference and suggest unequal gender distribution in the population.

However, when the sample size is large (e.g., 50,000 instead of 2,000), the gap of 400 would be obtained from a ratio, such as, 25,200 females to 24,800 males. In this case, the same difference of 400 would be considered a small difference and suggest equal gender distribution in the population.

By dividing the squared discrepancy between the observed and expected frequency (i.e., (O−E) 2 by E), when your sample size is small (e.g., 2,000), even a small difference like 400 could increase χ2 value significantly. In contrast, when your sample size is large (e.g., 50,000), only a very large difference could increase the χ2 value significantly while a small difference would not affect the χ2 value that much. This is how we compensate for the sample size in the calculation process of χ2.

How to Use the Chi-Square (χ2) Table?

Here’s how to use the chi-square table to draw the “line” on the χ2 distribution in terms of the χ2 value.

The main body of the χ2 table (see below) shows many χ2 values and these χ2 values are the critical χ2 values that could serve as “the line.”

To find your line (critical χ2 value) from the χ2 table, you need two pieces of information:

(a) Significance level or alpha level to determine the column: Alpha level is about what kind of χ2 values should be considered likely or unlikely under the χ2 distribution (refer to the white and red area, respectively, in the χ2 graph shown under the heading, Introduction to the chi-square distribution). Typically, the alpha level is .05 (α = .05). That is, χ2 values with the probability that is the same or lower than .05 are considered unlikely. This alpha (α) level also determines which column of the χ2 table to use to find the critical χ2 value (e.g., the yellow-highlighted column, χ2.050).

(b) How many categories are involved in your study to determine the row: The number of categories of the variable decides something called the degrees of freedom (df) of the χ2 test. The formula of df is category number – 1. In our gender example, the df = 2 (two categories: male and female) – 1 = 1. The df determines which row to use from the χ2 table. Specifically, you need to find the obtained df from the left heading of the table to decide which row of the table to use (in our example, the first row).

Finally, the intersection of the row (in our example, the first row) and the highlighted column (χ2.05) would give you the critical χ2 value (e.g., 3.841 if your df is 1): this is the “line” of the χ2 distribution.

How to Draw the Conclusion?

Given a certain claim (e.g., the population has an equal gender distribution), we draw our line on the χ2 distribution (in our example, 3.841). Then, we collect data and calculate our χ2 value following the steps described above (in our example, 39.2). If the sample’s χ2 is larger than the critical χ2 value found from the χ2 table, which also means that the probability of obtaining such a large χ2 from a sample is too low (p < .05), then, we reject the equal-gender-distribution claim. 

SPSS would calculate the precise probability of obtaining a χ2 value that is as large as yours (i.e., the specific χ2 that you calculated from your sample). Just by looking at the SPSS-calculated probability, which is usually shown under the .sig column of the SPSS output table, you can draw a final statistical conclusion: If this is too low (p < .05), then, you reject the claim (even-gender-distribution) and conclude that there would be an unequal gender distribution.

If the χ2 is smaller than the critical χ2 value from the χ2 table, which also means that the probability of obtaining a χ2 that is as large as yours is not too low (p > .05). Then, you do NOT reject the (even-gender-distribution) claim and retain the claim of even gender distribution. 

Finally, we report the Chi-square test result in the following format

In our sample, there were 860 females and 1,140 males, suggesting that there would not be an equal distribution of females and males (or there would be more males) among all the psychologists, χ2 (N = 2,000, 2) = 39.2, p < .05, V = 0.14.

That is, we report the sample size in the parentheses after the χ2 as well as the df. The two elements in the parentheses should be separated by a comma. Note, all the mathematical equations should be surrounded by empty spaces.

The V at the end is an effect size measure of χ2 test, called Cramer’s V. The formula is

The formula of the effect size measure of χ2 test, called Cramer’s V.
Excel calculation of the effect size measure of χ2 test, called Cramer’s V.

Hope this helps.

Leave a Comment