HPG 6104 • Epidemiological Methods II

Categorical Data & Contingency Tables

Week 2 Companion Handout

Choosing the Right Approach

A large share of epidemiologic data is categorical (e.g., smoking status, HIV status, vaccination). Once the data are arranged in a contingency table, we need to choose the correct test to assess association.

Situation	Usual Statistical Approach
Two categorical variables, adequate sample size	Chi-square test
2×2 table with small expected counts (< 5 in most cells)	Fisher's exact test
Need to control for one categorical confounder using strata	Mantel-Haenszel analysis
Multiple confounders, complex modelling, interaction	Logistic regression

Chi-Square vs. Fisher's Exact

Chi-Square Test

Assesses whether there is evidence of an association between two categorical variables. The null hypothesis is that the variables are independent.

Assumes observations are independent.
Categories must be mutually exclusive.
Rule of Thumb: Expected counts should be ≥ 5 in most cells for the approximation to hold.

Fisher's Exact Test

Especially useful when sample sizes are small or expected cell counts are low. It does not rely on large-sample approximations.

Often more appropriate for sparse 2×2 tables.
In R, it conveniently provides an OR estimate and confidence interval.

Interpretation Trap

A small p-value tells you the observed distribution is unlikely under the null hypothesis. It does not tell you how strong the association is, or if it is causal. You must look at the Odds Ratio (OR) for direction and strength!

Stratification: Mantel-Haenszel Analysis

Sometimes an observed crude association is distorted by a third variable (like Age). Mantel-Haenszel combines stratum-specific 2×2 tables to produce an adjusted estimate of association.

Confounding vs. Effect Modification

Confounding: The crude estimate is distorted. "If the stratum-specific ORs are similar to one another but different from the Crude OR, confounding is likely."

Effect Modification: The exposure-outcome association truly differs across levels of another variable. "If the stratum-specific ORs differ substantially from one another, effect modification is present. Reporting a single pooled estimate here hides the biological reality."

R Workflow & Syntax Reference

# 1. Chi-Square & Fisher's data <- matrix(c(40, 60, 10, 90),
nrow = 2, byrow = TRUE)

chisq.test(data)
fisher.test(data)

# 2. Mantel-Haenszel (Stratified) age1 <- matrix(c(30, 20, 10, 40), nrow=2)
age2 <- matrix(c(10, 40, 5, 50), nrow=2)

stratified_data <- array(c(age1, age2),
dim = c(2, 2, 2))
mantelhaen.test(stratified_data)