• Board of Directors
  • Partnerships
  • Contact APTR
  • Become A Member
  • Member Benefits
  • Academic Unit Membership
  • Individual Membership
  • Public Health Programs
  • Awards Program
  • American Journal of Preventive
  • AJPM Focus-Open Access Journal
  • APTR Community Forum
  • APTR Committees
  • Clinical Prevention Framework
  • Acceleration Projects
  • IPE Curriculum Guide
  • Health Literacy Curriculum
  • Immunization Framework
  • Undergraduate Public Health
  • Learning Modules
  • Case Studies
  • Teaching Complex Issues
  • Call for Webinars
  • Academic Partnership Models
  • Instructor Guides
  • Publications
  • Training Opportunities
  • Leadership Opportunities
  • Past Presentations
  • Job Postings
  • Advocacy Activities
  • Policy & Position Statements
  • APTR-CDC Projects
  • Environmental Health Education
  • Healthy People Curriculum TF
  • Health In All Education
  • Teaching Prevention
  • APTR News Now
  • Latest News
Name:
Category:
Share:
CDC Epidemiology Case Studies
CDC developed case studies in applied epidemiology based on real-life epidemiologic investigations and used them for training new Epidemic Intelligence Service (EIS) officers — CDC’s “disease detectives.” EIS offers these carefully crafted epidemiology case studies for schools of medicine, nursing, and public health to use as a component of an applied epidemiology curriculum.

Students may practice their epidemiologic skills by using these exercises in classroom activities or as homework assignments to reinforce principles and skills previously covered in lectures and reading assignments.

The following case studies use specific examples to teach epidemiology concepts, require active participation, and help strengthen problem-solving skills. These case studies in applied epidemiology:

Case study based on a 1985 outbreak with unknown etiology and mode of transmission in multiple states. Updated in 2003.

Case study based on the classic studies of Doll and Hill in the 1950s. Addresses study design, interpretation of measures of association, and impact of association. 

Case study based on a 1980–1982 multicenter case-control study. Addresses bias and analysis of case-control studies. Updated in 2005.

Case study of a classic, straightforward outbreak investigation in a defined population. Based on a 1940 outbreak of Staphylococcus aureus among church picnic attendees. Additional material: 

Case study based on a community outbreak of Legionnaires’ disease in Bogalusa, Louisiana in 1989. Addresses the steps of a field investigation and a case-control study.. Updated in 2003.

 (2003 Update)

Case study based on surveillance and investigation activities of the Oregon Health Division between 1986 and 1995. 

Case study based on an infectious disease outbreak investigation in Texas.

Instructor guides/Preceptor versions for teachers/faculty can be purchased from the . Instructor guides are available FREE for APTR members and are $20 for non-members.

case study in epidemiology

Remember Me

case study in epidemiology

7/16/2024 Call for Teaching Prevention 2025 Planning Committee Members

6/17/2024 APTR and AHRQ Welcome New Class of Residents

5/1/2024 APTR Convenes Clinical Health Professions Curriculum Task Force: Curriculum Framework 5th Revision Underway

4/3/2024 APTR Announces 2024 Award Recipients

The upcoming calendar is currently empty.

Click here to view past events and photos »

case study in epidemiology

2000 Pennsylvania Avenue, NW | Suite 7000 Washington, DC 20006 [email protected]

Connect With Us

Our mission, bringing together individuals and institutions devoted to health promotion and disease prevention to redefine how we educate the health professions workforce..

Service update: Some parts of the Library’s website will be down for maintenance on July 7.

Secondary menu

  • Log in to your Library account
  • Hours and Maps
  • Connect from Off Campus
  • UC Berkeley Home

Search form

Oomph library resources: phw 250/250b epidemiologic methods: epidemiologic case study resources.

  • Online Books on Epidemiology and Biostatistics
  • R for Public Health
  • Epidemiologic Case Study Resources
  • Rural Health Resources
  • Stata Resources and Tips
  • Help/Off-Campus Access

Epidemiologic Case Studies

  • Epidemiologic Case Studies (US CDC) These case studies are interactive exercises developed to teach epidemiologic principles and practices. They are based on real-life outbreaks and public health problems and were developed in collaboration with the original investigators and experts from the Centers for Disease Control and Prevention (CDC). The case studies require students to apply their epidemiologic knowledge and skills to problems confronted by public health practitioners at the local, state, and national level every day.
  • Case Studies (WHO) From "Strengthening health security by implementing the International Health Regulations," each case has learning objectives and documentation.
  • Case Studies in Social Medicine A series of Perspective articles from the New England Journal of Medicine that highlight the importance of social concepts and social context in clinical medicine. The series uses discussions of real clinical cases to translate theories and methods for understanding social processes into terms that can readily be used in medical education, clinical practice, and health system planning.
  • African Case Studies in Public Heath Case study exercises based on real events in African contexts and written by experienced Africa-based public health trainers and practitioners. These case studies represent the most up-to-date and context-appropriate case study exercises for African public health training programs. These exercises are designed to reinforce and instill competencies for addressing health threats in the future leaders of public health in Africa.
  • Case Consortium @ Columbia University: Public Health Cases The case collection includes "teaching" cases. Nearly all the cases are multimedia and based on original research; a few are written from secondary sources. All cases are offered free of charge.
  • Epi Teams Training: Case Studies From the North Carolina Institute for Public Health, this curriculum includes several interactive case studies designed be used by the Epi Team as a group. These case studies are based on actual outbreaks that have occurred in North Carolina and elsewhere.
  • National Center for Case Study Teaching in Science The mission of the NCCSTS at the University at Buffalo is to promote the development and dissemination of materials and practices for case teaching in the sciences. Our website provides access to an award-winning collection of peer-reviewed case studies. We offer a five-day summer workshop and a two-day fall conference to train faculty in the case method of teaching science. In addition, we are actively engaged in educational research to assess the impact of the case method on student learning. "Case Collection" includes over 100 public health cases.

Books of Case Studies

case study in epidemiology

  • << Previous: R for Public Health
  • Next: Rural Health Resources >>
  • Last Updated: Jul 1, 2024 9:17 AM
  • URL: https://guides.lib.berkeley.edu/publichealth/PHW250

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

The Case for Case-Cohort: An Applied Epidemiologist's Guide to Reframing Case-Cohort Studies to Improve Usability and Flexibility

Affiliations.

  • 1 From the Epidemiology Branch, National Institute of Environmental Health Sciences, NC.
  • 2 Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC.
  • PMID: 35383643
  • PMCID: PMC9172927
  • DOI: 10.1097/EDE.0000000000001469

When research questions require the use of precious samples, expensive assays or equipment, or labor-intensive data collection or analysis, nested case-control or case-cohort sampling of observational cohort study participants can often reduce costs. These study designs have similar statistical precision for addressing a singular research question, but case-cohort studies have broader efficiency and superior flexibility. Despite this, case-cohort designs are comparatively underutilized in the epidemiologic literature. Recent advances in statistical methods and software have made analyses of case-cohort data easier to implement, and advances from casual inference, such as inverse probability of sampling weights, have allowed the case-cohort design to be used with a variety of target parameters and populations. To provide an accessible link to this technical literature, we give a conceptual overview of case-cohort study analysis with inverse probability of sampling weights. We show how this general analytic approach can be leveraged to more efficiently study subgroups of interest or disease subtypes or to examine associations independent of case status. A brief discussion of how this framework could be extended to incorporate other related methodologic applications further demonstrates the broad cost-effectiveness and adaptability of case-cohort methods for a variety of modern epidemiologic applications in resource-limited settings.

Copyright © 2022 Wolters Kluwer Health, Inc. All rights reserved.

PubMed Disclaimer

Conflict of interest statement

The authors report no conflicts of interest.

Visualization of case-cohort designs assuming…

Visualization of case-cohort designs assuming a time-on-study time scale. (A) The case-cohort study…

Similar articles

  • A new comparison of nested case-control and case-cohort designs and methods. Kim RS. Kim RS. Eur J Epidemiol. 2015 Mar;30(3):197-207. doi: 10.1007/s10654-014-9974-4. Epub 2014 Dec 2. Eur J Epidemiol. 2015. PMID: 25446306 Free PMC article.
  • Evidence Brief: The Effectiveness Of Mandatory Computer-Based Trainings On Government Ethics, Workplace Harassment, Or Privacy And Information Security-Related Topics [Internet]. Peterson K, McCleery E. Peterson K, et al. Washington (DC): Department of Veterans Affairs (US); 2014 May. Washington (DC): Department of Veterans Affairs (US); 2014 May. PMID: 27606391 Free Books & Documents. Review.
  • Evidence Brief: Comparative Effectiveness of Appointment Recall Reminder Procedures for Follow-up Appointments [Internet]. Peterson K, McCleery E, Anderson J, Waldrip K, Helfand M. Peterson K, et al. Washington (DC): Department of Veterans Affairs (US); 2015 Jul. Washington (DC): Department of Veterans Affairs (US); 2015 Jul. PMID: 27606388 Free Books & Documents. Review.
  • Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas. Crider K, Williams J, Qi YP, Gutman J, Yeung L, Mai C, Finkelstain J, Mehta S, Pons-Duran C, Menéndez C, Moraleda C, Rogers L, Daniels K, Green P. Crider K, et al. Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217. Cochrane Database Syst Rev. 2022. PMID: 36321557 Free PMC article.
  • Evaluation of multiple imputation approaches for handling missing covariate information in a case-cohort study with a binary outcome. Middleton M, Nguyen C, Moreno-Betancur M, Carlin JB, Lee KJ. Middleton M, et al. BMC Med Res Methodol. 2022 Apr 3;22(1):87. doi: 10.1186/s12874-021-01495-4. BMC Med Res Methodol. 2022. PMID: 35369860 Free PMC article.
  • Associations between use of chemical hair products and epigenetic age: Findings from the Sister Study. Chang CJ, O'Brien KM, Kresovich JK, Nwanaji-Enwerem JC, Xu Z, Gaston SA, Jackson CL, Sandler DP, Taylor JA, White AJ. Chang CJ, et al. Environ Epidemiol. 2024 May 17;8(3):e311. doi: 10.1097/EE9.0000000000000311. eCollection 2024 Jun. Environ Epidemiol. 2024. PMID: 38799263 Free PMC article.
  • The accuracy of prehospital triage decisions in English trauma networks - a case-cohort study. Fuller G, Baird J, Keating S, Miller J, Pilbery R, Kean N, McKnee K, Turner J, Lecky F, Edwards A, Rosser A, Fothergill R, Black S, Bell F, Smyth M, Smith JE, Perkins GD, Herbert E, Walters S, Cooper C; MATTS research group. Fuller G, et al. Scand J Trauma Resusc Emerg Med. 2024 May 21;32(1):47. doi: 10.1186/s13049-024-01219-9. Scand J Trauma Resusc Emerg Med. 2024. PMID: 38773613 Free PMC article.
  • Predictors of upstream inflammation and oxidative stress pathways during early pregnancy. Welch BM, Bommarito PA, Cantonwine DE, Milne GL, Motsinger-Reif A, Edin ML, Zeldin DC, Meeker JD, McElrath TF, Ferguson KK. Welch BM, et al. Free Radic Biol Med. 2024 Mar;213:222-232. doi: 10.1016/j.freeradbiomed.2024.01.022. Epub 2024 Jan 21. Free Radic Biol Med. 2024. PMID: 38262546
  • Air pollution and epigenetic aging among Black and White women in the US. Koenigsberg SH, Chang CJ, Ish J, Xu Z, Kresovich JK, Lawrence KG, Kaufman JD, Sandler DP, Taylor JA, White AJ. Koenigsberg SH, et al. Environ Int. 2023 Nov;181:108270. doi: 10.1016/j.envint.2023.108270. Epub 2023 Oct 17. Environ Int. 2023. PMID: 37890265 Free PMC article.
  • Temporal trends and predictors of gestational exposure to organophosphate ester flame retardants and plasticizers. Bommarito PA, Friedman A, Welch BM, Cantonwine DE, Ospina M, Calafat AM, Meeker JD, McElrath TF, Ferguson KK. Bommarito PA, et al. Environ Int. 2023 Oct;180:108194. doi: 10.1016/j.envint.2023.108194. Epub 2023 Sep 7. Environ Int. 2023. PMID: 37708814 Free PMC article.
  • White E, Hunt JR, Casso D. Exposure measurement in cohort studies: The challenges of prospective data collection. Epidemiologic Reviews. 1998;20(1):43–56. doi:10.1093/oxfordjournals.epirev.a017971 - DOI - PubMed
  • Bao Y, Bertoia ML, Lenart EB, et al. Origin, methods, and evolution of the three nurses’ health studies. American Journal of Public Health. 2016;106(9):1573–1581. doi:10.2105/AJPH.2016.303338 - DOI - PMC - PubMed
  • Signorello LB, Hargreaves MK, Blot WJ. The Southern Community Cohort Study: Investigating health disparities. Journal of Health Care for the Poor and Underserved. 2010;21(1 SUPPL. 1):26–37. doi:10.1353/hpu.0.0233 - DOI - PMC - PubMed
  • Sandler DP, Hodgson ME, Deming-Halverson SL, et al. The Sister Study: Baseline methods and participant characteristics. Environ Health Perspect. 2017;125(12):127003. - PMC - PubMed
  • Prentice RL. A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika. 1986;73(1):1–11. doi:10.1093/biomet/73.1.1 - DOI

Publication types

  • Search in MeSH

Grants and funding

  • ZIA ES044005/ImNIH/Intramural NIH HHS/United States

LinkOut - more resources

Full text sources.

  • Europe PubMed Central
  • Ingenta plc
  • Ovid Technologies, Inc.
  • PubMed Central
  • Wolters Kluwer

full text provider logo

  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

tag. --> Epidemiologic Case Studies

These case studies are interactive exercises developed to teach epidemiologic principles and practices. They are based on real-life outbreaks and public health problems and were developed in collaboration with the original investigators and experts from the Centers for Disease Control and Prevention (CDC). The case studies require students to apply their epidemiologic knowledge and skills to problems confronted by public health practitioners at the local, state, and national level every day.

Three types of epidemiologic case studies are available.

Computer-Based Case Studies

Can be used as self-study and in the classroom.

Botulism in Argentina (CB3058)

E. coli O157:H7 Infection in Michigan (CB3075)

Gastroenteritis at a University in Texas (CB3076)

Classroom Case Studies

Primarily for use in a group setting with a knowledgeable instructor.

Instructor’s Guide

Foodborne Disease

Waterborne Disease

Outbreak Simulation

Gives students the opportunity to work through an outbreak investigation as a lead investigator.

Outbreak Simulation: Pharyngitis in Louisiana (CB3050)

File Formats Help:

  • Adobe PDF file
  • Microsoft PowerPoint file
  • Microsoft Word file
  • Microsoft Excel file
  • Audio/Video file
  • Apple Quicktime file
  • RealPlayer file
  • Zip Archive file
  • Page last reviewed: September 11, 2017
  • Page last updated: September 11, 2017
  • Office of Public Health Scientific Services ;
  • Center for Surveillance, Epidemiology, and Laboratory Services ;
  • Division of Scientific Education and Professional Development

Web Analytics

case study in epidemiology

  • Subscribe to journal Subscribe
  • Get new issue alerts Get alerts

Secondary Logo

Journal logo.

Colleague's E-mail is Invalid

Your message has been successfully sent to your colleague.

Save my selection

The Case Time Series Design

Gasparrini, Antonio a,b

a Department of Public Health Environments and Society, London School of Hygiene & Tropical Medicine, London, United Kingdom

b Centre for Statistical Methodology, London School of Hygiene & Tropical Medicine, London, United Kingdom.

Submitted November 26, 2019; accepted July 15, 2021

Supported by the Medical Research Council-UK (Grant ID: MR/R013349/1).

The authors report no conflicts of interest.

Online supplemental material includes documents for simulating data with the same features of the datasets used in the two case studies, and for reproducing the steps and results of the analyses presented in the article. An updated version complemented with scripts of the R statistical software is available at https://github.com/gasparrini/CaseTimeSeries .

Supplemental digital content is available through direct URL citations in the HTML and PDF versions of this article ( www.epidem.com ).

Correspondence: Antonio Gasparrini, London School of Hygiene & Tropical Medicine, 15-17 Tavistock Place, London WC1H 9SH, United Kingdom. E-mail: [email protected] .

This is an open access article distributed under the terms of the Creative Commons Attribution-Non Commercial License 4.0 (CCBY-NC) , where it is permissible to download, share, remix, transform, and buildup the work provided it is properly cited. The work cannot be used commercially without permission from the journal.

Modern data linkage and technologies provide a way to reconstruct detailed longitudinal profiles of health outcomes and predictors at the individual or small-area level. Although these rich data resources offer the possibility to address epidemiologic questions that could not be feasibly examined using traditional studies, they require innovative analytical approaches. Here we present a new study design, called case time series, for epidemiologic investigations of transient health risks associated with time-varying exposures. This design combines a longitudinal structure and flexible control of time-varying confounders, typical of aggregated time series, with individual-level analysis and control-by-design of time-invariant between-subject differences, typical of self-matched methods such as case–crossover and self-controlled case series. The modeling framework is highly adaptable to various outcome and exposure definitions, and it is based on efficient estimation and computational methods that make it suitable for the analysis of highly informative longitudinal data resources. We assess the methodology in a simulation study that demonstrates its validity under defined assumptions in a wide range of data settings. We then illustrate the design in real-data examples: a first case study replicates an analysis on influenza infections and the risk of myocardial infarction using linked clinical datasets, while a second case study assesses the association between environmental exposures and respiratory symptoms using real-time measurements from a smartphone study. The case time series design represents a general and flexible tool, applicable in different epidemiologic areas for investigating transient associations with environmental factors, clinical conditions, or medications.

Observational studies aim to discover and understand causal relationships between exposures and health outcomes through the analysis of epidemiologic data. 1 Paramount to this objective is removing biases due to the nonexperimental setting, in the first place confounding. It is, therefore, no surprise that traditional approaches based on cohort and case–control methods have been complemented with, and extended by, alternative study designs and statistical techniques applicable in specific contexts. An active area of research is so-called self-matched studies, which investigate acute effects of intermittent exposures by comparing observations sampled at different times within the same unit. These include individual-level designs such as the case–crossover, 2 the case-only, 3 the case–time–control, 4 the exposure–crossover, 5 and the self-controlled case series, 6 among others. An alternative but related epidemiologic method for aggregated data is the time series design, applied in particular in environmental studies. 7 A thorough overview of self-matched methods is provided in a recent publication by Mostofsky et al. 8

This landscape is likely to be transformed further by ongoing technologic and methodologic developments in data science, which offers unique opportunities for epidemiologic investigations, for instance through electronic health records linkage, 9 exposure modeling, 10 and real-time measurements technologies. 11 , 12 Ultimately, these data resources can be used to reconstruct detailed longitudinal profiles with repeated measures of health outcomes and various risk factors, offering the chance to investigate complex etiological mechanisms and to test elaborate causal hypotheses. However, existing self-matched methods present limitations in this context, and new analytical techniques must be developed for epidemiologic investigations in these intensive longitudinal and big data settings. 13

In this contribution, we present the case time series design, a novel self-matched method for the analysis of transient changes in risk of acute outcomes associated with time-varying exposures. This innovative design combines the longitudinal modeling structure of time series analysis with the individual-level setting of other self-matched methods, offering a flexible and generally applicable tool for modern epidemiologic studies. First, we introduce the case time series design and its features, including the design structure, modeling framework, estimation methods, and key assumptions. Later, we assess the methodology in a simulation study that evaluates its performance under various data-generating scenarios. Then, we demonstrate its application through two real-data epidemiologic analyses. In Discussion, we describe the epidemiologic context, advantages, and limitations, and areas of further development. We add documents for reproducing real-data examples and the simulation study as eAppendix 1–3; https://links.lww.com/EDE/B841 , with an updated version complemented with and R scripts available at the personal web site and GitHub webpage of the author (see “Data and Code”).

A NOVEL SELF-MATCHED DESIGN

The study design proposed here, called case time series, is a generally applicable tool for the analysis of transient health associations with time-varying risk factors. This novel design considers multiple observational units, defined as cases, for which data are longitudinally collected over a predefined follow-up period. The main design feature that defines the case time series methodology is the split of the follow-up period in equally spaced time intervals, which results in a set of multiple case-level time series. Data forming the series can originate from actual sequential observations or be reconstructed by aggregating or averaging longitudinal measurements, but, eventually, they are assumed to represent a continuous temporal frame. A graphical representation is provided in Figure 1 , showing case-specific time series data with various types of measurements of outcome and exposure collected for multiple subjects.

F1

The case time series data setting provides a flexible framework that can be adapted for studying a wide range of epidemiologic associations. For instance, outcomes, exposures, and other predictors can be represented by either indicators for events, episodes, or continuous measurements that vary across units and times, as in Figure 1 . The time intervals can be of any length (from seconds to years), depending on the temporal association between outcome and exposures and on practical design considerations. A case is a general definition, and it can represent a subject or other entities such as a geographic area to which observations are assigned, thus allowing analyses to be conducted either at individual level or with aggregated data. Eventually, the case time series structure combines characteristics of various other study designs: it allows individual-level analyses of transient risk associations as in traditional self-matched methods, but it retains the longitudinal temporal frame typical of time series data, with ordered repeated measures of outcomes, exposures, and other predictors. As discussed later, this flexible design setting offers important advantages.

Modeling Framework

A case time series model can be written in a regression form by defining the expectation of a given health outcome y it for case i at time t in relation to a series of predictor terms. Algebraically, the model can be written as follows:

The definition in Equation (1) resembles a classic time series regression model traditionally used in environmental epidemiology, where the ordered and sequential nature of the data allows the application of cutting-edge analytical techniques. 7 Specifically, the function f x ,     l specifies the association with the exposure of interest x , defined either as a binary episode indicator or as a continuous variable, optionally allowing for nonlinearity and complex temporal dependencies along the lag dimension l . These complex relationships can be modeled through distributed lag linear and nonlinear models (DLMs and DLNMs), which can flexibly define cumulative effects of multiple exposure episodes. 14 The term(s) s j represent functions expressed at different timescales to model temporal variations in risk associated to underlying trends or seasonality, among others. 15 Other measurable time-varying confounders z p can be modeled through functions h p , and these can include for instance age or time since a specific intervention. The two sets of terms s j and h p ensure a strict control of temporal variation in risks over multiple time axes. The outcome y can represent binary indicators, counts of rare or frequent events, or continuous measures. The analysis can be performed on multiple cases i = 1 ,     …     ,       n , with intercepts ξ i ( k ) expressing baseline risks for different risk sets, optionally stratified further in time strata k = 1 ,     …     ,     K i nested within them, allowing an additional within-case control for temporal variations in risk.

The estimation procedures in case time series analyses rely on estimators and efficient computational algorithms provided by the general framework of fixed-effects models. 16 These were developed in econometrics and often applied in panel studies with repeated observations. 10 , 17 Fixed-effects methods allow the estimation of coefficients for the various functions in Equation (1), without including the potentially high number of case/stratum-specific intercepts ξ i ( k ) , treated as nuisance (or incidental) parameters. 16

Fixed-effects estimators are available for the three main types of outcomes and distributions within the extended exponential family of generalized linear models (GLMs). Specifically, for continuous outcomes with a Gaussian distribution, the estimation procedure involves mean-centring and a simple correction of the degrees of freedom. For event-type indicator or count outcomes following a Bernoulli and Poisson distribution, respectively, estimators for fixed-effects models with canonical logit and log links can be defined through conditional likelihoods for logistic and Poisson regression. 18 , 19 These are forms of partial likelihoods that are derived by defining reduced sufficient statistics for ξ i ( k ) , obtained by conditioning on the total number of events within each of the n cases or n × K strata.

The main advantage of fixed-effects models is that the effect of any unmeasured predictor that does not vary within each risk set is absorbed by the intercept ξ i ( k ) , and therefore the related confounding effect is controlled for implicitly by design, as in other self-matched methods. 8 In addition, the within-case design offers important computational advantages, especially from a big data perspective. First, the analysis is restricted to informative strata, that is, cases and risk sets with variation in both outcome and exposure. Second, the estimators are based on efficient computational schemes, where the conditional or fixed-effect likelihood is defined by the sum of parts related to multiple risk sets, and the corresponding nuisance parameters ξ i ( k ) are not directly estimated.

Key Assumptions and Threats to Validity

As discussed earlier, the case time series framework has interesting design and modeling features that offer important advantages. On the other hand, its self-controlled structure, while appealing, only operates within an elementary causal framework and requires relatively strict assumptions to protect against key threats to validity. Specifically, the main requirements are the following:

  • Distributional assumptions on the outcome. The outcome y it must represent conditionally independent observations originating from one of the standard family distributions, for instance, Poisson counts, Bernoulli binary indicators, or Gaussian continuous measures.
  • Outcome-independent follow-up period. The period of observation for each case i must be independent of a given outcome, meaning that the follow-up period cannot be defined or modified by the outcome itself.
  • Outcome-independent exposure distribution. The probability of the exposure x t must be independent of the outcome history before t , meaning that the occurrence of a given outcome must not modify the exposure distribution in the following period.
  • Constant baseline risk conditionally on measured time-varying predictors. The baseline risk along the (strata of) follow-up period of each case i must be constant, meaning that variations in risks must be fully explained by model covariates.

These requirements enable valid conditional comparison of observations at different times within the follow-up of each case. Departures from these assumptions can produce imbalances in the temporal distribution of the outcome, the exposure, or unmeasured risk factors, thus determining spurious associations.

Some of these assumptions have been separately described in the literature of self-matched designs and fixed-effects models. 20–23 Specifically, assumption 1 dictates that outcomes must occur independently, and in particular that the occurrence of a given outcome level or event must not modify the risk of following outcomes. 24 This assumption indirectly implies that outcomes are recurrent, and nonrecurrent events can only be analyzed if rare in the population of interest. 25 , 26 Assumptions 2 and 3 are those posing more limitations to the application of self-matched methods, as for many associations of interest an outcome can modify both the follow-up period and exposure distribution. 27 , 28 These requirements often restrict the case time series designs to the analysis of exogenous exposures, which are by definition outcome-independent, and for which the observation period can be extended even beyond a terminal event, as in bidirectional case–crossover schemes. 29 Assumption 4 requires a constant baseline risk to ensure conditional exchangeability between observations within each risk sets, 20 , 30 , 31 requiring that relevant time-varying confounders are included and all the terms in Equation (1) are correctly specified.

Importantly, the design setting described earlier is not suited to represent complex causal scenarios characterized by dynamic mechanisms between time-varying terms. Specifically, feedback between outcomes and between outcome and exposure are forbidden by assumptions 1 and 3, respectively, while more generally exposure–confounder feedback cannot be validly handled through traditional regression-based methods for longitudinal data. 32

SIMULATION STUDY

We evaluated the performance of the case time series design in a set of simulated scenarios that involved various data-generating processes and assumptions ( Table ). Detailed information on the simulation settings, definitions, and additional results are provided in eAppendix 3; https://links.lww.com/EDE/B841 . Briefly, we simulated and analyzed data for 500 subjects followed up for 1 year, testing the method in terms of relative bias, coverage, and relative root mean square error (RMSE) in 50,000 replications. The basic scenario involves an outcome represented by repeated event counts and binary indicators of exposure episodes associated with a constant increase in risk in the next 10 days.

Scenario Relative Bias (%) Coverage Relative RMSE (%)
Scenario 1: basic 0.0 0.951 8.8
Scenario 2: rare outcome/exposure −4.5 0.951 86.0
Scenario 3: continuous exposure −0.1 0.950 15.2
Scenario 4: binary outcome 0.3 0.949 9.1
Scenario 5: continuous outcome 0.0 0.950 14.7
Scenario 6: common trend −0.1 0.950 28.8
Scenario 7: subject-specific trend 0.1 0.948 35.2
Scenario 8: unobserved baseline confounder 0.2 0.951 25.8
Scenario 9: time-varying confounder −0.2 0.949 35.1
Scenario 10: complex lag structure 0.0 0.950 29.2
Scenario 11: outcome-dependent risk −18.9 0.738 24.7
Scenario 12: outcome-dependent follow-up 16.8 0.797 22.7
Scenario 13: outcome-dependent exposure 11.1 0.744 14.4
Scenario 14: variation in baseline risk 40.7 0.222 43.3

The first part of the simulation study (scenarios 1–10) evaluates the performance of the new design in recovering the true association under increasingly complex data settings. Specifically, the scenarios depict different outcome and exposure types, the presence of common or subject-specific trends, time-invariant and time-dependent confounders, and more complex lag structures. Results in the Table indicate that the case time series design provides correct point estimates and confidence intervals in almost all ten scenarios. The small underestimation in scenario 2 is consistent with the asymptotic bias of maximum likelihood estimators originating from the extreme unbalance of expected events between risk and control periods, previously described and defined analytically in the self-controlled case series literature. 33 eFigure 1; https://links.lww.com/EDE/B841 shows that the case time series models can correctly recover the true association, both in the basic scenario 1 with constant risk and no confounding, and in the more complex scenario 10 representing varying lag effects, strong temporal trends, and highly correlated confounders.

The second part of the simulation study (scenarios 11–14) illustrates basic applications, but where each of the four assumptions, in turn, does not hold. Specifically, scenario 11 describes the case where the occurrence of an outcome can change the risk status of a subject and temporally reduce their underlying risk. This can occur for instance when the event results in the prescription of drugs or therapies. This induces a form of dependency in the outcome series that violates assumption 1 and, in this example, results in a negative bias (Table). Scenarios 12 simulates a different situation, namely when the outcome event carries a risk of censoring the follow-up, for instance, if it increases the probability of death. This contravenes assumption 2 and generates a bias in the opposite direction. In scenario 13, the outcome event reduces instead the probability of exposure episodes in the following 2 weeks, a situation that can occur for example if the event results in hospitalization or lifestyle changes. Here assumption 3 does not hold, and the estimators are again biased upward. Finally, scenario 14 illustrates the case of unobserved periods of lower baseline risk within the follow-up, for instance, corresponding to holiday periods with a reduced probability of an outcome being reported. This undermines the conditional exchangeability requirements of assumption 4 and induces a large positive bias.

ILLUSTRATIVE EXAMPLES

This section illustrates the application of the case time series design in two real-data examples. These case studies are described here only for illustrative purposes, and they are not meant to offer substantive epidemiologic evidence on the associations under study. Detailed information on the setting and sources of data can be found in the cited references. Documents in the eAppendices 1 and 2; https://links.lww.com/EDE/B841 , provide notes and R code that reproduce the steps of these analyses using simulated data, and they offer details on the specific modeling choices.

Flu and Myocardial Infarction

The first example replicates a published analysis that assessed the role of influenza infection as a trigger for acute myocardial infarction (AMI). 34 The data, retrieved by linking electronic health records from primary care and cohort databases for England and Wales, include 3,927 acute MI cases with at least one flu episode in the period 2003–2009. A representation of a subinterval of the follow-up for six subjects is reported in eFigure 2; https://links.lww.com/EDE/B841 . The original analysis relied on the self-controlled case series design to examine the association, using exposure windows in the 1–91 days after each flu episode and controlling for trends using 5-year age strata and trimester indicators. Limitations of this approach are the use of stratification to describe smooth continuous dependencies and the fact that multiple flu episodes experienced by some subjects resulted in the long exposure windows to overlap (eFigure 2; https://links.lww.com/EDE/B841 ), requiring ad-hoc fixes that can generate biases. 35 Conversely, the rarity of the exposure, with most of the subjects experiencing a single flu episode, prevents the application of the case–crossover design, as most control sampling schemes would generate nondiscordant case–referent sets.

We replicated the analysis with a case time series design, splitting the follow-up period of each subject into daily time series (eAppendix 1; https://links.lww.com/EDE/B841 ). We fitted a fixed-effects Poisson model to estimate the flu–AMI association while controlling for underlying trends across multiple time scales. The model includes smooth functions to define the baseline risk, specifically using natural splines (with two knots at the interquartile range) for age and cyclic splines (with three degrees of freedom) for seasonality. More importantly, we applied DLMs defined by either splines (with knots at 3, 10, and 29 lags) or step functions (with strata 1–3, 4–7, 8–14, 15–28, and 29–91 lags) to describe temporal effects along with the exposure window.

Results are reported in Figure 2 . The left and middle panels display the variation in risk of AMI by age and season, showing how the case time series design allows modeling baseline trends fluctuating smoothly across multiple time axes. The right panel illustrates the risk after a flu episode within the selected lag period, as estimated using a DLM with spline functions. The graph indicates a high risk in the first days after a flu episode, which then attenuates and disappears after approximately 1 month. The same panel also includes the fit of the alternative distributed lag model defined by step functions, which assumes a constant risk within exposure windows (see also eFigure 3; https://links.lww.com/EDE/B841 ). This specification matches the stratification approach in the original self-controlled case series analysis, 34 although the case time series design with DLMs accounts for cumulative effects of potentially overlapping periods of flu episodes.

F2

Environmental Exposures and Respiratory Symptoms

The second example illustrates a preliminary analysis of the role of multiple environmental stressors in increasing the risk of respiratory symptoms using smartphone technology. Data were collected within AirRater, an integrated online platform operating in Tasmania that combines symptom surveillance, environmental monitoring, and real-time notifications. 12 A smartphone app allowed the self-reported recording of respiratory symptoms and the reconstruction of personalized exposure series by linking geolocated positions with high-resolution spatiotemporal maps derived from environmental monitors ( Figure 3 ). Standard cohort analyses based on between-subject comparisons are unsuitable in this complex study setting, characterized by continuous recruitment, high dropout rates, and intermittent participation (eFigure 4; https://links.lww.com/EDE/B841 ). Similarly, the frequent and highly seasonal outcome pose problems in adopting a case–crossover design, with issues in selecting control times and about the assumption of constant within-stratum risk. Finally, the presence of multiple continuous exposures prevents the application of the self-controlled case series design, either in its standard or extended forms. 36 , 37

F3

We, therefore, applied a case time series design (eAppendix 2; https://links.lww.com/EDE/B841 ). The analysis included 1,601 subjects followed between October 2015 and November 2018, with a total of 364,384 person–days. The event-type outcome was defined as daily indicators of reported respiratory symptoms and associated with individual exposure to pollen (grains/m 3 ), fine particulate matter (PM 2.5 , μg/m 3 ), and temperature (°C) ( Figure 3 ). We modeled the relationships using a fixed-effects logistic regression over a lag period of 0–3 days, using an unconstrained distributed lag model for the linear association with PM 2.5 , and bidimensional spline DLNMs for specifying nonlinear dependencies with pollen and temperature. 14 , 38 A strict temporal control was enforced by using subject/month strata intercepts, natural splines of time (with 8 df/year), and indicators of the day of the week, thus modeling individually varying baseline risks on top of shared long-term, seasonal, and weekly trends.

Figure 4 shows the preliminary results, with estimated associations reported as odds ratios (ORs) from the model that includes simultaneously the three environmental stressors. The graphs display the overall cumulative exposure-response relationships (top panels), interpreted as the net effects across lags, and the full bidimensional exposure-lag-response associations (bottom panels). 14 , 38 The lefthand panels indicate a positive association between risk of allergic symptoms and pollen, with a step increase in risk that flattens out at high exposures, and a lagged effect up to 2 days. The middle panels suggest an independent association with PM 2.5 , where the risk is entirely limited to the same-day exposure. Finally, results in the righthand panels show a positive association with high ambient temperature, with the OR increasing above 1 beyond daily averages of 15°C.

F4

The novel case time series methodology offers a general modeling framework for the analysis of epidemiologic associations with time-varying exposures. The design is adaptable to various data settings for the analysis of highly informative longitudinal measurements, and it is particularly well suited in applications with modern data resources such as individual-level exposure models and real-time technologies.

The main feature of methodology is a flexible scheme that embeds a longitudinal time series structure in a within-subject design, providing unique modeling advantages. For instance, the sequential order of observations offers the opportunity to assess complex temporal relationships with multiple exposures, where patterns of cumulative effects for linear or nonlinear dependencies can be easily modeled. Furthermore, the time series and self-controlled features offer a structure that enables strict control for confounding: time-invariant and time-varying factors can be adjusted for by stratifying the baseline risk between and within subjects, respectively, while residual temporal variations can be directly modeled through time-varying predictors that represent confounders or shared trends across multiple time axes.

The new design complements and extends the already rich set of self-matched methods for observational studies described in the epidemiologic literature. 8 Previous methodological contributions have highlighted links and similarities between various designs, 18 , 21 , 29 , 30 , 39–41 and ultimately these can be seen as alternative approaches to model the same risk associations. However, each method relies on different sets of assumptions and modeling choices, which explain in part their separate areas of application. The case time series methodology, nevertheless, offers a general framework that combines and extends features of existing designs, with important advantages. For example, it borrows flexible modeling tools from aggregated-data time series design, but it implements them in individual-level analyses that allow a finer reconstruction of outcomes, exposures, and other risk factors. It is applicable to assess associations with multiple continuous predictors as the case–crossover design, and it can model recurrent events, either common or rare, as the self-controlled case series analyses, but it can be extended to the analysis of outcomes represented by binary indicators or continuous measures, simply assuming different distributions. Finally, its time series structure allows the application of sophisticated techniques such as smoothing methods and distributed lag models, characterized by well-defined parameterizations, computational efficiency, and standard software implementations. A thorough and critical comparison of the case time series methodology with alternative approaches will be provided in future contributions.

Together with other self-matched methods, the new case time series design is based on strict assumptions to protect against key threats to validity. However, these conditions are not always met in practice, and their violations can lead to important biases. Specifically, the requirement that both exposures and follow-up periods are independent of the outcome poses severe limitations to the application of the method, in particular in clinical and pharmacoepidemiologic studies. In fact, the temporal distribution of endogenous predictors such as behaviors, clinical therapies, or drug prescriptions are often modified by an outcome event. In contrast, the case time series and other self-controlled designs are well suited for the analysis of exogenous exposures such as environmental factors, as discussed before. Extension to test and relax these strong assumptions have been developed for the self-controlled case series design, 27 , 28 but further research is needed to implement and assess their validity in case time series models. Conversely, the new design is well suited to control for temporal confounding that can invalidate the assumption of constant baseline risk, through the stratification of the follow-up period and the inclusion of lagged and smooth continuous terms in the model.

Other limitations and areas of current research must be discussed. First, as a method based on a within-subject comparison, the case time series design is ideal for investigating phenomena with short-term changes in risk relative to the study period, while it is less suitable for the analysis of long-term effects and chronic exposures. In fact, while it is in theory possible to extend indefinitely the lag period within the follow-up interval, there is a limit to which the model can disentangle long-lagged effects from seasonal and other trends. 42 In addition, the splitting of the follow-up period in individual-level time series produces a substantial data expansion, with considerable computational demand especially in the presence of a high number of subjects or long study periods. Schemes based on risk-set sampling, previously proposed for cohort and nested case–control studies, 43–45 are currently under development to address this issue. Finally, the simulation study and the two real-data examples presented basic epidemiologic relationships between time-varying variables. However, more complex causal dependencies, involving, for instance, dynamic feedback or multiple pathways, explicitly violate the strict assumptions underpinning the case time series design, and cannot be modeled in the proposed framework. The definition, limitations, and potential extensions of fixed-effects models and related designs within a general causal inference setting is an area of current research. 23

In conclusion, the case time series design represents a novel epidemiologic method for the analysis of transient health associations with time-varying exposures. Its flexible modeling framework can be adapted to various contexts and research areas, for instance, in clinical, environmental, and pharmacoepidemiology, and it is suitable for the analysis of intensive longitudinal data provided by modern data technologies.

ACKNOWLEDGMENTS

The author is thankful to Dr. Charlotte Warren-Gash, and Dr. Fay Johnston and Mr. Iain Koolhof for providing data access and information for the two case studies used as illustrative examples. The author is also grateful to colleagues who provided comments on various drafts of the manuscript and analyses, in particular Mr. Francesco Sera, Dr. Ana Maria Vicedo-Cabrera, and Prof. Ben Armstrong. Finally, the author is indebted to Prof. Paddy Farrington for offering critical insights on asymptotic biases of maximum likelihood estimators in self-controlled case series. The study on influenza and AMI was originally approved by the Independent Scientific Advisory Committee (ISAC) of the Clinical Practice Research Datalink (Ref: 09_034), the Cardiovascular Disease Research Using Linked Bespoke Studies and Electronic Records (CALIBER) Scientific oversight committee and Myocardial Ischaemia National Audit Project (MINAP) Academic Group (ref: 09_08), and the UCL Research Ethics committee (Ref: 2219/001). This study, which used the analysis dataset only, was approved through a minor ISAC amendment (granted on 12/01/2016) and a MINAP Academic Group amendment (granted on 11/01/2016). More information about AirRater are available at https://airrater.org .

  • Cited Here |
  • Google Scholar

AirRater; Case-only; Epidemiologic methods; Longitudinal data; Self-controlled; Study design; Self-matched; Time series

Supplemental Digital Content

  • EDE_2021_08_13_GASPARRINI_EDE19-0733_SDC1.pdf; [PDF] (1.83 MB)
  • + Favorites
  • View in Gallery

Readers Of this Article Also Read

Heat and mortality in new york city since the beginning of the 20th century, commentary: does air pollution confound studies of temperature, measurement error in air pollution cohort studies, on the distinction between interaction and effect modification, impact of high temperatures on mortality: is there an added heat wave effect.

Module 4 - Epidemiologic Study Designs 1: Cohort Studies & Clinical Trials

Introduction

Video transcript in a Word file

We previously discussed descriptive epidemiology studies, noting that they are important for alerting us to emerging health problems, keeping track of trends in the population, and generating hypotheses about the causes of disease. Analytic studies provide a basic methodology for testing specific hypotheses. The essence of an analytic study is that groups of subjects are compared in order to estimate the magnitude of association between exposures and outcomes. This module will build on descriptive epidemiology and on measuring disease frequency and association by discussing cohort studies and intervention studies (clinical trials). Our discussion of analytic study designs will continue in module 5 which addresses case-control studies. Pay particular attention to the strengths and weaknesses of each design. This is important for being able to select the most appropriate design to answer a given research question. In addition, a firm understanding of the strengths and weaknesses of each design will facilitate building your skills in critical reading of studies by alerting you to possible pitfalls and weaknesses that can undermine the validity of a study.

Essential Questions

  • What are the different strategies for investigating the causes or sources of health outcomes?
  • How do we choose the best approach to study a particular health problem?
  • What are the strengths and limitations of different study designs?

Learning Objectives

After completing this section, you will be able to:

  • Explain the role of descriptive epidemiology for defining health problems and establishing hypotheses about the determinants of health and disease.
  • Explain the utility and the limitations of case reports and case series.
  • Describe the design features and the advantages and weaknesses of each of the following study designs: Cross-sectional studies, ecological studies, retrospective and prospective cohort studies, case control studies, and intervention studies
  • Identify the study design when reading an article or abstract.
  • Explain how different study designs can be applied to the same hypothesis to provide different and complementary information.

Overview of Epidemiologic studies

The figure below provides a brief overview of epidemiologic studies. The descriptive studies that have already been discussed are listed in the top part: case reports and case series, cross-sectional studies, and ecologic studies. In addition to identifying new problems and keeping track of trends in a population, they also generate hypotheses that can be tested using one of the analytic studies shown at the bottom.

Diagrammatic overview of descriptive and analytic studies as described in the accompanying text.

Note that cohort studies and case-control studies are observational studies, because investigators do not allocate exposure status. Some exposures are constituent (e.g., one's genome), some are behaviors and life style choices, and others are circumstantial, such as social, political, and economic determinants that affect health. None of these exposures are controlled by the investigators in observational studies; the investigators literally observe, collecting data on these exposures and on a variety of health outcomes. In contrast, intervention studies (also called clinical trials or experimental studies) are more like a true experiment in that the investigators assign subjects to a specific exposure (e.g., one or more treatment groups), and they are followed forward in time to record health outcomes of interest. Each of these analytic studies is useful in particular circumstances. Let's begin by discussing cohort studies.

Cohort Studies

Key features of cohort studies.

In cohort studies investigators enroll individuals who do not yet have the health outcomes of interest at the beginning of the observation period, and they assess exposure status for a variety of potentially relevant exposures. The enrollees are then followed forward in time (i.e., these are longitudinal studies rather than cross-sectional) and health outcomes are recorded. With this data investigators can sort the subjects according to their exposure status for one of the exposures of interest and compare the incidence of disease among the exposure categories.

case study in epidemiology

For example, in 1948 the Framingham Heart Study enrolled a cohort of 5,209 residents of Framingham, MA who were between the ages of 30-62 and who did not have cardiovascular disease when they were enrolled. These subjects differed from one another in many ways: whether they smoked, how much they smoked, body mass index, eating habits, exercise habits, sex, family history of heart disease, etc. The researchers assessed these and many other characteristics or "exposures" soon after the subjects had been enrolled and before any of them had developed cardiovascular disease. The many "baseline characteristics" were assessed in a number of ways including questionnaires, physical exams, laboratory tests, and imaging studies (e.g., x-rays). They then began "following" the cohort, meaning that they kept in contact with the subjects by phone, mail, or clinic visits in order to determine if and when any of the subjects developed any of the "outcomes of interest," such as myocardial infarction (heart attack), angina, congestive heart failure, stroke, diabetes and many other cardiovascular outcomes. They also kept track of whether their risk factors changed.

Over time some subjects eventually began to develop some of the outcomes of interest. Having followed the cohort in this fashion, it was eventually possible to use the information collected to evaluate many hypotheses about what characteristics were associated with an increased risk of heart disease. For example, if one hypothesized that smoking increased the risk of heart attacks, the subjects in the cohort could be sorted based on their smoking habits, and one could compare the subset of the cohort that smoked to the subset who had never smoked. For each such comparison that one wanted to make, the cohort could be grouped according to whether they had a given exposure or not, and one could measure and compare the frequency of heart attacks (i.e., the cumulative incidence or the incidence rates) between the groups.

The Population "At Risk"

From the discussion above, it should be obvious that one of the basic requirements of a cohort type study is that none of the subjects have the outcome of interest at the beginning of the follow-up period, and time must pass in order to determine the frequency of developing the outcome.

For example, if one wanted to compare the risk of developing uterine cancer between postmenopausal women receiving hormone-replacement therapy and those not receiving hormones, one would consider certain eligibility criteria for the members prior to the start of the study: 1) they should be female, 2) they should be post-menopausal, and 3) they should have a uterus. Among post-menopausal women there might be a number who had had a hysterectomy already, perhaps for persistent bleeding problems or endometriosis or prior uterine cancer. Since these women no longer have a uterus, one would want to exclude them from the cohort, because they are no longer at risk of developing this particular type of cancer. Similarly, if one wanted to compare the risk of developing diabetes among nursing home residents who exercised and those who did not, it would be important to test the subjects for diabetes at the beginning of the follow-up period in order to exclude all subjects who already had diabetes and therefore were not "at risk" of developing diabetes.

Prospective Cohort Studies

Cohort studies can be classified as prospective or retrospective based on when outcomes occurred in relation to the enrollment of the cohort. The Framingham Heart Study is an example of a prospective cohort study. Another well-known prospective cohort study is the Nurses' Health Study . The original Nurses' Health Study (NHS) began in 1976 by enrolling about 121,000 female nurses from across the United States who were initially free of known cardiovascular disease or cancer. (The Nurses' Health Study is now enrolling the third generation cohort, which includes male and female nurses).

case study in epidemiology

In a prospective study like the Nurses Health Study baseline information is collected from all subjects in the same way using exactly the same questions and data collection methods for all subjects. The investigators design the questions and data collection procedures carefully in order to obtain accurate information about exposures before disease develops in any of the subjects.

The distinguishing feature of a prospective cohort study is that, at the time that the investigators begin enrolling subjects and collecting baseline exposure information, none of the subjects has developed any of the outcomes of interest.

After baseline information is collected, the participants are followed "longitudinally," i.e. over a period of time, usually for years, to determine if and when they become diseased and whether their exposure status changes. Most studies of this type contact the participants periodically, perhaps every two years, to update information on exposures and outcomes. In this way, investigators can eventually use the data to answer many questions about the associations between exposures ("risk factors") and disease outcomes. For example, one NHS study examined the association between smoking and breast cancer and found that there was no significant association.

Another NHS study examined the association between obesity and myocardial infarction. They used reported height and weight to calculate BMII and categorized women into five categories of BMI. The table below summarizes their findings with respect to non-fatal myocardial infarction.

BMI # non-fatal MIs Person-Years Inc. Rate Per 10,000 P-Y Rate Ratio
>=30 85 99,573 85.4 3.7
25.0-29.9

 

67 148,541 45.1 1.6
23.0-24.9 56 155,717 36.0 1.6
20.0-22.9 57 194,243 29.3 1.3
<20 41 177,356 23.1 1.0

The data above are from Willett WC, Manson JE, et al.: Weight, weight change, and coronary heart disease in women. Risk within the 'normal' weight range . JAMA. 1995 Feb 8;273(6):461-5.

case study in epidemiology

Follow Up in Prospective Cohort Studies

Ideally, investigators want to have complete follow-up on all subjects, but in large cohort studies that run for years, there are inevitably people who become lost to follow up as a result of death, moving, or simply loss of interest in participating. When this occurs, the investigators know the subject's exposure status prior to losing them, but not their outcome.

The biggest problem with substantial loss to follow up (LTF) is that it can bias the results of the study if the losses are different for one of the exposure-outcome categories. This will be illustrated in the module on bias.

There is no way to know if the losses are different for one of the exposure-outcome categories, so the only strategy to minimize bias from loss to follow up is to keep follow up high (in both prospective cohort studies and clinical trials).

Strategies to Maintain Follow Up

  • Choosing subjects who are motivated
  • Choosing subjects who are easy to track (e.g., registered nurses or physicians)
  • Keeping subjects interested with newsletters and incentives;
  • Being courteous and making them feel that they are members of a research "family"
  • Frequent phone calls
  • Making questionnaires easy to fill out

Retrospective Cohort Studies

In contrast to prospective studies, retrospective studies are conceived after some people have already developed the outcomes of interest. The investigators jump back in time to identify a cohort of individuals at a point in time before they had developed the outcomes of interest , and they try to establish their exposure status at that point in time. They then determine whether the subjects subsequently developed the outcome of interest.

In essence, the investigators jump back in time to identify a useful cohort which was initially free of disease and 'at risk' of developing the outcome. They then use whatever records are available to determine each subject's exposure status at the beginning of the observation period, and they then ascertain what subsequently happened to the subjects in the two (or more) exposure groups. Retrospective cohort studies are also 'longitudinal,' because they examine health outcomes over a span of time. The distinction is that in retrospective cohort studies some or all of the cases of disease have already occurred before the investigators initiate the study. In contrast, exposure information is collected at the beginning of prospective cohort studies before any subjects have developed any of the outcomes or interest, and the 'at risk' period begins after baseline exposure data is collected and extends into the future.

Suppose investigators wanted to test the hypothesis that working with the chemicals involved in tire manufacturing increases the risk of death. Since this is a fairly rare exposure, it would be advantageous to use a special exposure cohort such as employees of a large tire manufacturing factory and conduct a retrospective cohort study.

case study in epidemiology

The employees who actually worked with chemicals used in the manufacturing process would be the exposed group, while clerical workers and management might constitute the "unexposed" comparison group. Instead of following these subjects for decades, it would be more efficient to use employee health and employment records over the past two or three decades as a source of data. In essence, the investigators are jumping back in time to identify the study cohort at a point in time before the outcome of interest (death) occurred. They can classify them as "exposed" or "unexposed" based on their employment records, and they can use a number of sources to determine subsequent outcome status, such as death (e.g., using health records, next of kin, National Death Index, etc.).

Retrospective cohort studies are less expensive and more efficient than prospective cohort studies, because subjects don't need to be followed for years. However, the disadvantage is that the quality of the data is generally inferior to that of a prospective study. In the study of mortality and tire manufacturing chemicals the clerical staff may be much less exposed to the chemicals, but there are likely to be important differences in other factors that influence mortality (confounding factors), such as sex, age, socioeconomic status, education, diet, smoking, alcohol consumption, etc. Employee health records are unlikely to capture this information in sufficient detail to enable the investigators to adjust for differences in these other factors. (We will discuss adjusting for confounding later in the course.)

The distinguishing feature of a retrospective cohort study is that the investigators conceive the study and begin identifying and enrolling subjects after outcomes have already occurred in some of the subjects.

Strengths and Disadvantages of Cohort Studies

 

  

Test Yourself

case study in epidemiology

Selection of Subjects for Cohort Studies

The selection of subjects for a study is primarily dictated by the research questions and by feasibility.

General Cohorts

For relatively common exposures and health outcomes a general cohort, such as residents of Framingham, MA, can be enrolled. The Framingham Heart Study, which began in 1948, enrolled 5,209 men and women 30-62 years old. At the time little was known about the determinants of heart disease and stroke, devastating health problems that had steadily increased in frequency throughout the 20th century. The investigators gathered extensive baseline information with questionnaires, lab tests, and imaging studies. They then followed the subjects, and had them return to the study office every two years for a detailed medical history, physical examination, and repeat lab tests. The Framingham study has been enormously successful in providing information about the most important determinants of cardiovascular diseases (e.g., hypertension, high cholesterol, smoking, obesity, diabetes, and physical inactivity).  Framingham investigators also collaborate with leading researchers throughout the world on studies of stroke and dementia, osteoporosis and arthritis, nutrition, diabetes, eye diseases, hearing disorders, lung diseases, and genetic patterns of common diseases.

The Nurses' Health Study and the Black Women's Health Study would also be considered general cohorts, because they both provide the opportunity to study many exposures and many health outcomes among residents with a wide variety of occupations and circumstances. These studies enable investigators to collect exposure information on many common exposures (e.g., high blood pressure, smoking, alcohol use, diet, exercise, etc.), and, after sufficient follow up time, many health outcomes can be studied. When conducting studies using data from a general cohort, the reference group comes from within the cohort, i.e., an internal comparison group. For example, when the Nurses' Health Study examined the association between exercise and heart disease, they carefully assessed physical activity and computed an overall " MET" score  that takes into account the frequency, duration, and intensity of many activities. They then sorted them by MET score, divided the cohort into quintiles (i.e., five more or less equal numbers of subjects), and used the quintile with the lowest MET scores as the reference group against which they compared each of the other quintiles. [Manson JE, Hu FB, et al.: A prospective study of walking as compared with vigorous exercise in the prevention of coronary heart disease in women . N Engle J Med 1999;341:650-8].

Special Cohorts

For rare or unusual exposures the obvious choice would be a special cohort that provides a sufficient number of subjects with the exposure of interest. Examples might include occupational exposures (e.g., asbestos, radiation, and pesticides), unusual diets, drug exposures (e.g., pregnant women treated with diethylstilbesterol in the 1960s), or rare events (e.g., Hurricane Katrina, the bombing of Hiroshima, exposure of responders to the attack on the World Trade Center on 9/11). With special cohorts there is obviously a focus on a single exposure, but many potential health outcomes can be studied. Another major difference from general cohorts is that selection of an appropriate comparison group can be challenging.

A good example of a special cohort study is the US Air Force Health Study on the effects of exposure to dioxin. During the Vietnam War, the U.S. military sprayed the herbicide dioxin ("agent orange") over Vietnam to expose enemy supply lines and bases. Airmen were exposed during spraying flights, while loading the chemical and while performing maintenance on the planes that were used. After the war, combat veterans who had been in Vietnam complained of a variety of health problems. In 1979, the US Congress directed that an epidemiologic study be conducted to evaluate adverse health effects associated with exposure to dioxin and other herbicides used during the Vietnam conflict. The study (informally called the "Ranch Hand Study") enrolled a special cohort consisting of US Air Force pilots who had flown missions to spray dioxin. The comparison group consisted of Air Force flight crews and maintenance personnel who served in Southeast Asia but had not been involved in herbicide spraying operations. Subjects have been followed for many years, and several analyses have found increased all-cause mortality and cardiovascular mortality in those exposed to dioxan. There was also evidence of an association with obesity and possibly diabetes. There were conflicting reports regarding the association between dioxan and cancers.

Selection of a Comparison Group

The major challenge for the Air Force Health Study (AFHS) and other special cohort studies is selection of an appropriate comparison group. The goal of analytic studies is to compare health outcomes in exposed and unexposed groups that are otherwise as similar as possible, i.e., having the same distributions of all other factors that could have any association with health outcomes. We will see that intervention studies with large numbers of subjects randomly assigned to two or more treatment groups (exposures) can usually achieve this so that the groups being compared have similar distributions of age, sex, smoking, physical activity, etc., but random assignment does not occur in cohort studies. Suppose that a cohort study had smokers who were older than the non-smokers. It is well established that the risk of heart disease increases with age, i.e., it is an independent risk factor for heart disease, and if the smokers are older, they have an additional risk factor that will cause an overestimate of the association between smoking and heart disease. This phenomenon, called confounding , occurs when the exposure groups that are being compared differ in the distribution of other determinants of the outcome of interest. Another concern is that the exposure groups being compared may differ in the quality or accuracy of the data that is being collected, and this can also bias the results (so-called information bias ). Confounding and bias will be discussed later in the course, but for now, it is important to recognize the importance of selecting a comparison group that differs in exposure status but is as similar as possible to the exposed group in all other ways including:

  • Other factors that can influence the health outcome
  • The quality and accuracy of their data

The figure below depicts three studies of cardiovascular disease illustrating the general approaches to selecting a comparison group for a cohort study.

case study in epidemiology

As noted earlier, general cohorts employ an internal comparison grou p, e.g., dividing the cohort into quintiles of BMI or quintiles of activity and using the quintile with the lowest BMI or the lowest activity as the reference group. This is the best comparison group for a general cohort study, because the subjects are likely to be similar in some ways, but they may still differ with respect to potentially confounding factors. For example, nurses who exercise regularly may be generally more health conscious (e.g., less likely to smoke; more likely to eat a healthier diet; more likely to take vitamins, etc.).

The second method is to use an external comparison grou p. A special exposure cohort consisting of workers in a rayon factory, was selected to study the association between disulfide exposure and risk of cardiovascular disease, and the comparison group consisted of workers in a paper mill. These two groups may be similar in age distribution, socioeconomic status, and other factors, but they may also differ with respect to other confounding factors. In addition, paper mills have their own mix of occupational exposures, which might also affect the likelihood of cardiovascular disease and bias the results.

The third approach is to use the general population as a comparison group , for example, if trying to determine whether workers in a rayon factory had higher mortality rates. This approach is less costly, and it is sometimes used for studies of occupational exposures when it is difficult to find an appropriate internal or external comparison group. However, using rates of death or disease in the general population has a number of limitations:

  • General population data are frequently limited to studies of mortality since accurate rates on specific health outcomes may not be available.
  • General population rates include exposed and unexposed individuals.
  • The general population is not really comparable because there are many confounding variables that cannot be controlled for.
  • The general population includes people who are unable to work because of disease or disability (the "healthy worker effect" which is discussed in the module on bias).

case study in epidemiology

Basic Analysis of Cohort Study Data

One of the first steps in the analysis of an epidemiologic study is to generate simple descriptive statistics on each of the groups being compared. This helps characterize the study population, and it also alerts you and your readers to any differences between the groups with respect to other exposures that might cause confounding.

The illustration below is Table 1 from the study by Manson et al. on exercise and prevention of cardiovascular disease. Recall that they calculated each subject's MET score to estimate their overall activity level and then divided the cohort into quintiles based on the MET scores.

case study in epidemiology

There are columns for each of the five quintiles in order from the least active to the most active. The rows list many variables that characterize the subjects and could also be confounders. Note that dichotomous variables are listed first and the percent with a given characteristic is listed for each quintile. For example, 28.2% of quintile 1 were current smokers, and this decreased steadily to 17.5% in the most active group (quintile 5). Therefore, smoking will be a potential confounding factor, because it is a risk factor for cardiovascular disease, and it differs among the exposure groups. Other possible confounding factors in table 1 include history of hypertension, history of diabetes, history of hypercholesterolemia (high blood levels of cholesterol), current use of hormone replacement therapy, use of multivitamins, and use of vitamin E supplements.

Continuous variables are listed in the lower half of Table 1, showing the mean value for each quintile of activity. Age is a risk factor for cardiovascular disease, but it is unlikely to cause confounding in this particular study, because the mean age is 52.1-52.3 years in all five quintiles. However, some of the other continuous variables do differ across the exposure groups, e.g., body mass index, alcohol consumption, and dietary cholesterol. Overall, increasing activity seems to be associated with trends in characteristics associated with a healthier lifestyle. If our goal is to understand the independent effect of exercise on risk of heart disease, then one must adjust for as many of these confounding factors as possible in the subsequent analysis. You will learn how to do this later in the course when we discuss confounding more completely.

You learned how to use R to generate descriptive statistics in the introductory module on R, and you have the tools to generate a table like Manson's Table 1 from a data set. The only other tool that you need is how to generate descriptive statistics in subsets of the data, e.g., the quintiles in the study by Manson et al. Methods for sub-setting are presented on the next page.

Analyzing Data in Subsets Using R

The tapply() command.

The tapply() function is useful for performing functions (e.g., descriptive statistics) on subsets of a data set. In effect this enables you to subset the data by one or more classifying factors and then performing some function (e.g., computing the mean and standard deviation of a given variable) by subset. Note that tapply() is used for descriptive statistics (e.g., mean, sd, summary) for continuously distributed variables. For categorical variables you should use the table() function to get counts of categorical variables and use the prop.table() function to get proportions. The basic structure of the tapply command is:

tapply(<var>,<by.var>,<function>)

where <var> is the variable that you want to analyze, <by.var> is the variable that you want to subset by, and <function> is the function or computation that you want to apply to <var> .

For example, suppose I have a data set with continuous variables Dubow (Dubow Score), DrugExp (Drug Exposure) and Ppregwt (Pre-pregnancy weight). My goal is to sort the data set by DrugExp and then compute the mean and standard deviation of Dubow Scores and Pre-pregnancy weights for each category of DrugExp.

> tapply(Dubow,DrugExp,mean) # Gives means of Dubowitz score by drug exposure > tapply(Dubow,DrugExp,sd) # Gives the standard deviations of Dubowitz score by drug exposur e > tapply(Ppregwt,DrugExp,mean) # Gives the means of pre-pregnancy weight by drug exposur e > tapply(Ppregwt,DrugExp,sd) # Gives the standard deviations of pre-pregnancy weight by drug > tapply(Birthwt,DrugExp,t.test) # Gives 95% confidence interval for exposed and unexposed in one outpu t

An Alternate Method of Subset Analysis

Getting descriptive statistics by category can also be achieved as follows:

> mean(Birthwt[DrugExp==1]); mean(Birthwt[DrugExp==0]) # means for each exposure group > sd(Birthwt[DrugExp==1]); sd(Birthwt[DrugExp==0]) # standard deviation for each exposure group > t.test(Birthwt[DrugExp==1]) # 1-sample t-test to get 95% CI for those exposed to drugs > t.test(Birthwt[DrugExp==0]) # 1-sample t-test to get 95% CI for those unexposed to drugs

Using the double equal sign (==) basically means "only if DrugExp equals 1".

Creating a Dichotomous Variable from a Continuous Variable

Suppose my data set has a continuously distributed variable called "birthwgt", which is each child's weight in grams at birth, but I wish to create a new variable that categorizes children as having Low Birth Weight (lowBW), i.e. less than 2500 grams or not. I can do this using the ifelse() function, which has the following format:

> ifelse(<logical statement>, <if true>, <if false>)

> lowBW <-ifelse(Birthwt<2500,1,0)

If the variable birthwt is less than 2500, then the new variable lowBW will have a value of 1, meaning "true"; if not, it will have a value of 0 meaning "false". When this command is executed, you should see the new variable show up in the global environment window at the upper right corner of RStudio. Note that you should reattach your data set so that the new variable will be recognized. If you want the loBW category to include those whose weight was exactly 2500 grams, then use <= (less than or equal to) as below.

> lowBW <-ifelse(Birthwt<=2500,1,0)

Crude Measures of Association in a Cohort Study (or Intervention Study)

After generating the descriptive statistics for an epidemiologic study, the next step is to generate estimates for the magnitude of association between the primary exposure of interest (e.g., physical activity level in the Manson study) and the primary outcome of interest (e.g., development of cardiovascular disease). As noted above, there may be confounding factors that can distort the estimated measure of association, but one still begins by generating crude measures of association, i.e., estimates that have not yet been adjusted for confounding factors.

The table below shows data from the top portion of Figure 2 from the study by Manson et al.

Table – Relative Risk of Coronary Events According to Quintile Group for Total Physical Activity

 

Quantile Group Based on Physical Activity

Variable

1

2

3

4

5

MET-hours/week

0-2.0

2.1-4.6

4.7-10.4

10.5-21.7

>21.7

Number of coronary events

178

153

124

101

89

Person-years of follow up

106,252

116,175

112,703

110,886

113,419

Using the data in the table above, a) compute the incidence rate ratio and the incidence rate difference for moderate activity compared to the least active subjects, and b) write an interpretation of your findings. Complete both parts before comparing your answers to those at the link below.

Intervention Studies (Clinical Trials, Experimental Studies)

Intervention studies (clinical trials) are similar to prospective cohort studies in design in that subjects with or without a given exposure are followed over time to compare incidence of the outcome of interest. The key difference is that prospective cohort studies are observational, but in clinical trials the investigators assign subjects to the exposure groups

case study in epidemiology

While this design is frequently used to evaluate new drugs, it can be used to evaluate the efficacy of

  • Diets (e.g., primary prevention of cardiovascular disease with a Mediterranean diet)
  • Exercise regimens (e.g., a clinical trial of exercise to alleviate post-partum depression)
  • New programs (e.g., pre-natal care in groups of 8-10 women versus usual one-on-one pre-natal care)
  • New clinical management schemes (e.g., a protocol to reduce post-operative complications)

However, unlike prospective cohort studies in which investigators record exposures that subjects already have, in clinical trials the investigators assign patients to one of the exposure groups being compared. Ideally, this assignment is done with random allocation, meaning that each subject has an equal chance of being assigned to any one of the "exposures."

Ethical Considerations

Investigators assign patients to competing treatments in clinical trials, and this raises the question of whether it is ethical to do this. Certainly, it is not ethical to test all exposures in this fashion. It would be unethical, for example, to conduct a clinical trial on the effects of smoking, particularly since we know that the harm caused by smoking far outweighs any potential benefits, such as relaxation or weight control.

On the other hand, consider a situation in which a new drug has been developed to treat breast cancer. Perhaps it has been found to be effective in cell cultures and in animal models, and perhaps preliminary studies in small groups of human volunteers have shown some evidence of effectiveness with minimal side effects. tIn other words, there is reason to believe that it might be a beneficial new treatment, but there is also doubt.about effectiveness and possible side effects. Testing on a large scale with a comparison group may show that it is not so effective or that its side effects are unacceptable. This is what is referred to as equipoise , i.e., the balance between sufficient belief in its potential benefit and safety that one can justify exposing some subjects to it and sufficient doubt about its benefit and safety that one can justify withholding it from some subjects.

case study in epidemiology

It is unethical to conduct a clinical trial in the absence of equipoise, and if equipoise ceases to exist during the course of a clinical trial, the trial must be discontinued.

Before research on living humans is conducted, a detailed protocol must be submitted to an Institutional Review Board (IRB) for review and approval. This is true not only of clinical trials, but also all other types of human research including case-series, cross-sectional surveys, prospective and retrospective cohort studies, and case-control studies. [For a more detailed overview of the ethical considerations for human research, see our online module on Research Ethics .]

h" is defined as any systematic investigation involving living humans (including research development, testing and evaluation), designed to develop or contribute to generalizable knowledge.

 

Informed Consent

One of the key things that an IRB will consider is whether potential subjects have provided informed consent, which is the process by which study participants consent to be subjects only after becoming fully informed and understand all aspects of the research including the purpose, risks, type of information to be collected, potential benefits, and alternatives to the research. Informed consent should allow people to make a fully informed decision about whether to participate in a study or not based on their own goals and values. Informed consent must be obtained before assignment to a treatment group, and consent can be withdrawn at any time during the study.

Potential participants must be fully informed about:

  • The purpose of the study
  • The treatment options (including alternatives) and the potential outcomes
  • The risks and potential benefit of the study
  • Randomization, i.e., that the treatment they receive is not their choice
  • What will be required of them (questionnaires, visits, samples collected, etc.)
  • Their ability to withdraw from the study at any time without consequence

Types of Intervention Studies

Therapeutic vs. preventive.

Clinical trials in individuals can be classified as either therapeutic or preventive, as in these examples:

Therapeutic Trials: New treatments are tested for the effectiveness in treating disease, e.g.,

  • Does the drug herceptin improve survival in women already diagnosed with breast cancer?
  • Does treatment with Tamiflu shorten the duration and improve survival in patients with bird flu?

Preventive Trials: Healthy or high-risk individuals are tested to determine whether a treatment prevents disease, e.g.,

  • Does the drug tamoxifen prevent development of breast cancer in women who have a high risk of developing breast cancer?
  • How effective is this year's influenza vaccine in preventing the flu?
  • Does a Mediterranean diet reduce the incidence of cardiovascular disease?

Individual Trials vs. Community Trials

Preventive measures can also be allocated on a community level – so-called community trials. A classic example is the Newburgh-Kingston Caries Fluoride Study which began in 1947. Fluoride was added to the water supply of Newburgh, NY, and the incidence of dental caries in Newburgh was then compared to the incidence in Kingston, NY, which did not receive fluoride. The trial demonstrated that addition of tiny amounts of fluoride to the water supply reduced dental caries by two thirds in children who began drinking fluoridated water within their first two years.

The key difference is that in community trials the treatments being studied are allocated not to individuals, but to entire communities.

case study in epidemiology

Phases of Individual Therapeutic Trials

When most people hear reference to a clinical trial, they think of phase 3 trials in which large numbers of subjects are enrolled and randomly assigned to one of the treatment groups. However, phase 3 trials of new drugs with potentially harmful side effects are preceded by extensive studies in lab animals and by phase I and phase 2 trials in human volunteers.

Phase 1 Clinical Trials

If studies in animals suggest efficacy and safety, a phase 1 trial can be conducted in a small group (10-30) of human volunteers over 2-12 months, primarily to test for safety and to identify side effects, but also to get some information on effective dose.

Phase 2 Clinical Trials

Phase 2 clinical trials involve more volunteers than phase 1, and they typically last about two years. They usually involve two or more groups receiving different doses of the new drug in order to establish its therapeutic range of the drug, i.e., doses at which it is effective and has an acceptable level of side effects. If results suggest efficacy and safety, a phase 3 trial will be conducted.

Phase 3 Clinical Trials

Phase 3 trials are similar to prospective cohort studies in their design, except that the exposure of interest is a drug or some other intervention that is randomly assigned to the participants by the investigators. To facilitate this presentation of phase 3 trials we will focus on the first Physicians' Health Study , which began in 1981 in order to test the efficacy of aspirin in primary prevention of myocardial infarction. A second goal of the study was to evaluate the efficacy of beta-carotene in preventing cancer, but this discussion will focus on the aspirin component.

The Physicians' Health Study on Aspirin

As early as the 1950s there were case series and small clinical trials suggesting that aspirin might be beneficial in preventing myocardial infarction (heart attack). However, the reduction in risk appeared to be modest, and the studies were too small to demonstrate a statistically significant benefit. Therefore, investigators at Harvard Medical School sought funding for a large phase 3 clinical trial.

In 1981, after receiving approval from the Institutional Review Board at Harvard Medical School, the investigators mailed invitation letters, consent forms, and enrollment questionnaires to all 261,248 registered male physicians in the US between 40 and 84 years old. (Phase 1 and phase 2 trials were unnecessary, because aspirin was a commonly used drug with known dosage range and known side effects.)

case study in epidemiology

Questionnaires were returned by 112,528 physicians, but only 59,285 of those were willing to participate in the trial. Of those, 26,062 could not participate because they had one or more of the exclusion criteria:

  • past myocardial infarction, stroke, or transient ischemic attack;
  • cancer (except non-melanoma skin cancer);
  • current renal or liver disease;
  • peptic ulcer;
  • contraindication to or current use of either aspirin or beta-carotene.

Informed consent was obtained from the 33,223 who were willing and eligible to participate. Since regular aspirin use has the potential to cause gastritis and bleeding problems, these physicians were enrolled in an 18-week run-in phase, during which all received active aspirin and placebo beta-carotene for 18 weeks. Some had unpleasant side effects, others decided not to participate, and some were excused because they didn't take the medications reliably. The remaining 22,071 men were then randomly assigned to one of four treatment groups.

case study in epidemiology

Randomization and Blinding (Masking)

Randomization.

case study in epidemiology

If assignment is truly unpredictable, then there is no bias in assignment, and neither the subjects nor the investigators can influence assignment. In addition, randomization of a large number of subjects tends to result in groups that differ only in treatment and are comparable with respect to all other factors and characteristics that might influence the outcome. As a result, randomization is the best method for eliminating confounding.

Blinded (or "masked" ) studies are those in which the subjects, and possibly the investigators as well, are unaware of which treatment the subject is receiving, e.g., active drug or placebo. Blinding is particularly important in drug trials when the study is assessing subjective outcomes , such as relief of pain or anxiety.

It isn't always possible to mask the treatments. For example, subjects randomly assigned to follow either a specific exercise regimen or continue their usual level of activity cannot be blinded.

  • Single-blinded: the subjects are unaware of which group they have been assigned to.
  • Double-blinded: Neither the subjects nor the investigators are aware of the treatment assignment until the end of the trial.

A placebo is an inert substance identical in appearance to the active treatment. Its purpose is to facilitate blinding by making the groups as similar as possible in the perception of treatment and to promote compliance. In the Physicians' Health Study participants were given a blister pack for each month (shown in the image below) that contained white tablets and red capsules that were taken on alternate days. The white tablets contained either 325 mg. of aspirin or an identical-looking inert substance; the red capsules contained either beta-carotene or an inert substance. The use of monthly blister packs also made it easier for participants to keep track of whether they had taken the correct pill each day.

case study in epidemiology

It is not always ethical to use a placebo. If there is already a standard treatment or method of care, it would be unethical to withhold it. A new treatment should be compared to the standard therapy rather than to a placebo.

Example of Placebo Use to Achieve Blinding:

Glucosamine and chondroitin are naturally occurring substances that are structural components of the cartilage that lines our joints. Health food stores began selling supplements to people as a prevention (or treatment) for osteoarthritis despite a lack of evidence of their benefit in humans. Clegg and colleagues conducted a double-blind, randomized clinical trial in 1583 subjects with symptomatic osteoarthritis of the knee. Participants were randomly assigned to one of five treatment arms in order to test the efficacy of glucosamine and chondroitin. The primary outcome was greater than 20% decrease in total score on the WOMAC pain scale from baseline to week 24. Some of their results are shown in the table below.

  Pain relief >20% Minimal Effect Total # Subjects
Placebo 188 125 313
Anti-inflammatory drug

 

223 95 318
Glucosamine 203 114 317
Chondroitin 208 110 318
Glucosamine + Chondroitin 211 106 317

Data from Clegg DO, et al.: Glucosamine, chondroitin sulfate, and the two in combination for painful knee osteoarthritis. N Engl J Med 354:795, 2006.

Perhaps the most remarkable observation is the response in the group treated with the placebo which had a cumulative incidence of >20% pain relief of 60% (188/313 = 0.60 = 60%)! This is an example of the "placebo effect" in which patients who perceive they are being treated often report subjective improvement, even if the treatment has no effect. Placebos make the perception of treatment similar among groups and provide a reference group that takes into account the placebo effect. Note also that the group treated with glucosamine and chondroitin had only a slightly greater response rate of 67%.

Analysis of Clinical Trial Data

The analysis of clinical trial data is very similar to the previously described analysis of data from a cohort study. The first step is to generate simple descriptive statistics on each of the groups being compared in order to characterize the study population and alert you and your readers to any differences between the groups with respect to other exposures that might cause confounding. If large numbers of subjects have been randomly assigned to the treatment arms, the groups should be comparable. If there are more than minor discrepancies, the investigators need to review the randomization procedures and consider adjusting for confounding by other methods.

The table below shows just a portion of the data from the table of descriptive statistics from the Physicians' Health Study on aspirin.

  Aspirin (n=11,037) Placebo (n=11,034)
Age (years) 53.2 ± 9.5 53.2 ± 9.5
Systolic BP (mm Hg 126.1 ± 11.3 126.1 ± 11.1
Diastolic BP (mm Hg) 78.8 ± 7.4 78.8 ± 7.4
History of hypertension (%) 13.5 13.6
History of high cholesterol (%)

 

17.5 17.3
Cholesterol level 212.1 ± 44.2 212.0 ± 45.1
History of diabetes (%) 2.3 2.2

Note that the two groups were remarkably similar on these and other characteristics, indicating that randomization had been successful.

After generating the descriptive statistics, the next step is to generate crude estimates for the magnitude of association between the primary exposure and the outcomes of interest.

Summarize these finding in a contingency table and compute the cumulative incidence in each group, the risk ratio, and the risk difference. Then interpret the risk ratio and the risk difference. Complete all of these tasks before comparing your answers to the ones provided in the link below.

Strengths and Limitations of Clinical Trials

Large randomized clinical trials can provide strong evidence of the true effect of a treatment or intervention, because they provide excellent control of confounding, but they also have some limitations:

 

Non-adherence

Ideally, the investigators want to compare exposed subjects to non-exposed in groups that are similar with respect to confounding factors. The true benefit of a new drug will be underestimated if subjects given the active medication fail to take it, causing subjects who were actually not exposed to be mixed in with the exposed subjects who were actually taking the medication. This mixing of the exposure groups dilutes the apparent benefit causing underestimates of association. The same thing occurs if people in the placebo group begin taking the active medication. This occurred in the Physicians' Health Study in which follow up questionnaires estimated that about 15% of the subjects assigned to the aspirin group did not take it, and a similar proportion of subjects in the placebo group used aspirin fairly regularly. This would cause an underestimate of the true benefit. In this case, in which the exposure was preventive with an observed risk ratio = 0.59, the true risk ratio would have been even smaller. In other words, non-adherence caused a "bias toward the null," an underestimate of the true benefit.

case study in epidemiology

Non-compliance can occur due to side effects of the treatment, illness, or loss of interest in the study.

How to Promote Adherence in a Clinical Trial

  • Begin with an interested group of participants
  • Present a realistic picture of the protocol during informed consent
  • Exclude participants with pre-existing conditions that make compliance difficult
  • Simplify the protocol as much as possible
  • Conduct a run-in period if necessary
  • Use blinding and placebos
  • Maintain frequent contact with subjects WITHOUT interfering with treatment
  • Provide incentives (free check-ups, transportation, t-shirts, birthday cards)

Data Safety and Monitoring Board

All clinical trials that involve more than minimal risk are required to have a Data Safety and Monitoring Board (DSMB), which is an independent board of experts not involved in the study who periodically review the data in a trial to evaluate safety, study conduct, and interim results. They can recommend that the study be continued, modified, or terminated. The DSMB for the Physicians' Health Study recommended that the study be terminated after five years because the benefits of taking low-dose aspirin were so clear that continuing to withhold aspirin from the placebo group was not ethically justified. The DSMB felt that equipoise no longer existed.

Intention-to-Treat Analysis versus Efficacy Analysis

The greatest advantage of large randomized clinical trials is that they provide control of confounding. However, as already noted there can be problems due to loss to follow up and lack of adherence to the protocol. It might be tempting to limit the analysis to subjects who completed the study and who adhered to the study protocol, but this efficacy analysis may not provide strong control of confounding, because subjects have, in essence, self-selected whether they would remain in the study and adhere to the protocol. For this reason, well-done clinical trials will conduct and report the results of an intention-to-treat analysis in which subjects are included in the analysis in the groups to which they were randomly assigned regardless of whether they adhered to the protocol. We already noted that non-adherence will bias the results toward the null, i.e., underestimate the association if there is one. However, the intention-to-treat analysis provides the best opportunity to examine the association in the absence of confounding. Many reports will provide the results of the intention-to-treat analysis and the efficacy analysis as well, and they may also analyze sub-groups of subjects, but these analyses need to use other methods to minimize the effects of confounding.

Volume 12, Number 7—July 2006

Books and Media

Gastroenteritis at a university in texas: an epidemiologic case study.

Cite This Article

This CD-ROM is an important addition to case exercises in field epidemiology that serve to educate when actual participation in a field investigation is not possible or practical. The authors have prepared a case exercise based on an actual field investigation with real data that have been put together in a meaningful and effective way. The use of an epidemic of gastroenteritis is a cutting-edge element, since foodborne disease is a major public health problem today. The outbreak occurs on a college campus, which lends an air of verisimilitude, and the causative agent, norovirus, is a genuine public health threat.

This reviewer had a number of specific editorial recommendations for the authors that could enhance forthcoming versions. These suggestions included inserting a case definition in the investigation outline; adding the role of the state laboratory; consistently labeling outbreak, epidemic, and epidemic curve throughout the program; clarifying the rationale for limiting the outbreak to the university; further refining methods for the study and controls; using 2×2 tables to illustrate epidemiologic ratios; and expanding the employee training plan.

Overall, these types of training aids are needed as we attempt to further expose public health workers to field investigations so that they can conduct investigations effectively. The reference to additional educational material throughout the steps is a well-conceived and appropriate aspect of the investigation. The narrative information, questions, and explanations are appropriate and flow smoothly.

DOI: 10.3201/eid1207.060457

Related Links

  • More Books and Media Articles

Table of Contents – Volume 12, Number 7—July 2006

EID Search Options
– Search articles by author and/or keyword.
– Search articles by the topic country.
– Search articles by article type and issue.

Please use the form below to submit correspondence to the authors or contact them at the following address:

Philip S. Brachman, Rollins School of Public Health, Grace Crum Rollins Building, 1518 Clifton Rd, Atlanta, GA 30322, USA

Comment submitted successfully, thank you for your feedback.

There was an unexpected error. Message not sent.

Article Citations

Highlight and copy the desired format.

EID Brachman PS. Gastroenteritis at a University in Texas: An Epidemiologic Case Study. Emerg Infect Dis. 2006;12(7):1180. https://doi.org/10.3201/eid1207.060457
AMA Brachman PS. Gastroenteritis at a University in Texas: An Epidemiologic Case Study. . 2006;12(7):1180. doi:10.3201/eid1207.060457.
APA Brachman, P. S. (2006). Gastroenteritis at a University in Texas: An Epidemiologic Case Study. , (7), 1180. https://doi.org/10.3201/eid1207.060457.

Metric Details

Article views: 438.

Data is collected weekly and does not include downloads and attachments. View data is from .

What is the Altmetric Attention Score?

The Altmetric Attention Score for a research output provides an indicator of the amount of attention that it has received. The score is derived from an automated algorithm, and represents a weighted count of the amount of attention Altmetric picked up for a research output.

Skip to content

Read the latest news stories about Mailman faculty, research, and events. 

Departments

We integrate an innovative skills-based curriculum, research collaborations, and hands-on field experience to prepare students.

Learn more about our research centers, which focus on critical issues in public health.

Our Faculty

Meet the faculty of the Mailman School of Public Health. 

Become a Student

Life and community, how to apply.

Learn how to apply to the Mailman School of Public Health. 

Case-Crossover Study Design

Software

Websites

Courses

This page briefly describes case-crossover designs as an approach to investigating acute triggers that are potentially causing disease. An annotated resource list is provided.

Description

A “trigger” can be thought of as the final step in leading from pathophysiology to disease, or the final component cause leading a susceptible person to experience a disease. Triggers thus may be important for our understanding of etiology. In addition, a new understanding of disease triggers can help us to prevent disease through trigger reduction, reduction of baseline risk, or a targeted intervention to reduce risk at a time when disease is more likely to occur.

For a potential hypothesis about a trigger to be tested using a case-crossover design, we would look for the following defining characteristics:

short-term changes in exposure

transient changes in disease risk

acute-onset disease

Related study designs

Study designs used to examine exposure outcome association include cohort and case-control studies. Whereas cohort studies can be limited in power for rare disease outcomes, and case-control studies can be biased due to retrospective exposure assessment, case-crossover designs compare individuals to themselves at different times. This parallels the randomized crossover trial approach that compares individuals to themselves as they are going on and off treatment.

Something that the case-crossover design has in common with a case-control approach is the need to find representative controls. However, while case-control designs select control individuals, case-crossover designs select control time windows. This brings our focus to the plausible temporal relationship between exposure to a trigger and disease onset.

Other time-focused designs include ecological time-series data, and interrupted time-series data. Disease counts over time can be modeled across time in a generalized linear modeling framework, often using Poisson regression. For Poisson models the beta coefficient provides information about rate comparisons on a relative scale because they use a log link.

Setting up a case-crossover analysis

A key decision point in setting up a case-crossover study is to decide for what length of time before disease onset would exposure be compatible with triggering. That is your “case” window. For example, if you think physical activity could trigger a myocardial infarction in the subsequent 2 hours, you could identify a case window starting 2 hours before symptom onset, and ending at the time of symptom onset. Sensitivity analyses might be planned altering the length of this time window.

Selection of one or more control windows is then designed to identify whether exposure during the case window was atypical. By comparing exposures over time within the same person, you automatically condition on all stable characteristics of the individual. You might also want to match on potential time-varying confounders such as time of day. Usually, case and control windows are the same length. It may be efficient to select multiple control time windows for every case time window. Control windows should reflect the exposure distribution while at risk for the outcome, should be close enough in time that the baseline risk is similar, and should be far apart enough in time so that exposures are uncorrelated.

Once you have constructed your case and control time windows, compare the probability of exposure during case and control periods. This is usually done using conditional logistic regression (similar to a matched case-control study).

An opportunity to consider when working with a case-crossover design is that although you do not typically need to control for many individual characteristics, you can evaluate effect modification by individual characteristics. For example, while looking at physical activity and myocardial infarction, you might hypothesize that triggering would be most likely to occur for individuals with hypertension.

Methodological Articles

Maclure M. The case-crossover design: a method for studying transient effects on the risk of acute events. Am J Epidemiol 1991; 133(2):144-53.

Maclure M, Mittleman MA. Should we use a case-crossover design? Annu Rev Public Health 2000;21:193-221.

Lu Y, Zeger SL. On the equivalence of case-crossover and time series methods in environmental epidemiology. Biostatistics 2007;8(2):337-344.

Lu Y, Symons JM, Geyh AS, Zeger SL. An approach to checking case-crossover analyses based on equivalence with time-series methods. Epidemiology 2008; 19(2):169-75

Maclure M, Mittleman MA. Case-crossover designs compared with dynamic follow-up designs. Epidemiology 2008; 19(2):176-8.

Janes H, Sheppard L, Lumley T. Case-crossover analyses of air pollution exposure data: referent selection strategies and their implications for bias. Epidemiology 2005; 16(6)717-26.

Application Articles

Basu R, Dominici F, Samet JM. Temperature and Mortality Among the Elderly in the United States: A Comparison of Epidemiologic Methods. Epidemiology 2005;16(1):58-66

Hebert C, Delaney JA, Hemmelgarn B, Levesque LE, Suissa S (2007) Benzodiazepines and elderly drivers: a comparison of pharmacoepidemiological study designs. Pharmacoepidemiol Drug Saf 16: 845–849

Join the Conversation

Have a question about methods? Join us on Facebook

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Front Public Health

Designing an Interactive Field Epidemiology Case Study Training for Public Health Practitioners

Globally, public health practitioners are called upon to respond quickly and capably to mitigate a variety of immediate and incipient threats to the health of their communities, which often requires additional training in new or updated methodologies or epidemiologic phenomena. Competing public health priorities and limited training resources can present challenges in developed and developing countries alike. Training provided to front-line public health workers by ministries of health, donors and/or partner organizations should be delivered in a way that is effective, adaptable to local conditions and culture, and should be an experience perceived as a job benefit. In this review, we share methods for interactive case-study training methodologies, including the use of problem-based scenarios, role-play activities, and other small-group focused efforts that encourage the learner to discuss and synthesize the concepts taught. We have fine-tuned these methods through years of carrying out training of all levels of public health practitioners in dozens of countries worldwide.

Background and rationale

In a rapidly changing field marked by frequent turnover, the need to provide continuing education for public health practitioners is constant. Often there is no requirement for prior credentialing in public health for local-level jobs 1 , 2 . Alternatively, training and professional development may be clinically-oriented as opposed to the more relevant population based public health focus ( 1 ). In many parts of the world public health training programs targeting the health professions are minimal ( 2 ). For these reasons, in the US and elsewhere, public health professions tend to be high-responsibility, low-pay jobs, marked by high turnover rates ( 3 ).

In many low- and middle-income countries, keeping educated health professionals in-country is challenging ( 4 ). Public health professionals are discouraged by lack of proper compensation and insufficient opportunities for continuing professional development and education ( 5 ). Results from a 2006 study conducted for the Ugandan Ministry of Health investigating health workforce morale, satisfaction, motivation, and intent to remain in Uganda showed that health workers were dissatisfied with their jobs. About one in four noted a desire to leave the country to improve their outlook ( 6 ). Further complicating matters, those who have learned occupational skills through on-the-job experience are vulnerable to being laid off with changes of government administration ( 7 ).

Our experience carrying out training activities over the last two decades has provided the opportunity to train public health practitioners in-person, online, and via self-directed learning. We have designed and conducted trainings for front-line local responders as well as district and national-level officials both in the US and abroad. For example, in the US, we have trained county and state-level public health workers in surveillance and outbreak investigation through in-person training courses, Web conferencing platforms, and via both synchronous and asynchronous distance-based methods. We have trained national-level officials and laboratorians through specialized in-person trainings as well as via large conference-based and train-the-trainer style settings. Examples include organizing and developing content for a multi-agency tabletop exercise focused on bioterrorism response post-9/11, development of a curriculum focused on rapid response to avian and pandemic influenza, strengthening workforce capacity in sentinel site surveillance for respiratory illness, and, with the US Centers for Disease Control and Prevention (CDC), a full Master's-degree level curricula the Central American Field Epidemiology Training Program (FETP) Master's degree program, as well as front-line public health practitioner trainings at Intermediate and Basic-levels within Central America and topic-specific FETP modules elsewhere. These efforts were supported by CDC, the National Council of State and Territorial Epidemiologists (CSTE), the World Health Organization (WHO), the Pan-American Health Organization (PAHO), and others.

Whether delivered to a non-epidemiologist who finds themselves responsible for outbreak investigation, or to a national epidemiology official who will subsequently oversee training of their staff, the case study or interactive exercise typically serves as a pivotal feature of training for public health professionals, moving them toward actively engaging in learning, knowledge sharing and in the application of previously learned information ( 8 ). We describe key elements in the development and use of case studies as a learning tool for public health practitioners, based on our years of experience in developing, implementing, and evaluating public health trainings. The learning objectives of this article are the following:

  • Identify critical teaching points around which to structure a case study training.
  • Consider relevant sources of information to provide information and “plot” to a case study.
  • Choose a case study format appropriate for the trainees and content.
  • Incorporate design elements that encourage interaction and thoughtful discussion among learners into a case study.

Pedagogical framework

Case-study based learning approaches have proven particularly effective in the medical education context. Comparison studies conducted in the medical sciences have found interactive, case-study focused learning a more effective teaching tool than didactic lectures ( 9 , 10 ). Although the literature focuses on medical education, the success of case studies in public health education is also reflected in the growing number of public health programs with problem-based curricula ( 11 , 12 ). Our goal has been to create effective training materials for public health professionals that present relevant, interactive, and interesting content that teach job skills (not merely general principles) in a way that can be quickly understood and applied while providing an opportunity for peers of different experience levels to learn from one another.

Curriculum development is a multi-layered process involving instructional designers, subject matter experts, and editorial and art personnel ( 13 ). Training format is highly dependent upon the needs of the involved partner(s). Information-driven training presented in the didactic style only can be less effective for information retention and learner motivation. Learning methods allowing the participant to contribute meaningfully from their own experience, develop skills in a supervised setting, and practice skills immediately pursuant to learning them, will result in more retention and ability to apply the skills taught. Learners who are given ample opportunities to analyze past experience through reflection, evaluation and reconstruction fulfill a key element of experience-based learning and are able to draw deeper meaning from prior events in their personal and professional lives ( 14 ). As such, interactive and problem-based approaches to building applied public health skills are critical features of many public health trainings. Interactive and problem-based approaches engender openness toward new experiences, a critical element in facilitating lifelong learning. At the social or group level, these approaches help emphasize critical social action and the importance of adopting a stance shaped by moral and socio-political responsibility ( 14 ).

Dissemination of information or didactic methods may be used as a means of communicating concepts, methods, or the use of new technologies, but should be accompanied by interactive teaching methods. Interactive teaching can occur in a variety of settings, which may be dependent on program, resources, and learner characteristics. The training audience is professional public health practitioners, who may or may not be formally educated in public health.

The size of the training has varied vastly, from fewer than 12 learners to over 100. With larger groups, we often have a lead instructor who is experienced both in the subject area and in carrying out trainings who is accompanied by “facilitators” or small-group leaders who can provide guidance during case-study work in break-out sessions.

Although we have designed trainings in many different formats, the most successful setting for a case study in our experience has been one where learners come together in-person for a scheduled period of instruction and worked activities, ranging in time from half-day trainings to 2-week courses.

The success of a particular training methodology is difficult to assess. For the several dozen trainings we have carried out, we include a participant evaluation that captures information such as whether the participant found the training useful, whether they feel able to use the information learned in their job, how satisfied they were with the training, and open comments. Informal interaction with trainees and feedback from these surveys over the years has shaped the case study format presented here. The success of a public health training should be measurable by participants' ability to perform specific job functions, and more broadly, improvements in components of the public health system such as timeliness and completeness of communicable disease reporting and surveillance.

Blueprints for an interactive curriculum

There are several considerations when planning and developing curriculum materials. First, the needs of the intended target audience (i.e., the learners) should form the core of any development process. The curriculum should be designed to both meet training and skill-building needs, and to approach the target audience in a comfortable and accessible way; it should meet them at their current level; the materials should address the fact that—trained or not—professionals who have been working in any field have experience to contribute that adds to the depth of a training. Second, training materials should be developed in a way that ensures consistency in the delivery. Thus, lectures should include a speaker script and built-in questions for the audience with suggestions to both engage learners and continually check knowledge acquisition. Having the script provides particular flexibility for train-the-trainer programs or other circumstances where the person with the most content and training expertise may not be the one to personally deliver the content to all trainees. Case studies should be developed in a standard format that includes an instructor guide complete with suggested answers, worked calculations, and suggested instructional methods for emphasizing critical teaching points. These methods ensure that critical content is taught regardless of the depth of experience of the teacher or facilitator. Finally, training materials should engage the learner in discussion and healthy debate, contribute to their own learning, and provide peer-to-peer learning opportunities. We discuss several types of interactive exercises which are often collectively referred to as “case studies,” but specifically may feature different elements (Table ​ (Table1 1 ).

Terminology, definition, and examples of interactive training activities often referred to as “case studies.”

ExerciseA short Q&A (~5 min) between instructor and the class that is embedded in didactic lectures or presentationsIn a biostatics lecture about probability distributions, the instructor asks every student to take out a coin, flip it 10 times and note how many heads and tails they obtained, and then report the number of heads. The instructor graphs these results on a histogram at the front of the class to demonstrate binomial distributions
ActivityA project accomplished in small groups where students answer questions and develop and apply concepts that were taught in a didactic lecture. These are often scenario-based but may take other formats• Analyze a dataset from an epidemiologic investigation
• Critique an abstract or manuscript
• Develop public health surveillance priorities
• Create a data collection instrument to be used for a rapid needs assessment
TabletopA scenario-based exercise that facilitates inter-agency collaboration, often in the context of planning for state or national emergencies such as disasters, pandemics, or bioterrorism. May or may not be accompanied by a didactic componentProfessionals from public health, agriculture, wildlife, the judiciary, customs, transportation, and/or law enforcement gather at the table to play their respective roles in a scenario where illegal importation of chickens leads to an outbreak of a new, highly pathogenic strain of influenza A on a domestic poultry farm
Case studyAny form of scenario-based work in which information is gradually released to the learner as they progress through the scenario. Learners are directed to apply core training concepts as they move along. May be designed for individuals, small groups, or breakout groups from a larger classroom settingLearners are given background information and must assess whether to investigate an outbreak. Subsequently they progress through the steps of an outbreak investigation as they are given “updates” on how many cases are occurring and clues they find through their investigation

We propose the following general objectives of a group case study, regardless of format or topic area:

  • To meet defined learning objectives or competencies.
  • To learn in an environment that respects the skills and abilities of the learners, and allows each learner to use and apply their unique skills.
  • To have learners interact with each other, thus enhancing their own learning through listening to the ideas and experience of fellow learners.
  • To exhibit genuine interest in and be motivated by the content they are learning.

We have found that there are four critical conceptual elements in developing and executing an interactive training:

  • Reference key points or concepts to be learned and applied.
  • Guide learners to work through a problem.
  • Keep learners engaged through active questioning.
  • Draw on learners' experience.

Each of these is described in detail below, with a measles outbreak in a small African village used as a guide for demonstrating their implementation through a case study exercise.

Reference key points or concepts to be learned and applied

The main intent of a case study is to teach predefined content. Often training delivery of a case study is preceded by didactic content, or other information-driven training such as web-based tutorials, workbooks, or reading material. Whether or not information-driven training is included, the key concepts to be learned and applied should be viewed as the framework upon which the rest of the case study will be built. For example, a lecture or reading may put forth a set of 9 steps to investigate an outbreak. The case study format will then be dictated by achieving each of these 9 steps. It is of benefit to clearly and obviously delineate any key steps or phases as one works through the case study. The beginning learner gains more benefit from understanding a firm case study structure than from trying to figure out which step or concept should be applied in which situations, even though events may not occur so neatly on the job. When teaching a set of concepts, it is better to teach them clearly and simply than to allow the learner to become frustrated, struggle, and possibly fumble. More advanced learners can be given the opportunity to struggle with decision-making and unexpected sequences of events, after they have mastered a clear set of tools from which to draw. A few examples of clearly structured concepts that lend themselves to case studies are given in Table ​ Table2, 2 , and a scenario using these types of structures within a case study is given in Table ​ Table3 3 .

Examples of concepts that can be given a clear sequence or structure for teaching purposes.

• Steps to conduct an outbreak investigation
• Designing a study to investigate a health problem
• Process for evaluating a surveillance system
• Methods for communicating with the public and the media
• Steps of writing an analysis plan
• Methods to assess data for confounding and effect measure modification
• Strategy for writing and delivering an oral presentation at a scientific conference

Scenario-based examples of a measles outbreak in a small village case study.

(A) Give clear learning objectives to provide a blueprint for the case studyAfter completing this case study, the participant should be able to:
• List the early steps of an outbreak investigation
• Describe several clinical diseases that have a similar presentation and how to differentiate them
• Create a line list for an outbreak investigation
• Calculate percentages for different symptoms of disease
• Identify factors that can influence a community's willingness to participate in public health programs
(B) Provide information for the scenario
It is Thursday, September 15th. You are an epidemiologist working in the Ministry of Health (MoH). As you are nearing the end of your workday, you receive a phone call. The caller is part of a vaccination team that has just arrived in “Gwema” Village as part of an effort to vaccinate the children of the prefecture. This prefecture has been experiencing rising numbers of preventable infectious diseases, and the vaccination campaign is considered a vital step in slowing this trend.
Apparently, however, residents heard that in another prefecture, vaccination campaigns were followed by cases of Ebola. Thus the vaccination team is being met with a certain degree of suspicion.
On top of this, the small Gwema primary school reportedly has 6 children out of 81 total children who have become ill with a sudden fever. A MoH staff member, Dr. Miriama Daillo, is currently close to Town A and is dispatched to investigate the cases of fever. You are responsible for maintaining communications with Dr. Daillo and offering her advice and guidance by phone.
(C) Active questioning with answers for the instructor What are the first steps you would advise Dr. Daillo to take to determine whether an outbreak is really occurring?

1. Establish the existence of an outbreak, and
2. Verify the diagnosis. A good start is to interview cases: Determine whether all cases have the same illness; Determine whether the illness is unusual for the time, place, and population group.
(D) Draw on personal and professional experience In your own role (health department/job/experience), have you ever had to initiate a vaccination campaign for measles or another vaccine-preventable illness? If so, who did you work with? What steps did you take? What challenges did you face? Was it successful? Why or why not? What would you do differently in a similar situation next time?
Given the information that you have received about this measles outbreak in Gwema Village, do you think is it too late to implement control measures such as a vaccination campaign?

Regardless of topic, a simple introduction for the participant and the instructor as to the point and process of the case study is required. Begin by providing learning objectives (as in Table ​ Table3A) 3A ) and a short introduction to the case study, including if, and how, learners should interact with others, the resources they should use for case study completion, the product that is expected, and the estimated time required to complete the tasks given.

Guide learners to work through a problem

A hallmark feature of a case study is working through a problem that parallels or is based on a real-life situation. An appealing format is to present the learner with introductory or background information and update information as the learner works through the problems of the case study.

Basing information and scenarios on real events is the best way to prevent the scenario from feeling contrived. Possible sources of information for scenarios include the experience, internal reports, or data of the curriculum developer or colleagues; published literature; local, national, or international bulletins; health situation updates provided by State health departments, CDC or WHO; and less formal sources such as news reports and ProMed, the Program for Monitoring Emerging Diseases. In some cases, the details of an actual outbreak, surveillance system, or other event can be used directly. Typically, an actual scenario can be used, but the details can be embellished, drawn out, or supplemented by different events in order to meet the learning purposes of the case study. For example, in a scenario where the learner is in the role of someone getting ready to plan a case-control study, the setting from a ProMed report or international bulletin might be ideal, but the details given from a case-control study published in the literature provide the topical details needed to expand upon the case study scenario. Unless teaching about a specific real-event is a learning objective, a case study developer should be free to modify details and provide creative license to developing characters for the case study events. The example case study used in Table ​ Table3 3 of this article was inspired by a health news article about difficulties facing vaccination campaigns in Guinea 3 , and was structured on WHO recommendations for measles surveillance and case definitions ( 15 ). Resource material used to develop the case study should be appropriately cited.

Once the learning objectives, introduction, and plot for the scenario are outlined, the learner or the group is positioned within the scenario through assignment of specific roles or responsibilities, for example:

  • “You are the health director for Mlima Province, and you receive a phone call from the local hospital epidemiologist who is concerned about….”
  • “You are a surveillance officer with the Forêt Health district, and you have just been placed in charge of developing a new surveillance system for…”

The plot of the case study can then be developed as questions are posed, the learner makes plans, decisions, calculations, and more. An example presentation of the scenario based on the previously established measles scenario is given in Table ​ Table3B 3B .

Keep learners engaged through active questioning

A key to a successful case study is placing learners in an active role, rather than asking them to reproduce information-driven training concepts in the context of the scenario. Although a case study is designed to be a classroom tool, in field epidemiology the case study should strive to put the learner in the field. For data-centered trainings, this can be accomplished by instructing the learner to carry out data processing and analysis. For example, a learner may be provided with a line listing and asked to construct an epidemic curve, or they may be provided a two-by-two table and asked to calculate relative risk. The effort to place the learner in an active role is particularly important for more routine types of questions that aim to get the learner to repeat concepts (as with an information-driven training). Some examples of inactive questions and how they can be modified to be active are shown in Table ​ Table4 4 .

Comparison of scenarios with inactive vs. active questioning methods.

Public health communicationsHow should complicated risk information be communicated to the media?When responding to the media, how will you ensure that accurate public health information is communicated in a way that describes the risk without inciting fear or panic?
Personal safetyWhat personal protective equipment should be worn when interviewing a subject with suspected or confirmed influenza A (H5N1)?What personal protective equipment will you wear when you interview the subject with suspected influenza A (H5N1)?
Public health surveillanceShould non-traditional sources of surveillance data be included in a new influenza sentinel surveillance system? Why or why not?What sources of non-traditional surveillance data will you incorporate into the new influenza sentinel surveillance system? Justify your choices
Outbreak investigationGive an example of close-ended questionnaire questions that gather information on food intake history in an outbreakWrite close-ended questions for the food history section of the questionnaire that you will use to investigate this outbreak

Questions should cause the learner to use the knowledge and information they have gleaned both within and outside of the training to “act” or make a decision but should not affect the ultimate plot of the case study. See an example from the measles case study in Table ​ Table3C 3C .

Draw on learners' experience

Adult learners are motivated by experiences that affect them personally. Facilitating contributory participation increases their personal investment ( 16 ). For a group case study, there are multiple advantages in using the case study to draw out learners' experience and personal perspective. First, when learners contribute their own experience to the case study, they gain confidence and enjoyment throughout the activity. Second, a case study that assumes learners have no experience or contribution may feel pedantic or even condescending. Third, encouraging the learner to share their ideas, opinions, and experience creates a synergistic learning opportunity that the questions on the page alone cannot accomplish. This can be done by asking discussion or reflection questions during the progress of the exercise and at its conclusion. Questions that aim to solicit information from the learner regarding past experiences he or she has in making decisions or navigating situations like those encountered in the case study can be effective in enhancing the learning experience. These types of questions can also encourage beneficial self-reflection and allow for recognition of how skills have been or could be applied in reality. Alternatively, questions that treat issues or controversies with no perfect answer fit into the context of the case study while drawing on learners' opinions, often leading to rich dialogue and experience sharing. Table ​ Table3D 3D provides a scenario-based example of how to draw upon learner experiences relevant to the measles case study.

Interactive training design options

The specific design used will depend on training content, setting, and the number of learners. Below, we discuss several designs in the context of smaller groups (e.g., 9 or fewer individuals) as well as larger groups (e.g., 10–30 individuals). In many cases, larger groups can be accommodated with small-group case study designs. Using “breakout” groups, where a classroom of a couple dozen or more breaks into smaller groups for case study work, is the easiest method to conduct participatory case-study based training. In many cases, small modifications can be made that also take advantage of the larger class size.

Problem-based scenarios

Problem-based scenarios are the mainstay of case study trainings. They guide learners through carrying out activities and can be flexible in length, depending on the depth of information covered and activities or computations required.

Small groups

In a problem-based scenario, a situation or background information is presented, and the individual or small group must work through a series of questions that address learning objectives in the context presented. Questions ask learners to provide information, conduct a calculation, or come to a decision and move on to the next question. Additional information that adds to the scenario may be provided once or several times throughout the case study.

Large groups

A couple variations on the problem-based case study scenario make them more interesting in the large group setting. One option is to create slightly different scenarios for breakout groups. For example, in a case study where groups are designing surveillance systems according to set principles taught in class, each small group can focus on a different set of diseases or conditions for their surveillance system. Alternatively, small groups can be provided with updates unique to their group. While this requires additional work in developing case study answer guides, small groups can present their results to each other at the end of the session and can learn more about public health challenges and considerations than just the work they themselves have performed.

Role plays are excellent tools for practicing scenarios in which the learner must think of what to say or do, and work well for interview situations and meeting scenarios. They are best used among learners who have already spent some time together, so that fewer inhibitions exist.

While many case studies ask the learner to assume they have a specific role or identity in order to answer case study questions, a role-play asks a group of learners to go a step further by carrying out interactions with other group members from the perspective of the role they are playing. To prevent learners from feeling uncomfortable in carrying out their role play, each “role” should contain specific guidance with information about their character, including general terms about what they should say and do in the role play.

Acting out an interview can provide inexperienced learners an opportunity to practice and offers experienced learners an opportunity to add improvisational content from their own experience. Scenarios may have learners conduct interviews with food workers, business proprietors, case-patients, research subjects, hospital personnel, and more. Additionally, they may be the interviewee with members of the media or government officials. Both the person conducting the interview and the person being interviewed are given guidance that can include questionnaires or question topic domains for the interviewer, a personality profile of their roles, background information for the person answering questions, top priority concerns for stakeholders, and any other details relevant to the situation. Additionally, role players can be given guidance to cause challenges for each other. For example, an interviewee's information sheet can tell them that they do not like to discuss “personal” problems, and they should do their best to avoid directly answering questions that ask for sensitive information.

Town hall meetings, stakeholder meetings, or media events can be simulated when roles relevant to the situation are assigned to different group members. For example, group members could be in a scenario dealing with student health, and members can be assigned the roles of school officials, parents, health department personnel, or students. Information sheets for each role can encourage the role player to be at odds with others to encourage discussion, bring up concepts taught in the class, and rationalize their actions relevant to the scenario. The value derived from role playing is in practicing personal interactions and the process of considering various viewpoints.

Role plays can be carried out within the context of the large group by dividing the audience into breakout groups, each representing a group of people, such as health spokespeople, media, or community members. The entire group would have the same pre-defined list of character traits or concerns, but individuals within the group voicing their own perspective provide lively interaction. Another option is to have each small group carry out the same role play, with one person per role, and then to follow the role play with a large group discussion. The instructor can ask questions such as, “For those of you who conducted the interview, what was the most challenging aspect of obtaining the information you needed?” Debriefing encourages learners to process what they have learned and allows them to share their own experiences in similar circumstances.

Create a common product

This exercise is particularly useful when the skills to complete a large task are being taught, such as skills for writing reports or designing surveillance systems.

Although it is not feasible, in a classroom setting, to have learners complete a large task at their desk, as a group they can efficiently combine forces to cover key concepts from the teaching and produce a common paper, outline, graph, or presentation that addresses key points. Simple example assignments to a small group include:

  • Given the scenario, create a flow diagram of a surveillance system that collects population-based pneumonia and influenza data.
  • Write the outline of a bulletin article that summarizes the outbreak investigation methods and results that we have worked through today.
  • With each group member taking responsibility for one section of a protocol, write an outline for key content to be included in each section. Then share your results with your group members and solicit their feedback.

Breakout groups creating a common product can provide an added dimension and present their product to the entire class. In these cases, it is beneficial to ask the class for constructive criticism on each other's work. Audience members can rate the presenting group against how well they addressed class concepts or met defined criteria. We have used Olympic-style score cards for this task, which lessens the risk of “speaking out” from the viewpoint of the learner offering feedback, but also gives the instructor impetus to ask specific individuals for their rationale in providing that score.

“Go out” assignments

Although the ability to use this exercise is highly dependent upon the training setting, “going out” is an interactive way to solidify concepts in questionnaire design, questionnaire execution, and data collection.

Small or large groups

If a training is being held in a location where learners have easy and safe access to a public lunch room, university students, or a busy walkway or plaza, the learner can go out of the classroom and engage the general public in practicing basic interviewing skills and piloting questions from a data collection instrument. For example, the learner can collect data to bring back to the classroom, such as whether people are wearing hats, so they can create a collective distribution diagram during a biostatistics lesson. A specific and somewhat tight time limit should be given with the understanding that this is a part of the training, not a break.

Presentations from the field

Where instruction is being given to individuals representing a variety of expertise levels, presentations by two or three learners who have more experience or expertise can provide a highly educational perspective. For example, in a training we carried out on strengthening population-based influenza surveillance, influenza officers from countries with a strong surveillance system were asked ahead of time to present on the design, site selection, case selection, and limitations or barriers in their surveillance systems. Other learners, on hearing the presentations, asked specific logistical and operational questions and observed how concepts being taught in a lecture really did apply to the “real world.” This helps build professional networks that learners can turn to for expertise or advice post-training.

Much like the practice of field epidemiology itself, there is an art and a science to producing a case study. We lay out a process for structuring a case study around key teaching points, finding elements to include in a case study plot, and incorporating interactive activities and methods throughout a case study, acknowledging that real-world situations may not always follow a predefined sequence.

First, ensure that the goals and content of the training are compatible with the goals of a case study mentioned earlier in this paper. Case studies lend themselves well to situations in which the learners have some experience working in their designated fields to enhance their participation. Clear didactic materials (whether lecture, job aid, or other format) for learners to refer to can keep the discussion focused appropriately. When a group of learners comes together, the discussion can flounder or become derailed by lack of clarity in the underlying concepts. Thus, it is important that any didactic materials also be carefully crafted. An invited or guest speaker for didactic materials can be beneficial but can also confuse learners if concepts are presented differently than in the case study or other training materials. If guest speakers are delivering didactic content, we have found that it is easiest to provide them slides covering the critical points. This ensures the speaker meets the required objectives, and provides a benefit to the course as the speaker can give subject matter feedback to curriculum developers. When teaching complex or involved ideas, we have found that having an additional checklist or conceptual aid handy helps keep the discussion on track (and the training on time).

Second, consider the amount of time for the training, characteristics of the learners and training content, and select a format. For long trainings, change the case study format to prevent learner fatigue. However, do think about contextual factors when selecting a format. In one multi-day training we carried out, learners were to complete a “go-out” assignment. However, this assignment was near the end of the week (learners were tired), and we chose an hour just before lunch to complete the assignment so that students would be able to interact with a larger number of patrons at the cafeteria at the training site. However, most of our learners were distracted by a coffee and a snack. While they eventually completed the assignment, we lost almost an hour of training time. Other considerations for choosing a format include number of participants, level of learner expertise, cultural sensibilities in interacting with one another, and availability of teaching and support staff.

Third, identify key steps and teaching points for inclusion as questions or processes during the case study. If the training content is not easily related to a set of steps or a process, it should be tailored to the identified learning objectives and should ensure that the “action” verbs from learning objectives are carried out. Write questions around these points that coincide with the plot or unfolding of the scenario. Professional experience or published literature can provide a basis for scenario development.

Fourth, ensure that questions are clear and action-oriented. Always provide an answer key or main teaching points that should be derived from each question in an instructor copy. Estimate that it will take about 10 min for a group to read, discuss, and record answers to each question.

In settings where breakout groups are utilized, it is recommended there be a facilitator embedded with each group to help moderate, ensure all have a chance to participate, that appropriate effort is being exerted, and that learners generally arrive at the intended answers.

Participatory case studies are a beneficial way of delivering training for professionals who need to learn, reinforce, and apply specific skills to carry out their job duties. In the field of Public Health, where the responsibilities and hours can be significant, the challenge of recruiting and maintaining dedicated workers must be met in order to protect and promote healthy communities. Training public health professionals to learn new skills and encouraging them to share their experience with others through a network-building group case study is an opportunity that is valuable for our public health work force as well as for our communities.

Author contributions

AN wrote the manuscript with significant contribution from LB. PM conceptualized the manuscript. All authors provided content expertise, contributed to manuscript revision, read, and approved the submitted version.

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

The authors wish to acknowledge the insight and feedback from partners and trainees over the past two decades.

1 Fairfax County Virginia: Public Health Nursing at the Fairfax County Health Department. http://www.fairfaxcounty.gov/hd/careers-in-public-health/public-nursing-jobs.htm (2016) (Accessed July 11, 2016).

2 Summit County Public Health: Careers in Public Health. Careers in Public Health. http://scphoh.org/pages/careers.html (2015) (Accessed July 11, 2016).

3 IRINNews.org Human Rights: Vaccination teams defeat “Ebola effect” in Guinea (2015). http://www.irinnews.org/news/2015/04/29 (Accessed August 13, 2018).

Funding. This publication was supported, in part, by the Cooperative Agreement Number 1U19GH001591 funded by the U.S. Centers for Disease Control and Prevention (CDC). Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the Centers for Disease Control and Prevention or the Department of Health and Human Services.

Advanced Epidemiological Analysis

Chapter 3 time series / case-crossover studies.

We’ll start by exploring common characteristics in time series data for environmental epidemiology. In the first half of the class, we’re focusing on a very specific type of study—one that leverages large-scale vital statistics data, collected at a regular time scale (e.g., daily), combined with large-scale measurements of a climate-related exposure, with the goal of estimating the typical relationship between the level of the exposure and risk of a health outcome. For example, we may have daily measurements of particulate matter pollution for a city, measured daily at a set of Environmental Protection Agency (EPA) monitors. We want to investigate how risk of cardiovascular mortality changes in the city from day to day in association with these pollution levels. If we have daily counts of the number of cardiovascular deaths in the city, we can create a statistical model that fits the exposure-response association between particulate matter concentration and daily risk of cardiovascular mortality. These statistical models—and the type of data used to fit them—will be the focus of the first part of this course.

3.1 Readings

The required readings for this chapter are:

  • Bhaskaran et al. ( 2013 ) Provides an overview of time series regression in environmental epidemiology.
  • Vicedo-Cabrera, Sera, and Gasparrini ( 2019 ) Provides a tutorial of all the steps for a projecting of health impacts of temperature extremes under climate change. One of the steps is to fit the exposure-response association using present-day data (the section on “Estimation of Exposure-Response Associations” in the paper). In this chapter, we will go into details on that step, and that section of the paper is the only required reading for this chapter. Later in the class, we’ll look at other steps covered in this paper. Supplemental material for this paper is available to download by clicking http://links.lww.com/EDE/B504 . You will need the data in this supplement for the exercises for class.

The following are supplemental readings (i.e., not required, but may be of interest) associated with the material in this chapter:

  • B. Armstrong et al. ( 2012 ) Commentary that provides context on how epidemiological research on temperature and health can help inform climate change policy.
  • Dominici and Peng ( 2008c ) Overview of study designs for studying climate-related exposures (air pollution in this case) and human health. Chapter in a book that is available online through the CSU library.
  • B. Armstrong ( 2006 ) Covers similar material as Bhaskaran et al. ( 2013 ) , but with more focus on the statistical modeling framework
  • Gasparrini and Armstrong ( 2010 ) Describes some of the advances made to time series study designs and statistical analysis, specifically in the context of temperature
  • Basu, Dominici, and Samet ( 2005 ) Compares time series and case-crossover study designs in the context of exploring temperature and health. Includes a nice illustration of different referent periods, including time-stratified.
  • B. G. Armstrong, Gasparrini, and Tobias ( 2014 ) This paper describes different data structures for case-crossover data, as well as how conditional Poisson regression can be used in some cases to fit a statistical model to these data. Supplemental material for this paper is available at https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-14-122#Sec13 .
  • Imai et al. ( 2015 ) Typically, the time series study design covered in this chapter is used to study non-communicable health outcomes. This paper discusses opportunities and limitations in applying a similar framework for infectious disease.
  • Dominici and Peng ( 2008b ) Heavier on statistics. Describes some of the statistical challenges of working with time series data for air pollution epidemiology. Chapter in a book that is available online through the CSU library.
  • Lu and Zeger ( 2007 ) Heavier on statistics. This paper shows how, under conditions often common for environmental epidemiology studies, case-crossover and time series methods are equivalent.
  • Gasparrini ( 2014 ) Heavier on statistics. This provides the statistical framework for the distributed lag model for environmental epidemiology time series studies.
  • Dunn and Smyth ( 2018 ) Introduction to statistical models, moving into regression models and generalized linear models. Chapter in a book that is available online through the CSU library.
  • James et al. ( 2013 ) General overview of linear regression, with an R coding “lab” at the end to provide coding examples. Covers model fit, continuous, binary, and categorical covariates, and interaction terms. Chapter in a book that is available online through the CSU library.

3.2 Time series and case-crossover study designs

In the first half of this course, we’ll take a deep look at how researchers can study how environmental exposures and health risk are linked using time series studies . Let’s start by exploring the study design for this type of study, as well as a closely linked study design, that of case-crossover studies .

It’s important to clarify the vocabulary we’re using here. We’ll use the terms time series study and case-crossover study to refer specifically to a type of study common for studying air pollution and other climate-related exposures. However, both terms have broader definitions, particularly in fields outside environmental epidemiology. For example, a time series study more generally refers to a study where data is available for the same unit (e.g., a city) for multiple time points, typically at regularly-spaced times (e.g., daily). A variety of statistical methods have been developed to apply to gain insight from this type of data, some of which are currently rarely used in the specific fields of air pollution and climate epidemiology that we’ll explore here. For example, there are methods to address autocorrelation over time in measurements—that is, that measurements taken at closer time points are likely somewhat correlated—that we won’t cover here and that you won’t see applied often in environmental epidemiology studies, but that might be the focus of a “Time Series” course in a statistics or economics department.

In air pollution and climate epidemiology, time series studies typically begin with study data collected for an aggregated area (e.g., city, county, ZIP code) and with a daily resolution. These data are usually secondary data, originally collected by the government or other organizations through vital statistics or other medical records (for the health data) and networks of monitors for the exposure data. In the next section of this chapter, we’ll explore common characteristics of these data. These data are used in a time series study to investigate how changes in the daily level of the exposure is associated with risk of a health outcome, focusing on the short-term period. For example, a study might investigate how risk of respiratory hospitalization in a city changes in relationship with the concentration of particulate matter during the week or two following exposure. The study period for these studies is often very long (often a decade or longer), and while single-community time series studies can be conducted, many time series studies for environmental epidemiology now include a large set of communities of national or international scope.

The study design essentially compares a community with itself at different time points—asking if health risk tends to be higher on days when exposure is higher. By comparing the community to itself, the design removes many challenges that would come up when comparing one community to another (e.g., is respiratory hospitalization risk higher in city A than city B because particulate matter concentrations are typically higher in city A?). Communities differ in demographics and other factors that influence health risk, and it can be hard to properly control for these when exploring the role of environmental exposures. By comparison, demographics tend to change slowly over time (at least, compared to a daily scale) within a community.

One limitation, however, is that the study design is often best-suited to study acute effects, but more limited in studying chronic health effects. This is tied to the design and traditional ways of statistically modeling the resulting data. Since a community is compared with itself, the design removes challenges in comparing across communities, but it introduces new ones in comparing across time. Both environmental exposures and rates of health outcomes can have strong patterns over time, both across the year (e.g., mortality rates tend to follow a strong seasonal pattern, with higher rates in winter) and across longer periods (e.g., over the decade or longer of a study period). These patterns must be addressed through the statistical model fit to the time series data, and they make it hard to disentangle chronic effects of the exposure from unrelated temporal patterns in the exposure and outcome, and so most time series studies will focus on the short-term (or acute) association between exposure and outcome, typically looking at a period of at most about a month following exposure.

The term case-crossover study is a bit more specific than time series study , although there has been a strong movement in environmental epidemiology towards applying a specific version of the design, and so in this field the term often now implies this more specific version of the design. Broadly, a case-crossover study is one in which the conditions at the time of a health outcome are compared to conditions at other times that should otherwise (i.e., outside of the exposure of interest) be comparable. A case-crossover study could, for example, investigate the association between weather and car accidents by taking a set of car accidents and investigating how weather during the car accident compared to weather in the same location the week before.

One choice in a case-crossover study design is how to select the control time periods. Early studies tended to use a simple method for this—for example, taking the day before, or a day the week before, or some similar period somewhat close to the day of the outcome. As researchers applied the study design to large sets of data (e.g., all deaths in a community over multiple years), they noticed that some choices could create bias in estimates. As a result, most environmental epidemiology case-crossover studies now use a time-stratified approach to selecting control days. This selects a set of control days that typically include days both before and after the day of the health outcome, and are a defined set of days within a “stratum” that should be comparable in terms of temporal trends. For daily-resolved data, this stratum typically will include all the days within a month, year, and day of week. For example, one stratum of comparable days might be all the Mondays in January of 2010. These stratums are created throughout the study period, and then days are only compared to other days within their stratum (although, fortunately, there are ways you can apply a single statistical model to fit all the data for this approach rather than having to fit code stratum-by-stratum over many years).

When this is applied to data at an aggregated level (e.g., city, county, or ZIP code), it is in spirit very similar to a time series study design, in that you are comparing a community to itself at different time points. The main difference is that a time series study uses statistical modeling to control from potential confounding from temporal patterns, while a case-crossover study of this type instead controls for this potential confounding by only comparing days that should be “comparable” in terms of temporal trends, for example, comparing a day only to other days in the same month, year, and day of week. You will often hear that case-crossover studies therefore address potential confounding for temporal patterns “by design” rather than “statistically” (as in time series studies). However, in practice (and as we’ll explore in this class), in environmental epidemiology, case-crossover studies often are applied to aggregated community-level data, rather than individual-level data, with exposure assumed to be the same for everyone in the community on a given day. Under these assumptions, time series and case-crossover studies have been determined to be essentially equivalent (and, in fact, can use the same study data), only with slightly different terms used to control for temporal patterns in the statistical model fit to the data. Several interesting papers have been written to explore differences and similarities in these two study designs as applied in environmental epidemiology ( Basu, Dominici, and Samet 2005 ; B. G. Armstrong, Gasparrini, and Tobias 2014 ; Lu and Zeger 2007 ) .

These types of study designs in practice use similar datasets. In earlier presentations of the case-crossover design, these data would be set up a bit differently for statistical modeling. More recent work, however, has clarified how they can be modeled similarly to when using a time series study design, allowing the data to be set up in a similar way ( B. G. Armstrong, Gasparrini, and Tobias 2014 ) .

Several excellent commentaries or reviews are available that provide more details on these two study designs and how they have been used specifically investigate the relationship between climate-related exposures and health ( Bhaskaran et al. 2013 ; B. Armstrong 2006 ; Gasparrini and Armstrong 2010 ) . Further, these designs are just two tools in a wider collection of study designs that can be used to explore the health effects of climate-related exposures. Dominici and Peng ( 2008c ) provides a nice overview of this broader set of designs.

3.3 Time series data

Let’s explore the type of dataset that can be used for these time series–style studies in environmental epidemiology. In the examples in this chapter, we’ll be using data that comes as part of the Supplemental Material in one of this chapter’s required readings, ( Vicedo-Cabrera, Sera, and Gasparrini 2019 ) . Follow the link for the supplement for this article and then look for the file “lndn_obs.csv.” This is the file we’ll use as the example data in this chapter.

These data are saved in a csv format (that is, a plain text file, with commas used as the delimiter), and so they can be read into R using the read_csv function from the readr package (part of the tidyverse). For example, you can use the following code to read in these data, assuming you have saved them in a “data” subdirectory of your current working directory:

This example dataset shows many characteristics that are common for datasets for time series studies in environmental epidemiology. Time series data are essentially a sequence of data points repeatedly taken over a certain time interval (e.g., day, week, month etc). General characteristics of time series data for environmental epidemiology studies are:

  • Observations are given at an aggregated level. For example, instead of individual observations for each person in London, the obs data give counts of deaths throughout London. The level of aggregation is often determined by geopolitical boundaries, for example, counties or ZIP codes in the US.
  • Observations are given at regularly spaced time steps over a period. In the obs dataset, the time interval is day. Typically, values will be provided continuously over that time period, with observations for each time interval. Occasionally, however, the time series data may only be available for particular seasons (e.g., only warm season dates for an ozone study), or there may be some missing data on either the exposure or health outcome over the course of the study period.
  • Observations are available at the same time step (e.g., daily) for (1) the health outcome, (2) the environmental exposure of interest, and (3) potential time-varying confounders. In the obs dataset, the health outcome is mortality (from all causes; sometimes, the health outcome will focus on a specific cause of mortality or other health outcomes such as hospitalizations or emergency room visits). Counts are given for everyone in the city for each day ( all column), as well as for specific age categories ( all_0_64 for all deaths among those up to 64 years old, and so on). The exposure of interest in the obs dataset is temperature, and three metrics of this are included ( tmean , tmin , and tmax ). Day of the week is one time-varying factor that could be a confounder, or at least help explain variation in the outcome (mortality). This is included through the dow variable in the obs data. Sometimes, you will also see a marker for holidays included as a potential time-varying confounder, or other exposure variables (temperature is a potential confounder, for example, when investigating the relationship between air pollution and mortality risk).
  • Multiple metrics of an exposure and / or multiple health outcome counts may be included for each time step. In the obs example, three metrics of temperature are included (minimum daily temperature, maximum daily temperature, and mean daily temperature). Several counts of mortality are included, providing information for specific age categories in the population. The different metrics of exposure will typically be fit in separate models, either as a sensitivity analysis or to explore how exposure measurement affects epidemiological results. If different health outcome counts are available, these can be modeled in separate statistical models to determine an exposure-response function for each outcome.

3.4 Exploratory data analysis

When working with time series data, it is helpful to start with some exploratory data analysis. This type of time series data will often be secondary data—it is data that was previously collected, as you are re-using it. Exploratory data analysis is particularly important with secondary data like this. For primary data that you collected yourself, following protocols that you designed yourself, you will often be very familiar with the structure of the data and any quirks in it by the time you are ready to fit a statistical model. With secondary data, however, you will typically start with much less familiarity about the data, how it was collected, and any potential issues with it, like missing data and outliers.

Exploratory data analysis can help you become familiar with your data. You can use summaries and plots to explore the parameters of the data, and also to identify trends and patterns that may be useful in designing an appropriate statistical model. For example, you can explore how values of the health outcome are distributed, which can help you determine what type of regression model would be appropriate, and to see if there are potential confounders that have regular relationships with both the health outcome and the exposure of interest. You can see how many observations have missing data for the outcome, the exposure, or confounders of interest, and you can see if there are any measurements that look unusual. This can help in identifying quirks in how the data were recorded—for example, in some cases ground-based weather monitors use -99 or -999 to represent missing values, definitely something you want to catch and clean-up in your data (replacing with R’s NA for missing values) before fitting a statistical model!

The following applied exercise will take you through some of the questions you might want to answer through this type of exploratory analysis. In general, the tidyverse suite of R packages has loads of tools for exploring and visualizing data in R. The lubridate package from the tidyverse , for example, is an excellent tool for working with date-time data in R, and time series data will typically have at least one column with the timestamp of the observation (e.g., the date for daily data). You may find it worthwhile to explore this package some more. There is a helpful chapter in Wickham and Grolemund ( 2016 ) , https://r4ds.had.co.nz/dates-and-times.html , as well as a cheatsheet at https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_lubridate.pdf . For visualizations, if you are still learning techniques in R, two books you may find useful are Healy ( 2018 ) (available online at https://socviz.co/ ) and Chang ( 2018 ) (available online at http://www.cookbook-r.com/Graphs/ ).

Applied: Exploring time series data

Read the example time series data into R and explore it to answer the following questions:

  • What is the study period for the example obs dataset? (i.e., what dates / years are covered by the time series data?)
  • Are there any missing dates (i.e., dates with nothing recorded) within this time period? Are there any recorded dates where health outcome measurements are missing? Any where exposure measurements are missing?
  • Are there seasonal trends in the exposure? In the outcome?
  • Are there long-term trends in the exposure? In the outcome?
  • Is the outcome associated with day of week? Is the exposure associated with day of week?

Based on your exploratory analysis in this section, talk about the potential for confounding when these data are analyzed to estimate the association between daily temperature and city-wide mortality. Is confounding by seasonal trends a concern? How about confounding by long-term trends in exposure and mortality? How about confounding by day of week?

Applied exercise: Example code

In the obs dataset, the date of each observation is included in a column called date . The data type of this column is “Date”—you can check this by using the class function from base R:

Since this column has a “Date” data type, you can run some mathematical function calls on it. For example, you can use the min function from base R to get the earliest date in the dataset and the max function to get the latest.

You can also run the range function to get both the earliest and latest dates with a single call:

This provides the range of the study period for these data. One interesting point is that it’s not a round set of years—instead, the data ends during the summer of the last study year. This doesn’t present a big problem, but is certainly something to keep in mind if you’re trying to calculate yearly averages of any values for the dataset. If you’re getting the average of something that varies by season (e.g., temperature), it could be slightly weighted by the months that are included versus excluded in the partial final year of the dataset. Similarly, if you group by year and then count totals by year, the number will be smaller for the last year, since only part of the year’s included. For example, if you wanted to count the total deaths in each year of the study period, it will look like they go down a lot the last year, when really it’s only because only about half of the last year is included in the study period:

case study in epidemiology

  • Are there any missing dates within this time period? Are there any recorded dates where health outcome measurements are missing? Any where exposure measurements are missing?

There are a few things you should check to answer this question. First (and easiest), you can check to see if there are any NA values within any of the observations in the dataset. This helps answer the second and third parts of the question. The summary function will provide a summary of the values in each column of the dataset, including the count of missing values ( NA s) if there are any:

Based on this analysis, all observations are complete for all dates included in the dataset. There are no listings for NA s for any of the columns, and this indicates no missing values in the dates for which there’s a row in the data.

However, this does not guarantee that every date between the start date and end date of the study period are included in the recorded data. Sometimes, some dates might not get recorded at all in the dataset, and the summary function won’t help you determine when this is the case. One common example in environmental epidemiology is with ozone pollution data. These are sometimes only measured in the warm season, and so may be shared in a dataset with all dates outside of the warm season excluded.

There are a few alternative explorations you can do to check this. Perhaps the easiest is to check the number of days between the start and end date of the study period, and then see if the number of observations in the dataset is the same:

This indicates that there is an observation for every date over the study period, since the number of observations should be one more than the time difference. In the next question, we’ll be plotting observations by time, and typically this will also help you see if there are large chunks of missing dates in the data.

You can use a simple plot to visualize patterns over time in both the exposure and the outcome. For example, the following code plots a dot for each daily temperature observation over the study period. The points are set to a smaller size ( size = 0.5 ) and plotted with some transparency ( alpha = 0.5 ) since there are so many observations.

case study in epidemiology

There is (unsurprisingly) clear evidence here of a strong seasonal trend in mean temperature, with values typically lowest in the winter and highest in the summer.

You can plot the outcome variable in the same way:

case study in epidemiology

Again, there are seasonal trends, although in this case they are inversed. Mortality tends to be highest in the winter and lowest in the summer. Further, the seasonal pattern is not equally strong in all years—some years it has a much higher winter peak, probably in conjunction with severe influenza seasons.

Another way to look for seasonal trends is with a heatmap-style visualization, with day of year along the x-axis and year along the y-axis. This allows you to see patterns that repeat around the same time of the year each year (and also unusual deviations from normal seasonal patterns).

For example, here’s a plot showing temperature in each year, where the observations are aligned on the x-axis by time in year. We’re using the doy —which stands for “day of year” (i.e., Jan 1 = 1; Jan 2 = 2; … Dec 31 = 365 as long as it’s not a leap year) as the measure of time in the year. We’ve reversed the y-axis so that the earliest years in the study period start at the top of the visual, then later study years come later—this is a personal style, and it would be no problem to leave the y-axis as-is. We’ve used the viridis color scale for the fill, since that has a number of features that make it preferable to the default R color scale, including that it is perceptible for most types of color blindness and be printed out in grayscale and still be correctly interpreted.

case study in epidemiology

From this visualization, you can see that temperatures tend to be higher in the summer months and lower in the winter months. “Spells” of extreme heat or cold are visible—where extreme temperatures tend to persist over a period, rather than randomly fluctuating within a season. You can also see unusual events, like the extreme heat wave in the summer of 2003, indicated with the brightest yellow in the plot.

We created the same style of plot for the health outcome. In this case, we focused on mortality among the oldest age group, as temperature sensitivity tends to increase with age, so this might be where the strongest patterns are evident.

case study in epidemiology

For mortality, there tends to be an increase in the winter compared to the summer. Some winters have stretches with particularly high mortality—these are likely a result of seasons with strong influenza outbreaks. You can also see on this plot the impact of the 2003 heat wave on mortality among this oldest age group—an unusual spot of light green in the summer.

Some of the plots we created in the last section help in exploring this question. For example, the following plot shows a clear pattern of decreasing daily mortality counts, on average, over the course of the study period:

case study in epidemiology

It can be helpful to add a smooth line to help detect these longer-term patterns, which you can do with geom_smooth :

case study in epidemiology

You could also take the median mortality count across each year in the study period, although you should take out any years without a full year’s worth of data before you do this, since there are seasonal trends in the outcome:

case study in epidemiology

Again, we see a clear pattern of decreasing mortality rates in this city over time. This means we need to think carefully about long-term time patterns as a potential confounder. It will be particularly important to think about this if the exposure also has a strong pattern over time. For example, air pollution regulations have meant that, in many cities, there may be long-term decreases in pollution concentrations over a study period.

The data already has day of week as a column in the data ( dow ). However, this is in a character data type, so it doesn’t have the order of weekdays encoded (e.g., Monday comes before Tuesday). This makes it hard to look for patterns related to things like weekend / weekday.

We could convert this to a factor and encode the weekday order when we do it, but it’s even easier to just recreate the column from the date column. We used the wday function from the lubridate package to do this—it extracts weekday as a factor, with the order of weekdays encoded (using a special “ordered” factor type):

We looked at the mean, median, and 25th and 75th quantiles of the mortality counts by day of week:

Mortality tends to be a bit higher on weekdays than weekends, but it’s not a dramatic difference.

We did the same check for temperature:

In this case, there does not seem to be much of a pattern by weekday.

You can also visualize the association using boxplots:

case study in epidemiology

You can also try violin plots—these show the full distribution better than boxplots, which only show quantiles.

case study in epidemiology

All these reinforce that there are some small differences in weekend versus weekday patterns for mortality. There isn’t much pattern by weekday with temperature, so in this case weekday is unlikely to be a confounder (the same is not true with air pollution, which often varies based on commuting patterns and so can have stronger weekend/weekday differences). However, since it does help some in explaining variation in the health outcome, it might be worth including in our models anyway, to help reduce random noise.

Exploratory data analysis is an excellent tool for exploring your data before you begin fitting a statistical model, and you should get in the habit of using it regularly in your research. Dominici and Peng ( 2008a ) provides another walk-through of exploring this type of data, including some more advanced tools for exploring autocorrelation and time patterns.

3.5 Statistical modeling for a time series study

Now that we’ve explored the data typical of a time series study in climate epidemiology, we’ll look at how we can fit a statistical model to those data to gain insight into the relationship between the exposure and acute health effects. Very broadly, we’ll be using a statistical model to answer the question: How does the relative risk of a health outcome change as the level of the exposure changes, after controlling for potential confounders?

In the rest of this chapter and the next chapter, we’ll move step-by-step to build up to the statistical models that are now typically used in these studies. Along the way, we’ll discuss key components and choices in this modeling process. The statistical modeling is based heavily on regression modeling, and specifically generalized linear regression. To help you get the most of this section, you may find it helpful to review regression modeling and generalized linear models. Some resources for that include Dunn and Smyth ( 2018 ) and James et al. ( 2013 ) .

One of the readings for this week, Vicedo-Cabrera, Sera, and Gasparrini ( 2019 ) , includes a section on fitting exposure-response functions to describe the association between daily mean temperature and mortality risk. This article includes example code in its supplemental material, with code for fitting the model to these time series data in the file named “01EstimationERassociation.r.” Please download that file and take a look at the code.

The model in the code may at first seem complex, but it is made up of a number of fairly straightforward pieces (although some may initially seem complex):

  • The model framework is a generalized linear model (GLM)
  • This GLM is fit assuming an error distribution and a link function appropriate for count data
  • The GLM is fit assuming an error distribution that is also appropriate for data that may be overdispersed
  • The model includes control for day of the week by including a categorical variable
  • The model includes control for long-term and seasonal trends by including a spline (in this case, a natural cubic spline ) for the day in the study
  • The model fits a flexible, non-linear association between temperature and mortality risk, also using a spline
  • The model fits a flexible non-linear association between temperature on a series of preceeding days and current day and mortality risk on the current day using a distributed lag approach
  • The model jointly describes both of the two previous non-linear associations by fitting these two elements through one construct in the GLM, a cross-basis term

In this section and the next chapter, we will work through the elements, building up the code to get to the full model that is fit in Vicedo-Cabrera, Sera, and Gasparrini ( 2019 ) .

Fitting a GLM to time series data

The generalized linear model (GLM) framework unites a number of types of regression models you may have previously worked with. One basic regression model that can be fit within this framework is a linear regression model. However, the framework also allows you to also fit, among others, logistic regression models (useful when the outcome variable can only take one of two values, e.g., success / failure or alive / dead) and Poisson regression models (useful when the outcome variable is a count or rate). This generalized framework brings some unity to these different types of regression models. From a practical standpoint, it has allowed software developers to easily provide a common interface to fit these types of models. In R, the common function call to fit GLMs is glm .

Within the GLM framework, the elements that separate different regression models include the link function and the error distribution. The error distribution encodes the assumption you are enforcing about how the errors after fitting the model are distributed. If the outcome data are normally distributed (a.k.a., follow a Gaussian distribution), after accounting for variance explained in the outcome by any of the model covariates, then a linear regression model may be appropriate. For count data—like numbers of deaths a day—this is unlikely, unless the average daily mortality count is very high (count data tend to come closer to a normal distribution the further their average gets from 0). For binary data—like whether each person in a study population died on a given day or not—normally distributed errors are also unlikely. Instead, in these two cases, it is typically more appropriate to fit GLMs with Poisson and binomial “families,” respectively, where the family designation includes an appropriate specification for the variance when fitting the model based on these outcome types.

The other element that distinguishes different types of regression within the GLM framework is the link function. The link function applies a transformation on the combination of independent variables in the regression equation when fitting the model. With normally distributed data, an identity link is often appropriate—with this link, the combination of independent variables remain unchanged (i.e., keep their initial “identity”). With count data, a log link is often more appropriate, while with binomial data, a logit link is often used.

Finally, data will often not perfectly adhere to assumptions. For example, the Poisson family of GLMs assumes that variance follows a Poisson distribution (The probability mass function for Poisson distribution \(X \sim {\sf Poisson}(\mu)\) is denoted by \(f(k;\mu)=Pr[X=k]= \displaystyle \frac{\mu^{k}e^{-\mu}}{k!}\) , where \(k\) is the number of occurences, and \(\mu\) is equal to the expected number of cases). With this distribution, the variance is equal to the mean ( \(\mu=E(X)=Var(X)\) ). With real-life data, this assumption is often not valid, and in many cases the variance in real life count data is larger than the mean. This can be accounted for when fitting a GLM by setting an error distribution that does not require the variance to equal the mean—instead, both a mean value and something like a variance are estimated from the data, assuming an overdispersion parameter \(\phi\) so that \(Var(X)=\phi E(X)\) . In environmental epidemiology, time series are often fit to allow for this overdispersion. This is because if the data are overdispersed but the model does not account for this, the standard errors on the estimates of the model parameters may be artificially small. If the data are not overdispersed ( \(\phi=1\) ), the model will identify this when being fit to the data, so it is typically better to prefer to allow for overdispersion in the model (if the size of the data were small, you may want to be parsimonious and avoid unneeded complexity in the model, but this is typically not the case with time series data).

In the next section, you will work through the steps of developing a GLM to fit the example dataset obs . For now, you will only fit a linear association between mean daily temperature and mortality risk, eventually including control for day of week. In later work, especially the next chapter, we will build up other components of the model, including control for the potential confounders of long-term and seasonal patterns, as well as advancing the model to fit non-linear associations, distributed by time, through splines, a distributed lag approach, and a cross-basis term.

Applied: Fitting a GLM to time series data

In R, the function call used to fit GLMs is glm . Most of you have likely covered GLMs, and ideally this function call, in previous courses. If you are unfamiliar with its basic use, you will want to refresh yourself on this topic—you can use some of the resources noted earlier in this section and in the chapter’s “Supplemental Readings” to do so.

  • Fit a GLM to estimate the association between mean daily temperature (as the independent variable) and daily mortality count (as the dependent variable), first fitting a linear regression. (Since the mortality data are counts, we will want to shift to a different type of regression within the GLM framework, but this step allows you to develop a simple glm call, and to remember where to include the data and the independent and dependent variables within this function call.)
  • Change your function call to fit a regression model in the Poisson family.
  • Change your function call to allow for overdispersion in the outcome data (daily mortality count). How does the estimated coefficient for temperature change between the model fit for #2 and this model? Check both the central estimate and its estimated standard error.
  • Change your function call to include control for day of week.
  • Fit a GLM to estimate the association between mean daily temperature (as the independent variable) and daily mortality count (as the dependent variable), first fitting a linear regression.

This is the model you are fitting:

\(Y_{t}=\beta_{0}+\beta_{1}X1_{t}+\epsilon\)

where \(Y_{t}\) is the mortality count on day \(t\) , \(X1_{t}\) is the mean temperature for day \(t\) and \(\epsilon\) is the error term. Since this is a linear model we are assuming a Gaussian error distribution \(\epsilon \sim {\sf N}(0, \sigma^{2})\) , where \(\sigma^{2}\) is the variance not explained by the covariates (here just temperature).

To do this, you will use the glm call. If you would like to save model fit results to use later, you assign the output a name as an R object ( mod_linear_reg in the example code). If your study data are in a dataframe, you can specify these data in the glm call with the data parameter. Once you do this, you can use column names directly in the model formula. In the model formula, the dependent variable is specified first ( all , the column for daily mortality counts for all ages, in this example), followed by a tilde ( ~ ), followed by all independent variables (only tmean in this example). If multiple independent variables are included, they are joined using + . We’ll see an example when we start adding control for confounders later.

Once you have fit a model and assigned it to an R object, you can explore it and use resulting values. First, the print method for a regression model gives some summary information. This method is automatically called if you enter the model object’s name at the console:

More information is printed if you run the summary method on the model object:

Make sure you are familiar with the information provided from the model object, as well as how to interpret values like the coefficient estimates and their standard errors and p-values. These basic elements should have been covered in previous coursework (even if a different programming language was used to fit the model), and so we will not be covering them in great depth here, but instead focusing on some of the more advanced elements of how regression models are commonly fit to data from time series and case-crossover study designs in environmental epidemiology. For a refresher on the basics of fitting statistical models in R, you may want to check out Chapters 22 through 24 of Wickham and Grolemund ( 2016 ) , a book that is available online, as well as Dunn and Smyth ( 2018 ) and James et al. ( 2013 ) .

Finally, there are some newer tools for extracting information from model fit objects. The broom package extracts different elements from these objects and returns them in a “tidy” data format, which makes it much easier to use the output further in analysis with functions from the “tidyverse” suite of R packages. These tools are very popular and powerful, and so the broom tools can be very useful in working with output from regression modeling in R.

The broom package includes three main functions for extracting data from regression model objects. First, the glance function returns overall data about the model fit, including the AIC and BIC:

The tidy function returns data at the level of the model coefficients, including the estimate for each model parameter, its standard error, test statistic, and p-value.

Finally, the augment function returns data at the level of the original observations, including the fitted value for each observation, the residual between the fitted and true value, and some measures of influence on the model fit.

One way you can use augment is to graph the fitted values for each observation after fitting the model:

case study in epidemiology

For more on the broom package, including some excellent examples of how it can be used to streamline complex regression analyses, see Robinson ( 2014 ) . There is also a nice example of how it can be used in one of the chapters of Wickham and Grolemund ( 2016 ) , available online at https://r4ds.had.co.nz/many-models.html .

A linear regression is often not appropriate when fitting a model where the outcome variable provides counts, as with the example data, since such data often don’t follow a normal distribution. A Poisson regression is typically preferred.

For a count distribution were \(Y \sim {\sf Poisson(\mu)}\) we typically fit a model such as

\(g(Y)=\beta_{0}+\beta_{1}X1\) , where \(g()\) represents the link function, in this case a log function so that \(log(Y)=\beta_{0}+\beta_{1}X1\) . We can also express this as \(Y=exp(\beta_{0}+\beta_{1}X1)\) .

In the glm call, you can specify this with the family parameter, for which “poisson” is one choice.

One thing to keep in mind with this change is that the model now uses a non-identity link between the combination of independent variable(s) and the dependent variable. You will need to keep this in mind when you interpret the estimates of the regression coefficients. While the coefficient estimate for tmean from the linear regression could be interpreted as the expected increase in mortality counts for a one-unit (i.e., one degree Celsius) increase in temperature, now the estimated coefficient should be interpreted as the expected increase in the natural log-transform of mortality count for a one-unit increase in temperature.

You can see this even more clearly if you take a look at the association between temperature for each observation and the expected mortality count fit by the model. First, if you look at the fitted values without transforming, they will still be in a state where mortality count is log-transformed. You can see by looking at the range of the y-scale that these values are for the log of expected mortality, rather than expected mortality (compare, for example, to the similar plot shown from the first model, which was linear), and that the fitted association for that transformation , not for untransformed mortality counts, is linear:

case study in epidemiology

You can use exponentiation to transform the fitted values back to just be the expected mortality count based on the model fit. Once you make this transformation, you can see how the link in the Poisson family specification enforced a curved relationship between mean daily temperature and the untransformed expected mortality count.

case study in epidemiology

For this model, we can interpret the coefficient for the temperature covariate as the expected log relative risk in the health outcome associated with a one-unit increase in temperature. We can exponentiate this value to get an estimate of the relative risk:

If you want to estimate the confidence interval for this estimate, you should calculate that before exponentiating.

In the R glm call, there is a family that is similar to Poisson (including using a log link), but that allows for overdispersion. You can specify it with the “quasipoisson” choice for the family parameter in the glm call:

When you use this family, there will be some new information in the summary for the model object. It will now include a dispersion parameter ( \(\phi\) ). If this is close to 1, then the data were close to the assumed variance for a Poisson distribution (i.e., there was little evidence of overdispersion). In the example, the overdispersion is around 5, suggesting the data are overdispersed (this might come down some when we start including independent variables that explain some of the variation in the outcome variable, like long-term and seasonal trends).

If you compare the estimates of the temperature coefficient from the Poisson regression with those when you allow for overdispersion, you’ll see something interesting:

The central estimate ( estimate column) is very similar. However, the estimated standard error is larger when the model allows for overdispersion. This indicates that the Poisson model was too simple, and that its inherent assumption that data were not overdispersed was problematic. If you naively used a Poisson regression in this case, then you would estimate a confidence interval on the temperature coefficient that would be too narrow. This could cause you to conclude that the estimate was statistically significant when you should not have (although in this case, the estimate is statistically significant under both models).

Day of week is included in the data as a categorical variable, using a data type in R called a factor. You are now essentially fitting this model:

\(log(Y)=\beta_{0}+\beta_{1}X1+\gamma^{'}X2\) ,

where \(X2\) is a categorical variable for day of the week and \(\gamma^{'}\) represents a vector of parameters associated with each category.

It is pretty straightforward to include factors as independent variables in calls to glm : you just add the column name to the list of other independent variables with a + . In this case, we need to do one more step: earlier, we added order to dow , so it would “remember” the order of the week days (Monday before Tuesday, etc.). However, we need to strip off this order before we include the factor in the glm call. One way to do this is with the factor call, specifying ordered = FALSE . Here is the full call to fit this model:

When you look at the summary for the model object, you can see that the model has fit a separate model parameter for six of the seven weekdays. The one weekday that isn’t fit (Sunday in this case) serves as a baseline —these estimates specify how the log of the expected mortality count is expected to differ on, for example, Monday versus Sunday (by about 0.03), if the temperature is the same for the two days.

You can also see from this summary that the coefficients for the day of the week are all statistically significant. Even though we didn’t see a big difference in mortality counts by day of week in our exploratory analysis, this suggests that it does help explain some variance in mortality observations and will likely be worth including in the final model.

The model now includes day of week when fitting an expected mortality count for each observation. As a result, if you plot fitted values of expected mortality versus mean daily temperature, you’ll see some “hoppiness” in the fitted line:

case study in epidemiology

This is because each fitted value is also incorporating the expected influence of day of week on the mortality count, and that varies across the observations (i.e., you could have two days with the same temperature, but different expected mortality from the model, because they occur on different days).

If you plot the model fits separately for each day of the week, you’ll see that the line is smooth across all observations from the same day of the week:

case study in epidemiology

Wrapping up

At this point, the coefficient estimates suggests that risk of mortality tends to decrease as temperature increases. Do you think this is reasonable? What else might be important to build into the model based on your analysis up to this point?

  • Search Menu

Sign in through your institution

  • Advance articles
  • Editor's Choice
  • 100 years of the AJE
  • Collections
  • Author Guidelines
  • Submission Site
  • Open Access Options
  • About American Journal of Epidemiology
  • About the Johns Hopkins Bloomberg School of Public Health
  • Journals Career Network
  • Editorial Board
  • Advertising and Corporate Services
  • Self-Archiving Policy
  • Dispatch Dates
  • Journals on Oxford Academic
  • Books on Oxford Academic

Society for Epidemiologic Research

Grace periods and exposure misclassification in self-controlled case-series studies of drug-drug interactions

  • Article contents
  • Figures & tables
  • Supplementary Data

Hanxi Zhang, Warren B Bilker, Charles E Leonard, Sean Hennessy, Todd A Miano, Grace periods and exposure misclassification in self-controlled case-series studies of drug-drug interactions, American Journal of Epidemiology , 2024;, kwae231, https://doi.org/10.1093/aje/kwae231

  • Permissions Icon Permissions

The self-controlled case-series (SCCS) research design is increasingly used in pharmacoepidemiologic studies of drug-drug interactions (DDIs), with the target of inference being the incidence rate ratio (IRR) associated with concomitant exposure to the object plus precipitant drug versus the object drug alone. While day-level drug exposure can be inferred from dispensing claims, these inferences may be inaccurate, leading to biased IRRs. Grace periods (periods assuming continued treatment impact after days’ supply exhaustion) are frequently used by researchers, but the impact of grace period decisions on bias from exposure misclassification remains unclear. Motivated by an SCCS study examining the potential DDI between clopidogrel (object) and warfarin (precipitant), we investigated bias due to precipitant or object exposure misclassification using simulations. We show that misclassified precipitant treatment always biases the estimated IRR toward the null, whereas misclassified object treatment may lead to bias in either direction or no bias, depending on the scenario. Further, including a grace period for each object dispensing may unintentionally increase the risk of misclassification bias. To minimize such bias, we recommend 1) avoiding the use of grace periods when specifying object drug exposure episodes; and 2) including a washout period following each precipitant exposed period.

Society for Epidemiologic Research

Society for Epidemiologic Research members

Personal account.

  • Sign in with email/username & password
  • Get email alerts
  • Save searches
  • Purchase content
  • Activate your purchase/trial code
  • Add your ORCID iD

Institutional access

Sign in with a library card.

  • Sign in with username/password
  • Recommend to your librarian
  • Institutional account management
  • Get help with access

Access to content on Oxford Academic is often provided through institutional subscriptions and purchases. If you are a member of an institution with an active account, you may be able to access content in one of the following ways:

IP based access

Typically, access is provided across an institutional network to a range of IP addresses. This authentication occurs automatically, and it is not possible to sign out of an IP authenticated account.

Choose this option to get remote access when outside your institution. Shibboleth/Open Athens technology is used to provide single sign-on between your institution’s website and Oxford Academic.

  • Click Sign in through your institution.
  • Select your institution from the list provided, which will take you to your institution's website to sign in.
  • When on the institution site, please use the credentials provided by your institution. Do not use an Oxford Academic personal account.
  • Following successful sign in, you will be returned to Oxford Academic.

If your institution is not listed or you cannot sign in to your institution’s website, please contact your librarian or administrator.

Enter your library card number to sign in. If you cannot sign in, please contact your librarian.

Society Members

Society member access to a journal is achieved in one of the following ways:

Sign in through society site

Many societies offer single sign-on between the society website and Oxford Academic. If you see ‘Sign in through society site’ in the sign in pane within a journal:

  • Click Sign in through society site.
  • When on the society site, please use the credentials provided by that society. Do not use an Oxford Academic personal account.

If you do not have a society account or have forgotten your username or password, please contact your society.

Sign in using a personal account

Some societies use Oxford Academic personal accounts to provide access to their members. See below.

A personal account can be used to get email alerts, save searches, purchase content, and activate subscriptions.

Some societies use Oxford Academic personal accounts to provide access to their members.

Viewing your signed in accounts

Click the account icon in the top right to:

  • View your signed in personal account and access account management features.
  • View the institutional accounts that are providing access.

Signed in but can't access content

Oxford Academic is home to a wide variety of products. The institutional subscription may not cover the content that you are trying to access. If you believe you should have access to that content, please contact your librarian.

For librarians and administrators, your personal account also provides access to institutional account management. Here you will find options to view and activate subscriptions, manage institutional settings and access options, access usage statistics, and more.

Short-term Access

To purchase short-term access, please sign in to your personal account above.

Don't already have a personal account? Register

Month: Total Views:
August 2024 13

Email alerts

Citing articles via, looking for your next opportunity.

  • Recommend to your Library

Affiliations

  • Online ISSN 1476-6256
  • Print ISSN 0002-9262
  • Copyright © 2024 Johns Hopkins Bloomberg School of Public Health
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Rights and permissions
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Loading metrics

Open Access

Peer-reviewed

Research Article

Strong effect of demographic changes on Tuberculosis susceptibility in South Africa

Contributed equally to this work with: Oshiomah P. Oyageshio, Justin W. Myrick

Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Validation, Visualization, Writing – original draft

* E-mail: [email protected] (OPO); [email protected] (MM); [email protected] (BMH)

Affiliation Center for Population Biology, University of California, Davis, Davis, California, United States of America

ORCID logo

Roles Conceptualization, Data curation, Investigation, Methodology, Project administration, Supervision, Validation, Writing – original draft

Affiliation UC Davis Genome Center, University of California, Davis, Davis, California, United States of America

Roles Data curation, Project administration, Supervision

Affiliation DSI-NRF Centre of Excellence for Biomedical Tuberculosis Research, South African Medical Research Council Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University, Cape Town, South Africa

Roles Data curation, Project administration, Resources

Roles Data curation, Project administration

Affiliation Department of Anthropology, University of California, Davis, Davis, California, United States of America

Roles Investigation, Project administration, Resources

Affiliation Department of Microbiology, Immunology, and Genetics, School of Biomedical Sciences, University of North Texas Health Science Center, Fort Worth, Texas, United States of America

Roles Formal analysis, Investigation

Affiliation Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, California, United States of America

Roles Funding acquisition, Project administration, Resources, Writing – review & editing

Roles Investigation, Project administration, Supervision, Writing – review & editing

Affiliations DSI-NRF Centre of Excellence for Biomedical Tuberculosis Research, South African Medical Research Council Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University, Cape Town, South Africa, Centre for Bioinformatics and Computational Biology, Stellenbosch University, Stellenbosch, South Africa

Roles Funding acquisition, Investigation, Project administration, Resources, Supervision, Writing – original draft

Roles Data curation, Formal analysis, Funding acquisition, Methodology, Project administration, Resources, Supervision, Writing – original draft

Affiliations Center for Population Biology, University of California, Davis, Davis, California, United States of America, UC Davis Genome Center, University of California, Davis, Davis, California, United States of America, Department of Anthropology, University of California, Davis, Davis, California, United States of America

  • Oshiomah P. Oyageshio, 
  • Justin W. Myrick, 
  • Jamie Saayman, 
  • Lena van der Westhuizen, 
  • Dana R. Al-Hindi, 
  • Austin W. Reynolds, 
  • Noah Zaitlen, 
  • Eileen G. Hoal, 
  • Caitlin Uren, 

PLOS

  • Published: July 23, 2024
  • https://doi.org/10.1371/journal.pgph.0002643
  • See the preprint
  • Peer Review
  • Reader Comments

Fig 1

South Africa is among the world’s top eight tuberculosis (TB) burden countries, and despite a focus on HIV-TB co-infection, most of the population living with TB are not HIV co-infected. The disease is endemic across the country, with 80–90% exposure by adulthood. We investigated epidemiological risk factors for (TB) in the Northern Cape Province, South Africa: an understudied TB endemic region with extreme TB incidence (926/100,000). We leveraged the population’s high TB incidence and community transmission to design a case-control study with similar mechanisms of exposure between the groups. We recruited 1,126 participants with suspected TB from 12 community health clinics and generated a cohort of 774 individuals (cases = 374, controls = 400) after implementing our enrollment criteria. All participants were GeneXpert Ultra tested for active TB by a local clinic. We assessed important risk factors for active TB using logistic regression and random forest modeling. We find that factors commonly identified in other global populations tend to replicate in our study, e.g. male gender and residence in a town had significant effects on TB risk (OR: 3.02 [95% CI: 2.30–4.71]; OR: 3.20 [95% CI: 2.26–4.55]). We also tested for demographic factors that may uniquely reflect historical changes in health conditions in South Africa. We find that socioeconomic status (SES) significantly interacts with an individual’s age ( p = 0.0005) indicating that protective effect of higher SES changed across age cohorts. We further find that being born in a rural area and moving to a town strongly increases TB risk, while town birthplace and current rural residence is protective. These interaction effects reflect rapid demographic changes, specifically SES over recent generations and mobility, in South Africa. Our models show that such risk factors combined explain 19–21% of the variance (r 2 ) in TB case/control status.

Citation: Oyageshio OP, Myrick JW, Saayman J, van der Westhuizen L, Al-Hindi DR, Reynolds AW, et al. (2024) Strong effect of demographic changes on Tuberculosis susceptibility in South Africa. PLOS Glob Public Health 4(7): e0002643. https://doi.org/10.1371/journal.pgph.0002643

Editor: Indira Govender, London School of Hygiene & Tropical Medicine, UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND

Received: November 20, 2023; Accepted: June 12, 2024; Published: July 23, 2024

Copyright: © 2024 Oyageshio et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All R scripts for statistical data analysis and visualization are available at https://github.com/oshiomah1/NCTB-Epidemiology-Project . The relevant raw genetic data is deposited in the European Genome-phenome Archive (study accession number: EGAS00001007850). To maintain the privacy and anonymity of our study participants, and following our IRB-approved protocol, epidemiological data is available upon reasonable request. For access, please contact the Stellenbosch University Health Research Ethics Office at [email protected] and Dr. Marlo Moller at [email protected] .

Funding: This work was funded by NIH grant R35GM133531 to BMH. This work was also partially funded by the South African government through the South African Medical Research Council and the National Research Foundation (UID41744) to all members of DSI-NRF Centre of Excellence for Biomedical Tuberculosis Research: MM, CU, LW and JS. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the South African government. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Tuberculosis (TB) is the world’s leading cause of death due to infectious disease, currently greater than COVID-19 [ 1 ]. The causative agent, Mycobacterium tuberculosis (M . tb) , is an obligate intracellular pathogen mainly infecting the lungs, and sometimes other organs [ 2 , 3 ]. Approximately 25% of the world’s population is infected with M . tb and the annual death toll is similar to COVID-19 (~1.5 million deaths). South Africa is amongst the top 30 ‘high burden’ countries coping with TB, TB/HIV co-infection, and multi-drug resistance or rifampicin-resistant TB (MDR/RR-TB). TB is South Africa’s leading natural cause of death [ 4 ] with an extremely high prevalence (446/100,000, [ 5 ]) and accounts for 3.3% of all global TB cases [ 1 ]. The Northern Cape presently has the highest TB incidence in South Africa (ZF Mgcawu district 926/100,000 [ 5 ]), but the lowest HIV prevalence (7.1% vs 13.5% National average), including the lowest density of people living with HIV [ 6 ].

Determinants of active TB progression are multifaceted, including: genetics, nutrition, social and economic conditions, behavior, and sex-specific biology [ 1 , 7 , 8 ]. Initial M . tb infection is largely determined by exogenous factors, such as TB prevalence in the community, population density (e.g., prisons), and working conditions (e.g., mining, healthcare workers) [ 9 – 13 ]. The lifetime risk of progressing to active TB following infection is 10%. This risk is the highest within the first 5 years of initial infection and is typically considered to be mediated by the host’s innate and cell-mediated immune system [ 9 , 14 ]. Individual (i.e., host) factors, however, have also been shown to increase risk of progressing to active disease. These include HIV/AIDS, poor nutrition or low body mass index, indoor air pollution (e.g., cooking with wood and poor ventilation, smoking, alcohol abuse, diabetes mellitus, and intravenous drug use [ 1 , 9 , 14 , 15 ]. Studies in India have shown undernutrition to be among the strongest determinants for TB risk [ 15 ]. In South Africa specifically, poor living conditions, unemployment, low SES, age and male gender, race, smoking, and marital status have all been identified as contributing to TB risk [ 16 – 21 ].

The extent of these determinants’ effects can vary across and within populations, necessitating epidemiological studies in differing contexts and communities [ 8 ]. Compared to medium or high-TB-incidence countries, the effect sizes for alcohol abuse, homelessness, and intravenous drug use are stronger in low-incidence populations [ 22 ]. In South Africa, multilevel modeling approaches have shown that provincial [ 16 ] and community income inequality [ 18 ] have strong effects on TB incidence and progression, independent of individual-level risk factors.

HIV increases TB risk by 20-fold, the largest known risk factor for progression to active TB, and TB is the leading cause of AIDS-related deaths [ 23 ]. The effect of HIV on suppressing the host immune system can reactivate a latent M . tb infection and increase susceptibility to initial infection [ 14 , 23 , 24 ]. Despite HIV being the strongest TB determinant, other TB risk factors explain the majority of global TB cases [ 9 ]. In South Africa, 59% of people with TB on the National TB Programme (i.e. on TB medication) are co-infected with HIV [ 25 ]. However, South Africa’s first national TB prevalence survey found that only 28% of people with TB were also people living with HIV (PLWHIV) [ 25 ], a finding underscoring the necessity to extend TB research to those living without HIV in high TB burden areas. At the provincial level in South Africa, HIV prevalence explains little of TB incidence (r 2 = 0.036) [ 26 ]. The Western and Northern Cape Provinces have among the highest TB incidence yet the lowest HIV prevalence [ 27 ].

Here, we present a TB case-control study characterizing the individual-level risk factors for TB progression among HIV-negative patients with suspected TB from the Northern Cape. The Northern Cape has the highest TB incidence but the lowest HIV prevalence and PLWHIV density, and overall low population density, canonical risk factors do not appear to be driving the extraordinary incidence rates. To focus on factors other than immune suppression, we exclude PLWHIV from the analysis. Controls from our study sample are people with suspected TB from local health clinics who were microbiologically confirmed to be negative for active TB. Controls are assumed to either have been previously exposed to or infected with M . tb (i.e., LTBI). In South Africa, TB transmission is driven largely by community spread, rather than household contacts [ 28 , 29 ]. Cases, in contrast, are people who have microbiologically confirmed active TB or self-report a past active TB episode. We test three separate models comprising common risk factors, as well as factors that may uniquely affect South Africa. We find that exogenous factors like SES, cohort age, and residence/birthplace have a strong effect on TB progression, often equal to or greater than endogenous factors like gender or smoking/alcohol. These results suggest further research into the causal mechanisms behind exogenous risk factors and opportunities for TB prevention are warranted.

Research ethics statement

This study has been approved by the Health Research Ethics Committee (HREC) of Stellenbosch University (N11/07/210A) and the Northern Cape Department of Health (NC2015/008). All participants were adults (18 years and older) and provided written informed formal consent. Authors Justin W. Myrick, Jamie Saayman, Lena van der Westhuizen and Marlo Möller had access to identifiable information about participants as they were directly involved in data collection or database management. Access to these records commenced on 26th January 2016, and is still ongoing as it is an integral part of the Northern Cape Tuberculosis Project (NCTB).

Inclusivity in global research

Additional information regarding the ethical, cultural, and scientific considerations specific to inclusivity in global research is included in the Supporting Information ( S1 Checklist ).

Study design and recruitment

Participants (18 years and older) provided written informed consent and were recruited from 12 community health clinics from the ZF Mgcawu district in the Northern Cape Province of South Africa from 26th January 2016 – 15th May 2017, and 11th December 2018 – 11th March 2020. Community health clinics are the front line for TB screening and treatment, visited by 87% of people who seek TB care [ 25 ]. TB nurses referred patients with suspected TB (with ≥2 TB symptoms: cough for ≥2 weeks, night sweats, weight loss, and fever ≥2 weeks, or interaction with a TB index contact) and known TB patients to our on-site RAs. All study participants took a clinic-administered sputum GeneXpert Ultra test for active TB at the time of the study interview and provided saliva for genotyping. Clinic medical charts were accessed by a staff research nurse to record GeneXpert test results and verify HIV status and TB history.

Case-control assignment

Cases and controls were assigned based on both the participant’s medical charts and self-reported data ( Fig 1 ). Cases include anyone with active pulmonary TB in their lifetime and that was HIV-negative. Thus, cases could be partitioned into 1) clinically confirmed active TB (n = 208) at the time of enrollment, and 2) self-reported past TB episode(s) (n = 166). GeneXpert results, diagnostic test date, TB strain (drug resistance), and TB medication regimen were used to validate clinically confirmed progression to active TB. Past TB episodes are based on self-report, mainly due to older medical charts which were not reliably available, discarded, or difficult to locate by clinic staff.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

Study participants were categorized as cases or controls based on medical record information and self-reported data. All participants were GeneXpert tested for active TB infection at the time of enrollment. Past TB episodes were self-reported and cross-referenced with medical records when available.

https://doi.org/10.1371/journal.pgph.0002643.g001

We defined controls as HIV-negative clinic patients with suspected TB symptoms who had a negative GeneXpert Ultra result, and no history of active pulmonary TB at the time of study enrollment. Controls in our study are likely to be latently infected with M . tb (LTBI) or to have been exposed in their lifetime. A majority of the population in high TB burden South African suburbs are LTBI, 88% by ages 31–35 [ 30 ].

Our exclusion criteria removed participants with unknown TB or HIV status, as well as PLWHIV.

Study covariates

We collected demographic information that included date of birth, place of birth, current residence, self-identified gender, self-reported ethnic identity, and parental ethnic identities. Behavioral variables include smoking, drinking and diabetes (Supplementary Methods in S1 Text ). In our analyses, we only used binary measures for smoking, drinking and diabetes (“Do you smoke?”, “Do you have diabetes?”). Age at enrollment was used as a continuous variable for all analyses and binned for calculating empirical odds. Socioeconomic status “SES” was operationalized as number of years of education, i.e., the highest completed level of education. McKenzie et al. [ 31 ] have shown education level, in this dataset, positively predicts body mass index in TB controls, tracking access to resources and food security if only a crude measure.

Residence and birthplace locations are categorized as rural (≤2000 people) and town (>2,000 people). Population size was derived from the South African census and when census data was absent, e.g., a farm, we used Google Earth ( earth.google.com ) to estimate population size based on the number of dwellings. Places that did not have a census size available in Stats SA ( statssa.gov.za ) typically were very small communities, like a farm or a small settlement. By visualizing the settlement through Google Earth, we were able to estimate whether the community size was >2000 estimated people. We used the average household size from the census data from all locations in the district listed by Stats SA ( statssa.gov.za ), then counted the number of dwellings in blocks and multiplied by the average household size [ 4 ] to get an estimated population size. While an imprecise measure of population size, the lack of government census data for a community is itself an indicator of its rural locality.

Statistical analyses

Statistical analyses were performed in R (version 4.2.3). We calculated Pearson correlations with the R package ggcorrplot . All categorical variables were numerically coded to “0” and “1”. Classification models for our binary, qualitative dependent variable (“case”/ “control”) included logistic regression and random forest—a machine learning classifier robust to non-linear associations and unknown variable interactions [ 32 ] (Supplementary Methods in S1 Text ). To calculate empirical odds, we binned our participants into 7 age groups, dividing the number of controls by the number of cases in each age bin. All R scripts for analyses are available at https://github.com/oshiomah1/NCTB-Epidemiology-Project .

Imputing missing data

To maximize our sample size, we imputed missing data for diabetes, smoking, and years of education (Table C in S1 Text ). The proportion of missing data was overall low, i.e. below 5%, except for alcohol use at for which missingness was 35%; the alcohol use measure was implemented after a pilot study. We chose to exclude alcohol from imputation and model analyses due to high missingness. Multivariate imputation was performed using the R package MICE, implementing chained equations in which every variable with incomplete data is imputed conditional on all the data from other variables in the dataset [ 33 ]. To initiate the MICE procedure, we created a matrix of variables consisting of age, gender, height, HIV, mother’s ethnicity, father’s ethnicity, diabetes, smoking status, and years of education. Notably, ethnicity and height variables were not used in our regression analysis but bore potential relevance to our missing variables, so they were included to improve the statistical inference of our imputation. We set the parameters of our imputation using recommended settings [ 33 , 34 ], generating two imputed datasets ( m = 2 ) that were run for 10 iterations each ( maxit = 10 ). We used a classification and regression tree method which is robust in epidemiological datasets similar to ours [ 35 ].

To cross-validate our imputation method, we randomly sampled ten percent of known values in each variable and converted them to missing values (Table C in S1 Text ). Next, these missing values were imputed using the procedure described above, and the missing value was compared to the original value. For continuous variables, the average percent difference between imputed and original value was used to calculate the cross-validation (CV) score while the average accuracy of the imputed variable was used to generate the CV score for binary variables. This procedure was carried out on one variable at a time for 100 iterations. Cross-validation results revealed that years of education and diabetes were sufficiently imputed (CV score > 10%).

Obtaining and visualizing model coefficients

After MICE imputation, we used the ‘psfmi’ package [ 36 ] to implement our logistic regression models, obtaining pooled odds ratios using Rubin’s Rules [ 37 ]. Each model was Bonferroni corrected using a baseline of p <0.05. For the Residence model, we set lifetime rural dwellers as the baseline and manually calculated contrasts for the other three comparisons. To illustrate the covariate effects from our models, we extracted the first imputed dataset from the MICE output, used the R ‘glm’ function to implement the logistic regression models, and then used the ‘effects’ package [ 38 ] to visualize the odds of Active TB.

Genetic data processing & ancestry estimation

A subset of participants ( n = 159) was genotyped for >2 million single nucleotide polymorphisms (SNPs) on the Illumina H3Africa array. Genetic data processing involved DNA extraction from saliva samples, common variant calling with GenomeStudio, rare variant calling with zCall, and further data cleaning using plink2 (Supplementary Methods in S1 Text ). Global [i.e. genome-wide) ancestry estimates were calculated using ADMIXTURE v1.13 [ 39 ]. The Luhya, Maasai, Himba, British, Palestinian, Chinese, Bangladeshi, Tamil, Ju|’hoansi San, Khomani San, and Nama populations were used as reference groups encompassing all major ancestry sources. ADMIXTURE was run in groups of maximally unrelated individuals to avoid biasing the ancestry estimates. We assumed k = 5 possible ancestries, inferred in unsupervised mode for each of the running groups. After matching clusters, we merged ancestry estimates across all running groups, averaging individuals that appeared in multiple running groups using pong [ 40 ]. We further tested whether population stratification affected the results of the logistic regression models by including 10 principal components (computed with plink2 ) and re-computing regressions for just the subset of n = 159 individuals (Supplementary Methods in S1 Text ).

TB case-control classification

1,126 participants were partitioned into preliminary cases, preliminary controls, and unverified TB status (571, 504, and 51 respectively). After excluding, participants with unverified TB status and either unknown or positive HIV status, 774 participants remained in the study (374 cases and 400 controls; Table A in S1 Text ).

Socio-behavioral and demographic characteristics of the cohort

Men and women were equally represented in the dataset (384 vs. 390, respectively, Table A in S1 Text ). Men were more likely to drink alcohol (p < 0.001) and smoke (p < 0.05). A high fraction of our participants self-reported smoking (67%) and drinking alcohol (46%); smoking and drinking were moderately correlated with each other ( r = 0.36, p < 0.05; Fig A in S1 Text ). Women were more likely to have diabetes (p = 0.0004; Fig A in S1 Text ) and, on average, had more education than men (female mean = 8.2 years, male mean = 7.8 years).

We use the number of years of education as a proxy for socio-economic status “SES” ( Methods ). The mean educational attainment in our cohort is 8 years, equivalent to completing primary school, and is similar between individuals recruited in rural areas and towns [ANOVA, p > 0.1). In the ZF Mgcawu District census [ 41 ], 13% of adults have not completed primary school as compared to 25.3% of our participants. Age was moderately correlated with SES (r = -0.5, p < 0.05; Fig A in S1 Text ) such that older participants tended to have lower SES. Cases and controls had similar mean ages, 43.6 and 43.1 years respectively (Wilcoxon rank sum p -value = 0.959) ( Fig 2A ). We found a significant difference in SES between cases and controls, with mean of 7.7 years and 8.3 years, respectively (Wilcoxon rank sum p -value = 0.0019) ( Fig 2B ).

thumbnail

Density plots of continuous variables A) Age by case-control status B) SES by case-control status.

https://doi.org/10.1371/journal.pgph.0002643.g002

To investigate the possibility of selection bias we computed empirical odds of active TB by age group. Assuming that age is a cumulative measure of exposure (that is, capturing the amount of time someone is exposed to TB), the empirical odds of TB should increase monotonically with age. We observe a non-monotonic trend where the odds of active TB progressively increase from ages 18 up to 38, then reverses, progressively decreasing starting at age 39 up till the 79–88 age group having the lowest empirical odds (Fig B in S1 Text ).

Ethnicity and genetic ancestry

Individuals were asked to self-identify their ethnicity without categorical prompts. 88.4% of participants [both TB cases and controls) self-identify as Coloured, followed by 4.2% as a Khoe-San ethnicity (e.g., Nama, San), 4.6% as Tswana, 1.3% as Xhosa, and 1.9% as “other”. Whilst we acknowledge that in some contexts the term “Coloured” has derogatory connotations, in South Africa it is a recognized ethnicity as well as a racial category. People who self-identify using this term tend to have genetic ancestries from multiple geographic origins, including the indigenous Khoe-San groups (e.g., Khoekhoe, San), Bantu-speaking, European, Indian, Malaysian (Southeast Asian) slaves, or people of mixed ancestry and their descendants [ 42 ]. The use of “Coloured” in this context reflects the self-identified cultural attributes of the participants, as well as possible historical and genetic attributes. Ethnicity is reported out of respect for participants’ choice of identity.

Genetic ancestry characterization was performed for 159 participants to assess if there was significant variation among clinics which could potentially confound analysis. Since genetic information was not available for all participants in the study, individual ancestry was not directly factored into the logistic regression models and random forest models. Khoe-San ancestry varied across clinic locations ( Fig 3A ) but remained the majority ancestry at each site (mean = 56%), followed by Bantu-speaking African ancestry (mean = 21%), European ancestry (mean = 16%), South Asian ancestry (mean = 5%), and East Asian ancestry (mean = 2%) ( Fig 3B ).

thumbnail

A subset of participants (n = 159) was genotyped to obtain the average genome-wide ancestry proportions across all individuals for each clinic. A) Khoe-San ancestry is the largest proportion of ancestry in our sample, it varies significantly across study sites. The boxplots show the median, the 25 th and 75 th percentile and 1.5 times said percentile, and all outliers as dots. B) The study population is admixed with 5 distinct ancestries with the southern African indigenous Khoe-San ancestry being the highest proportion of ancestry at all study sites.

https://doi.org/10.1371/journal.pgph.0002643.g003

There was statistically significant variation in ancestry at the clinic level (ANOVA: Khoe-San, F = 6.9, p < 1e-06; Bantu-speaking, F = 7, p < 6e-05; European, F = 5, p < 6e-05; South Asian, F = 8, p < 4.8e-08; East Asian, F = 4.5, p < 0.001, Fig 3A ). The statistically significant differences in the proportion of mean ancestry were generally between clinics in the Kalahari (Askham and Rietfontein, except Groot Mier) versus clinics along the Orange River (Harry Surtie, Dorp, Kakamas, and Keimoes) determined by Tukey HSD post hoc tests. The Kalahari clinics tended to have more Khoe-San ancestry (20–30% more) than the Orange River clinics and less Bantu-speaking African ancestry (~20% less). Notably, Groot Mier in the Kalahari had more European (11–19% more) and greater South and East Asian ancestry (4–10% and 2–3% more, respectively) than most other clinics ( Fig 3B ), likely due to Groot Mier’s history as an early European colonial post [ 43 ].

Hypotheses and models of progression to active TB

We designed three logistic regression models [ 44 ] to examine the risk factors that determine TB case/control status in our sample. Our first model, which we termed the “common risk factor” model (n = 774), includes six covariates known to be common behavioral or demographic risk factors for TB. We hypothesized that risk factors identified from prior studies are also significantly associated with active TB progression in our HIV-negative population.

Common risk factor model: TB Status ~ gender + smoking + diabetes + residence + age + SES

Health disparities are one of the many consequences of apartheid in South Africa [ 45 , 46 ]. The end of apartheid circa 1994 improved social mobility and educational access; however, health disparities in the Northern Cape (and other provinces) still remain problematic [ 47 ]. We formulated two alternative models which included variables potentially important to South African populations, involving the change in SES over time, and migration between rural and urban areas. We hypothesized that there are differential effects of SES (education) on TB status due to the sociopolitical effects of apartheid. To capture the effect of lived experience vis-à-vis Apartheid on TB outcomes we designed an “SES model” (n = 774). This model includes the common risk factors as above but allows for an interaction between age and SES to account changing economic conditions over the past eighty years. For example, completion of a high school equivalent education in 1960 did not afford the same economic benefits as completion of a high school education in 2010. We predict that for younger cohorts, higher SES is protective against TB; in contrast, for individuals born during apartheid, higher SES would have little effect on lifetime TB status. Age was kept as a continuous variable because apartheid was not a historically binary event.

SES model: TB Status ~ common risk factor model + age * SES

Residing in an urban or rural environment is an established risk factor for TB status [ 11 , 48 – 50 ] and was included in our common risk factor model, as above. Previously, we have shown that migration from an individual’s natal town has increased over the past two generations in the rural Northern Cape Province [ 31 ]. In addition, a recent longitudinal study leveraging data from South Africa’s National Health Laboratory Service showed that incorporating cross-municipality migration improves the ability to predict TB incidence [ 51 ]. For our “residence model”, we hypothesized that migrating from a rural to urban area in one’s lifetime increases the odds of active TB status due to greater exposure to M . tb . Here, we include an interaction between current residence and birthplace in the common risk factor model. Setting this interaction allows us to examine four patterns, namely: lifetime rural residence, rural birthplace to urban residence, urban birthplace to rural residence, and lifetime urban residence.

Residence Model: TB Status ~ common risk factor model + residence * birthplace

The common risk factor model (pseudo R 2 = 19%, n = 774, Table 1 ) performed slightly worse than the SES model (pseudo R 2 = 21%, n = 774, Table 2 ). The residence model had a smaller sample size (pseudo R 2 = 19%, n = 720, Table 2 ) than the common risk factor model due to missing birthplace data for some individuals. For an equal comparison, we re-ran the common risk factor and SES models with same individuals as in the Residence model. For the reduced dataset the Residence model and SES model had comparable pseudo R 2 while the common risk factor model had a slightly worse value (pseudo R 2 = 18% vs. 19%) (Table B in S1 Text ). Therefore, we present results from all three models and contrast the variable effects (Tables 1 and 2 ). All significance levels were Bonferroni corrected, assuming an α = 0.05.

thumbnail

https://doi.org/10.1371/journal.pgph.0002643.t001

thumbnail

https://doi.org/10.1371/journal.pgph.0002643.t002

Common risk factors

Across the three models, males consistently have three times the odds of active TB than females (OR = 3.01 [2.20,4.12], p < 0.001; Tables 1 and 2 , and Fig C in S1 Text ). All logistic regression models showed insufficient statistical evidence for smoking (common risk factor model: OR = 1.32 [0.94, 1.85], p = 0.11; Tables 1 and 2 ), and diabetes (common risk factor model: OR = 1.27 [0.67, 2.43], p = 0.46; Tables 1 and 2 ) on TB risk. Despite the lack of significance, we note that smoking had an effect size in the expected direction (Fig C in S1 Text ). The variable with the strongest effect size was current residence in towns–areas with a population size greater than 2,000 peoples (OR = 3.20 [2.26, 4.55], p <0.0001; Tables 1 and 2 and Fig C in S1 Text ).

Age interacts with SES

In the common risk factor model, age does not significantly affect TB risk, but higher SES has a protective effect (OR = 0.91 [0.86, 0.97], p = 0.006; Table 1 ). In the SES model, SES significantly affects TB status depending on age group (OR = 1.01 [1.00, 1.01], p < 0.001, Table 2 ). The effect is such that higher SES at younger ages (18–59 years old) is protective against TB, and higher SES at older ages (>59 years) increases risk ( Fig 4 )

thumbnail

A) The odds of active TB by education level vary across age groups (shown above by the different color lines). More years of education decreases the odds of active TB in younger age groups, but this pattern reverses in the oldest age groups. B) Effect plot from the residence model visualizing an interaction term between birthplace residence and current residence. Regardless of birthplace, the odds of active TB is highest in individuals who currently reside in towns. Individuals born in towns and currently residing in rural areas have the lowest odds of active TB.

https://doi.org/10.1371/journal.pgph.0002643.g004

TB risk by residence and birthplace

We analyzed the relationship between TB status and a change of locality between birthplace (rural area or town) and current residence (rural area or town) during an individual’s lifetime. We expected to see a difference in the odds of active TB between lifelong residents and those who have moved between locales. Under such a model, lifelong rural dwellers would have the lowest odds and lifelong town dwellers would have the highest odds. We set an interaction term between current residence and birthplace classified into town/rural, (OR = 0.33 [0.13–0.86], p = 0.024, Table 2 ). To break down the interaction effect, we set lifelong residents of rural areas as the baseline, comparing the three other residence patterns with this baseline group. We found that lifelong town dwellers had about twice the odds of active TB (OR = 2.16 [1.43–3.28], p < 0.001) relative to the baseline. Individuals born in a rural area and currently residing in a town had similar outcomes as lifelong town dwellers (OR = 2.19 [1.14–4.20], p = 0.018). Taken together, these show that town residence increases risk regardless of birthplace. Interestingly, individuals who were born in a town and later moved to rural areas are even more protected than individuals born and currently residing in rural areas (OR = 0.33 [0.16–0.71], p = 0.004) ( Fig 4 ).

Random forest modeling

As an alternative to logistic regression, we trained a random forest model to classify TB status utilizing all the variables from the common risk factor model. We configured the model to grow 5000 classification trees (Supplementary Methods in S1 Text ). The model assigned gender, current residence, SES, and age as the top important independent variables ( Fig 5 ). Diabetes and smoking were classified as uninformative predictors for TB status. The model had an overall “out-of-bag” misclassification rate of 23%.

thumbnail

Random subsets of all 6 variables on the y-axis were used to grow 5000 trees to classify participants into cases and controls. The model had an overall “out-of-bag” misclassification rate of 23%. Variables with higher variable importance are most crucial for case-control classification. Predictor variables with negative variable importance values worsen the ability of the model to classify TB status.

https://doi.org/10.1371/journal.pgph.0002643.g005

We examined common demographic risk factors for TB, constructing the largest TB epidemiological study in a Northern Cape clinical population (n = 774), to our knowledge. We show that gender, SES, and current residence locality are significant variables or important TB risk predictors, using logistic regression and random forest models. Neither smoking nor diabetes is associated with increased TB risk in any model. Among the three logistic regression models, interacting SES by age (“SES model”), and birthplace by residence (“residence model”), had similar explanatory power, improving on the common risk factor model.

In South African townships, M . tb is community spread [ 28 , 29 ] and about 88% of adults are infected by ages 31–35 [ 30 ]. Here, we demonstrate the utility of sampling in high disease incidence populations to rapidly build datasets with large sample sizes of TB cases and controls with similar pathogen exposure. Validating M . tb exposure in controls is traditionally done by a tuberculin skin test (TST) and/or interferon-gamma release assay (IGRA; e.g., QuantiFERON), thereby assigning LTBI status. We did not perform IGRA and TST tests for all controls in our cohort because IGRA testing requires blood draws and is often prohibitively expensive for large cohort studies, and TST is not readily available in South Africa. To validate the LTBI rate in our controls, we IGRA-tested (within 3 days of the participant’s GeneXpert result), a random selection of our sample controls (n = 70) and found that they have an 87% LTBI rate (IGRA+). The highly significant model results, in combination with the IGRA+ subset of controls, suggest that our sampling strategy reliably categorizes TB disease risk.

We considered whether our sampling strategy displayed any indication of selection bias. We hypothesized that the risk of active TB should monotonically increase with age, reflecting cumulative lifetime exposure to the M . tb pathogen. Contrary to this, our findings revealed a non-monotonic relationship. Starting with the youngest age group (18–28 years), the empirical odds of active TB increased with age, peaking in the 29-38-year-old age group, followed by a progressive decline in empirical odds till the oldest age group (79–88 years) (Fig B in S1 Text ). This unexpected pattern can be indicative of selection bias, possibly driven by survivor bias—where individuals dying from TB are absent from the study population at older age bins and/or related to whether our controls are representative of the general population (sample-selection bias). Generally, selection bias is difficult to measure and mitigate [ 52 – 54 ], especially in case-control studies where controls are recruited from a clinical setting. Recall bias in older adults could also lead to the lower observed empirical OR in older age bins; however, because TB treatment is a six-month course, and incomplete treatment regimens lead to relapse or life-threatening drug resistant TB, it is generally a life event that people remember with the exception of pediatric TB.

SES model effects

Age-specific TB risk varies across the lifespan. The greatest risk of TB is during infancy, decreasing through adolescence, then increasing and peaking between 25–35 years old followed by a decrease, and another peak after 65 years [ 48 , 55 ]. Age was not a significant variable in the logistic regression except when interacting with SES. SES’s protective effect on TB risk is most evident among 18–39-year-olds but the trend reverses and increases risk among the eldest individuals (>69 years; Fig 4A )—those who grew up and reached adulthood during Apartheid ( Fig 4A ). Higher SES increasing TB risk at older ages is contrary to findings in populations in the United States and Mexico [ 55 ]. This unique pattern may reflect South Africa’s recent history of Apartheid and post-Apartheid societal and economic shifts. During Apartheid, individuals from historically marginalized backgrounds had limited career options, but some were able to become teachers, police officers, or nurses. Such occupations are associated with higher education requirements and would have facilitated access to larger salaries, transportation, and mobility; potentially leading to better diagnosis and treatment. Alternatively, the observed pattern of our interaction effect at older ages could be explained by selection bias or our operationalization of SES in this study. Highest completed level of education (e.g., grade, diploma, degree, etc.) is a blunt measure of SES, and does not fully capture all SES facets, including social, economic, and cultural capital [ 56 , 57 ], and universal access to education increased post-Apartheid [ 58 ]. Additionally, we only sample from community health clinics, not private clinics, thereby missing a fraction of the SES spectrum.

Residence model effects

Consistent with previous research [ 21 , 59 – 61 ], we find TB risk is associated with living in larger towns. In our prior work, mobility in the Northern and Western Cape populations changed over the past 3 generations, with the highest levels of mobility in the grandparental generation [ 62 ]. Therefore, we tested whether mobility (operationalized as a different birthplace and residence) affected TB risk. Individuals currently residing in towns (regardless of birthplace) had higher odds of active TB, compared to those born in towns that migrated to rural areas, and lifetime rural dwellers. Unexpectedly, the individuals with the lowest odds of active TB are those born in a town who move to a rural area ( Fig 4B ). When we returned results to the community, the clinic staff hypothesized that despite nationally standardized BCG vaccinations, rural areas may have lower vaccination rates (observation communicated by clinical staff in the study catchment), therefore those born in town benefit from a greater likelihood of greater BCG vaccine during childhood and low adult M . tb exposure living in rural areas. The benefits of the BCG vaccine, however, attenuate in adolescents and memory of childhood TB episodes suffer from recall bias. Another possibility is that the town-born and rural-residence group accrued more wealth in towns before moving to a rural area, affording a different lifestyle than their rural neighbors (e.g., afford larger homes, less crowding, cleaner cooking fuel, etc.). This unique combination of factors may explain why the town-born rural-residence group has even lower odds of active TB compared to the lifetime rural dwellers. Future work should consider collecting birthplace in addition to current residence to better identify TB risk as M . tb exposure varies across the lifespan.

Across global studies, men are on average 1.7 times more likely to have TB [ 63 – 65 ]. Sex biases like this are common in other infectious diseases [ 66 , 67 ] and are attributable to an intersection of sex (biological factors, e.g., immune function) and gender (social and behavioral factors, e.g., risk-taking behavior) [ 68 ]. Despite smoking not being a significant TB risk, we found 75.5% of men smoke compared to 55.8% of women, indicating at least some gender differences in risky behaviors in the Northern Cape population.

Smoking and alcohol consumption has been shown to increase TB risk and mortality in the Northern Cape and at the national level [ 18 , 69 – 71 ]. In our models, smoking had the expected effect on TB risk and alcohol was excluded from our models due to high missingness. Self-reporting biases in observational studies like this one are a concern for variables like smoking, alcohol consumption, and SES measures [ 72 ]. Our sample, however, reports much higher levels of smoking compared to large-scale national surveys (e.g., [ 73 ], men: 75.5.% vs. 41%; women: 55.8% vs. 21%, respectively suggesting minimal self-report bias in our study. The weak effect of smoking observed from our models may be due to our method of binary classification. We collected fine-scale smoking phenotypes (Supplementary Methods in S1 text ) but because of the high missingness of these phenotypes, we ultimately classified participants as Smokers/Non-smokers. This stratification may mask the heterogeneity of smoking behaviors such as casual and binge substance use or differences in the types of smoking materials consumed.

Ancestry & ethnicity

Finally, we highlight that our study included enrollment from 12 different clinics, some of which are more than 250 kilometers apart. We surveyed ethnicity and genetic ancestry to test for population structure in the sample. Such structure can confound analysis if genetic ancestry tracks differential host risk for progression to TB or if different ethnicities have different cultural norms. Previous studies have described the high proportion of Khoe-San ancestry in Northern Cape communities but these largely focused on descendant groups who identify as Khoe-San (e.g. the ≠Khomani San, the Nama, Karretjie) rather than the general population [ 74 ]. Here, we show the clinical study population to be admixed with 4 other distinct ancestries ( Fig 3 ), demonstrative of recent historical events. These include the Bantu expansion into Southern Africa, European colonialism, the Dutch East India Company (aka VOC] slave trade, and the displacement and forced settlement of indigenous South African Khoe-San groups, especially in the last few generations in the Northern Cape [ 43 , 75 ]. Although we do observe heterogeneity in ancestry across clinics, correcting for the top 10 genetic principal components did not change the logistic regression results (Fig D in S1 Text and S1 Data ). To our knowledge, this is the first study to report ancestry proportions of clinical populations in the Northern Cape Province, South Africa. This work provides a baseline to design future studies, such as exploring host genetic correlates of active TB progression in this population (Supplementary Discussion in S1 text ).

Active TB progression is a multifactorial process involving the environment, genetics, and their interaction [ 1 , 7 ]. Our results from the NCTB cohort indicate that sociodemographic variables strongly impact active TB risk. Effects that are unique to the Northern Cape Province may reflect how changes in the pre- to post-apartheid environment modified social factors, such as SES and mobility, which in turn impacted lifetime TB risk.

Supporting information

S1 checklist..

https://doi.org/10.1371/journal.pgph.0002643.s001

https://doi.org/10.1371/journal.pgph.0002643.s002

https://doi.org/10.1371/journal.pgph.0002643.s003

Acknowledgments

We would like to thank all the participant communities in the Northern Cape for their continued trust and support in helping us undertake this project. We would also like to thank our community research assistants and translators who assisted in data collection for the project. We are grateful to Prof. Faadiel Essop, Dr. Desiree Petersen, Prof. Eileen Hoal, and Prof. Leslie Swartz for closely reading this manuscript. We would also like to thank Dr. Chris Gignoux and Dr. Mark Grote for statistical advice. Finally, we want to thank the Department of Health in the Northern Cape Province, South Africa for their continued support of the project.

  • View Article
  • Google Scholar
  • PubMed/NCBI
  • 5. National Institute for Communicable Diseases. Microbiologically Confirmed Pulmonary TB—South Africa. 2019. TB Online Surveillance Dashboard. Available from: https://www.nicd.ac.za/tb-surveillance-dashboard/
  • 9. Bloom BR, Atun R, Cohen T, Dye C, Fraser H, Gomez GB, et al. Tuberculosis. In: Holmes KK, Bertozzi S, Bloom BR, Jha P, editors. Major Infectious Diseases [Internet]. 3rd ed. Washington (DC): The International Bank for Reconstruction and Development / The World Bank; 2017 [cited 2024 Mar 18]. Available from: http://www.ncbi.nlm.nih.gov/books/NBK525174/
  • 23. WHO. Global Tuberculosis Report 2013. World Health Organization; 2013.
  • 27. National Institute for Communicable Diseases [Internet]. Microbiologically Confirmed Pulmonary TB—Province. 2017. TB Online Surveillance Dashboard. Available from: https://www.nicd.ac.za/tb-surveillance-dashboard/
  • 34. Rubin DB. Multiple Imputation for Nonresponse in Surveys [Internet]. John Wiley & Sons, Ltd; 1987 [cited 2024 Apr 19]. Available from: https://onlinelibrary.wiley.com/doi/abs/10.1002/9780470316696.fmatter
  • 37. Little RJ, Rubin DB. Statistical analysis with missing data. John Wiley & Sons; 2019 April 23.
  • 43. Legassick M. Hidden Histories of Gordonia: Land dispossession and resistance in the Northern Cape, 1800–1990. NYU Press; 2016.
  • 52. Szklo M, Nieto FJ. Epidemiology: Beyond the Basics. 3rd ed. Jones & Bartlett Learning; 2014.
  • 53. Rothman KJ, Greenland S, Lash TL. Modern Epidemiology. 3rd ed. Philadelphia: Lippincott Williams & Wilkins; 2008.
  • 54. Woodward M. Epidemiology: Study Design and Data Analysis, 3rd ed. CRC Press; 2013.
  • 67. Wizemann TM, Pardue ML. Exploring the Biological Contributions to Human Health [Internet]. National Academies Press (US); 2001 [cited 2022 Aug 29]. Available from: https://www.ncbi.nlm.nih.gov/books/NBK222288/
  • 75. Penn N. The Forgotten Frontier: Colonist and Khoisan on the Cape’s Northern Frontier in the 18th Century. Juta and Company Ltd; 2005.

IMAGES

  1. (PDF) UNDERSTANDING EPIDEMIOLOGY AND HEALTH BURDEN FROM COVID-19

    case study in epidemiology

  2. case control study epidemiology example

    case study in epidemiology

  3. TB Epidemiology case study: Student Version

    case study in epidemiology

  4. (PDF) Case Studies in Applied Epidemiology

    case study in epidemiology

  5. TYPES OF STUDIES IN EPIDEMIOLOGY

    case study in epidemiology

  6. PPT

    case study in epidemiology

VIDEO

  1. Epidemiology 5

  2. Basics of Epidemiological Studies

  3. 🔴 2- Study Design, Dr.Hazem Sayed ازاي تعرف نوع الدراسة بسهولة

  4. case control study I community Medicine I Epidemiology

  5. History Of Framingham Heart Study:Cohort Study Introduction

  6. Epidemiology: Case control study

COMMENTS

  1. Epidemiology Training & Resources|Epidemic Intelligence Service|CDC

    Epidemiology Training & Resources. The CDC Field Epidemiology Manual — This manual serves as an essential resource for epidemiologists and other health professionals working in local, state, national, and international settings for effective outbreak response to acute and emerging threats. CDC EIS Case Studies in Applied Epidemiology ...

  2. CDC Epidemiology Case Studies

    CDC developed case studies in applied epidemiology based on real-life epidemiologic investigations and used them for training new Epidemic Intelligence Service (EIS) officers — CDC's "disease detectives.". EIS offers these carefully crafted epidemiology case studies for schools of medicine, nursing, and public health to use as a ...

  3. Case Reports and Case Series

    Case Series. A case series is a report on the characteristics of a group of subjects who all have a particular disease or condition. Common features among the group may suggest hypotheses about disease causation. Note that the "series" may be small (as in the example below) or it may be large (hundreds or thousands of "cases").

  4. Epidemiologic Case Study Resources

    The case studies require students to apply their epidemiologic knowledge and skills to problems confronted by public health practitioners at the local, state, and national level every day.

  5. Case Studies

    The case studies require students to apply their epidemiologic knowledge and skills to problems confronted by public health practitioners at the local, state, and national level every day.

  6. EIS Case Studies

    CDC developed case studies in applied epidemiology based on real-life epidemiologic investigations and used them for training new Epidemic Intelligence Service (EIS) officers — CDC's "disease detectives." EIS offers these carefully crafted epidemiology case studies for schools of medicine, nursing, and public health to use as a component of an applied epidemiology curriculum.

  7. Designing and Conducting Analytic Studies in the Field

    Case-control studies are commonly performed in field epidemiology when a cohort study is impractical (e.g., no defined cohort or too many non-ill persons in the group to interview).

  8. Classroom Case Studies

    The epidemiologic case studies for the classroom are based on real-life outbreaks and public health problems. They were developed in collaboration with the original investigators and experts from the Centers for Disease Control and Prevention (CDC). In these case studies, a group of students works through a public health problem with guidance ...

  9. Introduction to Epidemiological Studies

    The basic epidemiological study designs are cross-sectional, case-control, and cohort studies. Cross-sectional studies provide a snapshot of a population by determining both exposures and outcomes at one time point. Cohort studies identify the study groups based on the exposure and, then, the researchers follow up study participants to measure ...

  10. The Case for Case-Cohort: An Applied Epidemiologist's Guide to

    These study designs have similar statistical precision for addressing a singular research question, but case-cohort studies have broader efficiency and superior flexibility. Despite this, case-cohort designs are comparatively underutilized in the epidemiologic literature.

  11. Epidemiologic Case Studies

    These case studies are interactive exercises developed to teach epidemiologic principles and practices. They are based on real-life outbreaks and public health problems and were developed in collaboration with the original investigators and experts from the Centers for Disease Control and Prevention (CDC). The case studies require students to ...

  12. The case for case-cohort: An applied epidemiologist's guide to re

    The two study designs have similar statistical precision for addressing a singular research question, but case-cohort studies have broader efficiency and superior flexibility. Despite this, case-cohort designs are comparatively underutilized in the epidemiologic literature.

  13. Case studies in applied epidemiology

    Students read the case study as homework, and come to class prepared to discuss the questions. In contrast, applied epidemiology case studies are usually read by the trainees in class, often out loud, stopping to answer questions that are interspersed throughout, without looking ahead. The questions can ask for a decision, but often they ...

  14. PDF Microsoft Word

    EPI Case Study 1: Incidence, Prevalence, and Disease Surveillance; Historical Trends in the Epidemiology of M. tuberculosis

  15. Epidemiology in Practice: Case-Control Studies

    A case-control study is designed to help determine if an exposure is associated with an outcome (i.e., disease or condition of interest). In theory, the case-control study can be described simply. First, identify the cases (a group known to have the outcome) and the controls (a group known to be free of the outcome).

  16. Case Studies

    EIS offers these carefully crafted epidemiology case studies for schools of medicine, nursing, and public health to use as a component of an applied epidemiology curriculum. International Association of National Public Health Institutes (IANPHI) case studies - Each national public health institute is different and develops in a unique way.

  17. The Case Time Series Design : Epidemiology

    Here we present a new study design, called case time series, for epidemiologic investigations of transient health risks associated with time-varying exposures.

  18. Module 4

    This module will build on descriptive epidemiology and on measuring disease frequency and association by discussing cohort studies and intervention studies (clinical trials). Our discussion of analytic study designs will continue in module 5 which addresses case-control studies.

  19. Gastroenteritis at a University in Texas: An Epidemiologic Case Study

    These suggestions included inserting a case definition in the investigation outline; adding the role of the state laboratory; consistently labeling outbreak, epidemic, and epidemic curve throughout the program; clarifying the rationale for limiting the outbreak to the university; further refining methods for the study and controls; using 2×2 ...

  20. Case-Crossover Study Design

    Case-crossovers study designs are an approach to investigating acute triggers that are potentially causing diseases. Learn more about how to perform one.

  21. Case Studies in Field Epidemiology: The EMR Experience

    Case studies were developed for public health workers mainly novice field epidemiology students. These participants are commonly health care workers working in the country departments of health whose background may be as medical doctors, nurses, environmental health officers or laboratory scientists who work in public health-related fields.

  22. Designing an Interactive Field Epidemiology Case Study Training for

    Case-study based learning approaches have proven particularly effective in the medical education context. Comparison studies conducted in the medical sciences have found interactive, case-study focused learning a more effective teaching tool than didactic lectures ( 9, 10 ).

  23. Chapter 3 Time series / case-crossover studies

    Chapter 3. Time series / case-crossover studies. We'll start by exploring common characteristics in time series data for environmental epidemiology. In the first half of the class, we're focusing on a very specific type of study—one that leverages large-scale vital statistics data, collected at a regular time scale (e.g., daily), combined ...

  24. PDF 6. Study Design and Analysis 2024 Apr

    Study Design and Analytic Epidemiology Question 2: Do we need a study, if so what type of study? • Oregon County health department Inve stigators decided to conduct a case-control study. • Initial evidence is usually not enough to conclude a point source and investigators look at several possible hypotheses. 39 Why might case control study ...

  25. Association Between Polycystic Ovary Syndrome and Risk of Pancreatic

    This case-control study sought to confirm the exploratory finding of an association between polycystic ovary syndrome and risk of pancreatic cancer.

  26. Grace periods and exposure misclassification in self-controlled case

    The self-controlled case-series (SCCS) research design is increasingly used in pharmacoepidemiologic studies of drug-drug interactions (DDIs), with the target of inference being the incidence rate ratio (IRR) associated with concomitant exposure to the object plus precipitant drug versus the object drug alone.

  27. Strong effect of demographic changes on Tuberculosis susceptibility in

    We leveraged the population's high TB incidence and community transmission to design a case-control study with similar mechanisms of exposure between the groups. ... a systematic review of population-based molecular epidemiology studies. Int J Tuberc Lung Dis. 2008 May 1;12(5):480-92. pmid:18419882 . View Article PubMed/NCBI Google Scholar ...