Sample selection and validity of exposure–disease association estimates in cohort studies

Costanza Pizzi; Bianca De Stavola; Franco Merletti; Rino Bellocco; Isabel dos Santos Silva; Neil Pearce; Lorenzo Richiardi

doi:10.1136/jech.2009.107185

Article Text

PDF

Theory and methods

Sample selection and validity of exposure–disease association estimates in cohort studies

Costanza Pizzi1,2,
Bianca De Stavola2,
Franco Merletti1,
Rino Bellocco3,4,
Isabel dos Santos Silva5,
Neil Pearce2,6,
Lorenzo Richiardi1

¹Cancer Epidemiology Unit, CeRMS and CPO-Piemonte, University of Turin, Italy
²Department of Medical Statistics, London School of Hygiene and Tropical Medicine, London, UK
³Department of Statistics, University of Milano Bicocca, Milan, Italy
⁴Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
⁵Department of Non-communicable Disease Epidemiology, London School of Hygiene & Tropical Medicine, London, UK
⁶Centre for Public Health Research, Massey University Wellington Campus, New Zealand

Correspondence to Costanza Pizzi, Via Santena 7, 10126 Torino, Italy; costanza.pizzi{at}lshtm.ac.uk

https://doi.org/10.1136/jech.2009.107185

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Introduction

Selection of study subjects from restricted source populations according to prespecified criteria is an approach that is frequently used in cohort studies. The purposes of such restrictions are to enhance study feasibility and to increase the prevalence of exposure or the completeness of follow-up, thereby increasing study validity and precision. Typically this may involve recruiting participants from a subgroup of the general population, rather than sampling directly from the entire general population. Such subgroups may be defined on the basis of occupation, gender, geographical area, birth cohort, etc. The British Doctors' Study1 and the Nurses' Health Study,2 occupational cohorts,3 follow-up of participants in specific events,4 analyses restricted to specific subgroups of the population, such as non-smokers,5 ancillary analyses of non-randomised exposures in randomised studies,6 and follow-up studies of screening attendants7 are all examples of cohort studies based on restricted samples.

Undoubtedly, restriction of the source population may introduce problems of generalisability of the study findings, but this also applies to studies that are based on the general population (eg, most cardiovascular epidemiology involves cohort studies in specific communities rather than true general population samples). We will therefore not consider issues of generalisability here; rather, our focus is on whether using a restricted source population may affect the validity of the exposure–disease associations.8 9 In particular, bias will be introduced if a risk factor for disease is not associated with exposure in the general population but is associated with exposure in the study population, as a result of the selection process. Such biases can be represented using directed acyclical graphs (DAGs).8 10–13 The example depicted in figure 1A represents a population in which there is no association between an exposure (E) and a disease (D); there is another risk factor (R) for the disease, but this is not a source of confounding as it is not associated with the exposure. However, E and R both affect the likelihood of being selected (S=1) into the study. When analyses are restricted to the selected subjects, there is an inherent conditioning on S (as represented by a square around S in figure 1B), which leads to a spurious association between E and R (represented by a dashed line). Under this scenario, even if E has no causal effect on D, the backdoor path E–R–D is opened and the estimated associational RR between the exposure and the disease (ARR_DE) may differ from the causal RR (CRR_DE). This could, for example, be the situation in a cohort study of the effect of obesity (E) on breast cancer (D) based on breast cancer screening participants (the restricted source population). In this example, obese women (E) are less likely to attend the screening programmes,14 while women with a family history of breast cancer (R) are more likely to participate. Among those who attend screening (ie, conditioning on those with S=1), obesity (E) and family history of breast cancer (R) become positively correlated. In fact an obese woman is more likely to have a family history of breast cancer within the selected sample than in the general population, because otherwise she may not have participated in the screening programme. As a result, family history of breast cancer is a confounder of the obesity–breast cancer association if studied among screening attendees, but is not—or to a lesser extent—in the general population.

Figure 1

Diagram of a cohort based on a selected sample. (A) In the population the exposure of interest (E) is not associated with the disease of interest (D) that is caused by a risk factor (R). Both E and R affect the probability of being selected (S) as a member of the cohort. (B) The study is carried out in the selected sample, and therefore there is an inherent conditioning on S (a box around a variable means conditioning for that variable) which generated an induced association between E and R (represented by a dashed line).

This type of bias has been extensively discussed in the causal inference literature from a theoretical point of view.8 9 Hernan and colleagues' 2004 paper on selection bias provides the conceptual framework and indicates that if the risk factor associated with the selection process is known and measured, it is possible to adjust for selection bias, whereas if the risk factor is unmeasured, the effect estimates may be biased. However, although the theoretical basis of selection bias is clear, there have been few attempts to quantify the likely strength of such biases. One exception is that of Greenland,15 who studied the setting of figure 1B with dichotomous exposure and outcome variables, employing methods originally developed to quantify the impact of unmeasured confounding.16 He calculated the likely maximum strength of the bias in the estimation of the E–D association in the S=1 stratum as a function of the ORs corresponding to the true associations depicted in figure 1B (ie, OR_SE, OR_SR, OR_DR). However, it is not clear how these results apply to cohort studies. Because of the increasing frequency of cohort studies based on selected populations, such as the internet-based birth cohort studies based in Italy (NINFEA cohort) and New Zealand (ELFS),17 quantifying the potential biases involved in analysing such data is timely and relevant.

Our aim is therefore to study the extent of these biases. We use simulations to mimic a variety of cohort restrictions and disease settings and examine the consequent bias in the estimated exposure hazard (or rate) ratio (HR) of disease. We then discuss these results in terms of whether, and under what circumstances, the resulting selection bias is serious enough to strongly bias the exposure effect estimates. For simplicity, we will assume throughout the paper that there is negligible random variation, that all variables are measured without error, and that there is uninformative censoring.

Sample selection and disease risk factors

As previously recognised,9 18 a fundamental characteristic of selection bias in restricted cohort studies is that the selection process makes a disease risk factor, which may not be associated with the exposure in the general population, become associated with the exposure among the study population and therefore act as a confounder.

Confounders in the general population and risk factors that become confounders in a restricted source population are usually indistinguishable when the study is analysed. Although typically some disease risk factors (ie, potential confounders) are known a priori, it is seldom known whether these are associated with the exposure of interest in the specific population in which the study will be carried out. Both in general population-based and restricted cohorts, therefore, researchers attempt to collect information on all known and suspected important risk factors of the disease in the population that they are studying, regardless of their expectations about whether these are associated with the exposure or not. The example of the association between smoking and socioeconomic position (SEP) illustrates this point well. Depending on the population and the calendar period, SEP can be positively or negatively associated, or not associated at all, with smoking. Researchers aiming to estimate the association between smoking and mortality will always attempt to collect information on SEP and, in most instances, will control for it, irrespective of whether the confounding effect of SEP is due to a real association between SEP and smoking in the general population or a spurious association caused by the sample selection process.

Another possible consequence of the selection mechanism is a change in magnitude, and in extreme cases direction, of the confounding effect of a risk factor. This may occur if the strength of the association between the risk factor and the exposure in the selected sample differs from that originally present in the general population. For example, when two (parent) variables influence a third (child) variable in the same direction, conditioning on the child variable likely leads to a negative association between the parent variables.8 Thus, if an exposure and a confounder influence the selection process in the same direction, the original association between exposure and confounder will be reduced in the subset of those who participate if they were originally positively associated, or increased if their original association was negative. For example, in many populations smoking and physical exercise are negatively associated. In a hypothetical study restricted to blood donors, who typically have a healthy lifestyle and thus smoke less and exercise more than the average individual in the general population, the sample selection would add a positive association between smoking and physical exercise. Therefore, the original negative association present in the general population would be, if anything, attenuated among blood donors.

In the next section, we use simulations to quantify the likely extent of selection bias arising from the use of restricted cohorts.

Quantification of the bias

Methods

We conducted Monte Carlo simulations of alternative settings corresponding to the scenario of figure 1B to quantify the resulting bias in the estimation of the E–D effect when conditioning on S=1 and not adjusting for R.19 The generation process of the four variables of figure 1B is described below.

We generated E and R as marginally independent binary variables, with prevalence, respectively P_E and P_R, initially set equal to 0.5 in the source population. They were later allowed to decrease to 0.25 for P_E and to 0.1 for P_R, in order to investigate scenarios more frequently addressed by epidemiologists.

The binary variable S was generated using a logistic regression model with baseline prevalence, P_S, equal to 0.5 and with the ORs for the explanatory binary variables E and R taking values 0.25, 0.33, 0.50, 2, 3 and 4. Specifically, with α_S indicating the log(odds) of S=1 among the non-exposed, β_SE indicating the log(OR) corresponding to exposure E and β_SR indicating the log(OR) corresponding to R, the generating model was:

logit(S=1)=αS+βSEE+βSRR(1)

A more complex model that included an interaction term between E and R was also considered:

logit(S=1)=αS+βSEE+βSRR+βinterE∗R(2)

with OR_inter, corresponding to exp(β_inter), set at values 0.5 or 2. The interaction term was introduced to examine more realistic selection settings. For example, in the first empirical demonstration of Berkson's bias, Roberts and colleagues found that not only do chronic conditions increase the chance of hospitalisation, but they often also interact more than multiplicatively.20

We generated time to the outcome D assuming a constant rate λ—that is, we assumed that time to event followed an exponential distribution.21 The baseline rate λ₀ was set equal to 0.01, 0.03 or 0.06 events/year, with administrative censoring time set at 5 years. The rate λ was allowed to be affected only by R, with HR_DR taking values 0.25, 0.33, 0.50, 2, 3 and 4, while we assumed no E–D association—that is, HR_DE=1. Specifically, with β_DE indicating the log(HR) of D for the exposure E and β_DR indicating the log(HR) of D for the risk factor R, the log rate function for D, log(λ), was defined as:

log(λ)=logλ0+βDEE+βDRR(3)

with β_DE fixed at 0.

We generated a total of 1000 Monte Carlo simulated datasets of 5000 subjects for each combination of the parameters described above. We also used a size of 2500 subjects, increasing the number of simulations (n=2000), to deal with the greater impact of random variation.

In each simulated dataset, we estimated two main parameters in the stratum S=1 (which sample size varies as a consequence of the selected parameters for the selection process): the association between E and R (OR_ER) and the association between E and D (HR_DE) which is induced by the selection process. The estimate of HR_DE was obtained fitting a Cox proportional hazards regression model with no adjustment for R.22 We then calculated the bias in the E–D association as the difference between zero, that is the true value of β_DE, and the logarithm of the estimated HR_DE. For each scenario, we summarised the bias, and the estimated values of β_DE, in terms of means, SD, and 5th and 95th percentiles.

Results

We first considered the situation with prevalence of E and R both equal to 0.5, OR_inter=1 (ie, no multiplicative interaction), and λ₀=0.03 (the ‘reference scenario’ in table 1). As expected, the size of the bias in the estimation of OR_DE depended on: (i) the induced association between the exposure and the risk factor (OR_ER), which increased in absolute terms with the absolute size of OR_SR and OR_SE; and (ii) the magnitude of the association between the risk factor and the disease (HR_DR). The largest values of the bias in the log OR were ±0.15 (table 1, ‘reference scenario’), which were reached when OR_SE, OR_SR and HR_DR were furthest from the null value (ie, equal to 0.25 or 4). Note that in table 1 the range for log(OR_ER|S=1) is not symmetrical because the magnitude of the association induced by the selection between E and R also depends on the prevalence of S in the population (P_s), with the strongest association obtained when P_s=0.5. Supplementary table 1 presents the complete results for all combinations of the values of OR_SE, OR_SR and HR_DR. The mean bias decreased from ±0.15 to just ±0.02 when the three ORs/HRs were equal to 2 or 0.5.

View this table:

Table 1

Bias in the crude estimation of the E–D association by selected values of the data generating parameters; results from 1000 simulations

When an interaction term between E and R was included in the model generating S, the induced E–R association increased considerably (figure 2), up to a log(OR) of −0.98 (table 1, row 2) when OR_SE and OR_SR were equal to 0.25 and the OR_inter was 0.5. The bias increased accordingly, ranging from −0.24 to 0.27 (table 1, row 2). This situation is equivalent, in terms of induced bias, to those involving very strong marginal associations with selection. It is clear from figure 2 that the impact of the interaction is not the same for all the parameter combinations, as the magnitude of the induced E–R association is strengthened or reduced according to the sign of the interaction term but also to the size of the stratum of subjects exposed to both E and R.

Figure 2

Mean OR of the induced E–R association in the stratum of those selected (S=1) by selected values of the association of the exposure (OR_SE) and the risk factor (OR_SR) with the selection process and of the E–R interaction (OR_inter); results from 1000 simulations.

Neither the prevalence of the exposure E (table 1, row 3) nor the baseline rate for the disease D (table 1, rows 4–5) or the sample size (table 1, row 6) affected the extent of the bias. Conversely, the prevalence of R, which becomes a confounder of the E–D association when S=1, had a non-marginal effect. For a given value of the induced E–R association, the bias reached its peak when the prevalence of R among the selected subjects (S=1 stratum) was 0.5. For this reason, when the population prevalence of R was set equal to 0.1 instead of 0.5, the range of the mean bias decreased to (−0.12; 0.07) (table 1, row 7).

Discussion

Conducting cohort studies in a restricted sample of the general population may offer several advantages, including more precise measurement of the exposure, higher exposure prevalence, enhanced feasibility of the study, better control of confounding, increased sample size, higher recruitment rates, and a higher completeness of follow-up. These advantages should be balanced against issues of validity.

In this paper we have shown, via simulations, that the possible bias introduced by restriction of the source population is usually weak when internal comparisons are carried out within the cohort, with a maximum bias in the log(HR) of ±0.15.

These results are in agreement with those of Greenland,15 who used an analytical approach to quantify the maximum selection bias in settings where the outcome risk is rare so that the analysis of cohort data can be performed using logistic regression. Our simulations add further insight to these results as we examined a wide range of disease and selection parameters, including exposure and risk factor prevalence, which highlighted their individual role in influencing the extent of the bias. Further, we considered settings where exposure and risk factor interact when influencing the selection process. Some additional points are warranted.

First, the bias is necessarily small when the association between the exposure of interest and the selection process is relatively weak (ie, 0.5<OR<2). In particular, when the exposure-selection OR is equal to 2 or 0.5, while the risk factor-selection OR and the risk factor-disease HR are allowed to take values up to 4 or down to 0.25, the maximum bias in the estimated exposure–disease association is within the ±0.07 range (on the log hazard scale). For example, consider the Million Women Study, a cohort nested within the breast screening programme in the UK.7 From the study carried out to compare the characteristics of the study participants with the rest of the population (women who attended the screening but did not join the study plus not attendants),23 the participation OR for current use of hormone replacement therapy, which is the main exposure of interest of the study, was derived. This estimated OR was about 1.6. On the basis of this information it is possible to assume that, in this cohort, the bias introduced by the baseline selection on the estimates of the effect of hormone replacement therapy on the outcome of interest would be negligible.

Second, selection must be associated with one or more unmeasured or unknown disease risk factors in order to introduce bias. However, unknown or unmeasured disease risk factors can introduce bias whether or not the cohort is based on the general population or a restricted source population; in the latter case, the sample selection can either increase or decrease the overall bias, with a magnitude and direction difficult to predict if there are multiple risk factors involved.24

Third, we have shown that even when all of the associations involved in the selection and outcome mechanisms are reasonably large (eg, all ORs/HRs of 4.0 or 0.25), the prevalence of the risk factor R is about 50% and there is no adjustment for R, the resulting bias is relatively weak (ie, ±0.15 on the log scale). This is reassuring, as this scenario is rather extreme and very unlikely to occur in practice. Besides, a disease risk factor with a 50% prevalence and a disease HR of 4.0 would have an attributable fraction of 60% and is therefore unlikely not to have been known and measured when a study is planned.

The scenarios considered in our simulations were restricted to binary exposure and binary risk factor and assumed no association between the exposure and the risk factor in the general population. A limitation is that we examined only the case of a single unmeasured determinant of the disease that also influences the selection process. However, we believe it is unlikely that multiple and independent important disease risk factors would affect the sample selection. It is indeed reasonable to consider R as a vector resulting from the combination of a set of correlated risk factors, all moderately associated with S. Finally, we only showed the findings derived from the analyses based on the assumption of a null causal association between the exposure and the outcome of interest; however choosing a true associational value, β_DE, different from zero would not modify the simulation results and therefore our conclusions.

We conclude that using a restricted source population for a cohort study will, under a range of sensible scenarios, produce only weak bias in estimates of the exposure–disease associations. On the other hand, the use of such restrictions may increase the response rate and the exposure prevalence, as well as being the only feasible approach in many circumstances.

What we already know on this subject

Baseline selection of participants in cohort studies may affect the study validity.
This happens when, because of the selection process, the confounding effect of an unknown or unmeasured disease risk factor is larger in the selected sample than in the general population.

What this study adds

We conducted Monte Carlo simulations to quantify the likely extent of the selection bias affecting the exposure-disease association, varying all the parameters involved: prevalence and effects of exposure and risk factor on both the selection and outcome process, selection prevalence, baseline incidence rate of the outcome and sample size.
The maximum bias is relatively weak (±0.15 in the log Hazard Ratio scale). When scenarios typically seen in epidemiological studies were considered the bias in the log Hazard Ratio drops to ±0.02.

References

↵
1. Doll R,
2. Peto R,
3. Boreham J,
4. et al
. Mortality in relation to smoking: 50 years' observations on male British doctors. BMJ 2004;328:1519.
OpenUrl Abstract/FREE Full Text
↵
1. Colditz GA,
2. Hankinson SE
. The Nurses' Health Study: lifestyle and health among women. Nat Rev Cancer 2005;5:388–96.
OpenUrl CrossRef PubMed Web of Science
↵
1. Magnani C,
2. Ferrante D,
3. Barone-Adesi F,
4. et al
. Cancer risk after cessation of asbestos exposure: a cohort study of Italian asbestos cement workers. Occup Environ Med 2008;65:164–70.
OpenUrl Abstract/FREE Full Text
↵
1. Lagerros YT,
2. Bellocco R,
3. Adami HO,
4. et al
. Measures of physical activity and their correlates: the Swedish National March Cohort. Eur J Epidemiol 2009;24:161–9.
OpenUrl CrossRef PubMed Web of Science
↵
1. Vineis P,
2. Airoldi L,
3. Veglia F,
4. et al
. Environmental tobacco smoke and risk of respiratory cancer and chronic obstructive pulmonary disease in former smokers and never smokers in the EPIC prospective study. BMJ 2005;330:277.
OpenUrl Abstract/FREE Full Text
↵
1. Beattie MS,
2. Costantino JP,
3. Cummings SR,
4. et al
. Endogenous sex hormones, breast cancer risk, and tamoxifen response: an ancillary study in the NSABP Breast Cancer Prevention Trial (P-1). J Natl Cancer Inst 2006;98:110–15.
OpenUrl Abstract/FREE Full Text
↵
1. Anonymous
. The Million Women Study: design and characteristics of the study population. The Million Women Study Collaborative Group. Breast Cancer Res 1999;1:73–80.
OpenUrl CrossRef PubMed
↵
1. Jossey-Bass
1. Glymour MM
. Using causal diagrams to understand common problems in social epidemiology. In: Jossey-Bass, ed. Methods in social epidemiology. San Francisco: Jossey-Bass, 2006.
↵
1. Hernan MA,
2. Hernandez-Diaz S,
3. Robins JM
. A structural approach to selection bias. Epidemiology 2004;15:615–25.
OpenUrl CrossRef PubMed Web of Science
↵
1. Greenland S,
2. Pearl J,
3. Robins JM
. Causal diagrams for epidemiologic research. Epidemiology 1999;10:37–48.
OpenUrl CrossRef PubMed Web of Science
↵
1. Hernan MA,
2. Hernandez-Diaz S,
3. Werler MM,
4. et al
. Causal knowledge as a prerequisite for confounding evaluation: an application to birth defects epidemiology. Am J Epidemiol 2002;155:176–84.
OpenUrl Abstract/FREE Full Text
↵
1. Pearl J
. Causal diagrams for empirical research. Biometrika 1995;82:669–88.
OpenUrl Abstract/FREE Full Text
↵
1. Pearl J
. Causality: models, reasoning, and inference. Cambridge, UK: Cambridge University Press, 2000.
↵
1. Ferrante JM,
2. Chen PH,
3. Crabtree BF,
4. et al
. Cancer screening in women: body mass index and adherence to physician recommendations. Am J Prev Med 2007;32:525–31.
OpenUrl CrossRef PubMed Web of Science
↵
1. Greenland S
. Quantifying biases in causal models: classical confounding vs collider-stratification bias. Epidemiology 2003;14:300–6.
OpenUrl CrossRef PubMed Web of Science
↵
1. Yanagawa T
. Case-control studies: assessing the effect of a confounding factor. Biometrika 1984;71:191–4.
OpenUrl Abstract/FREE Full Text
↵
1. Richiardi L,
2. Baussano I,
3. Vizzini L,
4. et al
. Feasibility of recruiting a birth cohort through the Internet: the experience of the NINFEA cohort. Eur J Epidemiol 2007;22:831–7.
OpenUrl CrossRef PubMed Web of Science
↵
1. Rothman KJ,
2. Greenland S,
3. Lash TL
1. Rothman KJ,
2. Greenland S,
3. Lash TL
. Distinguishing selection bias from confounding. In: Rothman KJ, Greenland S, Lash TL, eds. Modern epidemiology. Philadelphia: Lippincott Williams & Wilkins, 2008:136–7.
↵
1. Burton A,
2. Altman DG,
3. Royston P,
4. et al
. The design of simulation studies in medical statistics. Stat Med 2006;25:4279–92.
OpenUrl CrossRef PubMed Web of Science
↵
1. Roberts RS,
2. Spitzer WO,
3. Delmore T,
4. et al
. An empirical demonstration of Berkson's bias. J Chronic Dis 1978;31:119–28.
OpenUrl CrossRef PubMed Web of Science
↵
1. Bender R,
2. Augustin T,
3. Blettner M
. Generating survival times to simulate Cox proportional hazards models. Stat Med 2005;24:1713–23.
OpenUrl CrossRef PubMed Web of Science
↵
1. Clayton D,
2. Hills M
. Statistical models in epidemiology. Oxford: Oxford University Press, 1993.
↵
1. Banks E,
2. Beral V,
3. Cameron R,
4. et al
. Comparison of various characteristics of women who do and do not attend for breast cancer screening. Breast Cancer Res 2002;4:R1.
↵
1. VanderWeele TJ,
2. Hernan MA,
3. Robins JM
. Causal directed acyclic graphs and the direction of unmeasured confounding bias. Epidemiology 2008;19:720–8.
OpenUrl CrossRef PubMed Web of Science

Supplementary materials

Web Only Data jech.2009.107185

Files in this Data Supplement:

View Table

Footnotes

Funding The study was conducted within projects partially funded by Compagnia SanPaolo/FIRMS, the Piedmont Region, the Italian Ministry of University and Research (MIUR), the Italian Association for Research on Cancer (AIRC) and the Massey University Research Fund (MURF). The Centre for Public Health Research is supported by a Programme Grant from the Health Research Council of New Zealand.
Competing interests None.
Provenance and peer review Not commissioned; externally peer reviewed.

[1] ↵
Doll R,
Peto R,
Boreham J,
et al
. Mortality in relation to smoking: 50 years' observations on male British doctors. BMJ 2004;328:1519.
OpenUrl Abstract/FREE Full Text

[2] Doll R,

[3] Peto R,

[4] Boreham J,

[5] et al

[6] ↵
Colditz GA,
Hankinson SE
. The Nurses' Health Study: lifestyle and health among women. Nat Rev Cancer 2005;5:388–96.
OpenUrl CrossRef PubMed Web of Science

[7] Colditz GA,

[8] Hankinson SE

[9] ↵
Magnani C,
Ferrante D,
Barone-Adesi F,
et al
. Cancer risk after cessation of asbestos exposure: a cohort study of Italian asbestos cement workers. Occup Environ Med 2008;65:164–70.
OpenUrl Abstract/FREE Full Text

[10] Magnani C,

[11] Ferrante D,

[12] Barone-Adesi F,

[13] et al

[14] ↵
Lagerros YT,
Bellocco R,
Adami HO,
et al
. Measures of physical activity and their correlates: the Swedish National March Cohort. Eur J Epidemiol 2009;24:161–9.
OpenUrl CrossRef PubMed Web of Science

[15] Lagerros YT,

[16] Bellocco R,

[17] Adami HO,

[18] et al

[19] ↵
Vineis P,
Airoldi L,
Veglia F,
et al
. Environmental tobacco smoke and risk of respiratory cancer and chronic obstructive pulmonary disease in former smokers and never smokers in the EPIC prospective study. BMJ 2005;330:277.
OpenUrl Abstract/FREE Full Text

[20] Vineis P,

[21] Airoldi L,

[22] Veglia F,

[23] et al

[24] ↵
Beattie MS,
Costantino JP,
Cummings SR,
et al
. Endogenous sex hormones, breast cancer risk, and tamoxifen response: an ancillary study in the NSABP Breast Cancer Prevention Trial (P-1). J Natl Cancer Inst 2006;98:110–15.
OpenUrl Abstract/FREE Full Text

[25] Beattie MS,

[26] Costantino JP,

[27] Cummings SR,

[28] et al

[29] ↵
Anonymous
. The Million Women Study: design and characteristics of the study population. The Million Women Study Collaborative Group. Breast Cancer Res 1999;1:73–80.
OpenUrl CrossRef PubMed

[30] Anonymous

[31] ↵
Jossey-Bass
Glymour MM
. Using causal diagrams to understand common problems in social epidemiology. In: Jossey-Bass, ed. Methods in social epidemiology. San Francisco: Jossey-Bass, 2006.

[32] Jossey-Bass

[33] Glymour MM

[34] ↵
Hernan MA,
Hernandez-Diaz S,
Robins JM
. A structural approach to selection bias. Epidemiology 2004;15:615–25.
OpenUrl CrossRef PubMed Web of Science

[35] Hernan MA,

[36] Hernandez-Diaz S,

[37] Robins JM

[38] ↵
Greenland S,
Pearl J,
Robins JM
. Causal diagrams for epidemiologic research. Epidemiology 1999;10:37–48.
OpenUrl CrossRef PubMed Web of Science

[39] Greenland S,

[40] Pearl J,

[41] Robins JM

[42] ↵
Hernan MA,
Hernandez-Diaz S,
Werler MM,
et al
. Causal knowledge as a prerequisite for confounding evaluation: an application to birth defects epidemiology. Am J Epidemiol 2002;155:176–84.
OpenUrl Abstract/FREE Full Text

[43] Hernan MA,

[44] Hernandez-Diaz S,

[45] Werler MM,

[46] et al

[47] ↵
Pearl J
. Causal diagrams for empirical research. Biometrika 1995;82:669–88.
OpenUrl Abstract/FREE Full Text

[48] Pearl J

[49] ↵
Pearl J
. Causality: models, reasoning, and inference. Cambridge, UK: Cambridge University Press, 2000.

[50] Pearl J

[51] ↵
Ferrante JM,
Chen PH,
Crabtree BF,
et al
. Cancer screening in women: body mass index and adherence to physician recommendations. Am J Prev Med 2007;32:525–31.
OpenUrl CrossRef PubMed Web of Science

[52] Ferrante JM,

[53] Chen PH,

[54] Crabtree BF,

[55] et al

[56] ↵
Greenland S
. Quantifying biases in causal models: classical confounding vs collider-stratification bias. Epidemiology 2003;14:300–6.
OpenUrl CrossRef PubMed Web of Science

[57] Greenland S

[58] ↵
Yanagawa T
. Case-control studies: assessing the effect of a confounding factor. Biometrika 1984;71:191–4.
OpenUrl Abstract/FREE Full Text

[59] Yanagawa T

[60] ↵
Richiardi L,
Baussano I,
Vizzini L,
et al
. Feasibility of recruiting a birth cohort through the Internet: the experience of the NINFEA cohort. Eur J Epidemiol 2007;22:831–7.
OpenUrl CrossRef PubMed Web of Science

[61] Richiardi L,

[62] Baussano I,

[63] Vizzini L,

[64] et al

[65] ↵
Rothman KJ,
Greenland S,
Lash TL
Rothman KJ,
Greenland S,
Lash TL
. Distinguishing selection bias from confounding. In: Rothman KJ, Greenland S, Lash TL, eds. Modern epidemiology. Philadelphia: Lippincott Williams & Wilkins, 2008:136–7.

[66] Rothman KJ,

[67] Greenland S,

[68] Lash TL

[69] Rothman KJ,

[70] Greenland S,

[71] Lash TL

[72] ↵
Burton A,
Altman DG,
Royston P,
et al
. The design of simulation studies in medical statistics. Stat Med 2006;25:4279–92.
OpenUrl CrossRef PubMed Web of Science

[73] Burton A,

[74] Altman DG,

[75] Royston P,

[76] et al

[77] ↵
Roberts RS,
Spitzer WO,
Delmore T,
et al
. An empirical demonstration of Berkson's bias. J Chronic Dis 1978;31:119–28.
OpenUrl CrossRef PubMed Web of Science

[78] Roberts RS,

[79] Spitzer WO,

[80] Delmore T,

[81] et al

[82] ↵
Bender R,
Augustin T,
Blettner M
. Generating survival times to simulate Cox proportional hazards models. Stat Med 2005;24:1713–23.
OpenUrl CrossRef PubMed Web of Science

[83] Bender R,

[84] Augustin T,

[85] Blettner M

[86] ↵
Clayton D,
Hills M
. Statistical models in epidemiology. Oxford: Oxford University Press, 1993.

[87] Clayton D,

[88] Hills M

[89] ↵
Banks E,
Beral V,
Cameron R,
et al
. Comparison of various characteristics of women who do and do not attend for breast cancer screening. Breast Cancer Res 2002;4:R1.

[90] Banks E,

[91] Beral V,

[92] Cameron R,

[93] et al

[94] ↵
VanderWeele TJ,
Hernan MA,
Robins JM
. Causal directed acyclic graphs and the direction of unmeasured confounding bias. Epidemiology 2008;19:720–8.
OpenUrl CrossRef PubMed Web of Science

[95] VanderWeele TJ,

[96] Hernan MA,

[97] Robins JM

Log in using your username and password

Main menu

Log in using your username and password

You are here

Statistics from Altmetric.com

Request Permissions

Introduction

Sample selection and disease risk factors

Quantification of the bias

Methods

Results

Discussion

What we already know on this subject

What this study adds

References

Supplementary materials

Web Only Data jech.2009.107185

Footnotes

Read the full text or download the PDF:

Log in using your username and password