INTRODUCTION
The association of smoking with prostate cancer (PCa) remains disputed due to different conclusions coming from previous studies. Most epidemiology studies have found no association1, but there have been several reports of a positive association2, with some studies even finding that smoking may exert a protective effect against the PCa risk3,4. These contradictory findings indicate that the effect of smoking on PCa incidence needs to be investigated further while taking into account that the contradictory results may stem largely from differences in the definition of smoking, race of participants, and research period3.
The National Health and Nutrition Examination Survey (NHANES) is a nationally representative survey of American civilians that provides comprehensive data on various aspects of health and nutrition5. The survey is unique in combining interviews and physical examinations. NHANES, therefore, provides high-quality and nationally representative data that can be used to determine the prevalence and risk factors for diseases. Although NHANES has a retrospective design and bias is inevitable, its comprehensive nature means that possible confounding factors can be controlled6. Cigarette smoking is the predominant mode of tobacco consumption, and the present cross-sectional study is the first to investigate the association of cigarette smoking with the risk of PCa using NHANES data.
Mendelian randomization (MR) is an epidemiological method that utilizes genetic variants as instrumental variables for quantifying exposure and can be used to estimate the potential causal role of exposure in disease development. The MR design mitigates confounding since the genetic variants are assorted randomly during gamete formation and are mostly independent of environmental and lifestyle factors7. In this study, we perform a two-sample MR analysis intending to clarify whether there are potential causal effects of cigarette smoking on PCa risk.
METHODS
Study design
Firstly, a secondary dataset analysis of pooled 2003–2018 NHANES data was conducted to explore whether smoking is associated with the risk of PCa. Subsequently, we conducted Mendelian randomization analysis based on publicly available genome-wide association study (GWAS) data to clarify the possible causal effect of smoking on PCa risk at the genetic level.
Cross-sectional study using the NHANES database
Study population in NHANES
NHANES has a 2-year-cycle cross-sectional design. The population included in this study comprised male responders who either had or had not received a PCa diagnosis, as determined using the following questions: ‘Have you ever been told that you had cancer or malignancy?’, ‘First cancer – what kind was it?’, ‘Second cancer – what kind was it?’, and ‘Third cancer – what kind was it?’. Responders who answered ‘yes’ to the first question and ‘prostate’ to any of the other three questions were identified as having PCa. In contrast, other responses were classified as PCa not being present. Those refusing to answer, answering ‘don't know’, or have not responded to the first question were excluded, as were responders having more than three types of cancer.
Study variables in NHANES
The factor investigated in this study was the smoking status, which was categorized using the following question: ‘Have you smoked at least 100 cigarettes during your life?’8. Those refusing to answer, answering ‘don't know’, or having missing information were excluded. Based on previous epidemiology studies8-10, the influencing factors that were planned to be analyzed in the present study included age, race, education level, BMI (calculated through self-reported height and weight), hypertension status, diabetes status, and dietary intakes of energy, protein, carbohydrate, total fat, total polyunsaturated fat, cholesterol, vitamin E, vitamin A, calcium, magnesium, selenium, caffeine, and alcohol. The dietary data were based on the average total nutrient intakes on the first and second days. The definitions of all variables can be found on the NHANES website (https://www.cdc.gov/nchs/nhanes/). Those with unclear information on influencing factors were excluded. Individuals with excessive energy intake (± 3 SD) were also excluded. Since there was only one day of dietary recall for individuals in the surveys conducted before 2002 and only a relatively small amount of data was available after 2019, we only included data for 2003–2018.
Mendelian randomization study
Selection of instrumental variables
Smoking behaviors were categorized as follows: 1) the lifetime smoking index, as derived from the most recent GWAS in a sample of 462690 European-descent individuals that identified 126 significant single-nucleotide polymorphisms (SNPs) related to that index11; 2) light smoking, defined as having smoked at least 100 cigarettes during the lifetime from the GWAS pipeline using Pheasant-derived variables from UK Biobank (GWAS ID=ukb-b-8133) (https://gwas.mrcieu.ac.uk/datasets/ukb-b-8133/); 3) smoking initiation, as derived from a GWAS of Europeandescent individuals (GWAS ID=ieu-b-4877)12; and 4) the amount of smoking per day, as derived from a GWAS of European-descent individuals (GWAS ID=ieu-b-25)12. We extracted the significant variants associated with each trait (p<5×10–8). In addition, only those with a long physical distance (≥10000 kb) and a low probability of linkage disequilibrium (R2<0.001) were retained. Supplementary file Table 1 lists the instrumental variables.
Table 1
GWAS summary statistics of PCa
Summary-level genetic data of GWASs for PCa (diagnosed using ICD10 or ICD9 codes) were obtained from 3 sources: 1) the FinnGen research project, which included 6311 PCa cases and 74685 controls (GWAS ID=finn-b-C3_PROSTATE_EXALLC); 2) UK Biobank, with 9132 PCa cases and 173493 controls (GWAS ID=ieu-b-4809); and 3) the Prostate Cancer Association Group to Investigate Cancer-Associated Alterations in the Genome (PRACTICAL) consortium, with 79148 PCa cases and 61106 controls (GWAS ID=ieu-b-85)13.
Statistical analysis
All analyses were restricted to male subjects. For NHANES data, we compared the distribution of basic information between PCa cases and non-PCa controls using the independent-sample t-test and the Pearson chi-squared test, as appropriate. Binary logistic regression was then used to evaluate the association between smoking and PCa. Four models were used in this analysis: 1) univariate logistic regression model containing only the smoking status; 2) Model 1: multivariate logistic regression containing the smoking status, with age, race, BMI, education level, hypertension status, and diabetes status as confounding factors; 3) Model 2: multivariable model 1 with dietary factors (as continuous variables) added as additional covariates; and 4) Model 3: multivariable model 1 with dietary factors (as categorizations as approximately determined using quartile distributions) added as additional covariates.
Propensity-score matching (PSM) was used to reduce selection bias by matching age, race, BMI, and education-level distributions as clinically pertinent between PCa cases and non-PCa controls. Matching was performed based on the nearest-neighbor method in a 1:1 ratio, and the balance after PSM was assessed using a histogram. Then, the above four models established by conditional logistic regression were analyzed using the matched sample.
The assumptions for the MR analysis are shown in Supplementary file Figure 1. The random-effects inverse-variance weighting (IVW) method was used as the main statistical model to estimate the associations between smoking behavior and PCa risk. Heterogeneity between the SNPs was evaluated by calculating Cochrane’s Q statistic and was considered to be presented when the Cochrane-Q-derived p<0.05. The F statistic (F=β2/SE2) was calculated to measure the instrument’s strength in the analyses, given a probable overlap between exposure and outcome data in the UK Biobank study. SNPs with F statistic <10 were excluded. Horizontal pleiotropy was detected using the MR-PRESSO analysis method, with a p<0.05 indicating its presence. The MR-PRESSO outlier test was performed when horizontal pleiotropy was detected. We then analyzed whether the MR results changed after removing outliers. Estimates from PRACTICAL, FinnGen, and UK biobank were combined using fixed-effects (I2<50%) and random-effects (I2≥50%) meta-analysis methods, as appropriate. Two other sensitivity analysis methods (weighted median and MR-Egger regression) were performed to assess the robustness of the MR results. The weighted median model can provide consistent estimates on the condition that ≥50% of the weight in the analysis comes from valid instrumental variables. The MR–Egger sensitivity estimator can provide unbiased estimates of causal effects, even if all SNPs in an instrument are invalid because of pleiotropy. However, it is necessary to satisfy the hypothesis that the effect of genetic variation pleiotropy on outcomes is independent of the effect of genetic variation on exposure factors (InSide). All analyses were two-sided, using odds ratio (OR) and 95% confidence interval (95% CI) to present associations. The analyses were performed using the TwoSampleMR and MR-PRESSO packages in R software (version 4.0.2).
RESULTS
Relationship between smoking and PCa risk in NHANES
The final analysis was applied to 16073 participants, comprising 554 with PCa and 15519 without PCa. The data extraction process is shown in Figure 1. The distributions of basic information and dietary data between PCa and non-PCa participants are presented in Table 1. Relative to the non-PCa participants, PCa participants were generally older; comprised a larger proportion of non-Hispanic Whites and non-Hispanic Blacks; had higher prevalence rates of smoking, hypertension, and diabetes; and had a lower BMI and lower levels of all nutrient intakes except vitamin A. After PSM, 554 matched pairs were identified. The histogram (Supplementary file Figure 2) indicated that the balance between PCa and non-PCa participants is good. After PSM, the distribution of all factors was not significantly different except for PCa participants having a higher vitamin E and lower caffeine intake (Table 1).
The results from the analyses of the four logistic regression models using the population before and after PSM are presented in Table 2. None of the models showed a significant relationship between smoking status and PCa risk [adjusted OR with 95% CI before PSM=1.14 (0.95–1.37), 1.133 (0.94–1.37), 1.17 (0.96–1.41) for the three multivariable models; adjusted OR with 95% CI after PSM=1.15 (0.9–2.46), 1.12 (0.86–2.37), 1.14 (0.87–2.39) for the three multivariable models] with the exception of univariate logistic regression before PSM suggesting that non-smoking is a protective factor for PCa [OR with 95% CI: 0.78 (0.66–0.93), 1.13 (0.89–2.43) for univariate logistic regression before and after PSM].
Table 2
Models | Before propensity-score matching a | After propensity-score matching b | ||
---|---|---|---|---|
p | AOR (95% CI) | p | AOR (95% CI) | |
Univariable model | 0.01 | 0.78 (0.66–0.93)c | 0.32 | 1.13 (0.89–2.43)c |
Multivariable Model 1 | 0.17 | 1.14 (0.95–1.37) | 0.26 | 1.15 (0.9–2.46) |
Multivariable Model 2 | 0.2 | 1.133 (0.94–1.37) | 0.39 | 1.12 (0.86–2.37) |
Multivariable Model 3 | 0.11 | 1.17 (0.96–1.41) | 0.34 | 1.14 (0.87–2.39) |
Univariable model: univariate logistic regression containing only the smoking status. AOR: adjusted odds ratio. Model 1: multivariate logistic regression containing the smoking status, with age, race, BMI, education level, hypertension status, and diabetes status as confounding factors. Model 2: adding dietary factors (as continuous variables) as additional covariates to Model 1. Model 3: adding the dietary factors (as categorizations as approximately determined using the quartile distributions) as additional covariates to the Model 1. In the models, the exposure factor was smoking with the definition of smoking at least 100 cigarettes during life; the outcome was prostate cancer, defined as ever being told to have prostate cancer. NHANES: National Health and Nutrition Examination Survey. PCa: prostate cancer.
MR analysis of the association of the lifetime smoking index with PCa
The lifetime smoking index was associated with 126 SNPs, and their F statistics ranged from 21.78 to 196. The genetically predicted lifetime smoking index was not associated with the risk of PCa in the FinnGen consortium or UK Biobank study. In contrast, it was negatively correlated with the risk of PCa in the PRACTICAL study (OR=0.83; 95% CI: 0.70–0.97). A meta-analysis of the three data sources indicated that there was no significant association (OR=0.95; 95% CI: 0.83–1.09) (Figure 2), and this result remained consistent in sensitivity analyses (weighted median and MR-Egger regression methods) (Supplementary file Figures 3 and 4). We detected significant heterogeneity in the UK Biobank study (Q=210.20, p=2.04×10–7) or PRACTICAL study (Q=256.59, p=1.24×10–12) but not in the FinnGen consortium (Q=107.36, p=0.73). MR-PRESSO analyses revealed significant horizontal pleiotropy for the UK Biobank and PRACTICAL studies (p<0.01), with one and three outliers found, respectively. The results did not change after removing the outliers (Supplementary file Table 2).
MR analysis of the association of light smoking with PCa
There were three SNPs associated with light smoking, and their F statistics ranged from 31.76 to 160.15. Genetically predicted light smoking was not associated with PCa in the three PCa GWAS data sets. A meta-analysis of the three data sources indicated no significant association (OR=1.00; 95% CI: 0.95–1.06), and this result remained consistent in sensitivity analyses (Supplementary file Figures 3 and 4). We did not detect any heterogeneity in the UK Biobank study (Q=2.62, p=0.27), FinnGen consortium (Q=2.19, p=0.33), or PRACTICAL study (Q=0.28, p=0.87). MR-PRESSO analyses were not performed because of the small number of SNPs.
MR analysis of the association of smoking initiation with PCa
Smoking initiation was associated with 92 SNPs, with F statistics ranging from 29.81 to 144.74. The MR results showed that smoking initiation was not associated with the risk of PCa in the three PCa GWAS data sets. A meta-analysis of the three data sources indicated no significant association (OR=0.99; 95% CI: 0.99–1.00), and this result remained consistent in sensitivity analyses (Supplementary file Figures 3 and 4). We detected significant heterogeneity in the UK Biobank study (Q=109.35, p=0.03) and PRACTICAL study (Q=170.16, p=8.44×10–8) but not in the FinnGen consortium (Q=104.92, p=0.05). MR-PRESSO revealed significant horizontal pleiotropy for the UK Biobank study (p=0.04) and PRACTICAL study (p<0.01), with zero and three outliers found, respectively. The results did not change after removing the outliers (Supplementary file Table 2).
MR analysis of the association of the amount of smoking per day with PCa
The lifetime smoking index was associated with 23 SNPs, with F statistics ranging from 29.9 to 953.27. The genetically predicted amount of smoking per day was not associated with the risk of PCa in the FinnGen consortium or UK Biobank study. At the same time, it was negatively correlated with the PCa risk in the PRACTICAL study (OR=0.93; 95% CI: 0.86–1.00). A meta-analysis of the three data sources found no significant association (OR=1.00; 95% CI: 0.99–1.00), and this result remained consistent in sensitivity analyses (Supplementary Figures 3 and 4). We detected significant heterogeneity in the UK Biobank study (Q=36.62, p=0.02) but not in the FinnGen consortium (Q=19.80, p=0.53) or PRACTICAL study (Q=31.31, p=0.07). MR-PRESSO analyses revealed significant horizontal pleiotropy for the UK Biobank study (p=0.04), and no outliers were found.
DISCUSSION
Reducing the serious disease burden of PCa worldwide14 requires modifiable risk factors to be identified, among which smoking has been widely investigated1. However, the research findings for this factor are inconsistent and still need clarification. The present observational study analyzed a large sample population in NHANES and performed an MR study using publicly available GWAS data, with both investigations indicating that smoking is unlikely to be associated with the risk of PCa.
The unclear findings from observational studies of the relationship between smoking and the risk of PCa15,16 are at least partially attributable not only to the study design and confounding factors but also to how smoking is defined and the proportions of subjects in different stages of PCa. Smoking is a lifestyle behavior that itself can be categorized into several states, such as current smoking, quitting smoking, severe smoking, mild smoking, and the use of filtered or unfiltered tobacco17. It is challenging to investigate factors with such high variability. Studies have shown that risk factors accumulate in the body to impact the disease18. The large heterogeneity of included populations can make it difficult to draw definitive conclusions, such as whether or not smoking impacts disease susceptibility. The main reason for choosing ‘smoked at least 100 cigarettes during the lifetime’ as one of the smoking statuses in the present study was due to this factor being relatively objective in NHANES and since this definition is less affected by the confounding effects caused by mild smoking. However, there is no clinical staging of PCa in NHANES, which makes it difficult to determine the impact of smoking on the risk of PCa at different stages.
The findings of this study support that smoking is not associated with the overall PCa risk. This is consistent with the results of most previous studies1, while there were also some publications with different conclusions. A meta-analysis pooling data from prospective cohorts provided evidence for a negative correlation between smoking and PCa incidence19. However, that analysis did not address the presence of heterogeneity in the merged results, nor were subgroup analyses conducted based on population race. As mentioned above, smoking is a behavior that has several states, and these may vary markedly with race or income level; for example, unfiltered tobacco accounts for a higher proportion of use in low- and middle-income countries. However, the meta-analysis conducted by Cirne et al.20 revealed that smoking is not associated with the risk of PCa in low- and middle-income countries. As expected, there are also study results suggesting that smoking is associated with a higher risk of PCa2. Perhaps even more informative is the meta-analysis by Islami et al.3 that produced mixed results for the association between cigarette smoking and PCa risk, with their overall analysis of all included studies showing no or negative correlations. In contrast, those studies completed up to 1995 showed a positive correlation3. Those authors attributed this to smoking reducing the risk of inert non-invasive cancer, which has dominated in recent years while promoting more-invasive cancer. Another analysis based on biopsy data validated that result by finding that all cases of PCa as well as only those of low-grade PCa were not significantly associated with current or past smoking, while it was associated with an increased risk of high-grade PCa21. These results further indicate that the relationship between smoking and PCa risk is influenced by the date range of the analyzed data, which is mainly attributable to differences in the risk grading of PCa patients associated with prostate-specific antigen (PSA) screening policies22. In short, the relationship between smoking and PCa risk is very complex and cannot be simply explained by the theory that harmful substances such as nicotine in tobacco increase the risk of cancer23.
Any population-based study analyzing the relationship between smoking and PCa risk is complicated by susceptibility to various confounding factors. Thus, we conducted an MR study whose results also supported those from the NHANES-based investigation that there was no evidence that smoking was associated with the risk of PCa. Larsson and Burgess24 conducted similar MR research and unexpectedly found a statistically non-significant negative correlation between smoking initiation and PCa. We speculate that these different conclusions are mainly attributable to discrepancies in the instrumental variables. We applied stricter inclusion criteria and found only 92 SNPs related to smoking initiation, which is far fewer than the 378 found by Larsson and Burgess24. Including more instrumental variables will generally increase the probability of horizontal pleiotropy being present, which will lead to the instrumental variables not exhibiting exclusivity and therefore causing parameter estimation errors25; however, the status of horizontal pleiotropy was not reported in the study of Larsson and Burgess24. In addition, those authors obtained significant results using the PRACTICAL population. Similar to this, our study found that the lifetime smoking index and the amount of smoking per day were negatively correlated with the risk of PCa in the PRACTICAL population but not in the other two databases (UK Biobank and the Finland-based FinnGen consortium). We speculate that differences in sample sizes cause the inconsistent results between different database populations, while the negative correlation between smoking and PCa reflects detection bias; that is, the control group may be contaminated, especially among smokers26, since smokers may be less likely to undergo PSA screening and therefore less likely to be diagnosed with early-stage PCa.
The results from the two parts of this study were relatively stable. In our NHANES-based investigation, we matched important risk factors for PCa such as age, race, and BMI, which did not change the results. For the MR analysis, the weighting of the UK Biobank research was very high (>98%) in the meta-analysis, possibly due to this being the largest sample, which therefore exerted the dominant effect on the merged results. The results of our MR study using UK Biobank population were relatively reliable. There are two aspects to note: 1) all instrumental variables other than the SNP for light smoking (derived from UK Biobank) were all derived from GWAS and the Sequencing Consortium of Alcohol and Nicotine use, which resulted in very low sample-overlap of exposure factors and the results; and 2) although horizontal pleiotropy was present in the MR results for the UK Biobank and PRACTICAL populations, the MR results did not change when outliers were excluded.
Limitations
This study has several limitations. Firstly, PCa was not classified into different stages, and studies have shown that the association between smoking and PCa is mainly present in patients with invasive PCa27. The outcomes of the present study come from different GWAS populations. Although advanced PCa accounts for 19.15% of those in the PRACTICAL population28, the proportions of the other two databases are not publicly known, and relevant information cannot be obtained from NHANES. We, therefore, did not analyze the relationship between smoking and invasive PCa. Secondly, the definition of smoking can introduce limitations. We selected one variable in NHANES and four smoking variables in the MR analysis to represent the smoking status. The definition of smoking used in this study may not fully reflect the multiple states of smoking behavior in the included populations. While smoking may exert harmful effects on physical health, mainly via substances such as nicotine, its direct impact may be on the respiratory system29. However, even if there is such an impact on PCa, both previous studies and the present study indicate that the level of evidence is much weaker than that for non-modifiable risk factors such as age30. Thirdly, despite applying strict inclusion criteria for the instrumental variables, heterogeneity remained significant in some present analyses. Although we used the random-effects model for IVW, future studies that apply stratified analyses are required. Fourthly, unobserved pleiotropy cannot be addressed in MR analysis. Fifthly, for smoking status, whether the observed associations differ by age and other potential factors, and by PCa severity, could not be examined based on summary-level data in this study.