Copyright
©The Author(s) 2019.
World J Gastroenterol. Jun 28, 2019; 25(24): 2990-3008
Published online Jun 28, 2019. doi: 10.3748/wjg.v25.i24.2990
Published online Jun 28, 2019. doi: 10.3748/wjg.v25.i24.2990
Advantages | |
Clinical data readily available with minimal resources required | |
Can study rare exposures | |
Can study rare events | |
Can study long-term effects | |
Real-world data | |
Large sample size | |
Subgroup analysis | |
Sensitivity analysis | |
Interaction of different variables | |
Adjustment of outcome to a multitude of risk factors | |
Precise estimation of effect size | |
Reliable capture of small variations in incidence or disease flare | |
No selection bias if n = all | |
Shortcomings specific of Big Data analysis | Solution |
Data validity | Cross reference with medical records in a subset of the sample |
Missing data | Statistical methods to deal with missing data, e.g. multiple imputation |
Text mining or natural language processing of unstructured data | |
Incomplete capture of variables or unavailability of certain diagnosis codes | Surrogate markers (e.g., COPD for smoking, alcohol-related diseases for alcoholism) |
Inclusion of a large set of measured variables | |
Text mining or natural language processing of unstructured data | |
Privacy | De-identification of individuals |
Review of study plan by local ethics committee | |
Hypothesis-free predictive models | Validation in prospective studies or randomized control trials |
Shortcomings of all observational study including Big Data analysis | Solution |
Residual and/or unmeasured confounding | Inclusion of a large set of measured variables |
Inclusion of RCT datasets with extensive collection of data and outcomes for trial participants or linkage with other data sources | |
Fulfilment of Bradford Hill criteria | |
Reverse causality/protopathic bias (outcome of interest leads to exposure of interest) | Cohort study design instead of case-control study design |
Excluding prescriptions of drugs of interest (e.g., PPIs) within a certain period (e.g., 6 mo) before development of the outcome of interest (e.g., gastric cancer) | |
Example: Early symptoms of undiagnosed GC leads to PPI use, rather than PPIs cause GC | |
Selection bias | Encompassing entire study population (n = all) |
Indication bias (or confounding by indication/disease severity) | Balance of patient characteristics, in particular comorbidities that are indications for a certain treatment (e.g., PS matching of a large set of measured variables) |
Negative control exposure | |
Confounding by functional status and cognitive impairment | Balance of patient characteristics, in particular comorbidities that can affect functional and cognitive status (e.g., PS matching) |
Healthy user bias / adherer bias (individuals who are more health conscious tend to have better health outcomes) | Adjustment for other lifestyle factors – text mining or natural language processing of unstructured data |
Immortal time bias (arises when the study outcome cannot occur during a period of follow-up due to study design) | Landmark analysis |
Analysis using time varying covariates | |
Ascertainment bias / surveillance bias / detection bias (differential degree of surveillance or screening for the outcome among exposed and unexposed individuals) Example: PPI users may undergo upper endoscopy more frequently than non-PPI users, and hence more GC detected in PPI users | Selection of an unexposed group with a similar likelihood of screening/testing |
Selection of an outcome that are likely to be diagnosed equally in exposed and control groups | |
Adjustment for the surveillance rate | |
Access to healthcare | Stratified analysis according to patients’ residential regions (e.g., rural vs urban), socioeconomic status, immigration status, race/ethnicity, institutional factors (e.g., restrictive formularies) |
Selective prescription and treatment in frail and very sick patients | PS methodology (trimming of areas of non-overlap, PS matching, PS by treatment interaction) |
Advantages | Remarks |
Addressing “curse of dimensionality” when EPV < 10 | Traditional multivariable regression models yield similar results if EPV ≥ 10 |
Recognition of subjects with absolute indications (or contraindications) of an intervention | Exclusion of areas of non-overlap of the PS distribution between exposed and unexposed groups to ensure comparability |
Identification of PS interaction with treatment | Variation of effectiveness of an intervention according to indications (PS) may only be identified via stratified analysis by PS |
Gastric cancer | |||||
Country/Region | Database | Area of research | Sample size | Design, statistical methods and 3V | Application |
Taiwan, China | Taiwan National Health Insurance Database (NHID) | GC | 80255 | Nationwide retrospective cohort study | Early vs late H. pylori eradication on GC risk |
Wu et al[46], 2009 | |||||
Comparison with general population to derive SIR | |||||
Volume, Velocity and Variety | |||||
GC | 52161 | Nationwide retrospective cohort study | Association between NSAIDs and GC | ||
Wu et al[48], 2010 | |||||
Comparison with general population to derive SIR | |||||
Volume, Velocity and Variety | |||||
Hong Kong, China | Clinical Data Analysis and Reporting System (CDARS) | GC | 63397 | Territory-wide retrospective cohort study | Association between PPIs and GC |
Cheung et al[51], 2018 | |||||
PS regression adjustment | |||||
Volume, Velocity and Variety | |||||
GC | 63605 | Territory-wide retrospective cohort study | Association between aspirin and GC | ||
Cheung et al[49], 2018 | |||||
PS regression adjustment | |||||
Volume, Velocity and Variety | |||||
GC | 63397 | Territory-wide retrospective cohort study | Effect of H. pylori eradication among different age groups | ||
Leung et al[47], 2018 | |||||
Comparison with general population to derive SIR | |||||
Volume, Velocity and Variety | |||||
GC | 7266 | Territory-wide retrospective cohort study | Association between metformin and GC | ||
Cheung et al[50], 2018 | |||||
PS regression adjustment | |||||
Sensitivity analysis: PS weighting by IPTW and PS matching | |||||
Volume, Velocity and Variety | |||||
Sweden | Swedish Cancer Registry | GC | 797067 | Nationwide retrospective cohort study | Association between PPIs and GC |
Brusselaers et al[53], 2017 | |||||
Swedish Prescribed Drug Registry | Comparison with general population to derive SIR | ||||
Volume, Velocity and Variety | |||||
GC | 95176 | Nationwide retrospective cohort study | Effect of H. pylori eradication on GC risk | ||
Doorakkers et al[45], 2018 | |||||
Comparison with general population to derive SIR | |||||
Volume, Velocity and Variety | |||||
United States | Kaiser Permanente (KP) | GC | 61684 | Retrospective cohort study | Association between different PPIs and GC |
Schneider et al[55], 2016 | |||||
Volume, Velocity and Variety |
Gastrointestinal bleeding and/or proton pump inhibitors | |||||
Country/Region | Database | Area of research | Sample size | Design, statistical methods and 3V | Application |
Taiwan, China | Taiwan National Health Insurance Database (NHID) | PUD | 403567 | Nationwide retrospective cohort study | Effect of H. pylori therapy and PPIs on PUD |
Wu et al[58], 2009 | |||||
Volume, Velocity and Variety | |||||
PUD | 32235 | Nationwide retrospective cohort study | Risk of rebleeding from PUD in ESRD patients | ||
Wu et al[95], 2011 | |||||
Volume, Velocity and Variety | |||||
PPIs | 6552 | Nationwide retrospective cohort study | Effect of clopidogrel and PPIs on ACS | ||
Volume, Velocity and Variety | |||||
Wu et al[59], 2010 | |||||
South Korea | Korean Health Insurance Review and Assessment Service (HIRA) | PPIs | 59233 | Nationwide retrospective cohort study | Effect of PPIs on thrombotic risk |
Kim et al[96], 2019 | |||||
Volume, Velocity and Variety | |||||
Hong Kong, China | Clinical Data Analysis and Reporting System (CDARS) | Dabigatran | 5041 | Territory-wide retrospective cohort study | Risk factors for dabigatran-associated gastrointestinal bleeding |
Chan et al[62], 2015 | |||||
Volume, Velocity and Variety |
Inflammatory bowel disease | |||||
Country/Region | Database | Area of research | Sample size | Design, statistical methods and 3V | Application |
South Korea | Korean Health Insurance Review and Assessment Service (HIRA) | UC | 11233 | Nationwide retrospective cohort study | Incidence and clinical impact of perianal disease in UC |
Song et al[97], 2018 | |||||
Comparator: general population | |||||
Volume, Velocity and Variety | |||||
Taiwan, China | Taiwan National Health Insurance Database (NHID) | IBD | 38039 | Nationwide retrospective cohort study to compare IBD patients with general population to derive SIR | Association between IBD and herpes zoster infection |
Chang et al[98], 2018 | |||||
Hospital based nested case-control study | |||||
Volume, Velocity and Variety | |||||
Sweden | Swedish Patient Registry | UC | 63711 | Nationwide retrospective cohort study | Association between appendectomy and UC |
Myrelid et al[99], 2017 | |||||
Volume, Velocity and Variety | |||||
Swedish Medical Birth Register (child-mother link) | IBD | 827,239 children born between 2006 and 2013 | Nationwide prospective population-based register study | Association between maternal exposure to antibiotics during pregnancy and very early onset IBD in adulthood | |
Ortqvist et al[72], 2019 | |||||
Volume, Velocity and Variety | |||||
Swedish Multigeneration Register (child-father link) | |||||
Swedish Prescribed Drug Register National Patient Register | |||||
United States | NCBI Gene Expression Omnibus (GEO) | IBD | Not applicable | Signature inversion study | Topiramate as a potential therapeutic agent against IBD |
Dudley et al[70], 2011 | |||||
Volume, Velocity and Variety | |||||
United States | Not applicable | IBD | 1585 | Retrospective cohort study Natural language processing | Association between arthralgia and biologics (anti-TNF vs vedolizumab) |
Cai et al[20], 2018 | |||||
Volume, Velocity and Variety | |||||
Not applicable | International IBD Genetics Consortium's Immunochip project | IBD | 53279 | Machine learning algorithm | Predictors of IBD |
Wei et al[64], 2013 | |||||
Volume, Velocity and Variety | |||||
United States | Not applicable | IBD | 575 colonoscopy reports | Retrospective cohort study Natural language processing | Differentiation of surveillance from non-surveillance colonoscopy |
Hou et al[100], 2013 | |||||
Volume, Velocity and Variety | |||||
United States | Not applicable | IBD | 1080 | Retrospective cohort study | Prediction of IBD remission in thiopurine users |
Waljee et al[66], 2017 | |||||
Random Forest machine learning algorithm | |||||
United States | Not applicable | IBD | 20368 | Retrospective cohort study | Prediction of hospitalization and outpatient steroid use |
Waljee et al[65], 2017 | |||||
Random Forest machine learning algorithm | |||||
Not applicable | Phase 3 clinical trial data | IBD | 491 | Retrospective cohort study | Prediction of steroid-free endoscopic remission with vedolizumab in UC |
Waljee et al[67], 2018 | |||||
Random Forest machine learning algorithm | |||||
Volume, Velocity and Variety |
Colorectal cancer | |||||
Country/Region | Database | Area of research | Sample size | Design, statistical methods and 3V | Application |
Hong Kong, China | Clinical Data Analysis and Reporting System (CDARS) | CRC | 197902 | Territory-wide retrospective cohort study | Epidemiology, characteristics, risk factors and prognosis of postcolonoscopy Colorectal cancer in Asians |
Cheung et al[101], 2019 | |||||
Volume, Velocity and Variety | |||||
CRC | 187897 | Territory-wide retrospective cohort study | Association between statins and CRC | ||
Cheung et al[69], 2019 | |||||
PS matching | |||||
Volume, Velocity and Variety | |||||
United States | Nurses’ Health Study II (NHSII) | CRC | 134763 | Prospective cohort study | Association between DM and CRC |
Ma et al[74], 2018 | |||||
Volume and Variety | |||||
Health Professionals Follow-up Study (HPFS) | |||||
Nurses’ Health Study (NHS) | CRC | 1660 | Prospective cohort study | Effect of calcium intake, coffee and fibre on survival after CRC diagnosis | |
Yang et al[78], 2018 | |||||
1599 | |||||
Volume and Variety | |||||
Health Professionals Follow-up Study (HPFS) | Hu et al[77], 2018 | ||||
1575 | |||||
Song et al[79], 2018 | |||||
Nurses’ Health Study (NHS) | CRC | 141143 | Prospective cohort study | Risk factors of serrated polyps and conventional adenomas | |
He et al[76], 2018 | |||||
Nurses’ Health Study II (NHSII) | |||||
de Jong et al[80], 2006 | |||||
Volume and Variety | |||||
Health Professionals Follow-up Study (HPFS) | |||||
Nurses’ Health Study II (NHSII) | CRC | 85256 | Prospective cohort study | Association between obesity and CRC | |
Liu et al[75], 2018 | |||||
Volume and Variety | |||||
Netherlands | Dutch Lynch syndrome Registry | Various cancers including | 2788 | Retrospective cohort study | Decrease in CRC-related mortality in Lynch syndrome families by surveillance |
Volume, Velocity and Variety | |||||
CRC | |||||
Netherlands, Germany, Finland | Dutch Lynch syndrome Registry | CRC | 2747 patients with 16327 colonoscopies | Retrospective cohort study | Surveillance interval on CRC incidence and stage |
Engel et al[81], 2018 | |||||
Volume, Velocity and Variety | |||||
German HNPCC Consortium | |||||
Finland |
Hepatocellular carcinoma | |||||
Country/Region | Database | Area of research | Sample size | Design, statistical methods and 3V | Application |
Taiwan, China | Publicly available data on HCC-related genes | HCC | Not applicable | Signature inversion study | Anti-cancer effects of chlorpromazine and trifluoperazine on HCC |
Chen et al[17], 2011 | |||||
Volume, Velocity and Variety | |||||
Connectivity Map (CMap) -- includes 6100 drug-mediated expression profiles | |||||
Taiwan National Health Insurance Database (NHID) | HCC | 4569 | Nationwide retrospective cohort study | Association between NA therapy and HCC recurrence among patients with HBV-related HCC after liver resection | |
Wu et al[89], 2012 | |||||
Volume, Velocity and Variety | |||||
Taiwan National Health Insurance Database (NHID) | HCC | 292290 | Nationwide case-control study | Association between DM and HCC | |
Chen et al[91], 2013 | |||||
Volume, Velocity and Variety | |||||
Taiwan National Health Insurance Database (NHID) | HCC | 43190 | Nationwide retrospective cohort study | Association between NA therapy and HCC among CHB patients | |
Wu et al[87], 2014 | |||||
PS matching | |||||
Volume, Velocity and Variety | |||||
China | The Cancer Genome Atlas (TCGA) database | HCC | Not applicable | Signature inversion study | Anti-cancer effect of prenylamine on HCC |
Wang et al[18], 2016 | |||||
Volume, Velocity and Variety | |||||
Connectivity Map (CMap) | |||||
South Korea | Korean Health Insurance Review and Assessment Service (HIRA) | HCC | 24156 | Nationwide retrospective cohort study | Difference between tenofovir and entecavir on reducing HCC risk |
Choi et al[90], 2018 | |||||
Volume, Velocity and Variety | |||||
Hong Kong, China | Clinical Data Analysis and Reporting System (CDARS) | HCC | Entire Hong Kong population between 1999 and 2012 | Territory-wide retrospective cohort study | Association between NA therapy and HCC among CHB patients |
Seto et al[88], 2017 | |||||
Volume, Velocity and Variety | |||||
Sweden | Swedish Cancer Registry | HCC | 9160 CHB patients | Nationwide retrospective cohort study | Association between concomitant HBV/HDV infection and HCC |
Ji et al[102], 2012 | |||||
Swedish Patient Registry | |||||
Comparison with general population to derive SIR | |||||
Volume, Velocity and Variety |
- Citation: Cheung KS, Leung WK, Seto WK. Application of Big Data analysis in gastrointestinal research. World J Gastroenterol 2019; 25(24): 2990-3008
- URL: https://www.wjgnet.com/1007-9327/full/v25/i24/2990.htm
- DOI: https://dx.doi.org/10.3748/wjg.v25.i24.2990