Review
Copyright ©The Author(s) 2019.
World J Gastroenterol. Jun 28, 2019; 25(24): 2990-3008
Published online Jun 28, 2019. doi: 10.3748/wjg.v25.i24.2990
Table 1 Advantages and shortcomings of Big Data analysis (with proposed solutions)
Advantages
Clinical data readily available with minimal resources required
Can study rare exposures
Can study rare events
Can study long-term effects
Real-world data
Large sample size
Subgroup analysis
Sensitivity analysis
Interaction of different variables
Adjustment of outcome to a multitude of risk factors
Precise estimation of effect size
Reliable capture of small variations in incidence or disease flare
No selection bias if n = all
Shortcomings specific of Big Data analysisSolution
Data validityCross reference with medical records in a subset of the sample
Missing dataStatistical methods to deal with missing data, e.g. multiple imputation
Text mining or natural language processing of unstructured data
Incomplete capture of variables or unavailability of certain diagnosis codesSurrogate markers (e.g., COPD for smoking, alcohol-related diseases for alcoholism)
Inclusion of a large set of measured variables
Text mining or natural language processing of unstructured data
PrivacyDe-identification of individuals
Review of study plan by local ethics committee
Hypothesis-free predictive modelsValidation in prospective studies or randomized control trials
Shortcomings of all observational study including Big Data analysisSolution
Residual and/or unmeasured confoundingInclusion of a large set of measured variables
Inclusion of RCT datasets with extensive collection of data and outcomes for trial participants or linkage with other data sources
Fulfilment of Bradford Hill criteria
Reverse causality/protopathic bias (outcome of interest leads to exposure of interest)Cohort study design instead of case-control study design
Excluding prescriptions of drugs of interest (e.g., PPIs) within a certain period (e.g., 6 mo) before development of the outcome of interest (e.g., gastric cancer)
Example: Early symptoms of undiagnosed GC leads to PPI use, rather than PPIs cause GC
Selection biasEncompassing entire study population (n = all)
Indication bias (or confounding by indication/disease severity)Balance of patient characteristics, in particular comorbidities that are indications for a certain treatment (e.g., PS matching of a large set of measured variables)
Negative control exposure
Confounding by functional status and cognitive impairmentBalance of patient characteristics, in particular comorbidities that can affect functional and cognitive status (e.g., PS matching)
Healthy user bias / adherer bias (individuals who are more health conscious tend to have better health outcomes)Adjustment for other lifestyle factors – text mining or natural language processing of unstructured data
Immortal time bias (arises when the study outcome cannot occur during a period of follow-up due to study design)Landmark analysis
Analysis using time varying covariates
Ascertainment bias / surveillance bias / detection bias (differential degree of surveillance or screening for the outcome among exposed and unexposed individuals) Example: PPI users may undergo upper endoscopy more frequently than non-PPI users, and hence more GC detected in PPI usersSelection of an unexposed group with a similar likelihood of screening/testing
Selection of an outcome that are likely to be diagnosed equally in exposed and control groups
Adjustment for the surveillance rate
Access to healthcareStratified analysis according to patients’ residential regions (e.g., rural vs urban), socioeconomic status, immigration status, race/ethnicity, institutional factors (e.g., restrictive formularies)
Selective prescription and treatment in frail and very sick patientsPS methodology (trimming of areas of non-overlap, PS matching, PS by treatment interaction)
Table 2 Advantages of propensity score methodology
AdvantagesRemarks
Addressing “curse of dimensionality” when EPV < 10Traditional multivariable regression models yield similar results if EPV ≥ 10
Recognition of subjects with absolute indications (or contraindications) of an interventionExclusion of areas of non-overlap of the PS distribution between exposed and unexposed groups to ensure comparability
Identification of PS interaction with treatmentVariation of effectiveness of an intervention according to indications (PS) may only be identified via stratified analysis by PS
Table 3 Examples of studies on gastric cancer research by utilization of large healthcare datasets
Gastric cancer
Country/RegionDatabaseArea of researchSample sizeDesign, statistical methods and 3VApplication
Taiwan, ChinaTaiwan National Health Insurance Database (NHID)GC80255Nationwide retrospective cohort studyEarly vs late H. pylori eradication on GC risk
Wu et al[46], 2009
Comparison with general population to derive SIR
Volume, Velocity and Variety
GC52161Nationwide retrospective cohort studyAssociation between NSAIDs and GC
Wu et al[48], 2010
Comparison with general population to derive SIR
Volume, Velocity and Variety
Hong Kong, ChinaClinical Data Analysis and Reporting System (CDARS)GC63397Territory-wide retrospective cohort studyAssociation between PPIs and GC
Cheung et al[51], 2018
PS regression adjustment
Volume, Velocity and Variety
GC63605Territory-wide retrospective cohort studyAssociation between aspirin and GC
Cheung et al[49], 2018
PS regression adjustment
Volume, Velocity and Variety
GC63397Territory-wide retrospective cohort studyEffect of H. pylori eradication among different age groups
Leung et al[47], 2018
Comparison with general population to derive SIR
Volume, Velocity and Variety
GC7266Territory-wide retrospective cohort studyAssociation between metformin and GC
Cheung et al[50], 2018
PS regression adjustment
Sensitivity analysis: PS weighting by IPTW and PS matching
Volume, Velocity and Variety
SwedenSwedish Cancer RegistryGC797067Nationwide retrospective cohort studyAssociation between PPIs and GC
Brusselaers et al[53], 2017
Swedish Prescribed Drug RegistryComparison with general population to derive SIR
Volume, Velocity and Variety
GC95176Nationwide retrospective cohort studyEffect of H. pylori eradication on GC risk
Doorakkers et al[45], 2018
Comparison with general population to derive SIR
Volume, Velocity and Variety
United StatesKaiser Permanente (KP)GC61684Retrospective cohort studyAssociation between different PPIs and GC
Schneider et al[55], 2016
Volume, Velocity and Variety
Table 4 Examples of studies on gastrointestinal bleeding and/or proton pump inhibitor research by utilization of large healthcare datasets
Gastrointestinal bleeding and/or proton pump inhibitors
Country/RegionDatabaseArea of researchSample sizeDesign, statistical methods and 3VApplication
Taiwan, ChinaTaiwan National Health Insurance Database (NHID)PUD403567Nationwide retrospective cohort studyEffect of H. pylori therapy and PPIs on PUD
Wu et al[58], 2009
Volume, Velocity and Variety
PUD32235Nationwide retrospective cohort studyRisk of rebleeding from PUD in ESRD patients
Wu et al[95], 2011
Volume, Velocity and Variety
PPIs6552Nationwide retrospective cohort studyEffect of clopidogrel and PPIs on ACS
Volume, Velocity and Variety
Wu et al[59], 2010
South KoreaKorean Health Insurance Review and Assessment Service (HIRA)PPIs59233Nationwide retrospective cohort studyEffect of PPIs on thrombotic risk
Kim et al[96], 2019
Volume, Velocity and Variety
Hong Kong, ChinaClinical Data Analysis and Reporting System (CDARS)Dabigatran5041Territory-wide retrospective cohort studyRisk factors for dabigatran-associated gastrointestinal bleeding
Chan et al[62], 2015
Volume, Velocity and Variety
Table 5 Examples of studies on inflammatory bowel disease research by utilization of large healthcare datasets
Inflammatory bowel disease
Country/RegionDatabaseArea of researchSample sizeDesign, statistical methods and 3VApplication
South KoreaKorean Health Insurance Review and Assessment Service (HIRA)UC11233Nationwide retrospective cohort studyIncidence and clinical impact of perianal disease in UC
Song et al[97], 2018
Comparator: general population
Volume, Velocity and Variety
Taiwan, ChinaTaiwan National Health Insurance Database (NHID)IBD38039Nationwide retrospective cohort study to compare IBD patients with general population to derive SIRAssociation between IBD and herpes zoster infection
Chang et al[98], 2018
Hospital based nested case-control study
Volume, Velocity and Variety
SwedenSwedish Patient RegistryUC63711Nationwide retrospective cohort studyAssociation between appendectomy and UC
Myrelid et al[99], 2017
Volume, Velocity and Variety
Swedish Medical Birth Register (child-mother link)IBD827,239 children born between 2006 and 2013Nationwide prospective population-based register studyAssociation between maternal exposure to antibiotics during pregnancy and very early onset IBD in adulthood
Ortqvist et al[72], 2019
Volume, Velocity and Variety
Swedish Multigeneration Register (child-father link)
Swedish Prescribed Drug Register National Patient Register
United StatesNCBI Gene Expression Omnibus (GEO)IBDNot applicableSignature inversion studyTopiramate as a potential therapeutic agent against IBD
Dudley et al[70], 2011
Volume, Velocity and Variety
United StatesNot applicableIBD1585Retrospective cohort study Natural language processingAssociation between arthralgia and biologics (anti-TNF vs vedolizumab)
Cai et al[20], 2018
Volume, Velocity and Variety
Not applicableInternational IBD Genetics Consortium's Immunochip projectIBD53279Machine learning algorithmPredictors of IBD
Wei et al[64], 2013
Volume, Velocity and Variety
United StatesNot applicableIBD575 colonoscopy reportsRetrospective cohort study Natural language processingDifferentiation of surveillance from non-surveillance colonoscopy
Hou et al[100], 2013
Volume, Velocity and Variety
United StatesNot applicableIBD1080Retrospective cohort studyPrediction of IBD remission in thiopurine users
Waljee et al[66], 2017
Random Forest machine learning algorithm
United StatesNot applicableIBD20368Retrospective cohort studyPrediction of hospitalization and outpatient steroid use
Waljee et al[65], 2017
Random Forest machine learning algorithm
Not applicablePhase 3 clinical trial dataIBD491Retrospective cohort studyPrediction of steroid-free endoscopic remission with vedolizumab in UC
Waljee et al[67], 2018
Random Forest machine learning algorithm
Volume, Velocity and Variety
Table 6 Examples of studies on colorectal cancer research by utilization of large healthcare datasets
Colorectal cancer
Country/RegionDatabaseArea of researchSample sizeDesign, statistical methods and 3VApplication
Hong Kong, ChinaClinical Data Analysis and Reporting System (CDARS)CRC197902Territory-wide retrospective cohort studyEpidemiology, characteristics, risk factors and prognosis of postcolonoscopy Colorectal cancer in Asians
Cheung et al[101], 2019
Volume, Velocity and Variety
CRC187897Territory-wide retrospective cohort studyAssociation between statins and CRC
Cheung et al[69], 2019
PS matching
Volume, Velocity and Variety
United StatesNurses’ Health Study II (NHSII)CRC134763Prospective cohort studyAssociation between DM and CRC
Ma et al[74], 2018
Volume and Variety
Health Professionals Follow-up Study (HPFS)
Nurses’ Health Study (NHS)CRC1660Prospective cohort studyEffect of calcium intake, coffee and fibre on survival after CRC diagnosis
Yang et al[78], 2018
1599
Volume and Variety
Health Professionals Follow-up Study (HPFS)Hu et al[77], 2018
1575
Song et al[79], 2018
Nurses’ Health Study (NHS)CRC141143Prospective cohort studyRisk factors of serrated polyps and conventional adenomas
He et al[76], 2018
Nurses’ Health Study II (NHSII)
de Jong et al[80], 2006
Volume and Variety
Health Professionals Follow-up Study (HPFS)
Nurses’ Health Study II (NHSII)CRC85256Prospective cohort studyAssociation between obesity and CRC
Liu et al[75], 2018
Volume and Variety
NetherlandsDutch Lynch syndrome RegistryVarious cancers including2788Retrospective cohort studyDecrease in CRC-related mortality in Lynch syndrome families by surveillance
Volume, Velocity and Variety
CRC
Netherlands, Germany, FinlandDutch Lynch syndrome RegistryCRC2747 patients with 16327 colonoscopiesRetrospective cohort studySurveillance interval on CRC incidence and stage
Engel et al[81], 2018
Volume, Velocity and Variety
German HNPCC Consortium
Finland
Table 7 Examples of studies on hepatocellular carcinoma research by utilization of large healthcare datasets
Hepatocellular carcinoma
Country/RegionDatabaseArea of researchSample sizeDesign, statistical methods and 3VApplication
Taiwan, ChinaPublicly available data on HCC-related genesHCCNot applicableSignature inversion studyAnti-cancer effects of chlorpromazine and trifluoperazine on HCC
Chen et al[17], 2011
Volume, Velocity and Variety
Connectivity Map (CMap) -- includes 6100 drug-mediated expression profiles
Taiwan National Health Insurance Database (NHID)HCC4569Nationwide retrospective cohort studyAssociation between NA therapy and HCC recurrence among patients with HBV-related HCC after liver resection
Wu et al[89], 2012
Volume, Velocity and Variety
Taiwan National Health Insurance Database (NHID)HCC292290Nationwide case-control studyAssociation between DM and HCC
Chen et al[91], 2013
Volume, Velocity and Variety
Taiwan National Health Insurance Database (NHID)HCC43190Nationwide retrospective cohort studyAssociation between NA therapy and HCC among CHB patients
Wu et al[87], 2014
PS matching
Volume, Velocity and Variety
ChinaThe Cancer Genome Atlas (TCGA) databaseHCCNot applicableSignature inversion studyAnti-cancer effect of prenylamine on HCC
Wang et al[18], 2016
Volume, Velocity and Variety
Connectivity Map (CMap)
South KoreaKorean Health Insurance Review and Assessment Service (HIRA)HCC24156Nationwide retrospective cohort studyDifference between tenofovir and entecavir on reducing HCC risk
Choi et al[90], 2018
Volume, Velocity and Variety
Hong Kong, ChinaClinical Data Analysis and Reporting System (CDARS)HCCEntire Hong Kong population between 1999 and 2012Territory-wide retrospective cohort studyAssociation between NA therapy and HCC among CHB patients
Seto et al[88], 2017
Volume, Velocity and Variety
SwedenSwedish Cancer RegistryHCC9160 CHB patientsNationwide retrospective cohort studyAssociation between concomitant HBV/HDV infection and HCC
Ji et al[102], 2012
Swedish Patient Registry
Comparison with general population to derive SIR
Volume, Velocity and Variety