Copyright
        ©The Author(s) 2019.
    
    
        World J Gastroenterol. Jun 28, 2019; 25(24): 2990-3008
Published online Jun 28, 2019. doi: 10.3748/wjg.v25.i24.2990
Published online Jun 28, 2019. doi: 10.3748/wjg.v25.i24.2990
            Table 1 Advantages and shortcomings of Big Data analysis (with proposed solutions)
        
    | Advantages | |
| Clinical data readily available with minimal resources required | |
| Can study rare exposures | |
| Can study rare events | |
| Can study long-term effects | |
| Real-world data | |
| Large sample size | |
| Subgroup analysis | |
| Sensitivity analysis | |
| Interaction of different variables | |
| Adjustment of outcome to a multitude of risk factors | |
| Precise estimation of effect size | |
| Reliable capture of small variations in incidence or disease flare | |
| No selection bias if n = all | |
| Shortcomings specific of Big Data analysis | Solution | 
| Data validity | Cross reference with medical records in a subset of the sample | 
| Missing data | Statistical methods to deal with missing data, e.g. multiple imputation | 
| Text mining or natural language processing of unstructured data | |
| Incomplete capture of variables or unavailability of certain diagnosis codes | Surrogate markers (e.g., COPD for smoking, alcohol-related diseases for alcoholism) | 
| Inclusion of a large set of measured variables | |
| Text mining or natural language processing of unstructured data | |
| Privacy | De-identification of individuals | 
| Review of study plan by local ethics committee | |
| Hypothesis-free predictive models | Validation in prospective studies or randomized control trials | 
| Shortcomings of all observational study including Big Data analysis | Solution | 
| Residual and/or unmeasured confounding | Inclusion of a large set of measured variables | 
| Inclusion of RCT datasets with extensive collection of data and outcomes for trial participants or linkage with other data sources | |
| Fulfilment of Bradford Hill criteria | |
| Reverse causality/protopathic bias (outcome of interest leads to exposure of interest) | Cohort study design instead of case-control study design | 
| Excluding prescriptions of drugs of interest (e.g., PPIs) within a certain period (e.g., 6 mo) before development of the outcome of interest (e.g., gastric cancer) | |
| Example: Early symptoms of undiagnosed GC leads to PPI use, rather than PPIs cause GC | |
| Selection bias | Encompassing entire study population (n = all) | 
| Indication bias (or confounding by indication/disease severity) | Balance of patient characteristics, in particular comorbidities that are indications for a certain treatment (e.g., PS matching of a large set of measured variables) | 
| Negative control exposure | |
| Confounding by functional status and cognitive impairment | Balance of patient characteristics, in particular comorbidities that can affect functional and cognitive status (e.g., PS matching) | 
| Healthy user bias / adherer bias (individuals who are more health conscious tend to have better health outcomes) | Adjustment for other lifestyle factors – text mining or natural language processing of unstructured data | 
| Immortal time bias (arises when the study outcome cannot occur during a period of follow-up due to study design) | Landmark analysis | 
| Analysis using time varying covariates | |
| Ascertainment bias / surveillance bias / detection bias (differential degree of surveillance or screening for the outcome among exposed and unexposed individuals) Example: PPI users may undergo upper endoscopy more frequently than non-PPI users, and hence more GC detected in PPI users | Selection of an unexposed group with a similar likelihood of screening/testing | 
| Selection of an outcome that are likely to be diagnosed equally in exposed and control groups | |
| Adjustment for the surveillance rate | |
| Access to healthcare | Stratified analysis according to patients’ residential regions (e.g., rural vs urban), socioeconomic status, immigration status, race/ethnicity, institutional factors (e.g., restrictive formularies) | 
| Selective prescription and treatment in frail and very sick patients | PS methodology (trimming of areas of non-overlap, PS matching, PS by treatment interaction) | 
            Table 2 Advantages of propensity score methodology
        
    | Advantages | Remarks | 
| Addressing “curse of dimensionality” when EPV < 10 | Traditional multivariable regression models yield similar results if EPV ≥ 10 | 
| Recognition of subjects with absolute indications (or contraindications) of an intervention | Exclusion of areas of non-overlap of the PS distribution between exposed and unexposed groups to ensure comparability | 
| Identification of PS interaction with treatment | Variation of effectiveness of an intervention according to indications (PS) may only be identified via stratified analysis by PS | 
            Table 3 Examples of studies on gastric cancer research by utilization of large healthcare datasets
        
    | Gastric cancer | |||||
| Country/Region | Database | Area of research | Sample size | Design, statistical methods and 3V | Application | 
| Taiwan, China | Taiwan National Health Insurance Database (NHID) | GC | 80255 | Nationwide retrospective cohort study | Early vs late H. pylori eradication on GC risk | 
| Wu et al[46], 2009 | |||||
| Comparison with general population to derive SIR | |||||
| Volume, Velocity and Variety | |||||
| GC | 52161 | Nationwide retrospective cohort study | Association between NSAIDs and GC | ||
| Wu et al[48], 2010 | |||||
| Comparison with general population to derive SIR | |||||
| Volume, Velocity and Variety | |||||
| Hong Kong, China | Clinical Data Analysis and Reporting System (CDARS) | GC | 63397 | Territory-wide retrospective cohort study | Association between PPIs and GC | 
| Cheung et al[51], 2018 | |||||
| PS regression adjustment | |||||
| Volume, Velocity and Variety | |||||
| GC | 63605 | Territory-wide retrospective cohort study | Association between aspirin and GC | ||
| Cheung et al[49], 2018 | |||||
| PS regression adjustment | |||||
| Volume, Velocity and Variety | |||||
| GC | 63397 | Territory-wide retrospective cohort study | Effect of H. pylori eradication among different age groups | ||
| Leung et al[47], 2018 | |||||
| Comparison with general population to derive SIR | |||||
| Volume, Velocity and Variety | |||||
| GC | 7266 | Territory-wide retrospective cohort study | Association between metformin and GC | ||
| Cheung et al[50], 2018 | |||||
| PS regression adjustment | |||||
| Sensitivity analysis: PS weighting by IPTW and PS matching | |||||
| Volume, Velocity and Variety | |||||
| Sweden | Swedish Cancer Registry | GC | 797067 | Nationwide retrospective cohort study | Association between PPIs and GC | 
| Brusselaers et al[53], 2017 | |||||
| Swedish Prescribed Drug Registry | Comparison with general population to derive SIR | ||||
| Volume, Velocity and Variety | |||||
| GC | 95176 | Nationwide retrospective cohort study | Effect of H. pylori eradication on GC risk | ||
| Doorakkers et al[45], 2018 | |||||
| Comparison with general population to derive SIR | |||||
| Volume, Velocity and Variety | |||||
| United States | Kaiser Permanente (KP) | GC | 61684 | Retrospective cohort study | Association between different PPIs and GC | 
| Schneider et al[55], 2016 | |||||
| Volume, Velocity and Variety | |||||
            Table 4 Examples of studies on gastrointestinal bleeding and/or proton pump inhibitor research by utilization of large healthcare datasets
        
    | Gastrointestinal bleeding and/or proton pump inhibitors | |||||
| Country/Region | Database | Area of research | Sample size | Design, statistical methods and 3V | Application | 
| Taiwan, China | Taiwan National Health Insurance Database (NHID) | PUD | 403567 | Nationwide retrospective cohort study | Effect of H. pylori therapy and PPIs on PUD | 
| Wu et al[58], 2009 | |||||
| Volume, Velocity and Variety | |||||
| PUD | 32235 | Nationwide retrospective cohort study | Risk of rebleeding from PUD in ESRD patients | ||
| Wu et al[95], 2011 | |||||
| Volume, Velocity and Variety | |||||
| PPIs | 6552 | Nationwide retrospective cohort study | Effect of clopidogrel and PPIs on ACS | ||
| Volume, Velocity and Variety | |||||
| Wu et al[59], 2010 | |||||
| South Korea | Korean Health Insurance Review and Assessment Service (HIRA) | PPIs | 59233 | Nationwide retrospective cohort study | Effect of PPIs on thrombotic risk | 
| Kim et al[96], 2019 | |||||
| Volume, Velocity and Variety | |||||
| Hong Kong, China | Clinical Data Analysis and Reporting System (CDARS) | Dabigatran | 5041 | Territory-wide retrospective cohort study | Risk factors for dabigatran-associated gastrointestinal bleeding | 
| Chan et al[62], 2015 | |||||
| Volume, Velocity and Variety | |||||
            Table 5 Examples of studies on inflammatory bowel disease research by utilization of large healthcare datasets
        
    | Inflammatory bowel disease | |||||
| Country/Region | Database | Area of research | Sample size | Design, statistical methods and 3V | Application | 
| South Korea | Korean Health Insurance Review and Assessment Service (HIRA) | UC | 11233 | Nationwide retrospective cohort study | Incidence and clinical impact of perianal disease in UC | 
| Song et al[97], 2018 | |||||
| Comparator: general population | |||||
| Volume, Velocity and Variety | |||||
| Taiwan, China | Taiwan National Health Insurance Database (NHID) | IBD | 38039 | Nationwide retrospective cohort study to compare IBD patients with general population to derive SIR | Association between IBD and herpes zoster infection | 
| Chang et al[98], 2018 | |||||
| Hospital based nested case-control study | |||||
| Volume, Velocity and Variety | |||||
| Sweden | Swedish Patient Registry | UC | 63711 | Nationwide retrospective cohort study | Association between appendectomy and UC | 
| Myrelid et al[99], 2017 | |||||
| Volume, Velocity and Variety | |||||
| Swedish Medical Birth Register (child-mother link) | IBD | 827,239 children born between 2006 and 2013 | Nationwide prospective population-based register study | Association between maternal exposure to antibiotics during pregnancy and very early onset IBD in adulthood | |
| Ortqvist et al[72], 2019 | |||||
| Volume, Velocity and Variety | |||||
| Swedish Multigeneration Register (child-father link) | |||||
| Swedish Prescribed Drug Register National Patient Register | |||||
| United States | NCBI Gene Expression Omnibus (GEO) | IBD | Not applicable | Signature inversion study | Topiramate as a potential therapeutic agent against IBD | 
| Dudley et al[70], 2011 | |||||
| Volume, Velocity and Variety | |||||
| United States | Not applicable | IBD | 1585 | Retrospective cohort study Natural language processing | Association between arthralgia and biologics (anti-TNF vs vedolizumab) | 
| Cai et al[20], 2018 | |||||
| Volume, Velocity and Variety | |||||
| Not applicable | International IBD Genetics Consortium's Immunochip project | IBD | 53279 | Machine learning algorithm | Predictors of IBD | 
| Wei et al[64], 2013 | |||||
| Volume, Velocity and Variety | |||||
| United States | Not applicable | IBD | 575 colonoscopy reports | Retrospective cohort study Natural language processing | Differentiation of surveillance from non-surveillance colonoscopy | 
| Hou et al[100], 2013 | |||||
| Volume, Velocity and Variety | |||||
| United States | Not applicable | IBD | 1080 | Retrospective cohort study | Prediction of IBD remission in thiopurine users | 
| Waljee et al[66], 2017 | |||||
| Random Forest machine learning algorithm | |||||
| United States | Not applicable | IBD | 20368 | Retrospective cohort study | Prediction of hospitalization and outpatient steroid use | 
| Waljee et al[65], 2017 | |||||
| Random Forest machine learning algorithm | |||||
| Not applicable | Phase 3 clinical trial data | IBD | 491 | Retrospective cohort study | Prediction of steroid-free endoscopic remission with vedolizumab in UC | 
| Waljee et al[67], 2018 | |||||
| Random Forest machine learning algorithm | |||||
| Volume, Velocity and Variety | |||||
            Table 6 Examples of studies on colorectal cancer research by utilization of large healthcare datasets
        
    | Colorectal cancer | |||||
| Country/Region | Database | Area of research | Sample size | Design, statistical methods and 3V | Application | 
| Hong Kong, China | Clinical Data Analysis and Reporting System (CDARS) | CRC | 197902 | Territory-wide retrospective cohort study | Epidemiology, characteristics, risk factors and prognosis of postcolonoscopy Colorectal cancer in Asians | 
| Cheung et al[101], 2019 | |||||
| Volume, Velocity and Variety | |||||
| CRC | 187897 | Territory-wide retrospective cohort study | Association between statins and CRC | ||
| Cheung et al[69], 2019 | |||||
| PS matching | |||||
| Volume, Velocity and Variety | |||||
| United States | Nurses’ Health Study II (NHSII) | CRC | 134763 | Prospective cohort study | Association between DM and CRC | 
| Ma et al[74], 2018 | |||||
| Volume and Variety | |||||
| Health Professionals Follow-up Study (HPFS) | |||||
| Nurses’ Health Study (NHS) | CRC | 1660 | Prospective cohort study | Effect of calcium intake, coffee and fibre on survival after CRC diagnosis | |
| Yang et al[78], 2018 | |||||
| 1599 | |||||
| Volume and Variety | |||||
| Health Professionals Follow-up Study (HPFS) | Hu et al[77], 2018 | ||||
| 1575 | |||||
| Song et al[79], 2018 | |||||
| Nurses’ Health Study (NHS) | CRC | 141143 | Prospective cohort study | Risk factors of serrated polyps and conventional adenomas | |
| He et al[76], 2018 | |||||
| Nurses’ Health Study II (NHSII) | |||||
| de Jong et al[80], 2006 | |||||
| Volume and Variety | |||||
| Health Professionals Follow-up Study (HPFS) | |||||
| Nurses’ Health Study II (NHSII) | CRC | 85256 | Prospective cohort study | Association between obesity and CRC | |
| Liu et al[75], 2018 | |||||
| Volume and Variety | |||||
| Netherlands | Dutch Lynch syndrome Registry | Various cancers including | 2788 | Retrospective cohort study | Decrease in CRC-related mortality in Lynch syndrome families by surveillance | 
| Volume, Velocity and Variety | |||||
| CRC | |||||
| Netherlands, Germany, Finland | Dutch Lynch syndrome Registry | CRC | 2747 patients with 16327 colonoscopies | Retrospective cohort study | Surveillance interval on CRC incidence and stage | 
| Engel et al[81], 2018 | |||||
| Volume, Velocity and Variety | |||||
| German HNPCC Consortium | |||||
| Finland | |||||
            Table 7 Examples of studies on hepatocellular carcinoma research by utilization of large healthcare datasets
        
    | Hepatocellular carcinoma | |||||
| Country/Region | Database | Area of research | Sample size | Design, statistical methods and 3V | Application | 
| Taiwan, China | Publicly available data on HCC-related genes | HCC | Not applicable | Signature inversion study | Anti-cancer effects of chlorpromazine and trifluoperazine on HCC | 
| Chen et al[17], 2011 | |||||
| Volume, Velocity and Variety | |||||
| Connectivity Map (CMap) -- includes 6100 drug-mediated expression profiles | |||||
| Taiwan National Health Insurance Database (NHID) | HCC | 4569 | Nationwide retrospective cohort study | Association between NA therapy and HCC recurrence among patients with HBV-related HCC after liver resection | |
| Wu et al[89], 2012 | |||||
| Volume, Velocity and Variety | |||||
| Taiwan National Health Insurance Database (NHID) | HCC | 292290 | Nationwide case-control study | Association between DM and HCC | |
| Chen et al[91], 2013 | |||||
| Volume, Velocity and Variety | |||||
| Taiwan National Health Insurance Database (NHID) | HCC | 43190 | Nationwide retrospective cohort study | Association between NA therapy and HCC among CHB patients | |
| Wu et al[87], 2014 | |||||
| PS matching | |||||
| Volume, Velocity and Variety | |||||
| China | The Cancer Genome Atlas (TCGA) database | HCC | Not applicable | Signature inversion study | Anti-cancer effect of prenylamine on HCC | 
| Wang et al[18], 2016 | |||||
| Volume, Velocity and Variety | |||||
| Connectivity Map (CMap) | |||||
| South Korea | Korean Health Insurance Review and Assessment Service (HIRA) | HCC | 24156 | Nationwide retrospective cohort study | Difference between tenofovir and entecavir on reducing HCC risk | 
| Choi et al[90], 2018 | |||||
| Volume, Velocity and Variety | |||||
| Hong Kong, China | Clinical Data Analysis and Reporting System (CDARS) | HCC | Entire Hong Kong population between 1999 and 2012 | Territory-wide retrospective cohort study | Association between NA therapy and HCC among CHB patients | 
| Seto et al[88], 2017 | |||||
| Volume, Velocity and Variety | |||||
| Sweden | Swedish Cancer Registry | HCC | 9160 CHB patients | Nationwide retrospective cohort study | Association between concomitant HBV/HDV infection and HCC | 
| Ji et al[102], 2012 | |||||
| Swedish Patient Registry | |||||
| Comparison with general population to derive SIR | |||||
| Volume, Velocity and Variety | |||||
- Citation: Cheung KS, Leung WK, Seto WK. Application of Big Data analysis in gastrointestinal research. World J Gastroenterol 2019; 25(24): 2990-3008
- URL: https://www.wjgnet.com/1007-9327/full/v25/i24/2990.htm
- DOI: https://dx.doi.org/10.3748/wjg.v25.i24.2990

 
         
                         
                 
                 
                 
                 
                 
                         
                         
                        