Cohort Discovery


Cohort Discovery is a self-service informatics tool that that enables researchers to query a repository of UC Davis Health de-identified patient information gathered from multiple sources, including electronic medical records, lab results, and demographic data. The output of the query is a numeric count of patients that matches the criteria identified in the query. The numeric counts can be used to assess the feasibility of a study or hypotheses by identifying whether there are sufficient prospective subjects based upon the criteria input during the query process.

To login to Cohort Discovery you must complete required training to obtain a login ID and password.

UC Davis researchers who complete the Cohort Discovery introductory training session will receive a login ID and password for Cohort Discovery after completing and submitting the Cohort Discovery Confidentiality Agreement (PDF) to

Researchers subsequently requiring identified data from the EMR must gain regulatory approval and submit a UC Davis Health IT Service Hub request, as described elsewhere.

Cohort Discovery Data

At UC Davis Health, data collected through clinical practice is de-identified and selectively displayed in Cohort Discovery. The data pulled into and displayed in Cohort Discovery is only as good as the source from which it is pulled (i.e., if EMR/Clarity data is erroneous then Cohort Discovery will reflect the same error).

The goal is to leverage existing clinical data to greater patient benefit by allowing researchers to

  • conduct cohort discovery queries,
  • search de-identified patient data for research purposes, and
  • conduct research preparatory and feasibility queries.

Cohort Discovery is BEST suited for tasks such as

  • cohort identification
  • patient recruitment
  • feasibility study by viewing aggregate number of patients meetings specified criteria such as ICD-9 diagnosis, demographics, etc.
  • generate study hypothesis – view data grid that breaks down the patient count by
    •  age
    •  race
    •  gender
  • cost analysis

Cohort Discovery is NOT suited for tasks such as conducting a temporal query for individuals.

Cohort Discovery Introductory Training Session

This course introduces UC Davis Health researchers to the web client application Cohort Discovery. This tool provides researchers the ability to query de-identified patient data. You will learn what you can and cannot do with Cohort Discovery, including what type of data is available. You will also learn what you need to do to request access to the tool.

Activity Code DAHS-ITR-50082

For any questions, please contact

To meet federal and state compliance requirements, the following types of data are de-identified or excluded from Cohort Discovery:

  • prisoner data
  • patient ID (PAT ID) is replaced with a Pseudo ID
  • source field data such as but not limited to the following is de-identified:
    • Order Med ID
    • Pat CSN ID (Encounter ID)
    • Order ID
    • Encounter Num (from Finance for in-patient stay)
    • Medical Record Number (MRN)
  • patients 89 years and older
  • patient and doctor first and last names
  • phone, fax and pager number
  • patient address
  • zip code truncated to 3 digits – zip set to “000” for 3-digit zip codes with populations < 20,000
  • birth dates are normalized to the first day of the birth year
  • dates are internally consistent, but shifted +/- up to 14 days per patient record
  • data from notes (clinician, progress notes, etc.)
  • genomic data

At present Genomic data and data from notes (such as clinician’s notes and progress notes) is not available for search in Cohort Discovery, but is planned in future implementations.

Cohort Discovery is a research query tool powered by i2b2. i2b2 is an NIH-funded National Center for Biomedical Computing based at Partners HealthCare System. The i2b2 Center is developing a scalable informatics framework that will bridge clinical research data and the vast data banks arising from basic science research in order to better understand the genetic bases of complex diseases. This knowledge will facilitate the design of targeted therapies for individual patients with diseases having genetic origins. The i2b2 is funded as a cooperative agreement with the National Institutes of Health (Informatics for Integrating Biology and the Bedside,

The i2b2 data model consists of facts and dimensions. A fact is the piece of information being queried. The dimensions are groups of hierarchies and descriptors that describe the facts (see figure below for additional details).

The i2b2 database utilizes a star schema that consists of one fact table (e.g., Observation_Fact) surrounded by numerous dimension tables, such as Concept_Dimension, PatientDimension, etc. Facts in i2b2 are observations about a patient, including items such as diagnoses, demographics, laboratory results, etc.

database structure image