PGS Catalog Documentation
This page contains information regarding the development and contents of the PGS Catalog.
The Catalog is under active development, and a flagship publication is in preparation for early 2020. If you use the catalog before then we ask that you cite it as:
- Samuel A. Lambert, Simon Jupp, Gad Abraham, Helen Parkinson, John Danesh, Jacqueline A.L. MacArthur, and Michael Inouye. (2019) The Polygenic Score (PGS) Catalog: a database of published PGS to enable reproducibility and uniform evaluation. www.pgscatalog.org.
PGS Catalog Feedback & Contact Information
To submit a PGS to the catalog, provide feedback, or ask questions please contact the PGS Catalog team at firstname.lastname@example.org.
PGS Catalog Inclusion Criteria
For a publication's data to be included in the PGS Catalog it must contain one of the following:
- A newly developed PGS. This includes the following information about the score and its predictive ability (evaluated on samples not used in training):
- Variant information necessary to apply the PGS to new samples (variant rsID and/or genomic position, weights/effect sizes, effect allele, genome build)
- Information about how the PGS was developed (computational method, variant selection, relevant parameters)
- Descriptions of the samples used for training (e.g. discovery of the variant associations (these can usually be extracted directly from the GWAS Catalog using GCST IDs), as well as fitting the PGS) and external evaluation
- A description of PGS predictive ability (e.g. effect sizes [beta, OR, HR, etc.], classification accuracy, proportion of the variance explained (R2), and/or covariates evaluated in the PGS prediction)
- An evaluation of a previously developed PGS. This would include the evaluation of PGS already present in the catalog (or eligible for inclusion), on samples not used for PGS training. The requirements for description would be the same as for the evaluation of a new PGS.
In the current pilot of the catalog we have only curated PGS developed after 2010, and focused on including well-studied scores for the following traits: coronary artery disease (CAD), diabetes (types 1 and 2), obesity / body mass index (BMI), breast cancer, prostate cancer and Alzheimer’s disease. The catalog however is not limited to these traits, and in the future we plan to move to a model where researchers can submit PGS and evaluations for curation and subsequent inclusion into the catalog.
PGS Catalog Data Descriptions
In this section we describe the curated data and tables extracted from the PGS publications and presented in the catalog.
PublicationEach publication in the database is given a PGP (Polygenic Publication ID) so that scores and evaluations link to the same source object. When browsing by publications the Number of PGS Developed refers to the number of newly developed PGS in the paper, and the Number of PGS Evaluated refers to the number of PGS (new and existing) that have performance metrics derived in the study. For each publication the following information is extracted:
- PubMed ID: PubMed identification number
- doi: A Digital Object Identifier (doi) is also curated to allow unpublished work (e.g. pre-prints) to be added to the catalogue
- Publication Title: Title of the publication
- Author(s): List of publication authors, the first author is also extracted for a shorter display
- Journal: The publication's location
- Publication Date: Date of publication (with respect to the PMID or doi upon DB upload)
Polygenic Score (PGS)Each publication in the database is given a unique PGS (Polygenic Score) ID to identify it. The following information is extracted, and associated with each PGS in the catalog:
- Source (PGP ID): A PGP ID links the PGS to the publication in which it was described.
- Reported Trait: The author-reported trait (e.g. body mass index (BMI), or coronary artery disease) that the PGS has been developed to predict.
- Mapped Trait(s) / EFO ID(s): The Reported Trait is mapped to Experimental Factor Ontology (EFO) terms and their respective identifiers. For more information see the trait information.
- Score Name: This may be the name that the authors describe the PGS with in the source publication, or a name that a curator has assigned to identify the score during the curation process (before a PGS ID has been given).
- PGS Development Method: The name or description of the method or computational algorithm used to develop the PGS.
- PGS Development Details/Relevant Parameters: A description of the relevant inputs and parameters relevant to the PGS development method/process.
- Original Genome Build: The version of the genome that the variants present in the PGS are associated with. Listed as NR (Not Reported) if unknown.
- Number of Variants: Number of variants used to calculate the PGS. In the future this will include a more detailed description of the types of variants present.
- Number of Variant Interaction Terms: Number of higher-order variant interactions included in the PGS.
- Contributing Samples: Information about the samples used for the development of the PGS. Relevant column descriptions are in the Sample Description section.
- Source of Variant Associations (GWAS): A table describing the samples used to define the variant associations/effect-sizes used in the PGS. These data are extracted from the GWAS Catalog when a study ID (GCST) is available.
- Score Development/Training: A table describing the samples used to develop or train the score (e.g. not used for variant discovery, and non-overlapping with the samples used to evaluate the PGS predictive ability).
- Performance Metrics: A record of the performance metrics that have been reported for the PGS. Relevant column descriptions are in the Performance Metrics section
- Evaluated Samples: Information about the samples used in PGS performance evaluation. These samples have an PSS (PGS Catalog Sample Set ID) to link them to their associated performance metrics (and across different PGS). Relevant column descriptions are in the Sample Description section.
TraitTraits in the catalog are displayed/grouped according to the Mapped Traits rather than the author Reported Traits to facilitate comparability, similar to the GWAS Catalog. For a complete description of why the trait ontology is employed please refer to related documentation from the GWAS Catalog. The Experimental Factor Ontology is hosted and described here: Experimental Factor Ontology (EFO). The information for each EFO trait ID that is stored in the PGS catalog is:
- Experimental Factor Ontology ID (EFO_ID): A trait identifier that links to the EFO and GWAS catalog.
- EFO Label: The trait label/name
- Trait Description: Detailed description of the trait from EFO.
Sample DescriptionA consistent set of fields are used to describe the samples used to develop and test each PGS:
- PGS Catalog Sample Set (PSS) ID: PSS IDs are assigned to describe samples used in PGS evaluations (e.g. Performance Metrics). PSS IDs are not uniquely associated with a single PGS, multiple PGS can be evaluated on the sample sample sets (PSS ID).
- Study Identifiers: Identifiers used to link the samples to their initial descriptions (e.g. using PubMed IDs) and if they were used for variant associations to their associated GWAS studies (using GWAS Catalog GCST IDs).
- Sample Numbers: This field describes the number of individuals included in the sample, along with the number of cases and controls (if the trait is dichotomous) and the percent of participants that are male if available.
- Fields describing sample ancestry are curated according to the framework used to record ancestry data in the GWAS Catalog that was defined in Morales et al. Genome Biology (2018):
- Broad Ancestry Category: Author reported ancestry is mapped to the best matching ancestry category from the GWAS Catalog framework (Table 1, Morales et al. (2018)).
- Ancestry (e.g. French, Chinese): A more detailed description of sample ancestry that usually matches the most specific description described by the authors.
- Sample Ancestry: Is displayed on the website, and represents a combination of the two ancestry categories (with the more specific terms in brackets).
- Country of recruitment: Author reported countries of recruitment (if available).
- Additional Ancestry Description: Any additional description not captured in the structured data (e.g. founder or genetically isolated populations, or further description of admixed samples).
- Cohort(s): A list of cohorts that collected the samples. The initial list of common cohorts used in genetics studies that seeded these annotations is from Mills & Rahal. Communications Biology. (2019).
- Additional Sample/Cohort Information: Any additional description about the samples and what they were used for that is not captured by the structured categories (e.g. sub-cohort information).
Performance MetricsEach evaluation of a PGS is given a PPM (PGS Performance Metric) ID that links it to a description of the results:
- PGS Catalog Sample Set (PSS) ID: ID that links to the samples the displayed PGS evaluated.
- Source/Publication: ID that links to the publication where the performance metrics were reported.
- Trait: This field displays both the Reported and Mapped Traits. The reported trait often corresponds to the test set names reported in the publication, or more specific aspects of the phenotype being tested (e.g. if the disease cases are incident vs. recurrent events).
- The reported values of the performance metrics are all reported similarly (e.g. the estimate is recorded along with the 95% confidence interval (if supplied)) and grouped according to the type of statistic they represent:
- PGS Effect Sizes (per SD change): Examples include regression coefficients (betas) for continuous traits, Odds ratios (OR) and/or Hazard ratios (HR) for dichotomous traits depending on the availability of time-to-event data.
- PGS Classification Metrics: Examples include the Area under the Receiver Operating Characteristic (AUROC) or Harrell's C-index (Concordance statistic).
- Other Metrics: Metrics that do not fit into the other two categories. Examples include: R2 (proportion of the variance explained), or reclassification metrics.
- Covariates Included in PGS Model: A comma-separated list of covariates used in the prediction model used to evaluate the PGS. Examples include: age, sex, smoking habits, etc.
- PGS Performance: Other Relevant Information: Any other information relevant to the understanding of the performance metrics.
We wish to acknowledge the help of the following people & teams for their support of the PGS Catalog:
- EMBL-EBI Samples Phenotypes and Ontologies Team - Helen Parkinson, Simon Jupp
- NHGRI-EBI GWAS Catalog Team - Aoife McMahon
- Inouye Lab Members - Jonathan Marten, Petar Scepanovic, Gad Abraham
Development of the PGS Catalog is funded by: