PGS Scoring Files & Metadata
Individual PGS variants scoring and metadata files
|
View PGS Score Directories (FTP) |
PGS Catalog Metadata
Available PGS global metadata files
|
Bulk Metadata Downloads (FTP) |
PGS Catalog REST API
Programmatic access to the PGS Catalog metadata
|
REST API endpoint documentation |
Python package pgscatalog_utils
A collection of tools, such as scoring files download
|
Python package documentation |
The PGS Catalog FTP allows for consistent access to the bulk downloads, and is indexed by Polygenic Score (PGS) ID to allow programmatic access to score level data. The following diagram illustrates the FTP structure:
ftp://ftp.ebi.ac.uk/pub/databases/spot/pgs ├── pgs_scores_list.txt (list of Polygenic Score IDs) ├── metadata/ │ ├── pgs_all_metadata.xlsx │ ├── pgs_all_metadata_[sheet_name].csv (7 files) │ ├── pgs_all_metadata.tar.gz (xlsx + csv files) │ ├── publications/ (metadata for large studies) │ └── previous_releases/ └── scores/ ├── PGS000001/ │ ├── Metadata/ │ │ ├── PGS000001_metadata.xlsx │ │ ├── PGS000001_metadata_[sheet_name].csv (7 files) │ │ ├── PGS000001_metadata.tar.gz (xlsx + csv files) │ │ └── archived_versions/ │ └── ScoringFiles/ │ ├── PGS000001.txt.gz │ ├── archived_versions/ │ └── Harmonized/ │ ├── PGS000001_hmPOS_GRCh37.txt.gz │ └── PGS000001_hmPOS_GRCh38.txt.gz ├── PGS000002/ · ├─ ··· · └─ ··· · └── PGS00XXXX/ ├─ ··· └─ ···
Bulk download of the entire PGS Catalog's metadata, describing all PGS in terms of their publication source, samples used for development/evaluation, and related performance metrics. Download Metadata file.xlsx
The bulk download contains a single Excel file with multiple sheets describing each of the data types. The sheets are also provided as individual .csv
files for easier import in analysis tools, and are provided on the FTP in the metadata/
folder.
Each scoring file (variant information, effect alleles/weights) is formatted to be a gzipped tab-delimited text file, labelled by its PGS Catalog Score ID (e.g. PGS000001.txt.gz
).
ftp://ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/PGS######/ScoringFiles/
Here is a description of the PGS Scoring Files header:
###PGS CATALOG SCORING FILE - see https://www.pgscatalog.org/downloads/#dl_ftp_scoring for additional information #format_version=Version of the scoring file format, e.g. '2.0' ##POLYGENIC SCORE (PGS) INFORMATION #pgs_id=PGS identifier, e.g. 'PGS000001' #pgs_name=PGS name, e.g. 'PRS77_BC' - optional #trait_reported=trait, e.g. 'Breast Cancer' #trait_mapped=Ontology trait name, e.g. 'breast carcinoma' #trait_efo=Ontology trait ID (EFO), e.g. 'EFO_0000305' #genome_build=Genome build/assembly, e.g. 'GRCh38' #variants_number=Number of variants listed in the PGS #weight_type=Variant weight type, e.g. 'beta', 'OR/HR' (default 'NR') ##SOURCE INFORMATION #pgp_id=PGS publication identifier, e.g. 'PGP000001' #citation=Information about the publication #license=License and terms of PGS use/distribution - refers to the EMBL-EBI Terms of Use by default rsIDchr_namechr_positioneffect_alleleother_alleleeffect_weight...
|
"pipe"), e.g.:
#trait_mapped=Ischemic stroke|stroke #trait_efo=HP_0002140|EFO_0000712
###PGS CATALOG SCORING FILE - see https://www.pgscatalog.org/downloads/#dl_ftp_scoring for additional information #format_version=2.0 ##POLYGENIC SCORE (PGS) INFORMATION #pgs_id=PGS000348 #pgs_name=PRS_PrCa #trait_reported=Prostate cancer #trait_mapped=prostate carcinoma #trait_efo=EFO_0001663 #genome_build=GRCh37 #variants_number=72 #weight_type=log(OR) ##SOURCE INFORMATION #pgp_id=PGP000113 #citation=Black M et al. Prostate (2020). doi:10.1002/pros.24058 #license=Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0). © 2020 Ambry Genetics. rsIDchr_namechr_positioneffect_alleleother_alleleeffect_weight...
It also has been edited to have consistent column headings based on the following schema:
Column Header | Field Name | Field Requirement | Field Description |
---|---|---|---|
Variant Description: | |||
rsID | dbSNP Accession ID (rsID) | Optional | The SNP’s rs ID. This column also contains HLA alleles in the standard notation (e.g. HLA-DQA1*0102) that aren’t always provided with chromosomal positions. |
chr_name | Location - Chromosome | Required | Chromosome name/number associated with the variant. |
chr_position | Location within the Chromosome | Required | Chromosomal position associated with the variant. |
effect_allele | Effect Allele | Required | The allele that's dosage is counted (e.g. {0, 1, 2}) and multiplied by the variant's weight (effect_weight) when calculating score. The effect allele is also known as the 'risk allele'. Note: this does not necessarily need to correspond to the minor allele/alternative allele. |
other_allele | Other allele(s) | Recommended | The other allele(s) at the loci. Note: this does not necessarily need to correspond to the reference allele. |
locus_name | Locus Name | Optional | This is kept in for loci where the variant may be referenced by the gene (APOE e4). It is also common (usually in smaller PGS) to see the variants named according to the genes they impact. |
is_haplotype is_diplotype | FLAG: Haplotype or Diplotype | Optional | This is a TRUE/FALSE variable that flags whether the effect allele is a haplotype/diplotype rather than a single SNP. Constituent SNPs in the haplotype are semi-colon separated. |
imputation_method | Imputation Method | Optional | This described whether the variant was specifically called with a specific imputation or variant calling method. This is mostly kept to describe HLA-genotyping methods (e.g. flag SNP2HLA, HLA*IMP) that gives alleles that are not referenced by genomic position. |
variant_description | Variant Description | Optional | This field describes any extra information about the variant (e.g. how it is genotyped or scored) that cannot be captured by the other fields. |
inclusion_criteria | Score Inclusion Criteria | Optional | Explanation of when this variant gets included into the PGS (e.g. if it depends on the results from other variants). |
Weight Information: | |||
effect_weight | Variant Weight | Required | Value of the effect that is multiplied by the dosage of the effect allele (effect_allele) when calculating the score. Additional information on how the effect_weight was derived is in the weight_type field of the header, and score development method in the metadata downloads. |
is_interaction | FLAG: Interaction | Optional | This is a TRUE/FALSE variable that flags whether the weight should be multiplied with the dosage of more than one variant. Interactions are demarcated with a _x_ between entries for each of the variants present in the interaction. |
is_dominant | FLAG: Dominant Inheritance Model | Optional | This is a TRUE/FALSE variable that flags whether the weight should be added to the PGS sum if there is at least 1 copy of the effect allele (e.g. it is a dominant allele). |
is_recessive | FLAG: Recessive Inheritance Model | Optional | This is a TRUE/FALSE variable that flags whether the weight should be added to the PGS sum only if there are 2 copies of the effect allele (e.g. it is a recessive allele). |
dosage_0_weight | Effect weight with 0 copy of the effect allele | Optional | Weights that are specific to different dosages of the effect_allele (e.g. {0, 1, 2} copies) can also be reported when the the contribution of the variants to the score is not encoded as additive, dominant, or recessive. In this case three columns are added corresponding to which variant weight should be applied for each dosage, where the column name is formated as dosage_#_weight where the # sign indicates the number of effect_allele copies. |
dosage_1_weight | Effect weight with 1 copy of the effect allele | Optional | |
dosage_2_weight | Effect weight with 2 copies of the effect allele | Optional | |
Other information: | |||
OR HR | Odds Ratio [OR], Hazard Ratio [HR] | Optional | Author-reported effect sizes can be supplied to the Catalog. If no other effect_weight is given the weight is calculated using the log(OR) or log(HR). |
allelefrequency_effect | Effect Allele Frequency | Optional | Reported effect allele frequency, if the associated locus is a haplotype then haplotype frequency will be extracted. |
allelefrequency_effect_Ancestry | Population-specific effect allele frequency | Optional | Reported effect allele frequency in a specific population (described by the authors). |
Scoring Files header rsID chr_name chr_position effect_allele other_allele effect_weight rs2843152 1 2245570 G C -2.76009e-02 rs35465346 1 22132518 G A 2.39340e-02 rs28470722 1 38386727 G A -1.74935e-02 rs11206510 1 55496039 T C 2.93005e-02 rs9970807 1 56965664 C T 4.70027e-02 rs61772626 1 57015668 A G -2.71202e-02 rs7528419 1 109817192 A G 2.91912e-02 rs1277930 1 109822143 A G 2.60105e-02 rs11102000 1 110298166 G C 2.45969e-02 rs11810571 1 151762308 G C 2.09215e-02 rs6689306 1 154395946 G A -1.97906e-02 rs72702224 1 154911689 G A -2.81310e-02 rs3738591 1 155764808 C G 4.23731e-02 ...
###PGS CATALOG SCORING FILE - see ... #format_version=Version of the scoring file format, e.g. '2.0' ##POLYGENIC SCORE (PGS) INFORMATION #pgs_id=PGS identifier, e.g. 'PGS000001' #pgs_name=PGS name, e.g. 'PRS77_BC' - optional #trait_reported=trait, e.g. 'Breast Cancer' #trait_mapped=Ontology trait name, e.g. 'breast carcinoma' #trait_efo=Ontology trait ID (EFO), e.g. 'EFO_0000305' #genome_build=Genome build/assembly, e.g. 'GRCh38' #variants_number=Number of variants listed in the PGS #weight_type=Variant weight type, e.g. 'beta', 'OR/HR' (default 'NR') ##SOURCE INFORMATION #pgp_id=PGS publication identifier, e.g. 'PGP000001' #citation=Information about the publication #license=License and terms of PGS use/distribution ... rsIDchr_namechr_positioneffect_alleleother_allele...
### PGS CATALOG SCORING FILE - see ... ## POLYGENIC SCORE (PGS) INFORMATION # PGS ID = PGS identifier, e.g. 'PGS000001' # PGS Name = PGS name, e.g. 'PRS77_BC' - optional # Reported Trait = trait, e.g. 'Breast Cancer' # Original Genome Build = Genome build/assembly, e.g. 'GRCh38' # Number of Variants = Number of variants listed in the PGS ## SOURCE INFORMATION # PGP ID = PGS publication identifier, e.g. 'PGP000001' # Citation = Information about the publication # LICENSE = License and terms of PGS use/distribution ... rsIDchr_namechr_positioneffect_allelereference_allele...
Here is a description of the PGS Scoring Files header:
### PGS CATALOG SCORING FILE - see https://www.pgscatalog.org/downloads/#dl_ftp_scoring for additional information ## POLYGENIC SCORE (PGS) INFORMATION # PGS ID = PGS identifier, e.g. 'PGS000001' # PGS Name = PGS name, e.g. 'PRS77_BC' - optional # Reported Trait = trait, e.g. 'Breast Cancer' # Original Genome Build = Genome build/assembly, e.g. 'GRCh38' # Number of Variants = Number of variants listed in the PGS ## SOURCE INFORMATION # PGP ID = PGS publication identifier, e.g. 'PGP000001' # Citation = Information about the publication # LICENSE = License and terms of PGS use/distribution - refers to the EMBL-EBI Terms of Use by default rsIDchr_namechr_positioneffect_allelereference_allele...
### PGS CATALOG SCORING FILE - see https://www.pgscatalog.org/downloads/#dl_ftp_scoring for additional information ## POLYGENIC SCORE (PGS) INFORMATION # PGS ID = PGS000348 # PGS Name = PRS_PrCa # Reported Trait = Prostate cancer # Original Genome Build = GRCh37 # Number of Variants = 72 ## SOURCE INFORMATION # PGP ID = PGP000113 # Citation = Black M et al. Prostate (2020). doi:10.1002/pros.24058 # LICENSE = Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0). © 2020 Ambry Genetics. rsIDchr_namechr_positioneffect_allelereference_alleleeffect_weight...
It also has been edited to have consistent column headings based on the following schema:
Column Header | Field Name | Field Description | Field Requirement |
---|---|---|---|
rsID | dbSNP Accession ID (rsID) | The SNP’s rs ID | Optional - unless both the chr_name and chr_position columns are absent. This column also contains HLA alleles in the standard notation (e.g. HLA-DQA1*0102) that aren’t always provided with chromosomal positions. |
chr_name | Location - Chromosome | Chromosome name/number associated with the variant | Required - may be optional if an rsID for the variant is provided |
chr_position | Location within the Chromosome | Chromosomal position associated with the variant | Required - may be optional if an rsID for the variant is provided |
effect_allele | Effect Allele | The allele that's dosage is counted (e.g. {0, 1, 2}) and multiplied by the variant's weight ('effect_weight') when calculating score. The effect allele is also known as the 'risk allele'. | Required |
reference_allele | Reference Allele | The other allele(s) at the loci | Optional - but strongly recommended |
effect_weight | Variant Weight | Value of the effect that is multiplied by the dosage of the effect allele ('effect_allele') when calculating the score. | Required |
locus_name | Locus Name | This is kept in for loci where the variant may be referenced by the gene (APOE e4). It is also common (usually in smaller PGS) to see the variants named according to the genes they impact. | Optional |
weight_type | Type of Weight | Whether the author supplied Variant Weight is a: beta (effect size), or something like an OR/HR (odds/hazard ratio) | Optional |
allelefrequency_effect | Effect Allele Frequency | Reported effect allele frequency, if the associated locus is a haplotype then haplotype frequency will be extracted. | Optional |
is_interaction | FLAG: Interaction | This is a TRUE/FALSE variable that flags whether the weight should be multiplied with the dosage of more than one variant. Interactions are demarcated with a _x_ between entries for each of the variants present in the interaction. | Optional |
is_dominant | FLAG: Dominant Inheritance Model | This is a TRUE/FALSE variable that flags whether the weight should be added to the PGS sum if there is at least 1 copy of the effect allele (e.g. it is a dominant allele). | Optional |
is_recessive | FLAG: Recessive Inheritance Model | This is a TRUE/FALSE variable that flags whether the weight should be added to the PGS sum only if there are 2 copies of the effect allele (e.g. it is a recessive allele). | Optional |
is_haplotype is_diplotype | FLAG: Haplotype or Diplotype | This is a TRUE/FALSE variable that flags whether the effect allele is a haplotype/diplotype rather than a single SNP. Constituent SNPs in the haplotype are semi-colon separated. | Optional |
imputation_method | Imputation Method | This described whether the variant was specifically called with a specific imputation or variant calling method. This is mostly kept to describe HLA-genotyping methods (e.g. flag SNP2HLA, HLA*IMP) that gives alleles that are not referenced by genomic position. | Optional |
variant_description | Variant Description | This field describes any extra information about the variant (e.g. how it is genotyped or scored) that cannot be captured by the other fields. | Optional |
inclusion_criteria | Score Inclusion Criteria | Explanation of when this variant gets included into the PGS (e.g. if it depends on the results from other variants). | Optional |
Extra columns: | |||
OR HR | Odds Ratio [OR], Hazard Ratio [HR] | Author-reported effect sizes can be supplied to the Catalog. If no other effect_weight is given the weight is calculated using the log(OR) or log(HR). | Optional |
allelefrequency_effect_Ancestry | Population-specific effect allele frequency | Reported effect allele frequency in a specific population (described by the authors). | Optional |
Scoring Files header rsID chr_name chr_position effect_allele reference_allele effect_weight rs2843152 1 2245570 G C -2.76009e-02 rs35465346 1 22132518 G A 2.39340e-02 rs28470722 1 38386727 G A -1.74935e-02 rs11206510 1 55496039 T C 2.93005e-02 rs9970807 1 56965664 C T 4.70027e-02 rs61772626 1 57015668 A G -2.71202e-02 rs7528419 1 109817192 A G 2.91912e-02 rs1277930 1 109822143 A G 2.60105e-02 rs11102000 1 110298166 G C 2.45969e-02 rs11810571 1 151762308 G C 2.09215e-02 rs6689306 1 154395946 G A -1.97906e-02 rs72702224 1 154911689 G A -2.81310e-02 rs3738591 1 155764808 C G 4.23731e-02 ...
PGS Scoring Files in the Catalog are currently provided in a consistent format with standardized column names and data types, along with information about the genome build given by authors. The variant-level information in PGS is often heterogeneously described and may lack chromosome/position information, contain a mix of positions and/or rsIDs, or be mapped to a genome build different from your sample genotypes. To make PGS easier to apply we have created a new set of files that contain additional columns with harmonized variant information (chromosome name and base pair position) and variant identifiers (updated rsID), in commonly used genome builds (GRCh37/hg19 and GRCh38/hg38) to make variant matching and PGS calculation easier.
The generation of these harmonized files is done by using the pgs-harmonizer tool. It is based on the Open Targets and GWAS Catalog Summary Statistics harmonizer pipelines. To harmonize the variant positions the pgs-harmonizer performs the following tasks:The resultant files create new columns, indicating the source of the variant annotation (hm_source), as well as consistently annotated chromosome (hm_chr) / position (hm_pos), and rsID (hm_rsID) which can be used to match variants in your dataset along with the alleles (effect_allele, and other_allele).
ftp://ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/PGS######/ScoringFiles/Harmonized/
_
):
For instance: PGS000001_
hmPOS_
GRCh37.txt.gz
##HARMONIZATION DETAILS
) is a copy-paste of the Scoring file header.
Here is a description of the PGS Harmonized Scoring Files header:
###PGS CATALOG SCORING FILE - see https://www.pgscatalog.org/downloads/#dl_ftp_scoring for additional information #format_version=Version of the scoring file format, e.g. '2.0' ##POLYGENIC SCORE (PGS) INFORMATION #pgs_id=PGS identifier, e.g. 'PGS000001' ... #license=License and terms of PGS use/distribution - refers to the EMBL-EBI Terms of Use by default ##HARMONIZATION DETAILS #HmPOS_build=Genome build of the harmonized file, e.g. 'GRCh38' #HmPOS_date=Date of the harmonized file creation, e.g. '2022-05-26' #HmPOS_match_chr=Number of entries matching and not matching the given chromosome, e.g. {"True": 5210, "False": 8} #HmPOS_match_pos=Number of entries matching and not matching the given position, e.g. {"True": 5210, "False": 8} rsID...hm_sourcehm_rsIDhm_chrhm_poshm_inferOtherAllelehm_match_chrhm_match_pos
###PGS CATALOG SCORING FILE - see https://www.pgscatalog.org/downloads/#dl_ftp_scoring for additional information #format_version=2.0 ##POLYGENIC SCORE (PGS) INFORMATION #pgs_id=PGS000348 ... #license=Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0). © 2020 Ambry Genetics. ##HARMONIZATION DETAILS #HmPOS_build=GRCh37 #HmPOS_date=2022-07-26 #HmPOS_match_chr={"True": 72, "False":0} #HmPOS_match_pos={"True": 72, "False":0} rsID...hm_sourcehm_rsIDhm_chrhm_poshm_inferOtherAllelehm_match_chrhm_match_pos
The formatted scoring file (in the original genome build) has the following additional columns describing the variants in the specified genome build for each HmPOS file:
Additional Column Header | Field Name | Field Description |
---|---|---|
hm_source | Provider of the harmonized variant information | Data source of the variant position. Options include: ENSEMBL, liftover, author-reported (if being harmonized to the same build). |
hm_rsID | Harmonized rsID | Current rsID. Differences between this column and the author-reported column (rsID) indicate variant merges and annotation updates from dbSNP. |
hm_chr | Harmonized chromosome name | Chromosome that the harmonized variant is present on, preferring matches to chromosomes over patches present in later builds. |
hm_pos | Harmonized chromosome position | Chromosomal position (base pair location) where the variant is located, preferring matches to chromosomes over patches present in later builds. |
hm_inferOtherAllele | Harmonized other alleles | If only the effect_allele is given we attempt to infer the non-effect/other allele(s) using Ensembl/dbSNP alleles. |
hm_match_chr | FLAG: matching chromosome name | Used for QC. Only provided if the scoring file is being harmonized to the same genome build, and where the chromosome name is provided in the column chr_name. |
hm_match_pos | FLAG: matching chromosome position | Used for QC. Only provided if the scoring file is being harmonized to the same genome build, and where the chromosome name is provided in the column chr_position. |
###PGS CATALOG SCORING FILE - see https://www.pgscatalog.org/downloads/#dl_ftp_scoring for additional information #format_version=2.0 ##POLYGENIC SCORE (PGS) INFORMATION #pgs_id=PGS000116 ... #genome_build=GRCh37 ... ##HARMONIZATION DETAILS #HmPOS_build=GRCh38 ... rsID chr_name chr_position effect_allele other_allele effect_weight hm_source hm_rsID hm_chr hm_pos rs1921 1 949608 A G -0.003965 ENSEMBL rs1921 1 1014228 rs2710887 1 986443 T C -0.000846 ENSEMBL rs2710887 1 1051063 rs11260596 1 1002434 T C 0.000789 ENSEMBL rs11260596 1 1067054 rs113355263 1 1069535 A G -0.001627 ENSEMBL rs113355263 1 1134155 rs11260539 1 1109903 T C 0.000170 ENSEMBL rs11260539 1 1174523 ...