Introduction - PRSTools#

Background#

While working on a manuscript focused on benchmarking machine learning algorithms for genotype-phenotype prediction, I received feedback recommending the inclusion of a broader analysis of PRS methodologies. This led me to investigate various PRS calculation tools used in genetic research, each of which applies unique approaches and assumptions that influence final PRS. These findings were supported by studies like this one on PubMed, which reviews the diversity in PRS tools.

The calculation of PRS depends on several factors:

  • Data Type Compatibility: Some tools rely on GWAS summary statistics, others use genotype data, and some utilize both types.

  • Modeling Differences: Tools apply different mathematical models, impacting PRS interpretation.

  • Reference Panels and Genome Builds: PRS tools may require specific reference panels, which affect compatibility and generalizability.

Key Challenges#

Implementing PRS tools for practical analysis posed several challenges:

  • Data Format Incompatibility: Different tools accept varying input formats, making data integration across tools challenging.

  • Limited Cross-Validation Support: Many tools lack built-in cross-validation functionality, essential for robust model validation.

  • HPC Scalability Constraints: Some tools are not optimized for high-performance computing (HPC), limiting scalability for analyses across multiple phenotypes.

Repository Overview#

To address these challenges, we created a new repository that provides a unified implementation of PRS calculation tools with enhancements for usability in real-world research settings. Key features include:

  1. Benchmarking: We benchmarked 46 PRS tools/methods on both binary and continuous phenotypes.

  2. Comparative Analysis: Performance, computation time, memory consumption, and beta distribution of each tool were compared.

  3. Data Transformation: Documentation on the necessary input data transformations ensures compatibility across PRS tools.

  4. Parallel Execution: Tools were implemented for parallel execution on HPC systems, enabling simultaneous analyses across multiple phenotypes.

  5. Detailed Documentation: Comprehensive end-to-end documentation for each PRS tool is available on GitHub.

  6. Diverse Dataset Testing: PRS tools were benchmarked on diverse datasets, revealing specific tools’ limitations for certain data types.

  7. Unified PRS Calculation Pipeline: A standardized evaluation process was applied across all tools.

  8. Cross-Validation Implementation: Five-fold cross-validation was incorporated to assess tool performance, an often-missing feature.

  9. Hyperparameter Tuning: PRS-specific hyperparameters (e.g., p-value thresholds, PCA counts) and tool-specific hyperparameters were included to optimize PRS accuracy.

Introduction

Conclusion#

In summary, this repository addresses some limitations in current PRS tools, providing researchers with an adaptable and efficient solution for polygenic risk score calculations. By improving compatibility, scalability, and documentation, the repository supports large-scale genetic studies and promotes broader use of PRS methodologies in research.

Discussion.png

PRS Tools Included in the Analysis.#

Following is the list of PRS tools included in the analysis.

Tool

Python

Conda Environment

Tool Link

Phenotype (Binary/Continuous/Both)

Requires Genotype data

Requires GWAS data

Requires Covariate data

Requires Reference data

Dependence

Language

Original article DOI

Plink

3

genetics

Plink 1.9

Both

Yes

Yes

No

No

No

C

10.1086/519795

PRSice-2

3

genetics

PRSice-2

Both

Yes

Yes

Optional

Optional, but recommended

No

C++ and R

10.1093/gigascience/giz082

GCTA

3

genetics

GCTA SBLUP

Both

Yes

Yes

No

No

Plink (PRS calculation)

C++

10.1007/978-1-62703-447-0_9

DBSLMM

3

genetics

DBSLMM

Both

No (optional, can be used as reference panel)

Yes

No

Optional, but recommended

Plink (PRS calc), LDpred-2 (Heritability calc)

C++ and R

10.1016/j.ajhg.2020.03.013

lassosum

3

genetics

lassosum

Both

Yes

Yes

No

Yes

Plink (PRS calculation)

R

10.1002/gepi.22050

ldpred2_inf

3

genetics

LDpred2 inf

Both

Yes

Yes

No

Yes

Plink (PRS calculation)

R

10.1093/bioinformatics/btaa1029

ldpred2_grid

3

genetics

LDpred2 grid

Both

Yes

Yes

No

Yes

Plink (PRS calculation)

R

10.1093/bioinformatics/btaa1029

ldpred2_auto

3

genetics

LDpred2 auto

Both

Yes

Yes

No

Yes

Plink (PRS calculation)

R

10.1093/bioinformatics/btaa1029

ldpred2_lassosum2

3

genetics

LDpred2 lassosum2

Both

Yes

Yes

No

Yes

Plink (PRS calculation)

R

10.1093/bioinformatics/btaa1029

ldpred-funct

2

ldscc

LDpred-funct

Both

Yes

Yes

No

Yes

Plink (PRS calc), LDpred-2 (Heritability calc)

Python

10.1038/s41467-021-25171-9

SBayesR

3

genetics

SBayesR

Both

Yes - To create an LD matrix

Yes

No

Yes - LD Matrix

Plink (PRS calculation)

C++

10.1038/s41588-024-01704-y

SBayesRC

3

genetics

SBayesRC

Both

Yes - To create an LD matrix

Yes

No

Yes - LD Matrix

Plink (PRS calculation)

C++

10.1038/s41467-019-12653-0

LDAK-genotype

3

genetics

LDAK-genotype

Both

Yes

No

No

No

Plink (PRS calculation)

C++

10.1038/s41467-021-24485-y

LDAK-gwas

3

genetics

LDAK-gwas

Both

Yes

Yes

No

Yes - Correlation Matrix

Plink (PRS calculation)

C++

10.1038/s41467-021-24485-y

PRScs

3

genetics

PRScs

Both

Yes

Yes

No

Yes - LD Matrix

Plink (PRS calculation)

Python

10.1038/s41467-019-09718-5

PRScsx

3

genetics

PRScsx

Both

Yes

Yes

No

Yes - LD Matrix

Plink (PRS calculation)

Python

10.1038/s41588-022-01054-7

tlpSum

3

genetics

tlpSum

Both

Yes

Yes

No

Yes

Plink (PRS calculation)

R

10.1371/journal.pcbi.1008271

PRSbils

3

genetics

PRSbils

Both

Yes

Yes

No

Yes

Plink (PRS calculation)

Python

CTPR

3

genetics

CTPR

Both

Yes

Yes

No

No

Plink (PRS calculation)

C++

10.1038/s41467-019-08535-0

NPS

3

genetics

NPS

Both

Yes

Yes

No

No

Plink (PRS calculation)

R

10.1016/j.ajhg.2020.05.004

SDPR

3

genetics

SDPR

Both

Yes - To create an LD matrix

Yes

No

No

Plink (PRS calculation)

C++

10.1371/journal.pgen.1009697

JAMPred

3

genetics

JAMPred

Both

No

Yes

No

No

Plink (PRS calculation)

R

10.1002/gepi.22245

EB-PRS

3

genetics

EB-PRS

Both

Yes

No

No

No

Plink (PRS calculation)

R

10.1371/journal.pcbi.1007565

PANPRS

3

genetics

PANPRS

Both

Yes - As LD matrix

Yes

No

Yes - LD Matrix

Plink (PRS calculation)

R

10.1080/01621459.2020.1764849

BOLT-LMM

3

genetics

BOLT-LMM

Both

Yes

No

Yes

Yes

Plink (PRS calculation)

C++

10.1038/ng.3190

RapidoPGS-single

3

AdvanceR

RapidoPGS-single

Both

No

Yes

No

No

Plink (PRS calculation)

R

10.1093/bioinformatics/btab456

LDpred-gibbs

3

genetics

LDpred-gibbs

Both

Yes

Yes

No

Yes

Plink (PRS calculation)

Python

10.1016/j.ajhg.2015.09.001

LDpred-p+t

3

genetics

LDpred-p+t

Both

Yes

Yes

No

Yes

Plink (PRS calculation)

Python

10.1016/j.ajhg.2015.09.001

LDpred-inf

3

genetics

LDpred-inf

Both

Yes

Yes

No

Yes

Plink (PRS calculation)

Python

10.1016/j.ajhg.2015.09.001

LDpred-fast

3

genetics

LDpred-fast

Both

Yes

Yes

No

Yes

Plink (PRS calculation)

Python

10.1016/j.ajhg.2015.09.001

Anno-Pred

2

ldscc

Anno-Pred

Both

Yes

Yes

No

Yes

Plink (PRS calculation)

Python

10.1371/journal.pcbi.1005589

smtpred-wMtOLS

2

ldscc

smtpred-wMtOLS

Both

Yes

Yes

No

Yes

Plink (PRS calculation)

Python

10.1038/s41467-017-02769-6

smtpred-wMtSBLUP

2

ldscc

GitHub

Both

Yes

Yes

No

Yes

Plink (PRS calculation)

Python

https://doi.org/10.1038/s41467-017-02769-6

C+T (Clumping and Thresholding)

3

genetics

Both

Yes

Yes

No

No

Plink (PRS calculation)

Python

viprs-simple

3

viprs_env

GitHub

Both

Yes

Yes

No

Yes

Plink (PRS calculation)

Python

https://doi.org/10.1016/j.ajhg.2023.03.009

viprs-grid

3

viprs_env

GitHub

Both

Yes

Yes

No

Yes

Plink (PRS calculation)

Python

https://doi.org/10.1016/j.ajhg.2023.03.009

HAIL

3

genetics

Notebook

Both

Yes

Yes

No

Yes

Plink (PRS calculation)

Python

https://doi.org/10.1038/s41588-023-01648-9

GEMMA-LM

3

genetics

GitHub

Both

Yes

Yes

Yes

No

Plink (PRS calculation)

C++

https://doi.org/10.1038/ng.2310

GEMMA-LLM

3

genetics

GitHub

Both

Yes

Yes

Yes

No

Plink (PRS calculation)

C++

https://doi.org/10.1038/ng.2310

GEMMA_BSLMM

3

genetics

GitHub

Both

Yes

Yes

Yes

No

Plink (PRS calculation)

C++

https://doi.org/10.1038/ng.2310

MTG2

3

genetics

Homepage

Both

Yes

No

Yes

No

Plink (PRS calculation)

C++

10.1093/bioinformatics/btw012

SCT

3

genetics

bigSNP

Both

Yes

Yes

No

No

Plink (PRS calculation)

R

https://doi.org/10.1016/j.ajhg.2019.11.001

XP-BLUP

3

genetics

GitHub

Both

Yes

No

No

No

Plink (PRS calculation)

Bash

10.5808/gi.21053

CTSLEB

3

genetics

GitHub

Both

Yes

Yes

No

Yes

Plink (PRS calculation)

R

https://doi.org/10.1038/s41588-023-01501-z

PolyPred

3

polyfun

Polyfun Wiki

Both

Yes

Yes

No

Yes

Plink (PRS calculation)

Python

10.1038/s41588-022-01036-9

Pleio-Pred

2

ldscc

GitHub

Both

Yes

Yes

No

Yes

Plink (PRS calculation)

Python

10.1371/journal.pgen.1006836

Conda Environment#

You may need to create the following Conda Environment to execute each file.

The following tools were discarded from further consideration for the following reasons:

  • Multiprs – This methodology calculates PRS for multiple p-value thresholds and then combines them to form a single prediction. It is a methodolgy rather a tool.

  • BGLR-R – While this software worked with its provided test data, it failed with our data, returning NaN for the explained variances across all phenotypes. This was despite our data matching the required format, with no missing genotype values for SNPs or individuals.

  • PolyRiskScore – This is a web-based tool that calculates PRS for specific SNPs and uses GWAS files from the GWAS catalog. Due to its limited flexibility and dependency on specific SNPs, it was not suitable for our needs.

  • FairPRS – Although promising, this tool writes output files to the same directory, making it incompatible with running multiple phenotypes or datasets simultaneously on HPC. The lack of parallel processing capability led us to exclude it from further consideration.

  • RapidoPGS-multi – This tool did not support continuous phenotypes, though it worked for binary ones. Due to this limitation, it was removed from further consideration.

Results#

TableResults.png

Author Information.#