Polygenic Risk Score Tools Notebooks Workflow

Polygenic Risk Score Tools Notebooks Workflow#

Notebooks for each tool are organized in the following format:

Documentation/GitHub/Tool Link

Modified the GWAS file as required by the specific tool.
PRS tools expect the GWAS file to:
- Be in a specific format.
- Have specific headers.
- GWAS should be in a specific format before it is passed to Notebook/Code.
Continuous phenotypes: Ensure the BETA (effect size) column is in the GWAS.
Binary phenotypes: Ensure the OR (odd ratio) column is in the GWAS.

Specified clumping and pruning parameters for the genotype data.
Defined the number of principal components (PCA).
Defined the number of p-value thresholds (More p-values, better results. PRS is calculated for each p-value threshold. Combined PRS with covariates and PCA for the final prediction).
Specified the fold number for which the code should be executed.

Defined a function to calculate PCA from the genotype data.
Defined a function to perform clumping and pruning.
Defined a function to fit the binary phenotype using logistic regression (LOGIT).
- Merges covariates, PCA, and PRS.
- One can use different regularization terms when fitting the model.
- One can use different evaluation metrics.
Defined a function to fit the continuous phenotype using ordinary least squares (OLS).
- Merges covariates, PCA, and PRS.
- One can use different regularization terms when fitting the model.
- One can use different evaluation metrics.

Deleted files from previous iterations.
Executed the code for the specific tool, phenotype, and fold.
Modified the GWAS and genotype files if required by the tool.
For some tools, genotype data must be split across each phenotype.
Calculated additional parameters, such as heritability for GCTA, if needed.
Calculated the PRS using PLINK with posterior effect sizes estimated by the tool.
Saved the results.

Verified that the result file was generated.
Checked the number of rows in the result file.
Changed the fold number and executed the code for each fold.
Ensured that files were generated for each fold.
Summed the results across all folds, making sure the sum corresponds to the correct rows.
If the code failed for some datasets (e.g., negative heritability), discarded those rows for that fold.

Identified the best performance parameters across all folds.
Used two methods to report performance:
- The test performance corresponding to the highest training performance across all rows.
- Generalized performance where the difference between train and test performance is minimal, but their sum is high.

Performed various analyses, such as:
- Assessing the impact of different hyperparameters on performance.
- Plotting the correlation between hyperparameters and train and test performance.
- Reporting the p-values that yield the best performance.
Reviewed the logs generated by the notebook to ensure proper execution.