Step1-DownloadSampleDataandSetupEnvironment

Step1-DownloadSampleDataandSetupEnvironment#

Download Anaconda#

To run the PRS Notebooks, set up a conda environment. Some tools require a dedicated (separate) conda environment, and for such tools, instructions are provided in their corresponding Notebooks.

Windows:#

Visit the Anaconda website.
Choose the appropriate installer (64-bit is recommended for most systems).
Run the installer and follow the on-screen instructions.

Linux:#

Open a terminal window and run these commands:

wget https://repo.anaconda.com/archive/Anaconda3-2024.01-Linux-x86_64.sh
chmod +x Anaconda3-2024.01-Linux-x86_64.sh
bash Anaconda3-2024.01-Linux-x86_64.sh
source ~/.bashrc

Create a Dedicated Environment for Your Project:#

Open a terminal window (or Anaconda Prompt on Windows).
Create a new environment named “prstools” with Python 3.10:
```
conda create -n prstools python=3.10
```
Activate the environment:
```
conda activate prstools
```

Install R within the Environment:#

Install R version 4.3.2 from the conda-forge channel:

conda install -c conda-forge r-base=4.3.2
conda install -c conda-forge r-essentials

Data Preparation#

In this section, we will download the sample data.

Download Sample Data#

The first step is to download the sample data. You can find the base data here and the target data here.

Extracting Target Data#

Extract the target data and organize it with the following files in a directory named SampleData1:

SampleData1.bed: Plink Files (bed, bim, fam) containing genotype information.
SampleData1.bim
SampleData1.fam
SampleData1.cov: Contains covariate information.
SampleData1.height: Contains the actual height phenotype.
SampleData1.gz: GWAS summary statistic file.

Ensure to rename all files to the SampleData1 prefix.

.
├── SampleData1
│   ├── SampleData1.bed
│   ├── SampleData1.bim
│   ├── SampleData1.cov
│   ├── SampleData1.fam
│   ├── SampleData1.gz
│   └── SampleData1.height

View Bed Files#

To view Bed files, visit Plink Binary Files Documentation.

Analyzing Quantitative Trait (Height)#

We will begin by analyzing the quantitative trait (height). The dataset used for the continuous trait analysis is sourced from the tutorial by choishingwan, and we express our gratitude for making it publicly available.

Polygenic Risk Score (PRS) Tools Overview#

PRS tools can be categorized into three types:

Type 1: Summary Statistics File Only
- These tools use a summary statistic file (e.g., gwas.txt) and estimate betas or posterior probabilities. They may require population-specific linkage disequilibrium calculations.
Type 2: Summary Statistic File and Individual Data Set
- This category of PRS tools requires both a summary statistic file and an individual dataset. They optimize various hyperparameters for beta estimation or snp weight estimation.
Type 3: Individual Data Set Only
- These tools focus on polygenic risk score calculation for individuals without relying on a summary statistic file. They incorporate linkage equilibrium for betas or snp weight estimation.
Type 4: Multi-ancestry/Disease Tools
- These tools use GWAS from multiple populations or multiple diseases to calculate posterior probabilities or new Betas.

Data Splitting and Hyperparameter Optimization#

The target data is split into a 75% training or hyperparameter optimization set and a 25% testing set for final performance evaluation.

For binary phenotypes, we used a stratified cross-validation technique to split the data.

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

For continuous phenotypes, we used cross-validation to split the data into 5 folds.

kf = KFold(n_splits=5, shuffle=True, random_state=42)

Hyperparameter optimization can be categorized into two types:

Genotype data Quality controls related Hyperparameters
- This includes considerations such as pruning and clumping during quality controls on the target data.
Tools-Specific Hyperparameters
- When using the GWAS File and an individual dataset, specific hyperparameters, such as lambdas for lasso, may be considered.

Full Credit to Shing Wan Choi#

GitHub: Shing Wan Choi

Tutorial Link from where the code has been inferred: Shing Wan Choi’s PRS Tutorial

# The following code is for Linux
!conda config --add channels defaults
!conda config --add channels bioconda
!conda config --add channels conda-forge
!conda install bioconda::bedtools
!conda install bioconda::samtools
# You can execute the code from terminal after activitating the conda environment.
# conda activate genetics.

View Data#

In this section, we will view the sample data.

import os
import pandas as pd
import subprocess

# Set the directory where the files are located
filedirec = "SampleData1"

# Define file paths for different data files
BED = filedirec + os.sep + filedirec
BIM = filedirec + os.sep + filedirec+".bim"
FAM = filedirec + os.sep + filedirec+".fam"
COV = filedirec + os.sep + filedirec+".cov"
Height = filedirec + os.sep + filedirec+".height"
GWAS = filedirec + os.sep + filedirec+".gz"

# Read the first 10 rows of the BED file (matrix, cannot be viewed directly)
# This is a placeholder comment since viewing a BED file is not applicable in this context
#https://en.wikipedia.org/wiki/BED_(file_format)

# Read the first 10 rows of the BIM file (PLINK map file)
temp = pd.read_csv(BIM, sep="\s+", header=None, nrows=10)
print("Columns of BIM file:")
print(temp.columns)
print("First 10 rows of BIM file:")
print(temp.head())

# Read the first 10 rows of the FAM file (PLINK pedigree file)
temp = pd.read_csv(FAM, sep="\s+", header=None, nrows=10)
print("Columns of FAM file:")
print(temp.columns)
print("First 10 rows of FAM file:")
print(temp.head())

# Read the first 10 rows of the COV file (covariate information file)
temp = pd.read_csv(COV, sep="\s+", nrows=10)
print("Columns of COV file:")
print(temp.columns)
print("First 10 rows of COV file:")
print(temp.head())

# Read the first 10 rows of the Height file (file containing actual height phenotypes)
temp = pd.read_csv(Height, sep="\s+", nrows=10)
print("Columns of Height file:")
print(temp.columns)
print("First 10 rows of Height file:")
print(temp.head())

# Read the first 10 rows of the GWAS file (GWAS summary statistic file)
temp = pd.read_csv(GWAS, sep="\s+", nrows=10)
print("Columns of GWAS file:")
print(temp.columns)
print("First 10 rows of GWAS file:")
print(temp.head())

Columns of BIM file:
Index([0, 1, 2, 3, 4, 5], dtype='int64')
First 10 rows of BIM file:
   0           1         2       3  4  5
0  1   rs3131962  0.490722  756604  A  G
1  1  rs12562034  0.495714  768448  0  0
2  1   rs4040617  0.500708  779322  G  A
3  1  rs79373928  0.587220  801536  G  T
4  1  rs11240779  0.620827  808631  G  A
Columns of FAM file:
Index([0, 1, 2, 3, 4, 5], dtype='int64')
First 10 rows of FAM file:
         0        1  2  3  4  5
0  HG00096  HG00096  0  0  1 -9
1  HG00097  HG00097  0  0  2 -9
2  HG00099  HG00099  0  0  2 -9
3  HG00100  HG00100  0  0  2 -9
4  HG00101  HG00101  0  0  1 -9
Columns of COV file:
Index(['FID', 'IID', 'Sex'], dtype='object')
First 10 rows of COV file:
       FID      IID  Sex
0  HG00096  HG00096    1
1  HG00097  HG00097    2
2  HG00099  HG00099    2
3  HG00100  HG00100    2
4  HG00101  HG00101    1
Columns of Height file:
Index(['FID', 'IID', 'Height'], dtype='object')
First 10 rows of Height file:
       FID      IID      Height
0  HG00096  HG00096  169.132169
1  HG00097  HG00097  171.256259
2  HG00099  HG00099  171.534380
3  HG00101  HG00101  169.850176
4  HG00102  HG00102  172.788361
Columns of GWAS file:
Index(['CHR', 'BP', 'SNP', 'A1', 'A2', 'N', 'SE', 'P', 'OR', 'INFO', 'MAF'], dtype='object')
First 10 rows of GWAS file:
   CHR      BP         SNP A1 A2       N        SE         P        OR  \
0    1  756604   rs3131962  A  G  388028  0.003017  0.483171  0.997887   
1    1  768448  rs12562034  A  G  388028  0.003295  0.834808  1.000687   
2    1  779322   rs4040617  G  A  388028  0.003033  0.428970  0.997604   
3    1  801536  rs79373928  G  T  388028  0.008413  0.808999  1.002036   
4    1  808631  rs11240779  G  A  388028  0.002428  0.590265  1.001308   

       INFO       MAF  
0  0.890558  0.369390  
1  0.895894  0.336846  
2  0.897508  0.377368  
3  0.908963  0.483212  
4  0.893213  0.450410  

Important Note#

This notebook needs to be executed only once to download the sample data.
If someone wants to use the data in a different format, ensure you follow the same directory structure.
Please ensure that your data has the same number of columns and the same headers. Cheers!
For continuous phenotype GWAS, the SampleData1/SampleData1.gz file should have BETAs, and for binary phenotypes, it should have OR instead of BETAs. If BETAs are not available, we convert OR to BETAs using BETA = np.log(OR) and convert BETAs to OR using OR = np.exp(BETA).
import numpy as np; np is the NumPy module.