Step2-GWASAndIndividualGenotypeDataQualityControls

Step2-GWASAndIndividualGenotypeDataQualityControls#

Note on GWAS File#

It is important to note that different polygenic risk tools accept Genome-Wide Association summary statistic files in various formats.
The information we have encompasses all the columns required by PRS tools (where some fields are missing, we will highlight it).
For each individual tool, we will process this GWAS file so that it can be consumed by a specific PRS tool for prediction. The specific number of fields and the names of those fields are highlighted in each tool’s documentation.

Fields in the GWAS File:

These fields are also explained in this link.

CHR	BP	SNP	A1	A2	N	SE	P	OR	INFO	MAF
1	756604	rs3131962	A	G	388028	0.00301666	0.483171	0.997886915712657	0.890557941364774	0.369389592764921
1	768448	rs12562034	A	G	388028	0.00329472	0.834808	1.00068731609353	0.895893511351165	0.336845754096289
1	779322	rs4040617	G	A	388028	0.00303344	0.42897	0.997603556067569	0.897508290615237	0.377368010940814

Genome-Wide Association Study (GWAS) (Base Data) - Quality Controls#

Please note that some tools have built-in quality control procedures for both GWAS and genotype data.

Set the directory where the files are located#

filedirec = "SampleData1"

In case you are running the jobs on HPC and want to use parallel computing, you can replace it with sys.argv[1]. All the files for a specific phenotype will be produced in that specific directory, which in this case is SampleData1.

import os
import pandas as pd
import subprocess

# Set the directory where the files are located
filedirec = "SampleData1"
#filedirec = "asthma_19"
#filedirec = "migraine_0"

# Define file paths for different data files
BED = filedirec + os.sep + filedirec
BIM = filedirec + os.sep + filedirec+".bim"
FAM = filedirec + os.sep + filedirec+".fam"
COV = filedirec + os.sep + filedirec+".cov"
Height = filedirec + os.sep + filedirec+".height"
GWAS = filedirec + os.sep + filedirec+".gz"

# Read GWAS data from a compressed file using pandas
df = pd.read_csv(GWAS, compression="gzip", sep="\s+")

# Display the initial number of rows in the dataframe
print("Initial number of SNPs:", len(df))

# Apply quality control steps: Filter SNPs based on Minor Allele Frequency (MAF) and Imputation Information Score (INFO)
df = df.loc[(df['MAF'] > 0.01) & (df['INFO'] > 0.8)]

# Display the number of rows after applying the filters
print("Number of SNPs after quality control:", len(df))

 
# Display the number of rows after removing duplicate SNPs
print("SNPs in GWAS after removing duplicate SNPs:", len(df))

# Remove ambiguous SNPs with complementary alleles (C/G or A/T) to avoid potential errors
df = df[~((df['A1'] == 'A') & (df['A2'] == 'T') |
          (df['A1'] == 'T') & (df['A2'] == 'A') |
          (df['A1'] == 'G') & (df['A2'] == 'C') |
          (df['A1'] == 'C') & (df['A2'] == 'G'))]

# Display the final number of SNPs after removing ambiguous SNPs
print("Final number of SNPs after removing ambiguous SNPs:", len(df))

# Save the data.
df.to_csv(GWAS,compression="gzip",sep="\t",index=None)

df = pd.read_csv(GWAS,compression= "gzip",sep="\s+")
print(len(df))
print(df.head().to_markdown())

Initial number of SNPs: 499617
Number of SNPs after quality control: 499617
SNPs in GWAS after removing duplicate SNPs: 499617
Final number of SNPs after removing ambiguous SNPs: 499617
499617
|    |   CHR |     BP | SNP        | A1   | A2   |      N |         SE |        P |       OR |     INFO |      MAF |
|---:|------:|-------:|:-----------|:-----|:-----|-------:|-----------:|---------:|---------:|---------:|---------:|
|  0 |     1 | 756604 | rs3131962  | A    | G    | 388028 | 0.00301666 | 0.483171 | 0.997887 | 0.890558 | 0.36939  |
|  1 |     1 | 768448 | rs12562034 | A    | G    | 388028 | 0.00329472 | 0.834808 | 1.00069  | 0.895894 | 0.336846 |
|  2 |     1 | 779322 | rs4040617  | G    | A    | 388028 | 0.00303344 | 0.42897  | 0.997604 | 0.897508 | 0.377368 |
|  3 |     1 | 801536 | rs79373928 | G    | T    | 388028 | 0.00841324 | 0.808999 | 1.00204  | 0.908963 | 0.483212 |
|  4 |     1 | 808631 | rs11240779 | G    | A    | 388028 | 0.00242821 | 0.590265 | 1.00131  | 0.893213 | 0.45041  |

Match Variants Between GWAS and Individual Genotype Data#

If RSID is present in the GWAS, the following step can be skipped.

Steps for Handling RSIDs in GWAS and Genotype Data#

If RSIDs are not present for SNPs, put X in the SNP column in the GWAS file.
Read the genotype.bim file and extract the RSIDs from the genotype data.
If RSIDs are not present in the genotype data, use HapMap3 or another reference panel to obtain the RSIDs.

Some PRS tools use different criteria to create unique variants and match them between GWAS and individual genotype data:

CHR:BP:A1:A2: Some PRS tools use this format to define a unique variant.
RSID: Some PRS tools use RSID/SNP to define a unique variant.
CHR:BP: Some PRS tools use this format to define a unique variant.

We have highlighted which criteria are necessary for each tool.

bimfile = pd.read_csv(BIM, sep="\s+", header=None)
print("Columns of BIM file:")
print(bimfile.columns)
print("First 10 rows of BIM file:")


print("Removing SNPs for which even a single row does not contain the required value:", len(df))


# If RSID's are not present for SNPs, put X in the SNP column in the GWAS file.
# Read the genotype.bim file, and extract the RSID from the genotype data.
# If RSID are not present in the genotype data, use HapMap3 or other reference panel to get the RSIDs.

if (df['SNP'] == 'X').all():
    print("RSIDs are missing!")
    bimfile = pd.read_csv(filedirec+os.sep+filedirec+".bim", sep="\s+", header=None)
    
    # create a unique variant using CHR:BP:A1:A2.
    
    bimfile["match"] = bimfile[0].astype(str)+"_"+bimfile[3].astype(str)+"_"+bimfile[4].astype(str)+"_"+bimfile[5].astype(str)
    df["match"] = df["CHR"].astype(str)+"_"+df["BP"].astype(str)+"_"+df["A1"].astype(str)+"_"+df["A2"].astype(str)
    

  

    df.drop_duplicates(subset='match', inplace=True)
    bimfile.drop_duplicates(subset='match', inplace=True)

    df = df[df['match'].isin(bimfile['match'].values)]
    bimfile = bimfile[bimfile['match'].isin(df['match'].values)]
    df = df[df['match'].isin(bimfile['match'].values)]
    bimfile = bimfile[bimfile['match'].isin(df['match'].values)]
 
    
    df = df.sort_values(by='BP')
    bimfile = bimfile.sort_values(by=3)
    
    print(df.head())
    print(bimfile.head())

    df["SNP"] = bimfile[1].values
    print("match",len(df))


    df.drop_duplicates(subset='match', inplace=True)
    bimfile.drop_duplicates(subset='match', inplace=True)  

    print(len(df))
    print(len(bimfile))
    print(df.head())
    print(bimfile.head())
    
    del df["match"]
    # Just save the modified GWAS file.
    # If bim, file is modified, the genotype data will be considered as corupt by Plink.
    df.to_csv(GWAS,compression="gzip",sep="\t",index=None)   
    print("Total SNPs", len(df))

    pass
else:
    df.drop_duplicates(subset='SNP', inplace=True)
    df.to_csv(GWAS,compression="gzip",sep="\t",index=None)
    print("RSID is present!")
    print("Total SNPs",len(df))
    pass
 

Columns of BIM file:
Index([0, 1, 2, 3, 4, 5], dtype='int64')
First 10 rows of BIM file:
Removing SNPs for which even a single row does not contain the required value: 499617
RSID is present!
Total SNPs 499617

Individual genotype data (Target Data) Processing#

Ensure that the phenotype file, FAM file, and covariate file contain an identical number of samples. Remove any missing samples based on your data. Note that the extent of missingness in phenotypes and covariates may vary.

Note: Plink needs to be installed or placed in the same directory as this notebook.

Download Plink

We recommend using Linux. In cases where Windows is required due to package installation issues on Linux, we provide the following guidance:

For Windows, use plink.
For Linux, use ./plink.

Remove people with missing Phenotype#

Modify the fam file, make bed file, and modify the covariates files as well.

# New files to be saved with QC suffix
newfilename = filedirec + "_QC"

# Read information from FAM file
f = pd.read_csv(FAM, header=None, sep="\s+", names=["FID", "IID", "Father", "Mother", "Sex", "Phenotype"])
print("FAM file contents:")
print(f.head())
print("Total number of people in FAM file:", len(f))

# Append the Height phenotype values to FAM file
# Height file is basically the phenotype file.
h = pd.read_csv(Height, sep="\t")
print("Phenotype information is available for:", len(h), "people")
print(len(h))
result = pd.merge(f, h, on=['FID', 'IID'])

# Replace 'Phenotype' column with 'Height' and save to a new PeopleWithPhenotype.txt file
# Ensure that the input Phenotype file has teh header Height.
result["Phenotype"] = result["Height"].values
del result["Height"]

# Remove NA or missing in the phenotype column
result = result.dropna(subset=["Phenotype"])


print(result)
result.to_csv(filedirec + os.sep + "PeopleWithPhenotype.txt", index=False, header=False, sep="\t")

# Use plink to keep only the people with phenotype present
plink_command = [
    './plink',
    '--bfile', filedirec + os.sep + filedirec,
    '--keep', filedirec + os.sep + "PeopleWithPhenotype.txt",
    '--make-bed',
    '--out', filedirec + os.sep + newfilename
]
subprocess.run(plink_command)

# Update the phenotype information in the new FAM file
f = pd.read_csv(filedirec + os.sep + newfilename + ".fam", header=None, sep="\s+",
                names=["FID", "IID", "Father", "Mother", "Sex", "Phenotype"])
f["Phenotype"] = result["Phenotype"].values
f.to_csv(filedirec + os.sep + newfilename + ".fam", index=False, header=False, sep="\t")

# Update the covariate file as well
covfile = filedirec + os.sep + filedirec + '.cov'
covfile = pd.read_csv(covfile, sep="\s+")

print("Covariate file contents:")
print(covfile.head())
print("Total number of people in Covariate file:", len(covfile))

# Match the FID and IID from covariate and height file
covfile = covfile[covfile['FID'].isin(f["FID"].values) & covfile['IID'].isin(f["IID"].values)]
print("Covariate file contents after matching with FAM file:")
print(covfile.head())
print("Total number of people in Covariate file after matching:", len(covfile))
covfile.to_csv(filedirec + os.sep + newfilename + ".cov", index=None, sep="\t")

FAM file contents:
       FID      IID  Father  Mother  Sex  Phenotype
0  HG00096  HG00096       0       0    1         -9
1  HG00097  HG00097       0       0    2         -9
2  HG00099  HG00099       0       0    2         -9
3  HG00100  HG00100       0       0    2         -9
4  HG00101  HG00101       0       0    1         -9
Total number of people in FAM file: 503
Phenotype information is available for: 475 people
475
         FID      IID  Father  Mother  Sex   Phenotype
0    HG00096  HG00096       0       0    1  169.132169
1    HG00097  HG00097       0       0    2  171.256259
2    HG00099  HG00099       0       0    2  171.534380
3    HG00101  HG00101       0       0    1  169.850176
4    HG00102  HG00102       0       0    2  172.788361
..       ...      ...     ...     ...  ...         ...
470  NA20822  NA20822       0       0    2  170.405056
471  NA20826  NA20826       0       0    2  168.523029
472  NA20827  NA20827       0       0    1  170.975735
473  NA20828  NA20828       0       0    2  170.222028
474  NA20832  NA20832       0       0    2  169.431705

[475 rows x 6 columns]
PLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to SampleData1/SampleData1_QC.log.
Options in effect:
  --bfile SampleData1/SampleData1
  --keep SampleData1/PeopleWithPhenotype.txt
  --make-bed
  --out SampleData1/SampleData1_QC

63761 MB RAM detected; reserving 31880 MB for main workspace.
551892 variants loaded from .bim file.
503 people (240 males, 263 females) loaded from .fam.
--keep: 475 people remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 475 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping rate in remaining samples is 0.999896.
551892 variants and 475 people pass filters and QC.
Note: No phenotypes present.
--make-bed to SampleData1/SampleData1_QC.bed + SampleData1/SampleData1_QC.bim +
SampleData1/SampleData1_QC.fam ... 101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899done.
Covariate file contents:
       FID      IID  Sex
0  HG00096  HG00096    1
1  HG00097  HG00097    2
2  HG00099  HG00099    2
3  HG00100  HG00100    2
4  HG00101  HG00101    1
Total number of people in Covariate file: 503
Covariate file contents after matching with FAM file:
       FID      IID  Sex
0  HG00096  HG00096    1
1  HG00097  HG00097    2
2  HG00099  HG00099    2
4  HG00101  HG00101    1
5  HG00102  HG00102    2
Total number of people in Covariate file after matching: 475

Split Data into Test and Train and Perform Quality Controls on Training Data#

We adopt a cross-validation design to evaluate polygenic risk scores.
The base data is divided into two sets: training and test sets. The training set is used to find the best combination of parameters or hyperparameters offered by each tool, along with the summary statistic file. The data is split into 5 folds, and further processing is performed on the first fold. Fold_0
Data is divided into training and test sets, and hyperparameter optimization is performed on the training data. Operations such as clumping or pruning are applied to the training data, and the same remaining SNPs should be extracted from the test set rather than separately using pruning or clumping on test data.
Regarding hyperparameters from individual tools, all of those should be applied to the training set, and the performance should be measured for both the null model and the complete model (including covariates and polygenic risk scores). Since it is a continuous phenotype, Explained Variance is considered to assess the performance of the PRS model. For binary phenotypes, we considered AUC to evaluate the PRS performance.

Note: We will divide the data into training and test sets, perform quality controls on the training data. The data for each fold will be saved separately for further processing. .

Pruning#

Pruning is an integral part of the analysis, but it has been skipped as a quality control on the training set for the following reasons. In our initial analysis, we observed that pruning and clumping can affect the performance of the polygenic risk score model. Rather than performing it at this stage, we consider it as one of the hyperparameters and it will be performed at a later stage. If we perform pruning at this stage, the process that passed the pruning step would limit us to use other values for pruning at the latest stage. However, even at the latest stage, pruning is only performed on the training set.

Cross-validation Designs#

There are multiple ways of doing cross-validation design, and one of them is to use all the cohorts except one for the training and then use the last cohort as the validation set. We performed quality controls and hyperparameter optimization on the training set and found the best combination across all folds, reporting the performance of the best hyperparameter combination on the test set.

A Simple Analysis#

If you have a separate GWAS file and our training data that you will be using to optimize the hyperparameter on a subset without cross-validation, then follow the original tutorial presented in the first cell Shing Wan Choi’s PRS Tutorial.

Quality Controls considered for Training Data:

GWAS studies, e.g., removing SNPs with low genotyping rate, low minor allele frequency, out of Hardy-Weinberg Equilibrium, removing individuals with low genotyping rate.
Pruning was skipped.
Heterozygosity check.
Sex chromosomes.
Relatedness.

R Script - Module1.R

This file contains the code presented in this tutorial to assist in performing quality controls on the training data. Kindly, follow their instructions for better understanding, as quality controls are not the main focus of this research.

from IPython.display import FileLink

# R file used in to execute the following code.
FileLink('Module1.R')

Module1.R