Search

생물정보통계학 (Bioinformatics) - 수업 내용 정리

Properties
Lecture
Reference
Zoom
Author
Kipoong Kim
Date
2021/03/03
Link
Empty
Tags
Empty
links
Empty
이번 학기에 수강하는 생물정보통계학의 내용을 정리합니다. I'm gonna summarize the contents of bioinformatics lecture that I'm taking this semester.

Day 1

1. Goal of Genetic Studies

(1) Population Genetics

(2)

2. DNA

Gene and Genome
Chromosome
Human Genome - Numbered in order of decreasing length from 1 to 22. - The longest chromosome is around 250 million bases long.
Genetic Marker
Allele
Dominant Allele
Recessive Allele
Haplotype
Protein coding gene
Genetic Variation : 99.9% of a total number of 3.2 billions DNAs
SNP -
Genotype -
Gene vs Allele
Gene vs Allele
Search
Name
Gene
Allele
Role
Open
Genes determine individual traits
Alleles contribute the diversity in phenotype expression
Determines
Open
An organism’s genotype
An organism’s phenotype
Various types
Open
Alleles
Paternal vs maternal Dominant vs recessive
Examples
Open
Eye color, hair color, skin pigmentation
Blue eyes, brown hair dark skin
Complex Traits : is not determined by the genotype at a single locus. - Mendelian - quantitative trait and qualitative trait

3. Population-based Genetic Studies

Candidate polymorphism studies
Candidate gene studies
Fine mapping studies
Genome-Wide association studies (GWAS)

Genome-Wide association studies (GWAS)

Characteristics - high-dimensional genomic data - multiple testing problems - Incorporating some covariate data

Gene Structure

The gene consists of three major structures:
Regulatory segment (promoter)
Exons
Introns
25,000 protein-coding genes and 20,000 non-coding genes

Transcription and Translation

Chapter 2

Quantitative Trait Loci

Quantitative trait loci (QTL) is a genetic marker or a genetic region associated with the quantitative trait.
Linear regression models are commonly used.
They will often include a single genetic marker (e.g., a SNP) as predictor in the model, in addition to other relevant covariates (such as age, sex, etc.), with the quantitative phenotype as the response.

Additive Model

Dominant Model

Recessive Model

Two degrees of freedom Model

Additive Genetic Model

Most GWAS perform single SNP association testing with linear regression assuming an additive model.
yi=β0+β1xi+ϵiy_i = \beta_0 + \beta_1 x_i + \epsilon _i
Nice Interpretation The coefficient of determination (R2) gives an estimate of the proportion of phenotypic variation that is explained by the SNP (or SNPs) in the model, e.g., the "SNP heritability"
Test statistics for H0:β1=0 vs HA:β10H_0 : \beta_1 = 0 \text{ vs } H_A : \beta_1 \ne 0 .
T=β1^Var(β1^)tn2F=β12^Var(β1^)F1,n2\begin{aligned} T &= \dfrac{\hat{\beta_1} }{\sqrt{Var(\hat{\beta_1})}} \sim t_{n-2} \\ F &= \dfrac{\hat{\beta_1^2} }{ Var(\hat{\beta_1}) } \sim F_{1, n-2} \end{aligned}
where
Var(β1^)=σϵ2i(xixˉ)2Var(\hat{\beta_1}) = \dfrac{\sigma_\epsilon^2}{ \sum_i (x_i - \bar{x} )^2 }
Multiple linear regression is rarely used in genetic association studies due to high-dimension of SNP and LD among SNPs.

0414

Chromosome-based Association Test

각 choromosome에서의 selected genes vs 전체 chromosome에서의 selected genes 에 대한 contigency table 비교

Biological Pathways

여러 gene들이 하나의 function을 타겟으로 하는 경우 pathway를 이룬다고 함.
Metabolic pathways :
Signaling pathways :
Gene regulatory pathways :

Gene Ontology vs Biological Pathways

Gene Ontology :
Biological Pathways :

Pathway-level Analysis for Genetic Association

1.
Over-Representation Analysis : Similar to Chromosome-based Test, compare the selected counts of genes within a particular pathway to those within a total of genes. We look for an over-representation of the genes in a pathway among most significant genes or over-representation of most significant genes in a pathway.
2.
Gene Set Enrichment Analysis (PNAS) is a statistical method to determines whether an a priori defined set of genes shows statistically significant between two biological states. The original GSEA procedure : (1) Rank all N genes based on p-values to obtain a gene list L. (2) For GiG_i (the i-th gene in L) and S (a pathway), let Xi={NScNS if Gi is in SNSNSc if Gi is Not in SX_i = \begin{cases} \sqrt{\dfrac{N_{S^c}}{N_S}} & \text{ if } G_i \text{ is in } S \\ \sqrt{ -\dfrac{N_{S}}{N_{S^c}}} & \text{ if } G_i \text{ is Not in } S \\ \end{cases}
3.
Graph-based Methods

Code

BiocManager::install("graphite") library(graphite) pathwayDatabases() kegg <- pathways("hsapiens", "kegg") kegg names(kegg) kegg[[30]] nodes(kegg[[30]]) # Protein only nodes(kegg[[30]], which="mixed") # Protein & Metabolism edges(kegg[[30]]) kegg.nodes <- rep(0, length(kegg)) for (i in 1:length(kegg)) { kegg.nodes[i] <- length(nodes(kegg[[i]])) } kegg.nodes summary(kegg.nodes)
R
Collect the nodes of all genes among KEGG database
k30 <- convertIdentifiers(kegg[[30]], "SYMBOL") nodes(k30) edges(k30) suppressMessages(kegg.symbol <- convertIdentifiers(kegg, "SYMBOL")) kegg.symbol gene.kegg <- as.list(rep(NA, length(kegg))) for (i in 1:length(kegg)) { knodes <- nodes(kegg.symbol[[i]]) gene.kegg[[i]] <- gsub("SYMBOL:", "", knodes) } gene.kegg <- gene.kegg[kegg.nodes>0] path.kegg <- names(kegg)[kegg.nodes>0] unique(unlist(gene.kegg)) library(ALL) library(hgu95av2.db) data(ALL) row.names(exprs(ALL))
R
Collect the nodes of all genes after converting the gene names from ENTREZID to SYMBOL
k30 <- convertIdentifiers(kegg[[30]], "SYMBOL") nodes(k30) edges(k30) suppressMessages(kegg.symbol <- convertIdentifiers(kegg, "SYMBOL")) kegg.symbol gene.kegg <- as.list(rep(NA, length(kegg))) for (i in 1:length(kegg)) { knodes <- nodes(kegg.symbol[[i]]) gene.kegg[[i]] <- gsub("SYMBOL:", "", knodes) } gene.kegg <- gene.kegg[kegg.nodes>0] path.kegg <- names(kegg)[kegg.nodes>0] unique(unlist(gene.kegg))
R
Collect the nodes of all genes after converting the gene names from ENTREZID to SYMBOL
library(ALL) library(hgu95av2.db) data(ALL) row.names(exprs(ALL)) gene.ALL <- NULL for (i in 1:nrow(ALL)) { probe <- row.names(exprs(ALL))[i] gene.ALL[i] <- get(probe, env=hgu95av2SYMBOL) } ALL.data <- exprs(ALL) dim(ALL.data) miss <- is.na(gene.ALL) gene.ALL <- gene.ALL[!miss] ALL.data <- ALL.data[!miss,] row.names(ALL.data) <- gene.ALL dim(ALL.data) ALL.data[1:40, 1:4] ga <- unique(gene.ALL) gk <- unique(unlist(gene.kegg)) length(ga) sum(!is.na(match(ga, gk))) length(gk) sum(!is.na(match(gk, ga)))
R
In ALL dataset, convert the gene names from ENTREZID to SYMBOL

Genetic Correlation Estimation

Sample of n individuals, indexed by i = 1,2,...,n.
Genome screen data on m genetic autosomal markers, indexed by j = 1,2,...,m.
At each marker and each individual, we have a genotype value, x_ij.
Here, we consider SNP data, so x_ij takes values 0, 1, or 2, corresponding to the number of reference alleles.
We center and standardize these genotype values such as
zij=xij2p^j2p^j(1p^j)z_{ij} = \dfrac{x_{ij} - 2\hat{p}_j }{ 2 \hat{p}_j (1-\hat{p}_j) }
where p^j\hat{p}_j is an estimate of the reference allele frequency for marker j.

0414

Admixture Data analysis

Rare Variant Analysis

Rare variant + Common variant —> It's not good. Common variant에 의한 association이 dominant할 수 있음.

Burden Test

Simply collapse rare variants
Binary Collapsing: CAST
CMC Ci={1ifj=1pgij>00ifj=1pgij=0C_i = \begin{cases} 1 & \text{if} \sum_{j=1}^p g_{ij} > 0 \\ 0 & \text{if} \sum_{j=1}^p g_{ij} = 0 \end{cases}
Count Collapsing: MZ (GRANVIL) Ci=j=1pI(gij>0)C_i = \sum_{j=1}^p I(g_{ij}>0)
Weighted Sum Test - Madsen and Browning - CMC test

Effects of different directions

Existence of variants whose effects are in different directions can reduce power more substantially
Adaptive Sum Test (Han and Pan 2010, Hum Hered)
Estimated REgression Coefficient (EREC) test (Lin and Tang 2011, AJHG)
Limitations - Individual SNP regression - Computation of p-value using either permutation or bootstrap - The constant $\delta$ is to arbitrary in EREC test

Variance Component Test

Burden tests are not powerful, if there exist variants with different association directions or many non-causal variants.
Variance components tests have been proposed to address it.
"Similarity" based test
Major classes of tests * C-alpha test (Neal et al. 2011, PLoS Genet) * SKAT (Wu et al. 2010, 2011, AJHG) * Combined test (Derkach et al. 2013, Genet Epi) * SKAT-O (Lee et al. 2012, Biostatistics)
C-alpha Test
For case-control studies only
TOP