Download presentation
Presentation is loading. Please wait.
1
Analysis of Alu repeat elements
Molecular biology & Phylogeny Laboratory Woo-Yeon Kim
2
CONTENTS Whole-genome analysis of Alu repeat elements reveals complex evolutionary history INTRODUCTION NEW IDEAS RESULTS DISCUSSIONS Alu repeat analysis in the complete human genome: trends and variations with respect to genomic composition
3
Genome Research - Letter
Supplemental material is available online at
4
INTRODUCTION
5
Alu repeats A family of SINEs, short interspersed nuclear elements
Replicating via LINE-mediated reverse transcription of an RNA polymerase Ⅲ transcript Roughly 280 bp The history of substitution patterns in the human genome Markers to determine genetic distances between human subpopulations – polymorphic Alu insertions R L Poly A signal AAAAA SINE Structure
6
K-means Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. Assign each object to the group that has the closest centroid. When all objects have been assigned, recalculate the positions of the K centroids. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated. 1. 여러 데이터가 있을때, 임의로 K개의 군집수를 정하고, 이것의 군집중앙위치를 임의로 정합니다. 2. 각각의 데이터에 대해서, K개의 군집중앙까지의 거리를 구하고, 가장 가까운 군집에 속하게 합니다. 3. 각 군집에 속해진 데이터들을 통해서, 그 데이터로부터 군집중앙까지의 평균을 구해서, 군집중앙을 새롭게 정해줍니다. 4. 만약 새롭게 정해준 군집중앙이 이전의 군집중앙과 동일하다면, 알고리즘은 종료합니다. 5. 동일하지 않다면, 2번의 과정부터 되풀이 하게 됩니다. 이 과정을 거치면, 임의로 정해준 K개의 군집으로 데이터들이 나뉘게 됩니다. 하지만, 이 방법은 앞의 단점에서 말씀드렸듯이 군집수 K를 임의로 정해줘야 한다는 것에 있죠. 그리고, 군집의 갯수(K), 초기 군집 중심의 선택, 어떤 데이터부터 처리를 해줄 것인가 등의 영향을 받게 됩니다. 그래서 이 알고리즘을 적용할때는 서로 다른 초기 군집중심 뿐 아니라 다양한 K값에 대해서 실험을 해봐야 합니다.
7
NEW IDEAS
8
An example using real data
Only the 5 Alu positions with diagnostic mutations in the Ya5 subfamily (position 91, 98, 146, 175, and 238) Applying k-means clustering, k = 2
9
Looking for overrepresented pairs
Identifying nested subfamilies Computing biprofiles, frequencies of pairs of nucleotide values
10
RESULTS
11
Aligned consensus sequences of selected subfamilies
Roughly 480,000 full-length Alu elements Recursively split subfamilies Identifying 213 subfamilies
12
An evolutionary tree of Alu subfamilies
13
DISCUSSION Significant mutation from the consensus sequence
Available detected by a rigorous whole-genome analysis Partial results Not statistically discernible Limitations in this algorithm Limitations – Excluding Insertion/deletion mutations Frequent CpG mutations Mutations to nucleotide values already present in other subfamilies Statistically distinguishable subfamilies Only 19 of the 31 subfamilies currently reported in Repbase Update
14
Bioinformatics – Discovery Note
Online Supplementary data is available at the web page
15
Alu distribution in whole genome
Chromosome Alu J Alu S Alu Y Other Alus Total Alu No. Chromosome Size (bp) 1 25043 56044 12209 8114 101410 2 19679 46673 11295 6438 84085 3 15812 37539 9135 5044 67530 4 12857 30347 8158 4242 55604 5 12932 32423 8023 4351 57729 6 14449 35722 8375 4959 63505 7 17486 38816 8277 5150 69729 8 12092 27148 6203 3825 49268 9 10741 26910 6496 3441 47588 10 13909 31110 6707 4378 56104 11 11858 27461 6357 3744 49420 12 14932 32314 7026 4718 58990 13 6467 15929 4307 2114 28817 14 8921 20201 4392 2931 36445 15 9631 22169 5284 3000 40084 16 13913 29451 5462 3864 52690 17 13542 34653 7025 4150 59370 18 5935 13285 3333 1915 24468 19 14135 34297 6130 3912 58474 20 7245 16478 3058 2236 29017 21 2681 6965 1865 752 12263 22 5378 13590 3119 1586 23673 X 11160 25841 5405 3284 45690 Y 1699 3547 1128 465 6839 Un 86 226 68 39 419 Fig.1. (a) Number of Alu repeats in different chromosomes in human genome with vertical segments representing the numbers corresponding to each Alu subfamily
16
Alu repeat density and association with genes
Fig. 1. (b) Variation in Alu and gene densities in human genome
17
Alu in intergenic and intragenic regions
Variation in Alu contents in Genes of human Genome Alu densities in the intergenic and intragenic regions in human genome
18
Distribution of Alu subfamilies
The most abundant Alu subfamily – Alu S, 6.4 % region of the genome Chromosome Y The most Alu poor chromosome High density Alu Y – very low density Alu S, Alu J Chromosome 13, 9 – similar trend 13 having least density of Alu J Chromosome 8, X High density Alu S, J Very low density Alu Y
19
Correlation analysis GC content seems to have highest association with Alu density overall, followed by gene density and intron density
20
DISCUSSION Analysis of Alu distribution in genes
Statistically significant correlation between Alu and gene densities A higher Alu density in intragenic regions – These elements are preferred in genes. The highest Alu and gene densities – Chromosome 19, 22 Alu density is correlated in the order GC content > gene density > intron density The abundance of Alu subfamilies – Alu S > Alu J > Alu Y Young subfamilies - Chromosome 9, 13 and Y Old subfamilies – Chromosome 8 and X Higher correlation of older Alus with GC content than younger ones
Similar presentations