Analysis of Alu repeat elements Molecular biology & Phylogeny Laboratory Woo-Yeon Kim
CONTENTS Whole-genome analysis of Alu repeat elements reveals complex evolutionary history INTRODUCTION NEW IDEAS RESULTS DISCUSSIONS Alu repeat analysis in the complete human genome: trends and variations with respect to genomic composition
Genome Research - Letter Supplemental material is available online at www.genome.org
INTRODUCTION
Alu repeats A family of SINEs, short interspersed nuclear elements Replicating via LINE-mediated reverse transcription of an RNA polymerase Ⅲ transcript Roughly 280 bp The history of substitution patterns in the human genome Markers to determine genetic distances between human subpopulations – polymorphic Alu insertions R L Poly A signal AAAAA SINE Structure
K-means Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. Assign each object to the group that has the closest centroid. When all objects have been assigned, recalculate the positions of the K centroids. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated. 1. 여러 데이터가 있을때, 임의로 K개의 군집수를 정하고, 이것의 군집중앙위치를 임의로 정합니다. 2. 각각의 데이터에 대해서, K개의 군집중앙까지의 거리를 구하고, 가장 가까운 군집에 속하게 합니다. 3. 각 군집에 속해진 데이터들을 통해서, 그 데이터로부터 군집중앙까지의 평균을 구해서, 군집중앙을 새롭게 정해줍니다. 4. 만약 새롭게 정해준 군집중앙이 이전의 군집중앙과 동일하다면, 알고리즘은 종료합니다. 5. 동일하지 않다면, 2번의 과정부터 되풀이 하게 됩니다. 이 과정을 거치면, 임의로 정해준 K개의 군집으로 데이터들이 나뉘게 됩니다. 하지만, 이 방법은 앞의 단점에서 말씀드렸듯이 군집수 K를 임의로 정해줘야 한다는 것에 있죠. 그리고, 군집의 갯수(K), 초기 군집 중심의 선택, 어떤 데이터부터 처리를 해줄 것인가 등의 영향을 받게 됩니다. 그래서 이 알고리즘을 적용할때는 서로 다른 초기 군집중심 뿐 아니라 다양한 K값에 대해서 실험을 해봐야 합니다.
NEW IDEAS
An example using real data Only the 5 Alu positions with diagnostic mutations in the Ya5 subfamily (position 91, 98, 146, 175, and 238) Applying k-means clustering, k = 2
Looking for overrepresented pairs Identifying nested subfamilies Computing biprofiles, frequencies of pairs of nucleotide values
RESULTS
Aligned consensus sequences of selected subfamilies Roughly 480,000 full-length Alu elements Recursively split subfamilies Identifying 213 subfamilies
An evolutionary tree of Alu subfamilies
DISCUSSION Significant mutation from the consensus sequence Available detected by a rigorous whole-genome analysis Partial results Not statistically discernible Limitations in this algorithm Limitations – Excluding Insertion/deletion mutations Frequent CpG mutations Mutations to nucleotide values already present in other subfamilies Statistically distinguishable subfamilies Only 19 of the 31 subfamilies currently reported in Repbase Update
Bioinformatics – Discovery Note Online Supplementary data is available at the web page www.igib.res.in/manuscriptdata/aluanalysis.html
Alu distribution in whole genome Chromosome Alu J Alu S Alu Y Other Alus Total Alu No. Chromosome Size (bp) 1 25043 56044 12209 8114 101410 221782893 2 19679 46673 11295 6438 84085 237637456 3 15812 37539 9135 5044 67530 194846173 4 12857 30347 8158 4242 55604 188402715 5 12932 32423 8023 4351 57729 177705559 6 14449 35722 8375 4959 63505 175762617 7 17486 38816 8277 5150 69729 153794793 8 12092 27148 6203 3825 49268 142788062 9 10741 26910 6496 3441 47588 117013362 10 13909 31110 6707 4378 56104 131098977 11 11858 27461 6357 3744 49420 133239679 12 14932 32314 7026 4718 58990 129362603 13 6467 15929 4307 2114 28817 95228136 14 8921 20201 4392 2931 36445 88182284 15 9631 22169 5284 3000 40084 83582680 16 13913 29451 5462 3864 52690 80889146 17 13542 34653 7025 4150 59370 80734148 18 5935 13285 3333 1915 24468 74619305 19 14135 34297 6130 3912 58474 56446152 20 7245 16478 3058 2236 29017 59424940 21 2681 6965 1865 752 12263 33917895 22 5378 13590 3119 1586 23673 33821705 X 11160 25841 5405 3284 45690 147274156 Y 1699 3547 1128 465 6839 22660226 Un 86 226 68 39 419 1374146 1179211 Fig.1. (a) Number of Alu repeats in different chromosomes in human genome with vertical segments representing the numbers corresponding to each Alu subfamily
Alu repeat density and association with genes Fig. 1. (b) Variation in Alu and gene densities in human genome
Alu in intergenic and intragenic regions Variation in Alu contents in Genes of human Genome Alu densities in the intergenic and intragenic regions in human genome
Distribution of Alu subfamilies The most abundant Alu subfamily – Alu S, 6.4 % region of the genome Chromosome Y The most Alu poor chromosome High density Alu Y – very low density Alu S, Alu J Chromosome 13, 9 – similar trend 13 having least density of Alu J Chromosome 8, X High density Alu S, J Very low density Alu Y
Correlation analysis GC content seems to have highest association with Alu density overall, followed by gene density and intron density
DISCUSSION Analysis of Alu distribution in genes Statistically significant correlation between Alu and gene densities A higher Alu density in intragenic regions – These elements are preferred in genes. The highest Alu and gene densities – Chromosome 19, 22 Alu density is correlated in the order GC content > gene density > intron density The abundance of Alu subfamilies – Alu S > Alu J > Alu Y Young subfamilies - Chromosome 9, 13 and Y Old subfamilies – Chromosome 8 and X Higher correlation of older Alus with GC content than younger ones