Skeletal Output Of Contig Assembly
Given a set of N sequence fragments, we first obtain N complementary sequence fragments. The necessity of this step is shown in Figure 3-2 where two sequence fragments labeled 2 and 6 were from the top strand and four sequence fragments were from the bottom strand. The 3'-end sequence of fragment 3 is the same as the 5'-end of the complementary sequence of fragment 2, but not the same with any part of fragment 2 itself. Similarly, the 5-end of fragment 5 is the same as the 3'-end of the complementary sequence of fragment 6. Any practical contig assembly will involve many fragments, but we will use the fragments depicted in Figure 3-2 for subsequently illustrations.
In contig assembly literature, the two DNA strands are often referred to as the plus strand and the minus strand. When we have N sequence fragments and N complementary sequence fragments, contig assembly ideally should generate a plus strand and a minus strand that will be complementary to each other. This can be used for validating the contig assembly algorithm.
5' TACCACAATTACACGA..........................ACGCC 3'
3' ATGGTGTTAATGTGCT..........................TGCGG 5'
Figure 3-2. Six sequence fragments from the double-stranded genomic sequence. Fragments 2 and 6 are collinear with the top DNA strand whereas fragments 1, 3, 4 and 5 are collinear with the bottom DNA strand.
Contig assembly starts by doing pairwise comparisons between fragments i and j. Each pairwise comparison consists of (1) matching the 3'-end of fragment i with 5'-end of fragment j and (2) matching the 5'-end of fragment i with 3'-end of fragment j. Each match is evaluated for its statistical significance. The result of these N*(N-1)/2 pairwise comparisons is summarized in a matrix of matching scores (Figure 3-3). Note that a practical contig assembly in a shotgun sequencing project will typically involve hundreds of thousands of sequence fragments and consequently a matrix with many millions of elements. However, only good scores are kept for identifying mate pairs. So the actual number of scores that need to be kept in computer memory is much smaller.
|
2 |
3 |
4 |
5 |
6 | |
|
1 |
P |
P |
P |
P |
P |
|
2 |
G |
P |
P |
P | |
|
3 |
G |
G |
P | ||
|
4 |
G |
P | |||
|
5 |
G |
Figure 3-3. Matrix of matching scores for assembling the six fragments in Figure 3-2. Symbols P and G stand for poor and good matching scores, respectively. Only good matching scores need to be kept.
In actual computation, the simplest generic data structure for individual sequence fragment is as follows:
DataStructure dsFragment dsFragment fragmentWhose3EndMatchingMy5End() int goodMatchScorel()
int arrayDimension1
dsFragment fragmentWhose5EndMatchingMy3End() int goodMatchScore2() int arrayDimension2 End DataStructure
It is worth noting the difference between two kinds of string matches. The first is the overlapping match (Figure 3-4) used in contig assembly, and the second is the non-overlapping match (Figure 3-4) that is marked and discarded. Relevant statistical methods for evaluating the quality of an overlapping match have already been presented in the first chapter.
Overlapping match
Non-overlapping match
Figure 3-4. Contig assembly uses only overlapping matches.
From the pairwise comparisons of the six sequence fragments and their complements, we obtain the matrix in Figure 3-3 and an array of dsFragments from which one can construct a graph to obtain the contigs (Figure 3-5) by following the directed path. Note that there are two paths. The first is made of fragments 2, 3, 5 and 6 in the order of 2^3 ^5 ^6 and the second is made of fragments 2, 3, 4, 5, and 6 in the order of Both lead to the same assembled contig. Fragment 1 has no arrow to link it to any other fragments and will not be assembled with any other fragment.
Figure 3-5. Directed graph constructed from the matrix of good matching scores in Figure 33. Only good (statistically significant) scores contribute to the graph construction. An arrow directing from fragment i to fragment j means that the 3-end of fragment i overlaps the 5'-end of fragment j with a good score, i.e., each arrow links a mate pair.
Figure 3-5. Directed graph constructed from the matrix of good matching scores in Figure 33. Only good (statistically significant) scores contribute to the graph construction. An arrow directing from fragment i to fragment j means that the 3-end of fragment i overlaps the 5'-end of fragment j with a good score, i.e., each arrow links a mate pair.
There are two major difficulties with the non-greedy algorithm (Figure 35). The first is the large number of possible pathways to construct the contig. Take a sequencing project with 10x coverage for example ("10x coverage" means that the total length of all sequenced fragments is 10 times as long as the estimated genomic length). Each node in the directed graph will on average have 10 connections, with some nodes having many more. As the number of possible paths increases rapidly with increasing number of sequence fragments, it becomes difficult to decide which path to follow. We all know the difficulty when encountering two roads that diverged in the yellow wood, but we cannot afford the luxury of idly gazing down both as far as we could. We have to decide which paths are likely the right ones and let the computer help us travel down all of them and finally check to see if they are all the same. Fortunately, modern computers can go very fast, especially when your instruction is clear.
The second problem with the non-greedy algorithm (Figure 3-5) results from sequence repeats leading to (1) spurious connections and conflicting paths and (2) cyclic paths in the directed graph that trap the program that follows the paths to obtain contigs (Figure 3-6). The cyclic path is produced because the 3'-end of fragment 4 (i.e., T--A--C--T—G) is the same as the 5'-end of fragment 3. The path that would lead to a shortened contig is caused by the 3'-end of fragment 1 (G-T--A—C) being the same as the 5'-end of fragment 4.
Wrong Path leading to a contig shorter than the true one -N 5'-2-3' I-
Cyclic Path
Figure 3-6. The problem of sequence repeats in contig assembly. The top is the true genomic sequence with one segment repeated (shown in bold). The five fragments, with numeric labels to the right, were obtained from the genomic sequence.
The conventional approach to solve conflicting assemblies is to use the Bellman-Ford algorithm to find the shortest path (Kent and Haussler, 2001; Thayer et al., 1999). However, the shortest path does not imply the correct assembly. As illustrated in Figure 3-6, the shortest path will result in omission of repeated sequences. The human and mouse genomes from Celera, assembled from the whole-genome shotgun sequencing that is relatively vulnerable to sequence repeats, have many fewer sequence repeats (which most likely represent artifacts from contig assembly) than the one from Human Genome Project (Istrail et al., 2004). There are also substantial differences between different human genome assemblies (e.g., between NCBI and University of California at Santa Cruz, or UCSC) from public effort (Rouchka et al., 2002). While these different public assemblies have now been coordinated to provide a unique human genome assembly, it is not clear how discrepancies are resolved. A discrepancy between the two may mean both being wrong or one being wrong and one right. Forcing a consensus does not always mean that the result will be right.
Once when I get to this point in the classroom, a student suddenly asked why the two public assemblies could not both be right. My answer was that both assemblies were from the same source of fragments and therefore could not both be right when there were substantial differences. Whenever there was a substantial difference, either assembly A is wrong or assembly B is wrong or both are wrong. The student could not understand the logic and I, at my wit's end, suddenly recalled what Voltaire said in his Philosophical Dictionary that, given the evil we observe in this world, "Either God wishes to expunge the evil from this world and cannot; or He can and does not wish to; or he neither wishes to nor can." Equipped with this sudden rush of insight, I explained that "both assemblies being correct given the differences" was logically similar to "God wishes and can, given the observed evil", both being logically impossible. The student wanted me to show the connection between the two statements. Taking it literally, I draw a circle containing "both assemblies being correct given the differences" in the left of the paper and another circle containing "God wishes and can given the observed evil" in the right of the paper and draw an arrow from the left circle to the right circle. Unfortunately, the student claimed that this was not the connection she wanted. In the end, I was left more confused than the student. (Voltaire probably did not read Isaiah 45:7, King James Version, where one finds God stating "I form the light and create darkness, I make peace and I create evil, I am the LORD who does all of these". All religions share the notion of the good and the evil, especially oriental ones which are rich in dialectical thoughts. According to dialectics, the good cannot be defined or known without the evil. In contrast to the first law of logic, i.e., the law of identify which states that A is A and cannot be anything else, the first law of dialectics is the law of the negation of negation which states that A can be A only if it is not a non-A. In other words, A cannot be A unless the presence of a non-A is a condicio sine qua non. Therefore, God cannot create the good without creating the evil at the same time, or create light without creating darkness at the same time. We cannot see anything when confined in total darkness or when staring into pure light. A balance of light and darkness is necessary for us to see. For the same reason Solomon had wisely stated in Ecclesiastes 7:16-17 that "Be not overly righteous, ...... Be not overly wicked". In this sense, the Bible is quite dialectical. From this point of view we see that Voltaire was logical but not philosophical.)
Another problem with contig assembly, other than those getting both my students and me confused, is that the ending sequences are often of poor quality, making it difficult to have exact matches. However, modern automatic DNA sequencers can perform quality analysis and automatically chop off sequence ends that are poor.
Post a comment