SLIDE 1 Lecture 1: overviews Hello, everyone, and welcome to Genome 540. I'm Phil Green, the instructor. In this opening lecture I'll give you some overviews that hopefully will help to put the course into context. I'll start with some general comments about the role of computation in biology. Then I'll give you my view more specifically of computational molecular biology, emphasizing in particular the fundamental importance of probabilities. Then, because 540 is focussed on computational methods for interpreting genomes, I'll present a high-level, somewhat non-standard view of genome biology which emphasizes 'sites' (rather than, say, genes) as the fundamental units of functional information. This viewpoint turns out to be useful when we're developing probability models for the genome. Next, a summary of genomicists' tasks and the roles that computation plays in those. And finally, an overview of the course content and how it fits in with the computational tasks that you need for interpreting genomes, including some comments on what's not in the course. SLIDE 2 Computation and biology Let's start with three broad ways to think about the relationship between computation and biology. One is that it's like the author list on a scientific paper. Biology is the senior author, which originates and motivates the work and provides the judgment to guide it to a conclusion. Computation is the junior author, which carries out much of the work, provides the energy and a lot of the ideas, but may lack the experience to know what's important and what's not. If you want to do computational biology, you really should try to be like both authors. So in particular, if you come from a computational rather than a biological background, you should spend a fair amount of time trying to understand the biology: taking courses, reading textbooks and current literature, talking to biologists, and in general trying to develop an intuitive feel for the science -- learning to think like a biologist. Even if you're not going to do research in biology it's still worth learning as much about it as you can, because it's an amazing, beautiful and important intellectual achievement. A second way of thinking about computation is as technology -- like microsopes, or sequencing, or CRISPR, for example. It enables you to make scientific discoveries that would be difficult without it. And like these other technologies, computation actually alters the course of the science, by changing the kind of problems that scientists think about once they realize what it can do. In fact it's pretty clear you could not do biology as it's currently practiced without powerful computational methods. Analyzing, or even collecting massive data sets of the sort that are now central to molecular biology would not be possible. At the same time though, technology should not displace the science: it's not an end in itself. People coming from computational backgrounds tend to prize novelty and aesthetics: A new, clever and elegant algorithm is something to strive for. But while those can be useful criteria, they should never override the utility. The purpose of the computational technology is to make biological discoveries, and that's frequently going to involve sacrificing both novelty - because an old method more often than not does the trick - and elegance, because the analyses needed may be 'kluges' stringing together a series of different steps. This may not be very satisfying to someone trained to appreciate the beauty of mathematics and computer science but it typically is just what you need to analyze biological systems (which are themselves 'kluges' produced by a haphazard evolutionary process!) SLIDE 3 A third way to think about computation is in terms of the role it plays in the scientific method. In general, computational analysis can't answer a biological question definitively, rather it generates hypotheses that need to be tested by experiments, according to the scientific method. You of course want these hypotheses to have some reasonable chance of being correct in order to persuade somebody -- which could be yourself, or a collaborator, or some other biologist -- to carry out the experiments. One of the reasons for using probability models (which we'll be discussing later in this lecture) for your computational analysis is that they allow you to make a strong case that a particular pattern is unlikely to be due to chance and therefore is worth some experiments. An interesting point here though is that sometimes experiments may not be practical, and computational evidence for a biological phenomenon might be the best you can do. This is because evolution can act on extremely subtle effects. For example, a mutation having a fitness effect size of .001 -- which means the difference between leaving 999 descendants vs 1000, after some number of generations -- would likely be very difficult to confirm in the lab, but is enormous from the perspective of evolution. In fact, population genetics theory tells us that a fitness effect size of the order of 1 / N, where N is the effective population size, is enough for evolution to go to work on. An experimental test of such an effect size (1 / N) would require you to work with the entire population of the species! Furthermore, most populations experience a variety of different environments, and it's unlikely you could reproduce all of them experimentally. So even much larger effect sizes may sometimes not be confirmable in the lab, if you can't reproduce the relevant environment. Evolution is a much more thorough experimentalist than humans can be. For some of its experiments, computational analyses of the genome may provide our most convincing evidence. SLIDE 4 Computational molecular biology Now let's focus more specifically on computational molecular biology. I think of it as basically a convergence of three fields: molecular biology -- that part of biology that tries to understand cells and organisms as systems of interacting molecules -- and two computational fields: statistics and computer science. Of these molecular biology is paramount: it poses the questions and judges the answers (like a 'senior author'). Whether or not your work as a computational biologist is worthwhile ultimately comes down to whether or not you are making a contribution to the biology. The line between statistics and computer science has become increasingly blurred over the past few decades; but there are still non-trivial differences in their perspective (and in most universities, they are still separate departments.) The differences partly reflect a tension between what you might call 'fuzzy thinking' and deterministic thinking. The real world is 'fuzzy', in two ways: first, it's extremely complex; and second, the underlying physical laws are probabilistic in nature (the same inputs can have different outputs). However our brains construct simplified models of reality that are mostly deterministic, with causes and effects. Our scientific theories to some extent reflect this deterministic thinking but they also try to address the fuzziness. In CMB, computer science comprises the deterministic aspects of computation -- computers, programming languages, data structures and algorithms -- while statistics addresses the fuzziness, in particular contributing probability models for biological processes. By helping to simplify the biology and make it more manageable, such computational models play a role similar to that of experimental models such as model organisms and model systems. Because biological systems are much more complex than the systems that physicists and chemists tend to study, biologists have had to become more comfortable with 'fuzzy thinking', and consequently statistics in some ways has a closer relationship to the biology than computer science does. This close relationship goes back at least to the early 20th century, when both genetics and modern statistics were being developed, and a symbiotic relationship between them emerged: Geneticists needed statistics to interpret their data, and statisticians looked to genetics as a source of interesting problems. This was long before it was known that DNA was the genetic material, or electronic computers existed. For computer scientists, fuzzy thinking is less congenial and it can be a bit unsettling to learn that biologists don't even agree with each other on the definition of a gene: one of the most fundamental concepts in genome biology! SLIDE 5 Biology involves probabilities In the next lecture we'll talk more about algorithms, but here I'd like to say a little more about the role of probabilities in biology. Probabilities are important at several levels: SLIDE 6 Probabilistic physical laws One, at the level of the fundamental physical laws governing living organisms, viewed as systems of interacting molecules. These include first, quantum mechanics and quantum electrodynamics, which determine the structure and pairwise interactions of individual atoms and molecules. In the prevailing 'Copenhagen interpretation' of quantum theory, the wave aspect of matter and radiation provides probabilistic information about particle locations. Systems of interacting molecules are complicated enough that there's no hope in practice of directly tracking the individal molecules (their co-ordinates, speeds, and so forth); rather you have to understand the properties of the system by looking statistically at the ensemble of molecules. The relevant theory for this is statistical mechanics and thermodynamics. Again, fundamentally probabilistic. These quotes from the two physicists generally considered the greatest since Newton bear on this issue of probabilities in physical laws. Maxwell is best known for putting the classical laws of electromagnetism in final form, the so-called Maxwell's equations. This was prior to the quantum era, and his laws are deteministic -- they're about waves, but probabilities don't come into them. Nonetheless, he says "The true logic of this world is in the calculus of probabilities". Now in fact Maxwell was also one of the developers of statistical mechanics -- among other things, he helped discover the so-called Maxwell-Boltzmann distribution of velocities of molecules in gases -- and this quote suggests that he regarded that work as perhaps more central to our understanding of how the world works. On the other hand there's this contrasting quote from Einstein: "I cannot believe that God plays dice with the oosmos". Einstein did not like the idea of fundamental physical laws that were probabilistic in nature, and as a result he never accepted quantum mechanics. Nonetheless if you look at the four great papers he published in 1905 (his 'miracle year'), although two of them (one on special relativity, and one on E = m c^2) had nothing to do with probabilities, the other two are both statistical in significant measure. Ohe, on the photoelectric effect, was one of the instigators of quantum mechanics; and the other explained Brownian motion of dust particles in a water drop under the microscope as the statistical effect of collisions with enormous numbers of molecules (this work persuaded many hitherto sceptics of the validity of the atomic theory). SLIDE 7 Probabilistic evolutionary processes At a higher level, probabilities are important in evolutionary processes: in mutations as random changes to the DNA; in transmission of DNA from parent to offspring in populations of individuals, inheritance of alleles (via chromosome segregation), and so forth. And then random aspects of a variable environment. Consequently, since genomes are shaped by evolution, you can't really understand them without probabilities! SLIDE 8 Genome biology overview Now let's move on to interpreting genomes, starting with a high-level overview of genome biology. Genomes undergo two fundamental processes, both of which involve copying: Replication, which is the copying of the entire genome into new DNA molecules, and transcription, which is the copying of parts of the genome into RNA 'transcripts'. The functional information in the genome is in the form of what I'll call 'sites', which are short sequence segments (generally from about 3 to about 15 bases) that bind to an RNA or protein molecule, which I'll call the 'reader', to help mediate some function. Sites can be grouped into two broad categories: those that act (are read) at the DNA level, and those acting at the RNA level. Now, those of you who've studied genome biology are probably wondering how I can talk about the functional information in the genome without mentioning 'genes'. One reason I'm not doing that here is the fact I mentioned earlier, that the definition of a gene is not universally agreed on; but a more important reason is that sites are really more fundamental. Genes, as well as other genomic features, are comprised of sites. And as we'll see, thinking in terms of sites is quite helpful in developing probability models of the genome. SLIDE 9 Protein-coding gene structure in eukaryotes So we view a gene as a set of sites, and you can see a lot of them here. There are sites acting at the DNA level that control transcription. The unprocessed RNA transcript has sites acting at the RNA level for splicing out the introns and polyadenylation, and the processed transcript includes a 5' untranslated region with a translation start site and possibly some translational regulatory signals; a coding sequence which is comprised of an array of codon sites; and a 3' untranslated region that may include, for example, microRNA binding sites and protein binding sites that play roles in targeting the transcript within the cell, controlling its degradation and so forth. The ambiguity regarding the definition of a gene basically comes down to which sites do you choose to include in the gene and which do you not, but that becomes an uninteresting semantic issue once you focus on sites rather than genes as the fundamental units. SLIDE 10 Sites There are some subtleties in the definition of sites. One is that binding of an RNA or protein reader to some sequence is generally not sufficient in itself to make it a site; the binding event also has to help mediate some function within the cell. In general that's going to involve the reader interacting with other protein or RNA molecules to carry out some cellular process. Site sequences are generally short enough that they occur frequently in random sequence. A transcription factor binding sequence for example may be 6 or 7 bases, short enough that you expect to find it by chance every few thousand bases. Such chance occurrences may be recognized by a reader molecule and transiently bound, without triggering any function (so not sites, by our definition). So there is presumably a fair amount of non-productive binding to 'dummy sites'. But having short binding sequences also means that it's relatively easy to create new instances of sites via mutation. So from evolution's perspective, small size is a useful feature, rather than a bug. But it does increase the computational challenge of finding the functionally important ones. A second point is that sites aren't necessarily active in every cell. The reader, or its interaction partners required to carry out some function, may not be available (not expressed, or inactivated is some way -- for example via phosphorylation, or by being bound to another protein that prevents binding to the DNA or RNA) or the reader may be prevented from binding by methylation of the DNA or binding of some other protein (e.g. chromatin proteins, or readers at overlapping sites). A third point is that although sites constitute the functionally important part of the genome, the non-site (or 'background') DNA may still carry important information. In particular, the distance between nearby sites can influence interactions between reader molecules, which may be important for function. So the DNA that's between the sites may be important, not for its own sequence, but as a 'spacer' for positioning the sites relative to each other. In addition, the background sequence is important for estimating mutation rates. That's useful even if you're only interested in the sites, because as we'll see one important way to detect sites is as regions that are relatively depleted of mutations due to purifying selection. SLIDE 11 Sites: genomic distribution Sites are distributed non-randomly within the genome, something we'll need to take into account when developing probability models. First, sites recur, in the sense that a given reader will generally recognize multiple different sites. Once evolution has gone to the trouble of creating a particular reader, it tends to reuse it -- for example a given transcription factor is typically used in the expression of several different genes. As we'll see, the different sequence instances of a site usually vary somewhat. The sequence logo (which we'll see an example of shortly) is one nice way of representing this variation in a manner that conveys frequency information. 'Motifs', which are often used but less informative, indicate the possible nucleotides at each position but without frequencies. Sites also typically tend to cluster (we'll call the clusters 'features'): Several sites, with the same or different readers, may act collectively to carry out some function. Often there are positional constraints within the cluster. So for example, coding sequence is made up of multiple codon sites, and the the positional constraints there are very strict since each codon immmediately follows the previous one with no intervening bases and no overlap. Other constraints (for example, between splice sites) can be more lax. A gene, as we saw before, is a cluster of sites involved in expressing a particular transcript. Expression of a protein-coding transcript involves not only causing the transcription to occur, but also the processing of that transcript (e.g. splicing) and its translation into protein. So several steps are involved in getting to the end product, which is a protein molecule or molecules. There can also be additional steps in processing non-coding transcripts (for example, modification of nucleotides as you see with tRNAs for example). SLIDE 12 How much of the genome do the sites represent? In bacterial genomes that fraction seems to be quite high; typically 70% or more of the sequence might be protein coding, and when you add in RNA genes, transcription factor and other regulatory sites, a replication origin, you're getting up close to 90% or more -- not 100%, because there may be transposons and other parasitic DNA elements, and some DNA that's just playing a spacer role; but the vast majority of the genome does seem to be functional. When you go to more complicated organisms, in particular the human genome, the situation is quite different. There are some 'intelligent design' proponents -- including some scientists -- who believe that either God, or evolution, has efficiently structured the human genome to be almost entirely functional. But I think the current prevailing view among genomicists, based on comparing genomes to estimate the fraction under purifying selection, is that only about 5 to 10% of the human genome is functional. However a precise answer is hard to get because of variability in mutation rates across the genome, and my own belief is that it's even less, around 2% -- 60 million base pairs, or roughly 20 times the size of a typical bacterial genome. That's still a lot of DNA: After subtracting out the 35 million or so bases in protein-coding sequences and known functional non-coding RNAs, there's enough left to allow an average of about 1200 bases of regulatory sequence for each of the 20,000 genes -- much more than has been found even for intensively studied genes. Well, if the sites are less than 10%, what's the other 90% or more? At least 50% is identifiable as transposable elements, retroviruses, processed pseudogenes created by reverse transcription of RNA transcripts back into the genome, and 'dead genes': sequences that look like they were once genes, but lack transcripts and have picked up enough mutations that they are clearly no longer functional. Much of the remaining 40% or so probably arose is the same way (from transposons etc) but over hundreds of millions of years has accumulated enough mutations to obscure the original source. Why such a difference between the human and bacterial genomes? There are several reasons why selection for efficient genome organization ought to be much stronger in bacteria than in humans. One factor is relative population sizes. As was mentioned earlier in the lecture, population genetics theory tells us that evolution acts on fitness differences as small as 1/N where N is the effective population size of the organism, so it is more sensitive for organisms with large populations. Humans for most of their history seem to have had a fairly small effective population size numbering in the 10s of thousands, much less than for the typical bacterial species. Another factor is reproductive life span. Many bacteria grow fast enough in nutrient-rich conditions that replication of the DNA is rate-limiting, so one expects there to be intense competitive pressure to keep replication time, and therefore genome size, as small as possible. Finally, as the genome get larger, each added transposable element (for example) represents a diminishing percentage of the total, and so one expects selection against it to correspondingly decrease. The human genome at 3 billion bases is 1000-fold larger than a megabase-scale bacterial genome, and so selection against an added transposon should be correspondingly weaker. Consistent with the above, other eukaryotic species such as yeast, Drosophila, C. elegans, fish tend to be intermediate between bacteria and humans with respect to population size, life span, and genome size, and they're generally also (as predicted by the above argument) intermediate in the estimated functional proportion of the genome. Also consistent with this idea that selection for efficiency has been weak in humans is the finding that in many human cells a surprisingly high fraction of newly synthesized protein molecules misfold and then are immediately degraded. This is a major waste of cellular energy, probably in fact much worse than the energy lost in replicating an unnecessarily large genome. SLIDE 13 DNA sites As mentioned earlier sites can be grouped into two broad categories: those that act at the DNA level, and those acting at the RNA level. The DNA-level sites usually, but not always, have protein readers, and they help carry out or regulate one of the two fundamental processes, replication or transcription. Replication-associated features (meaning site clusters) include replication origins, centromeres and telomeres. I include telomeres here because one of their major roles is to ensure the faithful replication of the ends of linear chromosomes; and centromeres because they are involved, not in DNA replication per se, but in ensuring the faithful distribution of the products of DNA replication to daughter cells. Transcription involves several types of feature: promoters, enhancers, and suppressors. The readers in this case are called transcription factors. SLIDE 14 [Logo of cI/cro sites] Let's look at an example of a transcription factor binding site. This figure is taken from the web site of Tom Schneider, who invented sequence logos of the sort depicted here at the bottom. Each sequence here is from the genome of a bacterial virus, lambda, which infects E coli, and consists of a cluster with two binding sites, each of length 9 bases, for the transcription factor cro (a different transcription factor, cI, also recognizes these sites). There's a 1 base spacer (always the same size!) between the two sites, so the total length of each cluster sequence is 2 x 9 + 1 = 19 bases. There are 12 different sequences shown here, but they correspond to only 6 different clusters in the lambda genome, because both DNA strands are given for each cluster. So sequence 2 here is the reverse complement of sequence 1, 4 is the reverse complement of 3 and so on. To fully understand what's going on here we need to picture the protein binding to the DNA in 3 dimensions. SLIDE 15 [DNA structure schematic] First, recall the two-stranded molecular structure of DNA. This slide shows it schematically on the left, indicating the phosphate-sugar backbone on the outside and the AT and CG base pairs on the inside. The two strands run in opposite directions: the strand on the left has its 5' end on the top and its 3' end on the bottom, whereas the strand on the right has them reversed (the sugars are upside down). SLIDE 16 [3D Models of DNA, bound factors] Now, on the left here is a space-filling model of the atoms in the DNA molecule. Note first of all that it is a double helix, with the two strands winding around each other and base-pairing with each other. These two ridges are the sugar-phosphate backbones of the two strands, and the base-pairs you can see to some extent in the grooves. There are two continuous grooves: the so-called major groove here, which goes around the back and re-emerges here, and the minor groove here, going back behind and coming up here. Every base shows up partly in the major groove and partly in the minor groove. Although you can't really make out the base pairs, each full turn of the helix -- so for example from a point here to a point here -- or a point here to a point here, corresponds to about 10.5 base pairs. Note also that the helix is right-handed, meaning that if you point the index finger of your right hand along a ridge of the helix, your thumb points in the vertical direction (up or down) that the ridge is going. If you try that with your left hand instead, the thumb points in the wrong direction. Most transcription factors bind primarily in the major groove, because there are more opportunities to make contacts with atoms in the nucleotide bases there, although there are some that bind within the minor groove, or within both grooves. Here on the right is a stereo image: if you cross your eyes to make the two images converge you can see this in 3D. What's depicted is two identical copies (one in blue and one in yellowish-green) of one of these proteins (cI or cro, I'm not sure which) binding to the double helix at one of the sequences depicted on the earlier slide. Both are making contact with the major groove but also with each other, and in fact that interaction with each other helps to increase the overall stability of both molecules binding to the DNA sites, so it's important. For this particular protein, the contact between the two copies requires one to be flipped around with respect to the other. That means that the DNA sites that they bind to are also flipped around (in reverse orientation). back to SLIDE 14 Understanding that helps clarify what's going on here. As we said, there are 6 different clusters; and the two sites in each are in opposite orientations so that the two protein molecules can contact each other. To compare all twelve copies of the site, you need to include the reverse complements so that the right hand site copies are put in the same orientation as the left hand ones. Note that for convenience each site copy is represented by a single sequence strand, but the protein itself contacts both strands simultaneously. So on the left side here you have 12 copies of the 9-base site all in the same orientation; on the right you have the same 12 sites but now all in the reverse orientation. So the sequences on the right (and the overall patterns) are the reverse complements of what's on the left. Note that the site sequence is not invariant. Down here is a sequence logo, which reflects the frequencies with which different nucleotides are used at each position. In all of these copies of the sites you have a C here, and an A here; these presumably correspond to the most important contact points with the protein; in the reverse complements you have a G here and a T here (the complementary nucleotides, in the reverse order). Some of the other positions appear to show somewhat weaker conservation and others appear free to vary (including, as you'd expect, the 'spacer' nucleotide between the two sites). The logo is actually comparing two probability models, one reflecting frequencies with which nucleotides are used in instances of the site, and the other the so-called background probability model which gives the average frequencies of nucleotides at non-sites (or the genome as a whole). The total letter height at each position indicates how much better the site model fits than the background. In a sense this corresponds to the information the protein needs in order to pick out a site from the rest of the genome. The biology behind the sequence variability is interesting and not always understood. Some variability is presumably at positions where contact with the protein is weak or non-existent, and so has little impact on the strength of binding. At positions where contact does occur, some variability might be important in regulating the strength of binding (some sites might need to be bound more tightly than others). The curve that's depicted here has a spacing of 10.5 bases between peaks, and is intended to reflects the fact that the proteins are binding on one side of the helix so you might expect the contact points in the two site copies, and the most highly conserved bases, to show that spacing. That seems to be approximately true here. Another thing to note is that the total # of conserved positions in one copy of the site is small; only about 3-4 bases -- so it makes sense that you need a cluster of two of them to get some specificity and stable binding. SLIDE 17 RNA transcript sites The second broad class is sites acting at the RNA level: in this case the reader recognizes the site within a transcript, not the genomic DNA, and thereby helps to carry out the transcript's function. (Of course the site's sequence is present within the genomic DNA, since the transcript is a copy of it). Often, but not always, the reader is itself an RNA transcript. As we saw earlier, protein coding transcripts contain a variety of sites. There are also RNA transcripts that don't encode proteins but carry out some other function in the cell -- for example tRNAs, ribosomal RNAs, spliceosomal RNAs, microRNAs, and a variety of so-called lncRNAs (long noncoding RNAs). These all contain at least one type of site that might seem a little strange, but fits the definition I'm using: namely 'stems' that basepair one short sequence within the transcript to a complementary sequence within the same transcript, and which thereby help to give the transcript a structure that is important to its function. So these are functionally important sites within the transcript, in which the same transcript is reading itself. SLIDE 18 [Codon table] Here's an example illustrating some RNA sites involved in protein translation. Within a protein-coding transcript the codons are 3-base sites whose readers are tRNAs. In this case the codon AGU, which you look up in the table as the first base A, second base G, third base U, so it's a serine codon. The reader is a tRNA 'charged' with the amino acid serine that has the complementary 'anticodon' sequence ACU (in 5' to 3' order). Further, the tRNA, as a functional RNA in its own right, has a stem structure, i.e., as we just discussed, sites where the RNA is base pairing internally to itself, to give it the right structure; and also additional sites, one of which is recognized by the tRNA synthetase protein which covalently attaches a serine molecule to the tRNA, and others which are recognized by proteins that modify some of tRNA's nucleotides to stabilize it. Serine has five additional codons that are read by other tRNAs. Some tRNAs can read more than one codon -- in fact I think this tyrosine tRNA can read both of the tyrosine codons, via wobble pairing -- so there is sequence variability for some of these codon sites similar to what we saw in the transcription factor case. Of course the translation process involves this whole complex (tRNA + mRNA codon) interacting with the ribosome and there are other sites in the tRNA and the coding transcript that are recognized by various proteins within the ribosome. SLIDE 19 [Splicing] Here's another example: sites involved in the splicing process. First, the so-called 5' splice site (meaning that it is at the 5' end of the intron). The reader is the U1 small nuclear RNA that recognizes it by base pairing. This picture shows perfect pairing, but in fact typically the pairing is not perfect, that is you don't have an absolutely required base at all these positions within the transcript. So the U1 RNA sequence here can base pair with multiple possible sequences here, and you have sequence variability as in the previous examples. The U1 RNA has its own sites, including the stem structures that stabilize it, sites that interact with protein components of the spliceosome, and so forth. A little upstream of the 3' end of the intron, you've got the branch site which is recognized by the U2 snRNA. And then at the 3'end, you've got other sites that are recognized by protein components of the spliceosome. So the readers of transcript sites are not always RNAs. In this situation there are some constraints on site spacing but they are quite weak, as indicated by the fact that intron sizes vary enormously. If the intron is too small, the splicing machinery either can't recognize the intron at all or it can't process it correctly, so there is a lower bound on intron size (about 70 to 80 bases in human genes, with rare exceptions). But there's no corresponding upper limit and there are introns that are hundreds of kilobases long. SLIDE 20 Genomicists' tasks Given that view of the biology, in order to interpret genomes, genomicists first need to get the genome sequence, and identify the transcripts that are made from the genome. Then they have to find the sites, being mindful that sites can act either at the DNA or at the transcript level. On the following slides I'll say a bit more about each of these. Finally they have to illuminate the molecular functions of the sites. This is really the most openended and difficult part, requiring a variety of methods, and I won't try to cover it. One thing that's very helpful though is the fact that sites recur, not only within one organism's genome but also between the genomes of different organisms, and so once you've figured out the function for a particular site you largely understand the function for many other occurrences of that site as well. SLIDE 21 Finding the genome sequence The main approach to finding the genome sequence requires getting 'reads', which are the sequences (often with basecalling errors) of relatively short pieces of the genome, and then assembling those to infer the underlying genome sequence. The assembly process involves, in essence, finding sequence matches between portions of the reads, figuring out from these how the reads overlap in the genome, and then piecing the overlapping reads together while identifying and eliminating basecalling errors in order to reconstruct the underlying genome sequence. The main challenge in assembly is duplicate or nearly duplicate sequences within the genome. These can arise in several ways. One is parasitic DNA elements (such as transposons), which create copies of themselves with typical sizes in the range of a few hundred to a few thousand bases. A second source of duplicate sequences is errors in DNA replication that create so-called segmental duplications, which can be much larger -- up to several megabases in size. Consequently, when two reads have portions that are highly similar, one has to consider the possibility that they do not actually overlap within the genome but rather come from differemt duplicated segments. The different segments often have acquired sequence differences via mutation that in principle should help to identify spurious overlaps. But read basecalling errors complicate this, and evolutionarily recent duplicates can be essentially identical. (It helps to have probability models for both basecalling errors and the mutation process!) A variety of assembly strategies have been developed that try to ameliorate the duplicate segment issue, but it's really the advent of very long reads (longer than most duplicated segments) that's been the 'killer technology' finally allowing assembly of essentially complete human genomes. Comparing reads to each other remains an important requirement though. SLIDE 22 Finding transcripts ('RNASeq') Now, finding the transcripts, sometimes called "RNASeq". Since it's easiest to sequence DNA, RNASeq involves first of all making cDNA copies of the processed transcripts using reverse transcriptase, and then getting sequence reads from the cDNA. Typically these reads are then aligned to the genome, which allows targeted assembly to be done for reads mapping to the same genomic region in order to reconstruct full transcript sequences from the region. There are a number of issues here that collectively make finding all the transcript sequences a more challenging problem than sequencing the genome. One is that many transcripts can be spliced in more than one way ('alternative splicing'), resulting in multiple isoforms that may encode different proteins. Different isoforms share parts of their sequences with each other, which presents an assembly problem similar to that presented by duplicate segments in genome assembly. Another issue is 'expression bandwidth': genes may be expressed at very different levels, with some transcripts several orders of magnitude more frequent than others. This greatly increases the amount of sequencing that must be done in order to be sure of getting the rarest transcripts. That issue doesn't really arise in genome sequencing since all portions of the genome are equally represented in the starting DNA from which libraries are made (although library construction can sometimes introduce biases!) Another issue is that expression level, and to some extent splicing, depends on the cell type. So you have to make cDNA libraries from many different cell types to maximize the chance you're getting all isoforms of all genes. Yet another problem is that at least some transcripts are non-functional. Some transcripts from protein-coding genes result from splicing errors (which are common enough that there is a cellular process, nonsense-mediated decay, for detecting and degrading them). Most lncRNAs, which by definition lack protein coding potential, also currently lack any other known function within the cell. Novel functions have been discovered for a few of them, and may be for others. But given that selection for efficiency appears to be relatively weak in the human genome, it's also quite plausible (even likely!) that most lncRNAs are simply transcriptional 'noise'. Alternatively many of them may be byproducts of transcriptional events in which the act of transcription is important because it remodels the chromatin (which may be important for expression of nearby genes), but the transcript itself is non-functional. In such a situation the functional sites of interest would be the transcription-inducing sites acting at the DNA level, and not RNA-level sites in the transcript. SLIDE 23 Finding sites A harder problem is to find the sites. One method that a lot of work has gone into, for example in the ENCODE project, is the direct detection of binding events. One approach for this is to use antibodies (or some kind of tagging) to a particular transcription factor to isolate that factor bound to DNA and then sequence the DNA. A similar strategy could be used for other DNA binding proteins, and presumably also for RNA-binding proteins. This approach seems to be limited to readers that are proteins, and it requires some knowledge of what they are; but a more serious objection is that, as we discussed earlier, you can have binding without it being functional so the sequences you get may include non-sites. A somewhat complementary computational approach is to look for clusters of recurring motifs, not only known ones from the binding studies but also novel ones which have similar lengths, distributions of conserved positions, nucleotide composition, and clustering patterns to the known ones. As mentioned earlier, you do have the challenge that because site motifs are so short they occur often by chance. So small clusters (of 1 or 2 sites) may not be reliably detected. So both these methods are error-prone to some degree. SLIDE 24 Compare genomes A different and generally more definitive strategy (although with its own limitations!) is to compare genomes that differ from each other and try to relate the sequence differences to phenotypic differences (in physiology, or other organismic or cellular characteristics). This can point you both to the sites and, sometimes, to a functional role for specific sites. The key point is that a difference in phenotype usually means that there is a sequence difference affecting sites. For phenotypes at the cellular level you have to be a little careful because of the fact mentioned earlier, that for a variety of reasons sites may vary in activity across cells and so cells with identical genomes can nonetheless have different phenotypes. But assuming you can control or check for that, phenotype differences usually imply sequence differences that either alter a site's activity or create a new site. Insertion or deletion mutations in background sequence between two sites could also have a phenotype by altering site spacing, but that's not an issue with point mutations. In experimentally manipulable organisms (or cells) you can in principle find sites using CRISPR (for example) to systematically make mutations and assess phenotypes. Of course this is not, as yet, practical on a genome-wide scale, and more seriously it may not actually find all sites because (as previously discussed) our own ability to assess phenotypes in the lab is much less sensitive than evolution's. Another approach is to leverage naturally occurring mutations by comparing different members of a population. The problem here is that generally any two individuals have multiple sequence differences and multiple phenotypic differences, so associating phenotype to genotype is challenging. What you have to look for is correlations of recurring differences -- shared phenotypes between individuals having shared genotypic changes. GWAS studies do this in a targeted way. But GWAS typically does not, at least by itself, pinpoint the affected site because genomic variants in linkage disequilibrium with each other often have similar correlations with the phenotype. SLIDE 25 [different species] Finally, you can compare different species. Now you have many sequence differences and many phenotype differences, the numbers depending on how closely related the species are. We again expect phenotype differences to largely reflect differences in site content, i.e. sites present in one organism but absent from the other. Conversely, shared sites should typically correspond to shared aspects of phenotype. Since background (non-site) sequence is not under purifying selection, it accumulates mutations more rapidly than site sequences do. For very distant organisms, the density of accumulated mutations may make alignment of background impossible; for closer organisms, it may be possible, but shared sites should still be detectable as having higher similarity (greater 'conservation', due to purifying selection) between the species than the background. So in both cases the alignment can give us information about shared sites, but it says essentially nothing about lineage-specific sites underlying the differences between the species. SLIDE 26 [Phastcons] Illustrating that, this figure from Adam Siepel's paper on his program phastcons depicts alignments of a particular region in the human genome to four other vertebrates: mouse, rat, chicken and fish. Down here we see the sequence, up here a schematic indicating aligned segments. You can see that the alignment to chicken and especially fish is pretty much confined to exons in the genes in this region, with a little bit of surrounding sequence that may include some regulatory sites. The alignment to our fellow mammals mouse and rat is much greater, which could be partly due to shared mammalian-specific sites but is mainly due to the background sequence still being alignable for these. Within the aligned portions there are some regions, almost certainly clusters of shared sites under purifying selection, that stand out as having a much higher degree of conservation, indicated by these bars here. SLIDE 27 Some major computational tasks Hopefully, you'll have noticed in the preceding survey of genomicists' tasks that there are some recurring computational themes. One is comparing and aligning sequences. Sequence assembly involves comparing reads to reads. Variant detection, and transcript assembly, involve comparing reads to genomes. Finding evidence of shared sites between species involves comparing genomes to genomes. The appropriate alignment method depends on how similar the sequences are. In these first two cases the sequences are highly similar to each other, with isolated single-base differences from base-calling errors and mutational variants, and (in the case of transcripts aligned to the genome) isolated large gaps corresponding to introns. In all these cases, methods of the sort we'll discuss in the next lecture allow you to quickly find large perfectly matching segments which can then be easily extended to alignments. However the last case, involving distantly related genomes, or genes, requires more sensitive methods which we'll discuss later in the course. SLIDE 28 A second recurring computational requirement is models -- simplified representations of the genome and genome alignments that can guide computational analyses. To computationally find sites within genome sequences, we'll want to model sites and site clusters, as well as the non-site background. To find shared sites within genome alignments, we'll want to computationally model mutation and purifying selection. In addition, to find the alignments themselves we'll need a model to tell us how to do the scoring of an alignment. It will turn out that the models we develop for the above purposes are helpful in analyzing not just the sequences, but also various types of lab-generated 'linear' data relevant to genome interpretation -- for example read depth, or protein binding information. SLIDE 29 Probability models So, what type of model? It won't be a surprise that we favor probability models, for several reasons: First, genomes result from evolution, which is inherently probabilistic as we discussed earlier. Second, what we're trying to do is basically to detect site signals in a noisy background, and for signal to noise problems probability models are a widely used and powerful approach. Noise almost by its definition has to be modelled probabilistically. And as we've seen, there's a lot of variability in site sequences, so it makes sense to model that probabilistically as well. Finally, probability models allow you to report a measure of confidence in your prediction, which is important when you're trying to persuade someone to commit experimental resources to confirming it. SLIDE 30 Models: simplicity vs complexity There is still an important and non-trivial question as to how complex our models need to be. To start with, there are a few illuminating quotes: One, from the British statistician George Box, is 'all models are wrong; some models are useful'. He means that the real world is complex enough that no model can fully capture it, and so any model is 'wrong' in that sense. That's especially true of biological systems, which are far more complex than simple physical systems. But some models are able to capture enough of the reality to be able to make some progress in understanding it, and so are useful. An older quote from the French poet and philosopher Paul Valery, is 'What is simple is always wrong. What is not [by which he means what is not simple] is unusable'. He's talking more broadly about how we think about reality, and saying that we have no choice but to simplify. An even earlier quote, apparently from Einstein -- there is some dispute about whether he actually said this -- is 'Everything should be made as simple as possible, but not simpler'. He's talking here about the process of trying to deduce physical laws, and is saying, again, that you have to simplify reality to some extent, but that it's important to simplify to the appropriate level. SLIDE 31 Some disadvantages of complexity Okay, so all models are going to be oversimplifications to some extent, but that still doesn't tell us how complex they should be. Generally, we start with simple models ('as simple as possible' as Einstein would say), and then gradually make them increasingly complex to capture more and more of the biology. As you probably know there's a lot of exciting research now in developing and applying deep neural nets, which are extremely complex computational models that can have literally billions of parameters, to all sorts of things including molecular biology. So now that we have the ability to do that, what's the point of using anything simpler? Well, complex models have some significant disadvantages that its important to be aware of. In increasing order of seriousness: One of course is that they're more computationally challenging to work with. A second is 'overfitting'. The more parameters your model has, the more it will tend to capture chance characteristics of the training data that you use for estimating the parameters, with the result that the trained model although fitting the training set very well, performs poorly on 'new' data not in the training set. And of course it's the new data that you're most interested in. Neural net developers are well aware of this issue of course and have methods to try to avoid it, but those methods don't work perfectly. For example it has been found that deep neural nets for image classification typically fail on images to which a small amount of pixel noise has been added -- such that a human has no problem seeing what's in the image, but the neural net breaks down because it's 'overfit' to data lacking such noise. Statisticians have given a lot of thought to the issue of overfitting in complex models, but I think it's fair to say it's not really a solved problem yet. You can reduce overfitting, but not eliminate it. Even simple models will tend to overfit to some extent. Finally, and this I think is really the most serious issue, is that complex models are difficult to interpret: what do the billions of parameters mean? How does the model really 'work'? Basically, a complex model is a 'black box'. You can run it on data, and get an answer, and it might even be a reliable answer, but you don't understand why you got that answer. I'd argue that this is in fact anti-scientific, since the goal of science is to understand reality and a black box that simply makes predictions doesn't provide that understanding. Again, people who work on deep neural nets recognize model interpretability as an important issue, and have made some progress on it, but it's a very hard problem and I think it's fair to say that it is far from being solved. SLIDE 32 This course In this course we'll focus on sequence-based computational molecular biology; that is, general methods for obtaining and analyzing the information encoded in the genome sequence. As I mentioned earlier, Many of the methods apply more broadly to what you might call 'linear data' associated with genomes. We'll emphasize the underlying biology, discussing along the way some of the biological facts that are relevant to the computational methods. We'll favor simple and interpretable probability models. This is not to discount the importance of more complex machine learning models, but rather it's because if you're going to be working with probability models at all it's good to start with simple ones. And as we'll see you can get surprisingly far with those. I suspect there's a place for models that are intermediate in complexity between deep neural nets and the fairly simple hidden Markov models that we'll be talking about in this course -- still simple enough to be interpretable, but incorporating more of our knowledge of the genome. But that's for future research! Regarding proofs: Typically I just try to give you some intuition for proofs, and either omit the details altogether or include them on slides that I skip over in the lecture. This reflects how I personally tend to absorb proofs which is to read through them several times, each time getting some of the ideas. That doesn't fit very well into a lecture. Hopefully you won't find that too frustrating. SLIDE 33 Main topics Here's a survey of the course content; it is laid out in more detail on the course web page. In the next lecture we'll discuss algorithm generalities and then suffix arrays and hash tables, which are methods for finding exact sequence matches. Then, probability models for background sequence. Then, probability models for sites, and related to those, weight matrices and sequence logos, which have to do with comparing two models, namely site models and background models. Then I want to talk about dynamic programming, which is sometimes called the "Fundamental algorithm of computational molecular biology". The reason it has that name is that it comes up in many different computational biology contexts. I'm going to present this algorithm in the context of finding highest weight paths on weighted directed acyclic graphs, for two reasons: First, I find it very helpful to have the concrete picture of a graph for understanding how the algorithm works. Second, in the applications we want to make of this algorithm in computational biology, there nearly always is an associated graph for which highest weight paths or related constructions correspond exactly to what you want to compute. Then I'll discuss the problem of finding sequence regions which compositionally don't look like the background. I call this 'HMMs lite' because the ideas are closely related to what comes up in 2-state hidden Markov models. Then I'll talk about gapped-alignment algorithms. The weighted directed acyclic graph that's associated to these is called an edit graph, and the Smith-Waterman algorithm is just dynamic programming to find a highest weight path on this graph. We'll talk about those more generally for aligning multiple sequences. Then finally we'll get to hidden Markov models and some of their applications. In particular, HMMs are a good way of parsing genomes into sites and background. HMMs really generalize site models and background models, which you can think of as corresponding to different states or sets of states in the HMMs. And finally finding conserved regions within alignments. PhastCons (whcih I showed you a slide from earlier) uses what's called a phyloHMM, so we'll talk about those. And in the context of that I'll talk about simple molecular evolution models, because you need those in phyloHMMs. There's a lot more to say about molecular evolution models that I won't get into. I'll just give you a simple idea of how to do some of the basic calculations. SLIDE 34 We do not cover Things we don't cover: Although HMMs can be used to find new site motifs, there are other approaches that can be more powerful. Deeper discussion of sequence evolution models. We don't talk about statistical genetics, so we don't tell you how to do a GWAS, for example. As I mentioned we don't talk about deep neural nets or other complex machine-learning models (beyond HMMs). You could argue that all of the above are relevant in interpreting genomes, so they're part of 'linear' computational molecular biology. But of course we also don't cover 'non-linear' computational biology, which goes beyond genome sequences. For example, modelling the 3D structure of a protein, metabolic and signalling pathways, models for interacting molecules within a cell. Many of these topics are covered in other UW courses. In Genome 541, the sequel to this course, the content varies from year-to-year, but in the past has often covered some of these topics; also courses in computer science, statistics and biostatistics. So that concludes this overview lecture. In the next lecture we'll discuss algorithms, and specifically a clever one called the suffix array algorithm which you'll need to implement in the first homework assignment.