Biology is defined as the study of living things for which biologists collect and interpret data. Today, sophisticated laboratory technology allows us to collect data faster than we can interpret it. We have vast volumes of DNA sequence data but how do we figure out which parts of that DNA control the various chemical processes of life? We know the function and structure of some proteins, but how do we determine the function of new proteins? And how do we predict what a protein will look like, based on the knowledge of its sequence? We understand the relatively simple code that translates DNA into protein. But how do we find meaningful new words in the code and add them to the DNA-protein dictionary?
Basically, Bioinformatics is the science of using information to understand biology. It may be broadly defined as the interface between Life Sciences and Computational Sciences. It is a subset of the larger field of Computational Biology that includes the application of quantitative analytical techniques in modeling biological systems. It is a science that deals with genomics - an area of study which looks at the DNA sequence of an organism in order to determine what genes code for beneficial traits and which genes are involved in inherited diseases. As biology is advancing, more and more information is generated and scientists need a way to store and analyze it. Computers can greatly assist that process. As a result, this new research area, which combines the study of biotechnology and the use of computers, is emerging. It involves the use of Internet tools, artificial intelligence and other advanced computational methods to assist in storing and analyzing data generated from DNA sequencing.
The field of bioinformatics relies heavily on work by experts in statistical methods and pattern recognition. Researchers (or bioinformaticians) come to it from many fields, including mathematics, computer science, and linguistics. By providing algorithms, databases, user interfaces, and statistical tools, bioinformatics makes it possible to do exciting things such as compare DNA sequences and generate results that are potentially significant. These new tools also give you the opportunity to over interpret data and assign meaning where none really exists!
Bioinformatics is thus the study of the information content and information flow in biological systems and processes and the application of computational and analytical methods to biological problems. But the main goal of bioinformatics isn't developing the most elegant algorithms or the most arcane analyses; the goal is finding out how living things work!
How Is Computing Changing Biology?
An organism's hereditary and functional information is stored as DNA, RNA, and proteins, all of which are linear chains composed of smaller molecules. These macromolecules are assembled from a fixed alphabet of well-understood chemicals: DNA is made up of four deoxyribonucleotides (adenine, thymine, cytosine, and guanine), RNA is made up from the four ribonucleotides (adenine, uracil, cytosine, and guanine), and proteins are made from the 20 amino acids. Because these macromolecules are linear chains of defined components, they can be represented as sequences of symbols. These sequences can then be compared to find similarities that suggest the molecules are related by form or function.
The Eye of the Fly Example:
Fruit flies (Drosophila melanogaster) have a gene called eyeless, which, if it's knocked out (i.e., eliminated from the genome using molecular biology methods), results in fruit flies with no eyes and hence the eyeless gene plays a role in eye development.
Researchers have identified a human gene responsible for a condition called aniridia. In humans, who are missing this gene or in whom the gene has mutated stop functioning properly, the eyes develop without irises.
If the gene for aniridia is inserted into an eyeless drosophila, it causes the production of normal drosophila eyes which is an interesting coincidence. So there is a similarity in how eyeless and aniridia function, even though flies and humans are vastly different organisms!
To gain insight into this, their sequences can be compared. Thus, sequence comparison is possibly the most useful computational tool to emerge for molecular biologists. Today, a molecular biologist can compare an uncharacterized DNA sequence to a single public database of genome sequence data and collection of DNA sequences available to a worldwide community of users on the World Wide Web.
As little as 15 years ago, looking for similarities between eyeless and aniridia DNA sequences would have been like looking for a needle in a haystack. Most scientists compared the respective gene sequences by hand-aligning them one under the other in a word processor and looking for matches character by character. This was very unweildy. However, in the late 1980s, fast computer programs made it possible for pairwise comparison of biological sequences which is the foundation of most widely used bioinformatics techniques.
How does sequence alignment work?
It's important to remember that biological sequence (DNA or protein) has a chemical function, but when it's reduced to a single-letter code, it also functions as a unique label, almost like a bar code. The sequence label can be applied to a gene, its product, its function, its role in cellular metabolism, and so on. The user searching for information related to a particular gene can then use rapid pair-wise sequence comparison to access any information that's been linked to that sequence label.
The most important thing about these sequence labels, though, is that they don't just uniquely identify a particular gene; they also contain biologically meaningful patterns that allow users to compare different labels, connect information, and make inferences. So not only can the labels connect all the information about one gene, they can help users connect information about genes that are slightly or even dramatically different in sequence. So simply, every DNA sequence could be slapped with a unique number or ID and be done with it. But biological sequences are related by evolution, so a partial pattern match between two sequence labels is a significant find.
Basically, it involves:
· Finding the genes in the DNA sequences of various organisms
· Developing methods to predict the structure and/or function of newly discovered proteins and structural RNA sequences.
· Clustering protein sequences into families of related sequences and the development of protein models.
· Aligning similar proteins to examine evolutionary relationships.
The genomic code breaks down into thousands of individual genes. Genes tell cells to make proteins - individual molecules each one of which has a unique chemical mission. Proteins interact with each other to carry out thousands of functions, from digesting your dinner to synthesizing the small molecules that form a barrier between the inside of your cells and the outside world.
There are many ways in which computers can aid research into this. For example -
· Collecting and processing signals detected by laboratory equipment as DNA sequencers, CCD devices, spectrophotometers, and just about any other device that can be connected to a computer via an analog to digital converter.
· Tracking samples and managing experiments in industrial-style laboratories
· Storing data in public databases, and more importantly, public access to the database via sophisticated Web searches and deposition mechanisms.
· Extracting patterns and rules from large data collections and using these observed patterns to characterize and predict features in new data. This is the core of bioinformatics and hence tools that can recognize pattern matches and feature signatures within an otherwise inscrutable data set can be developed.
· Using automatic computational methods to assign functional meaning to uncharacterized data and to create informative links between different data collections. For example, using automated sequence comparison searches to identify potential genes in new genome data.
· Using known information about a system to simulate properties of the system.
Thus, with the help of computers, bioinformaticians work with data generated by the experimental biology community and by a growing number of data factory projects (e.g. genome sequencing projects). This data is mined to develop new hypotheses, new models of how biological systems function, and even rules and patterns which can be used to screen new data sets.
Let us take up two examples -
1. Information Storage -
As mentioned above, DNA is a molecule made from sugar, phosphate and bases called guanine (G), cytosine (C), adenine (A), and thymine (T). The various combinations these four bases make up the DNA in plants, animals, bacteria, yeast and fungi. An infinite number of combinations of these bases are possible. For example you could have AAGCT, CCAGT, TACGGT etc.
Scientists are currently trying to determine the entire DNA sequence of various living things. By the end of this year, it has been predicted that the human genome sequence databases (such as GenBank and EMBL), that are already present on the net as for all the scientists to collectively work on information, will have grown to 4 billion base pairs! These databases have been growing at exponential rates. This deluge of information has necessitated the careful storage, organization and indexing of sequence information. For Example, in agricultural research, the effort to sequence the Arabidopsis genome will require determining the sequence of a 120 - megabase sequence. Computers can greatly assist in storing and managing all of this information.
2. Data analysis
Results generated from DNA sequencing could identify genes, regulatory sequences and other functions. Once the information of the DNA sequence has been determined, the next step is to find out what these genes code for. Comparing one DNA sequence to another amongst closely related organisms assists these processes. Computers can help compare DNA sequences and look for homologies, or related strands of DNA. One can also compare DNA sequences to determine how closely two different species are related on an evolutionary scale.
The goal of biology, in the era of the genome projects, is to develop a quantitative understanding of how living things are built from the genome that encodes them. Cracking the genome code is complex. At the very simplest level, we still have difficulty identifying unknown genes by computer analysis of genomic sequence. Beyond the single-molecule level, the challenges are immense. As datatypes beyond DNA, RNA, and protein sequence begin to undergo the same kind of explosion, simply managing, accessing, and presenting this data to users in an intelligible form is a critical task. Human-computer interaction specialists need to work closely with academic and clinical researchers in the biological sciences to manage such staggering amounts of data. Biological data is very complex and interlinked. A spot on a DNA array, for instance, is connected not only to immediate information about its intensity, but to layers of information about genomic location, DNA sequence, structure, function, and more. Creating information systems that allow biologists to seamlessly follow these links without getting lost in a sea of information is also a huge opportunity for computer scientists. Finally, each gene in the genome isn't an independent entity. Multiple genes interact to form biochemical pathways, which in turn feed into other pathways. Putting genomic and biochemical data together into quantitative and predictive models of biochemistry and physiology is the work of bioinformaticians and computer scientists, mathematicians, and statisticians are also a vital part of this effort.
Biologists want to collect all of the information they can about every gene in every genome, and from that information construct models using computational techniques for analyzing how genes work together to build up and maintain a living body, whether it's a bacterium or a star quarterback!
Can they do it? Well, time will tell, and so will Bioinformatics!