Bioinformatics

Bioinformatics has established itself as a field that bridges biological research and computational technology. Its ever-growing importance in various scientific fields, from evolutionary biology to personalized medicine has made it a focal point of modern scientific research.

The scope of bioinformatics is vast, encompassing areas such as genome mapping, sequence alignment, gene identification, protein structure prediction, and more. Bioinformatics tools have also played a crucial role in understanding diseases, drug discovery, and is also increasingly leveraged in areas such as environmental studies and agriculture.

One of the defining and challenging features of bioinformatics is its interdisciplinary nature. It lies at the intersection of multiple fields including biology, computer science and statistics. Biology provides the problems and the data: understanding gene function, studying evolutionary relationships, disease pathways, and more. Computer science offers the algorithms and software needed to handle and process the data. Statistics provides the framework for making sense of the data, such as determining whether an observation is significant or merely random.

The combined efforts and knowledge from each of these fields have made it possible to tackle the complexities inherent in biological systems and has led to many exciting discoveries and breakthroughs in the field of biology.

A timeline of the field

The history of bioinformatics can be traced back to the 1960s and 70s with the development of initial bioinformatic databases and software. The term "bioinformatics" was coined in the 1970s, but the field truly took off in the late 1980s with the launch of the Human Genome Project. This ambitious project aimed at sequencing and mapping the entire human genome, and it required computational tools to manage and analyze the massive amount of data it generated.

In the ensuing years, bioinformatics evolved rapidly. The development and application of advanced algorithms, machine learning techniques, and high-performance computing have greatly increased the power and scope of the field.

1953 - A landmark paper first described the double helix structure of the DNA molecule, the foundation of heredity and genetic information.

Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid

1953 ~ J. D. WATSON & F. H. C. CRICK

1965 - Margaret Dayhoff develops the first public protein sequence databases, known as the Atlas of Protein Sequence and Structure. This was the model for GenBank and many other molecular databases.

Atlas of Protein Sequence and Structure

1966 ~ Richard V. Eck and Margaret O. Dayhoff. National Biomedical Research Foundation

1970 - Paulien Hogeweg and Ben Hesper first coined the term 'bioinformatics' in the early 1970s and defined it as 'the study of informatic processes in biotic systems'.

This was well before the term came to be associated predominantly with the computational study of genetics. The 'informatics' processes they were referring to include the kind of complex interactions and feedbacks that are now investigated in fields across computational biology, genomics, and network biology.

1977 - The algorithms for sequence alignment, including the Needleman-Wunsch algorithm for global alignment and the Smith-Waterman algorithm for local alignment, are developed.

The Needleman-Wunsch algorithm, which was one of the first applications of dynamic programming to bioinformatics, form the basis for more sophisticated sequence alignment tools such as BLAST (Basic Local Alignment Search Tool) and FASTA, which are fundamental to many bioinformatics analyses.

A general method applicable to the search for similarities in the amino acid sequence of two proteins

1970 ~ S B Needleman, C D Wunsch

Identification of common molecular subsequences

1981 ~ T F Smith, M S Waterman

1982 - The GenBank nucleotide sequence database is established. The GenBank is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations.

1983 - Invention of the Polymerase Chain Reaction: PCR revolutionized molecular biology by making it possible to generate millions of copies of a specific DNA sequence, which was a boon for data generation in bioinformatics.

1988 - The National Center for Biotechnology Information (NCBI) is created.

1990 - The Human Genome Project, an international research effort to sequence and map all the genes of the human genome and make them accessible for further biological study, is launched.

1995 - The first complete bacterial genome, Haemophilus influenzae, is sequenced. The significance of the H. influenzae sequence was not just that it presented for the first time a complete genome from a free-living organism, but that it proved that complete genomes could be sequenced rapidly and effectively at a low cost.

Whole-genome random sequencing and assembly of Haemophilus influenzae Rd

1995 ~ R D Fleischmann 1, M D Adams, O White, R A Clayton, E F Kirkness, A R Kerlavage, C J Bult, J F Tomb, B A Dougherty, J M Merrick, et al.

2002 - The full sequence of the human genome is published. The Human Genome Project, completed in 2003, covered about 92% of the total human genome sequence. However, the complete sequencing of the human genome, including many complex regions that were hard to sequence, was not completed until 2022. This work involved improvements in sequencing technology and computational methods over nearly two decades following the initial draft.

The complete sequence of a human genome

2022 ~ SERGEY NURK, SERGEY KOREN, ARANG RHIE, MIKKO RAUTIAINEN, ANDREY V. BZIKADZE, ALLA MIKHEENKO, MITCHELL R. VOLLGER, NICOLAS ALTEMOSE, LEV URALSKY, AND ADAM M. PHILLIPPY, +90 authors

2004 - The Ensembl genome database project is launched to provide resources for genome research.

Ensembl 2004

2004

2005 - The HapMap project published a paper, providing a database of common genetic variants in human beings. This resource greatly enhanced the studies of the genetic variants associated with human diseases and responses to pharmaceuticals, environmental factors, and vaccines.

A haplotype map of the human genome

2005 ~ International HapMap Consortium

2005 - The introduction of pyrosequencing technology began the “next generation sequencing” (NGS) revolution. This drastically reduced the cost and time required to sequence DNA, leading to an explosion of genomic data.

2006 - The protein structure prediction software, Rosetta, gains acclaim as the most successful method for protein structure prediction in the Critical Assessment of Techniques for Protein Structure Prediction (CASP).

2012 - The National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI) launched The Cancer Genome Atlas (TCGA) project. This initiative delivered a detailed genetic landscape of over 20 cancer types, significantly accelerating cancer research.

The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge

2015 ~ Katarzyna Tomczak, Patrycja Czerwińska, and Maciej Wiznerowicz

2012 - CRISPR-Cas9 gene-editing technology revolutionizes the field of genetics, and bioinformatics plays a vital role in designing guide RNAs and predicting off-target effects. CRISPR-Cas9 is a powerful tool that enables researchers to edit DNA with unprecedented precision and ease. It has been used in a wide range of applications, from gene therapy to agriculture.

A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity

2012 ~ Martin Jinek, Krzysztof Chylinski, Ines Fonfara, Michael Hauer, Jennifer A Doudna, Emmanuelle Charpentier

Genome editing. The new frontier of genome engineering with CRISPR-Cas9

2014 ~ Jennifer A Doudna, Emmanuelle Charpentier

Pioneers of revolutionary CRISPR gene editing win chemistry Nobel

2020

AlphaFoldFor decades, researchers have been working on computational methods to predict protein structures from their amino acid sequences, a problem known as the "protein folding problem".

In 2018, AlphaFold, a machine learning-based system, made significant progress in this area. The system uses a combination of techniques, including deep convolutional neural networks and gradient descent, to predict the distances and angles between amino acids, which it then uses to model the protein's structure.

However, the true breakthrough came in CASP14 in 2020, when an improved version of the system, AlphaFold2, achieved unprecedented accuracy in predicting protein structures, comparable to experimental methods. This was hailed as a solution to the protein folding problem and has significant implications for bioinformatics, structural biology, and biomedical research.

2018 - AlphaFold, a program developed by DeepMind, uses machine learning to accurately predict protein structure from amino acid sequence, demonstrating the potential of AI in bioinformatics.

Highly accurate protein structure prediction with AlphaFold

2021 ~ Jumper, J., Evans, R., Pritzel, A. et al.

2020 - In the year of the COVID-19 pandemic, bioinformatics played a significant role in understanding the SARS-CoV-2 virus and the development of vaccines. In particular, viral genome sequencing and comparative genomics enabled the identification of the virus and tracking of its mutations as it spread globally. Bioinformatics also played a key role in the development of vaccines, including the mRNA vaccines from Pfizer and Moderna.

The architecture of the SARS-CoV-2 RNA genome inside virion

2021 ~ Changchang Cao, Zhaokui Cai, Xia Xiao, Jian Rao, Juan Chen, Naijing Hu, Minnan Yang, Xiaorui Xing, Yongle Wang, Manman Li, Bing Zhou, Xiangxi Wang, Jianwei Wang & Yuanchao Xue

Pharmacogenomics in the era of personalised medicine

2022 ~ Cassandra White, Rodney Scott, Christine L Paul, and Stephen P Ackland

In recent years, the field of bioinformatics continues to expand with advancements in technologies like next-generation sequencing, machine learning, and cloud computing. The evolution mirrors the advancements in computational technology and the increasing complexity of biological systems.

Switching gears to recent areas of interest in bioinformatics. Noting that this is fairly subjective and there are many other applications of bioinformatics that are equally as interesting.

The omics revolution

Traditional genetic research often focuses on the role of individual genes or genetic variants. However, most biological traits and diseases are complex, involving many genes that interact with each other and the environment. As such, the study of single genes or variants may not provide a complete picture of the underlying biology.

Recent advances in high-throughput technologies has lead to the rise of multi-omics studies, enabling a more holistic view of complex biological systems. These methods integrate data from genomics, transcriptomics, metabolomics, proteomics, and other -omics fields of biology to create more comprehensive models of cellular processes and disease mechanisms.

This means researchers can form a more complete picture of the biological processes and pathways underlying a specific trait or disease. For example, one can use genomic data to identify genetic variants associated with a disease, transcriptomic and proteomic data to investigate how these variants influence gene and protein expression, and metabolomic data to explore the downstream effects on cellular metabolism.

Network approaches to systems biology analysis of complex disease: integrative methods for multi-omics data

2017 ~ Jingwen Yan, Shannon L Risacher, Li Shen, Andrew J Saykin

Open source

What first drew me to the field was that it sat at the intersection of biology and computer science. But, I was also drawn to the open-source nature of the field. The fact that bioinformatics leans on open-source software and open data practices not only accelerates research but also fosters a collaborative and transparent scientific environment.

I felt this opened up the field in a manner that I could participate in on my own terms, from anywhere, as long as I took the time to understand. It's fair to say that the open-source nature of bioinformatics is a key factor in its rapid growth and development. It has enabled researchers to share their data and tools, which has greatly accelerated the pace of research.

I hope its openness and collaborative spirit presents an opportunity for contributions, encouraging the global pursuit of knowledge and discovery in the face of biological complexities.