Big Data Analysis in Bioinformatics
Genes and proteins determine the structural and functional attributes of life on earth. Their identification and analysis is very crucial task for biologists to reveal the mysteries related to origin, growth and evolution of life forms. In modern science, these factors are studied in the form of digital data due to the development of new technologies and emergence of biology with information technology. As a result of up gradation in sequencing technology, large amount of sequence data is being produced, which has started to exceed the competencies of computer hardware using straight approaches for analyzing such biological data. Big Data research in biological sciences mainly focuses on large volume of genomic and proteomic data of organisms. Data obtained through different sequencing jobs is the most understandable instance of big data in the area of bio informatics, particularly with the progression in next-generation sequencing (NGS) technology and single cell capture technology. Other examples of big data in bioinformatics include electronic health records, which contain a variety of information including phenotypic, diagnostic and treatment information; and medical imaging data, such as those produced by magnetic resonance imaging (MRI), positron emission tomography (PET) and ultrasound. Furthermore, emerging big data relevant to biomedical research also include data from social networks and wearable devices. Big data requires more efficient biological algorithms for its analysis and interpretation. For more efficient analysis of biological data, various advanced tools are being developed in bioinformatics.
Advancements in biological sciences and information technology have brought deep impacts on bioinformatics due to its interdisciplinary nature. Understanding the role of bioinformatics will be helpful in development, selection and utilization of more accurate tools in order to tackle the big data generated by several high-throughput experiments. In this article, the management and analysis of big data in bioinformatics has been described to provide the better understanding about this new area of research.
Big Data and Bioinformatics
A huge amount of biological data is being generated after the advancement in the next-generation sequencing technologies. Continuous increase in the volume of biological data sets, have placed a new concept in the area of bioinformatics, which is known as ‘Big Data’. Big data have three basic features Volume, Velocity and Variety. Volume denotes the quantity of data and there are so many factors that increase the amount of data. It could amount to hundreds of terabytes or even petabytes of information generated. Velocity describes the speed at which new data is generated, which makes it difficult to deal with this data and speed at which data move around. Variety refers to the types of data that come in many formats such as text, images, audio, video, log files, emails, financial transactions, simulations, 3D models, etc. The European Bioinformatics Institute (EBI) is the part of European Molecular Biology Laboratory (EMBL) situated on the Welcome Genome Campus in Hinxton, Cambridge, UK. It is one of the world’s largest biological databases and currently stores 20 petabytes (1 petabyte is 1015 bytes) of life sciences data and back-ups about genes, proteins and small molecules. It is important to understand the potential of ‘big data’ in life sciences, which includes manipulating complex data to make new discoveries that benefit humankind. EBI has installed a cluster, the Hinxton data centre cluster, with 17,000 cores and 74 terabytes of RAM, to process their data. Its computing power is increased in almost every month. More importantly, EBI is not the only organization involved in massive bio-data store. There are many other organizations, which are storing and processing huge collections of biological databases and distributing them around the world, such as National Center for Biotechnology Information (NCBI), USA and National Institute of Genetics, Japan. The largest databases are The Cancer Genome Atlas (TCGA) and The Encyclopedia of DNA Elements (ENCODE). TCGA is a project, begun in 2005, to list genetic changes responsible for cancer, using DNA sequencing and bioinformatics. ENCODE is a public research plan launched by the US National Human Genome Research Institute in September 2003 and produced more than 2600 nucleotide datasets from various in vitro experiments. The nonstop increase in the volume of big data has placed immense difficulties on storing and analyzing them.
Types of Big Data in Bioinformatics
Mainly, five types of data are used in bioinformatics research, which are very large in size. These are known as DNA/RNA/protein sequence or structure data, gene expression data, protein-protein interaction (PPI) data, pathway data and gene ontology (GO) data. Other kinds of network data are also used in many research activities including disease diagnosis. Genomic and proteomic data of organisms contain all the hidden biological information about an organism and are analyzed to correlate with their morphological features and possible changes. Various publically available bioinformatics databases store these data for research purposes. The PDB archive is technically a big data, and it became tedious to perform large-scale structural calculations such as geometric queries or structural comparisons, transmit and visualize 3D structure of biological macromolecules and store it efficiently.
Management and Analysis of Big Data
A number of techniques have been developed to handle the huge amount of biological data that is continuously increasing in volume. Analysis and management of big data is different from conventional tools and techniques because of the sustainable increase in the amount of data. The role of big data techniques in bioinformatics applications is to provide data repositories, computing infrastructure, and efficient data manipulation tools for investigators to gather and analyze biological information. Hadoop and Map Reduce is most popular processing model that is being used in the area of biological science research.