There is a continuing need to hasten and improve protein–ligand docking to facilitate the next generation of drug discovery. As the drug-like chemical space reaches into the billions of molecules, increasingly powerful computer systems are required to probe, as well as tackle, the software engineering challenges needed to adapt existing docking programs to use next-generation massively parallel processing systems. We demonstrate docking setup using the wrapper code approach to optimize the DOCK program for large-scale computation as well as docking analysis using variance and promiscuity as examples. Wrappers provide faster docking speeds when compared with the naive multi-threading system MPI-DOCK, making future endeavors in large-scale docking more feasible; in addition, eliminating highly variant or promiscuous compounds will make databases more useful.
Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.
We are focusing specifically on fast data analysis and retrieval in bioinformatics that will have a direct impact on the quality of human health and the environment. The exponential growth of data generated in biology research, from small atoms to big ecosystems, necessitates an increasingly large computational component to perform analyses. Novel DNA sequencing technologies and complementary high-throughput approaches--such as proteomics, genomics, metabolomics, and meta-genomics--drive data-intensive bioinformatics. While individual research centers or universities could once provide for these applications, this is no longer the case. Today, only specialized national centers can deliver the level of computing resources required to meet the challenges posed by rapid data growth and the resulting computational demand. Consequently, we are developing massively parallel applications to analyze the growing flood of biological data and contribute to the rapid discovery of novel knowledge.
Based on recent published articles, the growth of genomic data has overtaken and outpaced both performance improvements of storage technologies and processing power due to the revolutionary advancements of next generation sequencing technologies. By bringing down the costs and increasing throughput by many orders of magnitude with sequencing technologies, data is doubling every 9 months resulting in the exponential growth of genomic data in recent years. However, data analysis becomes increasingly difficult and can be prohibitive, as existing bioinformatics tools developed in the past decade focus mainly on desktops, workstations and small clusters that have limited capabilities. Improving the performance and scalability of such tools is critical to transforming ever-growing raw genomic data into biological knowledge containing invaluable information directly related to human health. This paper describes a new software application which includes optimization techniques improving the scalability of a most widely used bioinformatics tool "PSI-BLAST" on advanced parallel architectures, pushing the envelope of biological data analysis. We show that our improvements allow near-linear scaling to tens of thousands of processing cores, up to the maximum non-capability size on current petaflop supercomputers. This new tool increases by 5 orders of magnitude the amount of genomics data that can be processed per hour.
- High Performance Computing Workflow for Functional Annotation
Larissa Stanberry, Yuan Liu, Bhanu Rekepalli, Paul Giblock, Roger Higdon, and William Broomall XSEDE13 Conference
- Surviving the Life Sciences Data Deluge using Cray Supercomputers
Paul Giblock and Bhanu Rekepalli CUG 2013 (Cray User Group) Conference
- Solving Life Sciences Data Deluge Problems Using Science Gateways
Paul Giblock, Bhanu Rekepalli, and Yuan Liu NSF EPSCoR Workshop in Bioinformatics to Foster Collaborative Research
- Petascale Informatics Applications Development of XSEDE Supercomputers
Bhanu Rekepalli, Paul Giblock, Christopher Reardon, Mark Fahey, and Subhra Sarkar 2012 Extreme Science and Engineering Discovery Environment (XSEDE12) Conference
Awarded “Best Poster”
- Bioinformatics: Desktop Applications to Peta-Scale Architectures with Web-Based Portals
Bhanu Rekepalli, Paul Giblock, and Christopher Reardon 2012 International Symposium on Bioinformatics Research and Applications (ISBRA)
- HSPp-BLAST: Highly Scalable Parallel PSI-BLAST for Very Large-scale Sequence Searches
Bhanu Rekepalli, Aaron Vose, and Paul Giblock 2012 Bioinformatics and Computational Biology (BICoB), ISCA 4th Int'l. Conference
- EPSCoR 2011
- Panel Display: A Hardware/Software Case Study
Paul Giblock 2006 Tennessee Posters at the Capitol
Article 1 Article 2