My professional publications

Docking optimization, variance and promiscuity for large-scale drug-like chemical space using high performance computing architectures

Richard E. Trager, , Sherwin Soltani, Amit A. Upadhyay, Bhanu Rekapalli, and Yuri K. Peterson
Drug Discovery Today

There is a continuing need to hasten and improve protein–ligand docking to facilitate the next generation of drug discovery. As the drug-like chemical space reaches into the billions of molecules, increasingly powerful computer systems are required to probe, as well as tackle, the software engineering challenges needed to adapt existing docking programs to use next-generation massively parallel processing systems. We demonstrate docking setup using the wrapper code approach to optimize the DOCK program for large-scale computation as well as docking analysis using variance and promiscuity as examples. Wrappers provide faster docking speeds when compared with the naive multi-threading system MPI-DOCK, making future endeavors in large-scale docking more feasible; in addition, eliminating highly variant or promiscuous compounds will make databases more useful.

Optimizing High Performance Computing Workflow for Protein Functional Annotation

Larissa Stanberry, Bhanu Rekepalli, Yuan Liu, , Roger Higdon, Elizabeth Montague, William Broomall, Natali Kolker, and Eugene Kolker
Concurrency and Computation: Practice and Experience

Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.

PoPLAR: The Portal for Petascale Lifescience Applications and Research

Bhanu Rekapalli, , and Christopher Reardon
BMC Bioinformatics

We are focusing specifically on fast data analysis and retrieval in bioinformatics that will have a direct impact on the quality of human health and the environment. The exponential growth of data generated in biology research, from small atoms to big ecosystems, necessitates an increasingly large computational component to perform analyses. Novel DNA sequencing technologies and complementary high-throughput approaches--such as proteomics, genomics, metabolomics, and meta-genomics--drive data-intensive bioinformatics. While individual research centers or universities could once provide for these applications, this is no longer the case. Today, only specialized national centers can deliver the level of computing resources required to meet the challenges posed by rapid data growth and the resulting computational demand. Consequently, we are developing massively parallel applications to analyze the growing flood of biological data and contribute to the rapid discovery of novel knowledge.

HSPp-BLAST: Highly Scalable Parallel PSI-BLAST for Very Large-scale Sequence Searches

Bhanu Rekepalli, , and
2012 Bioinformatics and Computational Biology (BICoB), ISCA 4th Int'l. Conference

Based on recent published articles, the growth of genomic data has overtaken and outpaced both performance improvements of storage technologies and processing power due to the revolutionary advancements of next generation sequencing technologies. By bringing down the costs and increasing throughput by many orders of magnitude with sequencing technologies, data is doubling every 9 months resulting in the exponential growth of genomic data in recent years. However, data analysis becomes increasingly difficult and can be prohibitive, as existing bioinformatics tools developed in the past decade focus mainly on desktops, workstations and small clusters that have limited capabilities. Improving the performance and scalability of such tools is critical to transforming ever-growing raw genomic data into biological knowledge containing invaluable information directly related to human health. This paper describes a new software application which includes optimization techniques improving the scalability of a most widely used bioinformatics tool "PSI-BLAST" on advanced parallel architectures, pushing the envelope of biological data analysis. We show that our improvements allow near-linear scaling to tens of thousands of processing cores, up to the maximum non-capability size on current petaflop supercomputers. This new tool increases by 5 orders of magnitude the amount of genomics data that can be processed per hour.

Download: hspp-blast.pdf

My presentations