Techniques for Improved Probabilistic Inference In Protein-Structure Determination via X-Ray Crystallography
University of Wisconsin-Madison Department of Computer Sciences
MetadataShow full item record
Over the past decade, the field of machine learning has seen a large increase in the use and study of probabilistic graphical models due to their ability to provide a compact representation of complex, multidimensional problems. Graphical models have applications in many areas, including natural language processing, computer vision, gene regulatory-network modeling, and medical diagnosis. Recently, the complexity of problems posed in many domains has stressed the ability of algorithms to reason in graphical models. New techniques for inference are essential to meet the demands of these problems in an efficient and accurate manner. One such area of application is in the area of structural genomics. The task of determining protein structures has been a central one to the biological community, with recent years seeing significant investments in structural-genomic initiatives. X-ray crystallography, a molecular-imaging technique, is at the core of many of these initiatives as it is the most popular method for determining protein structures. In creating a high-throughput crystallography pipeline, however, the final step of constructing an all-atom protein model from an electron-density map - a three-dimensional image of a molecule produced as an intermediate product of X-ray crystallography - remains a major bottleneck in need of computational methods. In difficult cases where the image is poor, this can take months of manual effort by an experienced crystallographer. In this thesis, I develop new inference techniques for the use of probabilistic graphical models for the automated determination of protein structures in electron-density maps. The first, guided belief propagation using domain knowledge, prioritizes messages in the popular belief propagation algorithm for approximate inference. Second, I propose Probabilistic Ensembles in ACMI (PEA), a framework for leveraging multiple, diverse executions of approximate inference to produce more accurate estimations of a variable's posterior probability distribution. Lastly, I present work on the use of statistical sampling (particle filtering) for the purpose of providing physically feasible, all-atom protein structures. I demonstrate that my new methods not only improve the accuracy of the probabilistic model in terms of log-likelihood values, but also produce protein structures with higher completeness, lower RMS error, and better fit to the density map according to R-free factor. My methods interpret difficult electron-density maps (3-4A resolution) better than prior inference approaches. Across a set of poor-quality density maps, my work outperforms all related work in the field by improving the state-of-the-art technique, ACMI. In addition, I show that the ability to incorporate biochemical domain knowledge is an important aspect to probabilistic modeling, creating more accurate modeling functions and influencing algorithmic design of belief propagation. I also describe my contributions on the subtask of three-dimensional shape matching in electron-density maps by utilizing spherical-harmonic decompositions to quickly align two 3D objects over rotations. I show that spherical-harmonic decompositions, when applied to the task of matching small amino-acid fragments, are more efficient and accurate than previous work. I also extend spherical harmonics to two other shape-detection tasks: homologous structure detection in electron-density maps and feature generation for 3D shape classification of local density regions. While the application of my work specifically targets the problem of protein-structure determination, the issues I pose generalize to computational problems seen in many areas of the field of artificial intelligence. Throughout this work, I will refer to, and develop, techniques to solve problems seen in probabilistic inference, three-dimensional shape matching, and statistical sampling, among others.