Professor of Applied Mathematics Charles "Chip" Lawrence and Luis Carvalho GS have developed a new collection of statistical tools that produce more accurate predictions about large sets of data. The two published an article about their work in the March 4 issue of the Proceedings of the National Academy of Sciences.
Lawrence first realized the shortcomings of the statistical analysis traditionally used to deal with massive amounts of data when he was studying the structures of different types of RNA, which is a type of nucleic acid in cells that is a key component in creating proteins.
RNA comes in many different forms. One form, messenger RNA, can come in many different shapes, but other forms such as ribosomal RNA and transport RNA are stable structures, so they only usually come in one shape. Lawrence had been working on trying to accurately predict the structure of messenger RNA molecules.
There are trillions of different shapes that messenger RNA molecules can take on. "Even if I had a huge computer, I would not be able to compute that," Lawrence said.
Statistics presents a way of dealing with huge quantities of data without having to sift through it all, he said. "We can get a good estimate by selecting a random sample. ... If you can get a good sample, statistics can be powerful," he added.
The pivotal moment in Lawrence's research came when he noticed a discrepancy between a prediction he made and the results reached through traditional statistical analysis. In this experiment, Lawrence was looking at a certain structural RNA that is not supposed to change shape, so he predicted a near 100 percent chance that the RNA would not change shape. But when he ran a traditional statistical analysis on the model, the results were quite different - the probability that the RNA would not change shape was very low.
Lawrence said this seemed "provocative" to him. "That's not the right way to predict," he added.
The reason for these results, as Lawrence and Carvalho explained in the PNAS article, was that the traditional algorithm does not take the whole "ensemble" of RNA into account. The two then looked to "find a (statistical) structure such that it, and a bunch like it, would describe the whole space," of RNA shape possibilities, Lawrence said.
Their method considers the probabilities of all possible RNA molecules occurring as an ensemble and "take(s) their probabilities and mathematically find(s) the one in middle," which is called the centroid, Lawrence said. This produces a more accurate estimate than the traditional statistical method of analysis.
Professor of Applied Mathematics Stuart Geman said Lawrence and Carvalho's discovery is important because it is better applied to certain biological problems than other statistical models.
"Chip invented a different kind of estimator for some problems because the other one isn't a happy fit," he said. "The optimal cell phones are two different things for you and me. The optimal estimators are two different things for different biological problems."
Geman added that Lawrence's new method is important because current methods require an unfeasible level of computation. "Oftentimes, the only available and computable estimator is the maximum-likelihood estimator," Geman said.
Lawrence and Carvalho's finding can be implemented across a broad array of fields. The new method of statistical analysis can be used "any time the problem is discrete (dealing with integers), and there are a lot of combinations," Lawrence said.
"Right now we're working on an even broader characterization of the centroid estimator which will make its applicability easier and its benefits more evident," Carvalho wrote in an e-mail.
"If it were all to work out well, in 10 years, anyone who needs to make an inference in discrete high-dimensional situations would change their view," Lawrence said.




