When I would tell people I do computational biology, they would nearly always remark on the ‘computational’ part, saying either that it sounds way beyond their comprehension or asking what on Earth it means. I would inwardly groan, because isn’t it obvious what it means? More recently I realized why ‘computational biology’ was so confusing to people, including both non-scientists and scientists in other areas: because it isn’t a real field.

Non-biologists are aware that tons of information is being generated in biology. They know about genome sequencing and how that is part of biology, and that we use computers to analyze that data. Unlike many of the Old Guard of biology, they aren’t still bewildered by the fact that work is now primarily done on computers. So they wonder what strange thing we’re doing on top of that to warrant the ‘computational’ label. Studying digital life forms?

Some people do wetlab experiments without doing the extensive analysis afterwards; they’re biologists who do wetlab experiments. Some people do all their research on computers doing simulations and analyzing data; they’re biologists because they study biological organisms.

How we use biological data

I explain my opinion on that label to put the current state of work in biology into perspective and to then predict where it’s headed.

The relative importance of experiments, data, and analysis over time is familiar: early biologists designed and conducted experiments to test properties of life, and the results were easily interpreted. More recently, especially in the last couple decades, data such as genomic sequences are generated primarily for use in future studies. As far as I can tell, generating data in this way is seen as more of an incomplete ‘first step’ than were previous forms of observation such as the surveys of early naturalists.

Consequently, much of the current work in biology consists of using available data to discover patterns, fit models, and otherwise understand biological systems as best we can. However, this strategy reaches its limits fairly quickly. Think about cancer mutations, in which some important genes can be implicated due to the frequency in which they are found to be mutated. Many other genes play a role, but we cannot accumulate enough clinical data to ever identify most of these using the same method. Estimating the relative impact of different mutations within a gene, even with holistic supervised models, already seems to be reaching an accuracy plateau – at around 30% of the variance of impact values captured by the predictions.

Sure, we could make a great deal of medical progress by treating only the effects of the main driver mutations, but the complexities of the cell and individual variation mean that this isn’t as simple as thoroughly studying one gene at a time.

We are increasingly making use of models to represent biological knowledge, and those models are becoming less formally defined. What use is a model then if it cannot be interpreted? Simulation.

Representing knowledge through simulations

Science historian George Dyson notes that the way we use and interact with technology is shifting from binary, logical, and discrete to analog and continuous:

The next revolution is the assembly of digital components into analog computers, similar to the way analog components were assembled into digital computers in the aftermath of World War II.

We should be, and are, following this trend with biological models. We can use a network of proteins to model protein interactions, but this is a simplification that will only answer basic questions.

We are starting to use analog models thanks to machine learning and physical simulations such as molecular dynamics, but overall are still making fairly formal and manual use of biological data. Further development of artificial intelligence will allow models to be developed more automatically, even allowing AI to collect data from an experimental system as it sees fit.

I predict that the broad process of biological research will center around huge models that are used to perform simulations. Engineers will build the AI agents, and the AI agents will build models that best simulate real measurements. Biologists will find new techniques for observing biological systems in ways that are most useful for human interests. They will guide the AI agents to understand the observed data and to develop the models in ways that answer important new questions. They will oversee and interpret the information produced by the simulations.

Interpreting big analog models is difficult, but formal models with nice intuitive interpretations, which can predict a few basic things about an analog biological system, will eventually have little use in research. Ironically, the analog and imperfect representations will be most useful for computers, while discrete and formal representations will be useful mainly to teach concepts to students.

As our approach to research changes steadily, eventually becoming what would be unrecognizable today, we should remember that we’re still doing biology.