Building a virtual cell

What if you could ask a computer "what happens if I knock down gene X in this cell type?" and get a reliable answer without touching a pipette? That's the promise behind virtual cell modeling–using machine learning to predict how cells respond to genetic perturbations before you step foot in a lab.

I keep coming back to this idea because it sits at this perfect intersection of things I care about. There's the obvious ML angle–can we really learn cellular behavior from data? But there's also this deeper question about whether we're compressing biology down to computation, or if we're just building really good memorization machines. I spent a weekend building one of these virtual cell pipelines, the process taught me more about where the field is heading than any paper could.

Three technology curves have finally intersected in a way that felt like magic. In 2015, single-cell atlases had ~1M cells total, GPU deep learning was for imaging not RNA-seq, and CRISPR screens measured bulk readouts. By 2025, public atlases contain >350M cells, GPT-style architectures dominate genomics papers, and perturb-seq measures 10³ cells × 10⁴ genes per target.

This convergence means a lone researcher with a GPU or two can prototype something that would have required a consortium in 2015. That's the kind of democratization that gets me excited–when the barrier to entry shifts from having institutional resources to simply knowing what pieces to glue together.

The actual problem we're trying to solve

Let me back up. When you knock down a gene in a cell, it's not like deleting a line of code. It's more like... imagine you're removing a specific word from every book in a library, and now you need to predict how the meaning of each book changes. Some books barely notice. Others become completely incoherent. And critically, the same word removal affects different books differently.

That's what we're asking these models to do. Take a cell (the book), remove a gene (the word), and predict the new expression profile (the altered meaning). The biology is straightforward in principle: genes regulate other genes in complex networks. Knock one down, and the effects ripple through these networks in theoretically predictable ways.

Here's what nobody tells you about biological modeling: the hard part isn't the biology or the ML. It's the impedance mismatch between how biologists think about cells (as living systems with purpose) and how ML sees them (as high-dimensional vectors). Bridging that gap is where the real work happens.

In practice, this means every modeling decision becomes a philosophical stance. Do you treat gene expression as counts? Ranks? Log-transformed values? Each choice embeds assumptions about what matters in cellular state. I went with ranks because–let's be honest–most genes in most cells are barely expressed. It's the relative ordering that captures cell identity. But another modeler might violently disagree, and they wouldn't be wrong.

How everyone is actually doing this

The field has converged on a few architectural patterns, each with its own bet about how biology works:

The "biology is language" camp: These folks (myself included, initially) treat genes as tokens and cells as documents. Mask random genes, predict them back, call it a day. It works surprisingly well, which either means biology really is language-like, or our models are good at finding patterns in any high-dimensional data. I'm still not sure which.

The "compress then perturb" camp: Train VAEs to learn a latent space where similar cells cluster together. Perturbations become simple arithmetic in this space. The elegance is seductive. If we can reduce a 20,000-dimensional cell to 50 meaningful numbers, surely we've learned something fundamental about biology?

The "respect the networks" camp: These are the principled folks who insist on incorporating known gene regulatory networks into their architectures. Information flows along regulatory edges, predictions become interpretable. The catch? Our knowledge of these networks is spotty at best. It's like building a map of London using a tourist brochure.

The "throw compute at it" camp: Diffusion models, 100-layer transformers, whatever's trending in ML. The philosophy here is that biology is complex, so our models should be too. Sometimes it works. Sometimes you're just fitting noise with extra steps.

Most successful approaches hedge their bets. My failed weekend experiments tried to combine masked gene modeling with a conditional VAE and some hand-wavy network constraints. It was elegant in theory, a mess in practice. But that's the thing about this field–most pipelines are held together with duct tape and hope.

The evaluation problem

Here's the dirty secret: we don't really know how to measure success. Sure, we have metrics–Pearson correlations, differential expression precision, Earth Mover's Distance if you're feeling fancy. But what correlation is "good enough" to trust a prediction?

The field uses several evaluation strategies, each with massive caveats:

Held-out perturbations sound rigorous until you realize you're often testing on genes that are functionally similar to your training set. It's like training on French and testing on Spanish–the good performance might just mean you've learned Romance languages, not language itself.

Zero-shot prediction is the real test. Can you predict knockdowns for genes you've never seen? This is where most models fall apart.

Biological consistency checks are my favorite because they're so humbling. Does your model predict that knocking down a transcription factor affects its target genes? You'd be surprised how often the answer is no, even for models with great correlation scores.

Where this is all heading

The optimist in me sees virtual cells following the AlphaFold trajectory. We're in the "early results are promising but not quite useful" phase. Give it 2-3 years and some key architectural insight, and suddenly every lab has access to reliable perturbation prediction.

The realist notes that cells are not proteins. Protein structure is physics–same sequence, same structure (mostly). Cellular responses are biology–context-dependent, stochastic, and influenced by factors we don't even measure. We might need fundamentally different approaches.

The pragmatist doesn't care. Even 70% accurate predictions are useful for hypothesis generation. If virtual cells can narrow down 10,000 possible experiments to the 100 most promising, they've already paid for themselves. Perfect prediction is a nice goal, but good-enough prediction that actually gets used is better.

Some learnings

Building this pipeline was supposed to be a quick reconnaissance mission—poke my head back into computational biology, see what's changed. Instead, it turned into a reminder of why I love this field. Where else can you combine:

Cutting-edge ML (transformers eating everything)
Fundamental biology (how do cells actually work?)
Practical impact (failed experiments cost millions)
Computational democracy (cheap GPU hours can compete with big labs)

The technical lessons were valuable–PyTorch has gotten really good, gradient checkpointing is magic, always log in float32. But the bigger lesson was about timing. The gap between 2018 RNA-seq and 2025 virtual cells is exactly where new careers get minted. It's not too late to jump in.

More importantly, the barrier to entry isn't compute or even knowledge—it's knowing which pieces fit together. The weekend I spent building a broken virtual cell model taught me more about the field's direction than months of reading papers. Sometimes you need to get your hands dirty to see where things are heading.

In the future

If you're thinking about building a virtual cell model–

The interesting problems aren't in the architecture–they're in the data representation and evaluation.
Biology knowledge helps more than ML knowledge. Understanding why genes co-express matters more than the latest attention mechanism.
The community is small and helpful. The Discord channels are full of people from top labs who'll actually answer questions.
Your first model will suck. Your fifth might be interesting. Nobody's tenth model is production-ready yet.

That's the real lesson from this whole exercise. Virtual cells aren't just about predicting perturbations. They're about compressing biological knowledge into computational form–making the implicit explicit, the qualitative quantitative. Whether that's possible remains an open question. But it's exactly the kind of question worth spending weekends on.