Today I submitted my final report for Special Topics in Machine Learning — the last course of my master’s program. The feeling is hard to put into words: relief, pride, and a kind of quiet disbelief that it is actually done.
The Problem: Finding Cancer-Fighting Genes Without Burning the Budget
The core challenge came from a real and urgent domain: from 20,000 human genes, identify which ones influence T cell anti-cancer immunity. Each CRISPR gene-editing experiment costs thousands of dollars. The practical constraint was five experimental rounds, with up to 128 genes tested per round. The state-of-the-art baseline was BioDiscoveryAgent (Roohani et al., Stanford 2024), which used Claude 3.5 Sonnet for zero-shot biological reasoning and achieved a 21% improvement over traditional Bayesian Optimization. Impressive — but I saw a structural weakness.
The Problem I Spotted: Flat Planning
BioDiscoveryAgent generates exactly one plan per experimental round and commits the entire budget to that single direction. If the language model picks the wrong scientific hypothesis in round one, the entire round’s budget is wasted with no mechanism for course correction. There is no mathematical basis for reallocating resources — just a loosely worded prompt asking the model to “pick the best genes.” I called this flat planning, and it became the central motivation for my thesis.
The Solution: Bio-ToT Lite
My proposed system, Bio-ToT Lite, draws on two bodies of literature: Tree-of-Thoughts (Yao et al., 2023), which structures LLM reasoning into parallel exploratory branches, and multi-armed bandit theory (Auer et al., 2002), which provides a mathematically grounded framework for balancing exploration and exploitation under uncertainty.
Part 1 — Multi-path planning with K=3 parallel branches:
- Pathway-Based Branch: reasons through known signaling cascades — JAK-STAT, NF-κB, MAPK — and selects genes that are mechanistically plausible targets within those pathways
- Literature-Guided Branch: draws on published scientific findings and GWAS data to prioritize genes with existing experimental evidence
- Data-Driven Branch: analyzes patterns from prior experimental rounds to identify emerging statistical signals, independent of prior biological assumptions
Part 2 — UCB1 Budget Allocation uses the Upper Confidence Bound algorithm to dynamically distribute gene slots across branches each round:
UCB(i) = hit_rate(i) + sqrt(2 * ln(N) / n(i))
Branches with high cumulative hit rates receive more gene slots (exploitation). Branches that have been under-explored receive a bonus term that grows with neglect (exploration). Crucially, regret — the cost of not always picking the optimal branch — grows at O(log N), which is mathematically provable and optimal for this class of problem.
Experimental Results
In dry-run mode (deterministic simulation, B=64 genes per round, R=5 rounds), Bio-ToT Lite identified 11 hits across 320 total genes tested (1.20% hit rate), compared to BioDiscoveryAgent’s 8 hits (0.87%). That represents a +37.5% hit rate improvement under controlled, reproducible conditions.
In real LLM mode (gpt-4o-mini, B=32, R=3 rounds), both systems found 17 hits — but Bio-ToT Lite achieved this using only 73 genes, versus 91 for the baseline. That translates to a +24.7% improvement in precision. The pathway-based branch alone achieved a 30.3% hit rate (10 hits out of 33 genes tested). Total API cost for the full real-LLM experiment: approximately $0.023.
The Technical Implementation Details
The full implementation runs to approximately 1,700 lines of Python, built directly on top of BioDiscoveryAgent’s existing codebase to ensure fair comparison. Rather than replacing the original framework, Bio-ToT Lite wraps it with a branch orchestration layer that manages three simultaneous LLM-driven planning agents per round.
Each branch receives a distinct system prompt persona designed to enforce genuinely different reasoning strategies. The pathway-based persona is instructed to reason strictly through mechanistic biology — signaling cascades, receptor-ligand relationships, pathway membership. The literature-guided persona is prompted to prioritize genes with published GWAS hits and peer-reviewed functional evidence. The data-driven persona is explicitly told to ignore prior biological assumptions and instead find statistical patterns in the results matrix from previous rounds. The diversity of reasoning outputs across branches is not accidental — it is the entire point.
The primary dataset used throughout development was the IFNG screen from Schmidt et al. (2022), a genome-wide CRISPR knockout screen measuring interferon-gamma secretion in primary human T cells — directly relevant to cancer immunotherapy. The dataset contains approximately 20,000 genes with quantified hit scores, providing a realistic and scientifically meaningful benchmark. A secondary test on the Scharenberg22 dataset (1,061 genes) revealed a limitation: with a small gene pool and thin per-branch budgets, Bio-ToT Lite can over-partition and underperform — what I called “branch dilution.” Honest reporting of this failure mode was as important as reporting the successes.
The deterministic dry-run mode was one of the most important design decisions in the project. By replacing real LLM calls with a rule-based simulator that scores genes deterministically based on known hit labels, the dry-run mode allows perfect reproducibility across machines and researchers — a critical requirement for scientific validity. Anyone can reproduce the core dry-run results with a single command:
python -m src.experiments.exp4_biotot_comparison --dataset IFNG --num_genes 64 --steps 5 --dry-run --seed 42
What I Learned Along the Way
Three insights stand out as genuinely surprising rather than merely confirming what I expected.
Structural diversity beats random diversity. The improvement did not come from asking the LLM to “think differently” — it came from imposing genuinely different constraints at the reasoning architecture level. Different starting premises produced different gene selections, and that structural separation was the mechanism behind the hit rate gains.
Inconsistent results are honest results. On the Scharenberg22 dataset, Bio-ToT Lite underperformed the baseline. A weaker paper might have omitted this. Including it was the right choice — scientifically and ethically. A method that only reports favorable conditions is not a method, it is marketing.
Dry-run mode is not a shortcut — it is a scientific standard. Building a reproducible simulation layer that mirrors the real system’s logic without real API calls made the entire research process more rigorous, more cost-effective, and more shareable.
What’s Next: Future Research Directions
Bio-ToT Lite, as implemented, is a proof of concept. Several natural extensions could substantially increase its impact.
Dynamic K branch selection is the most immediate priority. The current system fixes K=3 branches per round. A smarter system would allow K to vary based on budget size, round number, and observed branch performance — automatically collapsing low-performing branches and spawning new ones when the gene pool has been sufficiently explored. This would address the branch dilution problem observed on small datasets.
Incorporating protein-protein interaction (PPI) networks would give the pathway-based branch access to a richer biological prior. Currently, pathway reasoning is driven by LLM knowledge alone. Connecting the branch agent to a structured PPI database (such as STRING or BioGRID) would allow it to reason about second-degree gene relationships — neighbors of known hits in the interaction graph — and likely improve hit rates on novel datasets where prior biological literature is sparse.
Applying the framework beyond the IFNG screen is a necessary next step for establishing generalizability. CRISPR screens exist for dozens of phenotypes — drug resistance, cell proliferation, viral entry, metabolic function. Testing Bio-ToT Lite across heterogeneous datasets with different signal densities and gene pool sizes would reveal which aspects of the framework are robust and which require dataset-specific tuning.
Scaling to pooled screens — where multiple phenotypes are measured simultaneously in a single experimental batch — represents the most ambitious extension. Pooled screens multiply the complexity of the gene selection problem and the budget allocation challenge exponentially, but they also represent the frontier of what CRISPR-based biology is capable of delivering at scale.
From Student to Researcher: A Personal Reflection
When I started this master’s program, I was a blogger who wrote about vegan skincare and plant-based nutrition on weekends and struggled through linear algebra problem sets on weeknights. The gap between those two worlds felt enormous. Now, on the other side of a 15,000-word Vietnamese-language technical report on LLM agent architecture and bandit theory, I am not sure that gap ever existed the way I imagined it did.
What connects writing about zinc oxide sunscreens and writing about UCB1 budget allocation is the same underlying impulse: wanting to understand why something works, not just that it does. Why does non-nano zinc oxide sit on skin instead of penetrating? Why does the UCB1 bonus term use a square root of a logarithm? The commitment to mechanism — to not accepting “it just works” as a final answer — is the same in both domains.
This thesis taught me that I can read dozens of research papers, implement a novel algorithm from scratch, conduct real experiments against a peer-reviewed baseline, and produce findings that are nuanced enough to acknowledge their own limitations. That is not a small thing to learn about yourself. The master’s degree is officially over. The research, I think, is just beginning.