GPT-5.5 as Scientific Co-Researcher: Ramsey Proofs, Gene Expression, and Cyber Defense
GPT-5.5 has moved past literature review and data formatting into tool-using scientific workflows. OpenAI says an internal GPT-5.5 variant helped discover an off-diagonal Ramsey-number proof later verified in Lean, that GPT-5.5 scored 80.5% on BixBench and 81.8% on CyberGym, and that its biological/chemical and cybersecurity capabilities are treated as High under the Preparedness Framework [1][2].
GPT-5.5 Research and Cybersecurity Performance
vs 74.0% for GPT-5.4 — real-world biomedical data analysis [1]
vs Claude Opus 4.7 at 73.1% in OpenAI’s table [1]
vs 22.9% for Claude Opus 4.7 in OpenAI’s table [1]
Bio/chemical and cyber capabilities treated as High [1][2]
The Ramsey Number Proof: AI at the Boundary of Human Knowledge
Advanced mathematical reasoning has historically been one of the hardest categories for large language models because free-form proof sketches are not enough; the work must survive formal verification. GPT-5.5 challenges that limitation in a concrete way. OpenAI says an internal version of GPT-5.5 with a custom harness helped discover a new proof about off-diagonal Ramsey numbers [1].
Ramsey theory studies the conditions under which ordered structure must emerge within sufficiently large combinatorial systems. OpenAI describes the result as a proof of a longstanding asymptotic fact about off-diagonal Ramsey numbers that was later verified in Lean. That distinction matters: the claim is not simply that a model wrote a persuasive explanation, but that the mathematical argument passed machine-checkable verification [1].
“[The model] autonomously gathers evidence, tests experimental biological assumptions, and draws novel scientific conclusions.”
Professor Derya Unutmaz, Jackson Laboratory immunologist, describing GPT-5.5 Pro on 28,000-gene datasets [1]
The philosophical implication extends beyond a single result. A model that can contribute to Lean-verified mathematics is no longer just a summarization tool for researchers; it becomes part of an iterative research loop where conjecture, search, implementation, and formal checking reinforce each other [1].
GPT-5.5 Research Performance — Science and Mathematics
| Benchmark | Domain | GPT-5.5 Base | GPT-5.5 Pro | GPT-5.4 |
|---|---|---|---|---|
| BixBench | Bioinformatics & biomedical analysis | 80.5% [1] | — | 74.0% |
| GeneBench | Genetics & quantitative biology | 25.0% [3] | 33.2% | — |
| FrontierMath Tier 4 | Advanced mathematical logic | 35.4% [4] | 39.6% | — |
| Expert-SWE | Long-horizon coding (20-hr tasks) | 73.1% [3] | — | — |
Bioinformatics and the 28,000-Gene Frontier
In biological sciences, GPT-5.5’s capabilities are being validated at the frontier of real research rather than constructed evaluation scenarios. Leading immunologists at the Jackson Laboratory, including Professor Derya Unutmaz, are actively deploying the model to interpret gene-expression datasets encompassing nearly 28,000 individual genes — datasets of a complexity and scale that would require teams of specialists working for extended periods to analyze manually. The model autonomously gathers supporting evidence, tests experimental biological assumptions against available data, and draws scientific conclusions that Unutmaz reports are genuinely novel [2].
On BixBench, a standardized evaluation for real-world bioinformatics and biomedical data analysis, GPT-5.5 scored 80.5% — a meaningful 6.5-point improvement over GPT-5.4’s 74.0%. On GeneBench, a multi-stage evaluation targeting complex genetics and quantitative biology workflows, the base model scored 25.0% and the Pro variant reached 33.2%. The GeneBench absolute scores appear low in isolation but are representative of the genuine difficulty of multi-stage quantitative biology reasoning, where current human expert completion rates on the same tasks are themselves far from perfect [1][3].
OpenAI Preparedness Framework — GPT-5.5 Risk Profile
| Framework Element | GPT-5.5 Status | Implication |
|---|---|---|
| Risk classification | High for bio/chemical and cyber capabilities [1][2] | Deployment continues with controls; Critical would halt |
| Red-teaming scope | Nearly 200 early-access partners plus targeted red-teaming [2] | Stress-tested for biological and cyber attack pathways |
| General access controls | Stricter classifiers and protections for repeated misuse [1][2] | Tighter controls on higher-risk cyber activity |
| Verified defender access | Trusted Access for Cyber program [1][5] | Verified defenders get cyber-permissive variants without friction |
| CyberGym benchmark | 81.8% (vs Claude Opus 4.7: 73.1%) [1] | State-of-the-art in identifying and mitigating digital vulnerabilities |
Cybersecurity: 81.8% on CyberGym and the Governance High-Wire
On CyberGym, OpenAI reports GPT-5.5 at 81.8%, ahead of GPT-5.4 at 79.0% and Claude Opus 4.7 at 73.1%. The result is dual-use by nature: the same planning and debugging capabilities that help defenders find and patch vulnerabilities can lower the barrier for misuse if access is not governed carefully [1][2].
OpenAI’s Preparedness Framework treats GPT-5.5’s biological/chemical and cybersecurity capabilities as High. In the release post and system card, OpenAI says the model went through targeted red-teaming for advanced cybersecurity and biology, testing with external experts, and nearly 200 early-access partner workflows before release [1][2].
For general access, OpenAI describes stricter classifiers for potential cyber risk, stronger controls around higher-risk activity, and protections against repeated misuse. In parallel, Trusted Access for Cyber provides verified defenders with expanded access to cyber-permissive models for legitimate security work under trust and security requirements [1][5].
Key Takeaways
- GPT-5.5 helped discover an off-diagonal Ramsey-number proof later verified in Lean, giving the release a concrete example of AI-assisted mathematics rather than only benchmark claims [1].
- BixBench at 80.5%, GeneBench at 25.0%, and the Jackson Laboratory 28,000-gene example show GPT-5.5 moving into multi-stage scientific analysis workflows [1][3][4].
- GPT-5.5 leads CyberGym at 81.8% versus Claude Opus 4.7’s 73.1% in OpenAI’s release table, which makes cyber governance a core part of deployment rather than an afterthought [1][2].
- OpenAI’s governance posture combines stricter public safeguards with Trusted Access for Cyber so verified defenders can use more capable tools for legitimate security work [1][5].
References
- [1] OpenAI, “Introducing GPT-5.5,” Apr. 23, 2026. [Online]. Available: https://openai.com/index/introducing-gpt-5-5/
- [2] OpenAI, “GPT-5.5 System Card,” Apr. 23, 2026. [Online]. Available: https://openai.com/index/gpt-5-5-system-card/
- [3] OpenAI, “GeneBench: Assessing AI Agents for Multi-Stage Inference,” Apr. 2026. [Online]. Available: https://cdn.openai.com/pdf/6dc7175d-d9e7-4b8d-96b8-48fe5798cd5b/oai_genebench_benchmark.pdf
- [4] arXiv, “BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology,” Mar. 2025. [Online]. Available: https://arxiv.org/abs/2503.00096
- [5] OpenAI, “Accelerating the cyber defense ecosystem that protects us all,” Apr. 16, 2026. [Online]. Available: https://openai.com/index/accelerating-cyber-defense-ecosystem/