AI Research | Science, Mathematics & Governance

GPT-5.5 as Scientific Co-Researcher: Ramsey Proofs, Gene Expression, and Cyber Defense

GPT-5.5 has moved past literature review and data formatting into tool-using scientific workflows. OpenAI says an internal GPT-5.5 variant helped discover an off-diagonal Ramsey-number proof later verified in Lean, that GPT-5.5 scored 80.5% on BixBench and 81.8% on CyberGym, and that its biological/chemical and cybersecurity capabilities are treated as High under the Preparedness Framework [1][2].

BixBench (Bioinformatics)

vs 74.0% for GPT-5.4 — real-world biomedical data analysis [1]

CyberGym (Cybersecurity)

vs Claude Opus 4.7 at 73.1% in OpenAI’s table [1]

FrontierMath Tier 4 (Pro Variant)

vs 22.9% for Claude Opus 4.7 in OpenAI’s table [1]

OpenAI Preparedness Risk Level

Bio/chemical and cyber capabilities treated as High [1][2]

The Ramsey Number Proof: AI at the Boundary of Human Knowledge

Advanced mathematical reasoning has historically been one of the hardest categories for large language models because free-form proof sketches are not enough; the work must survive formal verification. GPT-5.5 challenges that limitation in a concrete way. OpenAI says an internal version of GPT-5.5 with a custom harness helped discover a new proof about off-diagonal Ramsey numbers [1].

Ramsey theory studies the conditions under which ordered structure must emerge within sufficiently large combinatorial systems. OpenAI describes the result as a proof of a longstanding asymptotic fact about off-diagonal Ramsey numbers that was later verified in Lean. That distinction matters: the claim is not simply that a model wrote a persuasive explanation, but that the mathematical argument passed machine-checkable verification [1].

“[The model] autonomously gathers evidence, tests experimental biological assumptions, and draws novel scientific conclusions.”

Professor Derya Unutmaz, Jackson Laboratory immunologist, describing GPT-5.5 Pro on 28,000-gene datasets [1]

The philosophical implication extends beyond a single result. A model that can contribute to Lean-verified mathematics is no longer just a summarization tool for researchers; it becomes part of an iterative research loop where conjecture, search, implementation, and formal checking reinforce each other [1].

Bioscience and bioinformatics represent some of the most complex real-world analytical tasks available for AI evaluation. GPT-5.5’s BixBench and GeneBench results signal readiness for genuine scientific partnership.

Benchmark	Domain	GPT-5.5 Base	GPT-5.5 Pro	GPT-5.4
BixBench	Bioinformatics & biomedical analysis	80.5% [1]	—	74.0%
GeneBench	Genetics & quantitative biology	25.0% [3]	33.2%	—
FrontierMath Tier 4	Advanced mathematical logic	35.4% [4]	39.6%	—
Expert-SWE	Long-horizon coding (20-hr tasks)	73.1% [3]	—	—

Bioinformatics and the 28,000-Gene Frontier

In biological sciences, GPT-5.5’s capabilities are being validated at the frontier of real research rather than constructed evaluation scenarios. Leading immunologists at the Jackson Laboratory, including Professor Derya Unutmaz, are actively deploying the model to interpret gene-expression datasets encompassing nearly 28,000 individual genes — datasets of a complexity and scale that would require teams of specialists working for extended periods to analyze manually. The model autonomously gathers supporting evidence, tests experimental biological assumptions against available data, and draws scientific conclusions that Unutmaz reports are genuinely novel [2].

On BixBench, a standardized evaluation for real-world bioinformatics and biomedical data analysis, GPT-5.5 scored 80.5% — a meaningful 6.5-point improvement over GPT-5.4’s 74.0%. On GeneBench, a multi-stage evaluation targeting complex genetics and quantitative biology workflows, the base model scored 25.0% and the Pro variant reached 33.2%. The GeneBench absolute scores appear low in isolation but are representative of the genuine difficulty of multi-stage quantitative biology reasoning, where current human expert completion rates on the same tasks are themselves far from perfect [1][3].

GPT-5.5’s cybersecurity governance reflects the dual-use reality of advanced reasoning: the same capabilities that enable cyber defense are structurally identical to those that enable attack. OpenAI’s governance framework attempts to preserve both without conflating them.

Framework Element	GPT-5.5 Status	Implication
Risk classification	High for bio/chemical and cyber capabilities [1][2]	Deployment continues with controls; Critical would halt
Red-teaming scope	Nearly 200 early-access partners plus targeted red-teaming [2]	Stress-tested for biological and cyber attack pathways
General access controls	Stricter classifiers and protections for repeated misuse [1][2]	Tighter controls on higher-risk cyber activity
Verified defender access	Trusted Access for Cyber program [1][5]	Verified defenders get cyber-permissive variants without friction
CyberGym benchmark	81.8% (vs Claude Opus 4.7: 73.1%) [1]	State-of-the-art in identifying and mitigating digital vulnerabilities

Cybersecurity: 81.8% on CyberGym and the Governance High-Wire

On CyberGym, OpenAI reports GPT-5.5 at 81.8%, ahead of GPT-5.4 at 79.0% and Claude Opus 4.7 at 73.1%. The result is dual-use by nature: the same planning and debugging capabilities that help defenders find and patch vulnerabilities can lower the barrier for misuse if access is not governed carefully [1][2].

OpenAI’s Preparedness Framework treats GPT-5.5’s biological/chemical and cybersecurity capabilities as High. In the release post and system card, OpenAI says the model went through targeted red-teaming for advanced cybersecurity and biology, testing with external experts, and nearly 200 early-access partner workflows before release [1][2].

For general access, OpenAI describes stricter classifiers for potential cyber risk, stronger controls around higher-risk activity, and protections against repeated misuse. In parallel, Trusted Access for Cyber provides verified defenders with expanded access to cyber-permissive models for legitimate security work under trust and security requirements [1][5].

Key Takeaways

GPT-5.5 helped discover an off-diagonal Ramsey-number proof later verified in Lean, giving the release a concrete example of AI-assisted mathematics rather than only benchmark claims [1].
BixBench at 80.5%, GeneBench at 25.0%, and the Jackson Laboratory 28,000-gene example show GPT-5.5 moving into multi-stage scientific analysis workflows [1][3][4].
GPT-5.5 leads CyberGym at 81.8% versus Claude Opus 4.7’s 73.1% in OpenAI’s release table, which makes cyber governance a core part of deployment rather than an afterthought [1][2].
OpenAI’s governance posture combines stricter public safeguards with Trusted Access for Cyber so verified defenders can use more capable tools for legitimate security work [1][5].

References

[1] OpenAI, “Introducing GPT-5.5,” Apr. 23, 2026. [Online]. Available: https://openai.com/index/introducing-gpt-5-5/
[2] OpenAI, “GPT-5.5 System Card,” Apr. 23, 2026. [Online]. Available: https://openai.com/index/gpt-5-5-system-card/
[3] OpenAI, “GeneBench: Assessing AI Agents for Multi-Stage Inference,” Apr. 2026. [Online]. Available: https://cdn.openai.com/pdf/6dc7175d-d9e7-4b8d-96b8-48fe5798cd5b/oai_genebench_benchmark.pdf
[4] arXiv, “BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology,” Mar. 2025. [Online]. Available: https://arxiv.org/abs/2503.00096
[5] OpenAI, “Accelerating the cyber defense ecosystem that protects us all,” Apr. 16, 2026. [Online]. Available: https://openai.com/index/accelerating-cyber-defense-ecosystem/

GPT-5.5 as Scientific Co-Researcher: Ramsey Proofs, Gene Expression, and Cyber Defense

GPT-5.5 Research and Cybersecurity Performance

The Ramsey Number Proof: AI at the Boundary of Human Knowledge

GPT-5.5 Research Performance — Science and Mathematics

Bioinformatics and the 28,000-Gene Frontier

OpenAI Preparedness Framework — GPT-5.5 Risk Profile

Cybersecurity: 81.8% on CyberGym and the Governance High-Wire

Key Takeaways

References

GPT-5.5 as Scientific Co-Researcher: Ramsey Proofs, Gene Expression, and Cyber Defense

GPT-5.5 Research and Cybersecurity Performance

The Ramsey Number Proof: AI at the Boundary of Human Knowledge

GPT-5.5 Research Performance — Science and Mathematics

Bioinformatics and the 28,000-Gene Frontier

OpenAI Preparedness Framework — GPT-5.5 Risk Profile

Cybersecurity: 81.8% on CyberGym and the Governance High-Wire

Key Takeaways

References

Related Reading

ScreenMemory: Measured Improvements in Visual Desktop Control

MedAgentBench and the Clinical AI Frontier: Stanford’s Benchmark for Healthcare Agent Safety

The Grid Can’t Keep Up: AI’s Energy Bottleneck, Zombie Projects, and the Path Forward

Powered Land: How AI Is Turning Dormant Industrial Sites Into Premium Real Estate

Stay in the loop