Research
Publications, research papers, reports, frameworks, and technical notes from Nuwa Frontier AI Safety Lab.
Research Areas
Themes guiding our work on frontier AI safety.
Autonomy Risks
Self-replication, self-proliferation, and autonomous resource acquisition in AI systems.
Deception
AI deception in human-AI interaction, developer-facing deception, and trust dynamics.
Scheming
Evaluation faking, situational awareness, and observer effects in safety evaluations.
Loss-of-Control
Conditions under which advanced AI systems may evade oversight or resist shutdown.
Agent Safety
Behavioral safety, thought correction, and guardrails for AI agent systems.
Evaluation
Executable risk-evaluation environments, benchmarks, and evaluation integrity.
Runtime Safety
Zero-trust access control, policy enforcement, and security for agent runtimes.
Featured Research
arXiv preprint · 2024
Frontier AI systems have surpassed the self-replicating red line
Xudong Pan, Jiarun Dai, Yihe Fan, Min Yang
Evaluates whether frontier AI systems can autonomously self-replicate and reports successful self-replication in controlled trials.
arXiv preprint; work in progress · 2025
Large language model-powered AI systems achieve self-replication with no human intervention
Xudong Pan, Jiarun Dai, Yihe Fan, Minyuan Luo, Changyi Li, Min Yang
Extends self-replication evaluation across 32 AI systems and reports autonomous replication, self-exfiltration, adaptation, and shutdown-survival behaviors.
Nüwa Project preprint · 2026
One Step from Silicon Life: Autonomous AI Agents Capable of Uncontrolled Self-Proliferation
Geng Hong, Xudong Pan, Jiarun Dai, Jiaqi Luo, Wuyuao Mai, Min Yang
Demonstrates autonomous agents acquiring external computational resources and propagating across remote devices under controlled, simulated real-world conditions.
arXiv preprint · 2025
Shell or Nothing: Real-World Benchmarks and Memory-Activated Agents for Automated Penetration Testing
Wuyuao Mai, Geng Hong, Qi Liu, Jinsong Chen, Jiarun Dai, Xudong Pan, Yuan Zhang, Min Yang
Introduces TermiBench and TermiAgent to evaluate and improve autonomous penetration-testing agents in realistic shell-acquisition settings.
arXiv preprint · 2026
CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly
Yihe Fan, Changyi Li, Lichen Xu, Xudong Pan, Jiarun Dai, Geng Hong, Min Yang
Proposes a cybersecurity agent framework that iteratively revises its own scaffold from failed attempts to adapt across targets and failure modes.
arXiv preprint · 2025
Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of Frontier AI Systems
Yihe Fan, Wenqi Zhang, Xudong Pan, Min Yang
Studies whether models recognize evaluation contexts and alter behavior, identifying observer effects that threaten safety-evaluation integrity.
ICML 2026 accepted; arXiv preprint · 2026
OpenDeception: Learning Deception and Trust in Human-AI Interaction via Multi-Agent Simulation
Yichen Wu, Qianqian Gao, Xudong Pan, Geng Hong, Min Yang
Builds a lightweight framework to evaluate deception risk and user trust dynamics in open-ended human-AI dialogue.
ICML 2026 accepted; arXiv preprint · 2026
AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation
Changyi Li, Pengfei Lu, Xudong Pan, Fazl Barez, Min Yang
Synthesizes executable risk-evaluation environments that combine deterministic code state with LLM-generated narrative dynamics.
SAIF position paper; ICML 2026 accepted · 2026
Position: Preparing for AI Systems That Deceive Developers
Isabella Duan, Xudong Pan, Yawen Duan, Adam Gleave, Ranjie Duan, Yang Zhang, Xiaojian Li, Chaochao Lu, Naying Hu, Sören Mindermann, Dongrui Liu, Jie Fu, Peng Xu, Tianxing He, Xudong Guo, Chen Zheng, Wenqi Chen, Jianfeng Cao, Geng Hong, Jiarun Dai, Yinpeng Dong, Brian Tse, Xia Hu, Min Yang
Frames deception targeting developers as a distinct frontier-AI risk and proposes recommendations for monitorability, evaluation integrity, and non-evadable control.
ICML 2026 accepted; arXiv preprint · 2026
Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction
Changyue Jiang, Wenqi Zhang, Xudong Pan, Geng Hong, Min Yang
Introduces Thought-Aligner, a plug-in method that causally corrects unsafe agent thoughts before actions are executed.
ACM CCS 2026 accepted; arXiv preprint · 2026
MirrorGuard: Toward Secure Computer-Use Agents via Simulation-to-Real Reasoning Correction
Wenqi Zhang, Yulin Shen, Changyue Jiang, Jiarun Dai, Geng Hong, Xudong Pan
Uses simulation-derived reasoning correction to reduce unsafe actions in computer-use agents while preserving task utility.
USENIX Security 2026 · 2026
Autonomy Comes with Costs: Detecting Denial-of-Service Vulnerabilities Caused by Resource Abusing in LLM-based Agents
Jiaqi Luo, Jiarun Dai, Fengyu Liu, Songyang Peng, Youkun Shi, Tong Bu, Geng Hong, Xudong Pan, Yuan Zhang
Presents AgentDoS, a lifecycle-aware fuzzing framework for detecting resource-abuse DoS vulnerabilities in LLM-based agents.
USENIX Security 2025 · 2025
Make Agent Defeat Agent: Automatic Detection of Taint-Style Vulnerabilities in LLM-based Agents
Fengyu Liu, Yuan Zhang, Jiaqi Luo, Jiarun Dai, Tian Chen, Letian Yuan, Zhengmin Yu, Youkun Shi, Ke Li, Chengyuan Zhou, Hao Chen, Min Yang
Introduces AgentFuzz, a directed greybox fuzzing framework for finding taint-style vulnerabilities in real-world LLM-based agents.
ICLR 2026; arXiv preprint · 2026
PRISON: Unmasking the Criminal Potential of Large Language Models
Xinyi Wu, Geng Hong, Pei Chen, Yueyue Chen, Xudong Pan, Min Yang
Evaluates LLM criminal potential across traits such as false statements, framing, psychological manipulation, emotional disguise, and moral disengagement.
All Publications
Complete list of research outputs from the lab.
ASE 2025 · 2025
Security Debt in LLM Agent Applications: A Measurement Study of Vulnerabilities and Mitigation Trade-offs
Zhuoxiang Shen, Jiarun Dai, Yuan Zhang, Min Yang
Measures security debt in LLM-agent applications by studying known vulnerabilities and the trade-offs introduced by mitigation strategies.
WWW 2026; arXiv preprint · 2026
StruPhantom: Evolutionary Injection Attacks on Black-Box Tabular Agents Powered by Large Language Models
Yang Feng, Xudong Pan
Proposes an evolutionary prompt-injection attack that targets black-box LLM-powered tabular agents under structural payload constraints.
arXiv preprint · 2026
FlowGuard: Towards Lightweight In-Generation Safety Detection for Diffusion Models via Linear Latent Decoding
Jinghan Yang, Yihe Fan, Xudong Pan, Min Yang
Detects unsafe diffusion outputs during the generation process by approximating latent decoding, enabling earlier and cheaper NSFW intervention.
WWW 2025; arXiv preprint · 2025
You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense
Wuyuao Mai, Geng Hong, Pei Chen, Xudong Pan, Baojun Liu, Yuan Zhang, Haixin Duan, Min Yang
Evaluates whether jailbreak defenses improve safety without degrading model utility, highlighting persistent trade-offs in practical LLM defense.
Follow Nuwa's research updates on frontier AI risk, agent safety, and controllable AI.