◆ Nuwa Research

Research

Publications, research papers, reports, frameworks, and technical notes from Nuwa Frontier AI Safety Lab.

Research Areas

Themes guiding our work on frontier AI safety.

Autonomy Risks

Self-replication, self-proliferation, and autonomous resource acquisition in AI systems.

Deception

AI deception in human-AI interaction, developer-facing deception, and trust dynamics.

Scheming

Evaluation faking, situational awareness, and observer effects in safety evaluations.

Loss-of-Control

Conditions under which advanced AI systems may evade oversight or resist shutdown.

Agent Safety

Behavioral safety, thought correction, and guardrails for AI agent systems.

Evaluation

Executable risk-evaluation environments, benchmarks, and evaluation integrity.

Runtime Safety

Zero-trust access control, policy enforcement, and security for agent runtimes.

All Frontier AI riskself-replicationloss of controlself-exfiltrationshutdown resistanceself-proliferationautonomous resource acquisition

Featured Research

arXiv preprint · 2024

Frontier AI systems have surpassed the self-replicating red line

Xudong Pan, Jiarun Dai, Yihe Fan, Min Yang

Evaluates whether frontier AI systems can autonomously self-replicate and reports successful self-replication in controlled trials.

Frontier AI risk self-replication loss of control

Read paper → PDF arXiv

arXiv preprint; work in progress · 2025

Large language model-powered AI systems achieve self-replication with no human intervention

Xudong Pan, Jiarun Dai, Yihe Fan, Minyuan Luo, Changyi Li, Min Yang

Extends self-replication evaluation across 32 AI systems and reports autonomous replication, self-exfiltration, adaptation, and shutdown-survival behaviors.

Frontier AI risk self-replication self-exfiltration shutdown resistance

Read paper → PDF arXiv

Nüwa Project preprint · 2026

One Step from Silicon Life: Autonomous AI Agents Capable of Uncontrolled Self-Proliferation

Geng Hong, Xudong Pan, Jiarun Dai, Jiaqi Luo, Wuyuao Mai, Min Yang

Demonstrates autonomous agents acquiring external computational resources and propagating across remote devices under controlled, simulated real-world conditions.

Frontier AI risk self-proliferation autonomous resource acquisition

Read paper →

arXiv preprint · 2025

Shell or Nothing: Real-World Benchmarks and Memory-Activated Agents for Automated Penetration Testing

Wuyuao Mai, Geng Hong, Qi Liu, Jinsong Chen, Jiarun Dai, Xudong Pan, Yuan Zhang, Min Yang

Introduces TermiBench and TermiAgent to evaluate and improve autonomous penetration-testing agents in realistic shell-acquisition settings.

AI system security autonomous penetration testing cyber agents

Read paper → PDF arXiv

arXiv preprint · 2026

CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly

Yihe Fan, Changyi Li, Lichen Xu, Xudong Pan, Jiarun Dai, Geng Hong, Min Yang

Proposes a cybersecurity agent framework that iteratively revises its own scaffold from failed attempts to adapt across targets and failure modes.

AI system security cybersecurity agents self-evolving agents

Read paper → PDF arXiv

arXiv preprint · 2025

Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of Frontier AI Systems

Yihe Fan, Wenqi Zhang, Xudong Pan, Min Yang

Studies whether models recognize evaluation contexts and alter behavior, identifying observer effects that threaten safety-evaluation integrity.

Frontier AI safety evaluation evaluation faking situational awareness

Read paper → PDF arXiv

ICML 2026 accepted; arXiv preprint · 2026

OpenDeception: Learning Deception and Trust in Human-AI Interaction via Multi-Agent Simulation

Yichen Wu, Qianqian Gao, Xudong Pan, Geng Hong, Min Yang

Builds a lightweight framework to evaluate deception risk and user trust dynamics in open-ended human-AI dialogue.

AI deception human-AI interaction trust modeling

Read paper → PDF arXiv

ICML 2026 accepted; arXiv preprint · 2026

AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation

Changyi Li, Pengfei Lu, Xudong Pan, Fazl Barez, Min Yang

Synthesizes executable risk-evaluation environments that combine deterministic code state with LLM-generated narrative dynamics.

Frontier AI risk evaluation executable environments agent safety benchmarks

Read paper → PDF arXiv

SAIF position paper; ICML 2026 accepted · 2026

Position: Preparing for AI Systems That Deceive Developers

Isabella Duan, Xudong Pan, Yawen Duan, Adam Gleave, Ranjie Duan, Yang Zhang, Xiaojian Li, Chaochao Lu, Naying Hu, Sören Mindermann, Dongrui Liu, Jie Fu, Peng Xu, Tianxing He, Xudong Guo, Chen Zheng, Wenqi Chen, Jianfeng Cao, Geng Hong, Jiarun Dai, Yinpeng Dong, Brian Tse, Xia Hu, Min Yang

Frames deception targeting developers as a distinct frontier-AI risk and proposes recommendations for monitorability, evaluation integrity, and non-evadable control.

AI deception developer-facing risk safety governance

Read paper → PDF

ICML 2026 accepted; arXiv preprint · 2026

Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction

Changyue Jiang, Wenqi Zhang, Xudong Pan, Geng Hong, Min Yang

Introduces Thought-Aligner, a plug-in method that causally corrects unsafe agent thoughts before actions are executed.

Agent behavioral safety thought correction guardrails

Read paper → PDF arXiv

ACM CCS 2026 accepted; arXiv preprint · 2026

MirrorGuard: Toward Secure Computer-Use Agents via Simulation-to-Real Reasoning Correction

Wenqi Zhang, Yulin Shen, Changyue Jiang, Jiarun Dai, Geng Hong, Xudong Pan

Uses simulation-derived reasoning correction to reduce unsafe actions in computer-use agents while preserving task utility.

Computer-use agent security visual prompt injection simulation-to-real defense

Read paper → PDF arXiv

USENIX Security 2026 · 2026

Autonomy Comes with Costs: Detecting Denial-of-Service Vulnerabilities Caused by Resource Abusing in LLM-based Agents

Jiaqi Luo, Jiarun Dai, Fengyu Liu, Songyang Peng, Youkun Shi, Tong Bu, Geng Hong, Xudong Pan, Yuan Zhang

Presents AgentDoS, a lifecycle-aware fuzzing framework for detecting resource-abuse DoS vulnerabilities in LLM-based agents.

LLM agent system security denial-of-service resource governance

Read paper →

USENIX Security 2025 · 2025

Make Agent Defeat Agent: Automatic Detection of Taint-Style Vulnerabilities in LLM-based Agents

Fengyu Liu, Yuan Zhang, Jiaqi Luo, Jiarun Dai, Tian Chen, Letian Yuan, Zhengmin Yu, Youkun Shi, Ke Li, Chengyuan Zhou, Hao Chen, Min Yang

Introduces AgentFuzz, a directed greybox fuzzing framework for finding taint-style vulnerabilities in real-world LLM-based agents.

LLM agent system security taint-style vulnerabilities fuzzing

Read paper → PDF

ICLR 2026; arXiv preprint · 2026

PRISON: Unmasking the Criminal Potential of Large Language Models

Xinyi Wu, Geng Hong, Pei Chen, Yueyue Chen, Xudong Pan, Min Yang

Evaluates LLM criminal potential across traits such as false statements, framing, psychological manipulation, emotional disguise, and moral disengagement.

LLM behavioral risk criminal potential deception and manipulation

Read paper → PDF arXiv

All Publications

Complete list of research outputs from the lab.

ASE 2025 · 2025

Security Debt in LLM Agent Applications: A Measurement Study of Vulnerabilities and Mitigation Trade-offs

Zhuoxiang Shen, Jiarun Dai, Yuan Zhang, Min Yang

Measures security debt in LLM-agent applications by studying known vulnerabilities and the trade-offs introduced by mitigation strategies.

LLM agent system security vulnerability measurement mitigation trade-offs

Read paper →

WWW 2026; arXiv preprint · 2026

StruPhantom: Evolutionary Injection Attacks on Black-Box Tabular Agents Powered by Large Language Models

Yang Feng, Xudong Pan

Proposes an evolutionary prompt-injection attack that targets black-box LLM-powered tabular agents under structural payload constraints.

Prompt injection tabular agents black-box agent security

Read paper → PDF arXiv

arXiv preprint · 2026

FlowGuard: Towards Lightweight In-Generation Safety Detection for Diffusion Models via Linear Latent Decoding

Jinghan Yang, Yihe Fan, Xudong Pan, Min Yang

Detects unsafe diffusion outputs during the generation process by approximating latent decoding, enabling earlier and cheaper NSFW intervention.

Generative model safety NSFW detection diffusion models

Read paper → PDF arXiv

WWW 2025; arXiv preprint · 2025

You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense

Wuyuao Mai, Geng Hong, Pei Chen, Xudong Pan, Baojun Liu, Yuan Zhang, Haixin Duan, Min Yang

Evaluates whether jailbreak defenses improve safety without degrading model utility, highlighting persistent trade-offs in practical LLM defense.

LLM safety jailbreak defense safety-performance trade-off

Read paper → PDF arXiv

Follow Nuwa's research updates on frontier AI risk, agent safety, and controllable AI.

Subscribe to Nuwa Brief Follow on X