publications | Yuzhou Nie

2025

NeurIPS

SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI

Yuzhou Nie*, Zhun Wang*, Yu Yang*, Ruizhe Jiang, and 6 more authors

2025

Abs arXiv Code

Existing benchmarks for evaluating the security risks and capabilities (e.g., vulnerability detection) of code-generating large language models (LLMs) face several key limitations: (1) limited coverage of risk and capabilities; (2) reliance on static evaluation metrics such as LLM judgments or rule-based detection, which lack the precision of dynamic analysis; and (3) a trade-off between data quality and benchmark scale. To address these challenges, we introduce a general and scalable benchmark construction framework that begins with manually validated, high-quality seed examples and expands them via targeted mutations. Our approach provides a comprehensive suite of artifacts so the benchmark can support comprehensive risk assessment and security capability evaluation using dynamic metrics. By combining expert insights with automated generation, we strike a balance between manual effort, data quality, and benchmark scale. Applying this framework to Python, C/C++, and Java, we build SecCodePLT, a dataset of more than 5.9k samples spanning 44 CWE-based risk categories and three security capabilities. Compared with state-of-the-art benchmarks, SecCodePLT offers broader coverage, higher data fidelity, and substantially greater scale. We use SecCodePLT to evaluate leading code LLMs and agents, revealing their strengths and weaknesses in both generating secure code and identifying or fixing vulnerabilities.
NeurIPS

Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation

Mengkang Hu, Yuhang Zhou, Wendong Fan, Yuzhou Nie, and 12 more authors

2025

Abs arXiv Website

Large Language Model (LLM)-based multi-agent systems show promise for automating real-world tasks but struggle to transfer across domains due to their domain-specific nature. Current approaches face two critical shortcomings: they require complete architectural redesign and full retraining of all components when applied to new domains. We introduce Workforce, a hierarchical multi-agent framework that decouples strategic planning from specialized execution through a modular architecture comprising: (i) a domain-agnostic Planner for task decomposition, (ii) a Coordinator for subtask management, and (iii) specialized Workers with domain-specific tool-calling capabilities. This decoupling enables cross-domain transferability during both inference and training phases: During inference, Workforce seamlessly adapts to new domains by adding or modifying worker agents; For training, we introduce Optimized Workforce Learning (OWL), which improves generalization across domains by optimizing a domain-agnostic planner with reinforcement learning from real-world feedback. To validate our approach, we evaluate Workforce on the GAIA benchmark, covering various realistic, multi-domain agentic tasks. Experimental results demonstrate Workforce achieves open-source state-of-the-art performance (69.70%), outperforming commercial systems like OpenAI’s Deep Research by 2.34%. More notably, our OWL-trained 32B model achieves 52.73% accuracy (+16.37%) and demonstrates performance comparable to GPT-4o on challenging tasks. To summarize, by enabling scalable generalization and modular domain transfer, our work establishes a foundation for the next generation of general-purpose AI assistants.
EMNLP

AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents

Zhun Wang, Vincent Siu, Zhe Ye, Tianneng Shi, and 5 more authors

2025

Abs arXiv

The strong planning and reasoning capabilities of Large Language Models (LLMs) have fostered the development of agent-based systems capable of leveraging external tools and interacting with increasingly complex environments. However, these powerful features also introduce a critical security risk: indirect prompt injection, a sophisticated attack vector that compromises the core of these agents, the LLM, by manipulating contextual information rather than direct user prompts. In this work, we propose a generic black-box fuzzing framework, AgentVigil, designed to automatically discover and exploit indirect prompt injection vulnerabilities across diverse LLM agents. Our approach starts by constructing a high-quality initial seed corpus, then employs a seed selection algorithm based on Monte Carlo Tree Search (MCTS) to iteratively refine inputs, thereby maximizing the likelihood of uncovering agent weaknesses. We evaluate AgentVigil on two public benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o, respectively, nearly doubling the performance of baseline attacks. Moreover, AgentVigil exhibits strong transferability across unseen tasks and internal LLMs, as well as promising results against defenses. Beyond benchmark evaluations, we apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.
COLM

LeakAgent: RL-based Red-teaming Agent for LLM Privacy Leakage

Yuzhou Nie, Zhun Wang, Ye Yu, Xian Wu, and 4 more authors

2025

Abs arXiv Code

Recent studies have discovered that large language models (LLM) may be “fooled” to output private information, including training data, system prompts, and personally identifiable information, under carefully crafted adversarial prompts. Existing red-teaming approaches for privacy leakage either rely on manual efforts or focus solely on system prompt extraction, making them ineffective for severe risks of training data leakage. We propose LeakAgent, a novel black-box red-teaming framework for LLM privacy leakage. Our framework trains an open-source LLM through reinforcement learning as the attack agent to generate adversarial prompts for both training data extraction and system prompt extraction. To achieve this, we propose a novel reward function to provide effective and fine-grained rewards and design novel mechanisms to balance exploration and exploitation during learning and enhance the diversity of adversarial prompts. Through extensive evaluations, we first show that LeakAgent significantly outperforms existing rule-based approaches in training data extraction and automated methods in system prompt leakage. We also demonstrate the effectiveness of LeakAgent in extracting system prompts from real-world applications in OpenAI’s GPT Store. We further demonstrate LeakAgent’s effectiveness in evading the existing guardrail defense and its helpfulness in enabling better safety alignment. Finally, we validate our customized designs through a detailed ablation study.

2024

NeurIPS

When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search

Yuzhou Nie*, Xuan Chen*, Wenbo Guo, and Xiangyu Zhang

2024

Abs arXiv Code

Recent studies developed jailbreaking attacks, which construct jailbreaking prompts to “fool” LLMs into responding to harmful questions. Early-stage jailbreaking attacks require access to model internals or significant human efforts. More advanced attacks utilize genetic algorithms for automatic and black-box attacks. However, the random nature of genetic algorithms significantly limits the effectiveness of these attacks. In this paper, we propose RLbreaker, a black-box jailbreaking attack driven by deep reinforcement learning (DRL). We model jailbreaking as a search problem and design an RL agent to guide the search, which is more effective and has less randomness than stochastic search, such as genetic algorithms. Specifically, we design a customized DRL system for the jailbreaking problem, including a novel reward function and a customized proximal policy optimization (PPO) algorithm. Through extensive experiments, we demonstrate that RLbreaker is much more effective than existing jailbreaking attacks against six state-of-the-art (SOTA) LLMs. We also show that RLbreaker is robust against three SOTA defenses and its trained agents can transfer across different LLMs. We further validate the key design choices of RLbreaker via a comprehensive ablation study.

2023

2022

CICAI

Adversarial and Implicit Modality Imputation with Applications to Depression Early Detection

Yuzhou Nie*, Chengyue Huang*, Hailun Liang, and Hongteng Xu

2022

Abs PDF Code

Depression early detection is a significant healthcare task that heavily relies on high-quality multi-modal medical data. In practice, however, learning a robust detection model is challenging because real-world data often suffers from serious modality-level missing issues caused by imperfect data collection and strict data sharing policies. In this study, we propose an Adversarial and Implicit Modality Imputation (AIMI) method to resolve this challenge. In particular, when training multi-modal predictive models, we learn an implicit mechanism to impute the missing modalities of training data at the same time. These two learning objectives are achieved jointly in an adversarial learning framework. Based on the UK Biobank dataset, we demonstrate the effectiveness and superiority of our method on the early detection of depression. Codes are available at https://github.com/rucnyz/AIMI
CIKM

Gromov-Wasserstein Multi-Modal Alignment and Clustering

Fengjiao Gong*, Yuzhou Nie*, and Hongteng Xu

2022

Abs PDF Code

Multi-modal clustering aims at finding a clustering structure shared by the data of different modalities in an unsupervised way. Currently, solving this problem often relies on two assumptions: i) the multi-modal data own the same latent distribution, and ii) the observed multi-modal data are well-aligned and without any missing modalities. Unfortunately, these two assumptions are often questionable in practice and thus limit the feasibility of many multi-modal clustering methods. In this work, we develop a new multi-modal clustering method based on the Gromovization of optimal transport distance, which relaxes the dependence on the above two assumptions. In particular, given the data of different modalities, whose correspondence is unknown, our method learns the Gromov-Wasserstein (GW) barycenter of their kernel matrices. Driven by the modularity maximization principle, the GW barycenter helps to explore the clustering structure shared by different modalities. Moreover, the GW barycenter is associated with the GW distances between the different modalities to the clusters, and the optimal transport plans corresponding to the GW distances help to achieve the alignment and the clustering of the multi-modal data jointly. Experimental results show that our method outperforms state-of-the-art multi-modal clustering methods, especially when the data are (partially or completely) unaligned. The code is available at https://github.com/rucnyz/GWMAC.