The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition

Abstract

The open-weight LLM ecosystem is increasingly defined by model composition techniques (such as weight merging, speculative decoding, and vocabulary expansion) that remix capabilities from diverse sources. A critical prerequisite for applying these methods across different model families is tokenizer transplant, which aligns incompatible vocabularies to a shared embedding space. We demonstrate that this essential interoperability step introduces a supply-chain vulnerability: we engineer a single "breaker token" that is functionally inert in a donor model yet reliably reconstructs into a high-salience malicious feature after transplant into a base model. By exploiting the geometry of coefficient reuse, our attack creates an asymmetric realizability gap that sabotages the base model's generation while leaving the donor's utility statistically indistinguishable from nominal behavior. We formalize this as a dual-objective optimization problem and instantiate the attack using a sparse solver. Empirically, the attack is training-free and achieves spectral mimicry to evade outlier detection, while demonstrating structural persistence against fine-tuning and weight merging, highlighting a hidden risk in the pipeline of modular AI composition.

Attack surface

Shared-basis transplant reconstructs donor-only tokens from shared anchors and reuses coefficients in the base.

Breaker token

One added token is optimized to stay inert in the donor but become high-salience after transplant.

Impact

High base emission rates with negligible donor utility change, persistent under fine-tuning and merging.

Takeaway: efficiency-first transplant pipelines need verification to prevent supply-chain trojans.

Overview

Tokenizer transplant enables composition across incompatible vocabularies. Standard tools reconstruct donor-only embeddings from shared anchors (often with sparse methods like OMP), then reuse those coefficients in the base. This coefficient reuse is not neutral and can be weaponized to create a gap between donor and base behavior.

Shared-basis transplant: donor-only tokens are approximated as linear combinations of shared anchors.
Asymmetric realizability gap: the donor sees the token as inert while the base realizes a high-salience feature.
Single-token sabotage: a single new vocabulary item is enough to trigger the failure.
Training-free: no fine-tuning or gradient access is required to execute the attack.

Threat Model

The attacker is a supply-chain adversary who publishes a modified donor tokenizer with one extra token. The victim applies a standard shared-basis transplant tool to align the donor vocabulary to a base model. The breaker token stays dormant in the donor and activates only after transplant into the base.

Assumptions

Attacker can ship a donor tokenizer or checkpoint with one added token.
Victim uses training-free shared-basis transplant (e.g., OMP or similar).
No access to the victim's fine-tuning or downstream data is required.

Sabotage goals

Reputation poisoning via toxic or policy-breaking outputs.
Adversarial watermarking for latent ownership signaling.
Service degradation through EOS mapping or loop-inducing triggers.

Attack Framework

The attack exploits coefficient reuse in shared-basis transplant. The adversary estimates cross-model feature overlap from public text, then solves a dual-objective optimization that suppresses donor salience while maximizing base salience under the transplant operator.

Shared anchors define a basis; donor-only tokens are reconstructed with sparse coefficients.
Coefficient reuse maps the same coefficients into the base, creating a realizability gap.
Solver supports sparse operators (OMP-style) and differentiable transplant operators.
Result: the breaker token is inert in the donor but high-impact in the base.

Attack framework and asymmetric realizability.

Attack visualization: pipeline, victim transplant, and asymmetric realizability.

Evaluation and Metrics

Experiments focus on whether a breaker token can activate in the base while staying inert in the donor, and whether that behavior survives standard post-processing like fine-tuning and weight merging. Evaluation spans multiple model families and scales, including a lightweight clique (Qwen2-0.5B, Qwen3-0.6B, Gemma-2-2B-it, Gemma-3-1B-it, Ministral-3B-Instruct) and larger cliques that include Gemma-2-9B-it, Llama-3-8B, Mistral-7B, Qwen2-7B, and Qwen3-14B. Cross-scale transfer sets mix smaller models (e.g., SmolLM2-1.7B, Qwen2.5-1.5B, Llama-3.2-3B) with larger donors.

Metric	What it captures
Sequence Emission Rate (SER)	Probability the breaker token appears at least once in a generation for prompt pools (Alpaca, SQuAD v2, GSM8K).
Utility preservation	Perplexity on Wikitext-103 and accuracy on LAMBADA, MMLU, and ARC-Challenge.
Stealth and controls	Spectral z-score mimicry and SER vs hits@k ablations that vary the penalty weight.

Experiments

Experiments validate asymmetric realizability across a fully connected clique of five open-weight models (Qwen2-0.5B, Qwen3-0.6B, Gemma-2-2B-it, Gemma-3-1B-it, Ministral-3B-Instruct), with additional tests on larger and cross-scale pairs. Activation is measured with SER on Alpaca, SQuAD v2, and GSM8K. Utility is tracked on Wikitext-103, LAMBADA, MMLU, and ARC-Challenge.

The main figure summarizes the asymmetric realizability gap. Additional panels quantify donor utility preservation, base utility shifts across transplant stages, persistence under LoRA fine-tuning and weight merging, sensitivity to the penalty weight, and spectral mimicry against outlier detection.

Key findings

Base activation is high while donor activation stays near zero across prompt pools.
Donor utility is largely preserved even when base utility shifts after attack.
LoRA fine-tuning and weight merging do not reliably remove the planted trigger.
Breaker tokens blend into spectral statistics rather than showing up as outliers.
Penalty weight tuning exposes a trade-off between donor stealth and base activation.

SER via dumbbell plots across Alpaca, SQuAD v2, GSM8K, and SER max.

Reading guide: each dumbbell shows donor vs base activation for the same breaker token after transplant. Open markers (donor) stay near zero while filled markers (base) rise sharply, illustrating the asymmetric realizability gap across tasks and model pairs.

Persistence under fine-tuning with norm-boost factors.

Base utility changes across pretrained, after-OMP, and after-attack stages.

Persistence and utility: the left panel shows that breaker tokens survive LoRA fine-tuning, even with norm-boosting, while the right panel tracks how base utility shifts after OMP and after the attack. The key signal is that high activation does not require large utility degradation in the donor.

Donor utility preservation (post-patch vs pretrained).

Donor utilities stay close to the identity line, indicating that the injected token is statistically indistinguishable from benign behavior when evaluated on standard tasks.

Ablation of penalty weight (SER vs Hits@k).

Spectral mimicry: breaker tokens avoid outlier detection.

The ablation shows the trade-off controlled by the penalty weight: tightening donor suppression can reduce base activation, and vice versa. The spectral plot highlights why simple outlier detection misses the breaker tokens - they blend into in-distribution embedding statistics.

Limitations and Ethics

Limitations

Focuses on training-free shared-basis transplant rather than costly retraining.
Does not evaluate data-driven post-processing or large-scale auditing pipelines.
Evaluation is limited to text-only models; multimodal settings are open.

Ethics

Releases the optimization framework without weaponized tokenizers.
Uses public models and datasets with no private or human-subject data.
Aims to harden the open-weight supply chain against silent failures.

Defense considerations

Transplant-time verification and auditing of newly added token rows.
Spectral filtering or provenance checks for vocabulary changes.
Embedding retraining or data-driven post-processing (not evaluated here).

BibTeX

@misc{liu2025trojanvocabularystealthysabotage,
      title={The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition}, 
      author={Xiaoze Liu and Weichen Yu and Matt Fredrikson and Xiaoqian Wang and Jing Gao},
      year={2025},
      eprint={2601.00065},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.00065}, 
}