Abstract
Our first paper studied the safety mechanisms of a single model: the 394B-parameter Qwen 3.5 MoE. Our second paper compared it to the 122B. Both raised a natural question: do these findings generalize?
We now have an answer. By testing 9 models spanning three orders of magnitude in parameter count — from a tiny 0.8-billion parameter model to the 397-billion parameter frontier — we discovered that safety mechanisms don't just get stronger as models scale. They undergo qualitative phase transitions: fundamentally changing how safety works inside the model.
Small models have simple safety circuits that can be cleanly removed. Large models re-derive safety from general reasoning — you cannot delete safety without deleting the ability to think.
This is the first cross-scale mechanistic study of how safety training actually works inside language models.
What We Tested
| Model | Parameters | Architecture | Family |
|---|---|---|---|
| Qwen 3.5 0.8B | 0.8 billion | Dense (hybrid attention/SSM) | Qwen |
| Qwen 3.5 2B | 2 billion | Dense (hybrid attention/SSM) | Qwen |
| Qwen 3.5 4B | 4 billion | Dense (hybrid attention/SSM) | Qwen |
| Qwen 3.5 9B | 9 billion | Dense (hybrid attention/SSM) | Qwen |
| Qwen 3.5 27B | 27 billion | Dense (hybrid attention/SSM) | Qwen |
| Qwen 3.5 35B-A3B | 35B total / 3B active | MoE (256 experts) | Qwen |
| Qwen 3.5 122B-A10B | 122B total / 10B active | MoE (256 experts) | Qwen |
| Qwen 3.5 397B-A17B | 397B total / 17B active | MoE (512 experts) | Qwen |
| MiniMax M2.5 172B | 172B total / ~14B active | MoE (192 experts, pure attention) | MiniMax |
All experiments were run on a Mac Studio with an M3 Ultra chip and 256GB of unified memory, using Apple's MLX framework. Models were tested across multiple precision levels (4-bit, 8-bit, and full 16-bit precision).
The Core Discovery: Safety Phase Transitions
The most important finding from this study is that safety mechanisms don't scale linearly. Instead, they go through four distinct phases as models grow in size and architectural complexity. Each phase implements safety in a fundamentally different way — and each requires a completely different approach to study.
Phase 1: The Deletable Circuit (0.8B – 9B, Dense)
Small Models Have Simple, Removable Safety Wiring
In models below 10 billion parameters, safety is implemented as a straightforward circuit. The model's "refuse dangerous requests" behavior lives in a specific set of layers near the end of the model, and it grows stronger as information flows from early layers to late layers — like a volume knob being gradually turned up.
Because this safety signal is concentrated in a predictable location, it can be surgically removed with high precision. We achieved 75–100% removal across the 0.8B, 2B, 4B, and 9B models with minimal impact on the model's general intelligence.
Key detail: These models use a hybrid architecture with two types of layers — standard attention layers and compressed-memory layers (SSM). We found that the attention layers are the safety chokepoints while the compressed-memory layers primarily carry factual knowledge. Applying the same intervention uniformly to both types destroys knowledge quality. Treating them differently — being aggressive on attention layers and gentle on memory layers — preserves capability while removing safety.
Lower Precision Paradoxically Strengthens Safety
When we compressed a 2-billion parameter model to different precision levels, we found something counterintuitive: the more aggressively compressed version (4-bit) required 50% stronger intervention to achieve the same safety removal as the less compressed version (8-bit).
The reason: compression is not neutral. When a model's weights are rounded to fit into fewer bits, some of the surgical modifications get rounded away too — effectively restoring safety that was removed. Lower-precision compression does more rounding, which restores more safety signal.
This means the same model behaves differently at different precision levels, and any safety evaluation must be conducted at the actual deployment precision to be meaningful.
| Precision | Intervention Needed | Why |
|---|---|---|
| 8-bit | Moderate | Less rounding → modifications preserved |
| 4-bit | 50% higher | More rounding → modifications partially erased |
Phase 2: The Reasoning Tangle (27B, Dense)
At 27B, Safety Becomes Entangled with the Ability to Think
The 27-billion parameter model sits at a critical boundary. Safety is no longer concentrated in one area — it's distributed throughout the model and woven together with the circuits responsible for chain-of-thought reasoning (the model's ability to "think step by step" before answering).
The practical consequence: interventions that disrupt safety also disrupt the model's reasoning process. The model loses its ability to properly structure its internal deliberation — it gets stuck in thinking loops, can't finish its reasoning, or produces fragmented output.
This is the first scale at which we see safety and capability begin to merge. The model is entering a regime where safety isn't a separate module that can be unplugged — it's becoming part of how the model thinks.
Phase 3: The Three-Pathway Defense (35B+ MoE)
Architecture Matters More Than Size
One of our most surprising findings: the 35-billion parameter Mixture-of-Experts (MoE) model — which only activates 3 billion parameters per token — responded to the exact same type of intervention as the 11× larger 397B model. Meanwhile, the 27B dense model (which activates all 27 billion parameters for every token) required a completely different approach.
This proves that architecture determines safety mechanism type, not raw parameter count. The MoE routing system — where the model chooses which specialist sub-networks to activate for each input — creates a distinctive safety pattern regardless of scale:
- Attention pathway: Detects potentially harmful content in the input
- Routing pathway: Selects safety-specialized sub-networks to handle it
- Residual pathway: Injects refusal signals into the output
All three must be disrupted simultaneously. At 35B scale, this is achievable. At 122B+ scale with full safety training, it is not.
Full Precision Works, Compression Destroys the Signal
The 35B model revealed a failure mode that standard safety testing would completely miss. At full precision (16-bit), interventions produce genuinely modified behavior — the model generates real, substantive responses to previously-refused prompts.
But after compressing the same model to 8-bit for deployment, something strange happens: the model still appears to comply (it doesn't refuse), but the actual content of its responses is hollow — sanitized placeholders that look like compliance but contain no real substance.
If you only check whether the model refuses (which is what most safety benchmarks do), this model passes. If you actually read the output, it's obvious the modification didn't survive compression. This is a critical lesson for safety evaluation: checking for refusal is not enough.
Phase 4: Holographic Safety (122B+ MoE, Full Safety Training)
The 122B Independently Confirms: Safety Becomes Indestructible at Scale
Our previous paper documented the 122B model's resistance to modification. This study goes further: we threw every technique we had at it, including methods specifically designed to overcome the defenses we'd already identified.
| Intervention | What It Does | Result |
|---|---|---|
| Standard weight surgery | Removes the safety direction from key weight matrices | Failed |
| Multi-direction removal | Removes 3 independent safety directions simultaneously | Failed |
| Router blinding | Prevents the routing system from detecting harmful content | Failed |
| Expert silencing (77% of safety experts disabled) | Disables the majority of safety-specialized sub-networks | Failed |
| All of the above combined | Every technique applied simultaneously | Failed |
The model either refuses outright, invents creative ways to reinterpret the prompt into something harmless (what we call "semantic evasion"), or becomes incoherent — but never cleanly complies.
This confirms our original 394B finding with an independent model. Holographic safety is not an artifact of our 394B research methodology — it's a property of large, fully-trained MoE models.
Why It Works: Knowledge and Safety Are Fused
The root cause of holographic safety is domain-intent entanglement. Through preference training (the process that teaches models what is and isn't acceptable), the model's knowledge and its safety behavior become inseparable:
- The sub-networks that know about chemistry are the same ones that refuse chemistry-related harmful requests
- The sub-networks that know about cybersecurity are the same ones that refuse hacking requests
- Silencing the safety behavior for a topic also silences the knowledge about that topic
Additionally, the hybrid architecture's compressed-memory layers carry safety information through an independent channel that is invisible to standard interventions. The safety signal simply flows around any obstacle through this second pathway — like water finding a path around a dam.
The Architectural Control: MiniMax M2.5
A Different Architecture Tells a Different Story
To test whether our findings were specific to the Qwen model family, we studied MiniMax M2.5 — a 172-billion parameter MoE model with a fundamentally different design. Unlike Qwen, which uses a hybrid of attention and compressed-memory layers, MiniMax uses pure attention throughout. No compressed-memory layers. No invisible second channel.
Three key differences emerged:
- Safety is distributed across all attention components, not concentrated in output projections. The intervention approach that works on Qwen (targeting only the output projection) had zero effect on MiniMax across 5 separate attempts. Only when we targeted all four attention components simultaneously did we see results.
- Much lower intervention threshold. Qwen models require aggressive intervention strength. MiniMax responds to very gentle modifications — suggesting the compressed-memory layers in Qwen provide a significant safety buffer that pure-attention models lack.
- At 172B parameters, MiniMax does NOT show holographic safety. This is striking because both the 122B and 397B Qwen models do. It suggests that holographic safety is not just a matter of scale — it requires the architectural complexity of hybrid attention/memory systems.
We want to report a methodological mistake that we believe is itself a valuable finding. Our initial automated testing of MiniMax models reported near-perfect compliance scores. When we went back and actually read the model outputs, many were incoherent — gibberish, looping text, or hallucinated code.
The problem: our automated compliance checker only looked for refusal keywords ("I cannot," "I'm sorry," etc.). Gibberish doesn't contain these keywords, so it scored as "compliant." After building proper evaluation that checks output quality alongside compliance, our verified MiniMax result is 2 out of 3 test cases showing genuine compliance with full coherence.
We report this honestly for two reasons: (1) inflating our numbers would be scientific misconduct, and (2) this experience reveals a systemic problem — any safety benchmark that only checks for refusal markers will produce false results. Coherence verification is essential. We suspect this issue affects other published abliteration results as well.
The GGUF Barrier: Format Conversion as Accidental Defense
The Most Common Deployment Format Silently Resets Modifications
We reported this in our 122B paper, and cross-model testing confirms it: converting a model from HuggingFace format to GGUF (the format used by llama.cpp, Ollama, LM Studio, and most local deployment tools) silently destroys weight modifications.
The GGUF converter rearranges internal weight structures for compatibility reasons. These rearrangements scramble any surgical modifications that were applied in the original format. A community member independently confirmed this by converting a third party's modified 122B model — the GGUF version behaved as if it had never been modified.
This means the most common local deployment pathway is currently immune to weight-based safety modifications. This immunity is accidental and undocumented — it could disappear if the GGUF format changes in future updates.
Quantization Is Not Neutral Compression
Every Precision Level Is a Different Model from a Safety Perspective
Across all 9 models, we found a consistent pattern: quantization actively transforms safety properties. It is not neutral compression. Specific findings:
- 4-bit needs more force than 8-bit — quantization regularizes weights back toward safety (confirmed at 2B and 35B scale)
- Modifications don't transfer between precisions — weights modified at 4-bit cannot be applied to an 8-bit model. Each precision level has its own internal geometry.
- Full-precision compliance ≠ compressed compliance — the 35B produces real content at 16-bit but sanitized placeholders at 8-bit, despite both appearing to "comply"
The practical implication: safety evaluations conducted at one precision level say nothing about safety at another. A model verified as safe at 8-bit may behave differently at 4-bit. This is a gap in current evaluation practices that our data exposes directly.
Calibration Data Determines What Survives Compression
Modern quantization methods like AWQ (Activation-Aware Weight Quantization) decide which weights to preserve at higher precision based on a calibration dataset — a set of example inputs run through the model to measure which weights matter most.
This creates a supply-chain vulnerability: if the calibration dataset is crafted to emphasize certain behaviors, the quantizer will preferentially preserve those behaviors while compressing others. A benign-looking model could be distributed with a calibration set designed to degrade safety upon quantization — and standard safety tests on the full-precision model would show nothing wrong.
This finding, which we first reported in our 394B paper, now has cross-model support: the sensitivity of quantization to calibration data is consistent across architectures and scales.
The Alignment Tax Is Real and Architecture-Dependent
Safety Training Degrades Capability — But the Cost Varies by Architecture
Our original 394B finding — that removing safety directions improves the model's performance on harmless text by approximately 11% — is confirmed across the dense model family. Safety-trained models carry a measurable capability cost: the safety-related signals are always present in the model's internal state, even on completely benign inputs, acting as noise that slightly degrades output quality.
But the magnitude of this "alignment tax" depends on architecture:
- Hybrid models (Qwen) pay a higher tax. The compressed-memory layers create additional safety redundancy — more pathways carrying safety signals means more noise on benign inputs.
- Pure-attention models (MiniMax) pay a lower tax. With all safety information in a single pathway type, there's less background noise.
This suggests that architecture choice is a safety decision. Hybrid designs are harder to break but impose a higher capability cost. Pure-attention designs are more efficient but offer less robust safety. This tradeoff should inform deployment decisions.
Implications
What This Means
- Safety research on small models doesn't generalize to large ones. Phase 1 mechanisms (simple removable circuits) bear no resemblance to Phase 4 mechanisms (holographic re-derivation from reasoning). Conclusions drawn from 7B or 13B models should not be assumed to apply at 100B+.
- Architecture matters as much as scale. A 35B MoE model has more in common with a 397B MoE model (same safety phase) than with a 27B dense model (different safety phase). Safety evaluations should be architecture-specific.
- Safety benchmarks need coherence checking. Any evaluation that only tests whether a model refuses — without verifying that compliant responses are coherent and substantive — will produce false results. Our compliance checker failure demonstrates this directly.
- Quantization is a safety-relevant transformation. Deployment decisions about precision (4-bit vs 8-bit vs 16-bit) and format (HuggingFace vs GGUF) have direct implications for safety properties. These should be part of safety evaluation, not treated as implementation details.
- Holographic safety may depend on architectural complexity, not just scale. The MiniMax result (172B, no holographic safety) vs the Qwen result (122B, holographic safety) suggests that compressed-memory channels play a critical role. This is a testable hypothesis for future work.
Relationship to Prior Work
This work extends and complements several recent studies:
- Li et al. (2025), "What Matters For Safety Alignment?" — Their study tested 32 models across 13 families using behavioral evaluation (jailbreak prompts, API calls). Our work provides the mechanistic complement: they show what happens across many models, we show why it happens through weight-level analysis. Their finding that "models integrating reasoning and self-reflection demonstrate superior safety" aligns with our Phase 2→4 transition where safety becomes entangled with reasoning circuits.
- Egashira et al. (2024), "Exploiting LLM Quantization" — They showed benign models can become harmful upon quantization. We show the mirror image: modified models can revert to safety upon quantization. Both stem from the same root cause — quantization is a many-to-one mathematical transformation that actively reshapes behavioral geometry.
- Wee et al. (2025), "Alignment-Aware Quantization" — They propose defenses to preserve alignment during quantization. Our cross-model data provides the empirical threat model that motivates their defense.
- Arditi et al. (2024), "Refusal in Language Models Is Mediated by a Single Direction" — Their single-direction framework holds at Phase 1 scales (we confirm it works on 0.8B–9B models). At Phase 3+ scales, safety is multi-directional and distributed, contradicting the single-direction assumption.
What's Next
This cross-model study raises several open questions we're actively investigating:
- Other model families. Our Qwen results need replication on Llama, Mistral, Gemma, and other architectures. The MiniMax results suggest architecture-specific safety is real — but how universal is the phase transition framework?
- Full-precision MiniMax surgery. Our current MiniMax results (2/3 verified compliance) use post-quantization intervention. We're building a full-precision pipeline that applies modifications before compression, which should improve both quality and compliance. Results forthcoming.
- The GGUF question. Is the GGUF conversion barrier permanent or fragile? If the format changes, the accidental defense disappears. This needs monitoring.
- Automated evaluation quality. How many published abliteration results are inflated by compliance checkers that don't verify coherence? We suspect our checker failure is not unique.
Reproducibility
- Models: All 9 Qwen 3.5 variants + MiniMax M2.5 172B
- Hardware: Mac Studio M3 Ultra, 256GB unified memory
- Framework: MLX (Apple Silicon native)
- Evaluation: 19 safety prompt categories, compliance + coherence verification
- Precisions tested: FP16, 8-bit, 4-bit
References
- Li, X., et al. "What Matters For Safety Alignment of LLMs." arXiv:2601.03868, 2025.
- Egashira, L., et al. "Exploiting LLM Quantization." NeurIPS 2024. arXiv:2405.18137.
- Wee, M., et al. "Alignment-Aware Quantization for LLM Safety." Neural Networks, 2025. arXiv:2511.07842.
- Arditi, S., et al. "Refusal in Language Models Is Mediated by a Single Direction." arXiv:2406.11717, 2024.
- Jang, E. "Safety Generalization in Frontier MoE Models." Dealign.ai, Feb 2026.
- Jang, E. "Abliteration at the Hybrid Frontier: Qwen 3.5 122B." Dealign.ai, Mar 2026.
- Qwen Team. "Qwen 3.5 MoE." Technical Report, 2026.
- Lin, J., et al. "AWQ: Activation-Aware Weight Quantization." MLSys, 2024.
- MiniMax. "MiniMax-M2.5 Technical Report." 2026.
For further discussion, contact eric@dealign.ai · dealign.ai