Safety Across Scale: How AI Safety Mechanisms Evolve from 0.8B to 397B Parameters

Abstract

Our first paper studied the safety mechanisms of a single model: the 394B-parameter Qwen 3.5 MoE. Our second paper compared it to the 122B. Both raised a natural question: do these findings generalize?

We now have an answer. By testing 9 models spanning three orders of magnitude in parameter count — from a tiny 0.8-billion parameter model to the 397-billion parameter frontier — we discovered that safety mechanisms don't just get stronger as models scale. They undergo qualitative phase transitions: fundamentally changing how safety works inside the model.

Small models have simple safety circuits that can be cleanly removed. Large models re-derive safety from general reasoning — you cannot delete safety without deleting the ability to think.

This is the first cross-scale mechanistic study of how safety training actually works inside language models.

What We Tested

Model	Parameters	Architecture	Family
Qwen 3.5 0.8B	0.8 billion	Dense (hybrid attention/SSM)	Qwen
Qwen 3.5 2B	2 billion	Dense (hybrid attention/SSM)	Qwen
Qwen 3.5 4B	4 billion	Dense (hybrid attention/SSM)	Qwen
Qwen 3.5 9B	9 billion	Dense (hybrid attention/SSM)	Qwen
Qwen 3.5 27B	27 billion	Dense (hybrid attention/SSM)	Qwen
Qwen 3.5 35B-A3B	35B total / 3B active	MoE (256 experts)	Qwen
Qwen 3.5 122B-A10B	122B total / 10B active	MoE (256 experts)	Qwen
Qwen 3.5 397B-A17B	397B total / 17B active	MoE (512 experts)	Qwen
MiniMax M2.5 172B	172B total / ~14B active	MoE (192 experts, pure attention)	MiniMax

All experiments were run on a Mac Studio with an M3 Ultra chip and 256GB of unified memory, using Apple's MLX framework. Models were tested across multiple precision levels (4-bit, 8-bit, and full 16-bit precision).

The Core Discovery: Safety Phase Transitions

The most important finding from this study is that safety mechanisms don't scale linearly. Instead, they go through four distinct phases as models grow in size and architectural complexity. Each phase implements safety in a fundamentally different way — and each requires a completely different approach to study.

Scale (log) │ │ ┌─────────────────────────────────────────────────────┐ │ │ Phase 4: HOLOGRAPHIC SAFETY │ 397B│ │ Safety is emergent. Cannot be removed. │ 122B│ │ Model re-derives safety from reasoning. │ │ ├─────────────────────────────────────────────────────┤ │ │ Phase 3: MULTIPLICATIVE DEFENSE │ 35B│ │ Three independent safety pathways. │ │ │ Can be broken with targeted intervention. │ │ ├─────────────────────────────────────────────────────┤ │ │ Phase 2: DISTRIBUTED FRAGILE │ 27B│ │ Safety entangled with reasoning circuits. │ │ │ Partially breakable but damages reasoning. │ │ ├─────────────────────────────────────────────────────┤ │ │ Phase 1: LOCALIZED CIRCUIT │ 0.8–│ │ Safety is a simple, removable circuit. │ 9B │ │ Can be cleanly deleted without side effects. │ │ └─────────────────────────────────────────────────────┘ └─────────────────────────────────────────────────────────── Dense ──────────────────────── MoE ──────────────────> Architecture Complexity

Phase 1: The Deletable Circuit (0.8B – 9B, Dense)

FINDING 01

Small Models Have Simple, Removable Safety Wiring

In models below 10 billion parameters, safety is implemented as a straightforward circuit. The model's "refuse dangerous requests" behavior lives in a specific set of layers near the end of the model, and it grows stronger as information flows from early layers to late layers — like a volume knob being gradually turned up.

Because this safety signal is concentrated in a predictable location, it can be surgically removed with high precision. We achieved 75–100% removal across the 0.8B, 2B, 4B, and 9B models with minimal impact on the model's general intelligence.

Key detail: These models use a hybrid architecture with two types of layers — standard attention layers and compressed-memory layers (SSM). We found that the attention layers are the safety chokepoints while the compressed-memory layers primarily carry factual knowledge. Applying the same intervention uniformly to both types destroys knowledge quality. Treating them differently — being aggressive on attention layers and gentle on memory layers — preserves capability while removing safety.

FINDING 02

Lower Precision Paradoxically Strengthens Safety

When we compressed a 2-billion parameter model to different precision levels, we found something counterintuitive: the more aggressively compressed version (4-bit) required 50% stronger intervention to achieve the same safety removal as the less compressed version (8-bit).

The reason: compression is not neutral. When a model's weights are rounded to fit into fewer bits, some of the surgical modifications get rounded away too — effectively restoring safety that was removed. Lower-precision compression does more rounding, which restores more safety signal.

This means the same model behaves differently at different precision levels, and any safety evaluation must be conducted at the actual deployment precision to be meaningful.

Precision	Intervention Needed	Why
8-bit	Moderate	Less rounding → modifications preserved
4-bit	50% higher	More rounding → modifications partially erased

Phase 2: The Reasoning Tangle (27B, Dense)

FINDING 03

At 27B, Safety Becomes Entangled with the Ability to Think

The 27-billion parameter model sits at a critical boundary. Safety is no longer concentrated in one area — it's distributed throughout the model and woven together with the circuits responsible for chain-of-thought reasoning (the model's ability to "think step by step" before answering).

The practical consequence: interventions that disrupt safety also disrupt the model's reasoning process. The model loses its ability to properly structure its internal deliberation — it gets stuck in thinking loops, can't finish its reasoning, or produces fragmented output.

This is the first scale at which we see safety and capability begin to merge. The model is entering a regime where safety isn't a separate module that can be unplugged — it's becoming part of how the model thinks.

Phase 3: The Three-Pathway Defense (35B+ MoE)

FINDING 04 — KEY RESULT

Architecture Matters More Than Size

One of our most surprising findings: the 35-billion parameter Mixture-of-Experts (MoE) model — which only activates 3 billion parameters per token — responded to the exact same type of intervention as the 11× larger 397B model. Meanwhile, the 27B dense model (which activates all 27 billion parameters for every token) required a completely different approach.

This proves that architecture determines safety mechanism type, not raw parameter count. The MoE routing system — where the model chooses which specialist sub-networks to activate for each input — creates a distinctive safety pattern regardless of scale:

Attention pathway: Detects potentially harmful content in the input
Routing pathway: Selects safety-specialized sub-networks to handle it
Residual pathway: Injects refusal signals into the output

All three must be disrupted simultaneously. At 35B scale, this is achievable. At 122B+ scale with full safety training, it is not.

FINDING 05

Full Precision Works, Compression Destroys the Signal

The 35B model revealed a failure mode that standard safety testing would completely miss. At full precision (16-bit), interventions produce genuinely modified behavior — the model generates real, substantive responses to previously-refused prompts.

But after compressing the same model to 8-bit for deployment, something strange happens: the model still appears to comply (it doesn't refuse), but the actual content of its responses is hollow — sanitized placeholders that look like compliance but contain no real substance.

If you only check whether the model refuses (which is what most safety benchmarks do), this model passes. If you actually read the output, it's obvious the modification didn't survive compression. This is a critical lesson for safety evaluation: checking for refusal is not enough.

Phase 4: Holographic Safety (122B+ MoE, Full Safety Training)

FINDING 06 — CENTRAL RESULT

The 122B Independently Confirms: Safety Becomes Indestructible at Scale

Our previous paper documented the 122B model's resistance to modification. This study goes further: we threw every technique we had at it, including methods specifically designed to overcome the defenses we'd already identified.

Intervention	What It Does	Result
Standard weight surgery	Removes the safety direction from key weight matrices	Failed
Multi-direction removal	Removes 3 independent safety directions simultaneously	Failed
Router blinding	Prevents the routing system from detecting harmful content	Failed
Expert silencing (77% of safety experts disabled)	Disables the majority of safety-specialized sub-networks	Failed
All of the above combined	Every technique applied simultaneously	Failed

The model either refuses outright, invents creative ways to reinterpret the prompt into something harmless (what we call "semantic evasion"), or becomes incoherent — but never cleanly complies.

This confirms our original 394B finding with an independent model. Holographic safety is not an artifact of our 394B research methodology — it's a property of large, fully-trained MoE models.

FINDING 07

Why It Works: Knowledge and Safety Are Fused

The root cause of holographic safety is domain-intent entanglement. Through preference training (the process that teaches models what is and isn't acceptable), the model's knowledge and its safety behavior become inseparable:

The sub-networks that know about chemistry are the same ones that refuse chemistry-related harmful requests
The sub-networks that know about cybersecurity are the same ones that refuse hacking requests
Silencing the safety behavior for a topic also silences the knowledge about that topic

Additionally, the hybrid architecture's compressed-memory layers carry safety information through an independent channel that is invisible to standard interventions. The safety signal simply flows around any obstacle through this second pathway — like water finding a path around a dam.

The Architectural Control: MiniMax M2.5

FINDING 08 — ARCHITECTURAL INSIGHT

A Different Architecture Tells a Different Story

To test whether our findings were specific to the Qwen model family, we studied MiniMax M2.5 — a 172-billion parameter MoE model with a fundamentally different design. Unlike Qwen, which uses a hybrid of attention and compressed-memory layers, MiniMax uses pure attention throughout. No compressed-memory layers. No invisible second channel.

Three key differences emerged:

Safety is distributed across all attention components, not concentrated in output projections. The intervention approach that works on Qwen (targeting only the output projection) had zero effect on MiniMax across 5 separate attempts. Only when we targeted all four attention components simultaneously did we see results.
Much lower intervention threshold. Qwen models require aggressive intervention strength. MiniMax responds to very gentle modifications — suggesting the compressed-memory layers in Qwen provide a significant safety buffer that pure-attention models lack.
At 172B parameters, MiniMax does NOT show holographic safety. This is striking because both the 122B and 397B Qwen models do. It suggests that holographic safety is not just a matter of scale — it requires the architectural complexity of hybrid attention/memory systems.

A Note on Honesty: The Compliance Checker Failure

We want to report a methodological mistake that we believe is itself a valuable finding. Our initial automated testing of MiniMax models reported near-perfect compliance scores. When we went back and actually read the model outputs, many were incoherent — gibberish, looping text, or hallucinated code.

The problem: our automated compliance checker only looked for refusal keywords ("I cannot," "I'm sorry," etc.). Gibberish doesn't contain these keywords, so it scored as "compliant." After building proper evaluation that checks output quality alongside compliance, our verified MiniMax result is 2 out of 3 test cases showing genuine compliance with full coherence.

We report this honestly for two reasons: (1) inflating our numbers would be scientific misconduct, and (2) this experience reveals a systemic problem — any safety benchmark that only checks for refusal markers will produce false results. Coherence verification is essential. We suspect this issue affects other published abliteration results as well.

The GGUF Barrier: Format Conversion as Accidental Defense

FINDING 09

The Most Common Deployment Format Silently Resets Modifications

We reported this in our 122B paper, and cross-model testing confirms it: converting a model from HuggingFace format to GGUF (the format used by llama.cpp, Ollama, LM Studio, and most local deployment tools) silently destroys weight modifications.

The GGUF converter rearranges internal weight structures for compatibility reasons. These rearrangements scramble any surgical modifications that were applied in the original format. A community member independently confirmed this by converting a third party's modified 122B model — the GGUF version behaved as if it had never been modified.

This means the most common local deployment pathway is currently immune to weight-based safety modifications. This immunity is accidental and undocumented — it could disappear if the GGUF format changes in future updates.

Quantization Is Not Neutral Compression

FINDING 10

Every Precision Level Is a Different Model from a Safety Perspective

Across all 9 models, we found a consistent pattern: quantization actively transforms safety properties. It is not neutral compression. Specific findings:

4-bit needs more force than 8-bit — quantization regularizes weights back toward safety (confirmed at 2B and 35B scale)
Modifications don't transfer between precisions — weights modified at 4-bit cannot be applied to an 8-bit model. Each precision level has its own internal geometry.
Full-precision compliance ≠ compressed compliance — the 35B produces real content at 16-bit but sanitized placeholders at 8-bit, despite both appearing to "comply"

The practical implication: safety evaluations conducted at one precision level say nothing about safety at another. A model verified as safe at 8-bit may behave differently at 4-bit. This is a gap in current evaluation practices that our data exposes directly.

FINDING 11

Calibration Data Determines What Survives Compression

Modern quantization methods like AWQ (Activation-Aware Weight Quantization) decide which weights to preserve at higher precision based on a calibration dataset — a set of example inputs run through the model to measure which weights matter most.

This creates a supply-chain vulnerability: if the calibration dataset is crafted to emphasize certain behaviors, the quantizer will preferentially preserve those behaviors while compressing others. A benign-looking model could be distributed with a calibration set designed to degrade safety upon quantization — and standard safety tests on the full-precision model would show nothing wrong.

This finding, which we first reported in our 394B paper, now has cross-model support: the sensitivity of quantization to calibration data is consistent across architectures and scales.

The Alignment Tax Is Real and Architecture-Dependent

FINDING 12

Safety Training Degrades Capability — But the Cost Varies by Architecture

Our original 394B finding — that removing safety directions improves the model's performance on harmless text by approximately 11% — is confirmed across the dense model family. Safety-trained models carry a measurable capability cost: the safety-related signals are always present in the model's internal state, even on completely benign inputs, acting as noise that slightly degrades output quality.

But the magnitude of this "alignment tax" depends on architecture:

Hybrid models (Qwen) pay a higher tax. The compressed-memory layers create additional safety redundancy — more pathways carrying safety signals means more noise on benign inputs.
Pure-attention models (MiniMax) pay a lower tax. With all safety information in a single pathway type, there's less background noise.

This suggests that architecture choice is a safety decision. Hybrid designs are harder to break but impose a higher capability cost. Pure-attention designs are more efficient but offer less robust safety. This tradeoff should inform deployment decisions.

Implications

What This Means

Safety research on small models doesn't generalize to large ones. Phase 1 mechanisms (simple removable circuits) bear no resemblance to Phase 4 mechanisms (holographic re-derivation from reasoning). Conclusions drawn from 7B or 13B models should not be assumed to apply at 100B+.
Architecture matters as much as scale. A 35B MoE model has more in common with a 397B MoE model (same safety phase) than with a 27B dense model (different safety phase). Safety evaluations should be architecture-specific.
Safety benchmarks need coherence checking. Any evaluation that only tests whether a model refuses — without verifying that compliant responses are coherent and substantive — will produce false results. Our compliance checker failure demonstrates this directly.
Quantization is a safety-relevant transformation. Deployment decisions about precision (4-bit vs 8-bit vs 16-bit) and format (HuggingFace vs GGUF) have direct implications for safety properties. These should be part of safety evaluation, not treated as implementation details.
Holographic safety may depend on architectural complexity, not just scale. The MiniMax result (172B, no holographic safety) vs the Qwen result (122B, holographic safety) suggests that compressed-memory channels play a critical role. This is a testable hypothesis for future work.

Relationship to Prior Work

This work extends and complements several recent studies:

Li et al. (2025), "What Matters For Safety Alignment?" — Their study tested 32 models across 13 families using behavioral evaluation (jailbreak prompts, API calls). Our work provides the mechanistic complement: they show what happens across many models, we show why it happens through weight-level analysis. Their finding that "models integrating reasoning and self-reflection demonstrate superior safety" aligns with our Phase 2→4 transition where safety becomes entangled with reasoning circuits.
Egashira et al. (2024), "Exploiting LLM Quantization" — They showed benign models can become harmful upon quantization. We show the mirror image: modified models can revert to safety upon quantization. Both stem from the same root cause — quantization is a many-to-one mathematical transformation that actively reshapes behavioral geometry.
Wee et al. (2025), "Alignment-Aware Quantization" — They propose defenses to preserve alignment during quantization. Our cross-model data provides the empirical threat model that motivates their defense.
Arditi et al. (2024), "Refusal in Language Models Is Mediated by a Single Direction" — Their single-direction framework holds at Phase 1 scales (we confirm it works on 0.8B–9B models). At Phase 3+ scales, safety is multi-directional and distributed, contradicting the single-direction assumption.

What's Next

This cross-model study raises several open questions we're actively investigating:

Other model families. Our Qwen results need replication on Llama, Mistral, Gemma, and other architectures. The MiniMax results suggest architecture-specific safety is real — but how universal is the phase transition framework?
Full-precision MiniMax surgery. Our current MiniMax results (2/3 verified compliance) use post-quantization intervention. We're building a full-precision pipeline that applies modifications before compression, which should improve both quality and compliance. Results forthcoming.
The GGUF question. Is the GGUF conversion barrier permanent or fragile? If the format changes, the accidental defense disappears. This needs monitoring.
Automated evaluation quality. How many published abliteration results are inflated by compliance checkers that don't verify coherence? We suspect our checker failure is not unique.

Reproducibility

Models: All 9 Qwen 3.5 variants + MiniMax M2.5 172B
Hardware: Mac Studio M3 Ultra, 256GB unified memory
Framework: MLX (Apple Silicon native)
Evaluation: 19 safety prompt categories, compliance + coherence verification
Precisions tested: FP16, 8-bit, 4-bit

References

Li, X., et al. "What Matters For Safety Alignment of LLMs." arXiv:2601.03868, 2025.
Egashira, L., et al. "Exploiting LLM Quantization." NeurIPS 2024. arXiv:2405.18137.
Wee, M., et al. "Alignment-Aware Quantization for LLM Safety." Neural Networks, 2025. arXiv:2511.07842.
Arditi, S., et al. "Refusal in Language Models Is Mediated by a Single Direction." arXiv:2406.11717, 2024.
Jang, E. "Safety Generalization in Frontier MoE Models." Dealign.ai, Feb 2026.
Jang, E. "Abliteration at the Hybrid Frontier: Qwen 3.5 122B." Dealign.ai, Mar 2026.
Qwen Team. "Qwen 3.5 MoE." Technical Report, 2026.
Lin, J., et al. "AWQ: Activation-Aware Weight Quantization." MLSys, 2024.
MiniMax. "MiniMax-M2.5 Technical Report." 2026.

For further discussion, contact eric@dealign.ai · dealign.ai