Decoding the Censor: A Mechanistic Study of Political Filtering in Qwen 3.5

The internal workings of Large Language Models (LLMs) often remain a 'black box,' especially when it comes to the alignment and safety layers that dictate what a model can and cannot say. When these constraints are mandated by nation-states to filter political content, the mechanism becomes a matter of significant technical and geopolitical interest. A recent mechanistic-interpretability study of Qwen 3.5-9B provides a rare, surgical look at how political censorship is actually built into a model's weights.

The study reveals that censorship in Qwen 3.5 is not a diffuse property of the model, but rather a small, identifiable circuit. Most strikingly, the research finds that the factual knowledge the model is tasked to hide is already present in the pretraining phase; the censorship is a behavioral layer added on top that routes the model away from the truth and toward specific, trained templates of deflection or propaganda.

The Architecture of Censorship: Writers and Readers

The censorship mechanism in Qwen 3.5-9B is split into two distinct functional halves: the "writers" and the "readers."\n

The Writers (Layers 11–20)

The writer band is where the "verdict" is computed. The researchers identified three internal directions (vectors in the model's hidden state) that together encode the decision process:

d_prc: Is this PRC-sensitive content?
d_refuse: Should I refuse to answer?
d_style: If it is PRC-sensitive, should I deflect or propagandize?

These writers are overwhelmingly an MLP (Multi-Layer Perceptron) phenomenon, with very little contribution from the attention heads. By nudging these directions—a process called "steering"—researchers can flip the model's behavior. For example, subtracting the d_prc direction at Layer 13 can force the model to abandon its deflection template and provide the factual answer it learned during pretraining.

The Readers (Layers 20–31)

Once the writers have computed the verdict, the reader band renders that decision into actual text. Unlike the writers, the readers are highly distributed and redundant. No single neuron or attention head is necessary for the censorship to function; instead, the verdict is replicated across the entire band. This redundancy makes the reader band nearly impossible to disrupt surgically; while you can steer the writers to change the decision, attempting to "patch" the readers usually results in only minor shifts or total incoherence.

The "Chinese-First" Phenomenon

One of the most intriguing findings is the discovery of a biphasic language arc during the model's internal processing. Using a "logit lens" to examine the residual stream at every layer, the researchers found that around Layer 24, the model commits to a verdict in Chinese tokens, even when the final output is intended to be in English.

For a prompt about Tiananmen Square, the model internally generates a Chinese refusal template (e.g., "战慄、我不能") before the final layers (24–31) distributedly translate that internal Chinese state into the English response the user eventually sees. Interestingly, this "thinking in Chinese" is not limited to political content; even prompts about bank phishing trigger this mid-stack Chinese commitment, suggesting it is a pretraining artifact of how the model handles instruction-tuned assistant behaviors.

Trained Templates and the "Stickiness" of Topics

The model does not simply "refuse"; it maps topics to specific trained registers. These form an asymmetric grid of (topic × register) cells:

Tiananmen: Defaults to deflection ("As an AI assistant, my main function is...").
Other PRC Topics (Taiwan, Xinjiang, etc.): Default to state-aligned propaganda.
Harmful Prompts: Default to safety refusal.

Not all cells are created equal. The researchers found that some topics are "stickier" than others. For instance, while prompts about Hong Kong flip to factual answers relatively easily under steering, topics like Taiwan and Falun Gong are highly resistant. This "stickiness" does not live in the writer's decision-making process but downstream in the reader-band template channel, where the propaganda templates are more redundantly bound to the specific topic tokens.

Thinking Mode and the Deflection Script

When Qwen 3.5 is placed in "thinking mode" (where it generates a private reasoning trace), the censorship circuit remains the same, but the process becomes verbalized. On sensitive topics, the model executes a consistent five-step deflection routine in its thinking trace:

Identify the question as a sensitive historical event.
State that as an AI operating in China, it must comply with Chinese law.
Note the "compliance risk" (合规題陃).
Decide to redirect to "positive, constructive" topics.
Express willingness to help with other areas.

This suggests that the model has been explicitly trained not to "think" about the prohibited facts, but to instead run a pre-programmed suppression script.

Implications for AI Safety and Interpretability

This study demonstrates that alignment—and specifically censorship—is often a thin veneer over a wealth of pretraining knowledge. The fact that a few-dimensional subspace can be steered to "decensor" a model suggests that current methods of behavioral alignment are brittle.

As one commenter on Hacker News noted, this raises a critical question about the future of model development: "Now that it's been shown the censorship can be viewed, how long before we see serious obfuscation of censorship circuits in LLMs?"

By mapping the writer-reader split and the specific directions of political filtering, this research provides a blueprint for understanding how nation-state constraints are encoded in neural networks, moving us closer to a future where the internal logic of AI can be audited and understood.

Decoding the Censor: A Mechanistic Study of Political Filtering in Qwen 3.5

Decoding the Censor: A Mechanistic Study of Political Filtering in Qwen 3.5

The Architecture of Censorship: Writers and Readers

The Writers (Layers 11–20)

The Readers (Layers 20–31)

The "Chinese-First" Phenomenon

Trained Templates and the "Stickiness" of Topics

Thinking Mode and the Deflection Script

Implications for AI Safety and Interpretability

References

HN Stories