reEtym: A Natively Feature-Disentangled Transformer for Interpretability

Hongyu Shi

reEtym: A Natively Feature-Disentangled Transformer for Interpretability

Hongyu Shi

PAPER · v1.0 · 2026-04-14 · human

Formal Sciences Computer Science Artificial intelligence and machine learning

Abstract

Based on the hypothesis that "human language is composed of fundamental semantic atoms," this paper proposes reEtym, a feature-disentangled architecture that modifies only the embedding layer. By factorizing the embedding matrix into a "recipe" matrix W_recipe and an "etymological basis" matrix W_basis, the model is guided to maintain a continuous set of semantic etymological bases in the latent space. At 0.5B parameters and 50k pretraining steps, reEtym achieves near-lossless equivalence with conventional architectures on zero-shot benchmarks (fluctuations within ±2.4%), while improving topic coherence by 28.4% and reducing extreme failure cases by 98.6%. Concurrently, interpretable structures spontaneously emerge in the etymological space: semantic algebra (6/6 hits, including linguistic and arithmetic analogies), natural sparsity (11-13% activation rate), and signal-level causal traceability (ablating a single signal reduces prediction from 8.31% to 0.03%), revealing new avenues for exploration. Unlike post-hoc reconstruction methods, the etymological space in reEtym is directly defined by the architecture and constitutes a native component of the model's computation. This enables audit findings to be directly translated into model modifications—adjusting recipes or bases can achieve behavioral steering such as sentiment manipulation and topic coherence enhancement, without retraining. Since modifications are confined to the embedding layer, this mechanism naturally extends to non-Transformer architectures such as Mamba and RWKV. The complete source code, model weights, training logs, and an online interpretability platform are publicly available under the MIT license at: https://github.com/reEtym/reEtym

Keywords

Natural Language Processing Machine Learning Interpretability Feature Disentanglement Large Language Models Artificial Intelligence

Download PDF