CVNSS4.0 IR-Based Reference IDs as Semantic Coordinates for Vietnamese Digital Infrastructure: A Precomputed Vector Registry Approach
Dai-Long Ngo-Hoang
PAPER · v1.0 · 2026-05-25 · human
Abstract
Modern natural-language processing pipelines usually transform text into tokens, tokens into vectors, and vectors into downstream representations. Nevertheless, not every tokenization strategy has the same computational or semantic role. Byte-pair encoding (BPE) and related subword tokenizers are effective for language-model prediction, but their token identifiers are model-internal and do not constitute stable semantic references. This paper proposes an IEEE-style conceptual architecture in which Vietnamese expressions are first normalized through a CVNSS4.0 intermediate representation (IR), then mapped to registry-stable reference identifiers, and finally associated with precomputed semantic vectors. The key analogy is geographic coordinates: a place name may vary, but a coordinate in a reference frame enables consistent localization. Likewise, a Vietnamese concept such as “traceability” may be mapped to a stable identifier, e.g., ID 30588, which functions as a discrete semantic coordinate. A precomputed embedding attached to this identifier functions as a continuous coordinate in semantic space. The resulting architecture separates the expensive phase of vector generation from the lightweight phase of vector usage. It enables O(1) embedding lookup, persistent vector indexing, compact QR/NFC/RFID payloads, auditable blockchain metadata, semantic GIS attributes, chipless RFID encoding, and low-latency edge-AI inference.