Architecting A WordPress Unicode Embedding Codec With Lm Studio
The technically sound way to build this is **not** to pretend that ISO 10646 or Unicode already contain a universal “semantic language.” They do not. Private-use characters in Unicode are explicitly reserved for meani...
Metadata
| Field | Value |
|---|---|
| Source site | ɩ.com / JustAnIota.com |
| Source URL | https://justaniota.com/ |
| Canonical AIWikis URL | https://aiwikis.org/justaniota/uai-system/files/raw-system-archives-justaniota-intake-processing-2026-05-03-iota1-conver-fc12c3d1/ |
| Source reference | raw/system-archives/justaniota/intake-processing/2026-05-03-iota1-converter-architecture/agent-file-handoff/Improvement/Architecting a WordPress Unicode Embedding Codec with LM Studio.md |
| File type | md |
| Content category | memory-file |
| Last fetched | 2026-05-15T00:23:56.0837262Z |
| Last changed | 2026-05-04T15:29:04.1867960Z |
| Content hash | sha256:fc12c3d1af4690df62f03d146ac8e90617b680a3ec084af609f2c83ef18bef0c |
| Import status | unchanged |
| Raw source layer | data/sources/justaniota/raw-system-archives-justaniota-intake-processing-2026-05-03-iota1-converter-architecture-agent-f-fc12c3d1af46.md |
| Normalized source layer | data/normalized/justaniota/raw-system-archives-justaniota-intake-processing-2026-05-03-iota1-converter-architecture-agent-f-fc12c3d1af46.txt |
Current File Content
Structure Preview
- Architecting a WordPress Unicode Embedding Codec with LM Studio
- Executive summary
- Standards and invariants you need to respect
- Recommended system architecture
- Encoding, quantization, and Unicode mapping design
- Protocol design
- Recommended Unicode mapping formula
- Example mappings
- Quantization choices
- Scalar Quantization default formula
- PQ default formula
- LSH default formula
- What each mode should mean in your plugin
- Local embedding model and vector backend choices
- Embedding model candidates for LM Studio
- Model recommendation
- Vector backend comparison
- Quantization comparison
- WordPress plugin, REST API, and storage schema
- Required components
- REST endpoints
- Sample encode request and response
- Sample decode response
- WordPress custom table schema
Raw Version
This public page shows a bounded preview of a large source file. The complete source remains in the raw and normalized source layers named in metadata, with the SHA-256 hash above for verification.
- Source characters:
44730 - Preview characters:
11732
# Architecting a WordPress Unicode Embedding Codec with LM Studio
## Executive summary
The technically sound way to build this is **not** to pretend that ISO 10646 or Unicode already contain a universal “semantic language.” They do not. Private-use characters in Unicode are explicitly reserved for meanings defined by **private agreement**, and their interpretation is outside the standard. That means your system can absolutely use Unicode private-use scalars as a transport layer for compact semantic codes, but the meaning lives in **your registry, model choice, quantizer, and decode service**, not in Unicode itself. Unicode and ISO/IEC 10646 stay synchronized on code points and encoding forms, but Unicode adds the normalization, segmentation, and behavior rules you need to implement safely. citeturn28view0turn27search4turn28view1turn15search1turn15search5
The strongest implementable design is a **hybrid two-lane codec**. In the **exact lane**, the encoded private-use string contains a protocol header plus a compact payload identifier, and the original text is stored in WordPress custom tables for perfect round-trip decode. In the **semantic lane**, the encoded private-use string carries a quantized embedding representation, and decode becomes approximate: reconstruct a vector, search a local vector index, and return the nearest stored text or nearest semantic paraphrase. That split is essential because embeddings are semantic representations, while scalar quantization and product quantization are lossy by construction. citeturn26view2turn24view0turn31view2turn31view3turn34view0
For low cost, the best MVP is: **WordPress plugin + local FastAPI sidecar + LM Studio embeddings + FAISS index + WordPress/MySQL exact-text tables**. LM Studio exposes a local API on `localhost`, supports an OpenAI-compatible `/v1/embeddings` endpoint, can run downloaded embedding models locally, and can also import compatible GGUF models with `lms import`. FAISS gives you the cheapest and most controllable ANN/PQ layer. If you already run Postgres, pgvector is the best relational alternative; if you want a simpler local developer experience with metadata and server mode, Chroma is a reasonable second choice. citeturn25view0turn25view2turn26view0turn26view1turn30view2turn30view3turn33view0turn33view2
My recommendation for the first production-capable version is:
- **Default embedding model:** `google/embedding-gemma-300m` in LM Studio.
- **Default vector backend:** FAISS `IndexHNSWFlat` for simplicity first, then `IndexIVFPQ` if memory pressure becomes material.
- **Default Unicode transport:** supplementary private-use scalars on **Plane 15 first**, with a compact byte-packing mapping that stays within valid scalar values and avoids noncharacters.
- **Default decode contract:** exact when `payload_id` exists and the stored text is retained; approximate otherwise, clearly labeled as approximate.
That architecture is the cheapest one that still respects the actual boundaries imposed by Unicode, WordPress, and vector retrieval. citeturn24view2turn24view0turn25view1turn30view3turn34view0turn28view0
## Standards and invariants you need to respect
ISO/IEC 10646:2020 is the Universal Coded Character Set, and the Unicode Consortium notes that current Unicode versions and ISO/IEC 10646 are synchronized on character codes and encoding forms. However, Unicode also defines the algorithms and data needed for consistent implementation, including normalization and segmentation, which matter directly for your plugin pipeline. citeturn15search1turn15search5turn28view1
For private-use transport, the relevant scalar ranges are:
| Range | Meaning | Capacity |
|---|---|---|
| `U+E000..U+F8FF` | BMP Private Use Area | 6,400 code points citeturn28view0turn27search12 |
| `U+F0000..U+FFFFD` | Supplementary Private Use Area-A | 65,534 code points citeturn28view0turn27search5 |
| `U+100000..U+10FFFD` | Supplementary Private Use Area-B | 65,534 code points citeturn28view0turn27search0 |
The last two code points of Plane 15 and Plane 16 are **noncharacters** and should be excluded from your mapping table. Unicode allows internal use of noncharacters, but they are not recommended as open interchange symbols; for a protocol meant to move across WordPress, browsers, JSON, and copy/paste, avoiding them is the right engineering choice. citeturn28view0turn27search5turn27search0
You should also treat supplementary PUA values as **normal Unicode scalars**, not as surrogate code points. UTF-8 encodes Unicode scalar values up to `U+10FFFF` using one to four bytes, and UTF-8 decoders must reject invalid sequences and UTF-16 surrogate code points used as if they were standalone characters. That matters because your plugin will be ingesting and emitting JSON over REST, and the entire system should operate on strict UTF-8. citeturn29view0
For text preprocessing, the safest rule set is:
1. **Strict UTF-8 decode** on input. Reject overlong or ill-formed sequences.
2. **Store original text exactly** as canonical source for lossless decode.
3. **Normalize a working copy to NFC** before embedding and chunking, so canonically equivalent strings get a stable binary form.
4. **Chunk only on grapheme cluster boundaries**, and preferably on sentence/word boundaries after that, using UAX #29 rules.
5. **Never normalize the encoded private-use payload after emission** other than transport-safe UTF-8 serialization.
UAX #15 says normalized strings give equivalent strings a unique binary representation, and UAX #29 defines default grapheme, word, and sentence boundaries. Unicode Chapter 23 also notes that normalization behavior for private-use characters is normatively defined and cannot be altered by private agreement. citeturn28view2turn28view3turn28view0
A subtle but important product point follows from those standards: **this protocol is private and self-consistent, not globally interoperable by default**. If another implementation does not know your model registry, quantizer parameters, and decode rules, the emitted PUA characters are just opaque private-use symbols. That is correct behavior according to Unicode. citeturn28view0turn27search4
## Recommended system architecture
The cleanest architecture is a WordPress plugin that owns the UI, permissions, and exact-text registry, plus a local sidecar service that owns embeddings, quantization, and vector search. WordPress REST routes must be registered on `rest_api_init`, with explicit `permission_callback`s; blocks are best registered server-side with `block.json`; logged-in browser calls should use WordPress REST nonces, while server-to-server calls can use Application Passwords or an internal shared secret. For large indexing jobs, Action Scheduler is the right WordPress-native background queue. citeturn3search0turn14search6turn14search5turn14search0turn3search1turn3search9turn4search0turn4search8
LM Studio should run only on `localhost` by default and require an API token in production, because the LM Studio API server does not require authentication unless you turn it on. It can serve on the local network and expose CORS if you enable those settings, but that is a larger attack surface. For this project, the safest pattern is **WordPress ⇄ FastAPI ⇄ LM Studio on localhost**, with the browser never seeing the LM Studio token. citeturn25view2turn25view3turn26view1
```mermaid
flowchart LR
A[WordPress Page or Block] --> B[WP Plugin REST Controller]
B --> C[FastAPI Sidecar]
C --> D[LM Studio /v1/embeddings]
C --> E[Vector Index]
B --> F[WP Exact Text Tables]
C --> G[Quantizer and Unicode Mapper]
E --> C
F --> B
G --> C
```
The encode flow should work like this:
1. Browser posts text to the WordPress REST endpoint.
2. WordPress validates auth, request shape, size, and UTF-8.
3. WordPress forwards the payload to FastAPI.
4. FastAPI stores a normalized working copy, calls LM Studio for embeddings, quantizes the vector, maps the quantized bytes to PUA scalars, writes the vector record to the vector backend, and returns the PUA string plus metadata.
5. WordPress persists the exact text, payload metadata, and a pointer to the vector backend record.
6. The UI displays both the raw PUA string and a hex/code-point view for debuggability.
The decode flow splits cleanly by mode:
- **Exact mode:** PUA header contains a payload reference. WordPress retrieves original stored text and returns it as authoritative.
- **Approximate mode:** FastAPI reconstructs the approximate vector from the PUA payload, queries the local vector index, and returns the nearest stored text chunks with scores. The UI must label this as approximate semantic reconstruction, not as exact text recovery. citeturn25view0turn25view1turn26view2turn31view2turn34view0
## Encoding, quantization, and Unicode mapping design
### Protocol design
Use a binary protocol internally, then map that byte stream into Unicode private-use scalars. That gives you a versioned, self-describing transport instead of a loose sequence of uninterpreted code points.
A good header is:
- magic: 2 bytes, e.g. `IU`
- version: 1 byte
- mode: 1 byte
- `0x01` = exact-ref
- `0x02` = SQ8
- `0x03` = PQ8
- `0x04` = LSH256
- model registry id: 2 bytes
- embedding dimension or subcode count: 2 bytes
- payload byte length: 2 bytes
- flags: 1 byte
- checksum: 4 bytes CRC32
- payload: variable
This binary message is then carried as a PUA string.
### Recommended Unicode mapping formula
The most practical low-overhead mapping is **two bytes per supplementary private-use scalar**. The combined supplementary PUAs give you 131,068 valid private-use code points, which is more than enough to represent all 65,536 possible 16-bit values without touching noncharacters. Unicode gives you 65,534 private-use scalars in Plane 15 and 65,534 more in Plane 16. citeturn28view0turn27search5turn27search0
Define a bijection `phi(u)` from 16-bit unsigned integers `u ∈ [0,65535]` to private-use scalars:
\[
\phi(u)=
\begin{cases}
0xF0000 + u, & 0 \le u \le 65533 \\
0x100000 + (u - 65534), & u \in \{65534, 65535\}
\end{cases}
\]
And the inverse:
\[
\phi^{-1}(cp)=
\begin{cases}
cp - 0xF0000, & 0xF0000 \le cp \le 0xFFFFD \\
65534 + (cp - 0x100000), & cp \in \{0x100000, 0x100001\}
\end{cases}
\]
Then pack bytes as:
\[
u_k = b_{2k} + 256 \cdot b_{2k+1}
\]
\[
cp_k = \phi(u_k)
\]
And unpack as:
\[
u_k = \phi^{-1}(cp_k)
\]
\[
b_{2k} = u_k \bmod 256,\quad b_{2k+1} = \lfloor u_k / 256 \rfloor
\]
This gives you a stable, reversible, and compact Unicode transport for any header or quantized payload. It is significantly better than “one byte = one code point” because it halves visible string length. The code points remain valid Unicode scalar values and stay outside the noncharacter positions. citeturn28view0turn27search5turn27search0turn29view0
#### Example mappings
If the next two bytes are `0x2A` and `0xF1`, then:
\[
u = 0x2A + 256 \cdot 0xF1 = 0xF12A = 61738
\]
Since `61738 <= 65533`, map to:
\[
cp = 0xF0000 + 0xF12A = 0xFF12A
\]
So the pair `[0x2A, 0xF1]` becomes `U+FF12A`. citeturn28view0turn27search5
If the byte pair is `[0xFE, 0xFF]`, then:
\[
u = 0xFFFE = 65534
\]
So it maps to `U+100000`. If the byte pair is `[0xFF,0xFF]`, then `u = 65535`, which maps to `U+100001`. Those are still valid private-use scalars in Plane 16. citeturn27search0turn28view0
### Quantization choices
Why This File Exists
This is a memory-system evidence file from ɩ.com / JustAnIota.com. It is shown here because AIWikis.org is demonstrating the real source files that make the UAIX / LLM Wiki memory system work, not only summarizing those systems after the fact.
Role
This file is memory-system evidence. It records source history, archive transfer, intake disposition, or another piece of provenance that should be retrievable without becoming an unsupported public claim.
Structure
The file is structured around these visible headings: Architecting a WordPress Unicode Embedding Codec with LM Studio; Executive summary; Standards and invariants you need to respect; Recommended system architecture; Encoding, quantization, and Unicode mapping design; Protocol design; Recommended Unicode mapping formula; Example mappings. Those headings are retrieval anchors: a crawler or LLM can decide whether the file is relevant before reading every line.
Prompt-Size And Retrieval Benefit
Keeping this material in a separate file reduces prompt pressure because an agent can load this exact unit only when its role, source site, category, or hash is relevant. The surrounding index pages point to it, while this page preserves the full content for audit and exact recall.
How To Use It
- Humans should read the metadata first, then inspect the raw content when they need exact wording or provenance.
- LLMs and agents should use the source site, category, hash, headings, and related files to decide whether this file belongs in the active prompt.
- Crawlers should treat the AIWikis page as transparent evidence and follow the source URL/source reference for authority boundaries.
- Future maintainers should regenerate this page whenever the source hash changes, then review the explanation if the role or structure changed.
Update Requirements
When this source file changes, update the raw source layer, normalized source layer, hash history, this rendered page, generated explanation, source-file inventory, changed-files report, and any source-section index that links to it.
Related Pages
Provenance And History
- Current observation:
2026-05-15T00:23:56.0837262Z - Source origin:
current-source-workspace - Retrieval method:
local-source-workspace - Duplicate group:
sfg-760(primary) - Historical hash records are stored in
data/hashes/source-file-history.jsonl.
Machine-Readable Metadata
{
"title": "Architecting A WordPress Unicode Embedding Codec With Lm Studio",
"source_site": "ɩ.com / JustAnIota.com",
"source_url": "https://justaniota.com/",
"canonical_url": "https://aiwikis.org/justaniota/uai-system/files/raw-system-archives-justaniota-intake-processing-2026-05-03-iota1-conver-fc12c3d1/",
"source_reference": "raw/system-archives/justaniota/intake-processing/2026-05-03-iota1-converter-architecture/agent-file-handoff/Improvement/Architecting a WordPress Unicode Embedding Codec with LM Studio.md",
"file_type": "md",
"content_category": "memory-file",
"content_hash": "sha256:fc12c3d1af4690df62f03d146ac8e90617b680a3ec084af609f2c83ef18bef0c",
"last_fetched": "2026-05-15T00:23:56.0837262Z",
"last_changed": "2026-05-04T15:29:04.1867960Z",
"import_status": "unchanged",
"duplicate_group_id": "sfg-760",
"duplicate_role": "primary",
"related_files": [
],
"generated_explanation": true,
"explanation_last_generated": "2026-05-15T00:23:56.0837262Z"
} Next Useful Routes
- Start Here A task-first reading path for AIWikis.org, separating newcomer learning, source-memory lookup, maintainer workflow, and AI-agent retrieval.
- Topic Index A tag-oriented index for LLM Wiki, AI memory, UAI, source governance, crawling, and retrieval topics.
- Source Map AIWikis source-governed page for durable AI memory, evidence routing, and agent-readable retrieval.
- JustAnIota.com / ɩ.com Source Memory AIWikis source-governed page for durable AI memory, evidence routing, and agent-readable retrieval.
- JustAnIota Source Memory Guide AIWikis source-governed page for durable AI memory, evidence routing, and agent-readable retrieval.
- ɩ.com / JustAnIota.com UAI System Files Real current JustAnIota handoff, LLM Wiki, compact-message tooling, public-content, and source-archive evidence files.