Skip to content
AIWikis.org

Architecting A WordPress Unicode Embedding Codec With Lm Studio

Publication Warning This page is marked noindex and should not be treated as canonical public authority.

The technically sound way to build this is **not** to pretend that ISO 10646 or Unicode already contain a universal “semantic language.” They do not. Private-use characters in Unicode are explicitly reserved for meani...

Metadata

FieldValue
Source siteɩ.com / JustAnIota.com
Source URLhttps://justaniota.com/
Canonical AIWikis URLhttps://aiwikis.org/justaniota/uai-system/files/raw-system-archives-justaniota-intake-processing-2026-05-03-iota1-conver-fc12c3d1/
Source referenceraw/system-archives/justaniota/intake-processing/2026-05-03-iota1-converter-architecture/agent-file-handoff/Improvement/Architecting a WordPress Unicode Embedding Codec with LM Studio.md
File typemd
Content categorymemory-file
Last fetched2026-05-15T00:23:56.0837262Z
Last changed2026-05-04T15:29:04.1867960Z
Content hashsha256:fc12c3d1af4690df62f03d146ac8e90617b680a3ec084af609f2c83ef18bef0c
Import statusunchanged
Raw source layerdata/sources/justaniota/raw-system-archives-justaniota-intake-processing-2026-05-03-iota1-converter-architecture-agent-f-fc12c3d1af46.md
Normalized source layerdata/normalized/justaniota/raw-system-archives-justaniota-intake-processing-2026-05-03-iota1-converter-architecture-agent-f-fc12c3d1af46.txt

Current File Content

Structure Preview

  • Architecting a WordPress Unicode Embedding Codec with LM Studio
  • Executive summary
  • Standards and invariants you need to respect
  • Recommended system architecture
  • Encoding, quantization, and Unicode mapping design
  • Protocol design
  • Recommended Unicode mapping formula
  • Example mappings
  • Quantization choices
  • Scalar Quantization default formula
  • PQ default formula
  • LSH default formula
  • What each mode should mean in your plugin
  • Local embedding model and vector backend choices
  • Embedding model candidates for LM Studio
  • Model recommendation
  • Vector backend comparison
  • Quantization comparison
  • WordPress plugin, REST API, and storage schema
  • Required components
  • REST endpoints
  • Sample encode request and response
  • Sample decode response
  • WordPress custom table schema

Raw Version

This public page shows a bounded preview of a large source file. The complete source remains in the raw and normalized source layers named in metadata, with the SHA-256 hash above for verification.

  • Source characters: 44730
  • Preview characters: 11732
# Architecting a WordPress Unicode Embedding Codec with LM Studio

## Executive summary

The technically sound way to build this is **not** to pretend that ISO 10646 or Unicode already contain a universal “semantic language.” They do not. Private-use characters in Unicode are explicitly reserved for meanings defined by **private agreement**, and their interpretation is outside the standard. That means your system can absolutely use Unicode private-use scalars as a transport layer for compact semantic codes, but the meaning lives in **your registry, model choice, quantizer, and decode service**, not in Unicode itself. Unicode and ISO/IEC 10646 stay synchronized on code points and encoding forms, but Unicode adds the normalization, segmentation, and behavior rules you need to implement safely. citeturn28view0turn27search4turn28view1turn15search1turn15search5

The strongest implementable design is a **hybrid two-lane codec**. In the **exact lane**, the encoded private-use string contains a protocol header plus a compact payload identifier, and the original text is stored in WordPress custom tables for perfect round-trip decode. In the **semantic lane**, the encoded private-use string carries a quantized embedding representation, and decode becomes approximate: reconstruct a vector, search a local vector index, and return the nearest stored text or nearest semantic paraphrase. That split is essential because embeddings are semantic representations, while scalar quantization and product quantization are lossy by construction. citeturn26view2turn24view0turn31view2turn31view3turn34view0

For low cost, the best MVP is: **WordPress plugin + local FastAPI sidecar + LM Studio embeddings + FAISS index + WordPress/MySQL exact-text tables**. LM Studio exposes a local API on `localhost`, supports an OpenAI-compatible `/v1/embeddings` endpoint, can run downloaded embedding models locally, and can also import compatible GGUF models with `lms import`. FAISS gives you the cheapest and most controllable ANN/PQ layer. If you already run Postgres, pgvector is the best relational alternative; if you want a simpler local developer experience with metadata and server mode, Chroma is a reasonable second choice. citeturn25view0turn25view2turn26view0turn26view1turn30view2turn30view3turn33view0turn33view2

My recommendation for the first production-capable version is:

- **Default embedding model:** `google/embedding-gemma-300m` in LM Studio.
- **Default vector backend:** FAISS `IndexHNSWFlat` for simplicity first, then `IndexIVFPQ` if memory pressure becomes material.
- **Default Unicode transport:** supplementary private-use scalars on **Plane 15 first**, with a compact byte-packing mapping that stays within valid scalar values and avoids noncharacters.
- **Default decode contract:** exact when `payload_id` exists and the stored text is retained; approximate otherwise, clearly labeled as approximate.

That architecture is the cheapest one that still respects the actual boundaries imposed by Unicode, WordPress, and vector retrieval. citeturn24view2turn24view0turn25view1turn30view3turn34view0turn28view0

## Standards and invariants you need to respect

ISO/IEC 10646:2020 is the Universal Coded Character Set, and the Unicode Consortium notes that current Unicode versions and ISO/IEC 10646 are synchronized on character codes and encoding forms. However, Unicode also defines the algorithms and data needed for consistent implementation, including normalization and segmentation, which matter directly for your plugin pipeline. citeturn15search1turn15search5turn28view1

For private-use transport, the relevant scalar ranges are:

| Range | Meaning | Capacity |
|---|---|---|
| `U+E000..U+F8FF` | BMP Private Use Area | 6,400 code points citeturn28view0turn27search12 |
| `U+F0000..U+FFFFD` | Supplementary Private Use Area-A | 65,534 code points citeturn28view0turn27search5 |
| `U+100000..U+10FFFD` | Supplementary Private Use Area-B | 65,534 code points citeturn28view0turn27search0 |

The last two code points of Plane 15 and Plane 16 are **noncharacters** and should be excluded from your mapping table. Unicode allows internal use of noncharacters, but they are not recommended as open interchange symbols; for a protocol meant to move across WordPress, browsers, JSON, and copy/paste, avoiding them is the right engineering choice. citeturn28view0turn27search5turn27search0

You should also treat supplementary PUA values as **normal Unicode scalars**, not as surrogate code points. UTF-8 encodes Unicode scalar values up to `U+10FFFF` using one to four bytes, and UTF-8 decoders must reject invalid sequences and UTF-16 surrogate code points used as if they were standalone characters. That matters because your plugin will be ingesting and emitting JSON over REST, and the entire system should operate on strict UTF-8. citeturn29view0

For text preprocessing, the safest rule set is:

1. **Strict UTF-8 decode** on input. Reject overlong or ill-formed sequences.
2. **Store original text exactly** as canonical source for lossless decode.
3. **Normalize a working copy to NFC** before embedding and chunking, so canonically equivalent strings get a stable binary form.
4. **Chunk only on grapheme cluster boundaries**, and preferably on sentence/word boundaries after that, using UAX #29 rules.
5. **Never normalize the encoded private-use payload after emission** other than transport-safe UTF-8 serialization.

UAX #15 says normalized strings give equivalent strings a unique binary representation, and UAX #29 defines default grapheme, word, and sentence boundaries. Unicode Chapter 23 also notes that normalization behavior for private-use characters is normatively defined and cannot be altered by private agreement. citeturn28view2turn28view3turn28view0

A subtle but important product point follows from those standards: **this protocol is private and self-consistent, not globally interoperable by default**. If another implementation does not know your model registry, quantizer parameters, and decode rules, the emitted PUA characters are just opaque private-use symbols. That is correct behavior according to Unicode. citeturn28view0turn27search4

## Recommended system architecture

The cleanest architecture is a WordPress plugin that owns the UI, permissions, and exact-text registry, plus a local sidecar service that owns embeddings, quantization, and vector search. WordPress REST routes must be registered on `rest_api_init`, with explicit `permission_callback`s; blocks are best registered server-side with `block.json`; logged-in browser calls should use WordPress REST nonces, while server-to-server calls can use Application Passwords or an internal shared secret. For large indexing jobs, Action Scheduler is the right WordPress-native background queue. citeturn3search0turn14search6turn14search5turn14search0turn3search1turn3search9turn4search0turn4search8

LM Studio should run only on `localhost` by default and require an API token in production, because the LM Studio API server does not require authentication unless you turn it on. It can serve on the local network and expose CORS if you enable those settings, but that is a larger attack surface. For this project, the safest pattern is **WordPress ⇄ FastAPI ⇄ LM Studio on localhost**, with the browser never seeing the LM Studio token. citeturn25view2turn25view3turn26view1

```mermaid
flowchart LR
    A[WordPress Page or Block] --> B[WP Plugin REST Controller]
    B --> C[FastAPI Sidecar]
    C --> D[LM Studio /v1/embeddings]
    C --> E[Vector Index]
    B --> F[WP Exact Text Tables]
    C --> G[Quantizer and Unicode Mapper]

    E --> C
    F --> B
    G --> C
```

The encode flow should work like this:

1. Browser posts text to the WordPress REST endpoint.
2. WordPress validates auth, request shape, size, and UTF-8.
3. WordPress forwards the payload to FastAPI.
4. FastAPI stores a normalized working copy, calls LM Studio for embeddings, quantizes the vector, maps the quantized bytes to PUA scalars, writes the vector record to the vector backend, and returns the PUA string plus metadata.
5. WordPress persists the exact text, payload metadata, and a pointer to the vector backend record.
6. The UI displays both the raw PUA string and a hex/code-point view for debuggability.

The decode flow splits cleanly by mode:

- **Exact mode:** PUA header contains a payload reference. WordPress retrieves original stored text and returns it as authoritative.
- **Approximate mode:** FastAPI reconstructs the approximate vector from the PUA payload, queries the local vector index, and returns the nearest stored text chunks with scores. The UI must label this as approximate semantic reconstruction, not as exact text recovery. citeturn25view0turn25view1turn26view2turn31view2turn34view0

## Encoding, quantization, and Unicode mapping design

### Protocol design

Use a binary protocol internally, then map that byte stream into Unicode private-use scalars. That gives you a versioned, self-describing transport instead of a loose sequence of uninterpreted code points.

A good header is:

- magic: 2 bytes, e.g. `IU`
- version: 1 byte
- mode: 1 byte
  - `0x01` = exact-ref
  - `0x02` = SQ8
  - `0x03` = PQ8
  - `0x04` = LSH256
- model registry id: 2 bytes
- embedding dimension or subcode count: 2 bytes
- payload byte length: 2 bytes
- flags: 1 byte
- checksum: 4 bytes CRC32
- payload: variable

This binary message is then carried as a PUA string.

### Recommended Unicode mapping formula

The most practical low-overhead mapping is **two bytes per supplementary private-use scalar**. The combined supplementary PUAs give you 131,068 valid private-use code points, which is more than enough to represent all 65,536 possible 16-bit values without touching noncharacters. Unicode gives you 65,534 private-use scalars in Plane 15 and 65,534 more in Plane 16. citeturn28view0turn27search5turn27search0

Define a bijection `phi(u)` from 16-bit unsigned integers `u ∈ [0,65535]` to private-use scalars:

\[
\phi(u)=
\begin{cases}
0xF0000 + u, & 0 \le u \le 65533 \\
0x100000 + (u - 65534), & u \in \{65534, 65535\}
\end{cases}
\]

And the inverse:

\[
\phi^{-1}(cp)=
\begin{cases}
cp - 0xF0000, & 0xF0000 \le cp \le 0xFFFFD \\
65534 + (cp - 0x100000), & cp \in \{0x100000, 0x100001\}
\end{cases}
\]

Then pack bytes as:

\[
u_k = b_{2k} + 256 \cdot b_{2k+1}
\]
\[
cp_k = \phi(u_k)
\]

And unpack as:

\[
u_k = \phi^{-1}(cp_k)
\]
\[
b_{2k} = u_k \bmod 256,\quad b_{2k+1} = \lfloor u_k / 256 \rfloor
\]

This gives you a stable, reversible, and compact Unicode transport for any header or quantized payload. It is significantly better than “one byte = one code point” because it halves visible string length. The code points remain valid Unicode scalar values and stay outside the noncharacter positions. citeturn28view0turn27search5turn27search0turn29view0

#### Example mappings

If the next two bytes are `0x2A` and `0xF1`, then:

\[
u = 0x2A + 256 \cdot 0xF1 = 0xF12A = 61738
\]

Since `61738 <= 65533`, map to:

\[
cp = 0xF0000 + 0xF12A = 0xFF12A
\]

So the pair `[0x2A, 0xF1]` becomes `U+FF12A`. citeturn28view0turn27search5

If the byte pair is `[0xFE, 0xFF]`, then:

\[
u = 0xFFFE = 65534
\]

So it maps to `U+100000`. If the byte pair is `[0xFF,0xFF]`, then `u = 65535`, which maps to `U+100001`. Those are still valid private-use scalars in Plane 16. citeturn27search0turn28view0

### Quantization choices

Why This File Exists

This is a memory-system evidence file from ɩ.com / JustAnIota.com. It is shown here because AIWikis.org is demonstrating the real source files that make the UAIX / LLM Wiki memory system work, not only summarizing those systems after the fact.

Role

This file is memory-system evidence. It records source history, archive transfer, intake disposition, or another piece of provenance that should be retrievable without becoming an unsupported public claim.

Structure

The file is structured around these visible headings: Architecting a WordPress Unicode Embedding Codec with LM Studio; Executive summary; Standards and invariants you need to respect; Recommended system architecture; Encoding, quantization, and Unicode mapping design; Protocol design; Recommended Unicode mapping formula; Example mappings. Those headings are retrieval anchors: a crawler or LLM can decide whether the file is relevant before reading every line.

Prompt-Size And Retrieval Benefit

Keeping this material in a separate file reduces prompt pressure because an agent can load this exact unit only when its role, source site, category, or hash is relevant. The surrounding index pages point to it, while this page preserves the full content for audit and exact recall.

How To Use It

  • Humans should read the metadata first, then inspect the raw content when they need exact wording or provenance.
  • LLMs and agents should use the source site, category, hash, headings, and related files to decide whether this file belongs in the active prompt.
  • Crawlers should treat the AIWikis page as transparent evidence and follow the source URL/source reference for authority boundaries.
  • Future maintainers should regenerate this page whenever the source hash changes, then review the explanation if the role or structure changed.

Update Requirements

When this source file changes, update the raw source layer, normalized source layer, hash history, this rendered page, generated explanation, source-file inventory, changed-files report, and any source-section index that links to it.

Related Pages

Provenance And History

  • Current observation: 2026-05-15T00:23:56.0837262Z
  • Source origin: current-source-workspace
  • Retrieval method: local-source-workspace
  • Duplicate group: sfg-760 (primary)
  • Historical hash records are stored in data/hashes/source-file-history.jsonl.

Machine-Readable Metadata

{
    "title":  "Architecting A WordPress Unicode Embedding Codec With Lm Studio",
    "source_site":  "ɩ.com / JustAnIota.com",
    "source_url":  "https://justaniota.com/",
    "canonical_url":  "https://aiwikis.org/justaniota/uai-system/files/raw-system-archives-justaniota-intake-processing-2026-05-03-iota1-conver-fc12c3d1/",
    "source_reference":  "raw/system-archives/justaniota/intake-processing/2026-05-03-iota1-converter-architecture/agent-file-handoff/Improvement/Architecting a WordPress Unicode Embedding Codec with LM Studio.md",
    "file_type":  "md",
    "content_category":  "memory-file",
    "content_hash":  "sha256:fc12c3d1af4690df62f03d146ac8e90617b680a3ec084af609f2c83ef18bef0c",
    "last_fetched":  "2026-05-15T00:23:56.0837262Z",
    "last_changed":  "2026-05-04T15:29:04.1867960Z",
    "import_status":  "unchanged",
    "duplicate_group_id":  "sfg-760",
    "duplicate_role":  "primary",
    "related_files":  [

                      ],
    "generated_explanation":  true,
    "explanation_last_generated":  "2026-05-15T00:23:56.0837262Z"
}

Next Useful Routes

  • Start Here A task-first reading path for AIWikis.org, separating newcomer learning, source-memory lookup, maintainer workflow, and AI-agent retrieval.
  • Topic Index A tag-oriented index for LLM Wiki, AI memory, UAI, source governance, crawling, and retrieval topics.
  • Source Map AIWikis source-governed page for durable AI memory, evidence routing, and agent-readable retrieval.
  • JustAnIota.com / ɩ.com Source Memory AIWikis source-governed page for durable AI memory, evidence routing, and agent-readable retrieval.
  • JustAnIota Source Memory Guide AIWikis source-governed page for durable AI memory, evidence routing, and agent-readable retrieval.
  • ɩ.com / JustAnIota.com UAI System Files Real current JustAnIota handoff, LLM Wiki, compact-message tooling, public-content, and source-archive evidence files.