Designing An Optimized Keyless JSON Representation From HTML

Publication Warning This page is marked noindex and should not be treated as canonical public authority.

A rigorous way to build an optimized, keyless JSON artifact from an index.html file is to treat the HTML in **two parallel forms** at once: a **raw source ledger** of bytes, code points, separators, and source spans...

Metadata

Field	Value
Source site	ɩ.com / JustAnIota.com
Source URL	https://justaniota.com/
Canonical AIWikis URL	https://aiwikis.org/justaniota/uai-system/files/raw-system-archives-justaniota-intake-processing-2026-05-03-agent-file-h-2360de1e/
Source reference	`raw/system-archives/justaniota/intake-processing/2026-05-03/agent-file-handoff/Improvement/Designing an Optimized Keyless JSON Representation from HTML.md`
File type	`md`
Content category	`memory-file`
Last fetched	`2026-05-15T00:23:56.0837262Z`
Last changed	`2026-05-03T19:06:06.6634107Z`
Content hash	`sha256:2360de1edc223386e6b241047b774602158bfecc896f1d9b84c7a83fd2ba102b`
Import status	`unchanged`
Raw source layer	`data/sources/justaniota/raw-system-archives-justaniota-intake-processing-2026-05-03-agent-file-handoff-improvement-desig-2360de1edc22.md`
Normalized source layer	`data/normalized/justaniota/raw-system-archives-justaniota-intake-processing-2026-05-03-agent-file-handoff-improvement-desig-2360de1edc22.txt`

Current File Content

Structure Preview

Designing an Optimized Keyless JSON Representation from HTML
Executive summary
Standards basis
Extraction pipeline
Dual-track extraction method
Separation of literal text from HTML syntax
Lossless versus optimized views
Normalization and elimination rules
Token and weight design
Proposed deterministic token algorithm
Proposed weight system
Why a deterministic registry is preferable here
Keyless JSON schema and worked example
Proposed keyless wire schema
Hypothetical input HTML
Comparison table
Full sample keyless JSON output
English glosses and embeddings
Using weights to reduce dependence on natural language
Security, ambiguity, and edge cases

Raw Version

This public page shows a bounded preview of a large source file. The complete source remains in the raw and normalized source layers named in metadata, with the SHA-256 hash above for verification.

Source characters: 37085
Preview characters: 11839

# Designing an Optimized Keyless JSON Representation from HTML

## Executive summary

A rigorous way to build an optimized, keyless JSON artifact from an `index.html` file is to treat the HTML in **two parallel forms** at once: a **raw source ledger** of bytes, code points, separators, and source spans, and a **standards-compliant parsed representation** of the document tree. The HTML parsing model itself is defined as a stream of code points that passes through tokenization and tree construction into a `Document`; during that process, ASCII uppercase in tag names and attribute names is lowercased, character references are resolved, duplicate attributes are dropped from the token, and raw-text/script states are handled differently from ordinary text. A source-only method misses document structure; a DOM-only method loses lexical facts that matter for audit and compression. citeturn4view0turn8view2turn8view4turn7view4turn16view1turn16view2turn10view2

For the wire format itself, “keyless JSON” should be understood literally: **use arrays and positional schema, not objects**, because JSON objects are defined as name/value pairs, member names are strings, and duplicate names create unpredictable behavior across implementations. ECMA-404 is explicit that JSON defines syntax, not semantics; semantics must come from an external agreement, schema, or registry. That point is central here: token IDs and weights can reduce the amount of natural-language text you ship to an AI system, but they do **not** create universal meaning by themselves. citeturn4view1turn11view4turn12view0turn12view1

The most defensible normalization strategy is to keep three layers in parallel: **raw**, **normalized**, and **cleaned**. The raw layer preserves exact bytes/code points and punctuation for provenance. The normalized layer should generally use **NFC** so canonically equivalent strings have a unique binary representation, while user-visible minimal symbols should be segmented by **extended grapheme cluster** rather than naïve code-point counting. The cleaned layer removes or quarantines syntax-only, invisible, reserved, or unsafe characters, then maps the surviving graphemes or short semantic units to concise registry tokens plus small numeric weight vectors. citeturn19view1turn4view3turn15view0turn15view1

This design is also consistent with the direction reflected in the uploaded background materials: Unicode and JSON are best treated as **transport and syntax substrates**, while compact semantics come from a versioned application-layer registry or profile, not from Unicode code points alone. The uploaded `index.html` is a useful real-world reminder of why this matters: even one page can combine metadata, inline CSS, navigation, prose, links, and a very large literal `<pre>` payload, which means extraction has to handle both structure and literal text faithfully. citeturn0file1turn0file6turn0file7

## Standards basis

The primary authoritative sources for this problem are the **WHATWG HTML Living Standard**, **RFC 8259** and **ECMA-404** for JSON, Unicode normalization and segmentation documents such as **UAX #15** and **UAX #29**, Unicode security guidance in **UTS #39** and **UTS #51**, and primary tokenization/embedding papers such as BPE-based subword segmentation, SentencePiece, CANINE, ByT5, SBERT, and E5. That ordering matters because HTML and JSON specs define what is structurally valid, Unicode defines what characters and user-perceived symbols are, and the NLP papers explain what is gained or lost when you compress language into learned or rule-based units. citeturn4view0turn4view1turn12view0turn19view1turn4view3turn4view4turn15view2turn4view6turn5view4turn5view1turn6view0turn5view2turn5view3

Several HTML behaviors are especially important for deterministic extraction. The parser consumes decoded code points, not bytes directly. ASCII uppercase letters in tag names and attribute names are lowercased during parsing. Duplicate attributes are parse errors and the later attribute is removed from the token. Character references are resolved during tokenization, and missing semicolons can still resolve in certain ambiguous cases. For non-void HTML elements, a trailing `/` before `>` does not create a true self-closing HTML element, so it should be preserved only as a raw punctuation fact, not as a semantic closure marker. Comments inside `script`-like contexts can also be treated as text rather than HTML comments. citeturn4view0turn8view4turn8view2turn7view4turn7view2

JSON imposes equally important constraints. Objects are name/value structures with string names; arrays are ordered sequences of values. RFC 8259 recommends unique object member names because duplicate names lead to unpredictable parser behavior, while ECMA-404 emphasizes that JSON syntax itself does not assign semantics. JSON strings are quoted, and quotation mark, reverse solidus, and control characters U+0000 through U+001F must be escaped. For interoperable interchange outside a closed ecosystem, JSON text must be encoded as UTF-8. citeturn11view4turn11view0turn17view3turn12view0turn12view1

Unicode adds two more design constraints. First, **NFC** is the safest default normalization form for interchange because it gives canonically equivalent strings a unique binary representation. Second, what users perceive as one “character” may be a multi-code-point grapheme cluster, so a minimal semantic symbol should usually be a grapheme cluster, not a raw code point. Unicode also warns that normalized strings are not closed under concatenation, so you should normalize after assembly or normalize carefully around stable boundaries. citeturn19view1turn4view3turn19view0turn19view3

## Extraction pipeline

The extraction method that best satisfies your requirement to capture **every field, symbol, and character** is a **dual-track pipeline**: one track preserves the exact source surface; the other derives structural meaning from the HTML5 parser and DOM. This is the only robust way to keep both punctuation-level provenance and semantic structure. citeturn4view0turn10view2

```mermaid
flowchart LR
    A[HTML bytes] --> B[Encoding detection and decode to code points]
    B --> C[Raw source scanner]
    B --> D[HTML5 tokenizer and tree builder]
    C --> E[Source ledger]
    D --> F[DOM and token events]
    E --> G[Alignment layer]
    F --> G
    G --> H[NFC normalization plus grapheme segmentation]
    H --> I[Reserved character filter and boundary mapping]
    I --> J[Token registry creation]
    J --> K[Salience and idea-weight assignment]
    K --> L[Keyless JSON wire output]
    J --> M[English gloss cache for embeddings]
```

### Dual-track extraction method

The **raw source scanner** should walk the decoded code-point stream once and record, for every code point, its byte span, code point value, grapheme-cluster membership, local context, and punctuation class. It should explicitly label syntax punctuation such as `<`, `>`, `/`, `=`, quotes, `&`, `;`, commas, and periods, because even punctuation you later remove from the semantic stream is still part of the provenance layer. The scanner should also preserve the literal spelling of character references such as `&`, `<`, or a missing-semicolon form, because the parser will resolve those into characters and may obscure the exact source spelling. citeturn4view0turn7view4turn4view3

The **parser/DOM track** should use an HTML5-compliant parser and then traverse the resulting structure in tree order, recording at minimum the document type, element tag names, attribute names, attribute values, comment nodes, text nodes, and parent/child relations. DOM `Text` nodes are the correct abstraction for extracted text content, and the DOM `normalize()` behavior is useful to understand because it merges contiguous text nodes and removes empty ones. For extraction, however, normalization should happen on a clone or a derived view, not destructively on the only copy of the DOM, because exact boundary positions may still matter to audit and alignment. citeturn10view0turn10view2

### Separation of literal text from HTML syntax

The pipeline should explicitly separate **syntax punctuation** from **literal document content**. The `<title>` element yields an element marker plus a text node. A character reference inside text or an attribute yields both a raw source lexeme and a decoded semantic character. Raw-text and script-data contexts need special handling: the parser deliberately switches tokenization mode for raw-text elements and script content, which means text inside those elements should be collected as literal character sequences, not recursively parsed as ordinary HTML markup. That distinction is essential if the page contains CSS, JS, or large literal blocks. citeturn16view1turn16view2turn16view3turn7view2

### Lossless versus optimized views

To satisfy both auditability and compactness, I recommend producing two synchronized outputs during generation even if you ultimately persist only one optimized artifact. The first is a **lossless audit ledger** containing exact raw spans and removed punctuation facts. The second is the **optimized keyless JSON wire form** that keeps only normalized symbols, concise tokens, weights, and optional source references. This is especially important for pages like your uploaded example, which includes metadata, inline styles, navigation, prose, and a large `<pre>` payload where literal spacing and line structure are part of the content itself. citeturn0file7

A practical extraction pseudocode sketch looks like this:

```text
bytes -> decode with HTML-compatible encoding handling -> code points
code points -> raw_scan() -> raw_ledger
code points -> parse_html5() -> dom_tree + tokenizer events

for each DOM node in tree order:
    emit structural record for element/comment/text/doctype
    preserve attribute order as observed in source for reproducibility
    keep parser-normalized tag and attribute names
    preserve both raw and decoded forms for character references

align dom records to raw spans
normalize semantic clones to NFC
segment minimal symbols by grapheme cluster
filter or map reserved/invisible punctuation to boundary classes
deduplicate into token registry
assign salience + idea weights
serialize as positional-array JSON
```

The `preserve attribute order as observed in source` step is a design choice for reproducibility, not a claim that attribute order is semantically significant in HTML; the HTML spec explicitly notes that attribute order does not matter for attribute equivalence comparisons, while duplicate attributes are removed during parsing. citeturn7view1turn8view2

## Normalization and elimination rules

The cleanest normalization strategy is to define three parallel forms for every extracted unit:

| Layer | Purpose | Example |
|---|---|---|
| Raw | Exact source preservation | `content="Fast, safe AI tooling."` |
| Normalized | Stable Unicode form for comparison | `Fast, safe AI tooling.` in NFC |
| Cleaned | Minimal semantic symbol used for tokenization | `fast safe ai tooling` |

NFC should be the default normalization target for semantic comparison and registry lookup because it preserves canonical distinctions while giving canonically equivalent strings a unique binary representation. Grapheme segmentation should happen after or alongside normalization for minimal symbol handling. If you concatenate normalized fragments during processing, normalize again at the join boundary because normalization is not closed under concatenation. citeturn19view1turn4view3turn19view0turn19view3

Why This File Exists

This is a memory-system evidence file from ɩ.com / JustAnIota.com. It is shown here because AIWikis.org is demonstrating the real source files that make the UAIX / LLM Wiki memory system work, not only summarizing those systems after the fact.

Role

This file is memory-system evidence. It records source history, archive transfer, intake disposition, or another piece of provenance that should be retrievable without becoming an unsupported public claim.

Structure

The file is structured around these visible headings: Designing an Optimized Keyless JSON Representation from HTML; Executive summary; Standards basis; Extraction pipeline; Dual-track extraction method; Separation of literal text from HTML syntax; Lossless versus optimized views; Normalization and elimination rules. Those headings are retrieval anchors: a crawler or LLM can decide whether the file is relevant before reading every line.

Prompt-Size And Retrieval Benefit

Keeping this material in a separate file reduces prompt pressure because an agent can load this exact unit only when its role, source site, category, or hash is relevant. The surrounding index pages point to it, while this page preserves the full content for audit and exact recall.

How To Use It

Humans should read the metadata first, then inspect the raw content when they need exact wording or provenance.
LLMs and agents should use the source site, category, hash, headings, and related files to decide whether this file belongs in the active prompt.
Crawlers should treat the AIWikis page as transparent evidence and follow the source URL/source reference for authority boundaries.
Future maintainers should regenerate this page whenever the source hash changes, then review the explanation if the role or structure changed.

Update Requirements

When this source file changes, update the raw source layer, normalized source layer, hash history, this rendered page, generated explanation, source-file inventory, changed-files report, and any source-section index that links to it.

Provenance And History

Current observation: 2026-05-15T00:23:56.0837262Z
Source origin: current-source-workspace
Retrieval method: local-source-workspace
Duplicate group: sfg-107 (primary)
Historical hash records are stored in data/hashes/source-file-history.jsonl.

Machine-Readable Metadata

{
    "title":  "Designing An Optimized Keyless JSON Representation From HTML",
    "source_site":  "ɩ.com / JustAnIota.com",
    "source_url":  "https://justaniota.com/",
    "canonical_url":  "https://aiwikis.org/justaniota/uai-system/files/raw-system-archives-justaniota-intake-processing-2026-05-03-agent-file-h-2360de1e/",
    "source_reference":  "raw/system-archives/justaniota/intake-processing/2026-05-03/agent-file-handoff/Improvement/Designing an Optimized Keyless JSON Representation from HTML.md",
    "file_type":  "md",
    "content_category":  "memory-file",
    "content_hash":  "sha256:2360de1edc223386e6b241047b774602158bfecc896f1d9b84c7a83fd2ba102b",
    "last_fetched":  "2026-05-15T00:23:56.0837262Z",
    "last_changed":  "2026-05-03T19:06:06.6634107Z",
    "import_status":  "unchanged",
    "duplicate_group_id":  "sfg-107",
    "duplicate_role":  "primary",
    "related_files":  [

                      ],
    "generated_explanation":  true,
    "explanation_last_generated":  "2026-05-15T00:23:56.0837262Z"
}

Next Useful Routes

Start Here A task-first reading path for AIWikis.org, separating newcomer learning, source-memory lookup, maintainer workflow, and AI-agent retrieval.
Topic Index A tag-oriented index for LLM Wiki, AI memory, UAI, source governance, crawling, and retrieval topics.
Source Map AIWikis source-governed page for durable AI memory, evidence routing, and agent-readable retrieval.
JustAnIota.com / ɩ.com Source Memory AIWikis source-governed page for durable AI memory, evidence routing, and agent-readable retrieval.
JustAnIota Source Memory Guide AIWikis source-governed page for durable AI memory, evidence routing, and agent-readable retrieval.
ɩ.com / JustAnIota.com UAI System Files Real current JustAnIota handoff, LLM Wiki, compact-message tooling, public-content, and source-archive evidence files.