Hashing

SDIF hashing produces a SHA-256 digest over the canonical bytes of a document. Because canonicalization is deterministic, two documents with the same semantic content — regardless of how they were formatted, what comments they contained, or in what order their rows were inserted — produce identical canonical bytes and therefore an identical hash.

The hash is the portable identity of a document's content.

The Pipeline

source.sdif → parser → AST → canonicalizer → canonical bytes → SHA-256

Parser reads the source document, accepting comments, blank lines, and formatting variation.
AST is the language-independent in-memory representation of the document's content.
Canonicalizer renders the AST to a deterministic byte sequence (see Canonicalization).
SHA-256 is computed over those canonical bytes.

The hash is computed over the canonical form, not the source file. This means that editing a comment, reformatting indentation, or changing the order of unordered table rows does not change the hash. Only changes to semantic content — field values, table data, relation triples, rule expressions — change the hash.

What the Hash Covers

The hash includes all semantic content:

The version header and profile
The kind declaration
All scalar fields and their values
All table headers, column names, and row values
All relation triples in the rel: block
All rule expressions in the rules: block

The hash does not cover:

Comments (stripped during canonicalization)
Blank lines (stripped during canonicalization)
Formatting whitespace beyond what is semantically significant

Two source documents that differ only in comments or whitespace will produce the same hash.

Schema-Aware vs Schema-Less Hashing

The hash depends on how the canonicalizer orders table rows, and that ordering depends on whether a schema is available.

Without a schema, all tables are treated as ordered: rows appear in the canonical form in the same sequence they were written in the source. Two source files with the same rows in different orders will produce different hashes.

With a schema, tables declared as ordered=false have their rows sorted by the declared primary key. Two source files with the same unordered table rows written in different orders will produce the same canonical bytes and the same hash.

For stable, insertion-order-independent hashing, use schema-aware hashing.

CLI

Hash a document without a schema:

sdif hash plan.sdif

Hash a document with a schema (enables stable row ordering for unordered tables):

sdif hash plan.sdif --schema schema.sdif

Example output:

sha256:3a7f2c9d1e4b8f6a0d5c2e9b7f1a4d6c3e8f0b2a9d5c7e1f4b8a3d6c0e2f9b7

The output is the prefix sha256: followed by the hex-encoded digest. Tools that store or compare hashes should include the prefix to indicate the hash function used.

Use Cases

Content addressing. Store the hash alongside the document. Later, recompute the hash to verify that the document has not changed since it was stored. This is especially useful when SDIF documents are transmitted over untrusted channels or stored in mutable locations.

Tamper detection. If a document is signed or notarized, the signature is computed over the canonical bytes (the same bytes the hash covers). Recomputing the hash later and comparing it to the stored value reveals whether the document has been altered since signing.

Cache keys. Systems that derive outputs from SDIF documents — validations, embeddings, projections, reports — can use the hash as a cache key. If the hash matches a previously computed result, the derivation can be skipped. Two source files with different whitespace but the same content share a cache entry.

Tracking document versions. When a document evolves over time, storing the hash at each revision provides a lightweight audit trail. A change in hash indicates a change in semantic content; no change in hash indicates the document is semantically identical to its previous version, even if it was reformatted.

Relationship to Canonicalization

Hashing and canonicalization are closely related. Canonicalization defines the byte sequence; hashing measures it. The sdif canon command produces the canonical bytes as a file. The sdif hash command computes the hash of those bytes. Running sdif hash is equivalent to running sdif canon | sha256sum, but more convenient.

See the Canonicalization page for a full description of what canonicalization normalizes and how it handles ordering.

The Pipeline​

What the Hash Covers​

Schema-Aware vs Schema-Less Hashing​

CLI​

Use Cases​

Relationship to Canonicalization​

The Pipeline

What the Hash Covers

Schema-Aware vs Schema-Less Hashing

CLI

Use Cases

Relationship to Canonicalization