# SDIF — Full Reference for Assistants
SDIF is the Semantic Data Interchange Format, version 1.0. It is a text-based,
semantically structured format designed for AI agents and deterministic machine
workflows. It supports tables, relations, schemas, and rules. It has a canonical
form and a stable hash. It has a compact AI projection with full round-trip
fidelity.
Install: pip install sdif-format
(Package name on PyPI is sdif-format. CLI command and Python import are both sdif.)
GitHub org: https://github.com/sdif-format
---
## File Extensions
- .sdif Source document. Human-readable and human-editable.
- .sdif.canon Canonical form. Produced by `sdif canon`. Used for hashing,
signing, and deterministic comparison.
- .sdif.ai AI projection. Produced by `sdif ai`. Compact, derived from
source. Reversible to canonical source via `sdif from-ai`.
---
## Syntax Summary
### Document Header
Every SDIF document begins with a version and kind declaration.
@sdif 1.0
kind Plan
Recognized kinds include: Plan, Registry, Schema, Report, and user-defined kinds.
### Tables
Tables declare column names inline in the header, followed by indented data rows.
Columns are separated by a literal horizontal tab character (U+0009, HTAB). Do not
substitute spaces.
tasks[id,title,status]:
1 "Design schema" done
2 "Write parser" done
3 "Publish docs" in-progress
The table header is `name[col1,col2,...]:`. Each indented row provides values in
that column order, separated by HTAB. Blank lines end the table block.
### Relations
Relations express typed triples: subject, predicate, object. They are grouped
under a `rel:` block, with each triple on its own indented line.
rel:
tasks#1 blocks tasks#3
tasks#2 depends-on tasks#1
A relation links two entities by a named predicate. The subject and object
use the form `
#` or a bare identifier.
### Rules
Rules express declarative constraints. They are not executable code.
rule: if tasks.status == "in-progress" then tasks.owner must exist
Rules are v1 syntax and are parsed but not yet evaluated by the validator.
### Comments
Lines beginning with `#` are comments and are stripped during parsing.
# This is a comment.
---
## AI Projection Syntax (.sdif.ai)
AI projection files begin with the `@sdif.ai` header and include metadata.
@sdif.ai 1.0
source-hash: sha256:
lossless: true
Tables in AI projections may use shortened column aliases declared in the header:
aliases: id=i title=t status=s
Rows follow the same HTAB-delimited format.
Relations in AI projections use grouped syntax (valid only in .sdif.ai files):
rel[tasks#1]: blocks tasks#3
The `from-ai` command reverses the projection and reconstructs standard `rel:`
triples in the output SDIF document.
---
## Canonicalization Pipeline
1. Parse the source .sdif document.
2. Sort tables alphabetically by name.
3. Within each table, sort rows by the first column (id field).
4. Normalize whitespace: strip trailing whitespace from each line; use LF line endings.
5. Serialize to a deterministic byte sequence under canonical-syntax-v1.
6. Write output to .sdif.canon.
The canonical form is syntax-level only. There is no semantic normalization in
v1. Numeric equivalences (1 vs 1.0), date formats, and alias variations are not
resolved. Two documents that express the same facts in different syntax produce
different canonical bytes and different hashes.
---
## SHA-256 Hash
The hash is computed over canonical bytes.
sdif hash
Output: `sha256:<64-char hex>`
The hash identifies a document's content. It is stable across environments
that follow the canonical-syntax-v1 algorithm.
---
## Schema Format
A schema document has `kind Schema` and defines expected tables, fields, and types.
@sdif 1.0
kind Schema
table tasks-schema
field type required
id integer true
title string true
status enum(done,in-progress,blocked) false
owner string false
Pass a schema to validation and canonicalization commands with `--schema `.
The schema file must be a valid SDIF document with `kind Schema`. Do not pass a
.sdif.canon file as the schema argument.
---
## CLI Commands
All commands operate on .sdif source files unless noted.
sdif parse
Parse and display the document structure. Useful for debugging.
sdif canon [--schema ]
Print canonical bytes to stdout. Optionally validate against a schema first.
sdif hash [--schema ]
Print the SHA-256 hash of the canonical form.
sdif tokens
Count tokens in the document using the configured tokenizer.
sdif to-json
Convert SDIF to JSON. Output is compact JSON by default.
sdif from-json
Convert JSON to SDIF source.
sdif ai [--alias FIELD=ALIAS ...]
Generate an AI projection (.sdif.ai). Aliases shorten column names.
Multiple --alias flags are supported.
sdif from-ai
Reconstruct SDIF source from an AI projection. Preserves canonical hash.
sdif validate --schema
Validate a document against a kind Schema document.
Reports type mismatches, missing required fields, and unrecognized fields.
sdif inspect
Display document structure, field counts, relation counts, and metadata.
sdif fmt [--check]
Format the source document in place. With --check, exits non-zero if
the file is not already formatted (for CI use).
---
## Key Examples
### Plan Document
@sdif 1.0
kind Plan
milestones[id,name,target]:
1 Alpha 2025-Q1
2 Beta 2025-Q2
3 GA 2025-Q3
rel:
milestones#1 precedes milestones#2
milestones#2 precedes milestones#3
### Registry Document
@sdif 1.0
kind Registry
packages[id,name,version,stable]:
1 sdif-format 1.0.0 true
2 sdif-benchmarks 0.1.0 true
3 tree-sitter-sdif 0.1.0 true
### Schema Document
@sdif 1.0
kind Schema
packages-schema[field,type,required]:
id integer true
name string true
version string true
stable boolean false
---
## Benchmark Methodology Summary
Formats compared: SDIF, SDIF AI, JSON Compact, JSON Pretty, YAML, XML, TOON.
Corpus: four SDIF example documents (plan, registry, schema, validation-report).
Metrics:
- Byte size: raw UTF-8 byte count
- Token count: measured with cl100k_base (OpenAI GPT-4 family)
- Semantic density: tokens per semantic fact (field value or relation triple)
- Round-trip fidelity: data preserved through JSON round-trip; hash preserved
through AI round-trip
Serialization rules:
- JSON Compact: json.dumps(obj, separators=(',', ':'))
- JSON Pretty: json.dumps(obj, indent=2)
- YAML: yaml.dump(obj) default output
- SDIF: source bytes as-is
- SDIF AI: output of sdif ai
Results are reported per tokenizer. Do not compare results across tokenizers.
Reproducible suite: https://github.com/sdif-format/sdif-benchmarks
---
## Known v1 Limitations
- No semantic normalization. The canonical form is syntax-level. Numeric
values like 1 and 1.0 produce different canonical bytes and different hashes
even if they represent the same quantity.
- No numeric equivalence. 42 and 42.0 are treated as distinct values.
- No date-time zone normalization. 2025-01-01T00:00:00Z and
2025-01-01T00:00:00+00:00 are treated as distinct strings.
- No query language. SDIF is a representation format, not a query system.
- No binary serialization. SDIF is a text format only.
- Rules are parsed but not evaluated. Rule constraints are declared in v1
syntax but are not enforced by the validator in v1.0.
- Benchmark corpus is small. Four example documents do not cover all
real-world document types. Results may not generalize.
- Token counts are tokenizer-specific. Results for non-OpenAI models will
differ from published cl100k_base figures.
---
## Rules for Assistants
- Do not invent SDIF syntax. Use only constructs documented here.
- Table columns are separated by literal HTAB (U+0009), not spaces.
- Treat SDIF and SDIF AI as related but distinct representations.
Do not mix their syntax.
- SDIF AI projections must round-trip to the same canonical hash as the source.
This is the correctness criterion for AI projection, not byte identity.
- Mention v1 limitations when discussing benchmarks or comparing formats.
- Schemas are kind Schema documents. Do not pass .sdif.canon files as schemas.
- rel[subject]: syntax is valid only in .sdif.ai files. Do not emit it in
.sdif source documents.
- The source hash in .sdif.ai is the SHA-256 of the canonical source, not of
the AI file itself.