# SDIF — Full Reference for Assistants SDIF is the Semantic Data Interchange Format, version 1.0. It is a text-based, semantically structured format designed for AI agents and deterministic machine workflows. It supports tables, relations, schemas, and rules. It has a canonical form and a stable hash. It has a compact AI projection with full round-trip fidelity. Install: pip install sdif-format (Package name on PyPI is sdif-format. CLI command and Python import are both sdif.) GitHub org: https://github.com/sdif-format --- ## File Extensions - .sdif Source document. Human-readable and human-editable. - .sdif.canon Canonical form. Produced by `sdif canon`. Used for hashing, signing, and deterministic comparison. - .sdif.ai AI projection. Produced by `sdif ai`. Compact, derived from source. Reversible to canonical source via `sdif from-ai`. --- ## Syntax Summary ### Document Header Every SDIF document begins with a version and kind declaration. @sdif 1.0 kind Plan Recognized kinds include: Plan, Registry, Schema, Report, and user-defined kinds. ### Tables Tables declare column names inline in the header, followed by indented data rows. Columns are separated by a literal horizontal tab character (U+0009, HTAB). Do not substitute spaces. tasks[id,title,status]: 1 "Design schema" done 2 "Write parser" done 3 "Publish docs" in-progress The table header is `name[col1,col2,...]:`. Each indented row provides values in that column order, separated by HTAB. Blank lines end the table block. ### Relations Relations express typed triples: subject, predicate, object. They are grouped under a `rel:` block, with each triple on its own indented line. rel: tasks#1 blocks tasks#3 tasks#2 depends-on tasks#1 A relation links two entities by a named predicate. The subject and object use the form `#` or a bare identifier. ### Rules Rules express declarative constraints. They are not executable code. rule: if tasks.status == "in-progress" then tasks.owner must exist Rules are v1 syntax and are parsed but not yet evaluated by the validator. ### Comments Lines beginning with `#` are comments and are stripped during parsing. # This is a comment. --- ## AI Projection Syntax (.sdif.ai) AI projection files begin with the `@sdif.ai` header and include metadata. @sdif.ai 1.0 source-hash: sha256: lossless: true Tables in AI projections may use shortened column aliases declared in the header: aliases: id=i title=t status=s Rows follow the same HTAB-delimited format. Relations in AI projections use grouped syntax (valid only in .sdif.ai files): rel[tasks#1]: blocks tasks#3 The `from-ai` command reverses the projection and reconstructs standard `rel:` triples in the output SDIF document. --- ## Canonicalization Pipeline 1. Parse the source .sdif document. 2. Sort tables alphabetically by name. 3. Within each table, sort rows by the first column (id field). 4. Normalize whitespace: strip trailing whitespace from each line; use LF line endings. 5. Serialize to a deterministic byte sequence under canonical-syntax-v1. 6. Write output to .sdif.canon. The canonical form is syntax-level only. There is no semantic normalization in v1. Numeric equivalences (1 vs 1.0), date formats, and alias variations are not resolved. Two documents that express the same facts in different syntax produce different canonical bytes and different hashes. --- ## SHA-256 Hash The hash is computed over canonical bytes. sdif hash Output: `sha256:<64-char hex>` The hash identifies a document's content. It is stable across environments that follow the canonical-syntax-v1 algorithm. --- ## Schema Format A schema document has `kind Schema` and defines expected tables, fields, and types. @sdif 1.0 kind Schema table tasks-schema field type required id integer true title string true status enum(done,in-progress,blocked) false owner string false Pass a schema to validation and canonicalization commands with `--schema `. The schema file must be a valid SDIF document with `kind Schema`. Do not pass a .sdif.canon file as the schema argument. --- ## CLI Commands All commands operate on .sdif source files unless noted. sdif parse Parse and display the document structure. Useful for debugging. sdif canon [--schema ] Print canonical bytes to stdout. Optionally validate against a schema first. sdif hash [--schema ] Print the SHA-256 hash of the canonical form. sdif tokens Count tokens in the document using the configured tokenizer. sdif to-json Convert SDIF to JSON. Output is compact JSON by default. sdif from-json Convert JSON to SDIF source. sdif ai [--alias FIELD=ALIAS ...] Generate an AI projection (.sdif.ai). Aliases shorten column names. Multiple --alias flags are supported. sdif from-ai Reconstruct SDIF source from an AI projection. Preserves canonical hash. sdif validate --schema Validate a document against a kind Schema document. Reports type mismatches, missing required fields, and unrecognized fields. sdif inspect Display document structure, field counts, relation counts, and metadata. sdif fmt [--check] Format the source document in place. With --check, exits non-zero if the file is not already formatted (for CI use). --- ## Key Examples ### Plan Document @sdif 1.0 kind Plan milestones[id,name,target]: 1 Alpha 2025-Q1 2 Beta 2025-Q2 3 GA 2025-Q3 rel: milestones#1 precedes milestones#2 milestones#2 precedes milestones#3 ### Registry Document @sdif 1.0 kind Registry packages[id,name,version,stable]: 1 sdif-format 1.0.0 true 2 sdif-benchmarks 0.1.0 true 3 tree-sitter-sdif 0.1.0 true ### Schema Document @sdif 1.0 kind Schema packages-schema[field,type,required]: id integer true name string true version string true stable boolean false --- ## Benchmark Methodology Summary Formats compared: SDIF, SDIF AI, JSON Compact, JSON Pretty, YAML, XML, TOON. Corpus: four SDIF example documents (plan, registry, schema, validation-report). Metrics: - Byte size: raw UTF-8 byte count - Token count: measured with cl100k_base (OpenAI GPT-4 family) - Semantic density: tokens per semantic fact (field value or relation triple) - Round-trip fidelity: data preserved through JSON round-trip; hash preserved through AI round-trip Serialization rules: - JSON Compact: json.dumps(obj, separators=(',', ':')) - JSON Pretty: json.dumps(obj, indent=2) - YAML: yaml.dump(obj) default output - SDIF: source bytes as-is - SDIF AI: output of sdif ai Results are reported per tokenizer. Do not compare results across tokenizers. Reproducible suite: https://github.com/sdif-format/sdif-benchmarks --- ## Known v1 Limitations - No semantic normalization. The canonical form is syntax-level. Numeric values like 1 and 1.0 produce different canonical bytes and different hashes even if they represent the same quantity. - No numeric equivalence. 42 and 42.0 are treated as distinct values. - No date-time zone normalization. 2025-01-01T00:00:00Z and 2025-01-01T00:00:00+00:00 are treated as distinct strings. - No query language. SDIF is a representation format, not a query system. - No binary serialization. SDIF is a text format only. - Rules are parsed but not evaluated. Rule constraints are declared in v1 syntax but are not enforced by the validator in v1.0. - Benchmark corpus is small. Four example documents do not cover all real-world document types. Results may not generalize. - Token counts are tokenizer-specific. Results for non-OpenAI models will differ from published cl100k_base figures. --- ## Rules for Assistants - Do not invent SDIF syntax. Use only constructs documented here. - Table columns are separated by literal HTAB (U+0009), not spaces. - Treat SDIF and SDIF AI as related but distinct representations. Do not mix their syntax. - SDIF AI projections must round-trip to the same canonical hash as the source. This is the correctness criterion for AI projection, not byte identity. - Mention v1 limitations when discussing benchmarks or comparing formats. - Schemas are kind Schema documents. Do not pass .sdif.canon files as schemas. - rel[subject]: syntax is valid only in .sdif.ai files. Do not emit it in .sdif source documents. - The source hash in .sdif.ai is the SHA-256 of the canonical source, not of the AI file itself.