Skip to content

About dpv

dpv is a LinkML port of the W3C Data Privacy Vocabulary (DPV) v2.3. It re-expresses the upstream OWL release as a single, navigable LinkM schema together with curated cross-walks to other privacy, AI, security and compliance vocabularies (ISO, NIST, OCSF, OSCAL, ODRL, Gist, CDM, …).

Why a LinkML port?

The canonical DPV release ships as OWL / RDF and HTML. That form is authoritative, but it is awkward to consume from typical data-engineering toolchains. By re-expressing DPV in LinkML we get, from a single source of truth:

  • Pydantic / dataclass / SQL / GraphQL / JSON-Schema bindings via gen-project.
  • Markdown documentation, DBML diagrams, SPARQL, SHACL, OWL and Pandera artifacts via the standard LinkML generators.
  • A simple substrate for authoring SSSOM mappings to adjacent ontologies, so that DPV concepts can be referenced consistently across heterogeneous compliance stacks.
  • Round-trippable example fixtures that double as conformance tests.

Design

Schema layout

  • src/dpv/schema/dpv.yaml - top-level umbrella schema (prefix dpv: -> https://w3id.org/lmodel/dpv/). Re-exports the 8 semantic-group schemas described below.

  • Semantic groups (src/dpv/schema/dpv_<group>.yaml, 8 schemas) - 973 classes and 144 slots covering every term in upstream DPV 2.3 (1:1 coverage). Each group imports only its direct upstream module dependencies, avoiding the load cost of a single monolithic schema:

Group Classes Slots Upstream modules
dpv_common 129 47 status, TOM, rules, context
dpv_legal_basis 166 16 legal_basis (+status), jurisdiction, legal_measures, contract (+clause/control/status/types)
dpv_entities 82 28 entities (+authority/datasubject/legalrole/organisation)
dpv_personal_data 221 4 personal_data, physical_measures, organisational_measures, technical_measures
dpv_processing 293 26 processing (+context/scale), process, purposes
dpv_risk_notice 47 19 risk, notice
dpv_rights 10 3 rights
dpv_consent 25 2 consent (+controls/status/types)
Total 973 145 34 upstream modules

(Slot total 145 = 144 upstream DPV-namespace properties + 1 synthetic id on DpvThing.)

  • src/dpv/schema/modules/ - per-module subset schemas. Both standalone-consumable and importable from dpv.yaml. All imports across the tree use the canonical URI form (e.g. dpv:schema/dpv_core) rather than relative file paths, so the same schema files resolve locally during build and over w3id.org/lmodel once published. See Build pipeline for the import-map mechanics.

  • src/dpv/schema/extensions/ - per-extension schemas mirroring the upstream W3C DPV 2.3 extension family. Each extension has its own id, default_prefix, and decomposed submodule tree. Generated by scripts/dpv_extensions_to_linkml.py, which reads the vendored OWL/Turtle files under upstream-releases/dpv/ and converts each extension and its sub-modules to LinkML YAML. Run via just gen-linkml-extensions. See Extensions for the full inventory.

  • src/dpv/mappings/ - SSSOM cross-walks.

Extensions

The upstream W3C DPV 2.3 release ships as a multi-document family. Each extension is published at its own namespace (e.g. https://w3id.org/dpv/ai#) and contains hundreds of specialised terms layered on top of the core ontology. dpv mirrors that layout under src/dpv/schema/extensions/, one LinkML sub-schema per extension (each with its own id, default_prefix, and decomposed submodule tree):

Extension dpv path Upstream namespace Coverage
Personal Data (PD) extensions/pd.yaml + pd/{core,extended}.yaml https://w3id.org/dpv/pd# Financial, demographic, health, biometric, identifying, tracking, communication, social leaf categories.
Location (LOC) extensions/loc.yaml + loc/{locations,memberships,inverse}.yaml https://w3id.org/dpv/loc# ISO 3166 country & region individuals (EU/EEA member states, US states, UK nations, supranational groupings).
Technology (TECH) extensions/tech.yaml + tech/{core,actors,comms,docs,io,provision,status,tools}.yaml https://w3id.org/dpv/tech# Hardware, software, OS, database, cloud, edge, IoT, deployment models - prerequisite for AI.
AI extensions/ai.yaml + ai/{core,risks,systems,lifecycle,techniques,capabilities,data,measures,development}.yaml https://w3id.org/dpv/ai# LLM, GPAI / GPAIModel, FineTunedModel, AGI, ExpertSystem, EdgeAI; AI lifecycle (Inception→Decommission); AI-specific risks (DataPoisoning, AdversarialAttack, ModelInversion, ExplainabilityRisk, bias subtree); AI techniques, capabilities, measures.
Risk (extension) extensions/risk.yaml + risk/{incident,incident_status,risk_controls,risk_levels,risk_matrix,risk_management,risk_taxonomy}.yaml https://w3id.org/dpv/risk# Typed control verbs (Detection/Intervention/Containment/Identification/Logging/Avoidance/Impact/Elimination/Monitor/Recovery/Remove/Reduction, HaltSource, RemedyControl), Incident / IncidentReport / IncidentStatus lifecycle, concrete consequences (Damage, Detriment, Harm, Injury, Material / NonMaterialDamage, IdentityFraud / Theft …), 5×5 risk matrices, Vulnerability. Distinct from modules/risk (abstract parents only).
Justifications extensions/justifications.yaml + justifications/{nonfulfilment,notrequired,exercise,delay}.yaml https://w3id.org/dpv/justifications# NonFulfilment / NonCompliance / NotRequired / Delay / Exercise justification taxonomy.
Legal (umbrella) extensions/legal/ https://w3id.org/dpv/legal# Abstract Law, Authority, Treaty, LegalAgreement classes.
Legal / EU extensions/legal/eu.yaml + legal/eu/{gdpr,dga,aiact,rights,ehds,nis2}.yaml (each decomposed into submodule trees, e.g. eu/gdpr/ has 15 sub-files, eu/aiact/ has 13) https://w3id.org/dpv/legal/eu*# EU member-state authorities, GDPR articles & lawful bases (A6-1-a…A6-1-f, A9-2-a…), special-category bases, data-subject rights, DPA roles, breach flows, DPIA; Data Governance Act terms; AI Act roles (Provider/Deployer/Importer/Distributor/AuthRep), risk categories (Prohibited / HighRisk / LimitedRisk / MinimalRisk), system types (GPAI / GPAIWithSystemicRisk), Annex III high-risk areas, conformity-assessment; CFREU fundamental-rights articles; EHDS and NIS2.
Legal / per-jurisdiction extensions/legal/<cc>.yaml for 40+ jurisdictions (at, be, bg, cy, cz, de (+ de/gdng), dk, ee, es, fi, fr, gb, gr, hk, hr, hu, ie, in, is, it, jp, kr, li, lt, lu, lv, mo, mt, my, nl, no, ph, pl, pt, ro, se, sg, si, sk, th, tw, us) https://w3id.org/dpv/legal/<cc># Per-jurisdiction laws & authorities (BDSG, DPC Ireland, DPDP Act India, UK GDPR / DPA-2018 / ICO, CCPA / CPRA / HIPAA / GLBA / state AGs, …). Exceeds upstream's published per-country set.
Sector extensions/sector/{finance,health,education,law,infra,publicservices}.yaml (+ subdirs) https://w3id.org/dpv/sector*# (WIP upstream) FINOS-relevant sector taxonomies. Marked WIP upstream; shipped here for early consumers.
Standards extensions/standards/ieee/ n/a (cross-walks against IEEE) IEEE standards alignments.

All extensions import only linkml:types and dpv:schema/dpv_core (for the abstract parents they specialise), so they can be consumed standalone or aggregated.

Per-extension documentation (possible reusable pattern)

The top-level 6 extensions (ai, justifications, loc, pd, risk, tech) are rendered as standalone documentation trees under docs/extensions/, grouped under a single Extensions entry in the site navigation. Each tree is produced by an isolated gen-doc invocation against a self-contained staged copy of the extension schema.

The staging step (scripts/linkml_import_tools.py strip-siblings) rewrites each extension so SchemaView can load it without sibling-schema context:

  • Drops every imports: entry except linkml:types (avoids Conflicting URIs errors caused by genuine cross-extension name collisions, e.g. DataAggregationBias in both ai and risk, or DpvData in ai vs. dpv_personal_data).

  • Drops dangling is_a parents and filters mixins lists to only those defined locally, so the resulting schema validates standalone.

The orchestration recipe (just gen-doc-extensions) iterates src/dpv/schema/extensions/*.yaml, stages each into tmp/extensions/<slug>.yaml, and runs gen-doc into docs/extensions/<slug>/. It is wired as a dependency of the top-level just gen-doc, so a single command produces both the core element pages (docs/elements/) and the per-extension trees. Per-jurisdiction legal/ schemas and sector/, standards/ trees are intentionally not rendered (they re-slice the same core terms and would multiply build time without adding distinct semantics); the docs/extensions/index.md landing page notes this and points readers to the source YAML.

Open-world modelling

DPV is intrinsically open-world: most properties are declared on the top-level DpvThing rather than pinned to a specific class, so any instance may carry any DPV property that makes sense for it. This is preserved in dpv:

  • DpvThing is the abstract base and declares only id.
  • DPV properties are domainless slots that can apply to any descendant.

  • Validation is therefore done via linkml.validator.Validator against the merged schema (JSON-Schema semantics), not by instantiating the generated closed-world dataclasses. The closed-world Pydantic classes are still emitted for use cases where a stricter contract is desired.

Mappings

Cross-walks live under src/dpv/mappings/ as SSSOM TSVs and are merged into the schema as exact_mappings, close_mappings, related_mappings, etc. by the sssom_mappings.py overlay subcommand.

Current cross-walks (26 total - LLM-seeded sets are pending human review):

File Target
dpv-odrl.sssom.tsv W3C ODRL (legacy)
dpv-ai-iso42001.sssom.tsv ISO/IEC 42001 (AI management) - AI extension
dpv-cdm.sssom.tsv FINOS Common Domain Model
dpv-gist.sssom.tsv Semantic Arts Gist
dpv-iso.sssom.tsv ISO/IEC general privacy terms
dpv-iso22989.sssom.tsv ISO/IEC 22989 (AI concepts)
dpv-iso27001.sssom.tsv ISO/IEC 27001 (ISMS)
dpv-iso29100-lmodel.sssom.tsv ISO/IEC 29100 (privacy framework)
dpv-iso42001.sssom.tsv ISO/IEC 42001 (AI management)
dpv-justifications-oscal.sssom.tsv NIST OSCAL responsibility statements - justifications extension
dpv-legal-eu-aiact-iso42001.sssom.tsv ISO/IEC 42001 - legal/eu/aiact extension
dpv-legal-eu-dga-iso27701.sssom.tsv ISO/IEC 27701 - legal/eu/dga extension
dpv-legal-eu-ehds-iso27701.sssom.tsv ISO/IEC 27701 - legal/eu/ehds extension
dpv-legal-eu-gdpr-iso27701.sssom.tsv ISO/IEC 27701 - legal/eu/gdpr extension
dpv-legal-eu-nis2-nist-csf-v2.sssom.tsv NIST CSF v2 - legal/eu/nis2 extension
dpv-legal-eu-rights-cfreu.sssom.tsv Charter of Fundamental Rights (EU) - legal/eu/rights extension
dpv-loc-iso3166.sssom.tsv ISO 3166 country / subdivision codes - loc extension
dpv-nist-ai-rmf.sssom.tsv NIST AI Risk Management Framework
dpv-nist-csf-v2.sssom.tsv NIST Cybersecurity Framework v2
dpv-ocsf.sssom.tsv OCSF
dpv-odrl.sssom.tsv W3C ODRL
dpv-oscal.sssom.tsv NIST OSCAL
dpv-pd-iso29100.sssom.tsv ISO/IEC 29100 - pd extension
dpv-risk-iso42001.sssom.tsv ISO/IEC 42001 - risk extension
dpv-tech-iso22989.sssom.tsv ISO/IEC 22989 - tech extension

LLM-seeded mapping sets provide candidate alignments where every row carries mapping_justification: semapv:LLMBasedMatching per the Semantic Mapping Vocabulary. They are not curated mappings - they require human review before promotion to semapv:ManualMappingCuration. Per-jurisdiction legal/<cc> extensions intentionally do not ship cross-walks (low cross-vocabulary leverage).

Mappings are verified by sssom_mappings.py verify, which project-agnostically auto-discovers schemas by default_prefix and checks that every subject_id resolves to an element in some schema.

Examples and tests

  • tests/data/ contains hand-authored and vendored YAML fixtures split into valid/ and invalid/ subtrees. The dpvcg/ subtree carries the upstream DPV CG examples converted to LinkML YAML by scripts/dpvcg_examples.py (to-yaml subcommand).
  • tests/test_data.py validates each fixture against the pre-merged schema at tmp/dpv.yaml using linkml.validator.Validator with JsonschemaValidationPlugin (explicit; the linkml Validator performs no validation at all when validation_plugins is omitted). The merged schema is used so no import-map is needed at test time.
  • tests/test_schema_imports.py is a regression test guarding against class/slot name collisions.

Static fixtures

Over 100 unit tests:

tests/data/valid/ exercises classes across core, extension, and legal modules.

tests/data/invalid/ asserts that the validator rejects the following error classes - "missing id", "null id", and "integer id".

Note: The dpv schema enforces only one constraint: id is a required string (uriorcurie). All other DPV properties are domainless open-world slots. Fixtures that were originally designed to test enum, date-format, cardinality, or referential constraints were rewritten to test id-type violations; their original intent is captured in fixture comments as a roadmap for schema tightening.

Generated fixtures (from DPVCG examples)

  • Fixture generation is fully automated: just test (via _test-python) runs _gen-fixtures, which chains three steps:

  • _extract-example-ttls - scripts/dpvcg_examples.py extract parses upstream-releases/dpv/examples/dex.html and writes one examples/E####.ttl per embedded <pre> block. The upstream release does not ship stand-alone E*.ttl files; the Turtle content lives only in the HTML documentation page.

  • _load-ttl-fixtures - scripts/dpvcg_examples.py load wraps each examples/E*.ttl in a self-contained Turtle document (prepending the standard DPV prefix preamble) and routes it to tests/data/dpvcg/valid/ or problem/ based on rdflib parse success.

  • _gen-fixtures - scripts/dpvcg_examples.py to-yaml converts each typed root subject into a <ClassName>-<stem>-<n>.yaml fixture that test_data.py picks up automatically via its glob. Idempotent: re-running with unchanged inputs produces byte-identical output.

  • just _test-examples runs the same fixtures through linkml-run-examples for CLI parity.

Build pipeline

  • just orchestrates everything; common targets include gen-project, _test-python, _test-examples, lint, doc, _test-python depends on _gen-fixtures -> _load-ttl-fixtures -> _extract-example-ttls so the full fixture pipeline (HTML extraction -> TTL wrapping -> YAML conversion) always runs before pytest.

  • URI imports are resolved at build time via an import map. The source-of-truth importmap.json uses relative paths for portability and IDE consumption; the _importmap recipe materialises an absolute-path copy at tmp/importmap.json (required because SchemaLoader joins import-map values with the importing schema's directory). Set LINKML_IMPORT_MAP= (empty) to fall back to HTTP resolution against w3id.org.

  • scripts/linkml_import_tools.py merge flattens the schema into a self-contained YAML before feeding it to gen-project. This works around a LinkML upstream bug (raised upstream) where SchemaLoader-based sub-generators (python, sqltable, excel) construct a secondary SchemaView without propagating --importmap and therefore HTTP-fetch URI imports they cannot resolve.

  • dpv_27560_to_linkml.py and the gen_*_patched.py scripts work around several LinkML generator edge-cases that surface on a schema this large (DBML, Pandera, RDF, SPARQL, Markdown data-dictionary).

  • scripts/dpv_core_to_linkml.py regenerates the 8 semantic-group schemas (dpv_<group>.yaml) and the top-level umbrella dpv.yaml from the vendored upstream OWL release under upstream-releases/dpv/.

  • scripts/dpv_extensions_to_linkml.py regenerates all extension schemas under src/dpv/schema/extensions/ from the same vendored upstream release. Invoked via just gen-linkml-extensions; must run after dpv_core_to_linkml.py because it imports that module for shared utilities.

  • scripts/linkml_import_tools.py strip-siblings produces self-contained per-extension YAML for the documentation build by dropping all non-linkml:types imports and stripping dangling is_a / mixins references. Driven by the gen-doc-extensions recipe, which is itself a dependency of top-level gen-doc so just gen-doc produces both docs/elements/ and docs/extensions/<slug>/ in one pass.

Status

This is an early-stage but functional port. Key milestones:

  • ✅ Full DPV 2.3 surface area (973 classes / 144 slots, 1:1 with upstream OWL) generated from upstream OWL into 8 semantic-group schemas.

  • ✅ Every concrete class carries the required id identifier: root classes in all 8 semantic groups are wired to is_a: DpvThing (declared once in dpv_common, imported transitively since every group depends on it), so JSON-Schema validation enforces id on subclasses — not just on DpvThing itself. This is the single constraint the invalid fixtures exercise.

  • ✅ Open-world validation honored by tests/test_data.py and the linkml-run-examples CLI path; the open-world drop-in runner (tests/run_examples_open_world.py) and pytest now share a schema-aware target-class resolver that maps filename targets via class names, class_uri and aliases, falling back to DpvThing when the upstream example references a concept outside the merged schema.

  • ✅ 26 SSSOM cross-walks present (17 hand-curated + 9 LLM-seeded candidate sets marked semapv:LLMBasedMatching and pending human review); verifier checks every row against the schema.

  • ✅ Upstream DPV CG examples round-tripped from TTL into LinkML YAML and exercised in CI; unmapped predicates preserved as # __unmapped__: provenance comments so nothing is silently dropped. The upstream release ships only a monolithic HTML page (dex.html); scripts/dpvcg_examples.py extract extracts 88 individual E####.ttl snippets from it. The full pipeline (HTML -> TTL -> wrapped TTL -> 5 valid YAML fixtures, with 6 examples routed to problem/) is wired into just test via _gen-fixtures - no manual step required.

  • ✅ 64 hand-authored valid fixtures covering 52 distinct classes (core, risk, AI, legal/EU, loc, tech, and sector modules) and 48 hand-authored invalid fixtures all exercising the one schema-enforced constraint — id must be a required uriorcurie string — across violation modes (missing id, null id, integer id). Fixtures carry comments recording the originally-intended semantic constraint (enum values, date formats, cardinality, referential integrity) as a roadmap for future schema tightening. Plus 5 generated YAML fixtures from the upstream DPV CG examples. All exercised by tests/test_data.py (117 passed) and just _test-examples.

  • ✅ URI-style imports (dpv:schema/...) wired across dpv.yaml and all 34 module schemas, with build-time resolution via importmap.json and post-publication resolution via w3id.org/lmodel. gen-project, gen-doc and all 12 sub-generators complete cleanly.

  • ✅ Each per-extension schema imports only the semantic group(s) it specialises — computed from its is_a/range reference graph by scripts/dpv_extensions_to_linkml.py — rather than the full umbrella dpv:schema/dpv. Narrow consumers load much less of the core (e.g. importing pd alone resolves ~580 classes instead of the full ~1,200; justifications ~195), while AI-heavy stacks that transitively pull tech still need most groups. The umbrella remains available for consumers wanting the whole vocabulary in one import.

  • ✅ Per-module subsets are first-class in the generated documentation: each semantic-group schema declares a <group>_subset and every class/slot is tagged with its group membership via in_subset. gen-doc renders a dedicated page for each subset. Per-module schemas under modules/ remain usable standalone but are intentionally not imported by dpv.yaml to avoid duplicate from_schema resolution in gen-project.

  • ✅ Per-extension documentation: just gen-doc-extensions renders an isolated gen-doc tree for each top-level extension (ai, justifications, loc, pd, risk, tech) plus an auto-generated extensions/index.md overview, all wired into the mkdocs nav. Each extension is documented from its own (sibling-import-stripped) schema, so cross-extension and core parents render as plain text rather than breaking the build.

  • ✅ Class-name collision avoidance trimmed to the minimum needed: DPV-domain terms (Consent, Contract, Notice, Policy, Purpose, Right, Risk, Rule, Status, Assessment) keep their natural names; only generic primitives that collide with Python typing or common base classes (Agent, Organisation, Person, Process, Service, …) are emitted as Dpv*.

  • 🚧 Generator patches (gen_*_patched.py) consolidate workarounds for upstream LinkML issues (raised upstream); upstreaming is in progress.

  • 🚧 Documentation site at https://lmodel.github.io/dpv is generator-driven and tracks main.

See also