About dpv
dpv is a LinkML port of the W3C Data Privacy Vocabulary (DPV) v2.3. It re-expresses the upstream OWL release as a single, navigable LinkM schema together with curated cross-walks to other privacy, AI, security and compliance vocabularies (ISO, NIST, OCSF, OSCAL, ODRL, Gist, CDM, …).
Why a LinkML port?
The canonical DPV release ships as OWL / RDF and HTML. That form is authoritative, but it is awkward to consume from typical data-engineering toolchains. By re-expressing DPV in LinkML we get, from a single source of truth:
- Pydantic / dataclass / SQL / GraphQL / JSON-Schema bindings via
gen-project. - Markdown documentation, DBML diagrams, SPARQL, SHACL, OWL and Pandera artifacts via the standard LinkML generators.
- A simple substrate for authoring SSSOM mappings to adjacent ontologies, so that DPV concepts can be referenced consistently across heterogeneous compliance stacks.
- Round-trippable example fixtures that double as conformance tests.
Design
Schema layout
-
src/dpv/schema/dpv.yaml - top-level umbrella schema (prefix
dpv:->https://w3id.org/lmodel/dpv/). Re-exports the 8 semantic-group schemas described below. -
Semantic groups (
src/dpv/schema/dpv_<group>.yaml, 8 schemas) - 973 classes and 144 slots covering every term in upstream DPV 2.3 (1:1 coverage). Each group imports only its direct upstream module dependencies, avoiding the load cost of a single monolithic schema:
| Group | Classes | Slots | Upstream modules |
|---|---|---|---|
dpv_common |
129 | 47 | status, TOM, rules, context |
dpv_legal_basis |
166 | 16 | legal_basis (+status), jurisdiction, legal_measures, contract (+clause/control/status/types) |
dpv_entities |
82 | 28 | entities (+authority/datasubject/legalrole/organisation) |
dpv_personal_data |
221 | 4 | personal_data, physical_measures, organisational_measures, technical_measures |
dpv_processing |
293 | 26 | processing (+context/scale), process, purposes |
dpv_risk_notice |
47 | 19 | risk, notice |
dpv_rights |
10 | 3 | rights |
dpv_consent |
25 | 2 | consent (+controls/status/types) |
| Total | 973 | 145 | 34 upstream modules |
(Slot total 145 = 144 upstream DPV-namespace properties + 1 synthetic id on DpvThing.)
-
src/dpv/schema/modules/ - per-module subset schemas. Both standalone-consumable and importable from
dpv.yaml. All imports across the tree use the canonical URI form (e.g.dpv:schema/dpv_core) rather than relative file paths, so the same schema files resolve locally during build and over w3id.org/lmodel once published. See Build pipeline for the import-map mechanics. -
src/dpv/schema/extensions/ - per-extension schemas mirroring the upstream W3C DPV 2.3 extension family. Each extension has its own
id,default_prefix, and decomposed submodule tree. Generated by scripts/dpv_extensions_to_linkml.py, which reads the vendored OWL/Turtle files underupstream-releases/dpv/and converts each extension and its sub-modules to LinkML YAML. Run viajust gen-linkml-extensions. See Extensions for the full inventory. -
src/dpv/mappings/ - SSSOM cross-walks.
Extensions
The upstream W3C DPV 2.3 release ships as a multi-document family. Each extension is published at its own namespace (e.g. https://w3id.org/dpv/ai#) and contains hundreds of specialised terms layered on top of the core ontology. dpv mirrors that layout under src/dpv/schema/extensions/, one LinkML sub-schema per extension (each with its own id, default_prefix, and decomposed submodule tree):
| Extension | dpv path | Upstream namespace | Coverage |
|---|---|---|---|
| Personal Data (PD) | extensions/pd.yaml + pd/{core,extended}.yaml |
https://w3id.org/dpv/pd# |
Financial, demographic, health, biometric, identifying, tracking, communication, social leaf categories. |
| Location (LOC) | extensions/loc.yaml + loc/{locations,memberships,inverse}.yaml |
https://w3id.org/dpv/loc# |
ISO 3166 country & region individuals (EU/EEA member states, US states, UK nations, supranational groupings). |
| Technology (TECH) | extensions/tech.yaml + tech/{core,actors,comms,docs,io,provision,status,tools}.yaml |
https://w3id.org/dpv/tech# |
Hardware, software, OS, database, cloud, edge, IoT, deployment models - prerequisite for AI. |
| AI | extensions/ai.yaml + ai/{core,risks,systems,lifecycle,techniques,capabilities,data,measures,development}.yaml |
https://w3id.org/dpv/ai# |
LLM, GPAI / GPAIModel, FineTunedModel, AGI, ExpertSystem, EdgeAI; AI lifecycle (Inception→Decommission); AI-specific risks (DataPoisoning, AdversarialAttack, ModelInversion, ExplainabilityRisk, bias subtree); AI techniques, capabilities, measures. |
| Risk (extension) | extensions/risk.yaml + risk/{incident,incident_status,risk_controls,risk_levels,risk_matrix,risk_management,risk_taxonomy}.yaml |
https://w3id.org/dpv/risk# |
Typed control verbs (Detection/Intervention/Containment/Identification/Logging/Avoidance/Impact/Elimination/Monitor/Recovery/Remove/Reduction, HaltSource, RemedyControl), Incident / IncidentReport / IncidentStatus lifecycle, concrete consequences (Damage, Detriment, Harm, Injury, Material / NonMaterialDamage, IdentityFraud / Theft …), 5×5 risk matrices, Vulnerability. Distinct from modules/risk (abstract parents only). |
| Justifications | extensions/justifications.yaml + justifications/{nonfulfilment,notrequired,exercise,delay}.yaml |
https://w3id.org/dpv/justifications# |
NonFulfilment / NonCompliance / NotRequired / Delay / Exercise justification taxonomy. |
| Legal (umbrella) | extensions/legal/ |
https://w3id.org/dpv/legal# |
Abstract Law, Authority, Treaty, LegalAgreement classes. |
| Legal / EU | extensions/legal/eu.yaml + legal/eu/{gdpr,dga,aiact,rights,ehds,nis2}.yaml (each decomposed into submodule trees, e.g. eu/gdpr/ has 15 sub-files, eu/aiact/ has 13) |
https://w3id.org/dpv/legal/eu*# |
EU member-state authorities, GDPR articles & lawful bases (A6-1-a…A6-1-f, A9-2-a…), special-category bases, data-subject rights, DPA roles, breach flows, DPIA; Data Governance Act terms; AI Act roles (Provider/Deployer/Importer/Distributor/AuthRep), risk categories (Prohibited / HighRisk / LimitedRisk / MinimalRisk), system types (GPAI / GPAIWithSystemicRisk), Annex III high-risk areas, conformity-assessment; CFREU fundamental-rights articles; EHDS and NIS2. |
| Legal / per-jurisdiction | extensions/legal/<cc>.yaml for 40+ jurisdictions (at, be, bg, cy, cz, de (+ de/gdng), dk, ee, es, fi, fr, gb, gr, hk, hr, hu, ie, in, is, it, jp, kr, li, lt, lu, lv, mo, mt, my, nl, no, ph, pl, pt, ro, se, sg, si, sk, th, tw, us) |
https://w3id.org/dpv/legal/<cc># |
Per-jurisdiction laws & authorities (BDSG, DPC Ireland, DPDP Act India, UK GDPR / DPA-2018 / ICO, CCPA / CPRA / HIPAA / GLBA / state AGs, …). Exceeds upstream's published per-country set. |
| Sector | extensions/sector/{finance,health,education,law,infra,publicservices}.yaml (+ subdirs) |
https://w3id.org/dpv/sector*# (WIP upstream) |
FINOS-relevant sector taxonomies. Marked WIP upstream; shipped here for early consumers. |
| Standards | extensions/standards/ieee/ |
n/a (cross-walks against IEEE) | IEEE standards alignments. |
All extensions import only linkml:types and dpv:schema/dpv_core (for the abstract parents they specialise), so they can be consumed standalone or aggregated.
Per-extension documentation (possible reusable pattern)
The top-level 6 extensions (ai, justifications, loc, pd, risk, tech) are
rendered as standalone documentation trees under docs/extensions/, grouped under a single Extensions entry in the site navigation. Each tree is
produced by an isolated gen-doc invocation against a self-contained staged copy of
the extension schema.
The staging step (scripts/linkml_import_tools.py strip-siblings) rewrites each extension so SchemaView can load it without sibling-schema context:
-
Drops every
imports:entry exceptlinkml:types(avoidsConflicting URIserrors caused by genuine cross-extension name collisions, e.g.DataAggregationBiasin bothaiandrisk, orDpvDatainaivs.dpv_personal_data). -
Drops dangling
is_aparents and filtersmixinslists to only those defined locally, so the resulting schema validates standalone.
The orchestration recipe (just gen-doc-extensions) iterates src/dpv/schema/extensions/*.yaml, stages each into tmp/extensions/<slug>.yaml, and runs gen-doc into docs/extensions/<slug>/. It is wired as a dependency of the top-level just gen-doc, so a single command produces both the core element pages (docs/elements/) and the per-extension trees. Per-jurisdiction legal/ schemas and sector/, standards/ trees are intentionally not rendered (they re-slice the same core terms and would multiply build time without adding distinct semantics); the docs/extensions/index.md landing page notes this and points readers to the source YAML.
Open-world modelling
DPV is intrinsically open-world: most properties are declared on the top-level DpvThing rather than pinned to a specific class, so any instance may carry any DPV property that makes sense for it. This is preserved in dpv:
DpvThingis the abstract base and declares onlyid.-
DPV properties are domainless slots that can apply to any descendant.
-
Validation is therefore done via
linkml.validator.Validatoragainst the merged schema (JSON-Schema semantics), not by instantiating the generated closed-world dataclasses. The closed-world Pydantic classes are still emitted for use cases where a stricter contract is desired.
Mappings
Cross-walks live under src/dpv/mappings/ as SSSOM TSVs and are merged into the schema as exact_mappings, close_mappings, related_mappings, etc. by the sssom_mappings.py overlay subcommand.
Current cross-walks (26 total - LLM-seeded sets are pending human review):
| File | Target |
|---|---|
dpv-odrl.sssom.tsv |
W3C ODRL (legacy) |
dpv-ai-iso42001.sssom.tsv |
ISO/IEC 42001 (AI management) - AI extension |
dpv-cdm.sssom.tsv |
FINOS Common Domain Model |
dpv-gist.sssom.tsv |
Semantic Arts Gist |
dpv-iso.sssom.tsv |
ISO/IEC general privacy terms |
dpv-iso22989.sssom.tsv |
ISO/IEC 22989 (AI concepts) |
dpv-iso27001.sssom.tsv |
ISO/IEC 27001 (ISMS) |
dpv-iso29100-lmodel.sssom.tsv |
ISO/IEC 29100 (privacy framework) |
dpv-iso42001.sssom.tsv |
ISO/IEC 42001 (AI management) |
dpv-justifications-oscal.sssom.tsv |
NIST OSCAL responsibility statements - justifications extension |
dpv-legal-eu-aiact-iso42001.sssom.tsv |
ISO/IEC 42001 - legal/eu/aiact extension |
dpv-legal-eu-dga-iso27701.sssom.tsv |
ISO/IEC 27701 - legal/eu/dga extension |
dpv-legal-eu-ehds-iso27701.sssom.tsv |
ISO/IEC 27701 - legal/eu/ehds extension |
dpv-legal-eu-gdpr-iso27701.sssom.tsv |
ISO/IEC 27701 - legal/eu/gdpr extension |
dpv-legal-eu-nis2-nist-csf-v2.sssom.tsv |
NIST CSF v2 - legal/eu/nis2 extension |
dpv-legal-eu-rights-cfreu.sssom.tsv |
Charter of Fundamental Rights (EU) - legal/eu/rights extension |
dpv-loc-iso3166.sssom.tsv |
ISO 3166 country / subdivision codes - loc extension |
dpv-nist-ai-rmf.sssom.tsv |
NIST AI Risk Management Framework |
dpv-nist-csf-v2.sssom.tsv |
NIST Cybersecurity Framework v2 |
dpv-ocsf.sssom.tsv |
OCSF |
dpv-odrl.sssom.tsv |
W3C ODRL |
dpv-oscal.sssom.tsv |
NIST OSCAL |
dpv-pd-iso29100.sssom.tsv |
ISO/IEC 29100 - pd extension |
dpv-risk-iso42001.sssom.tsv |
ISO/IEC 42001 - risk extension |
dpv-tech-iso22989.sssom.tsv |
ISO/IEC 22989 - tech extension |
LLM-seeded mapping sets provide candidate alignments where every row carries mapping_justification: semapv:LLMBasedMatching per the Semantic Mapping Vocabulary. They are not curated mappings - they require human review before promotion to semapv:ManualMappingCuration. Per-jurisdiction legal/<cc> extensions intentionally do not ship cross-walks (low cross-vocabulary leverage).
Mappings are verified by sssom_mappings.py verify, which project-agnostically auto-discovers schemas by default_prefix and checks that every subject_id resolves to an element in some schema.
Examples and tests
- tests/data/ contains hand-authored and vendored YAML fixtures split into
valid/andinvalid/subtrees. Thedpvcg/subtree carries the upstream DPV CG examples converted to LinkML YAML by scripts/dpvcg_examples.py (to-yamlsubcommand). - tests/test_data.py validates each fixture against the pre-merged schema at
tmp/dpv.yamlusinglinkml.validator.ValidatorwithJsonschemaValidationPlugin(explicit; the linkmlValidatorperforms no validation at all whenvalidation_pluginsis omitted). The merged schema is used so no import-map is needed at test time. - tests/test_schema_imports.py is a regression test guarding against class/slot name collisions.
Static fixtures
Over 100 unit tests:
tests/data/valid/ exercises classes across core, extension, and legal modules.
tests/data/invalid/ asserts that the validator rejects the following error classes - "missing id", "null id", and "integer id".
Note: The dpv schema enforces only one constraint:
idis a required string (uriorcurie). All other DPV properties are domainless open-world slots. Fixtures that were originally designed to test enum, date-format, cardinality, or referential constraints were rewritten to testid-type violations; their original intent is captured in fixture comments as a roadmap for schema tightening.
Generated fixtures (from DPVCG examples)
-
Fixture generation is fully automated:
just test(via_test-python) runs_gen-fixtures, which chains three steps: -
_extract-example-ttls- scripts/dpvcg_examples.pyextractparsesupstream-releases/dpv/examples/dex.htmland writes oneexamples/E####.ttlper embedded<pre>block. The upstream release does not ship stand-aloneE*.ttlfiles; the Turtle content lives only in the HTML documentation page. -
_load-ttl-fixtures- scripts/dpvcg_examples.pyloadwraps eachexamples/E*.ttlin a self-contained Turtle document (prepending the standard DPV prefix preamble) and routes it totests/data/dpvcg/valid/orproblem/based on rdflib parse success. -
_gen-fixtures- scripts/dpvcg_examples.pyto-yamlconverts each typed root subject into a<ClassName>-<stem>-<n>.yamlfixture thattest_data.pypicks up automatically via its glob. Idempotent: re-running with unchanged inputs produces byte-identical output. just _test-examplesruns the same fixtures throughlinkml-run-examplesfor CLI parity.
Build pipeline
-
justorchestrates everything; common targets includegen-project,_test-python,_test-examples,lint,doc,_test-pythondepends on_gen-fixtures->_load-ttl-fixtures->_extract-example-ttlsso the full fixture pipeline (HTML extraction -> TTL wrapping -> YAML conversion) always runs before pytest. -
URI imports are resolved at build time via an import map. The source-of-truth
importmap.jsonuses relative paths for portability and IDE consumption; the_importmaprecipe materialises an absolute-path copy attmp/importmap.json(required becauseSchemaLoaderjoins import-map values with the importing schema's directory). SetLINKML_IMPORT_MAP=(empty) to fall back to HTTP resolution againstw3id.org. -
scripts/linkml_import_tools.py
mergeflattens the schema into a self-contained YAML before feeding it togen-project. This works around a LinkML upstream bug (raised upstream) where SchemaLoader-based sub-generators (python,sqltable,excel) construct a secondarySchemaViewwithout propagating--importmapand therefore HTTP-fetch URI imports they cannot resolve. -
dpv_27560_to_linkml.pyand thegen_*_patched.pyscripts work around several LinkML generator edge-cases that surface on a schema this large (DBML, Pandera, RDF, SPARQL, Markdown data-dictionary). -
scripts/dpv_core_to_linkml.py regenerates the 8 semantic-group schemas (
dpv_<group>.yaml) and the top-level umbrelladpv.yamlfrom the vendored upstream OWL release under upstream-releases/dpv/. -
scripts/dpv_extensions_to_linkml.py regenerates all extension schemas under
src/dpv/schema/extensions/from the same vendored upstream release. Invoked viajust gen-linkml-extensions; must run afterdpv_core_to_linkml.pybecause it imports that module for shared utilities. -
scripts/linkml_import_tools.py
strip-siblingsproduces self-contained per-extension YAML for the documentation build by dropping all non-linkml:typesimports and stripping danglingis_a/mixinsreferences. Driven by thegen-doc-extensionsrecipe, which is itself a dependency of top-levelgen-docsojust gen-docproduces bothdocs/elements/anddocs/extensions/<slug>/in one pass.
Status
This is an early-stage but functional port. Key milestones:
-
✅ Full DPV 2.3 surface area (973 classes / 144 slots, 1:1 with upstream OWL) generated from upstream OWL into 8 semantic-group schemas.
-
✅ Every concrete class carries the required
ididentifier: root classes in all 8 semantic groups are wired tois_a: DpvThing(declared once indpv_common, imported transitively since every group depends on it), so JSON-Schema validation enforcesidon subclasses — not just onDpvThingitself. This is the single constraint the invalid fixtures exercise. -
✅ Open-world validation honored by
tests/test_data.pyand thelinkml-run-examplesCLI path; the open-world drop-in runner (tests/run_examples_open_world.py) and pytest now share a schema-aware target-class resolver that maps filename targets via class names,class_uriand aliases, falling back toDpvThingwhen the upstream example references a concept outside the merged schema. -
✅ 26 SSSOM cross-walks present (17 hand-curated + 9 LLM-seeded candidate sets marked
semapv:LLMBasedMatchingand pending human review); verifier checks every row against the schema. -
✅ Upstream DPV CG examples round-tripped from TTL into LinkML YAML and exercised in CI; unmapped predicates preserved as
# __unmapped__:provenance comments so nothing is silently dropped. The upstream release ships only a monolithic HTML page (dex.html); scripts/dpvcg_examples.pyextractextracts 88 individualE####.ttlsnippets from it. The full pipeline (HTML -> TTL -> wrapped TTL -> 5 valid YAML fixtures, with 6 examples routed toproblem/) is wired intojust testvia_gen-fixtures- no manual step required. -
✅ 64 hand-authored valid fixtures covering 52 distinct classes (core, risk, AI, legal/EU, loc, tech, and sector modules) and 48 hand-authored invalid fixtures all exercising the one schema-enforced constraint —
idmust be a requireduriorcuriestring — across violation modes (missingid, nullid, integerid). Fixtures carry comments recording the originally-intended semantic constraint (enum values, date formats, cardinality, referential integrity) as a roadmap for future schema tightening. Plus 5 generated YAML fixtures from the upstream DPV CG examples. All exercised bytests/test_data.py(117 passed) andjust _test-examples. -
✅ URI-style imports (
dpv:schema/...) wired acrossdpv.yamland all 34 module schemas, with build-time resolution viaimportmap.jsonand post-publication resolution via w3id.org/lmodel.gen-project,gen-docand all 12 sub-generators complete cleanly. -
✅ Each per-extension schema imports only the semantic group(s) it specialises — computed from its
is_a/rangereference graph by scripts/dpv_extensions_to_linkml.py — rather than the full umbrelladpv:schema/dpv. Narrow consumers load much less of the core (e.g. importingpdalone resolves ~580 classes instead of the full ~1,200;justifications~195), while AI-heavy stacks that transitively pulltechstill need most groups. The umbrella remains available for consumers wanting the whole vocabulary in one import. -
✅ Per-module subsets are first-class in the generated documentation: each semantic-group schema declares a
<group>_subsetand every class/slot is tagged with its group membership viain_subset.gen-docrenders a dedicated page for each subset. Per-module schemas undermodules/remain usable standalone but are intentionally not imported bydpv.yamlto avoid duplicatefrom_schemaresolution ingen-project. -
✅ Per-extension documentation:
just gen-doc-extensionsrenders an isolatedgen-doctree for each top-level extension (ai,justifications,loc,pd,risk,tech) plus an auto-generatedextensions/index.mdoverview, all wired into the mkdocs nav. Each extension is documented from its own (sibling-import-stripped) schema, so cross-extension and core parents render as plain text rather than breaking the build. -
✅ Class-name collision avoidance trimmed to the minimum needed: DPV-domain terms (
Consent,Contract,Notice,Policy,Purpose,Right,Risk,Rule,Status,Assessment) keep their natural names; only generic primitives that collide with Python typing or common base classes (Agent,Organisation,Person,Process,Service, …) are emitted asDpv*. -
🚧 Generator patches (
gen_*_patched.py) consolidate workarounds for upstream LinkML issues (raised upstream); upstreaming is in progress. -
🚧 Documentation site at https://lmodel.github.io/dpv is generator-driven and tracks
main.
See also
- W3C DPV - upstream source.
- LinkML - the modelling language used here.
- SSSOM - the mapping format used for the cross-walks.
- CONTRIBUTING.md - how to file changes.