[spdx-tech] Some fodder for the discussion of blank nodes

Sean Barnum Tue, 01 Aug 2023 08:50:15 -0700

All,

I apologize for the lateness of this. I threw it together yesterday and sent it 
to the list but just noticed that it never left my outbox so I must have messed 
something up.

This is a VERY simple overview of some of the aspects of blank nodes we should
consider when discussing whether they should be used for SPDX 3.0
It is VERY informal and quickly thrown together so please do not interpret it
as anything too rigorous. Rather than me spending time writing up rigorous
argumentation I instead took an approach of pulling together several reference
links addressing various aspects and let those do the talking with only a
simple summarization of the aspect issue from me.

* Some VERY quick and short notes on the question of using blank nodes or
not
*
* The below short outline includes several relevant links to resources on
various aspects of this issue. All of these links were found within 20 mins of
very simple Google querying and all were within the first 5-10 results for each
Google query.
*
* There is broad consensus on the existence of significant issues and
challenges with using blank nodes. 15-30 mins of googling will yield scores of
papers, blog posts, articles, etc. calling out various reasons that blank nodes
are problematic and should be avoided wherever possible in the large majority
of situations. Defined semantics and specifications regarding Bnodes are
inconsistent and contradictory leading to inconsistency between tools,
ambiguity in how they will be processed, interpreted or queried.
* http://richard.cyganiak.de/blog/2011/03/blank-nodes-considered-harmful/
* https://aidanhogan.com/docs/blank_nodes_jws.pdf
* https://marceloarenas.cl/publications/iswc11.pdf
* https://terminusdb.com/blog/blank-nodes-in-rdf/
* https://terminusdb.com/blog/blank-nodes-in-rdf/

* When blank nodes are used it is typically for the convenience of the
producer but often comes at significant cost to the consumer in the form of
ambiguity, uncertainty, complexity, and resources (time and computing resources)
* IF they are decided to be used they are ONLY for a single scope of a
single datastore or single serialized document and NOT for global or
cross-scope use. This is explicitly stated in all of the W3C specs dealing with
Bnodes. Using them for cross-scope use as SPDX 3.0 is intended leads to
significant potential data integrity issues.
* https://www.w3.org/wiki/BlankNodes
* https://www.w3.org/TR/rdf11-concepts/#section-blank-nodes
* https://www.w3.org/TR/rdf-mt/#unlabel
* https://en.wikipedia.org/wiki/Blank_node
* Avoiding these significant potential issues typically requires
skolemization (replacing the localized ids with globally unique IRIs) of the
Bnodes. This extra effort is forced on the consumer and is often done by
processors and graph stores as part of deserialization/ingestion. However, due
to the inconsistencies in the specs regarding Bnodes this is not consistent.
Some processors and stores do not perform skolemization an simply utilize the
localized Bnode ids (especially if they are producer asserted in any way). This
leads to significant integrity issues as these ids collide (simple example is
even explicitly in some W3C docs/specs and on the Wikipedia page) and increases
significantly with the volume of cross-scope content ingested. Skolemization
also does not provide any id-related context for the source of the nodes such
as that provided by namespaces in producer specified IRIs.
*
https://www.w3.org/2011/rdf-wg/wiki/Skolemisation#:~:text=Blank%20nodes%20do%20not%20have,are%20known%20as%20Skolem%20IRIs.
* https://www.w3.org/wiki/BnodeSkolemization
* Bnodes also have very significant issues for SPARQL, the definitive
standard mechanism for querying rdf graphs. The two do NOT play well together
at all due to inherent issues in Bnode design and inconsistencies in the rdf
specs related to Bnodes. Many queries can lead to inconsistent and
non-integrous results. Various academics and companies have offered workarounds
and schemes to attempt to address this disconnect but they all come at
significant compute complexity and cost. These issues increase significantly
with the volume of Bnodes in the overall graph being queried.
* https://www.w3.org/2009/12/rdf-ws/papers/ws23
* https://aidanhogan.com/docs/certain_answers_sparql_blank_nodes.pdf
* http://www.jsoftware.us/vol7/jsw0709-09.pdf
* Bnodes also cause significant issues with semantic entailment (ability
to determine full semantic integrity and correctness) of the graph. Entailment
is required for any higher-order semantic inferencing and analysis. A couple of
academics have offered papers that purport to mathematically prove that
entailment using Bnodes is NPComplete though there is broad consensus that
while it is likely possible it is almost always impractical and problematic.

Sean

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#5271): https://lists.spdx.org/g/Spdx-tech/message/5271
Mute This Topic: https://lists.spdx.org/mt/100486597/21656
Group Owner: [email protected]
Unsubscribe: https://lists.spdx.org/g/Spdx-tech/unsub [[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

[spdx-tech] Some fodder for the discussion of blank nodes

Reply via email to