All,

I apologize for the lateness of this. I threw it together yesterday and sent it 
to the list but just noticed that it never left my outbox so I must have messed 
something up.

This is a VERY simple overview of some of the aspects of blank nodes we should 
consider when discussing whether they should be used for SPDX 3.0
It is VERY informal and quickly thrown together so please do not interpret it 
as anything too rigorous. Rather than me spending time writing up rigorous 
argumentation I instead took an approach of pulling together several reference 
links addressing various aspects and let those do the talking with only a 
simple summarization of the aspect issue from me.


  *   Some VERY quick and short notes on the question of using blank nodes or 
not
  *
  *   The below short outline includes several relevant links to resources on 
various aspects of this issue. All of these links were found within 20 mins of 
very simple Google querying and all were within the first 5-10 results for each 
Google query.
  *
  *   There is broad consensus on the existence of significant issues and 
challenges with using blank nodes. 15-30 mins of googling will yield scores of 
papers, blog posts, articles, etc. calling out various reasons that blank nodes 
are problematic and should be avoided wherever possible in the large majority 
of situations. Defined semantics and specifications regarding Bnodes are 
inconsistent and contradictory leading to inconsistency between tools, 
ambiguity in how they will be processed, interpreted or queried.
     *   http://richard.cyganiak.de/blog/2011/03/blank-nodes-considered-harmful/
     *   https://aidanhogan.com/docs/blank_nodes_jws.pdf
     *   https://marceloarenas.cl/publications/iswc11.pdf
     *   https://terminusdb.com/blog/blank-nodes-in-rdf/
     *   https://terminusdb.com/blog/blank-nodes-in-rdf/

     *   When blank nodes are used it is typically for the convenience of the 
producer but often comes at significant cost to the consumer in the form of 
ambiguity, uncertainty, complexity, and resources (time and computing resources)
     *   IF they are decided to be used they are ONLY for a single scope of a 
single datastore or single serialized document and NOT for global or 
cross-scope use. This is explicitly stated in all of the W3C specs dealing with 
Bnodes. Using them for cross-scope use as SPDX 3.0 is intended leads to 
significant potential data integrity issues.
        *   https://www.w3.org/wiki/BlankNodes
        *   https://www.w3.org/TR/rdf11-concepts/#section-blank-nodes
        *   https://www.w3.org/TR/rdf-mt/#unlabel
        *   https://en.wikipedia.org/wiki/Blank_node
     *   Avoiding these significant potential issues typically requires 
skolemization (replacing the localized ids with globally unique IRIs) of the 
Bnodes. This extra effort is forced on the consumer and is often done by 
processors and graph stores as part of deserialization/ingestion. However, due 
to the inconsistencies in the specs regarding Bnodes this is not consistent. 
Some processors and stores do not perform skolemization an simply utilize the 
localized Bnode ids (especially if they are producer asserted in any way). This 
leads to significant integrity issues as these ids collide (simple example is 
even explicitly in some W3C docs/specs and on the Wikipedia page) and increases 
significantly with the volume of cross-scope content ingested. Skolemization 
also does not provide any id-related context for the source of the nodes such 
as that provided by namespaces in producer specified IRIs.
        *   
https://www.w3.org/2011/rdf-wg/wiki/Skolemisation#:~:text=Blank%20nodes%20do%20not%20have,are%20known%20as%20Skolem%20IRIs.
        *   https://www.w3.org/wiki/BnodeSkolemization
     *   Bnodes also have very significant issues for SPARQL, the definitive 
standard mechanism for querying rdf graphs. The two do NOT play well together 
at all due to inherent issues in Bnode design and inconsistencies in the rdf 
specs related to Bnodes. Many queries can lead to inconsistent and 
non-integrous results. Various academics and companies have offered workarounds 
and schemes to attempt to address this disconnect but they all come at 
significant compute complexity and cost. These issues increase significantly 
with the volume of Bnodes in the overall graph being queried.
        *   https://www.w3.org/2009/12/rdf-ws/papers/ws23
        *   https://aidanhogan.com/docs/certain_answers_sparql_blank_nodes.pdf
        *   http://www.jsoftware.us/vol7/jsw0709-09.pdf
     *   Bnodes also cause significant issues with semantic entailment (ability 
to determine full semantic integrity and correctness) of the graph. Entailment 
is required for any higher-order semantic inferencing and analysis. A couple of 
academics have offered papers that purport to mathematically prove that 
entailment using Bnodes is NPComplete though there is broad consensus that 
while it is likely possible it is almost always impractical and problematic.


Sean


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#5271): https://lists.spdx.org/g/Spdx-tech/message/5271
Mute This Topic: https://lists.spdx.org/mt/100486597/21656
Group Owner: [email protected]
Unsubscribe: https://lists.spdx.org/g/Spdx-tech/unsub [[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-


Reply via email to