Re: statement identifiers

thomas lörtsch Wed, 19 Sep 2018 11:28:25 -0700


> On 19. Sep 2018, aTO 12:22, Andy Seaborne <[email protected]> wrote:
> 
> You could look at the RDF* work  https://github.com/RDFstar/RDFstarTools from 
> Olaf Hartig. (actually RDF* isn't a triple id, it can be implemented that way 
> - the triple is a new kind of RDF term in the concrete data model of RDF)

I’m not fond of that literal style of reification for some reasons:
- it doesn’t scale well to annotations on annotations on annotations…
- it "feels" wrong: the triple and its reified representation are too similar
- I could be wrong but isn’t their semantics the same as of RDF standard 
reification (and therefor a problem, not the solution)?

> Do note that any triple id or tripel as element in the data model is not 
> reification. RDF* and things of a similar design can be endcoded using 
> reification but they are not reificiation.
> 
> For example: the stating (that is, a claim, not a fact):
> 
> :A :says ":moon :color :blue" .
> 
> can't be done in RDF* without
> 
> :moon :color :blue .
> 
> also being in the data but if it is in the data, it is a fact.  The triple 
> spoken about must be in the data where as reification does not do that.
> 
> Reification can express differing points of view:
> 
> :A :says ":moon :color :blue" .
> :B :says ":moon :color :red" .
> 
> without
> 
> SELECT * { :moon :color  ?C }
> 
> returning anything (the data does not have a fact about the moon color, only 
> claims).
> 
> With careful modelling using the triple id, you can build interesting cases, 
> especially by using event-based modelling to attach groups of triples about 
> claims.

I have to nitpick on your wording here: reification is reification no matter 
what exactly it reifies. RDF standard reification has that strange thing of 
being able to reify statements that haven’t actually been asserted. At least it 
is interpreted that way but that might be a common misconception - see below. 
What you want to say is that RDF standard reification can reify more things 
((maybe) unasserted statements, to be exact) than what RDF* can reify.

> OWL ism: Note that :color might be a functional property so if two colors 
> triples are in the data it infer they are owl:sameAs.
> 
> :blue owl:sameAs :red .
> 
> Oops!

Of course, but … does it really matter? There will always be such cases when 
things blow up because data from different sources is mixed that weren’t meant 
to (a case for contexts…).

> I'm be very interested in hearing your idea about adding named graphs into 
> this mix.

Okay, you asked for it… :-) I’ve been bitching for decades about the mediocre 
meta modelling facilities in RDF (used to prefer Topic Maps back in the day) 
and finally decided to try to do something about it, so started to work on 
"Context on the Semantic Web". Turns out the topic is more complicated than I 
thought and reaches into all kinds of unpleasant, strange and unexpected areas 
(RDF standard reification, n-ary relations, attributed graphs etc), not to 
mention the vague and untangible nature of the concept of context itself.

Regarding reification I asked on [email protected] [0] a few weeks ago and 
Pat Hayes was kind enough to enlighten me. If I got it right there’s 3 kinds of 
triples:
- an abstract triple (TA): a specific triple that could be stated but hasn’t 
been (at least not to our knowledge) - the thing you think RDF standard 
reification implements
- a triple type (TY): a specific triple and all it’s occurrences in any 
document, database, scribbled on a piece of paper etc
- a triple token (TO): a specific occurrence of some triple type, a triple in 
some named graph in some database on some server under some desk
To my great surprise I learned that RDF reification would like to deal with the 
third case - the concrete triple token occurrence - though it needs means 
external to RDF to refer to them. It was then, in the late 90ies, seen as 
necessary to be able to add e.g. provenance information to triples. RDF however 
has no way to distinguish one occurrence TO of a triple TY from another 
occurrence TO of the same triple TY in e.g. another file (as RDF has no concept 
of a document that holds the triple nor any other concept of "context", because 
its model theory is defined in terms of abstract set theory). So RDF here could 
only refer to means outside of RDF and those are not standardized.
With respect to the common conception that RDF standard reification can 
represent unasserted statements I’d say: no, it can’t. That would be the 
abstract triple TA. It does (want to) speak about triple tokens TO, it just 
can’t say which one exactly it means. OTOH if you prefer "If it walks like a 
duck…" type semantics then you’re probably right - but one thing I learned with 
this Semantic Web stuff is that in the end the pedants win every argument ;-)

Now, as you know all too well, also Named Graphs in Datasets have no 
semantically sound way to refer to them without some out-of-band mechansim as 
it's unspecified if a graph name actually denotes the graph it names or if it 
merely labels it but refers to something else. 

So RDF doesn’t provide _any_ semantically sound means for meta modelling. That 
is of course no news but it seems like finally it starts to be considered a 
problem as meta modelling is used in graphs outside the Semantic Web like in 
Property Graphs and in WikiData [1]. That should be good news for me but 
unfortunately I still don’t have a good plan how everything could be made fit 
as there are more problems.

Contexts, as vague a concept as they are, could be reduced to a very abstract 
concept of secondary attributes to (primary) relations. Everything more 
specific/expressive probably belongs into the realm of vocabularies. RDF 
however would have to provide a basic mechanism to add attributes to relations, 
and that would require it to be able to definitely denote triple occurrences TO 
- but it just can’t… Granted, syntactically it can in Datasets, it can to a 
lesser pleasant degree with standard reification - but both are not backed by 
model theoretic semantics. So no reasoning, no well defined semantics, and 
therefor no good.

My idea is now is to proceed step wise: first add a statement identifier to 
every statement. To make sure it is not (again, like with graph names) used as 
a label (e.g. to indicate a "context") the statement id is a hash (maybe MD5, 
but I’m not expert there). One could imagine it as having been there all the 
time, it just hadn’t been made explicit - so no need to extend the RDF 
semantics, just a syntactic tweak that fits snuggly into that little spot left 
undefined by the RDF standard reification mechanism. And boom we got rock solid 
triple denotation, with well defined formal model theoretic semantics, 
reasoning galore and are moved to tears! Ahem…

Actually that hash provides only the way to denote the triple type TY, not the 
triple token TO. So a second step is required: for the token TO - which is what 
we are really aiming for most of the time - we need a concept of context or 
surface or g-box or similar. But we can also get there via a second statement, 
like:
    aSub    a Pred     aObj    id_1
    id_1    inContext  aCon    id_2
These two triples+ID can be written in one row:
    aSub    a Pred     aObj    id_1    id_1    inContext  aCon    id_2
We see that column 4 and 5 are redundant and column 6 will always be 
"inContext", so we can shorten the two triples+ID to one quad+ID:
    aSub    a Pred     aObj                               aCon    id_2
The hashing function now includes the field in column 4 and ensures that a 
triple in the default context would still get id_1
    aSub    a Pred     aObj                               ----    id_1
So we can differentiate between a triple in the default context and some 
contextualized triple. What that is actually worth, depends. In a local dataset 
it can make a difference. As soon as datasets are shared on the web the default 
context will likely change to some source identifier.
It is also possible to compute a statement hash and use it as a subject of 
other triples without actually adding the statement to the store (still it can 
and probably should be described using a standard reifcation style quadlet). 
That covers the TA use case. It would be good though to define another type 
than rdf:Statement for this.
No doubt there are a lot of technical details that I overlooked. Maybe the MD5 
hash be omitted in internal use and replaced by something more performant 
and/or legible. "#someCruelHash rdf:label ex:myfirstTriple". Maybe, if 
reification is only used sparingly, the fifth field can be optimized away, 
etc...

What I like about this approach is that it solidly binds the quad+ID to the 
triples+ID. As the triple ID is firmly bound to the triple itself that should 
guarantee some pretty solid semantics for the quad/quint as well - at least I 
haven’t found any gaping holes so far.

We can now, finally, talk about specific concrete statement occurrences TO in 
semantically sound ways. The context field can for example carry information 
about the database that holds this triple, but just as well anything else. It 
will probably be subject of some other statements that define its attributes. 
There is total freedom in the use of the context field and actually that 
bothers me a little. But that’s another topic. What’s important is that the 
basic mechanism is now in place to have the luxury of such a problem :) This in 
short is the background of why I find quad+ID interesting.

Another justification would be that people often want to group statements. 
There’s a lot of practical use cases where it seems abhorrent to add a second 
statement to every other statement e.g. just to record the date of ingestion. A 
context field provides the means to do this quite efficiently.

My main problem right now is that I find it very hard to distinguish between an 
attribute to a relation and a context of that relation. The mechanism described 
above allows to annotate a statement with additional statements through its ID 
or through its context field which can refer to an arbitrarily complex context 
object. A pragmatic approach could be to say that contexts are for grouping 
purposes whereas statement specific attributions should handle aspects specific 
to that single statement. That probably makes it easier to model stuff but it 
sure makes it harder to query (having to search contexts as well as 
reifications). This requires more thinking...

Just one more idea for useful applications of statement IDs: in another mail on 
this(?) mailinglist [2] you argue that relation reification is an anti pattern. 
I’m not sure I agree with you there. I’m more thinking in a direction of 
complex, fine-grained objects from which simplistic "A emails B" triples are 
derived - and back linked to the originating complex object :myFirstEmail that 
has all the details like date, BCC's etc. That back link could be provided 
through a statement annotation "ID_1 derivedFrom :myFirstEmail". The simplistic 
triple "A emails B" would convey a basic fact and facilitate retrieval and 
integration.
The triple is indeed both the weak and the strong point of RDF: anything 
reasonably complex is at best tedious to model with triples. But in a vast, 
heterogeneous, distributed graph the basic, simplistic triple is by far the 
best bet to succesfully navigate from A to B to C etc. So: statement IDs and 
back links to the rescue (hopefully, maybe…). The complex mothership then might 
even be an RDBMS table, a tree structure, who knows - but that’s already RDF 
3.0 I fear.

Thanks for your interest! I hope this wasn’t too much detail.
Thomas

> One of the problems with reification is that applies to statements, not a 
> group of statements.
> 
> In extrmeis:
> 
> GRAPH <someId1> { :moon :color :blue }
> GRAPH <someId2> { :moon :color :red }
> 
> at least means the default graph makes no assertion about
> { :moon :color  ?C }
> 
>    Andy
> 
> 
> 
> 
> 
> 
> On 18/09/18 22:16, Rob Vesse wrote:
>> None of the Jena provided implementations use statement IDs, that includes 
>> both TDB1 and TDB2 which both just store quads directly
>> Rob
>> On 18/09/2018, 13:15, "ajs6f" <[email protected]> wrote:
>>     >>
>>     >> Not in general, no, although some specific DatasetGraph 
>> implementations may.
>>     >
>>     > Any idea where I should look?
>>          Nope. None of the in-memory implementations do this to my 
>> knowledge, because they needn't. I don't know if either TDB1 or -2 do, but I 
>> can't think of a reason they would.
>>          It's possible that someone out there in the community has written 
>> one, or you could try implementing DatasetGraph yourself, perhaps reusing 
>> some other implementation for part of the work.
>>          ajs6f
>>          > On Sep 18, 2018, at 4:00 PM, thomas lörtsch <[email protected]> wrote:
>>     >
>>     >
>>     >> On 18. Sep 2018, at 21:40, ajs6f <[email protected]> wrote:
>>     >
>>     > That was quick!
>>     >
>>     >> Not in general, no, although some specific DatasetGraph 
>> implementations may.
>>     >
>>     > Any idea where I should look?
>>     >
>>     >> There is some API support for reification:
>>     >>
>>     >> https://jena.apache.org/documentation/notes/reification.html
>>     >>
>>     >> Does that meet your use case?
>>     >
>>     > No, unfortunately not. I need the graph name too.
>>     >
>>     > Thomas
>>     >
>>     >
>>     >> ajs6f
>>     >>
>>     >>> On Sep 18, 2018, at 3:37 PM, thomas lörtsch <[email protected]> wrote:
>>     >>>
>>     >>> Hi,
>>     >>>
>>     >>> a questions (and my apologies upfront that I don’t take the time to 
>> dive into the code myself, but it would take me a lot of time):
>>     >>>
>>     >>> Does Jena happen to add an internal ID to each quad 
>> (statement+graphName)?
>>     >>>
>>     >>> Some databases do so for internal administrative purposes (I 
>> believe) and so I thought it might be worth to ask.
>>     >>> If Jena does provide such IDs I would like to use them as 
>> reification IDs and my next question would be about how hard it is to access 
>> them.
>>     >>>
>>     >>> Thanks,
>>     >>> Thomas
>>     >>
>>     >
>>          

[0] https://lists.w3.org/Archives/Public/semantic-web/2018Jul/0024.html
[1] Hernández, Daniel, Aidan Hogan, and Markus Krötzsch. "Reifying RDF: What 
works well with wikidata?." SSWS@ ISWC 1457 (2015): 32-47.
[2] https://apache.markmail.org/message/js6s6ry5st73soay

Re: statement identifiers

Reply via email to