[Wikidata-bugs] [Maniphest] [Commented On] T90119: BlazeGraph Finalization: RDF Issues
Manybubbles added a comment. Adding this here for posterity: It seems like the primary objection to RDR is vendor lock in. Its a BlazeGraph specific thing and would have to be reimplemented if we had to go someplace else. You *can* trigger it automatically using standard RDF reification syntax _but_ that syntax is deprecated and painful to query. I argue that SPARQL itself is worse from a vendor lock in perspective than RDR. There is _no_ chance that we'll want to reimplement SPARQL ourselves and we only have four open source options that support it: 1. BlazeGraph 2. Virtuoso OpenSource 3. Apache Jena 4. GraphSail on top of Gremlin on top of some other graph database 5. 4store #4 is unlikely to be efficient, given all the layers of abstraction. There is a chance it'll work but its not super high. Apache Jena doesn't scale nearly as well as Virtuoso or BlazeGraph according to http://www.w3.org/wiki/LargeTripleStores . 4store hasn't seen much development in a long, long time. That means we're locked to either Virtuoso or BlazeGraph any way. So my feeling is that exposing RDR to our users isn't _that_ bad. If we have to take it away one day our options are to either reimplement the syntax or deprecate it and drop support for it. Or replace it with some wikidata specific syntax. I think this is ok. Especially when you compare it to the craziness we were willing to put up with on top of Gremlin where switching out the graph backend required you to totally change your model based on the totally mismatching capabilities of the underlying system. TASK DETAIL https://phabricator.wikimedia.org/T90119 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Manybubbles Cc: Thompsonbry.systap, Smalyshev, Manybubbles, Aklapper, Haasepeter, Beebs.systap, daniel, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke, JanZerebecki ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T90119: BlazeGraph Finalization: RDF Issues
Thompsonbry.systap added a comment. There is support for inline UUIDs for blank nodes. See UUIDBNodeIV. You could also define a fully inline URI with a well-known prefix and a UUID. Bryan Bryan Thompson Chief Scientist Founder SYSTAP, LLC 4501 Tower Road Greensboro, NC 27410 br...@systap.com http://bigdata.com http://mapgraph.io CONFIDENTIALITY NOTICE: This email and its contents and attachments are for the sole use of the intended recipient(s) and are confidential or proprietary to SYSTAP. Any unauthorized review, use, disclosure, dissemination or copying of this email or its contents or attachments is prohibited. If you have received this communication in error, please notify the sender by reply email and permanently delete all copies of the email and its contents and attachments. TASK DETAIL https://phabricator.wikimedia.org/T90119 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Thompsonbry.systap Cc: Thompsonbry.systap, Smalyshev, Manybubbles, Aklapper, Haasepeter, Beebs.systap, daniel, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke, JanZerebecki ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T90119: BlazeGraph Finalization: RDF Issues
Manybubbles added a comment. One of the things that @Thompsonbry.systap and I talked about yesterday was representing statements kind of like this: wd:Q23 wdt:P39 wd:Q11696 . -- Optional wd:Q23 wd:P39 wd:Q11696 wdqual:P580 1789-4-30T00:00:00Z^^xs:dateTime ; wdqual:P582 1797-3-4T00:00:00Z^^xs:dateTime ; wdqual:P1365 wdo:no-value ; wdqual:P1366 wd:Q11806 ; wdref:P143 wd:Q328 ; wdo:rank normal, best . wd:Q23 wd:P39 wd:Q11696 wdqual:P580 1789-4-30^^xs:date wdo:precision day . So the takeaways are: 1. Stuff rank as a reified property of the field and -either- filter on it at query time -or- have wikidata dump out the truthy values as separate triples and use convention to support jumping from the truthy tuples to the non-truthy tuples. If we find that filtering is generally fast enough then we can add an AST rewrite to automatically add the filtering. 2. Stuff most of the value information as reification information on the statement itself. Like I did with the date precision. It looks pretty wordy when its on a qualifier but it'd be less crazy looking. A few more interesting points about RDR in BlazeGraph: 1. Its automatically used if you enable its property (statement identifiers or something) and you send data as standard triples that looks like the RDF standard for reification. I haven't hunted down this code but if it works that'd be super cool. We'd simply have to dump the RDF data in normal looking RDF and BlazeGraph can efficiently represent it. Its like RDR is just an optimization that can be transparently applied. This will require more investigation. 2. RDR can be nested like in the example above. 3. RDR works by using the triple's bytewise representation as the subject, predicate, or object. Generally this doesn't require too much space. Another interesting point: 1. BlazeGraph has vocabulary classes that allow it to efficiently represent certain uris that are known up front. You can add entries to them but if the new uri is already in the dictionary then bad things happen. So the usual way to use these is to name them BlahVocabularyV1, BlahVocabularyV2, etc. The old versions stick in case you want to open up an old knowledge base. You can use the new versions by rebuilding the knowledge base. TASK DETAIL https://phabricator.wikimedia.org/T90119 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Manybubbles Cc: Thompsonbry.systap, Smalyshev, Manybubbles, Aklapper, Haasepeter, Beebs.systap, daniel, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke, JanZerebecki ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T90119: BlazeGraph Finalization: RDF Issues
Thompsonbry.systap added a comment. You can use URIs instead of blank nodes. Most of the time when people use blank nodes they SHOULD be using URIs. Blank nodes are existential variables. Coin URIs if you want to have a reference. Bryan Thompson Chief Scientist Founder SYSTAP, LLC 4501 Tower Road Greensboro, NC 27410 br...@systap.com http://bigdata.com http://mapgraph.io CONFIDENTIALITY NOTICE: This email and its contents and attachments are for the sole use of the intended recipient(s) and are confidential or proprietary to SYSTAP. Any unauthorized review, use, disclosure, dissemination or copying of this email or its contents or attachments is prohibited. If you have received this communication in error, please notify the sender by reply email and permanently delete all copies of the email and its contents and attachments. TASK DETAIL https://phabricator.wikimedia.org/T90119 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Thompsonbry.systap Cc: Thompsonbry.systap, Smalyshev, Manybubbles, Aklapper, Haasepeter, Beebs.systap, daniel, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke, JanZerebecki ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T90119: BlazeGraph Finalization: RDF Issues
Manybubbles added a comment. In https://phabricator.wikimedia.org/T90119#1065645, @Thompsonbry.systap wrote: You can use URIs instead of blank nodes. Most of the time when people use blank nodes they SHOULD be using URIs. Blank nodes are existential variables. Coin URIs if you want to have a reference. We actually have natural uris for this - statements have uuids. The thing is that those uuids are not useful at query time so they shouldn't make it into the index. TASK DETAIL https://phabricator.wikimedia.org/T90119 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Manybubbles Cc: Thompsonbry.systap, Smalyshev, Manybubbles, Aklapper, Haasepeter, Beebs.systap, daniel, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke, JanZerebecki ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T90119: BlazeGraph Finalization: RDF Issues
Manybubbles added a comment. In https://phabricator.wikimedia.org/T90119#1065635, @Manybubbles wrote: In https://phabricator.wikimedia.org/T90119#1065623, @Thompsonbry.systap wrote: The RDR inlining of reified statement models is handled by the StatementBuffer class. It is important to have a limited lexical scope in the dump for the different RDF triples involved in the reified statement model. The code needs to buffer incomplete statement models until they become complete statement models, at which point it can release the storage associated with the partial model and write it out. Also, if your output includes a lot of blank nodes, it is a Good Idea to have limited resolution scope for blank nodes since the parser must maintain them across the entire document. Thus, outputting an RDF dump as a series of files can reduce the parser overhead. Are blanks nodes required for the RDR inlining? Is there any way in Turtle or N-Triples to allow blank nodes to go out of scope? I ask because we'll certainly be outputting the dump as a single large document - that is how our dumps work and fighting against that would be difficult. We can create a tool to slice it smaller if there isn't a standard way to control scope. I should note that this buffering thing removes one of the nicest parts about N-Triples: you can no longer just slice it on any new line to generate batches. Its context sensitive again. I should say I think that is a small price to pay for RDR inlining. TASK DETAIL https://phabricator.wikimedia.org/T90119 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Manybubbles Cc: Thompsonbry.systap, Smalyshev, Manybubbles, Aklapper, Haasepeter, Beebs.systap, daniel, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke, JanZerebecki ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T90119: BlazeGraph Finalization: RDF Issues
Smalyshev added a comment. Currently, we have two known issues with our RDF vs. BlazeGraph: 1. Date values (aka 13 billion BCE) 2. Geopoints notation TASK DETAIL https://phabricator.wikimedia.org/T90119 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev Cc: Smalyshev, Manybubbles, Aklapper, Haasepeter, Beebs.systap, daniel, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke, JanZerebecki ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs