[Wikidata-bugs] [Maniphest] [Commented On] T90119: BlazeGraph Finalization: RDF Issues

2015-02-27 Thread Manybubbles
Manybubbles added a comment.

Adding this here for posterity:  It seems like the primary objection to RDR is 
vendor lock in.  Its a BlazeGraph specific thing and would have to be 
reimplemented if we had to go someplace else.  You *can* trigger it 
automatically using standard RDF reification syntax _but_ that syntax is 
deprecated and painful to query.

I argue that SPARQL itself is worse from a vendor lock in perspective than RDR. 
 There is _no_ chance that we'll want to reimplement SPARQL ourselves and we 
only have four open source options that support it:

1. BlazeGraph
2. Virtuoso OpenSource
3. Apache Jena
4. GraphSail on top of Gremlin on top of some other graph database
5. 4store

#4 is unlikely to be efficient, given all the layers of abstraction.  There is 
a chance it'll work but its not super high.
Apache Jena doesn't scale nearly as well as Virtuoso or BlazeGraph according to 
http://www.w3.org/wiki/LargeTripleStores .
4store hasn't seen much development in a long, long time.

That means we're locked to either Virtuoso or BlazeGraph any way.

So my feeling is that exposing RDR to our users isn't _that_ bad.  If we have 
to take it away one day our options are to either reimplement the syntax or 
deprecate it and drop support for it.  Or replace it with some wikidata 
specific syntax.  I think this is ok.  Especially when you compare it to the 
craziness we were willing to put up with on top of Gremlin where switching out 
the graph backend required you to totally change your model based on the 
totally mismatching capabilities of the underlying system.


TASK DETAIL
  https://phabricator.wikimedia.org/T90119

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Manybubbles
Cc: Thompsonbry.systap, Smalyshev, Manybubbles, Aklapper, Haasepeter, 
Beebs.systap, daniel, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke, 
JanZerebecki



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T90119: BlazeGraph Finalization: RDF Issues

2015-02-25 Thread Thompsonbry.systap
Thompsonbry.systap added a comment.

There is support for inline UUIDs for blank nodes.  See UUIDBNodeIV.  You
could also define a fully inline URI with a well-known prefix and a UUID.
Bryan



Bryan Thompson
Chief Scientist  Founder
SYSTAP, LLC
4501 Tower Road
Greensboro, NC 27410
br...@systap.com
http://bigdata.com
http://mapgraph.io

CONFIDENTIALITY NOTICE:  This email and its contents and attachments are
for the sole use of the intended recipient(s) and are confidential or
proprietary to SYSTAP. Any unauthorized review, use, disclosure,
dissemination or copying of this email or its contents or attachments is
prohibited. If you have received this communication in error, please notify
the sender by reply email and permanently delete all copies of the email
and its contents and attachments.


TASK DETAIL
  https://phabricator.wikimedia.org/T90119

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Thompsonbry.systap
Cc: Thompsonbry.systap, Smalyshev, Manybubbles, Aklapper, Haasepeter, 
Beebs.systap, daniel, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke, 
JanZerebecki



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T90119: BlazeGraph Finalization: RDF Issues

2015-02-25 Thread Manybubbles
Manybubbles added a comment.

One of the things that @Thompsonbry.systap and I talked about yesterday was 
representing statements kind of like this:

  wd:Q23 wdt:P39 wd:Q11696 . -- Optional
  wd:Q23 wd:P39 wd:Q11696
wdqual:P580 1789-4-30T00:00:00Z^^xs:dateTime ;
wdqual:P582 1797-3-4T00:00:00Z^^xs:dateTime ;
wdqual:P1365 wdo:no-value ;
wdqual:P1366 wd:Q11806 ;
wdref:P143 wd:Q328 ;
wdo:rank normal, best .
  wd:Q23 wd:P39 wd:Q11696 wdqual:P580 1789-4-30^^xs:date
wdo:precision day .

So the takeaways are:

1. Stuff rank as a reified property of the field and -either- filter on it at 
query time -or- have wikidata dump out the truthy values as separate triples 
and use convention to support jumping from the truthy tuples to the non-truthy 
tuples.  If we find that filtering is generally fast enough then we can add an 
AST rewrite to automatically add the filtering.
2. Stuff most of the value information as reification information on the 
statement itself.  Like I did with the date precision.  It looks pretty wordy 
when its on a qualifier but it'd be less crazy looking.

A few more interesting points about RDR in BlazeGraph:

1. Its automatically used if you enable its property (statement identifiers or 
something) and you send data as standard triples that looks like the RDF 
standard for reification.  I haven't hunted down this code but if it works 
that'd be super cool.  We'd simply have to dump the RDF data in normal looking 
RDF and BlazeGraph can efficiently represent it.  Its like RDR is just an 
optimization that can be transparently applied.  This will require more 
investigation.
2. RDR can be nested like in the example above.
3. RDR works by using the triple's bytewise representation as the subject, 
predicate, or object.  Generally this doesn't require too much space.

Another interesting point:

1. BlazeGraph has vocabulary classes that allow it to efficiently represent 
certain uris that are known up front.  You can add entries to them but if the 
new uri is already in the dictionary then bad things happen.  So the usual way 
to use these is to name them BlahVocabularyV1, BlahVocabularyV2, etc.  The old 
versions stick in case you want to open up an old knowledge base.  You can use 
the new versions by rebuilding the knowledge base.




TASK DETAIL
  https://phabricator.wikimedia.org/T90119

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Manybubbles
Cc: Thompsonbry.systap, Smalyshev, Manybubbles, Aklapper, Haasepeter, 
Beebs.systap, daniel, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke, 
JanZerebecki



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T90119: BlazeGraph Finalization: RDF Issues

2015-02-25 Thread Thompsonbry.systap
Thompsonbry.systap added a comment.

You can use URIs instead of blank nodes.  Most of the time when people use
blank nodes they SHOULD be using URIs.  Blank nodes are existential
variables.  Coin URIs if you want to have a reference.



Bryan Thompson
Chief Scientist  Founder
SYSTAP, LLC
4501 Tower Road
Greensboro, NC 27410
br...@systap.com
http://bigdata.com
http://mapgraph.io

CONFIDENTIALITY NOTICE:  This email and its contents and attachments are
for the sole use of the intended recipient(s) and are confidential or
proprietary to SYSTAP. Any unauthorized review, use, disclosure,
dissemination or copying of this email or its contents or attachments is
prohibited. If you have received this communication in error, please notify
the sender by reply email and permanently delete all copies of the email
and its contents and attachments.


TASK DETAIL
  https://phabricator.wikimedia.org/T90119

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Thompsonbry.systap
Cc: Thompsonbry.systap, Smalyshev, Manybubbles, Aklapper, Haasepeter, 
Beebs.systap, daniel, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke, 
JanZerebecki



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T90119: BlazeGraph Finalization: RDF Issues

2015-02-25 Thread Manybubbles
Manybubbles added a comment.

In https://phabricator.wikimedia.org/T90119#1065645, @Thompsonbry.systap wrote:

 You can use URIs instead of blank nodes.  Most of the time when people use
  blank nodes they SHOULD be using URIs.  Blank nodes are existential
  variables.  Coin URIs if you want to have a reference.


We actually have natural uris for this - statements have uuids.  The thing is 
that those uuids are not useful at query time so they shouldn't make it into 
the index.


TASK DETAIL
  https://phabricator.wikimedia.org/T90119

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Manybubbles
Cc: Thompsonbry.systap, Smalyshev, Manybubbles, Aklapper, Haasepeter, 
Beebs.systap, daniel, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke, 
JanZerebecki



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T90119: BlazeGraph Finalization: RDF Issues

2015-02-25 Thread Manybubbles
Manybubbles added a comment.

In https://phabricator.wikimedia.org/T90119#1065635, @Manybubbles wrote:

 In https://phabricator.wikimedia.org/T90119#1065623, @Thompsonbry.systap 
 wrote:

  The RDR inlining of reified statement models is handled by the 
  StatementBuffer class.   It is important to have a limited lexical scope in 
  the dump for the different RDF triples involved in the reified statement 
  model.  The code needs to buffer incomplete statement models until they 
  become complete statement models, at which point it can release the storage 
  associated with the partial model and write it out.  Also, if your output 
  includes a lot of blank nodes, it is a Good Idea to have limited resolution 
  scope for blank nodes since the parser must maintain them across the entire 
  document. Thus, outputting an RDF dump as a series of files can reduce the 
  parser overhead.


 Are blanks nodes required for the RDR inlining?  Is there any way in Turtle 
 or N-Triples to allow blank nodes to go out of scope?  I ask because we'll 
 certainly be outputting the dump as a single large document - that is how our 
 dumps work and fighting against that would be difficult.  We can create a 
 tool to slice it smaller if there isn't a standard way to control scope.

 I should note that this buffering thing removes one of the nicest parts about 
 N-Triples: you can no longer just slice it on any new line to generate 
 batches.  Its context sensitive again.


I should say I think that is a small price to pay for RDR inlining.


TASK DETAIL
  https://phabricator.wikimedia.org/T90119

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Manybubbles
Cc: Thompsonbry.systap, Smalyshev, Manybubbles, Aklapper, Haasepeter, 
Beebs.systap, daniel, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke, 
JanZerebecki



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T90119: BlazeGraph Finalization: RDF Issues

2015-02-20 Thread Smalyshev
Smalyshev added a comment.

Currently, we have two known issues with our RDF vs. BlazeGraph:

1. Date values (aka 13 billion BCE)
2. Geopoints notation


TASK DETAIL
  https://phabricator.wikimedia.org/T90119

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: Smalyshev, Manybubbles, Aklapper, Haasepeter, Beebs.systap, daniel, jkroll, 
Wikidata-bugs, Jdouglas, aude, GWicke, JanZerebecki



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs