Re: [Wikidata-tech] Wikidata Query Backend Update (take two!)

Magnus Manske Fri, 06 Mar 2015 01:05:48 -0800

Yay progress! :-)

I'll try to keep WDQ alive until you have a production version
up-and-running. Don't take too long...


On Fri, Mar 6, 2015 at 9:02 AM Markus Krötzsch <
[email protected]> wrote:

> Hi,
>
> Thanks for all the work. I think this is a sensible decision. What
> confused me at first is that I did not know BlazeGraph (and when you
> google for it, the first thing is an unrelated sourceforge project). An
> important insight for me thus was that "BlazeGraph" is the project that
> has up until very recently been called "Bigdata", and as such is not the
> new, unknown project that I first thought it was.
>
> It seems clear that there are a few issues to address. In particular,
> among hundreds of known public SPARQL services [1], there does not seem
> to be one that identifies itself as using BlazeGraph/Bigdata. However,
> there is clearly potential here and it would be exciting to see the
> project maturing into a robust free RDF store and query engine.
>
> Cheers,
>
> Markus
>
> [1] http://sparqles.okfn.org/discoverability
>
> On 05.03.2015 19:49, Nikolas Everett wrote:
> > TL/DR: We're selected BlazeGraph to back the next Wikidata Query Service.
> >
> > After Titan evaporated about a month ago we went back to the drawing
> > board on back ends for a new Wikidata Query Service.  We took four weeks
> > (including a planed trip to Berlin) to settle on a backend.  As you can
> > see from the spreadsheet
> > <https://docs.google.com/a/wikimedia.org/spreadsheets/d/
> 1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit#gid=0>
> > we've really blown out the number of options.  As you can also see we
> > didn't finish filling them all out.  But we've still pretty much settled
> > on BlazeGraph <http://www.blazegraph.com/> anyway.  Let me first explain
> > what BlazeGraph is and then defend our decision to stop spreadsheet work.
> >
> > BlazeGraph is a GPLed RDF triple store that natively supports SPARQL
> > 1.1, RDFS, some OWL, and some extensions.  Those are all semantic web
> > terms and they translate into a "its a graph database with an
> > expressive, mostly standardized query language and support for inferring
> > stuff as data is added and removed to the graph".  It also has some
> > features that you'd recognize from nice relational databases: join order
> > rewriting, smart query planner, hash and nested loop joins,  query
> > rewrite rules, group by, order by, and aggregate functions.
> >
> > These are all cool features - really the kind of things that we thought
> > we need but they come with an "interesting" price.  Semantic Web is a
> > very old thing that's had a really odd degree of success.  If you have
> > an hour and half Jim Hendler can explain
> > <https://www.youtube.com/watch?v=oKiXpO2rbJM> it to you.  The upshot is
> > that _tons_ of people have _tons_ of opinions.  The W3C standardizes
> > RDF, SPARQL, RDFS, OWL, and about a billion other things.  There are
> > (mostly non-W3C) standards for talking about people
> > <http://xmlns.com/foaf/spec/>, social connections
> > <http://rdfs.org/sioc/spec/>, and music
> > <http://musicontology.com/specification/>. And they all have rules.  And
> > Wikidata doesn't.  Not like these rules.  One thing I've learned from
> > this project is that this lack of prescribed rules is one of Wikidata's
> > founding principles.  Its worth it to allow openness.  So you _can_ set
> > gender to "Bacon" or put GeoCoordinants on Amber
> > <https://www.wikidata.org/wiki/Q1053330>.  Anyway!  I argue that, at
> > least for now, we should ignore many of these standards.  We need to
> > think of Wikidata Query Service as a tool to answer questions instead of
> > as a some grand statement about the semantic web.  Mapping existing
> > ontologies onto Wikidata is a task for another day.
> >
> > I feel like these semantic web technologies and BlazeGraph in particular
> > are good fits for this project mostly because the quality of our "but
> > what about X?" questions is very very high.  "How much inference should
> > we do instead of query rewriting?" instead of "Can we do inference?  Can
> > we do query rewriting?"  And "Which standard vocabularies should think
> > about mapping to Wikidata?"  Holy cow!  In any other system there aren't
> > "standard vocabularies" to even talk about mapping, much less a
> > mechanism for mapping them.  Much less two!  Its almost an overwhelming
> > wealth and as I elude to above it can be easy to bikeshed.
> >
> > We've been reasonably careful to reach out people we know are familiar
> > with this space.  We're well aware of projects like the Wikidata Toolkit
> > and its RDF exports.  We've been using those for testing.  We've talked
> > to so many people about so many things.  Its really consumed a lot more
> > time then I'd expected and made the search for the next backend very
> > long.  But I feel comfortable that we're in a good place.  We don't know
> > all the answers but we're sure there _are_ answers.
> >
> > The BlazeGraph upstream has been super active with us.  They've spent
> > hours with us over hangouts, had me out to their office (a house an hour
> > and half from mine) to talk about data modeling, and spent a ton of time
> > commenting on Phabricator tickets.  They've offered to donate a formal
> > support agreement as well.  And to get together with us about writing
> > any features we might need to add to BlazeGraph.  And they've added me
> > as a committer (I told them I had some typos to fix but I have yet to
> > actually commit them).  And their code is well documented.
> >
> > So by now you've realized I'm a fan.  I believe that we should stop on
> > the spreadsheet and just start work against BlazeGraph because I think
> > we have phenomenal momentum with upstream.  And its a pretty clear
> > winner on the spreadsheet at this point.  But there are two other triple
> > stores which we haven't fully filled out that might be viable: OpenLink
> > Virtuoso Open Source and Apache Jena.  Virtuoso is open core so I'm
> > really loath to go too deep into it at his point.  Their HA features are
> > not open source which implies that we'd have trouble with them as an
> > upstream.  Apache Jena just isn't known
> > <http://www.w3.org/wiki/LargeTripleStores#Jena_TDB_.281.7B.29> to scale
> > to data as large as BlazeGraph and Virtuoso.  So I argue that these are
> > systems that, in the unlikely event that BlazeGraph goes the way of
> > Titan, we should start our third round of investigation against.  As it
> > stands now I think we have a winner.
> >
> > We created a phabricator task <https://phabricator.wikimedia.org/T90101>
> > with lots of children to run down our remaining questions.  The biggest
> > remaining questions revolve around three areas:
> > 1. Operational issues like "how should the cluster be deployed?" "do we
> > use HA at all?" "how are rolling restarts done in HA?"
> > 2.  How should we represent the data in the database? BlazeGraph (and
> > only BlazeGraph) has an extension that *could* us called RDR.  Should we
> > use it?
> > 3.  Some folks have identified update rate as a risk.  Not upstream, but
> > others familiar with triple stores in general.
> >
> >
> > Our plans is to work on #2 over the next weeks because it really informs
> > #1 because there are lots of working set size vs cpu time tradeoffs to
> > investigate.  We'll start on #1 shortly as well.  #3 is a potential risk
> > area so we'll be sure to investigate it soon.
> >
> > I admit I'm not super happy to leave the spreadsheet in the format its
> > current unfilled-out state but I'm excited to have something to work
> > with and think its the right thing to do right now.
> >
> > So thanks for reading all of this.  Please reply with comments.
> >
> > Thanks again,
> >
> > Nik
> >
> >
> > _______________________________________________
> > Wikidata-tech mailing list
> > [email protected]
> > https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
> >
>
>
> _______________________________________________
> Wikidata-tech mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
>

_______________________________________________
Wikidata-tech mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata-tech] Wikidata Query Backend Update (take two!)

Reply via email to