Yay progress! :-) I'll try to keep WDQ alive until you have a production version up-and-running. Don't take too long...
On Fri, Mar 6, 2015 at 9:02 AM Markus Krötzsch < [email protected]> wrote: > Hi, > > Thanks for all the work. I think this is a sensible decision. What > confused me at first is that I did not know BlazeGraph (and when you > google for it, the first thing is an unrelated sourceforge project). An > important insight for me thus was that "BlazeGraph" is the project that > has up until very recently been called "Bigdata", and as such is not the > new, unknown project that I first thought it was. > > It seems clear that there are a few issues to address. In particular, > among hundreds of known public SPARQL services [1], there does not seem > to be one that identifies itself as using BlazeGraph/Bigdata. However, > there is clearly potential here and it would be exciting to see the > project maturing into a robust free RDF store and query engine. > > Cheers, > > Markus > > [1] http://sparqles.okfn.org/discoverability > > On 05.03.2015 19:49, Nikolas Everett wrote: > > TL/DR: We're selected BlazeGraph to back the next Wikidata Query Service. > > > > After Titan evaporated about a month ago we went back to the drawing > > board on back ends for a new Wikidata Query Service. We took four weeks > > (including a planed trip to Berlin) to settle on a backend. As you can > > see from the spreadsheet > > <https://docs.google.com/a/wikimedia.org/spreadsheets/d/ > 1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit#gid=0> > > we've really blown out the number of options. As you can also see we > > didn't finish filling them all out. But we've still pretty much settled > > on BlazeGraph <http://www.blazegraph.com/> anyway. Let me first explain > > what BlazeGraph is and then defend our decision to stop spreadsheet work. > > > > BlazeGraph is a GPLed RDF triple store that natively supports SPARQL > > 1.1, RDFS, some OWL, and some extensions. Those are all semantic web > > terms and they translate into a "its a graph database with an > > expressive, mostly standardized query language and support for inferring > > stuff as data is added and removed to the graph". It also has some > > features that you'd recognize from nice relational databases: join order > > rewriting, smart query planner, hash and nested loop joins, query > > rewrite rules, group by, order by, and aggregate functions. > > > > These are all cool features - really the kind of things that we thought > > we need but they come with an "interesting" price. Semantic Web is a > > very old thing that's had a really odd degree of success. If you have > > an hour and half Jim Hendler can explain > > <https://www.youtube.com/watch?v=oKiXpO2rbJM> it to you. The upshot is > > that _tons_ of people have _tons_ of opinions. The W3C standardizes > > RDF, SPARQL, RDFS, OWL, and about a billion other things. There are > > (mostly non-W3C) standards for talking about people > > <http://xmlns.com/foaf/spec/>, social connections > > <http://rdfs.org/sioc/spec/>, and music > > <http://musicontology.com/specification/>. And they all have rules. And > > Wikidata doesn't. Not like these rules. One thing I've learned from > > this project is that this lack of prescribed rules is one of Wikidata's > > founding principles. Its worth it to allow openness. So you _can_ set > > gender to "Bacon" or put GeoCoordinants on Amber > > <https://www.wikidata.org/wiki/Q1053330>. Anyway! I argue that, at > > least for now, we should ignore many of these standards. We need to > > think of Wikidata Query Service as a tool to answer questions instead of > > as a some grand statement about the semantic web. Mapping existing > > ontologies onto Wikidata is a task for another day. > > > > I feel like these semantic web technologies and BlazeGraph in particular > > are good fits for this project mostly because the quality of our "but > > what about X?" questions is very very high. "How much inference should > > we do instead of query rewriting?" instead of "Can we do inference? Can > > we do query rewriting?" And "Which standard vocabularies should think > > about mapping to Wikidata?" Holy cow! In any other system there aren't > > "standard vocabularies" to even talk about mapping, much less a > > mechanism for mapping them. Much less two! Its almost an overwhelming > > wealth and as I elude to above it can be easy to bikeshed. > > > > We've been reasonably careful to reach out people we know are familiar > > with this space. We're well aware of projects like the Wikidata Toolkit > > and its RDF exports. We've been using those for testing. We've talked > > to so many people about so many things. Its really consumed a lot more > > time then I'd expected and made the search for the next backend very > > long. But I feel comfortable that we're in a good place. We don't know > > all the answers but we're sure there _are_ answers. > > > > The BlazeGraph upstream has been super active with us. They've spent > > hours with us over hangouts, had me out to their office (a house an hour > > and half from mine) to talk about data modeling, and spent a ton of time > > commenting on Phabricator tickets. They've offered to donate a formal > > support agreement as well. And to get together with us about writing > > any features we might need to add to BlazeGraph. And they've added me > > as a committer (I told them I had some typos to fix but I have yet to > > actually commit them). And their code is well documented. > > > > So by now you've realized I'm a fan. I believe that we should stop on > > the spreadsheet and just start work against BlazeGraph because I think > > we have phenomenal momentum with upstream. And its a pretty clear > > winner on the spreadsheet at this point. But there are two other triple > > stores which we haven't fully filled out that might be viable: OpenLink > > Virtuoso Open Source and Apache Jena. Virtuoso is open core so I'm > > really loath to go too deep into it at his point. Their HA features are > > not open source which implies that we'd have trouble with them as an > > upstream. Apache Jena just isn't known > > <http://www.w3.org/wiki/LargeTripleStores#Jena_TDB_.281.7B.29> to scale > > to data as large as BlazeGraph and Virtuoso. So I argue that these are > > systems that, in the unlikely event that BlazeGraph goes the way of > > Titan, we should start our third round of investigation against. As it > > stands now I think we have a winner. > > > > We created a phabricator task <https://phabricator.wikimedia.org/T90101> > > with lots of children to run down our remaining questions. The biggest > > remaining questions revolve around three areas: > > 1. Operational issues like "how should the cluster be deployed?" "do we > > use HA at all?" "how are rolling restarts done in HA?" > > 2. How should we represent the data in the database? BlazeGraph (and > > only BlazeGraph) has an extension that *could* us called RDR. Should we > > use it? > > 3. Some folks have identified update rate as a risk. Not upstream, but > > others familiar with triple stores in general. > > > > > > Our plans is to work on #2 over the next weeks because it really informs > > #1 because there are lots of working set size vs cpu time tradeoffs to > > investigate. We'll start on #1 shortly as well. #3 is a potential risk > > area so we'll be sure to investigate it soon. > > > > I admit I'm not super happy to leave the spreadsheet in the format its > > current unfilled-out state but I'm excited to have something to work > > with and think its the right thing to do right now. > > > > So thanks for reading all of this. Please reply with comments. > > > > Thanks again, > > > > Nik > > > > > > _______________________________________________ > > Wikidata-tech mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/wikidata-tech > > > > > _______________________________________________ > Wikidata-tech mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikidata-tech >
_______________________________________________ Wikidata-tech mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
