So9q added a comment.
🤩 big thanks for sharing this! In T289561#7393732 <https://phabricator.wikimedia.org/T289561#7393732>, @Hannah_Bast wrote: > We looked a bit into Apache Rya. A couple of observations: > > 1. The instructions on https://github.com/apache/rya are a mess. Compiling the code requires an old version of the JDK (version 8), which is written nowhere and tooks us some time to find out. Compilation takes forever. The instructions concerning getting a working Rya server are cryptic, mentioning all kinds of other libraries and projects, but without instructions on how exactly to install them. Loading the data also seems to be non-trivial: you have to write code for this. It's certainly all doable, but this does not look like a well-maintained project. I'm sorry to hear that. I wrote the last committer a while back and have yet to receive a response. Not a good sign. > 2. We had a look at the 2012 paper https://www.usna.edu/Users/cs/adina/research/Rya_CloudI2012.pdf (which is well cited) and the 2017 slides https://events.static.linuxfound.org/sites/events/files/slides/Rya_ApacheBigData_20170518.pdf . The slides are in sync with what is written in the paper, and they are very instructive in understanding how the engine works. It also looks to me like they describe the current state of Rya (that is, there have not been any major changes to the basic architecture since then). > > 3. The underlying data store (Accumulo or MongoDB) is used only for storing the raw data (the triples). The actual operations on this data (like the JOIN operations, which are central for processing SPARQL queries) are done by the Rya code. This makes sense because a NoSQL store like MongoDB does not support JOIN operations, that's just not what it's made for. > > 4. The basic principle of Rya JOIN operations is explained on slide 15 on the presentation, and variations of it on slides 16, 18, 31, and 32. The basic principle is to start with the most selective triple, consider the set of matching entities for that triple (which is hopefully small) and then look up each of these (hopefully few) entities in the appropriate index. > > 5. This principle is efficient only when you have at least one highly selective triple in your SPARQL query. In the paper mentioned above, Rya is evaluated on the the Lehigh University Benchmark (LUBM), which is a well-known but rather old benchmark, with rather special queries. Namely, all queries have at least one very selective triple, typically of the kind "variable <type> <some fixed type>". There is not a single query, with a triple for the <type> predicate, where the object is also a variable. > > 6. When you don't have a non-selective triples, Rya is bound to be slow because it then has to deal with very large sets entities, which it will look up one by one. Also, Rya is not really made to be particular efficient on a single machine. Its main purpose is to be efficient when distributed over several machines. We have already discussed that it does not make sense to distribute a moderate-sized dataset like Wikidata over several machines when you can easily process it on a single machine. Distributing a dataset always incurs a large performance overhead (because you need to send data back and forth between different machines during query processing) and you only do it when you have to. Interesting, I thought Wikidata was getting too big for 1 machine, but I might misunderstood the WMF operations team and the statements in the tickets surrounding BG. Wikidata could easily triple in the number of triples within a year if all horses are let loose and people start importing all scientific papers, books and chemicals in Wikipedia and all the authors associated with those. > 7. Rya's performance bottleneck is actually very similar to that of Blazegraph. When you look at the many example queries for the WDQS on https://query.wikidata.org , almost none of them require the computation of a large intermediate result. For the simple reason that such queries don't work well with Blazegraph (they take forever or time out). Large intermediate results occur either when you have no single very selective triples in your query or when there is no LIMIT or the LIMIT is preceded by an ORDER BY or GROUP BY (so that you have to compute a large intermediate result before you can LIMIT it to the top-ranked items). Interesting! I was unaware of this, but it makes sense from my interactions with BG. > In summary, Rya does not look like a good choice for several reasons, most notably: not well-maintained, efficient only for quite particular kinds of queries, and similar performance bottlenecks as Blazegraph. Big thanks for taking the time to look into this. Rya was the least bad choice IMO until I read your insights. TASK DETAIL https://phabricator.wikimedia.org/T289561 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: So9q Cc: Justin0x2004, Hannah_Bast, Gehel, Tpt, Smalyshev, So9q, Aklapper, Invadibot, MPhamWMF, maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org