So9q added a comment.

  🤩 big thanks for sharing this!
  
  In T289561#7393732 <https://phabricator.wikimedia.org/T289561#7393732>, 
@Hannah_Bast wrote:
  
  > We looked a bit into Apache Rya. A couple of observations:
  >
  > 1. The instructions on https://github.com/apache/rya are a mess. Compiling 
the code requires an old version of the JDK (version 8), which is written 
nowhere and tooks us some time to find out. Compilation takes forever. The 
instructions concerning getting a working Rya server are cryptic, mentioning 
all kinds of other libraries and projects, but without instructions on how 
exactly to install them. Loading the data also seems to be non-trivial: you 
have to write code for this. It's certainly all doable, but this does not look 
like a well-maintained project.
  
  I'm sorry to hear that. I wrote the last committer a while back and have yet 
to receive a response. Not a good sign.
  
  > 2. We had a look at the 2012 paper 
https://www.usna.edu/Users/cs/adina/research/Rya_CloudI2012.pdf (which is well 
cited) and the 2017 slides 
https://events.static.linuxfound.org/sites/events/files/slides/Rya_ApacheBigData_20170518.pdf
 . The slides are in sync with what is written in the paper, and they are very 
instructive in understanding how the engine works. It also looks to me like 
they describe the current state of Rya (that is, there have not been any major 
changes to the basic architecture since then).
  >
  > 3. The underlying data store (Accumulo or MongoDB) is used only for storing 
the raw data (the triples). The actual operations on this data (like the JOIN 
operations, which are central for processing SPARQL queries) are done by the 
Rya code. This makes sense because a NoSQL store like MongoDB does not support 
JOIN operations, that's just not what it's made for.
  >
  > 4. The basic principle of Rya JOIN operations is explained on slide 15 on 
the presentation, and variations of it on slides 16, 18, 31, and 32. The basic 
principle is to start with the most selective triple, consider the set of 
matching entities for that triple (which is hopefully small) and then look up 
each of these (hopefully few) entities in the appropriate index.
  >
  > 5. This principle is efficient only when you have at least one highly 
selective triple in your SPARQL query. In the paper mentioned above, Rya is 
evaluated on the the Lehigh University Benchmark (LUBM), which is a well-known 
but rather old benchmark, with rather special queries. Namely, all queries have 
at least one very selective triple, typically of the kind "variable <type> 
<some fixed type>".  There is not a single query, with a triple for the <type> 
predicate, where the object is also a variable.
  >
  > 6. When you don't have a non-selective triples, Rya is bound to be slow 
because it then has to deal with very large sets entities, which it will look 
up one by one. Also, Rya is not really made to be particular efficient on a 
single machine. Its main purpose is to be efficient when distributed over 
several machines. We have already discussed that it does not make sense to 
distribute a moderate-sized dataset like Wikidata over several machines when 
you can easily process it on a single machine. Distributing a dataset always 
incurs a large performance overhead (because you need to send data back and 
forth between different machines during query processing) and you only do it 
when you have to.
  
  Interesting, I thought Wikidata was getting too big for 1 machine, but I 
might misunderstood the WMF operations team and the statements in the tickets 
surrounding BG.
  
  Wikidata could easily triple in the number of triples within a year if all 
horses are let loose and people start importing all scientific papers, books 
and chemicals in Wikipedia and all the authors associated with those.
  
  > 7. Rya's performance bottleneck is actually very similar to that of 
Blazegraph. When you look at the many example queries for the WDQS on 
https://query.wikidata.org , almost none of them require the computation of a 
large intermediate result. For the simple reason that such queries don't work 
well with Blazegraph (they take forever or time out). Large intermediate 
results occur either when you have no single very selective triples in your 
query or when there is no LIMIT or the LIMIT is preceded by an ORDER BY or 
GROUP BY (so that you have to compute a large intermediate result before you 
can LIMIT it to the top-ranked items).
  
  Interesting! I was unaware of this, but it makes sense from my interactions 
with BG.
  
  > In summary, Rya does not look like a good choice for several reasons, most 
notably: not well-maintained, efficient only for quite particular kinds of 
queries, and similar performance bottlenecks as Blazegraph.
  
  Big thanks for taking the time to look into this. Rya was the least bad 
choice IMO until I read your insights.

TASK DETAIL
  https://phabricator.wikimedia.org/T289561

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: So9q
Cc: Justin0x2004, Hannah_Bast, Gehel, Tpt, Smalyshev, So9q, Aklapper, 
Invadibot, MPhamWMF, maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, 
Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org

Reply via email to