[Wikidata-bugs] [Maniphest] T289561: Evaluate Apache Rya as alternative to Blazegraph

2022-04-22 Thread MPhamWMF
MPhamWMF added a comment.


  I'm closing this task, as Rya is not a shortlist candidate to replace 
Blazegraph, based on 
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update/WDQS_backend_alternatives

TASK DETAIL
  https://phabricator.wikimedia.org/T289561

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: MPhamWMF
Cc: MPhamWMF, TheKtk, nguyenm9, Justin0x2004, Hannah_Bast, Gehel, Tpt, 
Smalyshev, So9q, Aklapper, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, CBogen, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T289561: Evaluate Apache Rya as alternative to Blazegraph

2022-04-22 Thread MPhamWMF
MPhamWMF closed this task as "Declined".

TASK DETAIL
  https://phabricator.wikimedia.org/T289561

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: MPhamWMF
Cc: MPhamWMF, TheKtk, nguyenm9, Justin0x2004, Hannah_Bast, Gehel, Tpt, 
Smalyshev, So9q, Aklapper, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, CBogen, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T289561: Evaluate Apache Rya as alternative to Blazegraph

2022-03-29 Thread So9q
So9q added a comment.


  Based on the discussion above I suggest closing this task.

TASK DETAIL
  https://phabricator.wikimedia.org/T289561

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: So9q
Cc: TheKtk, nguyenm9, Justin0x2004, Hannah_Bast, Gehel, Tpt, Smalyshev, So9q, 
Aklapper, Astuthiodit_1, karapayneWMDE, Invadibot, MPhamWMF, maantietaja, 
CBogen, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T289561: Evaluate Apache Rya as alternative to Blazegraph

2022-03-29 Thread MPhamWMF
MPhamWMF lowered the priority of this task from "Medium" to "Low".

TASK DETAIL
  https://phabricator.wikimedia.org/T289561

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: MPhamWMF
Cc: TheKtk, nguyenm9, Justin0x2004, Hannah_Bast, Gehel, Tpt, Smalyshev, So9q, 
Aklapper, Astuthiodit_1, karapayneWMDE, Invadibot, MPhamWMF, maantietaja, 
CBogen, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T289561: Evaluate Apache Rya as alternative to Blazegraph

2022-01-30 Thread TheKtk
TheKtk added a comment.


  I tried to set up Rya several times over the past years, I know no one that 
managed to get it to work. We are all RDF & OSS people so it's not that we 
didn't try hard enough. Everyone gave up before we could even load a triple. 
Don't waste time of this, if it is used somewhere, it's closed behind whoever 
works/pays on/for for it.

TASK DETAIL
  https://phabricator.wikimedia.org/T289561

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: TheKtk
Cc: TheKtk, nguyenm9, Justin0x2004, Hannah_Bast, Gehel, Tpt, Smalyshev, So9q, 
Aklapper, Invadibot, MPhamWMF, maantietaja, CBogen, Akuckartz, Nandana, 
Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, 
EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, 
jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T289561: Evaluate Apache Rya as alternative to Blazegraph

2021-10-07 Thread So9q
So9q added a comment.


  朗 big thanks for sharing this!
  
  In T289561#7393732 , 
@Hannah_Bast wrote:
  
  > We looked a bit into Apache Rya. A couple of observations:
  >
  > 1. The instructions on https://github.com/apache/rya are a mess. Compiling 
the code requires an old version of the JDK (version 8), which is written 
nowhere and tooks us some time to find out. Compilation takes forever. The 
instructions concerning getting a working Rya server are cryptic, mentioning 
all kinds of other libraries and projects, but without instructions on how 
exactly to install them. Loading the data also seems to be non-trivial: you 
have to write code for this. It's certainly all doable, but this does not look 
like a well-maintained project.
  
  I'm sorry to hear that. I wrote the last committer a while back and have yet 
to receive a response. Not a good sign.
  
  > 2. We had a look at the 2012 paper 
https://www.usna.edu/Users/cs/adina/research/Rya_CloudI2012.pdf (which is well 
cited) and the 2017 slides 
https://events.static.linuxfound.org/sites/events/files/slides/Rya_ApacheBigData_20170518.pdf
 . The slides are in sync with what is written in the paper, and they are very 
instructive in understanding how the engine works. It also looks to me like 
they describe the current state of Rya (that is, there have not been any major 
changes to the basic architecture since then).
  >
  > 3. The underlying data store (Accumulo or MongoDB) is used only for storing 
the raw data (the triples). The actual operations on this data (like the JOIN 
operations, which are central for processing SPARQL queries) are done by the 
Rya code. This makes sense because a NoSQL store like MongoDB does not support 
JOIN operations, that's just not what it's made for.
  >
  > 4. The basic principle of Rya JOIN operations is explained on slide 15 on 
the presentation, and variations of it on slides 16, 18, 31, and 32. The basic 
principle is to start with the most selective triple, consider the set of 
matching entities for that triple (which is hopefully small) and then look up 
each of these (hopefully few) entities in the appropriate index.
  >
  > 5. This principle is efficient only when you have at least one highly 
selective triple in your SPARQL query. In the paper mentioned above, Rya is 
evaluated on the the Lehigh University Benchmark (LUBM), which is a well-known 
but rather old benchmark, with rather special queries. Namely, all queries have 
at least one very selective triple, typically of the kind "variable  
".  There is not a single query, with a triple for the  
predicate, where the object is also a variable.
  >
  > 6. When you don't have a non-selective triples, Rya is bound to be slow 
because it then has to deal with very large sets entities, which it will look 
up one by one. Also, Rya is not really made to be particular efficient on a 
single machine. Its main purpose is to be efficient when distributed over 
several machines. We have already discussed that it does not make sense to 
distribute a moderate-sized dataset like Wikidata over several machines when 
you can easily process it on a single machine. Distributing a dataset always 
incurs a large performance overhead (because you need to send data back and 
forth between different machines during query processing) and you only do it 
when you have to.
  
  Interesting, I thought Wikidata was getting too big for 1 machine, but I 
might misunderstood the WMF operations team and the statements in the tickets 
surrounding BG.
  
  Wikidata could easily triple in the number of triples within a year if all 
horses are let loose and people start importing all scientific papers, books 
and chemicals in Wikipedia and all the authors associated with those.
  
  > 7. Rya's performance bottleneck is actually very similar to that of 
Blazegraph. When you look at the many example queries for the WDQS on 
https://query.wikidata.org , almost none of them require the computation of a 
large intermediate result. For the simple reason that such queries don't work 
well with Blazegraph (they take forever or time out). Large intermediate 
results occur either when you have no single very selective triples in your 
query or when there is no LIMIT or the LIMIT is preceded by an ORDER BY or 
GROUP BY (so that you have to compute a large intermediate result before you 
can LIMIT it to the top-ranked items).
  
  Interesting! I was unaware of this, but it makes sense from my interactions 
with BG.
  
  > In summary, Rya does not look like a good choice for several reasons, most 
notably: not well-maintained, efficient only for quite particular kinds of 
queries, and similar performance bottlenecks as Blazegraph.
  
  Big thanks for taking the time to look into this. Rya was the least bad 
choice IMO until I read your insights.

TASK DETAIL
  https://phabricator.wikimedia.org/T289561

EMAIL PREFERENCES
  

[Wikidata-bugs] [Maniphest] T289561: Evaluate Apache Rya as alternative to Blazegraph

2021-09-30 Thread Hannah_Bast
Hannah_Bast added a comment.


  We looked a bit into Apache Rya. A couple of observations:
  
  1. The instructions on https://github.com/apache/rya are a mess. Compiling 
the code requires an old version of the JDK (version 8), which is written 
nowhere and tooks us some time to find out. Compilation takes forever. The 
instructions concerning getting a working Rya server are cryptic, mentioning 
all kinds of other libraries and projects, but without instructions on how 
exactly to install them. Loading the data also seems to be non-trivial: you 
have to write code for this. It's certainly all doable, but this does not look 
like a well-maintained project.
  
  2. We had a look at the 2012 paper 
https://www.usna.edu/Users/cs/adina/research/Rya_CloudI2012.pdf (which is well 
cited) and the 2017 slides 
https://events.static.linuxfound.org/sites/events/files/slides/Rya_ApacheBigData_20170518.pdf
 . The slides are in sync with what is written in the paper, and they are very 
instructive in understanding how the engine works. It also looks to me like 
they describe the current state of Rya (that is, there have not been any major 
changes to the basic architecture since then).
  
  3. The underlying data store (Accumulo or MongoDB) is used only for storing 
the raw data (the triples). The actual operations on this data (like the JOIN 
operations, which are central for processing SPARQL queries) are done by the 
Rya code. This makes sense because a NoSQL store like MongoDB does not support 
JOIN operations, that's just not what it's made for.
  
  4. The basic principle of Rya JOIN operations is explained on slide 15 on the 
presentation, and variations of it on slides 16, 18, 31, and 32. The basic 
principle is to start with the most selective triple, consider the set of 
matching entities for that triple (which is hopefully small) and then look up 
each of these (hopefully few) entities in the appropriate index.
  
  5. This principle is efficient only when you have at least one highly 
selective triple in your SPARQL query. In the paper mentioned above, Rya is 
evaluated on the the Lehigh University Benchmark (LUBM), which is a well-known 
but rather old benchmark, with rather special queries. Namely, all queries have 
at least one very selective triple, typically of the kind "variable  
".  There is not a single query, with a triple for the  
predicate, where the object is also a variable.
  
  6. When you don't have a non-selective triples, Rya is bound to be slow 
because it then has to deal with very large sets entities, which it will look 
up one by one. Also, Rya is not really made to be particular efficient on a 
single machine. Its main purpose is to be efficient when distributed over 
several machines. We have already discussed that it does not make sense to 
distribute a moderate-sized dataset like Wikidata over several machines when 
you can easily process it on a single machine. Distributing a dataset always 
incurs a large performance overhead (because you need to send data back and 
forth between different machines during query processing) and you only do it 
when you have to.
  
  7. Rya's performance bottleneck is actually very similar to that of 
Blazegraph. When you look at the many example queries for the WDQS on 
https://query.wikidata.org , almost none of them require the computation of a 
large intermediate result. For the simple reason that such queries don't work 
well with Blazegraph (they take forever or time out). Large intermediate 
results occur either when you have no single very selective triples in your 
query or when there is no LIMIT or the LIMIT is preceded by an ORDER BY or 
GROUP BY (so that you have to compute a large intermediate result before you 
can LIMIT it to the top-ranked items).
  
  In summary, Rya does not look like a good choice for several reasons, most 
notably: not well-maintained, efficient only for quite particular kinds of 
queries, and similar performance bottlenecks as Blazegraph.

TASK DETAIL
  https://phabricator.wikimedia.org/T289561

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Hannah_Bast
Cc: Hannah_Bast, Gehel, Tpt, Smalyshev, So9q, Aklapper, Invadibot, MPhamWMF, 
maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T289561: Evaluate Apache Rya as alternative to Blazegraph

2021-09-27 Thread So9q
So9q renamed this task from "Evaluate Rya as alternative to Blazegraph" to 
"Evaluate Apache Rya as alternative to Blazegraph".

TASK DETAIL
  https://phabricator.wikimedia.org/T289561

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: So9q
Cc: Gehel, Tpt, Smalyshev, So9q, Aklapper, Invadibot, MPhamWMF, maantietaja, 
CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org