[Wikidata-bugs] [Maniphest] T290839: Evaluate a double backend strategy for WDQS

So9q Mon, 27 Sep 2021 13:24:30 -0700

So9q added a comment.


  In T290839#7360790 <https://phabricator.wikimedia.org/T290839#7360790>, 
@Hannah_Bast wrote:
  
  > It's of course up to you (the Wikidata team) to decide this. But I wouldn't 
dismiss this idea so easily.
  >
  > There is clearly a group of users who want to query the exact contents of 
the database at the point in time they are querying it. I assume that this 
group includes many Wikimedians and all kinds of statistics queries on 
Wikidata. But I am sure that there is also a large group of users who don't 
care if the version of Wikidata they are querying is a few hours old, but who 
care much more about convenience and efficiency (or getting results at all, 
which is clearly a problem with the current service).
  
  +1 
  I wrote a query today with wdt:P31 
<https://phabricator.wikimedia.org/P31>/wdt:P279 
<https://phabricator.wikimedia.org/P279>* that timed out. I implemented some 
workarounds though to get what I wanted and computed locally. See 
https://github.com/dpriskorn/ItemSubjector/blob/prepare-batch-improved-structure/fetch_main_subjects.py
  
  > Now this here is a (low-priority) thread about a "double backend strategy 
for WDQS". If there were an engine that can answer all "reasonable" queries 
efficiently **and** that supports SPARQL update operations, there would be no 
need for this debate. Based on my own experience, I personally think that 
Virtuoso comes close to being this engine. It is the most mature SPARQL engine 
on the market when it comes to handling very large datasets with reasonable 
hardware and it's remarkable how fast it is even for some fairly complex 
queries.
  
  I would very much like to see a comparison of Virtuoso and Rya on complex 
queries. Rya has some interesting query optimizations that are described a 
little and linked here: https://phabricator.wikimedia.org/T289561#7321936
  
  > But there are many reasonable queries which by design are very hard also 
for Virtuoso (and which indeed time out on Virtuoso's Wikidata SPARQL 
endpoint). In my experience, there is a clear trade-off between efficiency and 
the support of live updates. There is just a lot of room for optimization when 
you have read-only data and you can rebuild the index from scratch periodically.
  
  Interesting. I edited this task to mention the possibility of a time-lagging 
heavily optimized endpoint and a real-time endpoint like we have today with 
Blazegraph.
  
  I personally think this could offload a lot of the request-pressure we are 
seeing on WDQS right now. People who are not tech-savvy and or cannot afford 
time or money to set up their own endpoint have few good options to run 
expensive queries and get the data they want.
  
  @Hannah_Bast do you know if QLever supports a column-store backend? If we 
adopt Rya and have a column-store cluster, how could we best handle 
snapshotting/moving the data (I'm thinking it will be 1TB in a few years) 
efficiently to a QLever cluster?
  Has anyone done operations like this before?
  I found this in a quick search: 
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html

TASK DETAIL
  https://phabricator.wikimedia.org/T290839

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: So9q
Cc: Addshore, Justin0x2004, Lucas_Werkmeister_WMDE, Bugreporter, Hannah_Bast, 
Aklapper, MPhamWMF, So9q, Invadibot, maantietaja, CBogen, Akuckartz, Nandana, 
Namenlos314, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331

_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Wikidata-bugs] [Maniphest] T290839: Evaluate a double backend strategy for WDQS

Reply via email to