So9q added a comment.
In T290839#7360790 <https://phabricator.wikimedia.org/T290839#7360790>, @Hannah_Bast wrote: > It's of course up to you (the Wikidata team) to decide this. But I wouldn't dismiss this idea so easily. > > There is clearly a group of users who want to query the exact contents of the database at the point in time they are querying it. I assume that this group includes many Wikimedians and all kinds of statistics queries on Wikidata. But I am sure that there is also a large group of users who don't care if the version of Wikidata they are querying is a few hours old, but who care much more about convenience and efficiency (or getting results at all, which is clearly a problem with the current service). +1 I wrote a query today with wdt:P31 <https://phabricator.wikimedia.org/P31>/wdt:P279 <https://phabricator.wikimedia.org/P279>* that timed out. I implemented some workarounds though to get what I wanted and computed locally. See https://github.com/dpriskorn/ItemSubjector/blob/prepare-batch-improved-structure/fetch_main_subjects.py > Now this here is a (low-priority) thread about a "double backend strategy for WDQS". If there were an engine that can answer all "reasonable" queries efficiently **and** that supports SPARQL update operations, there would be no need for this debate. Based on my own experience, I personally think that Virtuoso comes close to being this engine. It is the most mature SPARQL engine on the market when it comes to handling very large datasets with reasonable hardware and it's remarkable how fast it is even for some fairly complex queries. I would very much like to see a comparison of Virtuoso and Rya on complex queries. Rya has some interesting query optimizations that are described a little and linked here: https://phabricator.wikimedia.org/T289561#7321936 > But there are many reasonable queries which by design are very hard also for Virtuoso (and which indeed time out on Virtuoso's Wikidata SPARQL endpoint). In my experience, there is a clear trade-off between efficiency and the support of live updates. There is just a lot of room for optimization when you have read-only data and you can rebuild the index from scratch periodically. Interesting. I edited this task to mention the possibility of a time-lagging heavily optimized endpoint and a real-time endpoint like we have today with Blazegraph. I personally think this could offload a lot of the request-pressure we are seeing on WDQS right now. People who are not tech-savvy and or cannot afford time or money to set up their own endpoint have few good options to run expensive queries and get the data they want. @Hannah_Bast do you know if QLever supports a column-store backend? If we adopt Rya and have a column-store cluster, how could we best handle snapshotting/moving the data (I'm thinking it will be 1TB in a few years) efficiently to a QLever cluster? Has anyone done operations like this before? I found this in a quick search: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html TASK DETAIL https://phabricator.wikimedia.org/T290839 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: So9q Cc: Addshore, Justin0x2004, Lucas_Werkmeister_WMDE, Bugreporter, Hannah_Bast, Aklapper, MPhamWMF, So9q, Invadibot, maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- [email protected] To unsubscribe send an email to [email protected]
