[Wikidata-bugs] [Maniphest] T206560: [Epic] Evaluate alternatives to Blazegraph
KingsleyIdehen added a comment. A few things: 1. This behavior is configurable i.e., we just have a restrictive timeout associated with the current public instance 2. We could also include HTML-level messaging to complement the HTTP-level messaging This feature is the result of a fundamental challenge associated with the following, when you publish a query for ad-hoc query access to the Web: 1. Unpredictable Query Complexity 2. Unpredictable Query Solution Size 3. Unpredictable Number of User Agents triggering and combination of the above. Conventional DBMS products can't handle this problem, hence the creation of this feature at the time Virtuoso was created circa 1998. Alternatively, as many do, you can simply throw lots of machines at the problem via horizontal partitioning using shards which sets you on the path to massive data centers (the norm worldwide these days). > SELECT (COUNT(*) AS ?count) WHERE { ?s ?p ?o } Is a typical expensive query that falls into the list I presented above. It's no different to > SELECT * FROM {Some-Table} in a typical SQL-compliant RDBMS which is why you don't see those published to Web -- although Virtuoso instances handle this too using the same "Anytime Query" functionality with a configurable timeout. Getting a result vs an actual latest and greatest result are different things. There is no way on earth that > SELECT (COUNT(*) AS ?count) WHERE { ?s ?p ?o } is happening in realtime on the Blazegraph instance . We could return a value from our statistics table too, but that isn't the global interpretation of a solution for that query. For instance, we could put all the relevant stats in a voID graph which scheduled updates to it if need be. With Virtuoso, everything can be configured to suit the interaction behavior desired. It just so happens that ad-hoc querying, for global 24/7 access; irrespective of solution size, complexity, origins; is a fundamental challenge that isn't generally understood since its pegged to the emergence of the Web :) TASK DETAIL https://phabricator.wikimedia.org/T206560 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: KingsleyIdehen Cc: Hannah_Bast, RShigapov, Izno, KingsleyIdehen, Daniel_Mietchen, Majavah, karapayneWMDE, MarioGom, Mohammed_Sadat_WMDE, Hjfocs, danshick-wmde, Thadguidry, Tpt, TallTed, Sj, Afandian, Justin0x2004, Jerven, TheKtk, Ivanhercaz, Jneubert, DanBri, Lydia_Pintscher, Tagishsimon, Samantha_Alipio_WMDE, Ostrzyciel, GreenReaper, WMDE-leszek, Salgo60, So9q, Krabina, Jecummings4, TomT0m, Akuckartz, Susannaanas, Addshore, Andrawaag, Gehel, Lucas_Werkmeister_WMDE, Aklapper, Smalyshev, Invadibot, MPhamWMF, Jtm-lis, maantietaja, NavinRizwi, CBogen, Isaacandy, Demian, Olson.jared.m, Nandana, Namenlos314, Lahi, Gq86, Bryandamon, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, Steko, Samwilson, PhotographerTom, suriyaa, Psychoslave, tosfos, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Darenwelsh, Dinoguy1000, Manybubbles, brion, Mbch331, MarkAHershberger ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T206560: [Epic] Evaluate alternatives to Blazegraph
KingsleyIdehen added a comment. That's a consequence of the "Anytime Query" feature in Virtuoso that provides partial solutions in situations where a query cannot be completed within a specific timeframe. This timeframe takes the form of a configurable timeout, and is an critical feature for enabling global ad-hoc query access, 24/7, 365 days a year re the likes of DBpedia and Wikidata. When a partial query is returned, information is delivered via the HTTP response as per: curl -I "https://wikidata.demo.openlinksw.com/sparql?default-graph-uri=http%3A%2F%2Fwww.wikidata.org%2F=PREFIX+parl%3A+%3Chttps%3A%2F%2Fid.parliament.uk%2Fschema%2F%3E%0D%0APREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0D%0APREFIX+wd%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%0D%0APREFIX+wdt%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0D%0APREFIX+wikibase%3A+%3Chttp%3A%2F%2Fwikiba.se%2Fontology%23%3E%0D%0APREFIX+bd%3A+%3Chttp%3A%2F%2Fwww.bigdata.com%2Frdf%23%3E%0D%0APREFIX+dct%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0D%0APREFIX+dbpedia%3A+%3Chttp%3A%2F%2Fdbpedia.org%2F%3E+%0D%0APREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0D%0APREFIX+wds%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2Fstatement%2F%3E%0D%0APREFIX+wdv%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fvalue%2F%3E%0D%0APREFIX+wdt%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0D%0APREFIX+p%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2F%3E%0D%0APREFIX+ps%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fstatement%2F%3E%0D%0APREFIX+pq%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fqualifier%2F%3E%0D%0A%0D%0ASELECT+%3Fperson_id+%3Fperson+%28COUNT%28%3Fprofession_id%29+AS+%3Fcount%29+%28GROUP_CONCAT%28%3Fprofession%3B+separator%3D%22%2C+%22%29+AS+%3Fprofessions%29+WHERE+%7B%0D%0A++%3Fperson_id+wdt%3AP31+wd%3AQ5+.%0D%0A++%3Fperson_id+wdt%3AP106+%3Fprofession_id+.%0D%0A++%3Fprofession_id+rdfs%3Alabel+%3Fprofession+.%0D%0A++%3Fperson_id+rdfs%3Alabel+%3Fperson+.%0D%0A++FILTER+%28LANG%28%3Fperson%29+%3D+%22en%22%29+.%0D%0A++FILTER+%28LANG%28%3Fprofession%29+%3D+%22en%22%29%0D%0A%7D%0D%0AGROUP+BY+%3Fperson_id+%3Fperson%0D%0AORDER+BY+DESC%28%3Fcount%29=text%2Fx-html%2Btr=3000_void=on_unconnected=on; Which returns: HTTP/1.1 200 OK Date: Thu, 09 Sep 2021 17:53:51 GMT Content-Type: text/html; charset=UTF-8 Content-Length: 479290 Connection: keep-alive Vary: Accept-Encoding Server: Virtuoso/08.03.3320 (Linux) x86_64-generic-linux-glibc25 VDB Accept-Ranges: bytes X-SPARQL-default-graph: http://www.wikidata.org/ X-SQL-State: S1TAT X-SQL-Message: RC...: Returning incomplete results, query interrupted by result timeout. Activity: 1.075M rnd 530.1K seq 16.78K same seg 109.7K same pg 14.13K same par 0 disk 0 spec disk 0B / 0 X-Exec-Milliseconds: 2031 X-Exec-DB-Activity: 1.075M rnd 530.1K seq 16.78K same seg 109.7K same pg 14.13K same par 0 disk 0 spec disk 0B / 0 messages 0 fork Content-disposition: filename=sparql_2021-09-09_17-53-51Z.html Expires: Thu, 09 Sep 2021 18:53:51 GMT Cache-Control: max-age=3600 Strict-Transport-Security: max-age=15768000 **Related** [1] DBpedia Fair Use Note <https://www.dbpedia.org/resources/sparql/> [2] Virtuoso Anytime Query Tips & Tricks Note <http://vos.openlinksw.com/owiki/wiki/VOS/VirtTipsAndTricksAnytimeSPARQLQuery> TASK DETAIL https://phabricator.wikimedia.org/T206560 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: KingsleyIdehen Cc: Hannah_Bast, RShigapov, Izno, KingsleyIdehen, Daniel_Mietchen, Majavah, karapayneWMDE, MarioGom, Mohammed_Sadat_WMDE, Hjfocs, danshick-wmde, Thadguidry, Tpt, TallTed, Sj, Afandian, Justin0x2004, Jerven, TheKtk, Ivanhercaz, Jneubert, DanBri, Lydia_Pintscher, Tagishsimon, Samantha_Alipio_WMDE, Ostrzyciel, GreenReaper, WMDE-leszek, Salgo60, So9q, Krabina, Jecummings4, TomT0m, Akuckartz, Susannaanas, Addshore, Andrawaag, Gehel, Lucas_Werkmeister_WMDE, Aklapper, Smalyshev, Invadibot, MPhamWMF, Jtm-lis, maantietaja, NavinRizwi, CBogen, Isaacandy, Demian, Olson.jared.m, Nandana, Namenlos314, Lahi, Gq86, Bryandamon, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, Steko, Samwilson, PhotographerTom, suriyaa, Psychoslave, tosfos, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Darenwelsh, Dinoguy1000, Manybubbles, brion, Mbch331, MarkAHershberger ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T206560: [Epic] Evaluate alternatives to Blazegraph
KingsleyIdehen added a comment. For the record. At the time of our first rendezvous re Wikidata hosting, handling 20 billion+ triples would have typically required our Cluster Edition (a Commercial Only offering). That was the deal-breaker back at the time of initial Blazegraph selection for Wikidata i.e., it offered an Open Source based Cluster Edition. Anyway, in recent times, our Open Source Edition has evolved to handle some 80 Billion+ triples (exemplified by the live Uniprot instance <https://sparql.uniprot.org/>) where performance and scale is primary a function of available memory. Fundamentally, the current 13 Billion Triples size of Wikidata and future growth all lie well within the range of Virtuoso's Open Source Edition. Also note, based on our experience hosting live DBpedia and Wikidata instances, we do have configuration best practices in place for uptime and scalability without the need for our Cluster Edition (which is really for dealing with massive setups in the 100 Billion Triples or higher range). I hope this helps. **Related** [1] Our Live Wikidata SPARQL Query Endpoint <https://wikidata.demo.openlinksw.com/sparql> [2] Google Spreadsheet about various Virtuoso Configurations associated with some well-known public endpoints <https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ683ZQF0/edit#gid=0> [3] This query doesn't complete with the current Blazegraph-based Wikidata endpoint <https://t.co/EjAAO73wwE> [4] Same query completing when applied to the Virtuoso-based endpoint <https://t.co/GTATPPJNBI> [5] About loading Wikidata's datasets into a Virtuoso instance <https://t.co/X7mLmcYC69> [6] Various demos shared via Twitter over the years regarding Wikidata <https://twitter.com/search?q=%23Wikidata%20%23VirtuosoRDBMS%20%40kidehen=typed_query=live> [7] Uniprot SPARQL Endpoint Presentation <https://t.co/EpuP27TFRE?amp=1> TASK DETAIL https://phabricator.wikimedia.org/T206560 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: KingsleyIdehen Cc: KingsleyIdehen, Daniel_Mietchen, Majavah, karapayneWMDE, MarioGom, Mohammed_Sadat_WMDE, Hjfocs, danshick-wmde, Thadguidry, Tpt, TallTed, Sj, Afandian, Justin0x2004, Jerven, TheKtk, Ivanhercaz, Jneubert, DanBri, Lydia_Pintscher, Tagishsimon, Samantha_Alipio_WMDE, Ostrzyciel, GreenReaper, WMDE-leszek, Salgo60, So9q, Krabina, Jecummings4, TomT0m, Akuckartz, Susannaanas, Addshore, Andrawaag, Gehel, Lucas_Werkmeister_WMDE, Aklapper, Smalyshev, Invadibot, MPhamWMF, Jtm-lis, maantietaja, NavinRizwi, CBogen, Isaacandy, Demian, Olson.jared.m, Nandana, Namenlos314, Lahi, Gq86, Bryandamon, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, Steko, Samwilson, PhotographerTom, suriyaa, Psychoslave, tosfos, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Darenwelsh, Dinoguy1000, Manybubbles, brion, Mbch331, MarkAHershberger ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T206561: Evaluate Virtuoso as alternative to Blazegraph
KingsleyIdehen added a comment. In T206561#7304519 <https://phabricator.wikimedia.org/T206561#7304519>, @So9q wrote: > I took a glance at Virtuoso. > > I found nothing about scaling Virtuoso to a cluster (which is IMO what WMF needs because of growing amounts of data and reaching the limits of what 1 machine can handle) > > A snippet from WP: > "Virtuoso is designed to take advantage of operating system threading support and multiple CPUs. It consists of a single process with an adjustable pool of threads shared between clients. Multiple threads may work on a single index tree with minimal interference with each other. One cache of database pages is shared among all threads and old dirty pages are written back to disk as a background process." > > Virtuoso IMO is not the way forward for WMF. We need a distributed graph/column database with SPARQL on top. See https://phabricator.wikimedia.org/T289561 for an application that has exactly that (but seems abandoned since dec 2020 unfortunately) Again, Virtuoso 7.x Open Source Edition scales up to 80 Billion Triples as demonstrated by Uniprots live instance. You don't need the Virtuoso Cluster Edition until the scalability of the single-server edition is exhausted. Wikidata is a long way from reaching 80 Billion+ triples. Virtuoso has also hosted DBpedia for the last 14 years i.e., since its inception . I hope that helps. Kingsley TASK DETAIL https://phabricator.wikimedia.org/T206561 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: KingsleyIdehen Cc: KingsleyIdehen, So9q, TallTed, Sj, Jerven, Base, TomT0m, Akuckartz, GreenReaper, Addshore, Lucas_Werkmeister_WMDE, Aklapper, Smalyshev, Invadibot, MPhamWMF, maantietaja, CBogen, Nandana, Namenlos314, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org