On Fri, 24 Feb 2023 at 19:31, Kingsley Idehen via Wikidata < [email protected]> wrote:
> > On 2/24/23 5:59 AM, Guillaume Lederrey wrote: > > On Thu, 23 Feb 2023 at 22:56, Kingsley Idehen <[email protected]> > wrote: > >> >> On 2/23/23 3:09 PM, Guillaume Lederrey wrote: >> >> On Thu, 23 Feb 2023 at 16:39, Kingsley Idehen <[email protected]> >> wrote: >> >>> >>> On 2/22/23 3:28 AM, Guillaume Lederrey wrote: >>> >>> On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata < >>> [email protected]> wrote: >>> >>>> >>>> On 2/21/23 4:05 PM, Guillaume Lederrey wrote: >>>> > Hello all! >>>> > >>>> > TL;DR: We expect to successfully complete the recent data reload on >>>> > Wikidata Query Service soon, but we've encountered multiple failures >>>> > related to the size of the graph, and anticipate that this issue may >>>> > worsen in the future. Although we succeeded this time, we cannot >>>> > guarantee that future reload attempts will be successful given the >>>> > current trend of the data reload process. Thank you for your >>>> > understanding and patience.. >>>> > >>>> > Longer version: >>>> > >>>> > WDQS is updated from a stream of recent changes on Wikidata, with a >>>> > maximum delay of ~2 minutes. This process was improved as part of the >>>> > WDQS Streaming Updater project to ensure data coherence[1] . However, >>>> > the update process is still imperfect and can lead to data >>>> > inconsistencies in some cases[2][3]. To address this, we reload the >>>> > data from dumps a few times per year to reinitialize the system from >>>> a >>>> > known good state. >>>> > >>>> > The recent reload of data from dumps started in mid-December and was >>>> > initially met with some issues related to download and instabilities >>>> > in Blazegraph, the database used by WDQS[4]. Loading the data into >>>> > Blazegraph takes a couple of weeks due to the size of the graph, and >>>> > we had multiple attempts where the reload failed after >90% of the >>>> > data had been loaded. Our understanding of the issue is that a "race >>>> > condition" in Blazegraph[5], where subtle timing changes lead to >>>> > corruption of the journal in some rare cases, is to blame.[6] >>>> > >>>> > We want to reassure you that the last reload job was successful on >>>> one >>>> > of our servers. The data still needs to be copied over to all of the >>>> > WDQS servers, which will take a couple of weeks, but should not bring >>>> > any additional issues. However, reloading the full data from dumps is >>>> > becoming more complex as the data size grows, and we wanted to let >>>> you >>>> > know why the process took longer than expected. We understand that >>>> > data inconsistencies can be problematic, and we appreciate your >>>> > patience and understanding while we work to ensure the quality and >>>> > consistency of the data on WDQS. >>>> > >>>> > Thank you for your continued support and understanding! >>>> > >>>> > >>>> > Guillaume >>>> > >>>> > >>>> > [1] https://phabricator.wikimedia.org/T244590 >>>> > [2] https://phabricator.wikimedia.org/T323239 >>>> > [3] https://phabricator.wikimedia.org/T322869 >>>> > [4] https://phabricator.wikimedia.org/T323096 >>>> > [5] https://en.wikipedia.org/wiki/Race_condition#In_software >>>> > [6] https://phabricator.wikimedia.org/T263110 >>>> > >>>> Hi Guillaume, >>>> >>>> Are there plans to decouple WDQS from the back-end database? Doing that >>>> provides more resilient architecture for Wikidata as a whole since you >>>> will be able to swap and interchange SPARQL-compliant backends. >>>> >>> >>> It depends what you mean by decoupling. The coupling points as I see >>> them are: >>> >>> * update process >>> * UI >>> * exposed SPARQL endpoint >>> >>> The update process is mostly decoupled from the backend. It is producing >>> a stream of RDF updates that is backend independent, with a very thin >>> Blazegraph specific adapted to load the data into Blazegraph. >>> >>> >>> Does that mean that we could integrate the RDF stream into our setup re >>> keeping our Wikidata instance up to date, for instance? >>> >> That data stream isn't exposed publicly. There are a few tricky part >> about the stream needing to be synchronized with a specific Wikidata dump >> that makes it not entirely trivial to reuse outside of our internal use >> case. But if there is enough interest, we could potentially work on making >> that stream public. >> >> >> I suspect there's broad interest in this matter since it contributes to >> the overarching issue of loose-coupling re Wikidata's underlying >> infrastructure. >> >> For starters, offering a public stream would be very useful to 3rd party >> Wikidata hosts. >> >> >> >>> The UI is mostly backend independant. It relies on Search for some >>> features. And of course, the queries themselves might depend on Blazegraph >>> specific features. >>> >>> >>> Can WDQS, based on what's stated above, work with a generic SPARQL >>> back-end like Virtuoso, for instance? By that I mean dispatch SPARQL >>> queries input by a user (without alteration) en route to server processing? >>> >> The WDQS UI is managed by WMDE, my knowledge is limited. Maybe someone >> from WMDE could jump in and add more context. That being said, as far as I >> know, pointing it to a different backend is just a configuration option. >> Feel free to have a look at the code ( >> https://gerrit.wikimedia.org/g/wikidata/query/gui). >> >> >> I'll take a look. >> >> It should be reasonably easy to deploy another WDQS UI instance somewhere >> else, which points to whatever backend you'd like. >> >> >> Okay, I assume that in the current state it would be sending >> Blazegraph-specific SPARQL? >> > Again, not my area of expertise, but I assume that the UI itself is > issuing fairly standard SPARQL. Of course, user queries will use whatever > they want. It does have dependencies on our Search interface as well, so > that would have to be replicated. > > > You mean WDQS has a Text Search interface component that's intertwined > with the Query Service provided by the Wikidata SPARQL Endpoint? > > > > >> As a policy, we don't send traffic to any third party, so we will not >> setup such an instance. >> >>> >>> The exposed SPARQL endpoint is at the moment a direct exposition of the >>> Blazegraph endpoint, so it does expose all the Blazegraph specific features >>> and quirks. >>> >>> >>> Is there a Query Service that's separated from the Blazegraph endpoint? >>> The crux of the matter here is that WDQS benefits more by being loosely- >>> bound to endpoints rather than tightly-bound to the Blazegraph endpoint. >>> >> It depends what you mean by Query Service. My definition of a Query >> Service in this context is a SPARQL endpoint with a specific data set. >> >> >> Yes, but in the case of Wikidata that's a combination of both a SPARQL >> Query Service (query processor and endpoint) and WDQS query solution >> rendering services. >> >> >> That SPARQL endpoint at the moment is Blazegraph. I'm not entirely clear >> what kind loose bound you'd like to see in this context. We might have >> different definitions of the same words here. >> >> >> Loose-coupling, in the context I am describing, would comprise the >> following: >> >> 1. WDQS that can be bolted on to any SPARQL endpoint, just like YASGUI >> <https://github.com/TriplyDB/Yasgui#this> >> > In this context, I would say "WDQS UI can be bolted to any SPARQL > endpoint". In term of SPARQL itself, that should already be mostly the > case. I think there is a dependency on Search as well. > > > As per my earlier comment, I don't quite understand what you are referring > to regarding the Search (Free Text Querying) intermingling. Does this > relate to SPARQL Query Patterns comprising literal objects? If so, WDQS > should be able to constrain such behavior to Blazegraph instances -- by way > of configuration that informs introspection. > WDQS UI relies on a Search endpoint (backed by Elasticsearch) for auto completion. The requirements of low latency and reasonable ranking are something that Elasticsearch (or another Search oriented backend) does really well. But I would not expect an RDF backend to offer good ranking heuristics. > 2. Near real-time data streams usable by 3rd Party Wikidata hosts >> >> With the above in place, the cost and burned associated with Wikidata >> hosting will also be reduced -- courtesy of federation. >> > Could you please open a Phabricator task to document what you would like > to see exposed and why it would be useful? > > > Okay, when I (or someone else) get a moment. > > -- > Regards, > > Kingsley Idehen > Founder & CEO > OpenLink Software > Home Page: http://www.openlinksw.com > Community Support: https://community.openlinksw.com > Weblogs (Blogs): > Company Blog: https://medium.com/openlink-software-blog > Virtuoso Blog: https://medium.com/virtuoso-blog > Data Access Drivers Blog: > https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers > > Personal Weblogs (Blogs): > Medium Blog: https://medium.com/@kidehen > Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/ > http://kidehen.blogspot.com > > Profile Pages: > Pinterest: https://www.pinterest.com/kidehen/ > Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen > Twitter: https://twitter.com/kidehen > Google+: https://plus.google.com/+KingsleyIdehen/about > LinkedIn: http://www.linkedin.com/in/kidehen > > Web Identities (WebID): > Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i > : > http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this > > _______________________________________________ > Wikidata mailing list -- [email protected] > Public archives at > https://lists.wikimedia.org/hyperkitty/list/[email protected]/message/CDDDQC7VKFDWMASFXUZ32MU6SAQB2QFQ/ > To unsubscribe send an email to [email protected] > -- *Guillaume Lederrey* (he/him) Engineering Manager Wikimedia Foundation <https://wikimediafoundation.org/>
_______________________________________________ Wikidata mailing list -- [email protected] Public archives at https://lists.wikimedia.org/hyperkitty/list/[email protected]/message/R3LKHYJC4QKQV4S3TFJ7FVUBTOOYH2R2/ To unsubscribe send an email to [email protected]
