[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS

Guillaume Lederrey Mon, 27 Feb 2023 07:16:30 -0800

On Fri, 24 Feb 2023 at 19:31, Kingsley Idehen via Wikidata <
[email protected]> wrote:


>
> On 2/24/23 5:59 AM, Guillaume Lederrey wrote:
>
> On Thu, 23 Feb 2023 at 22:56, Kingsley Idehen <[email protected]>
> wrote:
>
>>
>> On 2/23/23 3:09 PM, Guillaume Lederrey wrote:
>>
>> On Thu, 23 Feb 2023 at 16:39, Kingsley Idehen <[email protected]>
>> wrote:
>>
>>>
>>> On 2/22/23 3:28 AM, Guillaume Lederrey wrote:
>>>
>>> On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata <
>>> [email protected]> wrote:
>>>
>>>>
>>>> On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
>>>> > Hello all!
>>>> >
>>>> > TL;DR: We expect to successfully complete the recent data reload on
>>>> > Wikidata Query Service soon, but we've encountered multiple failures
>>>> > related to the size of the graph, and anticipate that this issue may
>>>> > worsen in the future. Although we succeeded this time, we cannot
>>>> > guarantee that future reload attempts will be successful given the
>>>> > current trend of the data reload process. Thank you for your
>>>> > understanding and patience..
>>>> >
>>>> > Longer version:
>>>> >
>>>> > WDQS is updated from a stream of recent changes on Wikidata, with a
>>>> > maximum delay of ~2 minutes. This process was improved as part of the
>>>> > WDQS Streaming Updater project to ensure data coherence[1] . However,
>>>> > the update process is still imperfect and can lead to data
>>>> > inconsistencies in some cases[2][3]. To address this, we reload the
>>>> > data from dumps a few times per year to reinitialize the system from
>>>> a
>>>> > known good state.
>>>> >
>>>> > The recent reload of data from dumps started in mid-December and was
>>>> > initially met with some issues related to download and instabilities
>>>> > in Blazegraph, the database used by WDQS[4]. Loading the data into
>>>> > Blazegraph takes a couple of weeks due to the size of the graph, and
>>>> > we had multiple attempts where the reload failed after >90% of the
>>>> > data had been loaded. Our understanding of the issue is that a "race
>>>> > condition" in Blazegraph[5], where subtle timing changes lead to
>>>> > corruption of the journal in some rare cases, is to blame.[6]
>>>> >
>>>> > We want to reassure you that the last reload job was successful on
>>>> one
>>>> > of our servers. The data still needs to be copied over to all of the
>>>> > WDQS servers, which will take a couple of weeks, but should not bring
>>>> > any additional issues. However, reloading the full data from dumps is
>>>> > becoming more complex as the data size grows, and we wanted to let
>>>> you
>>>> > know why the process took longer than expected. We understand that
>>>> > data inconsistencies can be problematic, and we appreciate your
>>>> > patience and understanding while we work to ensure the quality and
>>>> > consistency of the data on WDQS.
>>>> >
>>>> > Thank you for your continued support and understanding!
>>>> >
>>>> >
>>>> >     Guillaume
>>>> >
>>>> >
>>>> > [1] https://phabricator.wikimedia.org/T244590
>>>> > [2] https://phabricator.wikimedia.org/T323239
>>>> > [3] https://phabricator.wikimedia.org/T322869
>>>> > [4] https://phabricator.wikimedia.org/T323096
>>>> > [5] https://en.wikipedia.org/wiki/Race_condition#In_software
>>>> > [6] https://phabricator.wikimedia.org/T263110
>>>> >
>>>> Hi Guillaume,
>>>>
>>>> Are there plans to decouple WDQS from the back-end database? Doing that
>>>> provides more resilient architecture for Wikidata as a whole since you
>>>> will be able to swap and interchange SPARQL-compliant backends.
>>>>
>>>
>>> It depends what you mean by decoupling. The coupling points as I see
>>> them are:
>>>
>>> * update process
>>> * UI
>>> * exposed SPARQL endpoint
>>>
>>> The update process is mostly decoupled from the backend. It is producing
>>> a stream of RDF updates that is backend independent, with a very thin
>>> Blazegraph specific adapted to load the data into Blazegraph.
>>>
>>>
>>> Does that mean that we could integrate the RDF stream into our setup re
>>> keeping our Wikidata instance up to date, for instance?
>>>
>> That data stream isn't exposed publicly. There are a few tricky part
>> about the stream needing to be synchronized with a specific Wikidata dump
>> that makes it not entirely trivial to reuse outside of our internal use
>> case. But if there is enough interest, we could potentially work on making
>> that stream public.
>>
>>
>> I suspect there's broad interest in this matter since it contributes to
>> the overarching issue of loose-coupling re Wikidata's underlying
>> infrastructure.
>>
>> For starters, offering a public stream would be very useful to 3rd party
>> Wikidata hosts.
>>
>>
>>
>>> The UI is mostly backend independant. It relies on Search for some
>>> features. And of course, the queries themselves might depend on Blazegraph
>>> specific features.
>>>
>>>
>>> Can WDQS, based on what's stated above, work with a generic SPARQL
>>> back-end like Virtuoso, for instance? By that I mean dispatch SPARQL
>>> queries input by a user (without alteration) en route to server processing?
>>>
>>  The WDQS UI is managed by WMDE, my knowledge is limited. Maybe someone
>> from WMDE could jump in and add more context. That being said, as far as I
>> know, pointing it to a different backend is just a configuration option.
>> Feel free to have a look at the code (
>> https://gerrit.wikimedia.org/g/wikidata/query/gui).
>>
>>
>> I'll take a look.
>>
>> It should be reasonably easy to deploy another WDQS UI instance somewhere
>> else, which points to whatever backend you'd like.
>>
>>
>> Okay, I assume that in the current state it would be sending
>> Blazegraph-specific SPARQL?
>>
> Again, not my area of expertise, but I assume that the UI itself is
> issuing fairly standard SPARQL. Of course, user queries will use whatever
> they want. It does have dependencies on our Search interface as well, so
> that would have to be replicated.
>
>
> You mean WDQS has a Text Search interface component that's intertwined
> with the Query Service provided by the Wikidata SPARQL Endpoint?
>
>
>
>
>> As a policy, we don't send traffic to any third party, so we will not
>> setup such an instance.
>>
>>>
>>> The exposed SPARQL endpoint is at the moment a direct exposition of the
>>> Blazegraph endpoint, so it does expose all the Blazegraph specific features
>>> and quirks.
>>>
>>>
>>> Is there a Query Service that's separated from the Blazegraph endpoint?
>>> The crux of the matter here is that WDQS benefits more by being loosely-
>>> bound to endpoints rather than tightly-bound to the Blazegraph endpoint.
>>>
>> It depends what you mean by Query Service. My definition of a Query
>> Service in this context is a SPARQL endpoint with a specific data set.
>>
>>
>> Yes, but in the case of Wikidata that's a combination of both a SPARQL
>> Query Service (query processor and endpoint) and WDQS query solution
>> rendering services.
>>
>>
>> That SPARQL endpoint at the moment is Blazegraph. I'm not entirely clear
>> what kind loose bound you'd like to see in this context. We might have
>> different definitions of the same words here.
>>
>>
>> Loose-coupling, in the context I am describing, would comprise the
>> following:
>>
>> 1. WDQS that can be bolted on to any SPARQL endpoint, just like YASGUI
>> <https://github.com/TriplyDB/Yasgui#this>
>>
> In this context, I would say "WDQS UI can be bolted to any SPARQL
> endpoint". In term of SPARQL itself, that should already be mostly the
> case. I think there is a dependency on Search as well.
>
>
> As per my earlier comment, I don't quite understand what you are referring
> to regarding the Search (Free Text Querying) intermingling. Does this
> relate to SPARQL Query Patterns comprising literal objects? If so, WDQS
> should be able to constrain such behavior to Blazegraph instances -- by way
> of configuration that informs introspection.
>
WDQS UI relies on a Search endpoint (backed by Elasticsearch) for auto
completion. The requirements of low latency and reasonable ranking are
something that Elasticsearch (or another Search oriented backend) does
really well. But I would not expect an RDF backend to offer good ranking
heuristics.


> 2. Near real-time data streams usable by 3rd Party Wikidata hosts
>>
>> With the above in place, the cost and burned associated with Wikidata
>> hosting will also be reduced -- courtesy of federation.
>>
>  Could you please open a Phabricator task to document what you would like
> to see exposed and why it would be useful?
>
>
> Okay, when I (or someone else) get a moment.
>
> --
> Regards,
>
> Kingsley Idehen       
> Founder & CEO
> OpenLink Software
> Home Page: http://www.openlinksw.com
> Community Support: https://community.openlinksw.com
> Weblogs (Blogs):
> Company Blog: https://medium.com/openlink-software-blog
> Virtuoso Blog: https://medium.com/virtuoso-blog
> Data Access Drivers Blog: 
> https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers
>
> Personal Weblogs (Blogs):
> Medium Blog: https://medium.com/@kidehen
> Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/
>               http://kidehen.blogspot.com
>
> Profile Pages:
> Pinterest: https://www.pinterest.com/kidehen/
> Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen
> Twitter: https://twitter.com/kidehen
> Google+: https://plus.google.com/+KingsleyIdehen/about
> LinkedIn: http://www.linkedin.com/in/kidehen
>
> Web Identities (WebID):
> Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
>         : 
> http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this
>
> _______________________________________________
> Wikidata mailing list -- [email protected]
> Public archives at
> https://lists.wikimedia.org/hyperkitty/list/[email protected]/message/CDDDQC7VKFDWMASFXUZ32MU6SAQB2QFQ/
> To unsubscribe send an email to [email protected]
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>

_______________________________________________
Wikidata mailing list -- [email protected]
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/[email protected]/message/R3LKHYJC4QKQV4S3TFJ7FVUBTOOYH2R2/
To unsubscribe send an email to [email protected]

[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS

Reply via email to