[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS

Kingsley Idehen via Wikidata Thu, 23 Feb 2023 13:57:33 -0800


On 2/23/23 3:09 PM, Guillaume Lederrey wrote:

On Thu, 23 Feb 2023 at 16:39, Kingsley Idehen <[email protected]> wrote:



    On 2/22/23 3:28 AM, Guillaume Lederrey wrote:

    On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata
    <[email protected]> wrote:


        On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
        > Hello all!
        >
        > TL;DR: We expect to successfully complete the recent data
        reload on
        > Wikidata Query Service soon, but we've encountered multiple
        failures
        > related to the size of the graph, and anticipate that this
        issue may
        > worsen in the future. Although we succeeded this time, we
        cannot
        > guarantee that future reload attempts will be successful
        given the
        > current trend of the data reload process. Thank you for your
        > understanding and patience..
        >
        > Longer version:
        >
        > WDQS is updated from a stream of recent changes on
        Wikidata, with a
        > maximum delay of ~2 minutes. This process was improved as
        part of the
        > WDQS Streaming Updater project to ensure data coherence[1]
        . However,
        > the update process is still imperfect and can lead to data
        > inconsistencies in some cases[2][3]. To address this, we
        reload the
        > data from dumps a few times per year to reinitialize the
        system from a
        > known good state.
        >
        > The recent reload of data from dumps started in
        mid-December and was
        > initially met with some issues related to download and
        instabilities
        > in Blazegraph, the database used by WDQS[4]. Loading the
        data into
        > Blazegraph takes a couple of weeks due to the size of the
        graph, and
        > we had multiple attempts where the reload failed after >90%
        of the
        > data had been loaded. Our understanding of the issue is
        that a "race
        > condition" in Blazegraph[5], where subtle timing changes
        lead to
        > corruption of the journal in some rare cases, is to blame.[6]
        >
        > We want to reassure you that the last reload job was
        successful on one
        > of our servers. The data still needs to be copied over to
        all of the
        > WDQS servers, which will take a couple of weeks, but should
        not bring
        > any additional issues. However, reloading the full data
        from dumps is
        > becoming more complex as the data size grows, and we wanted
        to let you
        > know why the process took longer than expected. We
        understand that
        > data inconsistencies can be problematic, and we appreciate
        your
        > patience and understanding while we work to ensure the
        quality and
        > consistency of the data on WDQS.
        >
        > Thank you for your continued support and understanding!
        >
        >
        >     Guillaume
        >
        >
        > [1] https://phabricator.wikimedia.org/T244590
        > [2] https://phabricator.wikimedia.org/T323239
        > [3] https://phabricator.wikimedia.org/T322869
        > [4] https://phabricator.wikimedia.org/T323096
        > [5] https://en.wikipedia.org/wiki/Race_condition#In_software
        > [6] https://phabricator.wikimedia.org/T263110
        >
        Hi Guillaume,

        Are there plans to decouple WDQS from the back-end database?
        Doing that
        provides more resilient architecture for Wikidata as a whole
        since you
        will be able to swap and interchange SPARQL-compliant backends.


    It depends what you mean by decoupling. The coupling points as I
    see them are:

    * update process
    * UI
    * exposed SPARQL endpoint

    The update process is mostly decoupled from the backend. It is
    producing a stream of RDF updates that is backend independent,
    with a very thin Blazegraph specific adapted to load the data
    into Blazegraph.



    Does that mean that we could integrate the RDF stream into our
    setup re keeping our Wikidata instance up to date, for instance?

That data stream isn't exposed publicly. There are a few tricky part about the stream needing to be synchronized with a specific Wikidata dump that makes it not entirely trivial to reuse outside of our internal use case. But if there is enough interest, we could potentially work on making that stream public.

I suspect there's broad interest in this matter since it contributes to the overarching issue of loose-coupling re Wikidata's underlying infrastructure.

For starters, offering a public stream would be very useful to 3rd party Wikidata hosts.

    The UI is mostly backend independant. It relies on Search for
    some features. And of course, the queries themselves might depend
    on Blazegraph specific features.
    Can WDQS, based on what's stated above, work with a generic SPARQL
    back-end like Virtuoso, for instance? By that I mean dispatch
    SPARQL queries input by a user (without alteration) en route to
    server processing?
The WDQS UI is managed by WMDE, my knowledge is limited. Maybe someone from WMDE could jump in and add more context. That being said, as far as I know, pointing it to a different backend is just a configuration option. Feel free to have a look at the code (https://gerrit.wikimedia.org/g/wikidata/query/gui).



I'll take a look.

It should be reasonably easy to deploy another WDQS UI instance somewhere else, which points to whatever backend you'd like.

Okay, I assume that in the current state it would be sending Blazegraph-specific SPARQL?

As a policy, we don't send traffic to any third party, so we will not setup such an instance.
    The exposed SPARQL endpoint is at the moment a direct exposition
    of the Blazegraph endpoint, so it does expose all the Blazegraph
    specific features and quirks.
    Is there a Query Service that's separated from the Blazegraph
    endpoint? The crux of the matter here is that WDQS benefits more
    by being loosely- bound to endpoints rather than tightly-bound to
    the Blazegraph endpoint.
It depends what you mean by Query Service. My definition of a Query Service in this context is a SPARQL endpoint with a specific data set.

Yes, but in the case of Wikidata that's a combination of both a SPARQL Query Service (query processor and endpoint) and WDQS query solution rendering services.

That SPARQL endpoint at the moment is Blazegraph. I'm not entirely clear what kind loose bound you'd like to see in this context. We might have different definitions of the same words here.

Loose-coupling, in the context I am describing, would comprise the following:

1. WDQS that can be bolted on to any SPARQL endpoint, just like YASGUI <https://github.com/TriplyDB/Yasgui#this>


2. Near real-time data streams usable by 3rd Party Wikidata hosts

With the above in place, the cost and burned associated with Wikidata hosting will also be reduced -- courtesy of federation.



    What we would like to do at some point (this is not more than a
    rough idea at this point) is to add a proxy in front of the
    SPARQL endpoint, that would filter specific SPARQL features, so
    that we limit what is available to a standard set of features
    available across most potential backends. This would help reduce
    the coupling of queries with the backend. Of course, this would
    have the drawback of limiting the feature set.

As you've stated, that's narrowing service focus rather than diffusing service burden :)



Kingsley


    I'm not sure I entirely understood the question, please let me
    know if my answer is missing the point.

      Have fun!

        Guillaume

-- Regards,


    Kingsley Idehen     
    Founder & CEO
    OpenLink Software
    Home Page:http://www.openlinksw.com
    Community Support:https://community.openlinksw.com
    Weblogs (Blogs):
    Company Blog:https://medium.com/openlink-software-blog
    Virtuoso Blog:https://medium.com/virtuoso-blog
    Data Access Drivers 
Blog:https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

    Personal Weblogs (Blogs):
    Medium Blog:https://medium.com/@kidehen
    Legacy Blogs:http://www.openlinksw.com/blog/~kidehen/
                   http://kidehen.blogspot.com

    Profile Pages:
    Pinterest:https://www.pinterest.com/kidehen/
    Quora:https://www.quora.com/profile/Kingsley-Uyi-Idehen
    Twitter:https://twitter.com/kidehen
    Google+:https://plus.google.com/+KingsleyIdehen/about
    LinkedIn:http://www.linkedin.com/in/kidehen

    Web Identities (WebID):
    Personal:http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
             
:http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this



--
        *Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>


--
Regards,

Kingsley Idehen 
Founder & CEO
OpenLink Software
Home Page:http://www.openlinksw.com
Community Support:https://community.openlinksw.com
Weblogs (Blogs):
Company Blog:https://medium.com/openlink-software-blog
Virtuoso Blog:https://medium.com/virtuoso-blog
Data Access Drivers 
Blog:https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

Personal Weblogs (Blogs):
Medium Blog:https://medium.com/@kidehen
Legacy Blogs:http://www.openlinksw.com/blog/~kidehen/
              http://kidehen.blogspot.com

Profile Pages:
Pinterest:https://www.pinterest.com/kidehen/
Quora:https://www.quora.com/profile/Kingsley-Uyi-Idehen
Twitter:https://twitter.com/kidehen
Google+:https://plus.google.com/+KingsleyIdehen/about
LinkedIn:http://www.linkedin.com/in/kidehen

Web Identities (WebID):
Personal:http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
        
:http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Wikidata mailing list -- [email protected]
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/[email protected]/message/HXW33SMNVHP4U27WZZZRM4J56BA6RA2Q/
To unsubscribe send an email to [email protected]

[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS

Reply via email to