[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS

Kingsley Idehen via Wikidata Fri, 24 Feb 2023 10:31:36 -0800


On 2/24/23 5:59 AM, Guillaume Lederrey wrote:

On Thu, 23 Feb 2023 at 22:56, Kingsley Idehen <[email protected]> wrote:



    On 2/23/23 3:09 PM, Guillaume Lederrey wrote:

    On Thu, 23 Feb 2023 at 16:39, Kingsley Idehen
    <[email protected]> wrote:


        On 2/22/23 3:28 AM, Guillaume Lederrey wrote:

        On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata
        <[email protected]> wrote:


            On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
            > Hello all!
            >
            > TL;DR: We expect to successfully complete the recent
            data reload on
            > Wikidata Query Service soon, but we've encountered
            multiple failures
            > related to the size of the graph, and anticipate that
            this issue may
            > worsen in the future. Although we succeeded this time,
            we cannot
            > guarantee that future reload attempts will be
            successful given the
            > current trend of the data reload process. Thank you
            for your
            > understanding and patience..
            >
            > Longer version:
            >
            > WDQS is updated from a stream of recent changes on
            Wikidata, with a
            > maximum delay of ~2 minutes. This process was improved
            as part of the
            > WDQS Streaming Updater project to ensure data
            coherence[1] . However,
            > the update process is still imperfect and can lead to
            data
            > inconsistencies in some cases[2][3]. To address this,
            we reload the
            > data from dumps a few times per year to reinitialize
            the system from a
            > known good state.
            >
            > The recent reload of data from dumps started in
            mid-December and was
            > initially met with some issues related to download and
            instabilities
            > in Blazegraph, the database used by WDQS[4]. Loading
            the data into
            > Blazegraph takes a couple of weeks due to the size of
            the graph, and
            > we had multiple attempts where the reload failed after
            >90% of the
            > data had been loaded. Our understanding of the issue
            is that a "race
            > condition" in Blazegraph[5], where subtle timing
            changes lead to
            > corruption of the journal in some rare cases, is to
            blame.[6]
            >
            > We want to reassure you that the last reload job was
            successful on one
            > of our servers. The data still needs to be copied over
            to all of the
            > WDQS servers, which will take a couple of weeks, but
            should not bring
            > any additional issues. However, reloading the full
            data from dumps is
            > becoming more complex as the data size grows, and we
            wanted to let you
            > know why the process took longer than expected. We
            understand that
            > data inconsistencies can be problematic, and we
            appreciate your
            > patience and understanding while we work to ensure the
            quality and
            > consistency of the data on WDQS.
            >
            > Thank you for your continued support and understanding!
            >
            >
            >     Guillaume
            >
            >
            > [1] https://phabricator.wikimedia.org/T244590
            > [2] https://phabricator.wikimedia.org/T323239
            > [3] https://phabricator.wikimedia.org/T322869
            > [4] https://phabricator.wikimedia.org/T323096
            > [5]
            https://en.wikipedia.org/wiki/Race_condition#In_software
            > [6] https://phabricator.wikimedia.org/T263110
            >
            Hi Guillaume,

            Are there plans to decouple WDQS from the back-end
            database? Doing that
            provides more resilient architecture for Wikidata as a
            whole since you
            will be able to swap and interchange SPARQL-compliant
            backends.


        It depends what you mean by decoupling. The coupling points
        as I see them are:

        * update process
        * UI
        * exposed SPARQL endpoint

        The update process is mostly decoupled from the backend. It
        is producing a stream of RDF updates that is backend
        independent, with a very thin Blazegraph specific adapted to
        load the data into Blazegraph.



        Does that mean that we could integrate the RDF stream into
        our setup re keeping our Wikidata instance up to date, for
        instance?

    That data stream isn't exposed publicly. There are a few tricky
    part about the stream needing to be synchronized with a specific
    Wikidata dump that makes it not entirely trivial to reuse outside
    of our internal use case. But if there is enough interest, we
    could potentially work on making that stream public.



    I suspect there's broad interest in this matter since it
    contributes to the overarching issue of loose-coupling re
    Wikidata's underlying infrastructure.

    For starters, offering a public stream would be very useful to 3rd
    party Wikidata hosts.


        The UI is mostly backend independant. It relies on Search
        for some features. And of course, the queries themselves
        might depend on Blazegraph specific features.



        Can WDQS, based on what's stated above, work with a generic
        SPARQL back-end like Virtuoso, for instance? By that I mean
        dispatch SPARQL queries input by a user (without alteration)
        en route to server processing?

     The WDQS UI is managed by WMDE, my knowledge is limited. Maybe
    someone from WMDE could jump in and add more context. That being
    said, as far as I know, pointing it to a different backend is
    just a configuration option. Feel free to have a look at the code
    (https://gerrit.wikimedia.org/g/wikidata/query/gui).



    I'll take a look.

    It should be reasonably easy to deploy another WDQS UI instance
    somewhere else, which points to whatever backend you'd like.



    Okay, I assume that in the current state it would be sending
    Blazegraph-specific SPARQL?

Again, not my area of expertise, but I assume that the UI itself is issuing fairly standard SPARQL. Of course, user queries will use whatever they want. It does have dependencies on our Search interface as well, so that would have to be replicated.

You mean WDQS has a Text Search interface component that's intertwined with the Query Service provided by the Wikidata SPARQL Endpoint?

    As a policy, we don't send traffic to any third party, so we will
    not setup such an instance.


        The exposed SPARQL endpoint is at the moment a direct
        exposition of the Blazegraph endpoint, so it does expose all
        the Blazegraph specific features and quirks.



        Is there a Query Service that's separated from the Blazegraph
        endpoint? The crux of the matter here is that WDQS benefits
        more by being loosely- bound to endpoints rather than
        tightly-bound to the Blazegraph endpoint.

    It depends what you mean by Query Service. My definition of a
    Query Service in this context is a SPARQL endpoint with a
    specific data set.



    Yes, but in the case of Wikidata that's a combination of both a
    SPARQL Query Service (query processor and endpoint) and WDQS query
    solution rendering services.

    That SPARQL endpoint at the moment is Blazegraph. I'm not
    entirely clear what kind loose bound you'd like to see in this
    context. We might have different definitions of the same words here.



    Loose-coupling, in the context I am describing, would comprise the
    following:

    1. WDQS that can be bolted on to any SPARQL endpoint, just like
    YASGUI <https://github.com/TriplyDB/Yasgui#this>

In this context, I would say "WDQS UI can be bolted to any SPARQL endpoint". In term of SPARQL itself, that should already be mostly the case. I think there is a dependency on Search as well.

As per my earlier comment, I don't quite understand what you are referring to regarding the Search (Free Text Querying) intermingling. Does this relate to SPARQL Query Patterns comprising literal objects? If so, WDQS should be able to constrain such behavior to Blazegraph instances -- by way of configuration that informs introspection.

    2. Near real-time data streams usable by 3rd Party Wikidata hosts

    With the above in place, the cost and burned associated with
    Wikidata hosting will also be reduced -- courtesy of federation.
Could you please open a Phabricator task to document what you would like to see exposed and why it would be useful?



Okay, when I (or someone else) get a moment.

--
Regards,

Kingsley Idehen 
Founder & CEO
OpenLink Software
Home Page:http://www.openlinksw.com
Community Support:https://community.openlinksw.com
Weblogs (Blogs):
Company Blog:https://medium.com/openlink-software-blog
Virtuoso Blog:https://medium.com/virtuoso-blog
Data Access Drivers 
Blog:https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

Personal Weblogs (Blogs):
Medium Blog:https://medium.com/@kidehen
Legacy Blogs:http://www.openlinksw.com/blog/~kidehen/
              http://kidehen.blogspot.com

Profile Pages:
Pinterest:https://www.pinterest.com/kidehen/
Quora:https://www.quora.com/profile/Kingsley-Uyi-Idehen
Twitter:https://twitter.com/kidehen
Google+:https://plus.google.com/+KingsleyIdehen/about
LinkedIn:http://www.linkedin.com/in/kidehen

Web Identities (WebID):
Personal:http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
        
:http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Wikidata mailing list -- [email protected]
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/[email protected]/message/CDDDQC7VKFDWMASFXUZ32MU6SAQB2QFQ/
To unsubscribe send an email to [email protected]

[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS

Reply via email to