Re: [Wikidata] Status of Wikidata Query Service

Guillaume Lederrey Fri, 07 Feb 2020 08:19:32 -0800

On Fri, Feb 7, 2020 at 2:54 PM Marco Neumann <[email protected]>
wrote:


> thank you Guillaume, when do you expect a public update on the security
> incident [1]? Is any of our personal and private data (email, password etc)
> affected?
>

It should be made public in the next few days. I'm not going to go into any
more details until this is made public, but overall, don't worry too much.


> best,
> Marco
>
> [1] https://phabricator.wikimedia.org/T241410
>
> On Fri, Feb 7, 2020 at 1:33 PM Guillaume Lederrey <[email protected]>
> wrote:
>
>> Hello all!
>>
>> First of all, my apologies for the long silence. We need to do better in
>> terms of communication. I'll try my best to send a monthly update from now
>> on. Keep me honest, remind me if I fail.
>>
>> First, we had a security incident at the end of December, which forced us
>> to move from our Kafka based update stream back to the RecentChanges
>> poller. The details are still private, but you will be able to get the full
>> story soon on phabricator [1]. The RecentChange poller is less efficient
>> and this is leading to high update lag again (just when we thought we had
>> things slightly under control). We tried to mitigate this by improving the
>> parallelism in the updater [2], which helped a bit, but not as much as we
>> need.
>>
>> Another attempt to get update lag under control is to apply back pressure
>> on edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This is
>> obviously less than ideal (at least as long as WDQS updates are lagging as
>> often as they are), but does allow the service to recover from time to
>> time. We probably need to iterate on this, provide better granularity,
>> differentiate better between operations that have an impact on update lag
>> and those which don't.
>>
>> On the slightly better news side, we now have a much better understanding
>> of the update process and of its shortcomings. The current process does a
>> full diff between each updated entity and what we have in blazegraph. Even
>> if a single triple needs to change, we still read tons of data from
>> Blazegraph. While this approach is simple and robust, it is obviously not
>> efficient. We need to rewrite the updater to take a more event streaming /
>> reactive approach, and only work on the actual changes. This is a big chunk
>> of work, almost a complete rewrite of the updater, and we need a new
>> solution to stream changes with guaranteed ordering (something that our
>> kafka queues don't offer). This is where we are focusing our energy at the
>> moment, this looks like the best option to improve the situation in the
>> medium term. This change will probably have some functional impacts [3].
>>
>> Some misc things:
>>
>> We have done some work to get better metrics and better understanding of
>> what's going on. From collecting more metrics during the update [4] to
>> loading RDF dumps into Hadoop for further analysis [5] and better logging
>> of SPARQL requests. We are not focusing on this analysis until we are in a
>> more stable situation regarding update lag.
>>
>> We have a new team member working on WDQS. He is still ramping up, but we
>> should have a bit more capacity from now on.
>>
>> Some longer term thoughts:
>>
>> Keeping all of Wikidata in a single graph is most probably not going to
>> work long term. We have not found examples of public SPARQL endpoints with
>> > 10 B triples and there is probably a good reason for that. We will
>> probably need to split the graphs at some point. We don't know how yet
>> (that's why we loaded the dumps into Hadoop, that might give us some more
>> insight). We might expose a subgraph with only truthy statements. Or have
>> language specific graphs, with only language specific labels. Or something
>> completely different.
>>
>> Keeping WDQS / Wikidata as open as they are at the moment might not be
>> possible in the long term. We need to think if / how we want to implement
>> some form of authentication and quotas. Potentially increasing quotas for
>> some use cases, but keeping them strict for others. Again, we don't know
>> how this will look like, but we're thinking about it.
>>
>> What you can do to help:
>>
>> Again, we're not sure. Of course, reducing the load (both in terms of
>> edits on Wikidata and of reads on WDQS) will help. But not using those
>> services makes them useless.
>>
>> We suspect that some use cases are more expensive than others (a single
>> property change to a large entity will require a comparatively insane
>> amount of work to update it on the WDQS side). We'd like to have real data
>> on the cost of various operations, but we only have guesses at this point.
>>
>> If you've read this far, thanks a lot for your engagement!
>>
>>   Have fun!
>>
>>       Guillaume
>>
>>
>>
>>
>> [1] https://phabricator.wikimedia.org/T241410
>> [2] https://phabricator.wikimedia.org/T238045
>> [3] https://phabricator.wikimedia.org/T244341
>> [4] https://phabricator.wikimedia.org/T239908
>> [5] https://phabricator.wikimedia.org/T241125
>> [6] https://phabricator.wikimedia.org/T221774
>>
>> --
>> Guillaume Lederrey
>> Engineering Manager, Search Platform
>> Wikimedia Foundation
>> UTC+1 / CET
>> _______________________________________________
>> Wikidata mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>
>
> --
>
>
> ---
> Marco Neumann
> KONA
>
> _______________________________________________
> Wikidata mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>


-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+1 / CET

_______________________________________________
Wikidata mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Status of Wikidata Query Service

Reply via email to