I accept your apology Guillaume, no worries.

Regards,
Marco

On Mon, Feb 10, 2020 at 2:37 PM Guillaume Lederrey <[email protected]>
wrote:

> On Fri, Feb 7, 2020 at 5:18 PM Guillaume Lederrey <[email protected]>
> wrote:
>
>> On Fri, Feb 7, 2020 at 2:54 PM Marco Neumann <[email protected]>
>> wrote:
>>
>>> thank you Guillaume, when do you expect a public update on the security
>>> incident [1]? Is any of our personal and private data (email, password etc)
>>> affected?
>>>
>>
>> It should be made public in the next few days. I'm not going to go into
>> any more details until this is made public, but overall, don't worry too
>> much.
>>
>
> Corrections and apologies on what I said above. We are not actually ready
> to make this ticket public. The underlying issue is under control and does
> not require any user action to mitigate. Given the security aspect, I'm not
> going to do any further communication on this.
>
> Sorry to have been misleading on this.
>
>   Enjoy your day!
>
>      Guillaume
>
>
>> best,
>>> Marco
>>>
>>> [1] https://phabricator.wikimedia.org/T241410
>>>
>>> On Fri, Feb 7, 2020 at 1:33 PM Guillaume Lederrey <
>>> [email protected]> wrote:
>>>
>>>> Hello all!
>>>>
>>>> First of all, my apologies for the long silence. We need to do better
>>>> in terms of communication. I'll try my best to send a monthly update from
>>>> now on. Keep me honest, remind me if I fail.
>>>>
>>>> First, we had a security incident at the end of December, which forced
>>>> us to move from our Kafka based update stream back to the RecentChanges
>>>> poller. The details are still private, but you will be able to get the full
>>>> story soon on phabricator [1]. The RecentChange poller is less efficient
>>>> and this is leading to high update lag again (just when we thought we had
>>>> things slightly under control). We tried to mitigate this by improving the
>>>> parallelism in the updater [2], which helped a bit, but not as much as we
>>>> need.
>>>>
>>>> Another attempt to get update lag under control is to apply back
>>>> pressure on edits, by adding the WDQS update lag to the Wikdiata maxlag
>>>> [6]. This is obviously less than ideal (at least as long as WDQS updates
>>>> are lagging as often as they are), but does allow the service to recover
>>>> from time to time. We probably need to iterate on this, provide better
>>>> granularity, differentiate better between operations that have an impact on
>>>> update lag and those which don't.
>>>>
>>>> On the slightly better news side, we now have a much better
>>>> understanding of the update process and of its shortcomings. The current
>>>> process does a full diff between each updated entity and what we have in
>>>> blazegraph. Even if a single triple needs to change, we still read tons of
>>>> data from Blazegraph. While this approach is simple and robust, it is
>>>> obviously not efficient. We need to rewrite the updater to take a more
>>>> event streaming / reactive approach, and only work on the actual changes.
>>>> This is a big chunk of work, almost a complete rewrite of the updater, and
>>>> we need a new solution to stream changes with guaranteed ordering
>>>> (something that our kafka queues don't offer). This is where we are
>>>> focusing our energy at the moment, this looks like the best option to
>>>> improve the situation in the medium term. This change will probably have
>>>> some functional impacts [3].
>>>>
>>>> Some misc things:
>>>>
>>>> We have done some work to get better metrics and better understanding
>>>> of what's going on. From collecting more metrics during the update [4] to
>>>> loading RDF dumps into Hadoop for further analysis [5] and better logging
>>>> of SPARQL requests. We are not focusing on this analysis until we are in a
>>>> more stable situation regarding update lag.
>>>>
>>>> We have a new team member working on WDQS. He is still ramping up, but
>>>> we should have a bit more capacity from now on.
>>>>
>>>> Some longer term thoughts:
>>>>
>>>> Keeping all of Wikidata in a single graph is most probably not going to
>>>> work long term. We have not found examples of public SPARQL endpoints with
>>>> > 10 B triples and there is probably a good reason for that. We will
>>>> probably need to split the graphs at some point. We don't know how yet
>>>> (that's why we loaded the dumps into Hadoop, that might give us some more
>>>> insight). We might expose a subgraph with only truthy statements. Or have
>>>> language specific graphs, with only language specific labels. Or something
>>>> completely different.
>>>>
>>>> Keeping WDQS / Wikidata as open as they are at the moment might not be
>>>> possible in the long term. We need to think if / how we want to implement
>>>> some form of authentication and quotas. Potentially increasing quotas for
>>>> some use cases, but keeping them strict for others. Again, we don't know
>>>> how this will look like, but we're thinking about it.
>>>>
>>>> What you can do to help:
>>>>
>>>> Again, we're not sure. Of course, reducing the load (both in terms of
>>>> edits on Wikidata and of reads on WDQS) will help. But not using those
>>>> services makes them useless.
>>>>
>>>> We suspect that some use cases are more expensive than others (a single
>>>> property change to a large entity will require a comparatively insane
>>>> amount of work to update it on the WDQS side). We'd like to have real data
>>>> on the cost of various operations, but we only have guesses at this point.
>>>>
>>>> If you've read this far, thanks a lot for your engagement!
>>>>
>>>>   Have fun!
>>>>
>>>>       Guillaume
>>>>
>>>>
>>>>
>>>>
>>>> [1] https://phabricator.wikimedia.org/T241410
>>>> [2] https://phabricator.wikimedia.org/T238045
>>>> [3] https://phabricator.wikimedia.org/T244341
>>>> [4] https://phabricator.wikimedia.org/T239908
>>>> [5] https://phabricator.wikimedia.org/T241125
>>>> [6] https://phabricator.wikimedia.org/T221774
>>>>
>>>> --
>>>> Guillaume Lederrey
>>>> Engineering Manager, Search Platform
>>>> Wikimedia Foundation
>>>> UTC+1 / CET
>>>> _______________________________________________
>>>> Wikidata mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> ---
>>> Marco Neumann
>>> KONA
>>>
>>> _______________________________________________
>>> Wikidata mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>
>>
>>
>> --
>> Guillaume Lederrey
>> Engineering Manager, Search Platform
>> Wikimedia Foundation
>> UTC+1 / CET
>>
>
>
> --
> Guillaume Lederrey
> Engineering Manager, Search Platform
> Wikimedia Foundation
> UTC+1 / CET
> _______________________________________________
> Wikidata mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
-- 


---
Marco Neumann
KONA
_______________________________________________
Wikidata mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply via email to