I accept your apology Guillaume, no worries. Regards, Marco
On Mon, Feb 10, 2020 at 2:37 PM Guillaume Lederrey <[email protected]> wrote: > On Fri, Feb 7, 2020 at 5:18 PM Guillaume Lederrey <[email protected]> > wrote: > >> On Fri, Feb 7, 2020 at 2:54 PM Marco Neumann <[email protected]> >> wrote: >> >>> thank you Guillaume, when do you expect a public update on the security >>> incident [1]? Is any of our personal and private data (email, password etc) >>> affected? >>> >> >> It should be made public in the next few days. I'm not going to go into >> any more details until this is made public, but overall, don't worry too >> much. >> > > Corrections and apologies on what I said above. We are not actually ready > to make this ticket public. The underlying issue is under control and does > not require any user action to mitigate. Given the security aspect, I'm not > going to do any further communication on this. > > Sorry to have been misleading on this. > > Enjoy your day! > > Guillaume > > >> best, >>> Marco >>> >>> [1] https://phabricator.wikimedia.org/T241410 >>> >>> On Fri, Feb 7, 2020 at 1:33 PM Guillaume Lederrey < >>> [email protected]> wrote: >>> >>>> Hello all! >>>> >>>> First of all, my apologies for the long silence. We need to do better >>>> in terms of communication. I'll try my best to send a monthly update from >>>> now on. Keep me honest, remind me if I fail. >>>> >>>> First, we had a security incident at the end of December, which forced >>>> us to move from our Kafka based update stream back to the RecentChanges >>>> poller. The details are still private, but you will be able to get the full >>>> story soon on phabricator [1]. The RecentChange poller is less efficient >>>> and this is leading to high update lag again (just when we thought we had >>>> things slightly under control). We tried to mitigate this by improving the >>>> parallelism in the updater [2], which helped a bit, but not as much as we >>>> need. >>>> >>>> Another attempt to get update lag under control is to apply back >>>> pressure on edits, by adding the WDQS update lag to the Wikdiata maxlag >>>> [6]. This is obviously less than ideal (at least as long as WDQS updates >>>> are lagging as often as they are), but does allow the service to recover >>>> from time to time. We probably need to iterate on this, provide better >>>> granularity, differentiate better between operations that have an impact on >>>> update lag and those which don't. >>>> >>>> On the slightly better news side, we now have a much better >>>> understanding of the update process and of its shortcomings. The current >>>> process does a full diff between each updated entity and what we have in >>>> blazegraph. Even if a single triple needs to change, we still read tons of >>>> data from Blazegraph. While this approach is simple and robust, it is >>>> obviously not efficient. We need to rewrite the updater to take a more >>>> event streaming / reactive approach, and only work on the actual changes. >>>> This is a big chunk of work, almost a complete rewrite of the updater, and >>>> we need a new solution to stream changes with guaranteed ordering >>>> (something that our kafka queues don't offer). This is where we are >>>> focusing our energy at the moment, this looks like the best option to >>>> improve the situation in the medium term. This change will probably have >>>> some functional impacts [3]. >>>> >>>> Some misc things: >>>> >>>> We have done some work to get better metrics and better understanding >>>> of what's going on. From collecting more metrics during the update [4] to >>>> loading RDF dumps into Hadoop for further analysis [5] and better logging >>>> of SPARQL requests. We are not focusing on this analysis until we are in a >>>> more stable situation regarding update lag. >>>> >>>> We have a new team member working on WDQS. He is still ramping up, but >>>> we should have a bit more capacity from now on. >>>> >>>> Some longer term thoughts: >>>> >>>> Keeping all of Wikidata in a single graph is most probably not going to >>>> work long term. We have not found examples of public SPARQL endpoints with >>>> > 10 B triples and there is probably a good reason for that. We will >>>> probably need to split the graphs at some point. We don't know how yet >>>> (that's why we loaded the dumps into Hadoop, that might give us some more >>>> insight). We might expose a subgraph with only truthy statements. Or have >>>> language specific graphs, with only language specific labels. Or something >>>> completely different. >>>> >>>> Keeping WDQS / Wikidata as open as they are at the moment might not be >>>> possible in the long term. We need to think if / how we want to implement >>>> some form of authentication and quotas. Potentially increasing quotas for >>>> some use cases, but keeping them strict for others. Again, we don't know >>>> how this will look like, but we're thinking about it. >>>> >>>> What you can do to help: >>>> >>>> Again, we're not sure. Of course, reducing the load (both in terms of >>>> edits on Wikidata and of reads on WDQS) will help. But not using those >>>> services makes them useless. >>>> >>>> We suspect that some use cases are more expensive than others (a single >>>> property change to a large entity will require a comparatively insane >>>> amount of work to update it on the WDQS side). We'd like to have real data >>>> on the cost of various operations, but we only have guesses at this point. >>>> >>>> If you've read this far, thanks a lot for your engagement! >>>> >>>> Have fun! >>>> >>>> Guillaume >>>> >>>> >>>> >>>> >>>> [1] https://phabricator.wikimedia.org/T241410 >>>> [2] https://phabricator.wikimedia.org/T238045 >>>> [3] https://phabricator.wikimedia.org/T244341 >>>> [4] https://phabricator.wikimedia.org/T239908 >>>> [5] https://phabricator.wikimedia.org/T241125 >>>> [6] https://phabricator.wikimedia.org/T221774 >>>> >>>> -- >>>> Guillaume Lederrey >>>> Engineering Manager, Search Platform >>>> Wikimedia Foundation >>>> UTC+1 / CET >>>> _______________________________________________ >>>> Wikidata mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/wikidata >>>> >>> >>> >>> -- >>> >>> >>> --- >>> Marco Neumann >>> KONA >>> >>> _______________________________________________ >>> Wikidata mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/wikidata >>> >> >> >> -- >> Guillaume Lederrey >> Engineering Manager, Search Platform >> Wikimedia Foundation >> UTC+1 / CET >> > > > -- > Guillaume Lederrey > Engineering Manager, Search Platform > Wikimedia Foundation > UTC+1 / CET > _______________________________________________ > Wikidata mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikidata > -- --- Marco Neumann KONA
_______________________________________________ Wikidata mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata
