On Fri, Feb 7, 2020 at 5:18 PM Guillaume Lederrey <[email protected]> wrote:
> On Fri, Feb 7, 2020 at 2:54 PM Marco Neumann <[email protected]> > wrote: > >> thank you Guillaume, when do you expect a public update on the security >> incident [1]? Is any of our personal and private data (email, password etc) >> affected? >> > > It should be made public in the next few days. I'm not going to go into > any more details until this is made public, but overall, don't worry too > much. > Corrections and apologies on what I said above. We are not actually ready to make this ticket public. The underlying issue is under control and does not require any user action to mitigate. Given the security aspect, I'm not going to do any further communication on this. Sorry to have been misleading on this. Enjoy your day! Guillaume > best, >> Marco >> >> [1] https://phabricator.wikimedia.org/T241410 >> >> On Fri, Feb 7, 2020 at 1:33 PM Guillaume Lederrey < >> [email protected]> wrote: >> >>> Hello all! >>> >>> First of all, my apologies for the long silence. We need to do better in >>> terms of communication. I'll try my best to send a monthly update from now >>> on. Keep me honest, remind me if I fail. >>> >>> First, we had a security incident at the end of December, which forced >>> us to move from our Kafka based update stream back to the RecentChanges >>> poller. The details are still private, but you will be able to get the full >>> story soon on phabricator [1]. The RecentChange poller is less efficient >>> and this is leading to high update lag again (just when we thought we had >>> things slightly under control). We tried to mitigate this by improving the >>> parallelism in the updater [2], which helped a bit, but not as much as we >>> need. >>> >>> Another attempt to get update lag under control is to apply back >>> pressure on edits, by adding the WDQS update lag to the Wikdiata maxlag >>> [6]. This is obviously less than ideal (at least as long as WDQS updates >>> are lagging as often as they are), but does allow the service to recover >>> from time to time. We probably need to iterate on this, provide better >>> granularity, differentiate better between operations that have an impact on >>> update lag and those which don't. >>> >>> On the slightly better news side, we now have a much better >>> understanding of the update process and of its shortcomings. The current >>> process does a full diff between each updated entity and what we have in >>> blazegraph. Even if a single triple needs to change, we still read tons of >>> data from Blazegraph. While this approach is simple and robust, it is >>> obviously not efficient. We need to rewrite the updater to take a more >>> event streaming / reactive approach, and only work on the actual changes. >>> This is a big chunk of work, almost a complete rewrite of the updater, and >>> we need a new solution to stream changes with guaranteed ordering >>> (something that our kafka queues don't offer). This is where we are >>> focusing our energy at the moment, this looks like the best option to >>> improve the situation in the medium term. This change will probably have >>> some functional impacts [3]. >>> >>> Some misc things: >>> >>> We have done some work to get better metrics and better understanding of >>> what's going on. From collecting more metrics during the update [4] to >>> loading RDF dumps into Hadoop for further analysis [5] and better logging >>> of SPARQL requests. We are not focusing on this analysis until we are in a >>> more stable situation regarding update lag. >>> >>> We have a new team member working on WDQS. He is still ramping up, but >>> we should have a bit more capacity from now on. >>> >>> Some longer term thoughts: >>> >>> Keeping all of Wikidata in a single graph is most probably not going to >>> work long term. We have not found examples of public SPARQL endpoints with >>> > 10 B triples and there is probably a good reason for that. We will >>> probably need to split the graphs at some point. We don't know how yet >>> (that's why we loaded the dumps into Hadoop, that might give us some more >>> insight). We might expose a subgraph with only truthy statements. Or have >>> language specific graphs, with only language specific labels. Or something >>> completely different. >>> >>> Keeping WDQS / Wikidata as open as they are at the moment might not be >>> possible in the long term. We need to think if / how we want to implement >>> some form of authentication and quotas. Potentially increasing quotas for >>> some use cases, but keeping them strict for others. Again, we don't know >>> how this will look like, but we're thinking about it. >>> >>> What you can do to help: >>> >>> Again, we're not sure. Of course, reducing the load (both in terms of >>> edits on Wikidata and of reads on WDQS) will help. But not using those >>> services makes them useless. >>> >>> We suspect that some use cases are more expensive than others (a single >>> property change to a large entity will require a comparatively insane >>> amount of work to update it on the WDQS side). We'd like to have real data >>> on the cost of various operations, but we only have guesses at this point. >>> >>> If you've read this far, thanks a lot for your engagement! >>> >>> Have fun! >>> >>> Guillaume >>> >>> >>> >>> >>> [1] https://phabricator.wikimedia.org/T241410 >>> [2] https://phabricator.wikimedia.org/T238045 >>> [3] https://phabricator.wikimedia.org/T244341 >>> [4] https://phabricator.wikimedia.org/T239908 >>> [5] https://phabricator.wikimedia.org/T241125 >>> [6] https://phabricator.wikimedia.org/T221774 >>> >>> -- >>> Guillaume Lederrey >>> Engineering Manager, Search Platform >>> Wikimedia Foundation >>> UTC+1 / CET >>> _______________________________________________ >>> Wikidata mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/wikidata >>> >> >> >> -- >> >> >> --- >> Marco Neumann >> KONA >> >> _______________________________________________ >> Wikidata mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wikidata >> > > > -- > Guillaume Lederrey > Engineering Manager, Search Platform > Wikimedia Foundation > UTC+1 / CET > -- Guillaume Lederrey Engineering Manager, Search Platform Wikimedia Foundation UTC+1 / CET
_______________________________________________ Wikidata mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata
