Hello Guillaume, Le ven. 7 févr. 2020 à 14:33, Guillaume Lederrey <[email protected]> a écrit : > > Hello all! > > First of all, my apologies for the long silence. We need to do better in > terms of communication. I'll try my best to send a monthly update from now > on. Keep me honest, remind me if I fail. >
It will be nice to have some feedback on my grant request at: https://meta.wikimedia.org/wiki/Grants:Project/Future-proof_WDQS Or one of the other threads on the very same mailing list. > Another attempt to get update lag under control is to apply back pressure on > edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This is > obviously less than ideal (at least as long as WDQS updates are lagging as > often as they are), but does allow the service to recover from time to time. > We probably need to iterate on this, provide better granularity, > differentiate better between operations that have an impact on update lag and > those which don't. > > On the slightly better news side, we now have a much better understanding of > the update process and of its shortcomings. The current process does a full > diff between each updated entity and what we have in blazegraph. Even if a > single triple needs to change, we still read tons of data from Blazegraph. > While this approach is simple and robust, it is obviously not efficient. We > need to rewrite the updater to take a more event streaming / reactive > approach, and only work on the actual changes. When it will be done, it will be still a short term solution > This is a big chunk of work, almost a complete rewrite of the updater, > and we need a new solution to stream changes with guaranteed ordering > (something that our kafka queues don't offer). This is where we are focusing > our energy at the moment, this looks like the best option to improve the > situation in the medium term. This change will probably have some functional > impacts [3]. Guaranteed ordering in a multi-party distributed setting has no easy solution, and apparently it is not provided by Kafka. For a non-technical person, you can read https://en.wikipedia.org/wiki/Two_Generals%27_Problem > Some longer term thoughts: > > Keeping all of Wikidata in a single graph is most probably not going to work > long term. :( > We have not found examples of public SPARQL endpoints with > 10 B triples and > there is probably a good reason for that. Because Wikimedia is the only non-profit in the field? > We will probably need to split the graphs at some point. :( > We don't know how yet :( > (that's why we loaded the dumps into Hadoop, that might give us some more > insight). :( > We might expose a subgraph with only truthy statements. Or have > language-specific graphs, with only language-specific labels. :( > Or something completely different. :) > Keeping WDQS / Wikidata as open as they are at the moment might not be > possible in the long term. We need to think if / how we want to implement > some form of authentication and quotas. With blacklists and whitelists, but this is huge anyway. > Potentially increasing quotas for some use cases, but keeping them strict for > others. Again, we don't know how this will look like, but we're thinking > about it. > What you can do to help: > > Again, we're not sure. Of course, reducing the load (both in terms of edits > on Wikidata and of reads on WDQS) will help. But not using those services > makes them useless. What about making the lag part of the service. I mean, you could reload WDQS periodically, for instance daily, and drop the updater altogether. Who needs to see the updates live in WDQS as soon as edits are done in wikidata? > We suspect that some use cases are more expensive than others (a single > property change to a large entity will require a comparatively insane amount > of work to update it on the WDQS side). We'd like to have real data on the > cost of various operations, but we only have guesses at this point. > > If you've read this far, thanks a lot for your engagement! > > Have fun! > Will do. _______________________________________________ Wikidata mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata
