Hello Guillaume,

Le ven. 7 févr. 2020 à 14:33, Guillaume Lederrey
<[email protected]> a écrit :
>
> Hello all!
>
> First of all, my apologies for the long silence. We need to do better in 
> terms of communication. I'll try my best to send a monthly update from now 
> on. Keep me honest, remind me if I fail.
>

It will be nice to have some feedback on my grant request at:

  https://meta.wikimedia.org/wiki/Grants:Project/Future-proof_WDQS

Or one of the other threads on the very same mailing list.

> Another attempt to get update lag under control is to apply back pressure on 
> edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This is 
> obviously less than ideal (at least as long as WDQS updates are lagging as 
> often as they are), but does allow the service to recover from time to time. 
> We probably need to iterate on this, provide better granularity, 
> differentiate better between operations that have an impact on update lag and 
> those which don't.
>
> On the slightly better news side, we now have a much better understanding of 
> the update process and of its shortcomings. The current process does a full 
> diff between each updated entity and what we have in blazegraph. Even if a 
> single triple needs to change, we still read tons of data from Blazegraph. 
> While this approach is simple and robust, it is obviously not efficient. We 
> need to rewrite the updater to take a more event streaming / reactive 
> approach, and only work on the actual changes.

When it will be done, it will be still a short term solution

> This is a big chunk of work, almost a complete rewrite of the updater,

> and we need a new solution to stream changes with guaranteed ordering 
> (something that our kafka queues don't offer). This is where we are focusing 
> our energy at the moment, this looks like the best option to improve the 
> situation in the medium term. This change will probably have some functional 
> impacts [3].

Guaranteed ordering in a multi-party distributed setting has no easy
solution, and apparently it is not provided by Kafka.  For a
non-technical person, you can read
https://en.wikipedia.org/wiki/Two_Generals%27_Problem

> Some longer term thoughts:
>
> Keeping all of Wikidata in a single graph is most probably not going to work 
> long term.

:(

> We have not found examples of public SPARQL endpoints with > 10 B triples and 
> there is probably a good reason for that.

Because Wikimedia is the only non-profit in the field?

> We will probably need to split the graphs at some point.

:(

> We don't know how yet

:(

> (that's why we loaded the dumps into Hadoop, that might give us some more 
> insight).

:(

> We might expose a subgraph with only truthy statements. Or have 
> language-specific graphs, with only language-specific labels.

:(

> Or something completely different.

:)

> Keeping WDQS / Wikidata as open as they are at the moment might not be 
> possible in the long term. We need to think if / how we want to implement 
> some form of authentication and quotas.

With blacklists and whitelists, but this is huge anyway.

> Potentially increasing quotas for some use cases, but keeping them strict for 
> others. Again, we don't know how this will look like, but we're thinking 
> about it.

> What you can do to help:
>
> Again, we're not sure. Of course, reducing the load (both in terms of edits 
> on Wikidata and of reads on WDQS) will help. But not using those services 
> makes them useless.

What about making the lag part of the service.  I mean, you could
reload WDQS periodically, for instance daily, and drop the updater
altogether. Who needs to see the updates live in WDQS as soon as edits
are done in wikidata?

> We suspect that some use cases are more expensive than others (a single 
> property change to a large entity will require a comparatively insane amount 
> of work to update it on the WDQS side). We'd like to have real data on the 
> cost of various operations, but we only have guesses at this point.
>
> If you've read this far, thanks a lot for your engagement!
>
>   Have fun!
>

Will do.

_______________________________________________
Wikidata mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply via email to