| Smalyshev added a comment. |
Huh this is a big one. I've thought about it a bunch lately and here's roughly what I've got:
There are several way we can save some work on updating. I will list them here all though some of them more practical than others.
A. Blazegraph master-slave replication. Updater works on one node and updates are propagated on DB level to other nodes. This is possible in theory (Blazegraph has infrastructure to do that) but would require a lot of work as Blazegraph does not have protocols to do that.
B. Filter Kafka stream to exclude junk messages and provide "clean" update stream. This should not be very hard to do, but we should not put too many hopes into this one, as deduplication capabilities here are limited because we never know which timestamp which client starts with, so it's hard to make any serious deduplication in streaming mode. Basically, if events E1 and E2 happen for same ID within time T, deduplicating between them means holding issuing E1 for time T, which creates delay of T in the stream. Since we can not have long delay in the stream, T must be short. If we could when we've got E2 go back and revoke E1, then it could work better (though then the clients won't get E1's entity updated until E2 time but we can live with that since such client is behind anyway) but I don't think kafka allows to do such things.
Ideas are welcome here. If we saw whole stream at once we could probably save a bit of work, but that's not how updaters are working, and I am not sure how to create any real benefit from it. We could maybe have some cache of ID->last revision to quickly filter stale updates, but I am not sure we'd be saving a lot here, as we already have such filter against the database.
We could save downloading updates for other wikis that we don't care about, but my feeling is that does not change matters substantially.
This would also require running another service, with all dependencies and single point-of-failure scenarios that come from that.
C. We could cache data downloads from Wikidata more actively. Right now each poller basically fetches data un-cached. This is because we want to have the latest one. But if other host already fetched the latest one and it's in cache, we're not benefitting from it then. We could probably try to use some kind of "cache only if the content is not older than this timestamp" but I am not sure our Varnish knows how fresh Wikidata data is. We could probably try to create proper headers when sending RDF data and then try to use them when reading. This would require a lot of careful matching and will be a nightmare to debug if something goes wrong.
D. Fetching RDF data while pre-processing the Kafka stream is possible, but currently we check against the DB and not fetch the data for items that are already in the DB with latest version, and do not fetch twice for duplicates. However, since we can not substantially deduplicate in a generic stream (see above), this means a lot of data fetched and stored for active items. Again, if we had some kind of smart storage, we could probably make it so that the RDF data would be stored once-per-ID and updated on consequent fetches, but this already sounds as reinventing Varnish. Here some ideas may be welcome too.
As we can see, it's still kinda vaporous and vague and could use some work to define what can work and what can't.
Cc: Smalyshev, Aklapper, Joe, Gehel, Nandana, thifranc, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, Zppix, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
