Addshore added a comment.
In T217897#5056213 <https://phabricator.wikimedia.org/T217897#5056213>, @Smalyshev wrote: > > I'm still a bit confused about this logic inside the updater, especially with this id validation checking if we have the revision already etc? > > Not sure what you mean "already". You can have revision ID in the change, and revision ID in Wikidata, but you still have to check against revision ID in Blazegraph, so that you do not replace newer data with older data. I'm not quite sure how you would get to this situation in the first place though? Is the stream of events suddenly going to start sending old events? Or is this mainly a situation after data has been bulk loaded? >> hold the latest revision of an entity in some internal queue in the updater for a few second while waiting for more updates, and then just commit that to blazegraph for storage after a few seconds > > Not sure how holding it in the queue for a few seconds would help anything. You'd just time-shift the whole process several seconds to the past, but otherwise nothing would change. If you mean batching the updates, we already to that. But the batch for the updates covering several seconds would be huge (some bots do hundreds of updates per seconds) and putting them into SPARQL queries would make them very slow. If we split them, we slow the process down, and take the risk the whole update was useless since new data already arrived. I am not sure how waiting for a few seconds helps anything beyond what current process is already doing (and introducing additional complexity, as now we can't anymore assume we're working with latest data but always have to track which delayed update this data relates to). Maybe I misunderstand something in your proposal. Yes the waiting a few seconds would be for batching changes to the same entity. But this would be waiting on the stream of events for entity changes. wait until the entity has not been touched for 10 seconds (or something), then request the last revid that the updater received from special:entitydata using revid, and create the sparql and do the update. I'm thinking about batched updates per entity, not batched updates of all changes in a set period of time. Again, im mainly proposing this to try and get revid to be used, I still don't understand if above is essentially what the updater is doing, why revid can't be used, if I were going to write something to do updates to the query service from the ground up with no knowledge of what has already been attempted the above is what it would do. > > >> This means less reducing the php calls dramatically, increasing varnish hits, > > It may raise varnish hits (since everything would be varnish hit), but as for reducing PHP calls, I am not sure about that, because instead of fetching only newest edit, if the entry is edited 100 times, you now need to fetch 100 edits instead. That's 100x PHP calls. Well, the underlying PHP calls that would happen as a result of hitting varnish would decrease dramatically even if every single revisions was requested using revid for ttl format, due to the current distributed nature of the updater. If edits on wikidata were slower 1 edit on wikidata would result in 12 PHP runs using the cachbusting (so ignoring the batching) Hitting revid 1 edit would result in 1, maybe 2 PHP hits, depending on how fast varnish was the cache the result. Again ignoring the batching here as it definitely does not give us a 12x decrease in requests to php, that we would get with using a cachable url. This is briefly backed up by data in T217897#5048178 <https://phabricator.wikimedia.org/T217897#5048178> "but the actual edit count on the day was ~1.1 million, which resulted in between 926848 and 826878 requests to load entity data" That is per host. So 1.1 million edits, but around 10 million PHP code executions (at least) to update the query services, when in my eyes that should really be no more than the # of edits. >> PHP is being hit very roughly with 12.5 million requests to turn some PHP object into RDF output for special entity data, we might want to just consider caching that in its own memcached key inside wikibase so we only have to do that conversion once per revision > > May be worth considering, but we have tons of revisions, do we have enough memory for such cache? some entries are huge, and if one letter changes in 30M RDF, we'd be storing two 30M revisions differing in one byte. Of course, we could limit the size of the cacheable RDF - not sure how many resources are cached. So the shared cache for entity revisions inside wikibase exists per entity, not per revision, but it is updated during save and can be assumed to be the latest revision. It is shared between wikidata.org, and all client sites and used for essentially all entity revision retrieval (we actually don't have numbers of the cache hit rate here, but I imagine it is pretty high...) Even when retrieving an entity with a revision id the shared cache will be used, or at least checked to see if it contains the correct revision / latest revision to skip the DB call. Specifically caching RDF in special:entitydata only make sense if we are going to continue hitting the page so much and skipping the varnish cache. If we change the access pattern for the updaters to special:entitydata then the varnish cache is already that cache. TASK DETAIL https://phabricator.wikimedia.org/T217897 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Addshore Cc: Addshore, Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Zppix, Maathavan, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
