[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

Addshore Wed, 27 Mar 2019 02:59:28 -0700

Addshore added a comment.

  In T217897#5056213 <https://phabricator.wikimedia.org/T217897#5056213>, 
@Smalyshev wrote:

  > > I'm still a bit confused about this logic inside the updater, especially 
with this id validation checking if we have the revision already etc?
  >
  > Not sure what you mean "already". You can have revision ID in the change, 
and revision ID in Wikidata, but you still have to check against revision ID in 
Blazegraph, so that you do not replace newer data with older data.

  I'm not quite sure how you would get to this situation in the first place 
though?
  Is the stream of events suddenly going to start sending old events?
  Or is this mainly a situation after data has been bulk loaded?

  >> hold the latest revision of an entity in some internal queue in the 
updater for a few second while waiting for more updates, and then just commit 
that to blazegraph for storage after a few seconds
  > 
  > Not sure how holding it in the queue for a few seconds would help anything. 
You'd just time-shift the whole process several seconds to the past, but 
otherwise nothing would change. If you mean batching the updates, we already to 
that. But the batch for the updates covering several seconds would be huge 
(some bots do hundreds of updates per seconds) and putting them into SPARQL 
queries would make them very slow. If we split them, we slow the process down, 
and take the risk the whole update was useless since new data already arrived. 
I am not sure how waiting for a few seconds helps anything beyond what current 
process is already doing (and introducing additional complexity, as now we 
can't anymore assume we're working with latest data but always have to track 
which delayed update this data relates to). Maybe I misunderstand something in 
your proposal.

  Yes the waiting a few seconds would be for batching changes to the same 
entity. But this would be waiting on the stream of events for entity changes. 
wait until the entity has not been touched for 10 seconds (or something), then 
request the last revid that the updater received from special:entitydata using 
revid, and create the sparql and do the update.
  I'm thinking about batched updates per entity, not batched updates of all 
changes in a set period of time.
  Again, im mainly proposing this to try and get revid to be used, I still 
don't understand if above is essentially what the updater is doing, why revid 
can't be used, if I were going to write something to do updates to the query 
service from the ground up with no knowledge of what has already been attempted 
the above is what it would do.

  > 
  > 
  >> This means less reducing the php calls dramatically, increasing varnish 
hits,
  > 
  > It may raise varnish hits (since everything would be varnish hit), but as 
for reducing PHP calls, I am not sure about that, because instead of fetching 
only newest edit, if the entry is edited 100 times, you now need to fetch 100 
edits instead. That's 100x PHP calls.

  Well, the underlying PHP calls that would happen as a result of hitting 
varnish would decrease dramatically even if every single revisions was 
requested using revid for ttl format, due to the current distributed nature of 
the updater.
  If edits on wikidata were slower 1 edit on wikidata would result in 12 PHP 
runs using the cachbusting (so ignoring the batching)
  Hitting revid 1 edit would result in 1, maybe 2 PHP hits, depending on how 
fast varnish was the cache the result.
  Again ignoring the batching here as it definitely does not give us a 12x 
decrease in requests to php, that we would get with using a cachable url.

  This is briefly backed up by data in T217897#5048178 
<https://phabricator.wikimedia.org/T217897#5048178>

  "but the actual edit count on the day was ~1.1 million, which resulted in 
between 926848 and 826878 requests to load entity data"
  That is per host.
  So 1.1 million edits, but around 10 million PHP code executions (at least) to 
update the query services, when in my eyes that should really be no more than 
the # of edits.

  >> PHP is being hit very roughly with 12.5 million requests to turn some PHP 
object into RDF output for special entity data, we might want to just consider 
caching that in its own memcached key inside wikibase so we only have to do 
that conversion once per revision
  > 
  > May be worth considering, but we have tons of revisions, do we have enough 
memory for such cache? some entries are huge, and if one letter changes in 30M 
RDF, we'd be storing two 30M revisions differing in one byte. Of course, we 
could limit the size of the cacheable RDF - not sure how many resources are 
cached.

  So the shared cache for entity revisions inside wikibase exists per entity, 
not per revision, but it is updated during save and can be assumed to be the 
latest revision.
  It is shared between wikidata.org, and all client sites and used for 
essentially all entity revision retrieval (we actually don't have numbers of 
the cache hit rate here, but I imagine it is pretty high...)
  Even when retrieving an entity with a revision id the shared cache will be 
used, or at least checked to see if it contains the correct revision / latest 
revision to skip the DB call.

  Specifically caching RDF in special:entitydata only make sense if we are 
going to continue hitting the page so much and skipping the varnish cache.
  If we change the access pattern for the updaters to special:entitydata then 
the varnish cache is already that cache.

TASK DETAIL
  https://phabricator.wikimedia.org/T217897

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Addshore
Cc: Addshore, Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, 
CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, 
Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, 
Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, 
Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Zppix, 
Maathavan, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, 
fgiunchedi

_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater

Reply via email to