Addshore added a comment.

  > Isn't there some way to be more concise in these entries? So far there's 
only around 250 of them, but each one of them is over 1GB of data for all of 
its revisions, *compressed*. We kind of expect articles to take their time to 
get huge...
  
  I guess this is why for the RDF and JSON dumps we only do currently 
revisions, not all revisions.
  
  There are a couple of angles that could make this situation better.
  
  Right now many clients perform multiple sequential API calls in a row to 
complete a set of edits on an entity, resulting in more revisions that are 
probably necessary. One ticket that I found covering this is T216881 
<https://phabricator.wikimedia.org/T216881> (I'm sure there are more but I 
can't find them). Partly related here also is the desire to summarize changes 
well and automatically in edit summaries T67846 
<https://phabricator.wikimedia.org/T67846>.
  I suspect that even if we strongly pursued this route and managed to combine 
more changes into single revisions, the overall revision creation rate probably 
wouldn't change all that much.
  
  There is also the size of the JSON that is stored in revisions. There is 
likely room for optimization here, with some added overhead on the development 
side of things. Infact the storage serialization used to be different to the 
generally exposed serialization, but that change quite some time ago I believe 
to simplify things.
  
  I guess for the all revision dumps, we start the process from revision 1 
whenever generating a new dump? Is there a reason that we don't have some 
system in place to create dumps in batches of revisions or entities, and then 
do some checks each time we generate a dump to determine if something has been 
revdeled in a batch and then only regenerate that batch, otherwise use the 
previously generated dump? or something similar?

TASK DETAIL
  https://phabricator.wikimedia.org/T221504

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ArielGlenn, Addshore
Cc: Addshore, Smalyshev, Gehel, Mahir256, ArielGlenn, darthmon_wmde, alaa_wmde, 
joker88john, CucyNoiD, Nandana, NebulousIris, Gaboe420, Versusxo, 
Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Lahi, Gq86, Baloch007, 
Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Lunewa, 
Th3d3v1ls, Ramalepe, Liugev6, QZanden, LawExplorer, WSH1906, Lewizho99, 
Maathavan, _jensen, rosalieper, gnosygnu, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to