Addshore added a comment.
> Isn't there some way to be more concise in these entries? So far there's only around 250 of them, but each one of them is over 1GB of data for all of its revisions, *compressed*. We kind of expect articles to take their time to get huge... I guess this is why for the RDF and JSON dumps we only do currently revisions, not all revisions. There are a couple of angles that could make this situation better. Right now many clients perform multiple sequential API calls in a row to complete a set of edits on an entity, resulting in more revisions that are probably necessary. One ticket that I found covering this is T216881 <https://phabricator.wikimedia.org/T216881> (I'm sure there are more but I can't find them). Partly related here also is the desire to summarize changes well and automatically in edit summaries T67846 <https://phabricator.wikimedia.org/T67846>. I suspect that even if we strongly pursued this route and managed to combine more changes into single revisions, the overall revision creation rate probably wouldn't change all that much. There is also the size of the JSON that is stored in revisions. There is likely room for optimization here, with some added overhead on the development side of things. Infact the storage serialization used to be different to the generally exposed serialization, but that change quite some time ago I believe to simplify things. I guess for the all revision dumps, we start the process from revision 1 whenever generating a new dump? Is there a reason that we don't have some system in place to create dumps in batches of revisions or entities, and then do some checks each time we generate a dump to determine if something has been revdeled in a batch and then only regenerate that batch, otherwise use the previously generated dump? or something similar? TASK DETAIL https://phabricator.wikimedia.org/T221504 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: ArielGlenn, Addshore Cc: Addshore, Smalyshev, Gehel, Mahir256, ArielGlenn, darthmon_wmde, alaa_wmde, joker88john, CucyNoiD, Nandana, NebulousIris, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Lunewa, Th3d3v1ls, Ramalepe, Liugev6, QZanden, LawExplorer, WSH1906, Lewizho99, Maathavan, _jensen, rosalieper, gnosygnu, Wikidata-bugs, aude, Mbch331
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
