ArielGlenn added a comment.

Following up on the deduplication issue raised above:

The main concern about bloat with the new schema, as I understand it, is that it may be common for only one slot's content to change, in which case we don't want to write out the content for the other slots. This is easy enough to deal with, by just keeping the information for the last revision's slots and texts around when writing out the metadata (stubs) or the full dump in the case of Special:Export; if the slot origin of the new content is the slot revision id we read/request and write the content, otherwise we know it's a duplicate and we write an empty text tag which contains, say, <text duplicate="duplicate" />, akin to the tag we generate in stubs for texts which have deleted content, i.e. <text deleted="deleted" />. There's no need to do full deduplication.

This does mean we have entries for content, origin (required so that the dumps consumer knows which revision contains the full copy of the content), role, model, format, sha1 for every slot's content regardless, or about 250 extra bytes per entry. For 52 million pages, as we currently have on wikidatawiki, that adds up to 13G or so uncompressed, which is pretty reasonable.

I also agree that this should be version 12, but with an eye on growth in size (and slowness) of dumps production; if we have a bunch of bots inserting structured data on Commons and there's a lot of bloat all of a sudden, version 12 might need to be moved up earlier than we plan.


TASK DETAIL
https://phabricator.wikimedia.org/T199121

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ArielGlenn
Cc: kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to