Tgr added a comment.

Some issues that we did not have time to fully discuss during the meeting:

  • sha1 B/C. There are two candidates for the old sha1 field: the sha1 of the main slot and the sha1 of the full revision (which is computed as taking the base36 sha1 of slot 1, concatenating the raw value of slot 2, taking the sha1 of that, concatenating slot 3, taking the sha1 of that etc). When there is just the main slot, the two coincide, so no B/C issue with that. When there are multiple slots, which is the more B/C friendly option? sha1 is mainly used for revert detection, but how would an application that is only aware of the main slot define reverts?
  • Backfilling the id. Legacy clients expect a numerical text id for the main slot; when we switch to external storage, we'll have an URL instead (which will be exposed as a different attribute). Can we provide some id still so old clients don't break? The actual value of the id is internal detail and not relevant to clients; what's relevant is that it should either be unique or (preferably) hash-like (ie. only main slots with the same text have the same id). In theory we could use some kind of content hash (first N digits of the base10 sha1, for example) but it will not fit into a 32 bit integer (current max text ID is about 900 million; fake IDs should avoid any range that can be, or could be within reasonable time, a text ID; max value for a signed int32 is around 2 billion) - is that likely enough to break clients to make the whole idea moot?
  • in theory a lot of resources could be saved if identical slot contents are only written out once (they will be a very frequent occurrence due to reverts) - not so much in the actual dump files, as the compression there takes care of duplicate content anyway, but it would mean less text to process, both in the dump infrastructure and for reusers. Text IDs / blob URLs can be assumed to be deduplicated, but they also cannot be published without checking that they correspond to a (visible) revision. That seems like a hard problem; can we do something about it?

TASK DETAIL
https://phabricator.wikimedia.org/T199121

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ArielGlenn, Tgr
Cc: kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to