Re: Questions about the hash function transition
On 8/29/2018 9:27 AM, Derrick Stolee wrote: On 8/29/2018 9:09 AM, Johannes Schindelin wrote: What I meant was to leverage the midx code, not the .midx files. My comment was motivated by my realizing that both the SHA-1 <-> SHA-256 mapping and the MIDX code have to look up (in a *fast* way) information with hash values as keys. *And* this information is immutable. *And* the amount of information should grow with new objects being added to the database. I'm unsure what this means, as the multi-pack-index simply uses bsearch_hash() to find hashes in the list. The same method is used for IDX lookups. I talked with Johannes privately, and we found differences in our understanding of the current multi-pack-index feature. Johannes thought the feature was farther along than it is, specifically related to how much we value the data in the multi-pack-index when adding objects to pack-files or repacking. Some of this misunderstanding is due to how the equivalent feature works in VSTS (where there is no IDX-file equivalent, every object in the repo is tracked by a multi-pack-index). I'd like to point out a few things about how the multi-pack-index works now, and how we hope to extend it in the future. Currently: 1. Objects are added to the multi-pack-index by adding a new set of .idx/.pack file pairs. We scan the .idx file for the objects and offsets to add. 2. We re-use the information in the multi-pack-index only to write the new one without re-reading the .pack files that are already covered. 3. If a 'git repack' command deletes a pack-file, then we delete the multi-pack-index. It must be regenerated by 'git multi-pack-index write' later. In the current world, the multi-pack-index is completely secondary to the .idx files. In the future, I hope these features exist in the multi-pack-index: 1. A stable object order. As objects are added to the multi-pack-index, we assign a distinct integer value to each. As we add objects, those integers values do not change. We can then pair the reachability bitmap to the multi-pack-index instead of a specific pack-file (allowing repack and bitmap computations to happen asynchronously). The data required to store this object order is very similar to storing the bijection between SHA-1 and SHA-256 hashes. 2. Incremental multi-pack-index: Currently, we have only one multi-pack-index file per object directory. We can use a mechanism similar to the split-index to keep a small number of multi-pack-index files (at most 3, probably) such that the '.git/objects/pack/multi-pack-index' file is small and easy to rewrite, while it refers to larger '.git/objects/pack/*.midx' files that change infrequently. 3. Multi-pack-index-aware repack: The repacker only knows about the multi-pack-index enough to delete it. We could instead directly manipulate the multi-pack-index during repack, and we could decide to do more incremental repacks based on data stored in the multi-pack-index. In conclusion: please keep the multi-pack-index in mind as we implement the transition plan. I'll continue building the feature as planned (the next thing to do after the current series of cleanups is 'git multi-pack-index verify') but am happy to look into other applications as we need it. Thanks, -Stolee
Re: Questions about the hash function transition
On 8/29/2018 9:09 AM, Johannes Schindelin wrote: Hi Jonathan, On Tue, 28 Aug 2018, Jonathan Nieder wrote: Johannes Schindelin wrote: On Thu, 23 Aug 2018, Jonathan Nieder wrote: Ævar Arnfjörð Bjarmason wrote: Are we going to need a midx version of these mapping files? How does midx fit into this picture? Perhaps it's too obscure to worry about... That's a great question! I think the simplest answer is to have a midx only for the primary object format and fall back to using ordinary idx files for the others. The midx format already has a field for hash function (thanks, Derrick!). Related: I wondered whether we could simply leverage the midx code for the bidirectional SHA-1 <-> SHA-256 mapping, as it strikes me as very similar in concept and challenges. Interesting: tell me more. My first instinct is to prefer the idx-based design that is already described in the design doc. If we want to change that, we should have a motivating reason. Midx is designed to be optional and to not necessarily cover all objects, so it doesn't seem like a good fit. It is optional, but shouldn't this mode where a Git repo that needs to know about two different versions of all files be optional? Or at least temporary? The multi-pack-index is intended to cover all packed objects, so covers the same number of objects as an IDX-based strategy. If we are rebuilding the repo from scratch by translating the hashes, then "being too big to repack" is probably not a problem, so we would expect a single IDX file anyway. In my opinion, whatever we do for the IDX-based approach will need to be duplicated in the multi-pack-index. The multi-pack-index does have a natural mechanism (optional chunks) for inserting this data without incrementing the version number. Right. What I meant was to leverage the midx code, not the .midx files. My comment was motivated by my realizing that both the SHA-1 <-> SHA-256 mapping and the MIDX code have to look up (in a *fast* way) information with hash values as keys. *And* this information is immutable. *And* the amount of information should grow with new objects being added to the database. I'm unsure what this means, as the multi-pack-index simply uses bsearch_hash() to find hashes in the list. The same method is used for IDX lookups. I know that Stolee performed a bit of performance testing regarding different data structures to use in MIDX. We could benefit from that testing by using not only the results from those tests, but also the code. I did test ways to use something other than bsearch_hash(), such as using a 65,536-entry fanout table for lookups using the first two bytes of a hash (tl;dr: it speeds things up a bit, but the super-small improvement is probably not worth the space and complexity). I've also toyed with the idea of using interpolation search inside bsearch_hash(), but I haven't had time to do that. IIRC one of the insights was that packs are a natural structure that can be used for the MIDX mapping, too (you could, for example, store the SHA-1 <-> SHA-256 mapping *only* for objects inside packs, and re-generate them on the fly for loose objects all the time). Stolee can speak with much more competence and confidence about this, though, whereas all of what I said above is me waving my hands quite frantically. I understand the hesitation to pair such an important feature (hash transition) to a feature that hasn't even shipped. We will need to see how things progress on both fronts to see how mature the multi-pack-index is when we need this transition table. Thanks, -Stolee
Re: Questions about the hash function transition
Hi Jonathan, On Tue, 28 Aug 2018, Jonathan Nieder wrote: > Johannes Schindelin wrote: > > On Thu, 23 Aug 2018, Jonathan Nieder wrote: > > > Ævar Arnfjörð Bjarmason wrote: > > >>> Are we going to need a midx version of these mapping files? How does > >>> midx fit into this picture? Perhaps it's too obscure to worry > >>> about... > >> > >> That's a great question! I think the simplest answer is to have a > >> midx only for the primary object format and fall back to using > >> ordinary idx files for the others. > >> > >> The midx format already has a field for hash function (thanks, > >> Derrick!). > > > > Related: I wondered whether we could simply leverage the midx code for > > the bidirectional SHA-1 <-> SHA-256 mapping, as it strikes me as very > > similar in concept and challenges. > > Interesting: tell me more. > > My first instinct is to prefer the idx-based design that is already > described in the design doc. If we want to change that, we should > have a motivating reason. > > Midx is designed to be optional and to not necessarily cover all > objects, so it doesn't seem like a good fit. Right. What I meant was to leverage the midx code, not the .midx files. My comment was motivated by my realizing that both the SHA-1 <-> SHA-256 mapping and the MIDX code have to look up (in a *fast* way) information with hash values as keys. *And* this information is immutable. *And* the amount of information should grow with new objects being added to the database. I know that Stolee performed a bit of performance testing regarding different data structures to use in MIDX. We could benefit from that testing by using not only the results from those tests, but also the code. IIRC one of the insights was that packs are a natural structure that can be used for the MIDX mapping, too (you could, for example, store the SHA-1 <-> SHA-256 mapping *only* for objects inside packs, and re-generate them on the fly for loose objects all the time). Stolee can speak with much more competence and confidence about this, though, whereas all of what I said above is me waving my hands quite frantically. Ciao, Dscho
Re: Questions about the hash function transition
Derrick Stolee wrote: > I'm not super-familiar with how the transition plan specifically needs this > mapping, but it seems like a good place to put it. Would you mind reading it through and letting me know your thoughts? More eyes can't hurt. Thanks, Jonathan
Re: Questions about the hash function transition
Hi, Johannes Schindelin wrote: > On Thu, 23 Aug 2018, Jonathan Nieder wrote: > > Ævar Arnfjörð Bjarmason wrote: >>> Are we going to need a midx version of these mapping files? How does >>> midx fit into this picture? Perhaps it's too obscure to worry about... >> >> That's a great question! I think the simplest answer is to have a >> midx only for the primary object format and fall back to using >> ordinary idx files for the others. >> >> The midx format already has a field for hash function (thanks, >> Derrick!). > > Related: I wondered whether we could simply leverage the midx code for the > bidirectional SHA-1 <-> SHA-256 mapping, as it strikes me as very similar > in concept and challenges. Interesting: tell me more. My first instinct is to prefer the idx-based design that is already described in the design doc. If we want to change that, we should have a motivating reason. Midx is designed to be optional and to not necessarily cover all objects, so it doesn't seem like a good fit. Thanks, Jonathan
Re: Questions about the hash function transition
Edward Thomson writes: > If I'm understanding you correctly, then on the libgit2 side, I'm very much > opposed to this proposal. We never execute commands, nor do I want to start > thinking that we can do so arbitrarily. We run in environments where that's > a non-starter > > At present, in libgit2, users can provide their own mechanism for running > clean/smudge filters. But hash transformation / compatibility is going to > be a crucial compatibility component. So this is not something that we > could simply opt out of or require users to implement themselves. While I suspect the "apparent flexibility" does not equal to "we must be able to run arbitrary external programs" in the proposal, I do agree that hash transformation MUST NOT be configurable like this. We do not want to add random source of incompatible mappings when there is no need to introduce confusion. If old object names under old hash users find in log messages and other places need to be easily looked up in a repository that has been converted, then: (1) get_sha1() equivalent in the new world should learn to fall back to use old hash when there is no object with that name under new hash; (2) in addition to the above fallback, there should be a syntax to explicitly tell that function that it is using the old hash; (3) get_commit_buffer() should learn to optionally allow converting old hash in log messages to new ones, in a way similar to how textconv filter can be specified by the end-users to make binary blob easier to grok by text-based tools (the important part is that such a filter does not have to be limited to "upgrade hash algorithm"---it can be more general "correct misspelt words automatically" filter). With 1+2, you can say "git log $sha1" and also "git log sha1:$sha1" to disambiguate. 3 would be icing on the cake.
Re: Questions about the hash function transition
On Tue, Aug 28 2018, Edward Thomson wrote: > On Tue, Aug 28, 2018 at 2:50 PM, Ævar Arnfjörð Bjarmason > wrote: >> If we instead had something like clean/smudge filters: >> >> [extensions] >> objectFilter = sha256-to-sha1 >> compatObjectFormat = sha1 >> [objectFilter "sha256-to-sha1"] >> clean = ... >> smudge = ... >> >> We could apply arbitrary transformations on objects through filters >> which would accept/return some simple format requesting them to >> translate such-and-such objects, and would either return object >> names/types under which to store them, or "nothing to do". > > If I'm understanding you correctly, then on the libgit2 side, I'm very much > opposed to this proposal. We never execute commands, nor do I want to start > thinking that we can do so arbitrarily. We run in environments where that's > a non-starter I'm being unclear. I'm suggesting that we slightly amend the syntax of what we're proposing to put in the .git/config to leave the door open for *optionally* doing arbitrary mappings. It would still work exactly the same internally for the common sha1<->sha256 case, i.e. neither git, libgit, jgit or anyone else would need to shell out to anything. They'd just pick up that common case and handle it internally, similar to how e.g. the crlf filter (v.s. full clean/smudge support) works in git & libgit2: https://github.com/libgit2/libgit2/blob/master/tests/filter/crlf.c So the sha256<->sha1 support would be an implicit built-in like crlf, it would just leave the door open to having something like git-lfs. Now what does that really mean? And I admit I may be missing something here. Unlike smudge/clean filters we're going to be constrained by having hashes of length 20 or 32, locally & remotely, since we wouldn't want to support arbitrary lengths, but with relatively small changes it'll allow for changing just: # local remote sha256<->sha1 To also support: # local remote fn(sha1)<->fn(sha1) fn(sha1)<->fn(sha256) fn(sha256)<->fn(sha1) fn(sha256)<->fn(sha256) Where fn() is some hook you'd provide to hook into the bits where we're e.g. unpacking SHA-1 objects from the remote, and writing them locally as SHA-256, except instead of (as we do by default) writing: SHA256_map(sha256(content)) = content You'd write: SHA256_map(sha256(fn(content))) = fn(content) Where fn() would need to be idempotent. Now, why is this useful or worth considering? As noted in the E-Mail I linked to it allows for some novel use cases for doing local to remote object translation. But really, I'm not suggesting that *that* is something we should consider. *All* I'm saying is that given the experience of how we started out with stuff like built-in "crlf", and then grew smudge/clean filters, that it's worth considering what sort of .git/config key-value pairs we'd pick that would yield themselves to such future extensions, should that be something we deem to be a good idea in the future. Because if we don't we've lost nothing, but if we do we'd need to support two sets of config syntaxes to do those two related things. > At present, in libgit2, users can provide their own mechanism for running > clean/smudge filters. But hash transformation / compatibility is going to > be a crucial compatibility component. So this is not something that we > could simply opt out of or require users to implement themselves. Indeed.
Re: Questions about the hash function transition
On Tue, Aug 28, 2018 at 2:50 PM, Ævar Arnfjörð Bjarmason wrote: > If we instead had something like clean/smudge filters: > > [extensions] > objectFilter = sha256-to-sha1 > compatObjectFormat = sha1 > [objectFilter "sha256-to-sha1"] > clean = ... > smudge = ... > > We could apply arbitrary transformations on objects through filters > which would accept/return some simple format requesting them to > translate such-and-such objects, and would either return object > names/types under which to store them, or "nothing to do". If I'm understanding you correctly, then on the libgit2 side, I'm very much opposed to this proposal. We never execute commands, nor do I want to start thinking that we can do so arbitrarily. We run in environments where that's a non-starter At present, in libgit2, users can provide their own mechanism for running clean/smudge filters. But hash transformation / compatibility is going to be a crucial compatibility component. So this is not something that we could simply opt out of or require users to implement themselves. -ed
Re: Questions about the hash function transition
On Thu, Aug 23 2018, Ævar Arnfjörð Bjarmason wrote: >> Transition plan >> --- > > One thing that's not covered in this document at all, which I feel is > missing, is how we're going to handle references to old commit IDs in > commit messages, bug trackers etc. once we go through the whole > migration process. > > I.e. are users who expect to be able to read old history and "git show > " expected to maintain a repository that has a live > sha1<->sha256 mapping forever, or could we be smarter about this and > support some sort of marker in the repository saying "maintain the > mapping up until this point". > > Then, along with some v2 protocol extension to transfer such a > historical mapping (and perhaps a default user option to request it) > we'd be guaranteed to be able to read old log messages and "git show" > them, and servers could avoid breaking past URLs without maintaining the > mapping going forward. > > One example of this on the server is that on GitLab (I don't know how > GitHub does this) when you reference a commit from e.g a bug, a > refs/keep-around/ is created, to make sure it doesn't get GC'd. > > Those sorts of hosting providers would like to not break *existing* > links, without needing to forever maintain a bidirectional mapping. Considering this a bit more, I think this would nicely fall under what I suggested in https://public-inbox.org/git/874ll3yd75@evledraar.gmail.com/ I.e. the interface that's now proposed / documented is fairly inelastic. I.e.: [extensions] objectFormat = sha256 compatObjectFormat = sha1 If we instead had something like clean/smudge filters: [extensions] objectFilter = sha256-to-sha1 compatObjectFormat = sha1 [objectFilter "sha256-to-sha1"] clean = ... smudge = ... We could apply arbitrary transformations on objects through filters which would accept/return some simple format requesting them to translate such-and-such objects, and would either return object names/types under which to store them, or "nothing to do". So we could also have filters that would munge the contents of objects between local & remote (for e.g. this "use a public remote host for storing an encrypted repo" that'll fsck on their end) use-case, but also e.g. be able to pass arguments to the filters saying that only commits older than so-and-so are to have a reverse mapping (for looking up old commits), or just ones on some branch etc. It wouldn't be any slower than the current proposal, since some subset of it would be picked up and implemented in C directly via some fast path, similar to the proposal that e.g. some encoding filters be implemented as built-ins. But by having it be more extendable it'll be easy to e.g. pass options, or implement custom transformations. We're still far away from reviewing patches to implement this, but in anticipation of that I'd like to see what people think about future-proofing this objectFilter syntax.
Re: Questions about the hash function transition
On 8/28/2018 8:04 AM, Johannes Schindelin wrote: Hi, On Thu, 23 Aug 2018, Jonathan Nieder wrote: Ævar Arnfjörð Bjarmason wrote: [...] Since all operations that make new objects (e.g., "git commit") add the new objects to the corresponding index, this mapping is possible for all objects in the object store. Are we going to need a midx version of these mapping files? How does midx fit into this picture? Perhaps it's too obscure to worry about... That's a great question! I think the simplest answer is to have a midx only for the primary object format and fall back to using ordinary idx files for the others. The midx format already has a field for hash function (thanks, Derrick!). Related: I wondered whether we could simply leverage the midx code for the bidirectional SHA-1 <-> SHA-256 mapping, as it strikes me as very similar in concept and challenges. If we would like such a mapping, then I would propose the following: 1. The object store has everything in SHA-256, so the HASH_LEN parameter of the multi-pack-index is 32. 2. We create an optional chunk to add to the multi-pack-index that stores the SHA-1 for each object. This list would be in lex order. 3. We create two optional chunks that store the bijection between SHA-256 and SHA-1: the first is a list of integers i_0, i_1, ..., i_{N-1} such that i_k is the position in the SHA-1 list corresponding to the kth SHA-256. The second is a list of integers j_0, j_1, ..., j_{N-1} such that j_k is the position in the SHA-256 list of the kth SHA-1. I'm not super-familiar with how the transition plan specifically needs this mapping, but it seems like a good place to put it. Thanks, -Stolee
Re: Questions about the hash function transition
Hi, On Thu, 23 Aug 2018, Jonathan Nieder wrote: > Ævar Arnfjörð Bjarmason wrote: > > [...] > >> Since all operations that make new objects (e.g., "git commit") add > >> the new objects to the corresponding index, this mapping is possible > >> for all objects in the object store. > > > > Are we going to need a midx version of these mapping files? How does > > midx fit into this picture? Perhaps it's too obscure to worry about... > > That's a great question! I think the simplest answer is to have a > midx only for the primary object format and fall back to using > ordinary idx files for the others. > > The midx format already has a field for hash function (thanks, > Derrick!). Related: I wondered whether we could simply leverage the midx code for the bidirectional SHA-1 <-> SHA-256 mapping, as it strikes me as very similar in concept and challenges. Ciao, Dscho
Re: Questions about the hash function transition
brian m. carlson wrote: > On Thu, Aug 23, 2018 at 06:54:38PM -0700, Jonathan Nieder wrote: >> For what it's worth, even if it all is in one commit with message >> "wip", I think I'd benefit from being able to see this code. I can >> promise not to critique it, and to only treat it as a rough >> premonition of the future. > > It's in my object-id-partn branch at https://github.com/bk2204/git. > > It doesn't do protocol v2 yet, but it does do protocol v1. It is, of > course, subject to change (especially naming) depending on what the list > thinks is most appropriate. $ git diff --shortstat origin/master...bmc/object-id-partn 185 files changed, 2263 insertions(+), 1535 deletions(-) Beautiful. Thanks much for this. Sincerely, Jonathan
Re: Questions about the hash function transition
On Thu, Aug 23, 2018 at 06:54:38PM -0700, Jonathan Nieder wrote: > brian m. carlson wrote: > > I realize I have a lot of code that has not been sent in yet, but I also > > tend to build on my own series a lot, and I probably need to be a bit > > better about extracting reusable pieces that can go in independently > > without waiting for the previous series to land. > > For what it's worth, even if it all is in one commit with message > "wip", I think I'd benefit from being able to see this code. I can > promise not to critique it, and to only treat it as a rough > premonition of the future. It's in my object-id-partn branch at https://github.com/bk2204/git. It doesn't do protocol v2 yet, but it does do protocol v1. It is, of course, subject to change (especially naming) depending on what the list thinks is most appropriate. -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204 signature.asc Description: PGP signature
Re: Questions about the hash function transition
Ævar Arnfjörð Bjarmason wrote: >> Objective >> - >> Migrate Git from SHA-1 to a stronger hash function. > > Should way say "Migrate Git from SHA-1 to SHA-256" here instead? > > Maybe it's overly specific, i.e. really we're also describnig how /any/ > hash function transition might happen, but having just read this now > from start to finish it takes us a really long time to mention (and at > first, only offhand) that SHA-256 is the new hash. I answered this question in my other reply, but my answer missed the point. I think it would be fine for this to say "Migrate Git from SHA-1 to a stronger hash function (SHA-256)". More importantly, I think the Background section should say something about SHA-256 --- e.g. how about replacing the sentence SHA-1 still possesses the other properties such as fast object lookup and safe error checking, but other hash functions are equally suitable that are believed to be cryptographically secure. with something about SHA-256? Rereading the background section, I see some other bits that could be clarified, too. It has a run-on sentence: Thus Git has in effect already migrated to a new hash that isn't SHA-1 and doesn't share its vulnerabilities, its new hash function just happens to produce exactly the same output for all known inputs, except two PDFs published by the SHAttered researchers, and the new implementation (written by those researchers) claims to detect future cryptanalytic collision attacks. The "," after vulnerabilities should be a period, ending the sentence. My understanding is that sha1collisiondetection's safe-hash is meant to protect against known attacks and that the code is meant to be adaptable for future attacks of the same kind (by updating the list of disturbance vectors), but it doesn't claim to guard against future novel cryptanalysis methods that haven't been published yet. Thanks, Jonathan
Re: Questions about the hash function transition
Hi, brian m. carlson wrote: > On Thu, Aug 23, 2018 at 04:02:51PM +0200, Ævar Arnfjörð Bjarmason wrote: >>> 1. Add SHA-256 support to Git protocol. This is valuable and the >>>logical next step but it is out of scope for this initial design. >> >> This is a non-goal according to the docs, but now that we have protocol >> v2 in git, perhaps we could start specifying or describing how this >> protocol extension will work? > > I have code that does this. The reason is that the first stage of the [nice explanation snipped] > I hope to be able to spend some time documenting this in a little bit. > I have documentation for that code in my branch, but I haven't sent it > in yet. Yay! > I realize I have a lot of code that has not been sent in yet, but I also > tend to build on my own series a lot, and I probably need to be a bit > better about extracting reusable pieces that can go in independently > without waiting for the previous series to land. For what it's worth, even if it all is in one commit with message "wip", I think I'd benefit from being able to see this code. I can promise not to critique it, and to only treat it as a rough premonition of the future. [...] > For SHA-1, I have 0x73686131, which is "sha1", big-endian, and for > SHA-256, I have 0x73323536, which is "s256", big-endian. The former is > in the codebase already; the latter, in my hash-impl branch. I mentioned in another reply that "sha2" sounds fine. "s256" of course also sounds fine to me. Thanks to Ævar for asking so that we have the reminder to pin it down in the doc. Thanks, Jonathan
Re: Questions about the hash function transition
Hi, Ævar Arnfjörð Bjarmason wrote: > I wanted to send another series to clarify things in > hash-function-transition.txt, but for some of the issues I don't know > the answer, and I had some questions after giving this another read. Thanks for looking it over! Let's go. :) [...] >> Objective >> - >> Migrate Git from SHA-1 to a stronger hash function. > > Should way say "Migrate Git from SHA-1 to SHA-256" here instead? > > Maybe it's overly specific, i.e. really we're also describnig how /any/ > hash function transition might happen, but having just read this now > from start to finish it takes us a really long time to mention (and at > first, only offhand) that SHA-256 is the new hash. Well, the objective really is to migrate to a stronger hash function, and that we chose SHA-256 is part of the details of how we chose to do that. So I think this would be a misleading change. You can tell that I'm not just trying to justify after the fact because the initial version of the design doc at [*] already uses this wording, and that version assumed that the hash function was going to be SHA-256. [*] https://public-inbox.org/git/20170304011251.ga26...@aiede.mtv.corp.google.com/ [...] >> Non-Goals >> - >> 1. Add SHA-256 support to Git protocol. This is valuable and the >>logical next step but it is out of scope for this initial design. > > This is a non-goal according to the docs, but now that we have protocol > v2 in git, perhaps we could start specifying or describing how this > protocol extension will work? Yes, that would be great! But I suspect it's cleanest to do so in a separate doc. That would allow clarifying this part, by pointing to the protocol doc. [...] >> 3. Intermixing objects using multiple hash functions in a single >>repository. > > But isn't that the goal now per "Translation table" & writing both SHA-1 > and SHA-256 versions of objects? No, we don't write both versions of objects. The translation records both names of an object. [...] >> - For each object format: >> - 4-byte format identifier (e.g., 'sha1' for SHA-1) > > So, given that we have 4-byte limit and have decided on SHA-256 are we > just going to call this 'sha2'? Good question. 'sha2' sounds fine to me. If we want to do SHA-512/256 later, say, we'd just have to come up with a name for that at that point (and it doesn't have to be ASCII). > That might be confusingly ambiguous This is a binary format. Are you really worried that people are going to misinterpret the magic numbers it contains? > since SHA2 is a standard with more than just SHA-256, maybe 's256', or > maybe we should give this 8 bytes with trailing \0s so we can have > "SHA-1\0\0\0" and "SHA-256\0"? For what it's worth, if that's the alternative, I'd rather have four random bytes. [...] >> The loose object index is protected against concurrent writes by a >> lock file $GIT_OBJECT_DIR/loose-object-idx.lock. To add a new loose >> object: >> >> 1. Write the loose object to a temporary file, like today. >> 2. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the lock. >> 3. Rename the loose object into place. >> 4. Open loose-object-idx with O_APPEND and write the new object >> 5. Unlink loose-object-idx.lock to release the lock. >> >> To remove entries (e.g. in "git pack-refs" or "git-prune"): >> >> 1. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the >>lock. >> 2. Write the new content to loose-object-idx.lock. >> 3. Unlink any loose objects being removed. >> 4. Rename to replace loose-object-idx, releasing the lock. > > Do we expect multiple concurrent writers to poll the lock if they can't > aquire it right away? I.e. concurrent "git commit" would block? Has this > overall approach been benchmarked somewhere? Git doesn't support concurrent "git commit" today. My feeling is that if loose object writing becomes a performance problem, we should switch to writing packfiles instead (as "git receive-pack" already does). So when there's a choice between better performance of writing loose objects and simplicity, I lean toward simplicity (though that's not absolute, there are definitely tradeoffs to be made). Earlier discussion about this had sharded loose object indices for each xy/ subdir. It was more complicated, for not much gain. [...] > Maybe I've missed some subtlety where that won't work, I'm just > concerned that something that's writing a lot of objects in parallel > will be slowed down (e.g. the likes of BFG repo cleaner). BFG repo cleaner is an application like fast-import that is a good fit for writing packs, not loose objects. [...] >> Since all operations that make new objects (e.g., "git commit") add >> the new objects to the corresponding index, this mapping is possible >> for all objects in the object store. > > Are we going to need a midx version of these mapping files? How does > midx fit into this picture? Perhaps it's too obscure to
Re: Questions about the hash function transition
On Thu, Aug 23, 2018 at 04:02:51PM +0200, Ævar Arnfjörð Bjarmason wrote: > > [...] > > Goals > > - > > 1. The transition to SHA-256 can be done one local repository at a time. > >a. Requiring no action by any other party. > >b. A SHA-256 repository can communicate with SHA-1 Git servers > > (push/fetch). > >c. Users can use SHA-1 and SHA-256 identifiers for objects > > interchangeably (see "Object names on the command line", below). > >d. New signed objects make use of a stronger hash function than > > SHA-1 for their security guarantees. > > 2. Allow a complete transition away from SHA-1. > >a. Local metadata for SHA-1 compatibility can be removed from a > > repository if compatibility with SHA-1 is no longer needed. > > 3. Maintainability throughout the process. > >a. The object format is kept simple and consistent. > >b. Creation of a generalized repository conversion tool. > > > > Non-Goals > > - > > 1. Add SHA-256 support to Git protocol. This is valuable and the > >logical next step but it is out of scope for this initial design. > > This is a non-goal according to the docs, but now that we have protocol > v2 in git, perhaps we could start specifying or describing how this > protocol extension will work? I have code that does this. The reason is that the first stage of the transition code is to implement stage 4 of the transition: that is, a full SHA-256 implementation without any SHA-1 support. Implementing it that way means that we don't have to deal with any of the SHA-1 to SHA-256 mapping in the first stage of the code. In order to clone an SHA-256 repo (which the testsuite is completely broken without), you need to be able to have basic SHA-256 support in the protocol. I know this was a non-goal, but the alternative is a an inability to run the testsuite using SHA-256 until all the code is merged, which is unsuitable for development. The transition plan also anticipates stage 4 (full SHA-256) support before earlier stages, so this will be required. I hope to be able to spend some time documenting this in a little bit. I have documentation for that code in my branch, but I haven't sent it in yet. I realize I have a lot of code that has not been sent in yet, but I also tend to build on my own series a lot, and I probably need to be a bit better about extracting reusable pieces that can go in independently without waiting for the previous series to land. > > [...] > > 3. Intermixing objects using multiple hash functions in a single > >repository. > > But isn't that the goal now per "Translation table" & writing both SHA-1 > and SHA-256 versions of objects? No, I think this statement is basically that you have to have the entire repository use all one algorithm under the hood in the .git directory, translation tables excluded. I don't think that's controversial. > > [...] > > Pack index > > ~~ > > Pack index (.idx) files use a new v3 format that supports multiple > > hash functions. They have the following format (all integers are in > > network byte order): > > > > - A header appears at the beginning and consists of the following: > > - The 4-byte pack index signature: '\377t0c' > > - 4-byte version number: 3 > > - 4-byte length of the header section, including the signature and > > version number > > - 4-byte number of objects contained in the pack > > - 4-byte number of object formats in this pack index: 2 > > - For each object format: > > - 4-byte format identifier (e.g., 'sha1' for SHA-1) > > So, given that we have 4-byte limit and have decided on SHA-256 are we > just going to call this 'sha2'? That might be confusingly ambiguous > since SHA2 is a standard with more than just SHA-256, maybe 's256', or > maybe we should give this 8 bytes with trailing \0s so we can have > "SHA-1\0\0\0" and "SHA-256\0"? This is the format_version field in struct git_hash_algo. For SHA-1, I have 0x73686131, which is "sha1", big-endian, and for SHA-256, I have 0x73323536, which is "s256", big-endian. The former is in the codebase already; the latter, in my hash-impl branch. If people have objections, we can change this up until we merge the pack index v3 code (which is not yet finished). It needs to be unique, and that's it. We could specify 0x0001 and 0x0002 if we wanted, although I feel the values I mentioned above are self-documenting, which is desirable. > > [...] > > - The trailer consists of the following: > > - A copy of the 20-byte SHA-256 checksum at the end of the > > corresponding packfile. > > > > - 20-byte SHA-256 checksum of all of the above. > > We need to update both of these to 32 byte, right? Or are we planning to > truncate the checksums? > > This seems like just a mistake when we did s/NewHash/SHA-256/g, but then > again it was originally "20-byte NewHash checksum" ever since 752414ae43 > ("technical doc: add a design doc for hash function transition", >
Re: Questions about the hash function transition
Ævar Arnfjörð Bjarmason writes: > On Thu, Aug 23 2018, Junio C Hamano wrote: > >> Ævar Arnfjörð Bjarmason writes: >> - The trailer consists of the following: - A copy of the 20-byte SHA-256 checksum at the end of the corresponding packfile. - 20-byte SHA-256 checksum of all of the above. >>> >>> We need to update both of these to 32 byte, right? Or are we planning to >>> truncate the checksums? >> >> https://public-inbox.org/git/ca+55afwc7uq61ebnj36pfu_abcxgya4jut-tvppj21hkhre...@mail.gmail.com/ > > Thanks. > > Yeah for this checksum purpose even 10 or 5 characters would do, but > since we'll need a new pack format anyway for SHA-256 why not just use > the full length of the SHA-256 here? We're using the full length of the > SHA-1. > > I don't see it mattering for security / corruption detection purposes, > but just to avoid confusion. We'll have this one place left where > something looks like a SHA-1, but is actually a trunctated SHA-256. I would prefer to see us at least explore if the gain in throughput is sufficiently big if we switch to weaker checksum, like crc32. If does not give us sufficient gain, I'd agree with you that consistently using full hash everywhere would conceptually be cleaner.
Re: Questions about the hash function transition
On Thu, Aug 23 2018, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason writes: > >>> - The trailer consists of the following: >>> - A copy of the 20-byte SHA-256 checksum at the end of the >>> corresponding packfile. >>> >>> - 20-byte SHA-256 checksum of all of the above. >> >> We need to update both of these to 32 byte, right? Or are we planning to >> truncate the checksums? > > https://public-inbox.org/git/ca+55afwc7uq61ebnj36pfu_abcxgya4jut-tvppj21hkhre...@mail.gmail.com/ Thanks. Yeah for this checksum purpose even 10 or 5 characters would do, but since we'll need a new pack format anyway for SHA-256 why not just use the full length of the SHA-256 here? We're using the full length of the SHA-1. I don't see it mattering for security / corruption detection purposes, but just to avoid confusion. We'll have this one place left where something looks like a SHA-1, but is actually a trunctated SHA-256.
Re: Questions about the hash function transition
Ævar Arnfjörð Bjarmason writes: >> - The trailer consists of the following: >> - A copy of the 20-byte SHA-256 checksum at the end of the >> corresponding packfile. >> >> - 20-byte SHA-256 checksum of all of the above. > > We need to update both of these to 32 byte, right? Or are we planning to > truncate the checksums? https://public-inbox.org/git/ca+55afwc7uq61ebnj36pfu_abcxgya4jut-tvppj21hkhre...@mail.gmail.com/
Questions about the hash function transition
I wanted to send another series to clarify things in hash-function-transition.txt, but for some of the issues I don't know the answer, and I had some questions after giving this another read. So let's discuss that here first. Quoting from the document (available at https://github.com/git/git/blob/v2.19.0-rc0/Documentation/technical/hash-function-transition.txt) > Git hash function transition > > > Objective > - > Migrate Git from SHA-1 to a stronger hash function. Should way say "Migrate Git from SHA-1 to SHA-256" here instead? Maybe it's overly specific, i.e. really we're also describnig how /any/ hash function transition might happen, but having just read this now from start to finish it takes us a really long time to mention (and at first, only offhand) that SHA-256 is the new hash. > [...] > Goals > - > 1. The transition to SHA-256 can be done one local repository at a time. >a. Requiring no action by any other party. >b. A SHA-256 repository can communicate with SHA-1 Git servers > (push/fetch). >c. Users can use SHA-1 and SHA-256 identifiers for objects > interchangeably (see "Object names on the command line", below). >d. New signed objects make use of a stronger hash function than > SHA-1 for their security guarantees. > 2. Allow a complete transition away from SHA-1. >a. Local metadata for SHA-1 compatibility can be removed from a > repository if compatibility with SHA-1 is no longer needed. > 3. Maintainability throughout the process. >a. The object format is kept simple and consistent. >b. Creation of a generalized repository conversion tool. > > Non-Goals > - > 1. Add SHA-256 support to Git protocol. This is valuable and the >logical next step but it is out of scope for this initial design. This is a non-goal according to the docs, but now that we have protocol v2 in git, perhaps we could start specifying or describing how this protocol extension will work? > [...] > 3. Intermixing objects using multiple hash functions in a single >repository. But isn't that the goal now per "Translation table" & writing both SHA-1 and SHA-256 versions of objects? > [...] > Pack index > ~~ > Pack index (.idx) files use a new v3 format that supports multiple > hash functions. They have the following format (all integers are in > network byte order): > > - A header appears at the beginning and consists of the following: > - The 4-byte pack index signature: '\377t0c' > - 4-byte version number: 3 > - 4-byte length of the header section, including the signature and > version number > - 4-byte number of objects contained in the pack > - 4-byte number of object formats in this pack index: 2 > - For each object format: > - 4-byte format identifier (e.g., 'sha1' for SHA-1) So, given that we have 4-byte limit and have decided on SHA-256 are we just going to call this 'sha2'? That might be confusingly ambiguous since SHA2 is a standard with more than just SHA-256, maybe 's256', or maybe we should give this 8 bytes with trailing \0s so we can have "SHA-1\0\0\0" and "SHA-256\0"? > [...] > - The trailer consists of the following: > - A copy of the 20-byte SHA-256 checksum at the end of the > corresponding packfile. > > - 20-byte SHA-256 checksum of all of the above. We need to update both of these to 32 byte, right? Or are we planning to truncate the checksums? This seems like just a mistake when we did s/NewHash/SHA-256/g, but then again it was originally "20-byte NewHash checksum" ever since 752414ae43 ("technical doc: add a design doc for hash function transition", 2017-09-27), so what do we mean here? > Loose object index > ~~ > A new file $GIT_OBJECT_DIR/loose-object-idx contains information about > all loose objects. Its format is > > # loose-object-idx > (sha256-name SP sha1-name LF)* > > where the object names are in hexadecimal format. The file is not > sorted. > > The loose object index is protected against concurrent writes by a > lock file $GIT_OBJECT_DIR/loose-object-idx.lock. To add a new loose > object: > > 1. Write the loose object to a temporary file, like today. > 2. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the lock. > 3. Rename the loose object into place. > 4. Open loose-object-idx with O_APPEND and write the new object > 5. Unlink loose-object-idx.lock to release the lock. > > To remove entries (e.g. in "git pack-refs" or "git-prune"): > > 1. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the >lock. > 2. Write the new content to loose-object-idx.lock. > 3. Unlink any loose objects being removed. > 4. Rename to replace loose-object-idx, releasing the lock. Do we expect multiple concurrent writers to poll the lock if they can't aquire it right away? I.e. concurrent "git commit" would block? Has this overall approach been benchmarked somewhere? I wonder if some lock-less variant of this would