Re: Questions about the hash function transition

2018-08-29 Thread Derrick Stolee

On 8/29/2018 9:27 AM, Derrick Stolee wrote:

On 8/29/2018 9:09 AM, Johannes Schindelin wrote:


What I meant was to leverage the midx code, not the .midx files.

My comment was motivated by my realizing that both the SHA-1 <-> SHA-256
mapping and the MIDX code have to look up (in a *fast* way) information
with hash values as keys. *And* this information is immutable. *And* the
amount of information should grow with new objects being added to the
database.


I'm unsure what this means, as the multi-pack-index simply uses 
bsearch_hash() to find hashes in the list. The same method is used for 
IDX lookups.


I talked with Johannes privately, and we found differences in our 
understanding of the current multi-pack-index feature. Johannes thought 
the feature was farther along than it is, specifically related to how 
much we value the data in the multi-pack-index when adding objects to 
pack-files or repacking. Some of this misunderstanding is due to how the 
equivalent feature works in VSTS (where there is no IDX-file equivalent, 
every object in the repo is tracked by a multi-pack-index).


I'd like to point out a few things about how the multi-pack-index works 
now, and how we hope to extend it in the future.


Currently:

1. Objects are added to the multi-pack-index by adding a new set of 
.idx/.pack file pairs. We scan the .idx file for the objects and offsets 
to add.


2. We re-use the information in the multi-pack-index only to write the 
new one without re-reading the .pack files that are already covered.


3. If a 'git repack' command deletes a pack-file, then we delete the 
multi-pack-index. It must be regenerated by 'git multi-pack-index write' 
later.


In the current world, the multi-pack-index is completely secondary to 
the .idx files.


In the future, I hope these features exist in the multi-pack-index:

1. A stable object order. As objects are added to the multi-pack-index, 
we assign a distinct integer value to each. As we add objects, those 
integers values do not change. We can then pair the reachability bitmap 
to the multi-pack-index instead of a specific pack-file (allowing repack 
and bitmap computations to happen asynchronously). The data required to 
store this object order is very similar to storing the bijection between 
SHA-1 and SHA-256 hashes.


2. Incremental multi-pack-index: Currently, we have only one 
multi-pack-index file per object directory. We can use a mechanism 
similar to the split-index to keep a small number of multi-pack-index 
files (at most 3, probably) such that the 
'.git/objects/pack/multi-pack-index' file is small and easy to rewrite, 
while it refers to larger '.git/objects/pack/*.midx' files that change 
infrequently.


3. Multi-pack-index-aware repack: The repacker only knows about the 
multi-pack-index enough to delete it. We could instead directly 
manipulate the multi-pack-index during repack, and we could decide to do 
more incremental repacks based on data stored in the multi-pack-index.


In conclusion: please keep the multi-pack-index in mind as we implement 
the transition plan. I'll continue building the feature as planned (the 
next thing to do after the current series of cleanups is 'git 
multi-pack-index verify') but am happy to look into other applications 
as we need it.


Thanks,

-Stolee



Re: Questions about the hash function transition

2018-08-29 Thread Derrick Stolee

On 8/29/2018 9:09 AM, Johannes Schindelin wrote:

Hi Jonathan,

On Tue, 28 Aug 2018, Jonathan Nieder wrote:


Johannes Schindelin wrote:

On Thu, 23 Aug 2018, Jonathan Nieder wrote:

Ævar Arnfjörð Bjarmason wrote:

Are we going to need a midx version of these mapping files? How does
midx fit into this picture? Perhaps it's too obscure to worry
about...

That's a great question!  I think the simplest answer is to have a
midx only for the primary object format and fall back to using
ordinary idx files for the others.

The midx format already has a field for hash function (thanks,
Derrick!).

Related: I wondered whether we could simply leverage the midx code for
the bidirectional SHA-1 <-> SHA-256 mapping, as it strikes me as very
similar in concept and challenges.

Interesting: tell me more.

My first instinct is to prefer the idx-based design that is already
described in the design doc.  If we want to change that, we should
have a motivating reason.

Midx is designed to be optional and to not necessarily cover all
objects, so it doesn't seem like a good fit.


It is optional, but shouldn't this mode where a Git repo that needs to 
know about two different versions of all files be optional? Or at least 
temporary?


The multi-pack-index is intended to cover all packed objects, so covers 
the same number of objects as an IDX-based strategy. If we are 
rebuilding the repo from scratch by translating the hashes, then "being 
too big to repack" is probably not a problem, so we would expect a 
single IDX file anyway.


In my opinion, whatever we do for the IDX-based approach will need to be 
duplicated in the multi-pack-index. The multi-pack-index does have a 
natural mechanism (optional chunks) for inserting this data without 
incrementing the version number.



Right.

What I meant was to leverage the midx code, not the .midx files.

My comment was motivated by my realizing that both the SHA-1 <-> SHA-256
mapping and the MIDX code have to look up (in a *fast* way) information
with hash values as keys. *And* this information is immutable. *And* the
amount of information should grow with new objects being added to the
database.


I'm unsure what this means, as the multi-pack-index simply uses 
bsearch_hash() to find hashes in the list. The same method is used for 
IDX lookups.



I know that Stolee performed a bit of performance testing regarding
different data structures to use in MIDX. We could benefit from that
testing by using not only the results from those tests, but also the code.


I did test ways to use something other than bsearch_hash(), such as 
using a 65,536-entry fanout table for lookups using the first two bytes 
of a hash (tl;dr: it speeds things up a bit, but the super-small 
improvement is probably not worth the space and complexity). I've also 
toyed with the idea of using interpolation search inside bsearch_hash(), 
but I haven't had time to do that.



IIRC one of the insights was that packs are a natural structure that
can be used for the MIDX mapping, too (you could, for example, store the
SHA-1 <-> SHA-256 mapping *only* for objects inside packs, and re-generate
them on the fly for loose objects all the time).

Stolee can speak with much more competence and confidence about this,
though, whereas all of what I said above is me waving my hands quite
frantically.


I understand the hesitation to pair such an important feature (hash 
transition) to a feature that hasn't even shipped. We will need to see 
how things progress on both fronts to see how mature the 
multi-pack-index is when we need this transition table.


Thanks,

-Stolee



Re: Questions about the hash function transition

2018-08-29 Thread Johannes Schindelin
Hi Jonathan,

On Tue, 28 Aug 2018, Jonathan Nieder wrote:

> Johannes Schindelin wrote:
> > On Thu, 23 Aug 2018, Jonathan Nieder wrote:
> > > Ævar Arnfjörð Bjarmason wrote:
> 
> >>> Are we going to need a midx version of these mapping files? How does
> >>> midx fit into this picture? Perhaps it's too obscure to worry
> >>> about...
> >>
> >> That's a great question!  I think the simplest answer is to have a
> >> midx only for the primary object format and fall back to using
> >> ordinary idx files for the others.
> >>
> >> The midx format already has a field for hash function (thanks,
> >> Derrick!).
> >
> > Related: I wondered whether we could simply leverage the midx code for
> > the bidirectional SHA-1 <-> SHA-256 mapping, as it strikes me as very
> > similar in concept and challenges.
> 
> Interesting: tell me more.
> 
> My first instinct is to prefer the idx-based design that is already
> described in the design doc.  If we want to change that, we should
> have a motivating reason.
> 
> Midx is designed to be optional and to not necessarily cover all
> objects, so it doesn't seem like a good fit.

Right.

What I meant was to leverage the midx code, not the .midx files.

My comment was motivated by my realizing that both the SHA-1 <-> SHA-256
mapping and the MIDX code have to look up (in a *fast* way) information
with hash values as keys. *And* this information is immutable. *And* the
amount of information should grow with new objects being added to the
database.

I know that Stolee performed a bit of performance testing regarding
different data structures to use in MIDX. We could benefit from that
testing by using not only the results from those tests, but also the code.

IIRC one of the insights was that packs are a natural structure that
can be used for the MIDX mapping, too (you could, for example, store the
SHA-1 <-> SHA-256 mapping *only* for objects inside packs, and re-generate
them on the fly for loose objects all the time).

Stolee can speak with much more competence and confidence about this,
though, whereas all of what I said above is me waving my hands quite
frantically.

Ciao,
Dscho

Re: Questions about the hash function transition

2018-08-28 Thread Jonathan Nieder
Derrick Stolee wrote:

> I'm not super-familiar with how the transition plan specifically needs this
> mapping, but it seems like a good place to put it.

Would you mind reading it through and letting me know your thoughts?
More eyes can't hurt.

Thanks,
Jonathan


Re: Questions about the hash function transition

2018-08-28 Thread Jonathan Nieder
Hi,

Johannes Schindelin wrote:
> On Thu, 23 Aug 2018, Jonathan Nieder wrote:
> > Ævar Arnfjörð Bjarmason wrote:

>>> Are we going to need a midx version of these mapping files? How does
>>> midx fit into this picture? Perhaps it's too obscure to worry about...
>>
>> That's a great question!  I think the simplest answer is to have a
>> midx only for the primary object format and fall back to using
>> ordinary idx files for the others.
>>
>> The midx format already has a field for hash function (thanks,
>> Derrick!).
>
> Related: I wondered whether we could simply leverage the midx code for the
> bidirectional SHA-1 <-> SHA-256 mapping, as it strikes me as very similar
> in concept and challenges.

Interesting: tell me more.

My first instinct is to prefer the idx-based design that is already
described in the design doc.  If we want to change that, we should
have a motivating reason.

Midx is designed to be optional and to not necessarily cover all
objects, so it doesn't seem like a good fit.

Thanks,
Jonathan


Re: Questions about the hash function transition

2018-08-28 Thread Junio C Hamano
Edward Thomson  writes:

> If I'm understanding you correctly, then on the libgit2 side, I'm very much
> opposed to this proposal.  We never execute commands, nor do I want to start
> thinking that we can do so arbitrarily.  We run in environments where that's
> a non-starter
>
> At present, in libgit2, users can provide their own mechanism for running
> clean/smudge filters.  But hash transformation / compatibility is going to
> be a crucial compatibility component.  So this is not something that we
> could simply opt out of or require users to implement themselves.

While I suspect the "apparent flexibility" does not equal to "we
must be able to run arbitrary external programs" in the proposal, I
do agree that hash transformation MUST NOT be configurable like
this.  We do not want to add random source of incompatible mappings
when there is no need to introduce confusion.

If old object names under old hash users find in log messages and
other places need to be easily looked up in a repository that has
been converted, then:

 (1) get_sha1() equivalent in the new world should learn to fall
 back to use old hash when there is no object with that name
 under new hash;

 (2) in addition to the above fallback, there should be a syntax to
 explicitly tell that function that it is using the old hash;

 (3) get_commit_buffer() should learn to optionally allow converting
 old hash in log messages to new ones, in a way similar to how
 textconv filter can be specified by the end-users to make
 binary blob easier to grok by text-based tools (the important
 part is that such a filter does not have to be limited to
 "upgrade hash algorithm"---it can be more general "correct
 misspelt words automatically" filter).

With 1+2, you can say "git log $sha1" and also "git log sha1:$sha1"
to disambiguate.  3 would be icing on the cake.


Re: Questions about the hash function transition

2018-08-28 Thread Ævar Arnfjörð Bjarmason


On Tue, Aug 28 2018, Edward Thomson wrote:

> On Tue, Aug 28, 2018 at 2:50 PM, Ævar Arnfjörð Bjarmason
>  wrote:
>> If we instead had something like clean/smudge filters:
>>
>> [extensions]
>> objectFilter = sha256-to-sha1
>> compatObjectFormat = sha1
>> [objectFilter "sha256-to-sha1"]
>> clean  = ...
>> smudge = ...
>>
>> We could apply arbitrary transformations on objects through filters
>> which would accept/return some simple format requesting them to
>> translate such-and-such objects, and would either return object
>> names/types under which to store them, or "nothing to do".
>
> If I'm understanding you correctly, then on the libgit2 side, I'm very much
> opposed to this proposal.  We never execute commands, nor do I want to start
> thinking that we can do so arbitrarily.  We run in environments where that's
> a non-starter

I'm being unclear. I'm suggesting that we slightly amend the syntax of
what we're proposing to put in the .git/config to leave the door open
for *optionally* doing arbitrary mappings.

It would still work exactly the same internally for the common
sha1<->sha256 case, i.e. neither git, libgit, jgit or anyone else would
need to shell out to anything.

They'd just pick up that common case and handle it internally, similar
to how e.g. the crlf filter (v.s. full clean/smudge support) works in
git & libgit2:
https://github.com/libgit2/libgit2/blob/master/tests/filter/crlf.c

So the sha256<->sha1 support would be an implicit built-in like crlf, it
would just leave the door open to having something like git-lfs.

Now what does that really mean? And I admit I may be missing something
here.

Unlike smudge/clean filters we're going to be constrained by having
hashes of length 20 or 32, locally & remotely, since we wouldn't want to
support arbitrary lengths, but with relatively small changes it'll allow
for changing just:

# local  remote
sha256<->sha1

To also support:

# local  remote
fn(sha1)<->fn(sha1)
fn(sha1)<->fn(sha256)
fn(sha256)<->fn(sha1)
fn(sha256)<->fn(sha256)

Where fn() is some hook you'd provide to hook into the bits where we're
e.g. unpacking SHA-1 objects from the remote, and writing them locally
as SHA-256, except instead of (as we do by default) writing:

SHA256_map(sha256(content)) = content

You'd write:

SHA256_map(sha256(fn(content))) = fn(content)

Where fn() would need to be idempotent.

Now, why is this useful or worth considering? As noted in the E-Mail I
linked to it allows for some novel use cases for doing local to remote
object translation.

But really, I'm not suggesting that *that* is something we should
consider. *All* I'm saying is that given the experience of how we
started out with stuff like built-in "crlf", and then grew smudge/clean
filters, that it's worth considering what sort of .git/config key-value
pairs we'd pick that would yield themselves to such future extensions,
should that be something we deem to be a good idea in the future.

Because if we don't we've lost nothing, but if we do we'd need to
support two sets of config syntaxes to do those two related things.

> At present, in libgit2, users can provide their own mechanism for running
> clean/smudge filters.  But hash transformation / compatibility is going to
> be a crucial compatibility component.  So this is not something that we
> could simply opt out of or require users to implement themselves.

Indeed.


Re: Questions about the hash function transition

2018-08-28 Thread Edward Thomson
On Tue, Aug 28, 2018 at 2:50 PM, Ævar Arnfjörð Bjarmason
 wrote:
> If we instead had something like clean/smudge filters:
>
> [extensions]
> objectFilter = sha256-to-sha1
> compatObjectFormat = sha1
> [objectFilter "sha256-to-sha1"]
> clean  = ...
> smudge = ...
>
> We could apply arbitrary transformations on objects through filters
> which would accept/return some simple format requesting them to
> translate such-and-such objects, and would either return object
> names/types under which to store them, or "nothing to do".

If I'm understanding you correctly, then on the libgit2 side, I'm very much
opposed to this proposal.  We never execute commands, nor do I want to start
thinking that we can do so arbitrarily.  We run in environments where that's
a non-starter

At present, in libgit2, users can provide their own mechanism for running
clean/smudge filters.  But hash transformation / compatibility is going to
be a crucial compatibility component.  So this is not something that we
could simply opt out of or require users to implement themselves.

-ed


Re: Questions about the hash function transition

2018-08-28 Thread Ævar Arnfjörð Bjarmason


On Thu, Aug 23 2018, Ævar Arnfjörð Bjarmason wrote:

>> Transition plan
>> ---
>
> One thing that's not covered in this document at all, which I feel is
> missing, is how we're going to handle references to old commit IDs in
> commit messages, bug trackers etc. once we go through the whole
> migration process.
>
> I.e. are users who expect to be able to read old history and "git show
> " expected to maintain a repository that has a live
> sha1<->sha256 mapping forever, or could we be smarter about this and
> support some sort of marker in the repository saying "maintain the
> mapping up until this point".
>
> Then, along with some v2 protocol extension to transfer such a
> historical mapping (and perhaps a default user option to request it)
> we'd be guaranteed to be able to read old log messages and "git show"
> them, and servers could avoid breaking past URLs without maintaining the
> mapping going forward.
>
> One example of this on the server is that on GitLab (I don't know how
> GitHub does this) when you reference a commit from e.g a bug, a
> refs/keep-around/ is created, to make sure it doesn't get GC'd.
>
> Those sorts of hosting providers would like to not break *existing*
> links, without needing to forever maintain a bidirectional mapping.

Considering this a bit more, I think this would nicely fall under what I
suggested in
https://public-inbox.org/git/874ll3yd75@evledraar.gmail.com/

I.e. the interface that's now proposed / documented is fairly
inelastic. I.e.:

[extensions]
objectFormat = sha256
compatObjectFormat = sha1

If we instead had something like clean/smudge filters:

[extensions]
objectFilter = sha256-to-sha1
compatObjectFormat = sha1
[objectFilter "sha256-to-sha1"]
clean  = ...
smudge = ...

We could apply arbitrary transformations on objects through filters
which would accept/return some simple format requesting them to
translate such-and-such objects, and would either return object
names/types under which to store them, or "nothing to do".

So we could also have filters that would munge the contents of objects
between local & remote (for e.g. this "use a public remote host for
storing an encrypted repo" that'll fsck on their end) use-case, but also
e.g. be able to pass arguments to the filters saying that only commits
older than so-and-so are to have a reverse mapping (for looking up old
commits), or just ones on some branch etc.

It wouldn't be any slower than the current proposal, since some subset
of it would be picked up and implemented in C directly via some fast
path, similar to the proposal that e.g. some encoding filters be
implemented as built-ins.

But by having it be more extendable it'll be easy to e.g. pass options,
or implement custom transformations.

We're still far away from reviewing patches to implement this, but in
anticipation of that I'd like to see what people think about
future-proofing this objectFilter syntax.


Re: Questions about the hash function transition

2018-08-28 Thread Derrick Stolee

On 8/28/2018 8:04 AM, Johannes Schindelin wrote:

Hi,

On Thu, 23 Aug 2018, Jonathan Nieder wrote:


Ævar Arnfjörð Bjarmason wrote:

[...]

Since all operations that make new objects (e.g., "git commit") add
the new objects to the corresponding index, this mapping is possible
for all objects in the object store.

Are we going to need a midx version of these mapping files? How does
midx fit into this picture? Perhaps it's too obscure to worry about...

That's a great question!  I think the simplest answer is to have a
midx only for the primary object format and fall back to using
ordinary idx files for the others.

The midx format already has a field for hash function (thanks,
Derrick!).

Related: I wondered whether we could simply leverage the midx code for the
bidirectional SHA-1 <-> SHA-256 mapping, as it strikes me as very similar
in concept and challenges.


If we would like such a mapping, then I would propose the following:

1. The object store has everything in SHA-256, so the HASH_LEN parameter 
of the multi-pack-index is 32.


2. We create an optional chunk to add to the multi-pack-index that 
stores the SHA-1 for each object. This list would be in lex order.


3. We create two optional chunks that store the bijection between 
SHA-256 and SHA-1: the first is a list of integers i_0, i_1, ..., 
i_{N-1} such that i_k is the position in the SHA-1 list corresponding to 
the kth SHA-256. The second is a list of integers j_0, j_1, ..., j_{N-1} 
such that j_k is the position in the SHA-256 list of the kth SHA-1.


I'm not super-familiar with how the transition plan specifically needs 
this mapping, but it seems like a good place to put it.


Thanks,

-Stolee



Re: Questions about the hash function transition

2018-08-28 Thread Johannes Schindelin
Hi,

On Thu, 23 Aug 2018, Jonathan Nieder wrote:

> Ævar Arnfjörð Bjarmason wrote:
> 
> [...]
> >> Since all operations that make new objects (e.g., "git commit") add
> >> the new objects to the corresponding index, this mapping is possible
> >> for all objects in the object store.
> >
> > Are we going to need a midx version of these mapping files? How does
> > midx fit into this picture? Perhaps it's too obscure to worry about...
> 
> That's a great question!  I think the simplest answer is to have a
> midx only for the primary object format and fall back to using
> ordinary idx files for the others.
> 
> The midx format already has a field for hash function (thanks,
> Derrick!).

Related: I wondered whether we could simply leverage the midx code for the
bidirectional SHA-1 <-> SHA-256 mapping, as it strikes me as very similar
in concept and challenges.

Ciao,
Dscho

Re: Questions about the hash function transition

2018-08-23 Thread Jonathan Nieder
brian m. carlson wrote:
> On Thu, Aug 23, 2018 at 06:54:38PM -0700, Jonathan Nieder wrote:

>> For what it's worth, even if it all is in one commit with message
>> "wip", I think I'd benefit from being able to see this code.  I can
>> promise not to critique it, and to only treat it as a rough
>> premonition of the future.
>
> It's in my object-id-partn branch at https://github.com/bk2204/git.
>
> It doesn't do protocol v2 yet, but it does do protocol v1.  It is, of
> course, subject to change (especially naming) depending on what the list
> thinks is most appropriate.

 $ git diff --shortstat origin/master...bmc/object-id-partn
  185 files changed, 2263 insertions(+), 1535 deletions(-)

Beautiful.  Thanks much for this.

Sincerely,
Jonathan


Re: Questions about the hash function transition

2018-08-23 Thread brian m. carlson
On Thu, Aug 23, 2018 at 06:54:38PM -0700, Jonathan Nieder wrote:
> brian m. carlson wrote:
> > I realize I have a lot of code that has not been sent in yet, but I also
> > tend to build on my own series a lot, and I probably need to be a bit
> > better about extracting reusable pieces that can go in independently
> > without waiting for the previous series to land.
> 
> For what it's worth, even if it all is in one commit with message
> "wip", I think I'd benefit from being able to see this code.  I can
> promise not to critique it, and to only treat it as a rough
> premonition of the future.

It's in my object-id-partn branch at https://github.com/bk2204/git.

It doesn't do protocol v2 yet, but it does do protocol v1.  It is, of
course, subject to change (especially naming) depending on what the list
thinks is most appropriate.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Re: Questions about the hash function transition

2018-08-23 Thread Jonathan Nieder
Ævar Arnfjörð Bjarmason wrote:

>> Objective
>> -
>> Migrate Git from SHA-1 to a stronger hash function.
>
> Should way say "Migrate Git from SHA-1 to SHA-256" here instead?
>
> Maybe it's overly specific, i.e. really we're also describnig how /any/
> hash function transition might happen, but having just read this now
> from start to finish it takes us a really long time to mention (and at
> first, only offhand) that SHA-256 is the new hash.

I answered this question in my other reply, but my answer missed the
point.

I think it would be fine for this to say "Migrate Git from SHA-1 to a
stronger hash function (SHA-256)".  More importantly, I think the
Background section should say something about SHA-256 --- e.g. how about
replacing the sentence

  SHA-1 still possesses the other properties such as fast object
  lookup and safe error checking, but other hash functions are equally
  suitable that are believed to be cryptographically secure.

with something about SHA-256?

Rereading the background section, I see some other bits that could be
clarified, too.  It has a run-on sentence:

  Thus Git has in effect already migrated to a new hash that isn't
  SHA-1 and doesn't share its vulnerabilities, its new hash function
  just happens to produce exactly the same output for all known
  inputs, except two PDFs published by the SHAttered researchers, and
  the new implementation (written by those researchers) claims to
  detect future cryptanalytic collision attacks.

The "," after vulnerabilities should be a period, ending the sentence.
My understanding is that sha1collisiondetection's safe-hash is meant
to protect against known attacks and that the code is meant to be
adaptable for future attacks of the same kind (by updating the list of
disturbance vectors), but it doesn't claim to guard against future
novel cryptanalysis methods that haven't been published yet.

Thanks,
Jonathan


Re: Questions about the hash function transition

2018-08-23 Thread Jonathan Nieder
Hi,

brian m. carlson wrote:
> On Thu, Aug 23, 2018 at 04:02:51PM +0200, Ævar Arnfjörð Bjarmason wrote:

>>> 1. Add SHA-256 support to Git protocol. This is valuable and the
>>>logical next step but it is out of scope for this initial design.
>>
>> This is a non-goal according to the docs, but now that we have protocol
>> v2 in git, perhaps we could start specifying or describing how this
>> protocol extension will work?
>
> I have code that does this.  The reason is that the first stage of the
[nice explanation snipped]
> I hope to be able to spend some time documenting this in a little bit.
> I have documentation for that code in my branch, but I haven't sent it
> in yet.

Yay!

> I realize I have a lot of code that has not been sent in yet, but I also
> tend to build on my own series a lot, and I probably need to be a bit
> better about extracting reusable pieces that can go in independently
> without waiting for the previous series to land.

For what it's worth, even if it all is in one commit with message
"wip", I think I'd benefit from being able to see this code.  I can
promise not to critique it, and to only treat it as a rough
premonition of the future.

[...]
> For SHA-1, I have 0x73686131, which is "sha1", big-endian, and for
> SHA-256, I have 0x73323536, which is "s256", big-endian.  The former is
> in the codebase already; the latter, in my hash-impl branch.

I mentioned in another reply that "sha2" sounds fine.  "s256" of
course also sounds fine to me.  Thanks to Ævar for asking so that we
have the reminder to pin it down in the doc.

Thanks,
Jonathan


Re: Questions about the hash function transition

2018-08-23 Thread Jonathan Nieder
Hi,

Ævar Arnfjörð Bjarmason wrote:

> I wanted to send another series to clarify things in
> hash-function-transition.txt, but for some of the issues I don't know
> the answer, and I had some questions after giving this another read.

Thanks for looking it over!  Let's go. :)

[...]
>> Objective
>> -
>> Migrate Git from SHA-1 to a stronger hash function.
>
> Should way say "Migrate Git from SHA-1 to SHA-256" here instead?
>
> Maybe it's overly specific, i.e. really we're also describnig how /any/
> hash function transition might happen, but having just read this now
> from start to finish it takes us a really long time to mention (and at
> first, only offhand) that SHA-256 is the new hash.

Well, the objective really is to migrate to a stronger hash function,
and that we chose SHA-256 is part of the details of how we chose to do
that.  So I think this would be a misleading change.

You can tell that I'm not just trying to justify after the fact
because the initial version of the design doc at [*] already uses this
wording, and that version assumed that the hash function was going to
be SHA-256.

[*] 
https://public-inbox.org/git/20170304011251.ga26...@aiede.mtv.corp.google.com/

[...]
>> Non-Goals
>> -
>> 1. Add SHA-256 support to Git protocol. This is valuable and the
>>logical next step but it is out of scope for this initial design.
>
> This is a non-goal according to the docs, but now that we have protocol
> v2 in git, perhaps we could start specifying or describing how this
> protocol extension will work?

Yes, that would be great!  But I suspect it's cleanest to do so in a
separate doc.  That would allow clarifying this part, by pointing to
the protocol doc.

[...]
>> 3. Intermixing objects using multiple hash functions in a single
>>repository.
>
> But isn't that the goal now per "Translation table" & writing both SHA-1
> and SHA-256 versions of objects?

No, we don't write both versions of objects.  The translation records
both names of an object.

[...]
>>   - For each object format:
>> - 4-byte format identifier (e.g., 'sha1' for SHA-1)
>
> So, given that we have 4-byte limit and have decided on SHA-256 are we
> just going to call this 'sha2'?

Good question.  'sha2' sounds fine to me.  If we want to do
SHA-512/256 later, say, we'd just have to come up with a name for that
at that point (and it doesn't have to be ASCII).

> That might be confusingly ambiguous

This is a binary format.  Are you really worried that people are going
to misinterpret the magic numbers it contains?

> since SHA2 is a standard with more than just SHA-256, maybe 's256', or
> maybe we should give this 8 bytes with trailing \0s so we can have
> "SHA-1\0\0\0" and "SHA-256\0"?

For what it's worth, if that's the alternative, I'd rather have four
random bytes.

[...]
>> The loose object index is protected against concurrent writes by a
>> lock file $GIT_OBJECT_DIR/loose-object-idx.lock. To add a new loose
>> object:
>>
>> 1. Write the loose object to a temporary file, like today.
>> 2. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the lock.
>> 3. Rename the loose object into place.
>> 4. Open loose-object-idx with O_APPEND and write the new object
>> 5. Unlink loose-object-idx.lock to release the lock.
>>
>> To remove entries (e.g. in "git pack-refs" or "git-prune"):
>>
>> 1. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the
>>lock.
>> 2. Write the new content to loose-object-idx.lock.
>> 3. Unlink any loose objects being removed.
>> 4. Rename to replace loose-object-idx, releasing the lock.
>
> Do we expect multiple concurrent writers to poll the lock if they can't
> aquire it right away? I.e. concurrent "git commit" would block? Has this
> overall approach been benchmarked somewhere?

Git doesn't support concurrent "git commit" today.

My feeling is that if loose object writing becomes a performance
problem, we should switch to writing packfiles instead (as "git
receive-pack" already does).  So when there's a choice between better
performance of writing loose objects and simplicity, I lean toward
simplicity (though that's not absolute, there are definitely tradeoffs
to be made).

Earlier discussion about this had sharded loose object indices for
each xy/ subdir.  It was more complicated, for not much gain.

[...]
> Maybe I've missed some subtlety where that won't work, I'm just
> concerned that something that's writing a lot of objects in parallel
> will be slowed down (e.g. the likes of BFG repo cleaner).

BFG repo cleaner is an application like fast-import that is a good fit
for writing packs, not loose objects.

[...]
>> Since all operations that make new objects (e.g., "git commit") add
>> the new objects to the corresponding index, this mapping is possible
>> for all objects in the object store.
>
> Are we going to need a midx version of these mapping files? How does
> midx fit into this picture? Perhaps it's too obscure to 

Re: Questions about the hash function transition

2018-08-23 Thread brian m. carlson
On Thu, Aug 23, 2018 at 04:02:51PM +0200, Ævar Arnfjörð Bjarmason wrote:
> > [...]
> > Goals
> > -
> > 1. The transition to SHA-256 can be done one local repository at a time.
> >a. Requiring no action by any other party.
> >b. A SHA-256 repository can communicate with SHA-1 Git servers
> >   (push/fetch).
> >c. Users can use SHA-1 and SHA-256 identifiers for objects
> >   interchangeably (see "Object names on the command line", below).
> >d. New signed objects make use of a stronger hash function than
> >   SHA-1 for their security guarantees.
> > 2. Allow a complete transition away from SHA-1.
> >a. Local metadata for SHA-1 compatibility can be removed from a
> >   repository if compatibility with SHA-1 is no longer needed.
> > 3. Maintainability throughout the process.
> >a. The object format is kept simple and consistent.
> >b. Creation of a generalized repository conversion tool.
> >
> > Non-Goals
> > -
> > 1. Add SHA-256 support to Git protocol. This is valuable and the
> >logical next step but it is out of scope for this initial design.
> 
> This is a non-goal according to the docs, but now that we have protocol
> v2 in git, perhaps we could start specifying or describing how this
> protocol extension will work?

I have code that does this.  The reason is that the first stage of the
transition code is to implement stage 4 of the transition: that is, a
full SHA-256 implementation without any SHA-1 support.  Implementing it
that way means that we don't have to deal with any of the SHA-1 to
SHA-256 mapping in the first stage of the code.

In order to clone an SHA-256 repo (which the testsuite is completely
broken without), you need to be able to have basic SHA-256 support in
the protocol.  I know this was a non-goal, but the alternative is a an
inability to run the testsuite using SHA-256 until all the code is
merged, which is unsuitable for development.  The transition plan also
anticipates stage 4 (full SHA-256) support before earlier stages, so
this will be required.

I hope to be able to spend some time documenting this in a little bit.
I have documentation for that code in my branch, but I haven't sent it
in yet.

I realize I have a lot of code that has not been sent in yet, but I also
tend to build on my own series a lot, and I probably need to be a bit
better about extracting reusable pieces that can go in independently
without waiting for the previous series to land.

> > [...]
> > 3. Intermixing objects using multiple hash functions in a single
> >repository.
> 
> But isn't that the goal now per "Translation table" & writing both SHA-1
> and SHA-256 versions of objects?

No, I think this statement is basically that you have to have the entire
repository use all one algorithm under the hood in the .git directory,
translation tables excluded.  I don't think that's controversial.

> > [...]
> > Pack index
> > ~~
> > Pack index (.idx) files use a new v3 format that supports multiple
> > hash functions. They have the following format (all integers are in
> > network byte order):
> >
> > - A header appears at the beginning and consists of the following:
> >   - The 4-byte pack index signature: '\377t0c'
> >   - 4-byte version number: 3
> >   - 4-byte length of the header section, including the signature and
> > version number
> >   - 4-byte number of objects contained in the pack
> >   - 4-byte number of object formats in this pack index: 2
> >   - For each object format:
> > - 4-byte format identifier (e.g., 'sha1' for SHA-1)
> 
> So, given that we have 4-byte limit and have decided on SHA-256 are we
> just going to call this 'sha2'? That might be confusingly ambiguous
> since SHA2 is a standard with more than just SHA-256, maybe 's256', or
> maybe we should give this 8 bytes with trailing \0s so we can have
> "SHA-1\0\0\0" and "SHA-256\0"?

This is the format_version field in struct git_hash_algo.

For SHA-1, I have 0x73686131, which is "sha1", big-endian, and for
SHA-256, I have 0x73323536, which is "s256", big-endian.  The former is
in the codebase already; the latter, in my hash-impl branch.

If people have objections, we can change this up until we merge the pack
index v3 code (which is not yet finished).  It needs to be unique, and
that's it.  We could specify 0x0001 and 0x0002 if we wanted,
although I feel the values I mentioned above are self-documenting, which
is desirable.

> > [...]
> > - The trailer consists of the following:
> >   - A copy of the 20-byte SHA-256 checksum at the end of the
> > corresponding packfile.
> >
> >   - 20-byte SHA-256 checksum of all of the above.
> 
> We need to update both of these to 32 byte, right? Or are we planning to
> truncate the checksums?
> 
> This seems like just a mistake when we did s/NewHash/SHA-256/g, but then
> again it was originally "20-byte NewHash checksum" ever since 752414ae43
> ("technical doc: add a design doc for hash function transition",
> 

Re: Questions about the hash function transition

2018-08-23 Thread Junio C Hamano
Ævar Arnfjörð Bjarmason  writes:

> On Thu, Aug 23 2018, Junio C Hamano wrote:
>
>> Ævar Arnfjörð Bjarmason  writes:
>>
 - The trailer consists of the following:
   - A copy of the 20-byte SHA-256 checksum at the end of the
 corresponding packfile.

   - 20-byte SHA-256 checksum of all of the above.
>>>
>>> We need to update both of these to 32 byte, right? Or are we planning to
>>> truncate the checksums?
>>
>> https://public-inbox.org/git/ca+55afwc7uq61ebnj36pfu_abcxgya4jut-tvppj21hkhre...@mail.gmail.com/
>
> Thanks.
>
> Yeah for this checksum purpose even 10 or 5 characters would do, but
> since we'll need a new pack format anyway for SHA-256 why not just use
> the full length of the SHA-256 here? We're using the full length of the
> SHA-1.
>
> I don't see it mattering for security / corruption detection purposes,
> but just to avoid confusion. We'll have this one place left where
> something looks like a SHA-1, but is actually a trunctated SHA-256.

I would prefer to see us at least explore if the gain in throughput
is sufficiently big if we switch to weaker checksum, like crc32.  If
does not give us sufficient gain, I'd agree with you that consistently
using full hash everywhere would conceptually be cleaner.




Re: Questions about the hash function transition

2018-08-23 Thread Ævar Arnfjörð Bjarmason


On Thu, Aug 23 2018, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason  writes:
>
>>> - The trailer consists of the following:
>>>   - A copy of the 20-byte SHA-256 checksum at the end of the
>>> corresponding packfile.
>>>
>>>   - 20-byte SHA-256 checksum of all of the above.
>>
>> We need to update both of these to 32 byte, right? Or are we planning to
>> truncate the checksums?
>
> https://public-inbox.org/git/ca+55afwc7uq61ebnj36pfu_abcxgya4jut-tvppj21hkhre...@mail.gmail.com/

Thanks.

Yeah for this checksum purpose even 10 or 5 characters would do, but
since we'll need a new pack format anyway for SHA-256 why not just use
the full length of the SHA-256 here? We're using the full length of the
SHA-1.

I don't see it mattering for security / corruption detection purposes,
but just to avoid confusion. We'll have this one place left where
something looks like a SHA-1, but is actually a trunctated SHA-256.


Re: Questions about the hash function transition

2018-08-23 Thread Junio C Hamano
Ævar Arnfjörð Bjarmason  writes:

>> - The trailer consists of the following:
>>   - A copy of the 20-byte SHA-256 checksum at the end of the
>> corresponding packfile.
>>
>>   - 20-byte SHA-256 checksum of all of the above.
>
> We need to update both of these to 32 byte, right? Or are we planning to
> truncate the checksums?

https://public-inbox.org/git/ca+55afwc7uq61ebnj36pfu_abcxgya4jut-tvppj21hkhre...@mail.gmail.com/


Questions about the hash function transition

2018-08-23 Thread Ævar Arnfjörð Bjarmason
I wanted to send another series to clarify things in
hash-function-transition.txt, but for some of the issues I don't know
the answer, and I had some questions after giving this another read.

So let's discuss that here first. Quoting from the document (available
at
https://github.com/git/git/blob/v2.19.0-rc0/Documentation/technical/hash-function-transition.txt)

> Git hash function transition
> 
>
> Objective
> -
> Migrate Git from SHA-1 to a stronger hash function.

Should way say "Migrate Git from SHA-1 to SHA-256" here instead?

Maybe it's overly specific, i.e. really we're also describnig how /any/
hash function transition might happen, but having just read this now
from start to finish it takes us a really long time to mention (and at
first, only offhand) that SHA-256 is the new hash.

> [...]
> Goals
> -
> 1. The transition to SHA-256 can be done one local repository at a time.
>a. Requiring no action by any other party.
>b. A SHA-256 repository can communicate with SHA-1 Git servers
>   (push/fetch).
>c. Users can use SHA-1 and SHA-256 identifiers for objects
>   interchangeably (see "Object names on the command line", below).
>d. New signed objects make use of a stronger hash function than
>   SHA-1 for their security guarantees.
> 2. Allow a complete transition away from SHA-1.
>a. Local metadata for SHA-1 compatibility can be removed from a
>   repository if compatibility with SHA-1 is no longer needed.
> 3. Maintainability throughout the process.
>a. The object format is kept simple and consistent.
>b. Creation of a generalized repository conversion tool.
>
> Non-Goals
> -
> 1. Add SHA-256 support to Git protocol. This is valuable and the
>logical next step but it is out of scope for this initial design.

This is a non-goal according to the docs, but now that we have protocol
v2 in git, perhaps we could start specifying or describing how this
protocol extension will work?

> [...]
> 3. Intermixing objects using multiple hash functions in a single
>repository.

But isn't that the goal now per "Translation table" & writing both SHA-1
and SHA-256 versions of objects?

> [...]
> Pack index
> ~~
> Pack index (.idx) files use a new v3 format that supports multiple
> hash functions. They have the following format (all integers are in
> network byte order):
>
> - A header appears at the beginning and consists of the following:
>   - The 4-byte pack index signature: '\377t0c'
>   - 4-byte version number: 3
>   - 4-byte length of the header section, including the signature and
> version number
>   - 4-byte number of objects contained in the pack
>   - 4-byte number of object formats in this pack index: 2
>   - For each object format:
> - 4-byte format identifier (e.g., 'sha1' for SHA-1)

So, given that we have 4-byte limit and have decided on SHA-256 are we
just going to call this 'sha2'? That might be confusingly ambiguous
since SHA2 is a standard with more than just SHA-256, maybe 's256', or
maybe we should give this 8 bytes with trailing \0s so we can have
"SHA-1\0\0\0" and "SHA-256\0"?

> [...]
> - The trailer consists of the following:
>   - A copy of the 20-byte SHA-256 checksum at the end of the
> corresponding packfile.
>
>   - 20-byte SHA-256 checksum of all of the above.

We need to update both of these to 32 byte, right? Or are we planning to
truncate the checksums?

This seems like just a mistake when we did s/NewHash/SHA-256/g, but then
again it was originally "20-byte NewHash checksum" ever since 752414ae43
("technical doc: add a design doc for hash function transition",
2017-09-27), so what do we mean here?

> Loose object index
> ~~
> A new file $GIT_OBJECT_DIR/loose-object-idx contains information about
> all loose objects. Its format is
>
>   # loose-object-idx
>   (sha256-name SP sha1-name LF)*
>
> where the object names are in hexadecimal format. The file is not
> sorted.
>
> The loose object index is protected against concurrent writes by a
> lock file $GIT_OBJECT_DIR/loose-object-idx.lock. To add a new loose
> object:
>
> 1. Write the loose object to a temporary file, like today.
> 2. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the lock.
> 3. Rename the loose object into place.
> 4. Open loose-object-idx with O_APPEND and write the new object
> 5. Unlink loose-object-idx.lock to release the lock.
>
> To remove entries (e.g. in "git pack-refs" or "git-prune"):
>
> 1. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the
>lock.
> 2. Write the new content to loose-object-idx.lock.
> 3. Unlink any loose objects being removed.
> 4. Rename to replace loose-object-idx, releasing the lock.

Do we expect multiple concurrent writers to poll the lock if they can't
aquire it right away? I.e. concurrent "git commit" would block? Has this
overall approach been benchmarked somewhere?

I wonder if some lock-less variant of this would