Re: Re: Merge with git-pasky II.

2005-04-18 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> On Sun, 17 Apr 2005, Ingo Molnar wrote:
> > 
> > in fact, this attack cannot even be proven to be malicious, purely via 
> > the email from Malice: it could be incredible bad luck that caused that 
> > good-looking patch to be mistakenly matching a dangerous object.
> 
> I really hate theoretical discussions.

i was only replying to your earlier point:

> > > Almost all attacks on sha1 will depend on _replacing_ a file with 
> > > a bogus new one. So guys, instead of using sha256 or going 
> > > overboard, just make sure that when you synchronize, you NEVER 
> > > import a file you already have.

which point i still believe is subtly wrong. You were suggesting to 
concentrate on file replacement to counter most of the practical 
attacks, while i pointed out an attack _using the same basic mechanism 
that your point above supposed_.

[ if you can replace a file with a known hash, with a bogus new one, and 
  you still have enough control over the contents of your bogus new file 
  that it is 1) a valid file that builds 2) compromises the kernel, then 
  you likely have the same amount of control my 'theoretical' attack
  requires. ]

> And the thing is, _if_ somebody finds a way to make sha1 act as just a 
> complex parity bit, and comes up with generating a clashing object 
> that actually makes sense, then going to sha256 is likely pointless 
> too [...]

yes, that's why i suggested to not actually trust the hash to be 
cryptographically secure, but to just assume it's a good generic hash we 
can design a DB around, and to turn -DCOLLISION_CHECK on and enforce 
consistency rules on boundaries.

[ it's not bad to keep sha1 because even my suggested enhancement still
  leaves 'content-less trust-pointers to untrusted content via email'
  vectors open against attack (maintainer sends you an email that commit
  X in Malice's repository Y is fine to pull, and you pull it blindly,
  while the attacker has replaced his content with the compromised one
  meanwhile), but it at least validates the bulk traffic that goes into
  the DB: patches via emails and trusted repositories. ]

so all i was suggesting was to extend your suggested 'overwrite 
collision check' to a stricter 'content we throw away and use the sha1 
shortcut for needs to be checked against the in-DB content as well'.

in other words, your suggested 'rename check' is checking for 'positive 
duplicate content', while my addition would also check for 'negative 
duplicate content' as well.

but as usual, i could be wrong, so dont take this too serious :-)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: Merge with git-pasky II.

2005-04-17 Thread Linus Torvalds


On Sun, 17 Apr 2005, Ingo Molnar wrote:
> 
> in fact, this attack cannot even be proven to be malicious, purely via 
> the email from Malice: it could be incredible bad luck that caused that 
> good-looking patch to be mistakenly matching a dangerous object.

I really hate theoretical discussions. 

The fact is, a lot of _crap_ engineering gets done because of the question
"what if?". It results in over-engineering, often to the point where the 
end result is quite a lot measurably worse than the sane results.

You are _literally_ arguing for the equivalent of "what if a meteorite hit
my plane while it was in flight - maybe I should add three inches of
high-tension armored steel around the plane, so that my passengers would
be protected".

That's not engineering. That's five-year-olds discussing building their
imaginary forts ("I want gun-turrets and a mechanical horse one mile high,
and my command center is 5 miles under-ground and totally encased in 5
meters of lead").

I absolutely _hate_ doing engineering on the principle of "this might be
possible in theory", and I'm violently opposed to it. So far, I have not
heard a single argument that I consider even _remotely_ likely.

The thing is, even if you can force a hash collission by sending somebody 
a patch, it's really pretty much almost guaranteed that the patch is not 
just "a few strange characters", unless sha1 is really broken to the point 
where it's not cryptographically secure _at_all_.

In other words, unless somebody finds a way to make sha1 appear as nothing
more than a complicated set of parity bits, all brute-force "get the same
sha1" is likely to be about generating a really strange blob based on the
thing you want to replace - and by "really strange" I mean total binary
crap. And likely _much_ bigger too. And by "much bigger" I mean "possibly
gigabytes of data".

And the thing is, _if_ somebody finds a way to make sha1 act as just a
complex parity bit, and comes up with generating a clashing object that
actually makes sense, then going to sha256 is likely pointless too - I
think the algorithm is basically the same, just with more bits. If you've
broken sha1 to the point where it's _that_ breakable, then you've likely
broken sha256 too. Nobody has ever proven that you couldn't break sha256 
with some really clever algorithm...

So if you start playing "what if?" games, dammit, I can play mine.

If we want to have any kind of confidence that the hash is reall
yunbreakable, we should make it not just longer than 160 bits, we should
make sure that it's two or more hashes, and that they are based on totally
different principles.

And we should all digitally sign every single object too, and we should
use 4096-bit PGP keys and unguessable passphrases that are at least 20
words in length. And we should then build a bunker 5 miles underground,
encased in lead, so that somebody cannot flip a few bits with a ray-gun, 
and make us believe that the sha1's match when they don't. Oh, and we need 
to all wear aluminum propeller beanies to make sure that they don't use 
that ray-gun to make us do the modification _outselves_.

And the thing is, that's just crazy talk. The difference between a crazy
person and an intelligent one is that the crazy one doesn't realize what
makes sense in the world. The goal of good engineering is not to ask "what
if?", but to ask "how do I make this work as well as possible".

So please stop with the theoretical sha1 attacks. It is simply NOT TRUE
that you can generate an object that looks halfway sane and still gets you
the sha1 you want. Even the "breakage" doesn't actually do that.  And if
it ever _does_ become true, it will quite possibly be thanks to some
technology that breaks other hashes too.

So until proven otherwise, I worry about accidental hashes, and in 160
bits of good hashing, that just isn't an issue either. Anybody who
compares a 128-bit md5-sum to a 160-bit sha1 doesn't understand the math.  
It didn't get "slightly less likely" to happen. It got so _unbelievably_
less likely to happen that it's not even funny.

Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: Merge with git-pasky II.

2005-04-17 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> The compromise relies on you having reviewed something harmless, while 
> in reality what happened within the DB was far less harmless. And the 
> DB remains self-consistent: neither fsck, nor others importing your 
> tree will be able to detect the compromise. This attack can only be 
> detected when you apply the patch, after that point all the 
> information (except Malice's message in your inbox) is gone.

in fact, this attack cannot even be proven to be malicious, purely via 
the email from Malice: it could be incredible bad luck that caused that 
good-looking patch to be mistakenly matching a dangerous object.

In fact this could happen even today, _accidentally_. (but i'm willing 
to bet that hell will be freezing over first, and i'll have some really 
good odds ;) There's probably a much higher likelyhood of Linus' tree 
getting corrupted in some old fashioned way and introducing a security 
hole by accident)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: Merge with git-pasky II.

2005-04-17 Thread Ingo Molnar

* Brad Roberts <[EMAIL PROTECTED]> wrote:

> While I agree that a hash collision is bad and certainly worth 
> preventing during new object creation, for it to actually implant a 
> trojan in a build successfully it'd have to meet even more criteria 
> than you've layed out. It'd have to...

>   - provide all the public symbols the shadowed object provided so that it
> would still build and link successfully

that's not a problem. Most modules dont provide public symbols.  
Especially not drivers. Generally it's the modules that _dont_ have any 
global impact that get reviewed less stringently - an attacker would 
thus choose them for psychological reasons anyway.

>   - be shadowing an object that's part of an active tree
>
> Shadowing an object that's not part of the working tree means 
> something on another branch or obsoleted some time in the past is 
> still db corruption, but not nearly as big an issue from a trojan 
> standpoint.

it's not DB corruption, it's a feature of GIT: it's a content _cache_, 
new and old alike. Nothing in GIT says that old objects in the 
repository (which are still very much part of history) cannot be revived 
in newer trees. (in fact it regularly happens - e.g. if a fix is undone 
manually.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: Merge with git-pasky II.

2005-04-17 Thread Brad Roberts
On Sun, 17 Apr 2005, Ingo Molnar wrote:

> Date: Sun, 17 Apr 2005 16:52:32 +0200
> From: Ingo Molnar <[EMAIL PROTECTED]>
> To: Linus Torvalds <[EMAIL PROTECTED]>
> Cc: Petr Baudis <[EMAIL PROTECTED]>, Simon Fowler <[EMAIL PROTECTED]>,
>  David Lang <[EMAIL PROTECTED]>, git@vger.kernel.org
> Subject: Re: Re: Merge with git-pasky II.
>
>
> * Linus Torvalds <[EMAIL PROTECTED]> wrote:
>
> > Almost all attacks on sha1 will depend on _replacing_ a file with a
> > bogus new one. So guys, instead of using sha256 or going overboard,
> > just make sure that when you synchronize, you NEVER import a file you
> > already have.
>
> With tens of thousands (or hundreds of thousands) of objects expected in
> the repository sooner or later, there's quite a selection to pick from.
> Once you apply the patch, instead of the expected new file that you
> reviewed and found safe, the attacker has the other object included in
> the official kernel.
>
> A dangerous object can be anything: e.g. a debugging hack that allows
> arbitrary kernel-space writes. Or a known-insecure module (which since
> then got fixed, but the buggy code still exists in the DB). The module
> is in a single file and is self-installing (e.g. it has __init code to
> register itself as some driver.)

While I agree that a hash collision is bad and certainly worth preventing
during new object creation, for it to actually implant a trojan in a build
successfully it'd have to meet even more criteria than you've layed out.
It'd have to...

  - be shadowing an object that's part of an active tree
  - provide all the public symbols the shadowed object provided so that it
would still build and link successfully

Shadowing an object that's not part of the working tree means something on
another branch or obsoleted some time in the past is still db corruption,
but not nearly as big an issue from a trojan standpoint.

Later,
Brad


-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: Merge with git-pasky II.

2005-04-17 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> Almost all attacks on sha1 will depend on _replacing_ a file with a 
> bogus new one. So guys, instead of using sha256 or going overboard, 
> just make sure that when you synchronize, you NEVER import a file you 
> already have.

here is a bit complex, but still practical attack that doesnt rely on 
replacement and which can only be detected if we check the sha1 
uniqueness assumptions.

If you can generate a duplicate sha1 key for an arbitrary 'target' file, 
and Malice sends you a GIT-generated patch that introduces a new file 
(which doesnt exist in the current tree) which you review (in the email) 
and which looks safe to apply & harmless. Maybe the patch has a bit 
weird formatting and some weird comments (which in reality Malice used 
to generate the proper sha1 key) but otherwise the patch is for some 
seldom used arcane driver that no-one used for quite some time and 
no-one really cares about, so you are happy to apply the patch.

The compromise occurs when you apply the patch: the seemingly harmless 
patch has an sha1 key that Malice manufacured to match that of an 
already existing, 'dangerous' object in your database.

With tens of thousands (or hundreds of thousands) of objects expected in 
the repository sooner or later, there's quite a selection to pick from.  
Once you apply the patch, instead of the expected new file that you 
reviewed and found safe, the attacker has the other object included in 
the official kernel.

A dangerous object can be anything: e.g. a debugging hack that allows 
arbitrary kernel-space writes. Or a known-insecure module (which since 
then got fixed, but the buggy code still exists in the DB). The module 
is in a single file and is self-installing (e.g. it has __init code to 
register itself as some driver.)

Malice might even previously plant a dangerous object as some 'firmware 
module' in another arcane driver, which doesnt get compiled by default, 
but still shows up in the DB. Or Malice might plant a dangerous object 
via an innocent-looking documentation file.  (which contains some sample 
code and is called sample.txt)

this type of 'false sharing attack' can only be prevented if an object 
is only 'shared' with another object if it has been memcmp-ed with the 
object in the repository. I.e. if we trust the sharing decision! Once 
the attack has occured it cannot be detected automatically: only people 
will notice it. (why did that weird unrelated module show up in that old 
driver?)

The compromise relies on you having reviewed something harmless, while 
in reality what happened within the DB was far less harmless. And the DB 
remains self-consistent: neither fsck, nor others importing your tree 
will be able to detect the compromise. This attack can only be detected 
when you apply the patch, after that point all the information (except 
Malice's message in your inbox) is gone.

so unless we actively check for collisions, once an sha1 key can be 
generated at will on near-arbitrary input, it's not a secure system 
anymore. We might be lucky and safe, but we wont be secure.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: Merge with git-pasky II.

2005-04-16 Thread David Lang
On Sat, 16 Apr 2005, Linus Torvalds wrote:
Almost all attacks on sha1 will depend on _replacing_ a file with a bogus
new one. So guys, instead of using sha256 or going overboard, just make
sure that when you synchronize, you NEVER import a file you already have.
It's really that simple. Add "--ignore-existing" to your rsync scripts,
and you're pretty much done. That guarantees that a new evil blob by the
next mad scientist out to take over the world will never touch your
repository, and if we make this part of the _standard_ scripts, then
dammit, security is in good _practices_ rather than just relying blindly
on the hash being secure.
In other words, I think we could have used md5's as the hash, if we just
make sure we have good practices. And it wouldn't have been "insecure".
The fact is, you don't merge with people you don't trust. If you don't
trust them, they have a much easier time corrupting your repository by
just creating bugs in the code and checking that thing in. Who cares about
hash collisions, when you can generate a kernel root vulnerability by just
adding a single line of code and use the _correct_ hash for it.
So the sha1 hash does not replace _trust_. That comes from something else
altogether.
What I am bringing up is not intended to be a trust thing, but instead a 
safety thing, accidents, not evil intent. makeing the rsync scripts 
--ignore-existing will avoid corrupting local data when pulling remotely, 
but it won't solve the problem of running into a collision locally (and 
won't do much to help you figure out what's wrong when you run into a 
remote collision)

David Lang
--
There are two ways of constructing a software design. One way is to make it so 
simple that there are obviously no deficiencies. And the other way is to make 
it so complicated that there are no obvious deficiencies.
 -- C.A.R. Hoare
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: Merge with git-pasky II.

2005-04-16 Thread Simon Fowler
On Sat, Apr 16, 2005 at 06:03:33PM +0200, Petr Baudis wrote:
> Dear diary, on Sat, Apr 16, 2005 at 05:55:37PM CEST, I got a letter
> where Simon Fowler <[EMAIL PROTECTED]> told me that...
> > On Sat, Apr 16, 2005 at 05:19:24AM -0700, David Lang wrote:
> > > Simon
> > > 
> > > given that you have multiple machines creating files, how do you deal 
> > > with 
> > > the idea of the same 'unique id' being assigned to different files by 
> > > different machines?
> > > 
> > The id is a sha1 hash of the current time and the full path of the
> > file being added - the chances of that being replicated without
> > malicious intent is extremely small. There are other things that
> > could be used, like the hostname, username of the person running the
> > program, etc, but I don't really see them being necessary.
> 
> Why not just use UUID?
> 
Hey, everything else in git seems to use sha1, so I just copied
Linus' sha1 code ;-)

All I wanted was something that had a good chance of being unique
across any potential set of distributed repositories, to avoid the
chance of accidental clashes. A sha1 hash of something that's not
likely to be replicated is a simple way to do that.

Simon

-- 
PGP public key Id 0x144A991C, or http://himi.org/stuff/himi.asc
(crappy) Homepage: http://himi.org
doe #237 (see http://www.lemuria.org/DeCSS) 
My DeCSS mirror: ftp://himi.org/pub/mirrors/css/ 


signature.asc
Description: Digital signature


Re: Re: Merge with git-pasky II.

2005-04-16 Thread Linus Torvalds


On Sat, 16 Apr 2005, Petr Baudis wrote:

> Dear diary, on Sat, Apr 16, 2005 at 05:55:37PM CEST, I got a letter
> where Simon Fowler <[EMAIL PROTECTED]> told me that...
>
> > The id is a sha1 hash of the current time and the full path of the
> > file being added - the chances of that being replicated without
> > malicious intent is extremely small. There are other things that
> > could be used, like the hostname, username of the person running the
> > program, etc, but I don't really see them being necessary.
> 
> Why not just use UUID?

Note that using anything that isn't data-related totally destroys the 
whole point of the object database. Remember: any time we don't uniquely 
generate the same name for the same object, we'll waste disk-space.

So adding in user/machine/uuid's to the thing is always a mistake. The 
whole thing depends on the hash being as close to 1:1 with the contents as 
humanly possible. 

There's also the issue of size. Yes, I could have chosen sha256 instead of
sha1. But the keys would be almost twice as big, which in turn means that 
the "tree" objects would be bigger, and that the "index" file would be 
bigger.

Is that a huge problem? No. We can certainly move to it if sha1 ever shows
itself to be weak. But I really think we are much better off just
re-generating the whole tree and history at that point, rather than try to 
predict the future.

The fact is, with current knowledge, sha1 _is_ safe for what git uses it 
for, for the forseeable future. And we have a migration strategy if I'm 
wrong. Don't worry about it.

Almost all attacks on sha1 will depend on _replacing_ a file with a bogus
new one. So guys, instead of using sha256 or going overboard, just make 
sure that when you synchronize, you NEVER import a file you already have.

It's really that simple. Add "--ignore-existing" to your rsync scripts,
and you're pretty much done. That guarantees that a new evil blob by the
next mad scientist out to take over the world will never touch your
repository, and if we make this part of the _standard_ scripts, then
dammit, security is in good _practices_ rather than just relying blindly
on the hash being secure.

In other words, I think we could have used md5's as the hash, if we just
make sure we have good practices. And it wouldn't have been "insecure".

The fact is, you don't merge with people you don't trust. If you don't
trust them, they have a much easier time corrupting your repository by
just creating bugs in the code and checking that thing in. Who cares about
hash collisions, when you can generate a kernel root vulnerability by just
adding a single line of code and use the _correct_ hash for it.

So the sha1 hash does not replace _trust_. That comes from something else 
altogether.

Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: Merge with git-pasky II.

2005-04-16 Thread Petr Baudis
Dear diary, on Sat, Apr 16, 2005 at 05:55:37PM CEST, I got a letter
where Simon Fowler <[EMAIL PROTECTED]> told me that...
> On Sat, Apr 16, 2005 at 05:19:24AM -0700, David Lang wrote:
> > Simon
> > 
> > given that you have multiple machines creating files, how do you deal with 
> > the idea of the same 'unique id' being assigned to different files by 
> > different machines?
> > 
> The id is a sha1 hash of the current time and the full path of the
> file being added - the chances of that being replicated without
> malicious intent is extremely small. There are other things that
> could be used, like the hostname, username of the person running the
> program, etc, but I don't really see them being necessary.

Why not just use UUID?

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: Merge with git-pasky II.

2005-04-15 Thread Petr Baudis
Dear diary, on Fri, Apr 15, 2005 at 12:22:26PM CEST, I got a letter
where Junio C Hamano <[EMAIL PROTECTED]> told me that...
> After I re-read [*R1*], in which Linus talks about dircache,
> especially this section:
> 
>  - The "current directory cache" describes some baseline. In particular,
>note the "some" part. It's not tied to any special baseline, and you
>can change your baseline any way you please.
> 
>So it does NOT have to track any particular state in either the object 
>database _or_ in your actual current working tree. In fact, all real 
>interactions with "git" are really about updating this staging area one 
>way or the other: you might check out the state from it into your 
>working area (partially or fully), you can push your working area into 
>the staging area (again, partially or fully).
> 
>And if you want to, you can write the thing that the staging area 
>represents as a "tree" into the object database, or you can merge a 
>tree from the object database into the staging area.
> 
>In other words: the staging area aka "current directory cache" is 
>really how all interaction takes place. The object database never 
>interacts directly with your working directory contents. ALL 
>interactions go through the current directory cache.
> 
> I started to have more doubts on the approach of *not*
> performing the merge in the dircache I set up specifically for
> merging, which is the direction in which you are pushing if I
> understand you correctly.  Maybe I completely misunderstand what
> you want.  This message is long but I need a clear understanding
> of what is expected to be useful to you, so please bear with me.



> PB>   merge-tree.pl -b $base $(tree-id) $merged | parse-your-output
> 
> Please help me understand this example you have given earlier.
> Here is my understanding of your assumption when the above
> pipeline takes place.  Correct me if I am mistaken.
> 
>  * The user is in a working directory $W.  It is controlled by
>git-tools and there are $W/.git/. directory and $W/.git/index
>dircache.
> 
>  * The dircache $W/.git/index started its life as a read-tree
>from some commit.  The git-tools is keeping track of which
>commit it is somewhere, presumably in $W/.git/ directory.
>Let's call it $C (commit).
> 
>  ? Question.  Is the $(tree-id) in your example the same as $C
>above?

Yes. Actually $(tree-id) returns ID of the tree object, not the commit
object; but that doesn't matter here, probably - let's ignore that
distinction for simplicity.

>  * The user have run [*1*] (see Footnote below) checkout-cache
>on $W/.git/index some time in the past and $W is full of
>working files.  Some of them may or may not have modified.
>There may be some additions or deletions.  So the contents of
>the working directory may not match the tree associated with
>$C.
> 
>  * The user may or may not have run [*1*] update-cache in $W.
>The contents of the dircache $W/.git/index may not match the
>tree associated with $C.
> 
>  ? Question.  Are you forbidding the user to run update-cache by
>hand, and keeping track of the changes yourself, to be
>applied all at once at "git commit" time, thereby
>guaranteeing the $W/.git/index to match the tree associated
>with $C all times?  From the description of The "GIT toolkit"
>section in README, it is not clear to me which part of his
>repository an end user is not supposed to muck with himself.

Ideally, he shouldn't be using *any* of the low-level plumbing by now.
The only exception is update-cache --refresh, which he can do at will
(I'm yet thinking what to do with it :-).

The git-tools always assume that index basically contains the state as
of the last commit. (Actually the only time when this matters *now*
might be git diff - the user would get confused from the results.)

>  * Now the user has some changes in his working directory and
>notices upstream or a side branch has notable changes
>desireble to be picked up.  So he runs some git-tools command
>to cause the above quoted pipeline to run.
> 
>  ? Question.  Does $merged in your example mean such an upstream
>or side branch?  Is $base in your example the common ancestor
>between $C and $merged?

Correct.

*HOWEVER* what is not correct is that git-tools would let you merge in
your working directory while you have local changes there.

In the past, the merge would happen in your working tree, but git-tools
wouldn't let you go for it unless your working tree has no local
changes. It would complain loudly and refuse to, since it's *not* what
you want to do and it was most likely a mistake.

Currently, git merge just creates a ,,merge/ subdirectory sharing the
object database with your working tree, but with an independent checkout
of it; it will do the merge there, and when you commit it there, it will
update your working tree with the merged chan

Re: Re: Merge with git-pasky II.

2005-04-15 Thread Petr Baudis
Dear diary, on Fri, Apr 15, 2005 at 02:58:25AM CEST, I got a letter
where Junio C Hamano <[EMAIL PROTECTED]> told me that...
> > "PB" == Petr Baudis <[EMAIL PROTECTED]> writes:
> >> I think the above would result in what SCM person would call
> >> "merge upstream/sidestream changes into my working directory".
> 
> PB> And that's exactly what I'm doing now with git merge. ;-) In fact,
> PB> ideally the whole change in my scripts when your script is finished
> PB> would be replacing
> 
> PB>   checkout-cache `diff-tree` # symbolic
> PB>   git diff $base $merged | git apply
> 
> PB> with
> 
> PB>   merge-tree.pl -b $base $(tree-id) $merged | parse-your-output
> 
> In the above I presume by $merged you mean the tree ID (or
> commit ID) the user's working directory is based upon?  Well,
> merge-trees (Linus has a single directory merge-tree already)
> looks at tree IDs (or commit IDs); it would never involve
> working files in random state that is not recorded as part of a
> tree (committed or not).  Given that constraints I am not sure
> how well that would pan out.  I have to think about this a bit.

No, $(tree-id) is the "destination branhc", what the user directory is
based upon; $merged is the branch you are merging now, relative to
$base. When I throw away the useless "-b" argument, in practice it would
look like

merge-trees abcd 1234 5678

for doing

  /-- 1234 -+-
abcd < /
  \-- 5678

(not that the order of 1234 and 5678 would actually really matter)

I fear I don't understand the rest of your paragraph. :-(

> I do like, however, the idea of separating the step of doing any
> checkout/merge etc. and actually doing them.  So the command set
> of parse-your-output needs to be defined.  Based on what I have
> done so far, it would consist of the following:
> 
>  - Result is this object $SHA1 with mode $mode at $path (takes
>one of the trees); you can do update-cache --cacheinfo (if
>you want to muck with dircache) or cat-file blob (if you want
>to get the file) or both.
> 
>  - Result is to delete $path.
> 
>  - Result is a merge between object $SHA1-1 and $SHA1-2 with
>mode $mode-1 or $mode-2 at $path.
> 
> Would this be a good enough command set?

What about the conflicts? Like one tree deleting, other tree modifying?

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: Merge with git-pasky II.

2005-04-14 Thread Christopher Li
BTW, I am not competing with Junio script. If that is the way
we all agree on. It is should be very easy for Junio to fix his
perl script. right?

Chris

On Thu, Apr 14, 2005 at 04:37:17PM -0400, Christopher Li wrote:
> Is that some thing you want to see? Maybe clean up the error printing.
> 
> 
> Chris
> 
> --- /dev/null 2003-01-30 05:24:37.0 -0500
> +++ merge.py  2005-04-14 16:34:39.0 -0400
> @@ -0,0 +1,76 @@
> +#!/usr/bin/env python
> +
> +import re
> +import sys
> +import os
> +from pprint import pprint
> +
> +def get_tree(commit):
> +data = os.popen("cat-file commit %s"%commit).read()
> +return re.findall(r"(?m)^tree (\w+)", data)[0]
> +
> +PREFIX = 0
> +PATH = -1
> +SHA = -2
> +ORIGSHA = -3
> +
> +def get_difftree(old, new):
> +lines = os.popen("diff-tree %s %s"%(old, new)).read().split("\x00")
> +patterns = (r"(\*)(\d+)->(\d+)\s(\w+)\s(\w+)->(\w+)\s(.*)",
> + r"([+-])(\d+)\s(\w+)\s(\w+)\s(.*)")
> +res = {}
> +for l in lines:
> + if not l: continue
> + for p in patterns:
> + m = re.findall(p, l)
> + if m:
> + m = m[0]
> + res[m[-1]] = m
> + break
> + else:
> + raise "difftree: unknow line", l
> +return res
> +
> +def analyze(diff1, diff2):
> +diff1only = [ diff1[k] for k in diff1 if k not in diff2 ]
> +diff2only = [ diff2[k] for k in diff2 if k not in diff1 ]
> +both = [ (diff1[k],diff2[k]) for k in diff2 if k in diff1 ]
> +
> +action(diff1only)
> +action(diff2only)
> +action_two(both)
> +
> +def action(diffs):
> +for act in diffs:
> + if act[PREFIX] == "*":
> + print "modify", act[PATH], act[SHA]
> + elif act[PREFIX] == '-':
> + print "remove", act[PATH], act[SHA]
> + elif act[PREFIX] == '+':
> + print "add", act[PATH], act[SHA]
> + else:
> + raise "unknow action"
> +
> +def action_two(diffs):
> +for act1, act2 in diffs:
> + if len(act1) == len(act2):  # same kind type
> + if act1[PREFIX] == act2[PREFIX]:
> + if act1[SHA] == act2[SHA] or act1[PREFIX] == '-': 
> + return action(act1)
> + if act1[PREFIX]=='*':
> + print "do_merge", act1[PATH], act1[ORIGSHA], act1[SHA], 
> act2[SHA]
> + return
> + print "unable to handle", act[PATH]
> + print "one side wants", act1[PREFIX]
> + print "the other side wants", act2[PREFIX]
> + 
> +
> +args = sys.argv[1:]
> +if len(args)!=3:
> +print "Usage merge.py   "
> +trees = map(get_tree, args)
> +print "checkout-tree", trees[0]
> +diff1 = get_difftree(trees[0], trees[1])
> +diff2 = get_difftree(trees[0], trees[2])
> +analyze(diff1, diff2)
> +
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: Merge with git-pasky II.

2005-04-14 Thread Christopher Li
Is that some thing you want to see? Maybe clean up the error printing.


Chris

--- /dev/null   2003-01-30 05:24:37.0 -0500
+++ merge.py2005-04-14 16:34:39.0 -0400
@@ -0,0 +1,76 @@
+#!/usr/bin/env python
+
+import re
+import sys
+import os
+from pprint import pprint
+
+def get_tree(commit):
+data = os.popen("cat-file commit %s"%commit).read()
+return re.findall(r"(?m)^tree (\w+)", data)[0]
+
+PREFIX = 0
+PATH = -1
+SHA = -2
+ORIGSHA = -3
+
+def get_difftree(old, new):
+lines = os.popen("diff-tree %s %s"%(old, new)).read().split("\x00")
+patterns = (r"(\*)(\d+)->(\d+)\s(\w+)\s(\w+)->(\w+)\s(.*)",
+   r"([+-])(\d+)\s(\w+)\s(\w+)\s(.*)")
+res = {}
+for l in lines:
+   if not l: continue
+   for p in patterns:
+   m = re.findall(p, l)
+   if m:
+   m = m[0]
+   res[m[-1]] = m
+   break
+   else:
+   raise "difftree: unknow line", l
+return res
+
+def analyze(diff1, diff2):
+diff1only = [ diff1[k] for k in diff1 if k not in diff2 ]
+diff2only = [ diff2[k] for k in diff2 if k not in diff1 ]
+both = [ (diff1[k],diff2[k]) for k in diff2 if k in diff1 ]
+
+action(diff1only)
+action(diff2only)
+action_two(both)
+
+def action(diffs):
+for act in diffs:
+   if act[PREFIX] == "*":
+   print "modify", act[PATH], act[SHA]
+   elif act[PREFIX] == '-':
+   print "remove", act[PATH], act[SHA]
+   elif act[PREFIX] == '+':
+   print "add", act[PATH], act[SHA]
+   else:
+   raise "unknow action"
+
+def action_two(diffs):
+for act1, act2 in diffs:
+   if len(act1) == len(act2):  # same kind type
+   if act1[PREFIX] == act2[PREFIX]:
+   if act1[SHA] == act2[SHA] or act1[PREFIX] == '-': 
+   return action(act1)
+   if act1[PREFIX]=='*':
+   print "do_merge", act1[PATH], act1[ORIGSHA], act1[SHA], 
act2[SHA]
+   return
+   print "unable to handle", act[PATH]
+   print "one side wants", act1[PREFIX]
+   print "the other side wants", act2[PREFIX]
+   
+
+args = sys.argv[1:]
+if len(args)!=3:
+print "Usage merge.py   "
+trees = map(get_tree, args)
+print "checkout-tree", trees[0]
+diff1 = get_difftree(trees[0], trees[1])
+diff2 = get_difftree(trees[0], trees[2])
+analyze(diff1, diff2)
+
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: Merge with git-pasky II.

2005-04-14 Thread Christopher Li
On Fri, Apr 15, 2005 at 01:31:59AM +0200, Petr Baudis wrote:
> > I am just trying to follow my understanding of what Linus
> > wanted.  One of the guiding principle is to do as much things as
> > in dircache without ever checking things out or touching working
> > files unnecessarily.
> 
> I'm just arguing that instead of directly touching the directory cache,
> you should just list what would you do there - and you already do this,

That is exactly what I suggest in the previous email. And my python script
does exactly that ;-)

Chris

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: Merge with git-pasky II.

2005-04-14 Thread Petr Baudis
Dear diary, on Fri, Apr 15, 2005 at 01:12:34AM CEST, I got a letter
where Junio C Hamano <[EMAIL PROTECTED]> told me that...
> > "PB" == Petr Baudis <[EMAIL PROTECTED]> writes:
> 
> PB> What I would like your script to do is therefore just do the
> PB> merge in a given already prepared (including built index)
> PB> directory, with a passed base. The base should be determined
> PB> by a separate tool (I already saw some patches); most future
> PB> "science" will probably go to a clever selection of this
> PB> base, anyway.
> 
> I think you are contradicting yourself for saying the above
> after agreeing with me that the script should just work on trees
> not commits.  My understanding is that the tools is just to
> merge two related trees relative to another ancestor tree,
> nothing more.  Especially, it should not care what is in the
> working directory---that is SCM person's business.

Yes. Isn't this exactly what I'm saying?

I'm arguing for doing less in my paragraph, you are arguing for doing
less in your paragraph, and we even seem to agree on the direction in
which we should do less.

> I am just trying to follow my understanding of what Linus
> wanted.  One of the guiding principle is to do as much things as
> in dircache without ever checking things out or touching working
> files unnecessarily.

I'm just arguing that instead of directly touching the directory cache,
you should just list what would you do there - and you already do this,
I think. So I'd be happy with a switch which would just do that and not
touch the directory cache. I'll parse your output and do the right thing
for me.

> PB> This will give the tool maximal flexibility.
> 
> I suspect it would force me to have a working directory
> populated with files, just to do a merge.

Why would that be so?

> PB> I'm all for an -o, and I don't mind ,, - I just don't want it uselessly
> PB> long. I hope "git~merge~$$" was a joke... :-)
> 
> Which part do you object to?  PID part?  or tilde?  Would
> git~merge do, perhaps?  It probably would not matter to you
> because as an SCM you would always give an explicit --output
> parameter to the script anyway.

Yes. I'll just override it with ,,merge, I think. So, do whatever you
want. ;-))

> PB> By the way, what about indentation with tabs? If you have a
> PB> strong opinion about this, I don't insist - but if you
> PB> really don't mind/care either way, it'd be great to use tabs
> PB> as in the rest of the git code.
> 
> I do not have a strong opinion, but it is more trouble for me
> only because I am lazy and am used to the indentation my Emacs
> gives me.  I write code other than git, so changing Perl-mode
> indentation setting globally for all .pl files is not an option
> for me.  I'll see what I can do when I have time.

Doesn't Emacs have something equivalent to ./.vimrc? I've also seen
those funny -*- strings.

Well, if it would mean a lot of trouble for you, just forget about it.

> PB> Is there a fundamental reason why the directory cache
> PB> contains the ancestor instead of the destination branch?
> 
> Because you are thinking as an SCM person where there are
> distinction between tree-A and tree-B, two heads being merged.
> There is no "destination branch" nor "source branch" in what I
> am doing.  It is a merge of two equals derived from the same
> ancestor.

That's a valid point of view too.

Actually, when you would have a mode in which you would not write to the
directory cache, do you need to read from it? You could do just direct
cat-files like for the other trees, and it would be even faster. Then,
you could do without a directory cache altogether in this mode.

> PB> And this is another thing I dislike a lot. I'd like merge-tree.pl to
> PB> leave my directory cache alone, thank you very much. You know, I see
> PB> what goes to the directory cache as actually part of the policy part.
> 
> Remember I am not touching *your* dircache.  It is a dircache in
> the temporary merge area, specifically set up to help you review
> the merge.  

Yes, but I want to have a control over its dircache too. :-) That is
because I want the user to be able to use the regular git commands like
"git diff" there.

> Can't the SCM driver do things along this line, perhaps?
> 
>  - You have your working files and your dircache.  They may not
>match because you have uncommitted changes to your
>environment.  You want to merge with Linus head.  You know
>its SHA1 (call it COMMIT-Linus).  Your SCM knows which commit
>you started with (call it COMMIT-Current).
> 
>  - First you merge the tree associated with COMMIT-Current.  Use
>it and COMMIT-Linus to find the common ancestor to use.
> 
>  - Now use the tree SHA of COMMIT-Current, tree SHA1 of
>COMMIT-Linus, and tree SHA1 of the common ancestor commit to
>drive git-merge.perl (to be renamed ;-).  You will get a
>temporary directory.  Have your user examine what is in
>there, and fix the merge and have them tell you 

Re: Re: Re: Merge with git-pasky II.

2005-04-14 Thread Petr Baudis
Dear diary, on Thu, Apr 14, 2005 at 10:23:26PM CEST, I got a letter
where Erik van Konijnenburg <[EMAIL PROTECTED]> told me that...
> On Thu, Apr 14, 2005 at 09:35:07PM +0200, Petr Baudis wrote:
> > Hmm. I actually don't like this naming. I think it's not too consistent,
> > is irregular, therefore parsing it would be ugly. What I propose:
> > 
> > 12c\tname <- legend
> >   <- original file
> > D <- tree #1 removed file
> >  D<- tree #2 removed file
> > DD<- both trees removed file
> > M <- tree #1 modified file
> >  M
> > DM*   <- conflict, tree #1 removed file, tree #2 modified file
> > MD*
> > MM<- exact same modification
> > MM*   <- different modifications, merging
> > 
> > This is generic, theoretically scales well even to more trees, is easy
> > to parse trivially, still is human readable (actually the asterisk in
> > the 'conflict' column is there basically only for the humans), is
> > completely regular and consistent.
> 
> Detail: perhaps use underscore instead of space, to avoid space/tab typos
> that are invisible on paper and user friendly mail clients?

I'd go for dots in that case. Looks less intrusive. :^)

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: Merge with git-pasky II.

2005-04-14 Thread Erik van Konijnenburg
On Thu, Apr 14, 2005 at 09:35:07PM +0200, Petr Baudis wrote:
> Hmm. I actually don't like this naming. I think it's not too consistent,
> is irregular, therefore parsing it would be ugly. What I propose:
> 
> 12c\tname <- legend
>   <- original file
> D <- tree #1 removed file
>  D<- tree #2 removed file
> DD<- both trees removed file
> M <- tree #1 modified file
>  M
> DM*   <- conflict, tree #1 removed file, tree #2 modified file
> MD*
> MM<- exact same modification
> MM*   <- different modifications, merging
> 
> This is generic, theoretically scales well even to more trees, is easy
> to parse trivially, still is human readable (actually the asterisk in
> the 'conflict' column is there basically only for the humans), is
> completely regular and consistent.

Detail: perhaps use underscore instead of space, to avoid space/tab typos
that are invisible on paper and user friendly mail clients?

Regards,
Erik
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: Merge with git-pasky II.

2005-04-14 Thread Petr Baudis
Dear diary, on Thu, Apr 14, 2005 at 09:59:04PM CEST, I got a letter
where Junio C Hamano <[EMAIL PROTECTED]> told me that...
> > "LT" == Linus Torvalds <[EMAIL PROTECTED]> writes:
> 
> LT> On Thu, 14 Apr 2005, Junio C Hamano wrote:
> 
> >> Sorry, I have not seen what you have been doing since pasky 0.3,
> >> and I have not even started to understand the mental model of
> >> the world your tool is building.  That said, my gut feeling is
> >> that telling this script about git-pasky's world model might be
> >> a mistake.  I'd rather see you consider the script as mere "part
> >> of the plumbing". 
> 
> LT> I agree. Having separate abstraction layers is good.  I'm actually very 
> LT> happy with Pasky's cleaned-up-tree, exactly because unlike the first one, 

(Just a side-note - functionally and even organizationally, the cleaned
up tree does not differ significantly from the original one.)

> LT> Pasky did a great job of maintaining the abstraction between "plumbing" 
> LT> and user interfaces.
> 
> Agreed, not just with your agreeing with me, but with the
> statement that Pasky did a good job (although I am ashamed to
> say I have not caught up with the "userland" tools).

Thanks. :-)

> LT> The plumbing should take user interface needs into account, but the more
> LT> conceptually separate it is ("does it makes sense on its own?") the better
> LT> off we'll be. And "merge these two trees" (which works on a _tree_ level)
> LT> or "find the common commit" (which works on a _commit_ level) look like 
> LT> plumbing to me - the kind of things I should have written, if I weren't 
> LT> such a lazy slob.
> 
> I am planning drop the ancestor computation from the script, and
> make it another command line parameter to the script.  Dan
> Barkalow's merge-base program should be used to compute it and
> his result should drive the merge.  That sounds more UNIXy to
> me.

Good move, I say!

> I even may want to make the script take three trees not
> commits, since the merge script does not need commits (it only
> needs trees).  As plumbing it would be cleaner interface to it
> to do so.  The wrapper SCM scripts can and should make sure it
> is fed trees when the user gives it commits (or symbolic
> representation of it like .git/tags/blah, or `cat .git/HEAD`).

Agreed.

> But one different thing to note here.
> 
> You say "merge these two trees" above (I take it that you mean
> "merge these two trees, taking account of this tree as their
> common ancestor", so actually you are dealing with three trees),
> and I am tending to agree with the notion of merging trees not
> commits.  However you might get richer context and more sensible
> resulting merge if you say "merge these two commits".  Since
> commit chaining is part of the fundamental git object model you
> may as well use it.

Could you be more particular on the richer context etc?

I think this script should stay strictly on the level of trees. When
someone invents it, there could be a merge-commits script which does
something very smart about two commits, traversing the graph between
them etc, and doing a set of merge-tree invocations, possibly preparing
the staging area for them etc.

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: Merge with git-pasky II.

2005-04-14 Thread Petr Baudis
Dear diary, on Thu, Apr 14, 2005 at 08:12:35PM CEST, I got a letter
where Junio C Hamano <[EMAIL PROTECTED]> told me that...
> > "PB" == Petr Baudis <[EMAIL PROTECTED]> writes:
> 
> PB> Bah, you outran me. ;-)
> 
> Just being in a different timezone, I guess.
> 
> PB> I'll change it to use the cool git-pasky stuff (commit-id etc) and its
> PB> style of committing - that is, it will merely record the update-caches
> PB> to be done upon commit, and it will read-tree the branch we are merging
> PB> to instead of the ancestor. (So that git diff gives useful output.)
> 
> Sorry, I have not seen what you have been doing since pasky 0.3,
> and I have not even started to understand the mental model of
> the world your tool is building.  That said, my gut feeling is
> that telling this script about git-pasky's world model might be
> a mistake.  I'd rather see you consider the script as mere "part
> of the plumbing".  Maybe adding an extra parameter to the script
> to let the user explicitly specify the common ancestor to use
> would be needed, but I would prefer git-pasky-merge to do its
> own magic (converting symbolic commit names into raw commit
> names and such) before calling this low level script.
> 
> That way people like me who have not migrated to your framework
> can still keep using it.  All the script currently needs is a
> bare git object database; i.e., nothing other than what is in
> .git/objects and a couple of commit record SHA1s as its
> parameters.  No .git/heads/, no .git/HEAD.local, no .git/tags,
> are involved for it to work, and I would prefer to keep things
> that way if possible.

I see, and I actually agree with it. However, I'll want merge-tree.pl to
do a little less than it does now for that, though. The mechanics in
"kernel" is fine as long as I can control policy in my "userspace". ;-)

BTW, the git* name sorta imply my toilet instead of the core plumbing, and
it'd be more consistent with the current plumbnaming; and could we have
it with the .pl extension, please? :-)

What I would like your script to do is therefore just do the merge in a
given already prepared (including built index) directory, with a passed
base. The base should be determined by a separate tool (I already saw
some patches); most future "science" will probably go to a clever
selection of this base, anyway.

This will give the tool maximal flexibility. E.g., then someone who
wants to can just merge with his working copy (if you don't give
checkout-cache -f - but why would you anyway), or do whatever other
cleverness he wants.

> >> * show-diff updates to add -r flag to squelch diffs for files not in
> >> the working directory.  This is mainly useful when verifying the
> >> result of an automated merge.
..snip..
> was too tired and did not think of a letter when I wrote it.  I
> guess '-r' stood for removed, but I agree it is a bad choice.
> Any objections to '-q'?

None here.

> >> +# Create a temporary directory and go there.
> >> +system 'rm', '-rf', ',,merge-temp';
> 
> PB> Can't we call it just ,,merge?
> 
> I'd rather have a command line option '-o' (scrapping the
> current '-o' and renaming it to something else; as you can see I
> am terrible at picking option names ;-)) to mean "output to this
> directory".  I am not really an Arch person so I do not
> particulary care about /^,,/.  How about "git~merge~$$"?

I'm all for an -o, and I don't mind ,, - I just don't want it uselessly
long. I hope "git~merge~$$" was a joke... :-)

> >> +for ((',,merge-temp', '.git')) { mkdir $_; chdir $_; }
> >> +symlink "../../.git/objects", "objects";
> >> +chdir '..';
> >> +
> >> +my $ancestor_tree = read_commit_tree($common);
> >> +system 'read-tree', $ancestor_tree;
> >> +
> >> +my %tree0 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[0]));
> >> +my %tree1 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[1]));
> >> +
> >> +my @ancestor_file = read_show_files();
> >> +my %ancestor_file = map { $_ => 1 } @ancestor_file;
> >> +
> >> +for (@ancestor_file) {
> >> +if (! exists $tree0{$_} && ! exists $tree1{$_}) {

By the way, what about indentation with tabs? If you have a strong
opinion about this, I don't insist - but if you really don't mind/care
either way, it'd be great to use tabs as in the rest of the git code.

> >> +  if ($full_checkout) {
> >> +  system 'checkout-cache', $_;
> >> +  }
> >> +  print STDERR "O - $_\n";
> 
> PB> Huh, what are you trying to do here? I think you should just record
> PB> remove, no? (And I wouldn't do anything with my read-tree. ;-)
> 
> At this moment in the script, we have run "read-tree" the
> ancestor so the dircache has the original.  %tree0 and %tree1
> both did not touch the path ($_ here) so it is the same as
> ancestor.  When '-f' is specified we are populating the output
> working tree with the merge result so that is what that
> 'checkout-cache' is about.  "O - $path" means "we took the
> original".

Aha! Thanks.

Is there a fundamental reason why the dire

Re: Re: Merge with git-pasky II.

2005-04-14 Thread Petr Baudis
Dear diary, on Thu, Apr 14, 2005 at 01:14:13PM CEST, I got a letter
where Junio C Hamano <[EMAIL PROTECTED]> told me that...
> Here is a diff to update the git-merge.perl script I showed you
> earlier today ;-).  It contains the following updates against
> your HEAD (bb95843a5a0f397270819462812735ee29796fb4).

Bah, you outran me. ;-)

>  * git-merge.perl command we talked about on the git list.  I've
>covered the changed-to-the-same case etc.  I still haven't done
>anything about file-vs-directory case yet.
> 
>It does warn when it needed to run merge to automerge and let
>merge give a warning message about conflicts if any.  In
>modify/remove cases, modified in one but removed in the other
>files are left in either $path~A~ or $path~B~ in the merge
>temporary directory, and the script issues a warning at the
>end.

I think I will take it rather my working git merge implementation - it's
getting insane in bash. ;-)

I'll change it to use the cool git-pasky stuff (commit-id etc) and its
style of committing - that is, it will merely record the update-caches
to be done upon commit, and it will read-tree the branch we are merging
to instead of the ancestor. (So that git diff gives useful output.)

>  * show-files and ls-tree updates to add -z flag to NUL terminate records;
>this is needed for git-merge.perl to work.
> 
>  * show-diff updates to add -r flag to squelch diffs for files not in
>the working directory.  This is mainly useful when verifying the
>result of an automated merge.

-r traditionally means recursive - what's the reasoning behind the
choice of this letter?

>  * update-cache updates to add "--cacheinfo mode sha1" flag to register
>a file that is not in the current working directory.  Needed for
>minimum-checkout merging by git-merge.perl.
> 
> 
> Signed-off-by: Junio C Hamano <[EMAIL PROTECTED]>
> 
> ---
> 
>  git-merge.perl |  247 
> +
>  ls-tree.c  |9 +-
>  show-diff.c|   11 +-
>  show-files.c   |   12 ++
>  update-cache.c |   25 +
>  5 files changed, 296 insertions(+), 8 deletions(-)
> 
> diff -x .git -Nru ,,1/git-merge.perl ,,2/git-merge.perl
> --- ,,1/git-merge.perl1969-12-31 16:00:00.0 -0800
> +++ ,,2/git-merge.perl2005-04-14 04:00:14.0 -0700
> @@ -0,0 +1,247 @@
> +#!/usr/bin/perl -w
> +
> +use Getopt::Long;

use strict?

> +
> +my $full_checkout = 0;
> +my $oneside_checkout = 0;
> +GetOptions("full" => \$full_checkout,
> +"oneside" => \$oneside_checkout)
> +or die;
> +
> +if ($full_checkout) {
> +$oneside_checkout = 1;
> +}
> +
> +sub read_rev_tree {
> +my (@head) = @_;
> +my ($fhi);
> +open $fhi, '-|', 'rev-tree', '--edges', @head
> + or die "$!: rev-tree --edges @head";
> +my %common;
> +while (<$fhi>) {
> + chomp;
> + (undef, undef, my @common) = split(/ /, $_);
> + for (@common) {
> + if (s/^([a-f0-f]{40}):3$/$1/) {
> + $common{$_}++;
> + }
> + }
> +}
> +close $fhi;
> +
> +my @common = (map { $_->[1] }
> +   sort { $b->[0] <=> $a->[0] }
> +   map { [ $common{$_} => $_ ] }
> +   keys %common);
> +
> +return $common[0];
> +}

It'd be simpler to do just

my @common = (map { $common{$_} }
  sort { $b <=> $a }
  keys %common)

But I really think this is a horrible heuristic. I believe you should
take the latest commit in the --edges output, and from that choose the
base whose rev-tree --edges the_base merged_branch has the least lines
on output. (That is, the path to it is shortest - ideally it's already
part of the merged_branch.)

> +
> +sub read_commit_tree {
> +my ($commit) = @_;
> +my ($fhi);
> +open $fhi, '-|', 'cat-file', 'commit', $commit
> + or die "$!: cat-file commit $commit";
> +my $tree = <$fhi>;
> +close $fhi;
> +$tree =~ s/^tree //;
> +return $tree;
> +}
> +
> +# Reads diff-tree -r output and gives a hash that maps a path
> +# to 3-tuple (old-mode new-mode new-sha).
> +# When creating, old-mode is undef.  When removing, new-* are undef.

What about

sub OLDMODE { 0 }
sub NEWMODE { 1 }
sub NEWSHA { 2 }

and then using that when accessing the tuple? Would make the code
much more readable.

> +sub read_diff_tree {
> +my (@tree) = @_;
> +my ($fhi);
> +local ($_, $/);
> +$/ = "\0"; 
> +my %path;
> +open $fhi, '-|', 'diff-tree', '-r', @tree
> + or die "$!: diff-tree -r @tree";
> +while (<$fhi>) {
> + chomp;
> + if (/^\*([0-7]+)->([0-7]+)\tblob\t[0-9a-f]+->([0-9a-f]{40})\t(.*)$/s) {
> + $path{$4} = [$1, $2, $3];
> + }
> + elsif (/^\+([0-7]+)\tblob\t([0-9a-f]{40})\t(.*)$/s) {
> + $path{$3} = [undef, $1, $2];
> + }
> + elsif (/^\-([0-7]+)\tblob\t[0-9a-f]{40}\t(.*)$/s) {
> + $path{$2} = [$1, undef, undef];
> + }
> + else {
> +