Re: Re: more git updates..

2005-04-13 Thread Matt Mackall
On Tue, Apr 12, 2005 at 06:10:27PM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 13 Apr 2005, Andrea Arcangeli wrote:
> > 
> > I wasn't suggesting to use CVS. I meant that for a newly developed SCM,
> > the CVS/SCCS format as storage may be more appealing than the current
> > git format.
> 
> Go wild. I did mine in six days, and you've been whining about other 
> peoples SCM's for three years.

I wrote a hack to do efficient delta storage with O(1) seeks for
lookup and append last week, I believe it's been integrated into the
latest Bazaar-NG. I expect it'll give better compression and
performance than BK. Of course it ends up being O(revisions) for
modifications or insertions (but that is probably a non-issue for the
SCM models we're looking at).

The git model is obviously very different, but I worry about the slop
space implied. With 200k file revision and an average of 2k slop per
file, that's 400MB of slop, or almost the size of an equivalent delta
compressed kernel repo.

Now if you can assume that blobs never change and are never deleted,
you can simply append them all onto a log, and then index them with a
separate file containing an htree of (sha1, offset, length) or the
like. Since the key is already a strong hash, this is an excellent
match and avoids rehashing in the kernel's directory lookup. And it'll
save an inode, a directory entry, and about half a data block per
entry. "Open" will also be cheaper as there's no per-revision inode to
grab.

I could hack on this if you think it fits with the git model,
otherwise I'll go back to my other experiments..

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-13 Thread Linus Torvalds


On Wed, 13 Apr 2005, Russell King wrote:
> 
> And my entire 2.6.12-rc2 BK tree, unchecked out, is about 220MB, which
> is more dense than CVS.
> 
> BK is also a lot better than CVS.  So _your_ point is?

Hey, anybody who wants to argue that BK is getter than GIT won't be 
getting any counter-arguments from me.

The fact is, I have constraints. Like needing something to work within a
few days. If somebody comes up with a ultra-fast, replicatable, space
efficient SCM in three days, I'm all over it. 

In the meantime, I'd suggest people who worry about network bandwidth try 
to work out a synchronization protocol that allows you to send "diff 
updates" between git repositories. The git model doesn't preclude looking 
at the objects and sending diffs instead (and re-creating the objects on 
the other side). But my time-constraints _do_.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-13 Thread Andrea Arcangeli
On Tue, Apr 12, 2005 at 06:10:27PM -0700, Linus Torvalds wrote:
> Go wild. I did mine in six days, and you've been whining about other 
> peoples SCM's for three years.

Even if I spend 6 days doing git, you'd never have thrown away BK in
exchange for git.

> In other words - go and _do_ something instead of whining. I'm not 
> interested.

CVS and SVN are already an order of magnitude more efficient than git at
storing and exporting the data and they shouldn't annoy you during the
checkins either, they have a backend much more efficient than git too,
and yet you seem not to care about them.

My suggestion was simply to at least change git to coalesce the diffs
like CVS/SCCS, I'm only making a suggestion to give git a chance to have
a backend at least as efficient as the one that CVS uses and to avoid
running rsync on a 2.8G uncompressible blob. I don't have enough spare
time to do something myself, my spare time would be too short anyway to
make a difference in SCM space, so I'd rather spend it all in more
innovative space where it might have a slight change to make a
difference.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-13 Thread Andrea Arcangeli
On Wed, Apr 13, 2005 at 10:30:52AM +0100, Russell King wrote:
> And my entire 2.6.12-rc2 BK tree, unchecked out, is about 220MB, which
> is more dense than CVS.

Yep, this is why I mentioned SCCS format too, I didn't know it was even
smaller, but I expected a similar density from SCCS.

> Note: I'm _not_ arguing with your sentiments towards CVS.  However, I
> think the space usage point still stands.

If it wasn't for network synchronization it almost wouldn't matter, but
fetching 2.8G uncompressible when I could simply fetch 220MB
compressible (that will compress with zlib at little cost during rsync
to less than 78M), sounds a bit overkill.

> What is the space usage behaviour when you have multiple git trees?

Multiple trees in the sense of pulls from multiple developers aren't
more costly than a normal checkin, due the "soft hardlink" property of
the hashes. It's just every checkin taking lots of space, and generating
a new uncompressible blobs every time a changeset touches one file.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-13 Thread Russell King
On Tue, Apr 12, 2005 at 04:45:07PM -0700, Linus Torvalds wrote:
> On Wed, 13 Apr 2005, Andrea Arcangeli wrote:
> > At the rate of 9M for every 198 changeset checkins, that means I'll have
> > to download 2.7G _uncompressible_ (i.e. already compressed with a bad
> > per-file ratio due the too-small files) for a whole pack including all
> > changesets without accounting the original 111MB of the original tree,
> > with rsync -z of git.  That compares with 514M _compressible_ with CVS
> > format on-disk, and with ~79M of the CVS-network download with rsync -z of
> > the CVS repository (assuming default gzip compression level).
> 
> Yes. CVS is much denser.
> 
> CVS is also total crap. So your point is?

And my entire 2.6.12-rc2 BK tree, unchecked out, is about 220MB, which
is more dense than CVS.

BK is also a lot better than CVS.  So _your_ point is?

8)

Note: I'm _not_ arguing with your sentiments towards CVS.  However, I
think the space usage point still stands.

What is the space usage behaviour when you have multiple git trees?
Do we need a git relink command in git-pasky? 8)

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 Serial core
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-13 Thread Russell King
On Tue, Apr 12, 2005 at 04:45:07PM -0700, Linus Torvalds wrote:
 On Wed, 13 Apr 2005, Andrea Arcangeli wrote:
  At the rate of 9M for every 198 changeset checkins, that means I'll have
  to download 2.7G _uncompressible_ (i.e. already compressed with a bad
  per-file ratio due the too-small files) for a whole pack including all
  changesets without accounting the original 111MB of the original tree,
  with rsync -z of git.  That compares with 514M _compressible_ with CVS
  format on-disk, and with ~79M of the CVS-network download with rsync -z of
  the CVS repository (assuming default gzip compression level).
 
 Yes. CVS is much denser.
 
 CVS is also total crap. So your point is?

And my entire 2.6.12-rc2 BK tree, unchecked out, is about 220MB, which
is more dense than CVS.

BK is also a lot better than CVS.  So _your_ point is?

8)

Note: I'm _not_ arguing with your sentiments towards CVS.  However, I
think the space usage point still stands.

What is the space usage behaviour when you have multiple git trees?
Do we need a git relink command in git-pasky? 8)

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 Serial core
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-13 Thread Andrea Arcangeli
On Wed, Apr 13, 2005 at 10:30:52AM +0100, Russell King wrote:
 And my entire 2.6.12-rc2 BK tree, unchecked out, is about 220MB, which
 is more dense than CVS.

Yep, this is why I mentioned SCCS format too, I didn't know it was even
smaller, but I expected a similar density from SCCS.

 Note: I'm _not_ arguing with your sentiments towards CVS.  However, I
 think the space usage point still stands.

If it wasn't for network synchronization it almost wouldn't matter, but
fetching 2.8G uncompressible when I could simply fetch 220MB
compressible (that will compress with zlib at little cost during rsync
to less than 78M), sounds a bit overkill.

 What is the space usage behaviour when you have multiple git trees?

Multiple trees in the sense of pulls from multiple developers aren't
more costly than a normal checkin, due the soft hardlink property of
the hashes. It's just every checkin taking lots of space, and generating
a new uncompressible blobs every time a changeset touches one file.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-13 Thread Andrea Arcangeli
On Tue, Apr 12, 2005 at 06:10:27PM -0700, Linus Torvalds wrote:
 Go wild. I did mine in six days, and you've been whining about other 
 peoples SCM's for three years.

Even if I spend 6 days doing git, you'd never have thrown away BK in
exchange for git.

 In other words - go and _do_ something instead of whining. I'm not 
 interested.

CVS and SVN are already an order of magnitude more efficient than git at
storing and exporting the data and they shouldn't annoy you during the
checkins either, they have a backend much more efficient than git too,
and yet you seem not to care about them.

My suggestion was simply to at least change git to coalesce the diffs
like CVS/SCCS, I'm only making a suggestion to give git a chance to have
a backend at least as efficient as the one that CVS uses and to avoid
running rsync on a 2.8G uncompressible blob. I don't have enough spare
time to do something myself, my spare time would be too short anyway to
make a difference in SCM space, so I'd rather spend it all in more
innovative space where it might have a slight change to make a
difference.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-13 Thread Linus Torvalds


On Wed, 13 Apr 2005, Russell King wrote:
 
 And my entire 2.6.12-rc2 BK tree, unchecked out, is about 220MB, which
 is more dense than CVS.
 
 BK is also a lot better than CVS.  So _your_ point is?

Hey, anybody who wants to argue that BK is getter than GIT won't be 
getting any counter-arguments from me.

The fact is, I have constraints. Like needing something to work within a
few days. If somebody comes up with a ultra-fast, replicatable, space
efficient SCM in three days, I'm all over it. 

In the meantime, I'd suggest people who worry about network bandwidth try 
to work out a synchronization protocol that allows you to send diff 
updates between git repositories. The git model doesn't preclude looking 
at the objects and sending diffs instead (and re-creating the objects on 
the other side). But my time-constraints _do_.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-13 Thread Matt Mackall
On Tue, Apr 12, 2005 at 06:10:27PM -0700, Linus Torvalds wrote:
 
 
 On Wed, 13 Apr 2005, Andrea Arcangeli wrote:
  
  I wasn't suggesting to use CVS. I meant that for a newly developed SCM,
  the CVS/SCCS format as storage may be more appealing than the current
  git format.
 
 Go wild. I did mine in six days, and you've been whining about other 
 peoples SCM's for three years.

I wrote a hack to do efficient delta storage with O(1) seeks for
lookup and append last week, I believe it's been integrated into the
latest Bazaar-NG. I expect it'll give better compression and
performance than BK. Of course it ends up being O(revisions) for
modifications or insertions (but that is probably a non-issue for the
SCM models we're looking at).

The git model is obviously very different, but I worry about the slop
space implied. With 200k file revision and an average of 2k slop per
file, that's 400MB of slop, or almost the size of an equivalent delta
compressed kernel repo.

Now if you can assume that blobs never change and are never deleted,
you can simply append them all onto a log, and then index them with a
separate file containing an htree of (sha1, offset, length) or the
like. Since the key is already a strong hash, this is an excellent
match and avoids rehashing in the kernel's directory lookup. And it'll
save an inode, a directory entry, and about half a data block per
entry. Open will also be cheaper as there's no per-revision inode to
grab.

I could hack on this if you think it fits with the git model,
otherwise I'll go back to my other experiments..

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-12 Thread Linus Torvalds


On Wed, 13 Apr 2005, Andrea Arcangeli wrote:
> 
> I wasn't suggesting to use CVS. I meant that for a newly developed SCM,
> the CVS/SCCS format as storage may be more appealing than the current
> git format.

Go wild. I did mine in six days, and you've been whining about other 
peoples SCM's for three years.

In other words - go and _do_ something instead of whining. I'm not 
interested.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-12 Thread Andrea Arcangeli
On Tue, Apr 12, 2005 at 04:45:07PM -0700, Linus Torvalds wrote:
> Yes. CVS is much denser.
>
> CVS is also total crap. So your point is?

I wasn't suggesting to use CVS. I meant that for a newly developed SCM,
the CVS/SCCS format as storage may be more appealing than the current
git format. I guess I should have said RCS instead of CVS, sorry if that
created any confusion. The arch/darcs approach of pratically storing
patches would also be much denser but it has no efficient way of doing
"rcs up -p 1.x" on a file, that doesn't involve potentially unpacking
tons of unrelated changesets.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-12 Thread Andrea Arcangeli
On Tue, Apr 12, 2005 at 02:21:58PM -0700, Linus Torvalds wrote:
> The full .git archive for 199 versions of the kernel (the 2.6.12-rc2 one
> and a test-run of 198 patches from Andrew) is 111MB. In other words,
> adding 198 "full" new kernels only grew the archive by 9MB (that's all
> "actual disk usage" btw - the files themselves are smaller, but since they
> all end up taking up a full disk block..)

reiserfs can do tail packing, plus the disk block is meaningless when
fetching the data from the network which is the real cost to worry about
when synchronizing and downloading (disk cost isn't a big deal).

The pagecache cost sounds a very minor one too, since you don't need
the whole data in ram, not even all dentries need to be in cache.  This
is one of the reasons why you don't need to run readdir, and why you can
discard the old trees anytime.

At the rate of 9M for every 198 changeset checkins, that means I'll have
to download 2.7G _uncompressible_ (i.e. already compressed with a bad
per-file ratio due the too-small files) for a whole pack including all
changesets without accounting the original 111MB of the original tree,
with rsync -z of git.  That compares with 514M _compressible_ with CVS
format on-disk, and with ~79M of the CVS-network download with rsync -z of
the CVS repository (assuming default gzip compression level).

What BKCVS provided with 79M of rsync -z, now is provided with 2.8G of
rsync -z, with a network-bound slowdown of -97.2%. Similar slowdowns
should be expected for synchronizations over time while fetching new
blobs etc...

Ok, BKCVS has less than 6 checkins due the linearization and
coalescing of pulls that couldn't be represented losslessy in CVS, so
the network-bound slowdown is less than -97.2%, my math is
approximative, but the order of magnitude should remain the same.

Clearly one can write an ad-hoc network protocol instead of using
rsync/wget, but the server will need quite a bit of cpu and ram to do a
checkout/update/sync efficiently to unpack all data and create all
changesets to gzip and transfer.

Anyway git simplicity and immutable hashes robustness certainly makes it
an ideal interim format (and it may even be a very pratical local
live format on-disk, except for the backups), I'm only unsure if it's a
wise idea to build an SCM on top of the current git format or if it's
better to use something like SCCS or CVS to coalesce all diffs of a
single file together and to save space and make rsync -z very efficient
too (or an approach like arch and darcs that stores changesets per file,
i.e. patches).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-12 Thread Panagiotis Issaris
Hi David,

On Tue, Apr 12, 2005 at 06:36:23PM -0400, David Eger wrote:
> > No. A tree is not the full data. A tree contains enough information
> > to 
> > _recreate_ the full data, but the tree itself just tells you _how_
> > to do 
> > that. It doesn't contain very much of the data itself at all.
> 
> Perhaps I'd understand this if you tell me what "recreate" means.
> If a have a SHA1 hash of a file, and I have the file, I can verify
> that said
> file has the SHA1 hash it's supposed to have, but I can't generate the
> file
> from it's hash...

But, but if you have that hexified SHA1 hash of a particular file you
want to access, there would be a file with a filename equal to that
hexified SHA1 hash which contained the compressed contents of the file
you're looking for.

At least, that's how I understood it...

With friendly regards,
Takis

-- 
OpenPGP key: http://lumumba.luc.ac.be/takis/takis_public_key.txt
fingerprint: 6571 13A3 33D9 3726 F728  AA98 F643 B12E ECF3 E029
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-12 Thread Linus Torvalds


On Wed, 13 Apr 2005, Andrea Arcangeli wrote:
> 
> At the rate of 9M for every 198 changeset checkins, that means I'll have
> to download 2.7G _uncompressible_ (i.e. already compressed with a bad
> per-file ratio due the too-small files) for a whole pack including all
> changesets without accounting the original 111MB of the original tree,
> with rsync -z of git.  That compares with 514M _compressible_ with CVS
> format on-disk, and with ~79M of the CVS-network download with rsync -z of
> the CVS repository (assuming default gzip compression level).

Yes. CVS is much denser.

CVS is also total crap. So your point is?

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-12 Thread David Eger
On Tue, Apr 12, 2005 at 02:21:58PM -0700, Linus Torvalds wrote:
> 
> Yes. A tree is defined by the blobs it references (and the subtrees) but 
> it doesn't _contain_ them. It just contains a pointer to them.

A pointer to them?  You mean a SHA1 hash of them? or what?
Where is the *real* data stored?  The real files, the real patches?
Are these somewhere completely outside of git?

> > Therefore, "TREE" must be the *full* data, and since we have the following
> > definition for CHANGESET:
> 
> No. A tree is not the full data. A tree contains enough information to 
> _recreate_ the full data, but the tree itself just tells you _how_ to do 
> that. It doesn't contain very much of the data itself at all.

Perhaps I'd understand this if you tell me what "recreate" means.
If a have a SHA1 hash of a file, and I have the file, I can verify that said
file has the SHA1 hash it's supposed to have, but I can't generate the file
from it's hash...

Sorry for being stubbornly dumb, but you'll have a couple of us puzzling 
at the README ;-)

-dte
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-12 Thread Linus Torvalds


On Tue, 12 Apr 2005, David Eger wrote:
> 
> The reason I am questioning this point is the GIT README file.
> 
> Linus makes explicit that a "blob" is just the "file contents," and that
> really, a "blob" is not just the SHA1 of the "blob":
> 
> > In particular, the "current directory cache" certainly does not need to
> > be consistent with the current directory contents, but it has two very
> > important attributes:
> > 
> > (a) it can re-generate the full state it caches (not just the directory
> > structure: through the "blob" object it can regenerate the data too)
> 
> And he defines "TREE" with the same name: blob

Yes. A tree is defined by the blobs it references (and the subtrees) but 
it doesn't _contain_ them. It just contains a pointer to them.

> Therefore, "TREE" must be the *full* data, and since we have the following
> definition for CHANGESET:

No. A tree is not the full data. A tree contains enough information to 
_recreate_ the full data, but the tree itself just tells you _how_ to do 
that. It doesn't contain very much of the data itself at all.

> That each changeset remembers *everything* for *each point in the tree*.

But only BY REFERENCE. A "commit" is usually very small. For example, the
top-of-tree commit-file for my currest kernel test is literally 401
_bytes_ in size. Because it just references a tree (20 bytes of
_reference_).

> Linus, if you actually mean to differentiate between the full data
> and a SHA1 of the data

There is no differentiation. The sha1 _is_ the data as far as git is 
concerned. 

It's only confusing if you think they are different. 

> Also, the details of just what data constitutes a 'changeset' would be
> lovely... i.e. a precise spec of what Pat is describing below...

[EMAIL PROTECTED]:~/test-tools/linux-2.6.12-rc2> cat-file commit `cat 
.git/HEAD `
tree cf9fd295d3048cd84c65d5e1a5a6b606bf4fddc6
parent c7a1a189dd0fe2c6ecd0aa33f2bd2f414c7892a0
author NeilBrown <[EMAIL PROTECTED]> Tue Apr 12 08:27:08 2005
committer Linus Torvalds <[EMAIL PROTECTED]> Tue Apr 12 08:27:08 2005

[PATCH] md: remove a number of misleading calls to MD_BUG

The conditions that cause these calls to MD_BUG are not kernel bugs, 
just
oddities in what userspace is asking for.

Also convert analyze_sbs to return void, and the value it returned was
always 0.

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]>

That's it. In all it's glory. Compressed and tagged it's 401 bytes. 

The tree it references is 677 bytes in size. That in turn references a 
number of subtrees, but almost all of the sub-trees are shared with 
_other_ tree commits, so their size is spread out over all the commits.

The full archive of the 2.6.12-rc2 kernel that I used for testing (only
_one_ version) is 102MB in size. That's about half of what the kernel is
uncompressed.

The full .git archive for 199 versions of the kernel (the 2.6.12-rc2 one
and a test-run of 198 patches from Andrew) is 111MB. In other words,
adding 198 "full" new kernels only grew the archive by 9MB (that's all
"actual disk usage" btw - the files themselves are smaller, but since they
all end up taking up a full disk block..)

Basically, the whole point of git is that objects are equated with their 
sha1 name, and that you can thus "include" an object by just referring to 
its name. The two are equivalent. 

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-12 Thread David Eger

The reason I am questioning this point is the GIT README file.

Linus makes explicit that a "blob" is just the "file contents," and that
really, a "blob" is not just the SHA1 of the "blob":

> In particular, the "current directory cache" certainly does not need to
> be consistent with the current directory contents, but it has two very
> important attributes:
> 
> (a) it can re-generate the full state it caches (not just the directory
> structure: through the "blob" object it can regenerate the data too)

And he defines "TREE" with the same name: blob

> TREE: The next hierarchical object type is the "tree" object.  A tree
> object is a list of permission/name/blob data, sorted by name.

Therefore, "TREE" must be the *full* data, and since we have the following
definition for CHANGESET:

> A "changeset" is defined by the tree-object that it results in, the
> parent changesets (zero, one or more) that led up to that point, and a
> comment on what happened.

That each changeset remembers *everything* for *each point in the tree*.

Linus, if you actually mean to differentiate between the full data
and a SHA1 of the data, *please please please* say "blob" in one place
and "SHA1 of the blob" elsewhere.  It's quite confusing, to me at least.

Also, the details of just what data constitutes a 'changeset' would be
lovely... i.e. a precise spec of what Pat is describing below...

-dte 

> where David Eger <[EMAIL PROTECTED]> told me that...
> > So with git, *every* changeset is an entire (compressed) copy of the
> > kernel.  Really?  Every patch you accept adds 37 MB to your hard disk?
> > 
> > Am I missing something here?
> 
> Yes. Only changes files re-appear. The unchanged files keep the same
> SHA1 hash, therefore they don't re-appear in the repository.
> 
> So, if Linus gets a patch which sanitizes drivers/char/selection.c,
> only these new objects appear in the repository:
> 
>   drivers/char/selection.c
>   drivers/char
>   drivers
>   . (project root)
>   commit message
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-12 Thread Petr Baudis
Dear diary, on Tue, Apr 12, 2005 at 06:05:19AM CEST, I got a letter
where David Eger <[EMAIL PROTECTED]> told me that...
> So with git, *every* changeset is an entire (compressed) copy of the
> kernel.  Really?  Every patch you accept adds 37 MB to your hard disk?
> 
> Am I missing something here?

Yes. Only changes files re-appear. The unchanged files keep the same
SHA1 hash, therefore they don't re-appear in the repository.

So, if Linus gets a patch which sanitizes drivers/char/selection.c,
only these new objects appear in the repository:

drivers/char/selection.c
drivers/char
drivers
. (project root)
commit message

Kind regards,

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-12 Thread Petr Baudis
Dear diary, on Tue, Apr 12, 2005 at 06:05:19AM CEST, I got a letter
where David Eger [EMAIL PROTECTED] told me that...
 So with git, *every* changeset is an entire (compressed) copy of the
 kernel.  Really?  Every patch you accept adds 37 MB to your hard disk?
 
 Am I missing something here?

Yes. Only changes files re-appear. The unchanged files keep the same
SHA1 hash, therefore they don't re-appear in the repository.

So, if Linus gets a patch which sanitizes drivers/char/selection.c,
only these new objects appear in the repository:

drivers/char/selection.c
drivers/char
drivers
. (project root)
commit message

Kind regards,

-- 
Petr Pasky Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-12 Thread David Eger

The reason I am questioning this point is the GIT README file.

Linus makes explicit that a blob is just the file contents, and that
really, a blob is not just the SHA1 of the blob:

 In particular, the current directory cache certainly does not need to
 be consistent with the current directory contents, but it has two very
 important attributes:
 
 (a) it can re-generate the full state it caches (not just the directory
 structure: through the blob object it can regenerate the data too)

And he defines TREE with the same name: blob

 TREE: The next hierarchical object type is the tree object.  A tree
 object is a list of permission/name/blob data, sorted by name.

Therefore, TREE must be the *full* data, and since we have the following
definition for CHANGESET:

 A changeset is defined by the tree-object that it results in, the
 parent changesets (zero, one or more) that led up to that point, and a
 comment on what happened.

That each changeset remembers *everything* for *each point in the tree*.

Linus, if you actually mean to differentiate between the full data
and a SHA1 of the data, *please please please* say blob in one place
and SHA1 of the blob elsewhere.  It's quite confusing, to me at least.

Also, the details of just what data constitutes a 'changeset' would be
lovely... i.e. a precise spec of what Pat is describing below...

-dte 

 where David Eger [EMAIL PROTECTED] told me that...
  So with git, *every* changeset is an entire (compressed) copy of the
  kernel.  Really?  Every patch you accept adds 37 MB to your hard disk?
  
  Am I missing something here?
 
 Yes. Only changes files re-appear. The unchanged files keep the same
 SHA1 hash, therefore they don't re-appear in the repository.
 
 So, if Linus gets a patch which sanitizes drivers/char/selection.c,
 only these new objects appear in the repository:
 
   drivers/char/selection.c
   drivers/char
   drivers
   . (project root)
   commit message
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-12 Thread Linus Torvalds


On Tue, 12 Apr 2005, David Eger wrote:
 
 The reason I am questioning this point is the GIT README file.
 
 Linus makes explicit that a blob is just the file contents, and that
 really, a blob is not just the SHA1 of the blob:
 
  In particular, the current directory cache certainly does not need to
  be consistent with the current directory contents, but it has two very
  important attributes:
  
  (a) it can re-generate the full state it caches (not just the directory
  structure: through the blob object it can regenerate the data too)
 
 And he defines TREE with the same name: blob

Yes. A tree is defined by the blobs it references (and the subtrees) but 
it doesn't _contain_ them. It just contains a pointer to them.

 Therefore, TREE must be the *full* data, and since we have the following
 definition for CHANGESET:

No. A tree is not the full data. A tree contains enough information to 
_recreate_ the full data, but the tree itself just tells you _how_ to do 
that. It doesn't contain very much of the data itself at all.

 That each changeset remembers *everything* for *each point in the tree*.

But only BY REFERENCE. A commit is usually very small. For example, the
top-of-tree commit-file for my currest kernel test is literally 401
_bytes_ in size. Because it just references a tree (20 bytes of
_reference_).

 Linus, if you actually mean to differentiate between the full data
 and a SHA1 of the data

There is no differentiation. The sha1 _is_ the data as far as git is 
concerned. 

It's only confusing if you think they are different. 

 Also, the details of just what data constitutes a 'changeset' would be
 lovely... i.e. a precise spec of what Pat is describing below...

[EMAIL PROTECTED]:~/test-tools/linux-2.6.12-rc2 cat-file commit `cat 
.git/HEAD `
tree cf9fd295d3048cd84c65d5e1a5a6b606bf4fddc6
parent c7a1a189dd0fe2c6ecd0aa33f2bd2f414c7892a0
author NeilBrown [EMAIL PROTECTED] Tue Apr 12 08:27:08 2005
committer Linus Torvalds [EMAIL PROTECTED] Tue Apr 12 08:27:08 2005

[PATCH] md: remove a number of misleading calls to MD_BUG

The conditions that cause these calls to MD_BUG are not kernel bugs, 
just
oddities in what userspace is asking for.

Also convert analyze_sbs to return void, and the value it returned was
always 0.

Signed-off-by: Neil Brown [EMAIL PROTECTED]
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
Signed-off-by: Linus Torvalds [EMAIL PROTECTED]

That's it. In all it's glory. Compressed and tagged it's 401 bytes. 

The tree it references is 677 bytes in size. That in turn references a 
number of subtrees, but almost all of the sub-trees are shared with 
_other_ tree commits, so their size is spread out over all the commits.

The full archive of the 2.6.12-rc2 kernel that I used for testing (only
_one_ version) is 102MB in size. That's about half of what the kernel is
uncompressed.

The full .git archive for 199 versions of the kernel (the 2.6.12-rc2 one
and a test-run of 198 patches from Andrew) is 111MB. In other words,
adding 198 full new kernels only grew the archive by 9MB (that's all
actual disk usage btw - the files themselves are smaller, but since they
all end up taking up a full disk block..)

Basically, the whole point of git is that objects are equated with their 
sha1 name, and that you can thus include an object by just referring to 
its name. The two are equivalent. 

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-12 Thread David Eger
On Tue, Apr 12, 2005 at 02:21:58PM -0700, Linus Torvalds wrote:
 
 Yes. A tree is defined by the blobs it references (and the subtrees) but 
 it doesn't _contain_ them. It just contains a pointer to them.

A pointer to them?  You mean a SHA1 hash of them? or what?
Where is the *real* data stored?  The real files, the real patches?
Are these somewhere completely outside of git?

  Therefore, TREE must be the *full* data, and since we have the following
  definition for CHANGESET:
 
 No. A tree is not the full data. A tree contains enough information to 
 _recreate_ the full data, but the tree itself just tells you _how_ to do 
 that. It doesn't contain very much of the data itself at all.

Perhaps I'd understand this if you tell me what recreate means.
If a have a SHA1 hash of a file, and I have the file, I can verify that said
file has the SHA1 hash it's supposed to have, but I can't generate the file
from it's hash...

Sorry for being stubbornly dumb, but you'll have a couple of us puzzling 
at the README ;-)

-dte
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-12 Thread Linus Torvalds


On Wed, 13 Apr 2005, Andrea Arcangeli wrote:
 
 At the rate of 9M for every 198 changeset checkins, that means I'll have
 to download 2.7G _uncompressible_ (i.e. already compressed with a bad
 per-file ratio due the too-small files) for a whole pack including all
 changesets without accounting the original 111MB of the original tree,
 with rsync -z of git.  That compares with 514M _compressible_ with CVS
 format on-disk, and with ~79M of the CVS-network download with rsync -z of
 the CVS repository (assuming default gzip compression level).

Yes. CVS is much denser.

CVS is also total crap. So your point is?

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-12 Thread Panagiotis Issaris
Hi David,

On Tue, Apr 12, 2005 at 06:36:23PM -0400, David Eger wrote:
  No. A tree is not the full data. A tree contains enough information
  to 
  _recreate_ the full data, but the tree itself just tells you _how_
  to do 
  that. It doesn't contain very much of the data itself at all.
 
 Perhaps I'd understand this if you tell me what recreate means.
 If a have a SHA1 hash of a file, and I have the file, I can verify
 that said
 file has the SHA1 hash it's supposed to have, but I can't generate the
 file
 from it's hash...

But, but if you have that hexified SHA1 hash of a particular file you
want to access, there would be a file with a filename equal to that
hexified SHA1 hash which contained the compressed contents of the file
you're looking for.

At least, that's how I understood it...

With friendly regards,
Takis

-- 
OpenPGP key: http://lumumba.luc.ac.be/takis/takis_public_key.txt
fingerprint: 6571 13A3 33D9 3726 F728  AA98 F643 B12E ECF3 E029
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-12 Thread Andrea Arcangeli
On Tue, Apr 12, 2005 at 02:21:58PM -0700, Linus Torvalds wrote:
 The full .git archive for 199 versions of the kernel (the 2.6.12-rc2 one
 and a test-run of 198 patches from Andrew) is 111MB. In other words,
 adding 198 full new kernels only grew the archive by 9MB (that's all
 actual disk usage btw - the files themselves are smaller, but since they
 all end up taking up a full disk block..)

reiserfs can do tail packing, plus the disk block is meaningless when
fetching the data from the network which is the real cost to worry about
when synchronizing and downloading (disk cost isn't a big deal).

The pagecache cost sounds a very minor one too, since you don't need
the whole data in ram, not even all dentries need to be in cache.  This
is one of the reasons why you don't need to run readdir, and why you can
discard the old trees anytime.

At the rate of 9M for every 198 changeset checkins, that means I'll have
to download 2.7G _uncompressible_ (i.e. already compressed with a bad
per-file ratio due the too-small files) for a whole pack including all
changesets without accounting the original 111MB of the original tree,
with rsync -z of git.  That compares with 514M _compressible_ with CVS
format on-disk, and with ~79M of the CVS-network download with rsync -z of
the CVS repository (assuming default gzip compression level).

What BKCVS provided with 79M of rsync -z, now is provided with 2.8G of
rsync -z, with a network-bound slowdown of -97.2%. Similar slowdowns
should be expected for synchronizations over time while fetching new
blobs etc...

Ok, BKCVS has less than 6 checkins due the linearization and
coalescing of pulls that couldn't be represented losslessy in CVS, so
the network-bound slowdown is less than -97.2%, my math is
approximative, but the order of magnitude should remain the same.

Clearly one can write an ad-hoc network protocol instead of using
rsync/wget, but the server will need quite a bit of cpu and ram to do a
checkout/update/sync efficiently to unpack all data and create all
changesets to gzip and transfer.

Anyway git simplicity and immutable hashes robustness certainly makes it
an ideal interim format (and it may even be a very pratical local
live format on-disk, except for the backups), I'm only unsure if it's a
wise idea to build an SCM on top of the current git format or if it's
better to use something like SCCS or CVS to coalesce all diffs of a
single file together and to save space and make rsync -z very efficient
too (or an approach like arch and darcs that stores changesets per file,
i.e. patches).
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-12 Thread Andrea Arcangeli
On Tue, Apr 12, 2005 at 04:45:07PM -0700, Linus Torvalds wrote:
 Yes. CVS is much denser.

 CVS is also total crap. So your point is?

I wasn't suggesting to use CVS. I meant that for a newly developed SCM,
the CVS/SCCS format as storage may be more appealing than the current
git format. I guess I should have said RCS instead of CVS, sorry if that
created any confusion. The arch/darcs approach of pratically storing
patches would also be much denser but it has no efficient way of doing
rcs up -p 1.x on a file, that doesn't involve potentially unpacking
tons of unrelated changesets.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-12 Thread Linus Torvalds


On Wed, 13 Apr 2005, Andrea Arcangeli wrote:
 
 I wasn't suggesting to use CVS. I meant that for a newly developed SCM,
 the CVS/SCCS format as storage may be more appealing than the current
 git format.

Go wild. I did mine in six days, and you've been whining about other 
peoples SCM's for three years.

In other words - go and _do_ something instead of whining. I'm not 
interested.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-11 Thread Petr Baudis
Dear diary, on Mon, Apr 11, 2005 at 05:49:31PM CEST, I got a letter
where "Randy.Dunlap" <[EMAIL PROTECTED]> told me that...
> On Sun, 10 Apr 2005 16:38:00 -0700 (PDT) Linus Torvalds wrote:
..snip..
> | Yes. Crappy old tree, but it can still read my git.git directory, so you 
> | can use it to update to my current source base.
> 
> Please go into a little more detail about how to do this step...
> that seems to be the most basic concept that I am missing.
> i.e., how to find the "latest/current" tree (version/commit)
> and check it out (read-tree, checkout-cache, etc.).

Well, its ID is by convention kept in .dircache/HEAD. But that is really
only a convention, no "core git" tool reads it directly, and you need to
update it manually after you do commit-tree.

First, you need to get the accompanying tree's id. git-pasky's shortcut
is $(tree-id), but manually you can do it by

$(cat-file commit $(cat .dircache/HEAD)) | egrep '^tree'

Note that if you ever forgot to update HEAD or if you have multiple
branches in your repository, you can list all "head commits" (that is,
commits which have no other commits referencing them as parents) by
doing fsck-cache.

Now, you need to populate the directory cache by the tree (see Paul
Jackson's diagram):

read-tree $tree_id

And now you want to update your working tree from the cache:

checkout-cache -a -f

This will bring your tree in sync with the cache (it won't remove any
stale files, though). That means it will overwrite your local changes
too - turn that off by omitting the "-f". If you want to update only
some files, omit the "-a" and list them.

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-11 Thread Petr Baudis
Dear diary, on Mon, Apr 11, 2005 at 05:49:31PM CEST, I got a letter
where Randy.Dunlap [EMAIL PROTECTED] told me that...
 On Sun, 10 Apr 2005 16:38:00 -0700 (PDT) Linus Torvalds wrote:
..snip..
 | Yes. Crappy old tree, but it can still read my git.git directory, so you 
 | can use it to update to my current source base.
 
 Please go into a little more detail about how to do this step...
 that seems to be the most basic concept that I am missing.
 i.e., how to find the latest/current tree (version/commit)
 and check it out (read-tree, checkout-cache, etc.).

Well, its ID is by convention kept in .dircache/HEAD. But that is really
only a convention, no core git tool reads it directly, and you need to
update it manually after you do commit-tree.

First, you need to get the accompanying tree's id. git-pasky's shortcut
is $(tree-id), but manually you can do it by

$(cat-file commit $(cat .dircache/HEAD)) | egrep '^tree'

Note that if you ever forgot to update HEAD or if you have multiple
branches in your repository, you can list all head commits (that is,
commits which have no other commits referencing them as parents) by
doing fsck-cache.

Now, you need to populate the directory cache by the tree (see Paul
Jackson's diagram):

read-tree $tree_id

And now you want to update your working tree from the cache:

checkout-cache -a -f

This will bring your tree in sync with the cache (it won't remove any
stale files, though). That means it will overwrite your local changes
too - turn that off by omitting the -f. If you want to update only
some files, omit the -a and list them.

-- 
Petr Pasky Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-10 Thread Petr Baudis
Dear diary, on Mon, Apr 11, 2005 at 01:14:57AM CEST, I got a letter
where Paul Jackson <[EMAIL PROTECTED]> told me that...
> Useful explanation - thanks, Linus.
> 
> Is this picture and description accurate:
> 
> ==
> 
> 
>  < working directory files (foo.c) >
>^
>   ^|
>   |  upward ops|downward ops  |
>   |  --|  |
>   | checkout-cache |update-cache  |
>   | show-diff  |  v
>v
> < current directory cache (".dircache/index") >
>^
>   ^|
>   |  upward ops|downward ops  |
>   |  --|  |
>   |   read-tree| write-tree   |
>   ||commit-tree   |
>|  v
>v
> < git filesystem (blobs, trees, commits: .dircache/{HEAD,objects}) >

Well, except that from purely technical standpoint commit-tree has
nothing to do in this picture - it creates new object in the git
filesystem based on its input data, but regardless to the directory
cache or current tree. It probably still belongs where it is from the
workflow standpoint, though.

..snip..
> Minor question:
> 
>   I must have an old version - I got 'git-0.03', but
>   it doesn't have 'checkout-cache', and its 'read-tree'
>   directly writes my working files.
>  
>   How do I get a current version?  Well, one way I see,
>   and that's to pick up Pasky's:
> 
> http://pasky.or.cz/~pasky/dev/git/git-pasky-base.tar.bz2
>  
>   Perhaps that's the best way?

You can take mine, and do:

git pull pasky
git pull linus
cp .dircache/HEAD .dircache/HEAD.local

Now, your tree and git filesystem is up to date.

git track local

Now, when you do git pull pasky, your working tree will not be updated
automatically anymore.

git track linus

Now, you start tracking Linus' tree instead. Note that the initial
update will blow away the scripts in your current tree, so before you do
the last two steps you will probably want to clone the tree and set PATH
to the one still tracking me, so you get all the comfort. ;-)

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-10 Thread Petr Baudis
Dear diary, on Sun, Apr 10, 2005 at 08:42:53PM CEST, I got a letter
where Christopher Li <[EMAIL PROTECTED]> told me that...
> I totally agree that odds is really really small.
> That is why it is not worthy to handle the case. People hit that
> can just add a new line or some thing to avoid it, if
> it happen after all.
> 
> It is the little peace of mind to know for sure that did
> not happen. I am just paranoid. 

BTW, I've merged the check to git-pasky some time ago, you can disable
it in the Makefile. It is by default on now, until someone convinces me
it actually affects performance measurably.

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RE: more git updates..

2005-04-10 Thread Petr Baudis
Dear diary, on Mon, Apr 11, 2005 at 12:07:37AM CEST, I got a letter
where "Luck, Tony" <[EMAIL PROTECTED]> told me that...
..snip..
> >Hey, I may end up being wrong, and yes, maybe I should have done a 
> >two-level one. The good news is that we can trivially fix it later (even 
> >dynamically - we can make the "sha1 object tree layout" be a per-tree 
> >config option, and there would be no real issue, so you could make small 
> >projects use a flat version and big projects use a very deep structure 
> >etc). You'd just have to script some renames to move the files around.
> 
> It depends on how many eco-system shell scripts get built that need to
> know about the layout ... if some shell/perl "libraries" encode this
> filename layout (and people use them) ... then switching later would
> indeed be painless.

FWIW, my short-term plans include support for monotone-like hash ID
shortening - it's enough to use the shortest leading unique part of the
ID to identify the revision. I will poke to the object repository for
that. I also already have Randy Dunlap's git lsobj, which will list all
objects of a specified type (very useful especially when looking for
orphaned commits and such rather lowlevel work).

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-10 Thread Christopher Li
On Sun, Apr 10, 2005 at 11:41:53AM +0200, Petr Baudis wrote:
> Dear diary, on Sun, Apr 10, 2005 at 07:53:40AM CEST, I got a letter
> where Christopher Li <[EMAIL PROTECTED]> told me that...
> > On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
> > > 
> > > But I am wondering what your plans are to handle renames---or
> > > does git already represent them?
> > >
> > 
> > Rename should just work.  It will create a new tree object and you
> > will notice that in the entry that changed, the hash for the blob
> > object is the same.
> 
> Which is of course wrong when you want to do proper merging, examine
> per-file history, etc. One solution which springs to my mind is to have
> a UUID accompany each blob and tree; that will take relatively lot of
> space though, and I'm not sure it is really worth it.

It should just use the rename + change two step then it is tractable
with git now.

Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-10 Thread Petr Baudis
Dear diary, on Sun, Apr 10, 2005 at 11:28:54AM CEST, I got a letter
where Junio C Hamano <[EMAIL PROTECTED]> told me that...
> > "CL" == Christopher Li <[EMAIL PROTECTED]> writes:
> 
> CL> On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
> >> 
> >> But I am wondering what your plans are to handle renames---or
> >> does git already represent them?
> >> 
> 
> CL> Rename should just work.  It will create a new tree object and you
> CL> will notice that in the entry that changed, the hash for the blob
> CL> object is the same.
> 
> Sorry, I was unclear.  But doesn't that imply that a SCM built
> on top of git storage needs to read all the commit and tree
> records up to the common ancestor to show tree diffs between two
> forked tree?

No. See diff-tree output and
http://pasky.or.cz/~pasky/dev/git/gitdiff-do for how it's done.
Basically, you just take the two trees and compare them linearily (do a
normal diff on them, essentialy). Then the differences you spot this way
are everything what needs to appear in the patch.

> I suspect that another problem is that noticing the move of the
> same SHA1 hash from one pathname to another and recognizing that
> as a rename would not always work in the real world, because
> sometimes people move files *and* make small changes at the same
> time.  If git is meant to be an intermediate format to suck
> existing kernel history out of BK so that the history can be
> converted for the next SCM chosen for the kernel work, I would
> imagine that there needs to be a way to represent such a case.
> Maybe convert a file rename as two git trees (one tree for pure
> move which immediately followed by another tree for edit) if it
> is not a pure move?

Actually, this could be possible too I think. We will have to make
diff-tree two-pass, but it is already so blinding fast that I guess that
doesn't hurt too much. I might try to get my hands on that.

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-10 Thread Petr Baudis
Dear diary, on Sun, Apr 10, 2005 at 07:53:40AM CEST, I got a letter
where Christopher Li <[EMAIL PROTECTED]> told me that...
> On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
> > 
> > But I am wondering what your plans are to handle renames---or
> > does git already represent them?
> >
> 
> Rename should just work.  It will create a new tree object and you
> will notice that in the entry that changed, the hash for the blob
> object is the same.

Which is of course wrong when you want to do proper merging, examine
per-file history, etc. One solution which springs to my mind is to have
a UUID accompany each blob and tree; that will take relatively lot of
space though, and I'm not sure it is really worth it.

How many renames were there in the 64k commits so far anyway?

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-10 Thread Petr Baudis
Dear diary, on Sun, Apr 10, 2005 at 07:53:40AM CEST, I got a letter
where Christopher Li [EMAIL PROTECTED] told me that...
 On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
  
  But I am wondering what your plans are to handle renames---or
  does git already represent them?
 
 
 Rename should just work.  It will create a new tree object and you
 will notice that in the entry that changed, the hash for the blob
 object is the same.

Which is of course wrong when you want to do proper merging, examine
per-file history, etc. One solution which springs to my mind is to have
a UUID accompany each blob and tree; that will take relatively lot of
space though, and I'm not sure it is really worth it.

How many renames were there in the 64k commits so far anyway?

-- 
Petr Pasky Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-10 Thread Petr Baudis
Dear diary, on Sun, Apr 10, 2005 at 11:28:54AM CEST, I got a letter
where Junio C Hamano [EMAIL PROTECTED] told me that...
  CL == Christopher Li [EMAIL PROTECTED] writes:
 
 CL On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
  
  But I am wondering what your plans are to handle renames---or
  does git already represent them?
  
 
 CL Rename should just work.  It will create a new tree object and you
 CL will notice that in the entry that changed, the hash for the blob
 CL object is the same.
 
 Sorry, I was unclear.  But doesn't that imply that a SCM built
 on top of git storage needs to read all the commit and tree
 records up to the common ancestor to show tree diffs between two
 forked tree?

No. See diff-tree output and
http://pasky.or.cz/~pasky/dev/git/gitdiff-do for how it's done.
Basically, you just take the two trees and compare them linearily (do a
normal diff on them, essentialy). Then the differences you spot this way
are everything what needs to appear in the patch.

 I suspect that another problem is that noticing the move of the
 same SHA1 hash from one pathname to another and recognizing that
 as a rename would not always work in the real world, because
 sometimes people move files *and* make small changes at the same
 time.  If git is meant to be an intermediate format to suck
 existing kernel history out of BK so that the history can be
 converted for the next SCM chosen for the kernel work, I would
 imagine that there needs to be a way to represent such a case.
 Maybe convert a file rename as two git trees (one tree for pure
 move which immediately followed by another tree for edit) if it
 is not a pure move?

Actually, this could be possible too I think. We will have to make
diff-tree two-pass, but it is already so blinding fast that I guess that
doesn't hurt too much. I might try to get my hands on that.

-- 
Petr Pasky Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-10 Thread Christopher Li
On Sun, Apr 10, 2005 at 11:41:53AM +0200, Petr Baudis wrote:
 Dear diary, on Sun, Apr 10, 2005 at 07:53:40AM CEST, I got a letter
 where Christopher Li [EMAIL PROTECTED] told me that...
  On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
   
   But I am wondering what your plans are to handle renames---or
   does git already represent them?
  
  
  Rename should just work.  It will create a new tree object and you
  will notice that in the entry that changed, the hash for the blob
  object is the same.
 
 Which is of course wrong when you want to do proper merging, examine
 per-file history, etc. One solution which springs to my mind is to have
 a UUID accompany each blob and tree; that will take relatively lot of
 space though, and I'm not sure it is really worth it.

It should just use the rename + change two step then it is tractable
with git now.

Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RE: more git updates..

2005-04-10 Thread Petr Baudis
Dear diary, on Mon, Apr 11, 2005 at 12:07:37AM CEST, I got a letter
where Luck, Tony [EMAIL PROTECTED] told me that...
..snip..
 Hey, I may end up being wrong, and yes, maybe I should have done a 
 two-level one. The good news is that we can trivially fix it later (even 
 dynamically - we can make the sha1 object tree layout be a per-tree 
 config option, and there would be no real issue, so you could make small 
 projects use a flat version and big projects use a very deep structure 
 etc). You'd just have to script some renames to move the files around.
 
 It depends on how many eco-system shell scripts get built that need to
 know about the layout ... if some shell/perl libraries encode this
 filename layout (and people use them) ... then switching later would
 indeed be painless.

FWIW, my short-term plans include support for monotone-like hash ID
shortening - it's enough to use the shortest leading unique part of the
ID to identify the revision. I will poke to the object repository for
that. I also already have Randy Dunlap's git lsobj, which will list all
objects of a specified type (very useful especially when looking for
orphaned commits and such rather lowlevel work).

-- 
Petr Pasky Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-10 Thread Petr Baudis
Dear diary, on Sun, Apr 10, 2005 at 08:42:53PM CEST, I got a letter
where Christopher Li [EMAIL PROTECTED] told me that...
 I totally agree that odds is really really small.
 That is why it is not worthy to handle the case. People hit that
 can just add a new line or some thing to avoid it, if
 it happen after all.
 
 It is the little peace of mind to know for sure that did
 not happen. I am just paranoid. 

BTW, I've merged the check to git-pasky some time ago, you can disable
it in the Makefile. It is by default on now, until someone convinces me
it actually affects performance measurably.

-- 
Petr Pasky Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-10 Thread Petr Baudis
Dear diary, on Mon, Apr 11, 2005 at 01:14:57AM CEST, I got a letter
where Paul Jackson [EMAIL PROTECTED] told me that...
 Useful explanation - thanks, Linus.
 
 Is this picture and description accurate:
 
 ==
 
 
   working directory files (foo.c) 
^
   ^|
   |  upward ops|downward ops  |
   |  --|  |
   | checkout-cache |update-cache  |
   | show-diff  |  v
v
  current directory cache (.dircache/index) 
^
   ^|
   |  upward ops|downward ops  |
   |  --|  |
   |   read-tree| write-tree   |
   ||commit-tree   |
|  v
v
  git filesystem (blobs, trees, commits: .dircache/{HEAD,objects}) 

Well, except that from purely technical standpoint commit-tree has
nothing to do in this picture - it creates new object in the git
filesystem based on its input data, but regardless to the directory
cache or current tree. It probably still belongs where it is from the
workflow standpoint, though.

..snip..
 Minor question:
 
   I must have an old version - I got 'git-0.03', but
   it doesn't have 'checkout-cache', and its 'read-tree'
   directly writes my working files.
  
   How do I get a current version?  Well, one way I see,
   and that's to pick up Pasky's:
 
 http://pasky.or.cz/~pasky/dev/git/git-pasky-base.tar.bz2
  
   Perhaps that's the best way?

You can take mine, and do:

git pull pasky
git pull linus
cp .dircache/HEAD .dircache/HEAD.local

Now, your tree and git filesystem is up to date.

git track local

Now, when you do git pull pasky, your working tree will not be updated
automatically anymore.

git track linus

Now, you start tracking Linus' tree instead. Note that the initial
update will blow away the scripts in your current tree, so before you do
the last two steps you will probably want to clone the tree and set PATH
to the one still tracking me, so you get all the comfort. ;-)

-- 
Petr Pasky Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-09 Thread Petr Baudis
Dear diary, on Sun, Apr 10, 2005 at 01:31:10AM CEST, I got a letter
where Linus Torvalds <[EMAIL PROTECTED]> told me that...
> On Sat, 9 Apr 2005, Linus Torvalds wrote:
> > 
> > Actually, I guess I wouldn't have to change the format. I could just 
> > extend the existing "tree" object to be able to point to other trees, and 
> > that's it.
> 
> Done, and pushed out. The current git.git repository seems to do all of 
> this correctly.
..snip..

Ok, so now I can dare announce it, I hope. I hacked my branch of git
somewhat, kept in sync with Linus, and now I have something to show.
Please see it at

http://pasky.or.cz/~pasky/dev/git/

It is basically a set of (still rather crude) shell scripts upon Linus'
git, which make it sanely usable by mere humans for actual version
tracking. Its usage _is_ going to change, so don't get too used to it
(that'd be hard anyway, I suspect), but it should be working nicely.

I have described most of the interesting parts and some basic usage in
the README at that page. It wraps commits, supports log retrieval and
comfortable diffing between any two trees. And on top of that, it can do
some basic remote repositories - it will pull (rsync) from them and it
can make the local copy track them - on pull, it will be updated
accordingly (and your local commits on the tracked branch will get
orphaned).

I didn't attach a patch against Linus since I think it's pretty much
useless now. It's available as against-linus.patch on the web, and
you can apply it to the latest git tree (NOT 0.03). But it's probably
better idea to wget my tree. You can then watch us making progress by

gitpull.sh linus
gitpull.sh pasky

and see where we differ by:

gitdiff.sh linus pasky

(This is how the against-linus.patch was generated. I'd easily generate
even 0.03 patch this way, but I forgot to merge the fsck at that time,
so it would suck.)

(Note that the tree you wget is set up to track my branch. If you want
to stop tracking it (basically necessary now if you want to do local
commits), do:

cp .dircache/HEAD .dircache/HEAD.local
gittrack.sh

The cp says that something like "I want to pick up where the tracked
branch left off". Otherwise, untracking would return you to your "local"
branch, which is just some ancient predecessor of the pasky branch here
anyway.)

Note that I didn't really test it on anything but git itself yet, so I'm
not sure how will it cope especially with directories - I tried to make
it aware of them though. I will do some more practical testing tomorrow.

Otherwise, I will probably try to consolidate the usage and
documentation now, and beautify the scripts. I might start pondering
some merging too. Oh, and gitpatch.sh. :-)

Have fun and please share your opinions,

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: more git updates..

2005-04-09 Thread Petr Baudis
Dear diary, on Sun, Apr 10, 2005 at 01:31:10AM CEST, I got a letter
where Linus Torvalds [EMAIL PROTECTED] told me that...
 On Sat, 9 Apr 2005, Linus Torvalds wrote:
  
  Actually, I guess I wouldn't have to change the format. I could just 
  extend the existing tree object to be able to point to other trees, and 
  that's it.
 
 Done, and pushed out. The current git.git repository seems to do all of 
 this correctly.
..snip..

Ok, so now I can dare announce it, I hope. I hacked my branch of git
somewhat, kept in sync with Linus, and now I have something to show.
Please see it at

http://pasky.or.cz/~pasky/dev/git/

It is basically a set of (still rather crude) shell scripts upon Linus'
git, which make it sanely usable by mere humans for actual version
tracking. Its usage _is_ going to change, so don't get too used to it
(that'd be hard anyway, I suspect), but it should be working nicely.

I have described most of the interesting parts and some basic usage in
the README at that page. It wraps commits, supports log retrieval and
comfortable diffing between any two trees. And on top of that, it can do
some basic remote repositories - it will pull (rsync) from them and it
can make the local copy track them - on pull, it will be updated
accordingly (and your local commits on the tracked branch will get
orphaned).

I didn't attach a patch against Linus since I think it's pretty much
useless now. It's available as against-linus.patch on the web, and
you can apply it to the latest git tree (NOT 0.03). But it's probably
better idea to wget my tree. You can then watch us making progress by

gitpull.sh linus
gitpull.sh pasky

and see where we differ by:

gitdiff.sh linus pasky

(This is how the against-linus.patch was generated. I'd easily generate
even 0.03 patch this way, but I forgot to merge the fsck at that time,
so it would suck.)

(Note that the tree you wget is set up to track my branch. If you want
to stop tracking it (basically necessary now if you want to do local
commits), do:

cp .dircache/HEAD .dircache/HEAD.local
gittrack.sh

The cp says that something like I want to pick up where the tracked
branch left off. Otherwise, untracking would return you to your local
branch, which is just some ancient predecessor of the pasky branch here
anyway.)

Note that I didn't really test it on anything but git itself yet, so I'm
not sure how will it cope especially with directories - I tried to make
it aware of them though. I will do some more practical testing tomorrow.

Otherwise, I will probably try to consolidate the usage and
documentation now, and beautify the scripts. I might start pondering
some merging too. Oh, and gitpatch.sh. :-)

Have fun and please share your opinions,

-- 
Petr Pasky Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/