Re: Re: more git updates..
On Tue, Apr 12, 2005 at 06:10:27PM -0700, Linus Torvalds wrote: > > > On Wed, 13 Apr 2005, Andrea Arcangeli wrote: > > > > I wasn't suggesting to use CVS. I meant that for a newly developed SCM, > > the CVS/SCCS format as storage may be more appealing than the current > > git format. > > Go wild. I did mine in six days, and you've been whining about other > peoples SCM's for three years. I wrote a hack to do efficient delta storage with O(1) seeks for lookup and append last week, I believe it's been integrated into the latest Bazaar-NG. I expect it'll give better compression and performance than BK. Of course it ends up being O(revisions) for modifications or insertions (but that is probably a non-issue for the SCM models we're looking at). The git model is obviously very different, but I worry about the slop space implied. With 200k file revision and an average of 2k slop per file, that's 400MB of slop, or almost the size of an equivalent delta compressed kernel repo. Now if you can assume that blobs never change and are never deleted, you can simply append them all onto a log, and then index them with a separate file containing an htree of (sha1, offset, length) or the like. Since the key is already a strong hash, this is an excellent match and avoids rehashing in the kernel's directory lookup. And it'll save an inode, a directory entry, and about half a data block per entry. "Open" will also be cheaper as there's no per-revision inode to grab. I could hack on this if you think it fits with the git model, otherwise I'll go back to my other experiments.. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Wed, 13 Apr 2005, Russell King wrote: > > And my entire 2.6.12-rc2 BK tree, unchecked out, is about 220MB, which > is more dense than CVS. > > BK is also a lot better than CVS. So _your_ point is? Hey, anybody who wants to argue that BK is getter than GIT won't be getting any counter-arguments from me. The fact is, I have constraints. Like needing something to work within a few days. If somebody comes up with a ultra-fast, replicatable, space efficient SCM in three days, I'm all over it. In the meantime, I'd suggest people who worry about network bandwidth try to work out a synchronization protocol that allows you to send "diff updates" between git repositories. The git model doesn't preclude looking at the objects and sending diffs instead (and re-creating the objects on the other side). But my time-constraints _do_. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Tue, Apr 12, 2005 at 06:10:27PM -0700, Linus Torvalds wrote: > Go wild. I did mine in six days, and you've been whining about other > peoples SCM's for three years. Even if I spend 6 days doing git, you'd never have thrown away BK in exchange for git. > In other words - go and _do_ something instead of whining. I'm not > interested. CVS and SVN are already an order of magnitude more efficient than git at storing and exporting the data and they shouldn't annoy you during the checkins either, they have a backend much more efficient than git too, and yet you seem not to care about them. My suggestion was simply to at least change git to coalesce the diffs like CVS/SCCS, I'm only making a suggestion to give git a chance to have a backend at least as efficient as the one that CVS uses and to avoid running rsync on a 2.8G uncompressible blob. I don't have enough spare time to do something myself, my spare time would be too short anyway to make a difference in SCM space, so I'd rather spend it all in more innovative space where it might have a slight change to make a difference. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Wed, Apr 13, 2005 at 10:30:52AM +0100, Russell King wrote: > And my entire 2.6.12-rc2 BK tree, unchecked out, is about 220MB, which > is more dense than CVS. Yep, this is why I mentioned SCCS format too, I didn't know it was even smaller, but I expected a similar density from SCCS. > Note: I'm _not_ arguing with your sentiments towards CVS. However, I > think the space usage point still stands. If it wasn't for network synchronization it almost wouldn't matter, but fetching 2.8G uncompressible when I could simply fetch 220MB compressible (that will compress with zlib at little cost during rsync to less than 78M), sounds a bit overkill. > What is the space usage behaviour when you have multiple git trees? Multiple trees in the sense of pulls from multiple developers aren't more costly than a normal checkin, due the "soft hardlink" property of the hashes. It's just every checkin taking lots of space, and generating a new uncompressible blobs every time a changeset touches one file. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Tue, Apr 12, 2005 at 04:45:07PM -0700, Linus Torvalds wrote: > On Wed, 13 Apr 2005, Andrea Arcangeli wrote: > > At the rate of 9M for every 198 changeset checkins, that means I'll have > > to download 2.7G _uncompressible_ (i.e. already compressed with a bad > > per-file ratio due the too-small files) for a whole pack including all > > changesets without accounting the original 111MB of the original tree, > > with rsync -z of git. That compares with 514M _compressible_ with CVS > > format on-disk, and with ~79M of the CVS-network download with rsync -z of > > the CVS repository (assuming default gzip compression level). > > Yes. CVS is much denser. > > CVS is also total crap. So your point is? And my entire 2.6.12-rc2 BK tree, unchecked out, is about 220MB, which is more dense than CVS. BK is also a lot better than CVS. So _your_ point is? 8) Note: I'm _not_ arguing with your sentiments towards CVS. However, I think the space usage point still stands. What is the space usage behaviour when you have multiple git trees? Do we need a git relink command in git-pasky? 8) -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: 2.6 Serial core - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Tue, Apr 12, 2005 at 04:45:07PM -0700, Linus Torvalds wrote: On Wed, 13 Apr 2005, Andrea Arcangeli wrote: At the rate of 9M for every 198 changeset checkins, that means I'll have to download 2.7G _uncompressible_ (i.e. already compressed with a bad per-file ratio due the too-small files) for a whole pack including all changesets without accounting the original 111MB of the original tree, with rsync -z of git. That compares with 514M _compressible_ with CVS format on-disk, and with ~79M of the CVS-network download with rsync -z of the CVS repository (assuming default gzip compression level). Yes. CVS is much denser. CVS is also total crap. So your point is? And my entire 2.6.12-rc2 BK tree, unchecked out, is about 220MB, which is more dense than CVS. BK is also a lot better than CVS. So _your_ point is? 8) Note: I'm _not_ arguing with your sentiments towards CVS. However, I think the space usage point still stands. What is the space usage behaviour when you have multiple git trees? Do we need a git relink command in git-pasky? 8) -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: 2.6 Serial core - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Wed, Apr 13, 2005 at 10:30:52AM +0100, Russell King wrote: And my entire 2.6.12-rc2 BK tree, unchecked out, is about 220MB, which is more dense than CVS. Yep, this is why I mentioned SCCS format too, I didn't know it was even smaller, but I expected a similar density from SCCS. Note: I'm _not_ arguing with your sentiments towards CVS. However, I think the space usage point still stands. If it wasn't for network synchronization it almost wouldn't matter, but fetching 2.8G uncompressible when I could simply fetch 220MB compressible (that will compress with zlib at little cost during rsync to less than 78M), sounds a bit overkill. What is the space usage behaviour when you have multiple git trees? Multiple trees in the sense of pulls from multiple developers aren't more costly than a normal checkin, due the soft hardlink property of the hashes. It's just every checkin taking lots of space, and generating a new uncompressible blobs every time a changeset touches one file. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Tue, Apr 12, 2005 at 06:10:27PM -0700, Linus Torvalds wrote: Go wild. I did mine in six days, and you've been whining about other peoples SCM's for three years. Even if I spend 6 days doing git, you'd never have thrown away BK in exchange for git. In other words - go and _do_ something instead of whining. I'm not interested. CVS and SVN are already an order of magnitude more efficient than git at storing and exporting the data and they shouldn't annoy you during the checkins either, they have a backend much more efficient than git too, and yet you seem not to care about them. My suggestion was simply to at least change git to coalesce the diffs like CVS/SCCS, I'm only making a suggestion to give git a chance to have a backend at least as efficient as the one that CVS uses and to avoid running rsync on a 2.8G uncompressible blob. I don't have enough spare time to do something myself, my spare time would be too short anyway to make a difference in SCM space, so I'd rather spend it all in more innovative space where it might have a slight change to make a difference. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Wed, 13 Apr 2005, Russell King wrote: And my entire 2.6.12-rc2 BK tree, unchecked out, is about 220MB, which is more dense than CVS. BK is also a lot better than CVS. So _your_ point is? Hey, anybody who wants to argue that BK is getter than GIT won't be getting any counter-arguments from me. The fact is, I have constraints. Like needing something to work within a few days. If somebody comes up with a ultra-fast, replicatable, space efficient SCM in three days, I'm all over it. In the meantime, I'd suggest people who worry about network bandwidth try to work out a synchronization protocol that allows you to send diff updates between git repositories. The git model doesn't preclude looking at the objects and sending diffs instead (and re-creating the objects on the other side). But my time-constraints _do_. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Tue, Apr 12, 2005 at 06:10:27PM -0700, Linus Torvalds wrote: On Wed, 13 Apr 2005, Andrea Arcangeli wrote: I wasn't suggesting to use CVS. I meant that for a newly developed SCM, the CVS/SCCS format as storage may be more appealing than the current git format. Go wild. I did mine in six days, and you've been whining about other peoples SCM's for three years. I wrote a hack to do efficient delta storage with O(1) seeks for lookup and append last week, I believe it's been integrated into the latest Bazaar-NG. I expect it'll give better compression and performance than BK. Of course it ends up being O(revisions) for modifications or insertions (but that is probably a non-issue for the SCM models we're looking at). The git model is obviously very different, but I worry about the slop space implied. With 200k file revision and an average of 2k slop per file, that's 400MB of slop, or almost the size of an equivalent delta compressed kernel repo. Now if you can assume that blobs never change and are never deleted, you can simply append them all onto a log, and then index them with a separate file containing an htree of (sha1, offset, length) or the like. Since the key is already a strong hash, this is an excellent match and avoids rehashing in the kernel's directory lookup. And it'll save an inode, a directory entry, and about half a data block per entry. Open will also be cheaper as there's no per-revision inode to grab. I could hack on this if you think it fits with the git model, otherwise I'll go back to my other experiments.. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Wed, 13 Apr 2005, Andrea Arcangeli wrote: > > I wasn't suggesting to use CVS. I meant that for a newly developed SCM, > the CVS/SCCS format as storage may be more appealing than the current > git format. Go wild. I did mine in six days, and you've been whining about other peoples SCM's for three years. In other words - go and _do_ something instead of whining. I'm not interested. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Tue, Apr 12, 2005 at 04:45:07PM -0700, Linus Torvalds wrote: > Yes. CVS is much denser. > > CVS is also total crap. So your point is? I wasn't suggesting to use CVS. I meant that for a newly developed SCM, the CVS/SCCS format as storage may be more appealing than the current git format. I guess I should have said RCS instead of CVS, sorry if that created any confusion. The arch/darcs approach of pratically storing patches would also be much denser but it has no efficient way of doing "rcs up -p 1.x" on a file, that doesn't involve potentially unpacking tons of unrelated changesets. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Tue, Apr 12, 2005 at 02:21:58PM -0700, Linus Torvalds wrote: > The full .git archive for 199 versions of the kernel (the 2.6.12-rc2 one > and a test-run of 198 patches from Andrew) is 111MB. In other words, > adding 198 "full" new kernels only grew the archive by 9MB (that's all > "actual disk usage" btw - the files themselves are smaller, but since they > all end up taking up a full disk block..) reiserfs can do tail packing, plus the disk block is meaningless when fetching the data from the network which is the real cost to worry about when synchronizing and downloading (disk cost isn't a big deal). The pagecache cost sounds a very minor one too, since you don't need the whole data in ram, not even all dentries need to be in cache. This is one of the reasons why you don't need to run readdir, and why you can discard the old trees anytime. At the rate of 9M for every 198 changeset checkins, that means I'll have to download 2.7G _uncompressible_ (i.e. already compressed with a bad per-file ratio due the too-small files) for a whole pack including all changesets without accounting the original 111MB of the original tree, with rsync -z of git. That compares with 514M _compressible_ with CVS format on-disk, and with ~79M of the CVS-network download with rsync -z of the CVS repository (assuming default gzip compression level). What BKCVS provided with 79M of rsync -z, now is provided with 2.8G of rsync -z, with a network-bound slowdown of -97.2%. Similar slowdowns should be expected for synchronizations over time while fetching new blobs etc... Ok, BKCVS has less than 6 checkins due the linearization and coalescing of pulls that couldn't be represented losslessy in CVS, so the network-bound slowdown is less than -97.2%, my math is approximative, but the order of magnitude should remain the same. Clearly one can write an ad-hoc network protocol instead of using rsync/wget, but the server will need quite a bit of cpu and ram to do a checkout/update/sync efficiently to unpack all data and create all changesets to gzip and transfer. Anyway git simplicity and immutable hashes robustness certainly makes it an ideal interim format (and it may even be a very pratical local live format on-disk, except for the backups), I'm only unsure if it's a wise idea to build an SCM on top of the current git format or if it's better to use something like SCCS or CVS to coalesce all diffs of a single file together and to save space and make rsync -z very efficient too (or an approach like arch and darcs that stores changesets per file, i.e. patches). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
Hi David, On Tue, Apr 12, 2005 at 06:36:23PM -0400, David Eger wrote: > > No. A tree is not the full data. A tree contains enough information > > to > > _recreate_ the full data, but the tree itself just tells you _how_ > > to do > > that. It doesn't contain very much of the data itself at all. > > Perhaps I'd understand this if you tell me what "recreate" means. > If a have a SHA1 hash of a file, and I have the file, I can verify > that said > file has the SHA1 hash it's supposed to have, but I can't generate the > file > from it's hash... But, but if you have that hexified SHA1 hash of a particular file you want to access, there would be a file with a filename equal to that hexified SHA1 hash which contained the compressed contents of the file you're looking for. At least, that's how I understood it... With friendly regards, Takis -- OpenPGP key: http://lumumba.luc.ac.be/takis/takis_public_key.txt fingerprint: 6571 13A3 33D9 3726 F728 AA98 F643 B12E ECF3 E029 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Wed, 13 Apr 2005, Andrea Arcangeli wrote: > > At the rate of 9M for every 198 changeset checkins, that means I'll have > to download 2.7G _uncompressible_ (i.e. already compressed with a bad > per-file ratio due the too-small files) for a whole pack including all > changesets without accounting the original 111MB of the original tree, > with rsync -z of git. That compares with 514M _compressible_ with CVS > format on-disk, and with ~79M of the CVS-network download with rsync -z of > the CVS repository (assuming default gzip compression level). Yes. CVS is much denser. CVS is also total crap. So your point is? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Tue, Apr 12, 2005 at 02:21:58PM -0700, Linus Torvalds wrote: > > Yes. A tree is defined by the blobs it references (and the subtrees) but > it doesn't _contain_ them. It just contains a pointer to them. A pointer to them? You mean a SHA1 hash of them? or what? Where is the *real* data stored? The real files, the real patches? Are these somewhere completely outside of git? > > Therefore, "TREE" must be the *full* data, and since we have the following > > definition for CHANGESET: > > No. A tree is not the full data. A tree contains enough information to > _recreate_ the full data, but the tree itself just tells you _how_ to do > that. It doesn't contain very much of the data itself at all. Perhaps I'd understand this if you tell me what "recreate" means. If a have a SHA1 hash of a file, and I have the file, I can verify that said file has the SHA1 hash it's supposed to have, but I can't generate the file from it's hash... Sorry for being stubbornly dumb, but you'll have a couple of us puzzling at the README ;-) -dte - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Tue, 12 Apr 2005, David Eger wrote: > > The reason I am questioning this point is the GIT README file. > > Linus makes explicit that a "blob" is just the "file contents," and that > really, a "blob" is not just the SHA1 of the "blob": > > > In particular, the "current directory cache" certainly does not need to > > be consistent with the current directory contents, but it has two very > > important attributes: > > > > (a) it can re-generate the full state it caches (not just the directory > > structure: through the "blob" object it can regenerate the data too) > > And he defines "TREE" with the same name: blob Yes. A tree is defined by the blobs it references (and the subtrees) but it doesn't _contain_ them. It just contains a pointer to them. > Therefore, "TREE" must be the *full* data, and since we have the following > definition for CHANGESET: No. A tree is not the full data. A tree contains enough information to _recreate_ the full data, but the tree itself just tells you _how_ to do that. It doesn't contain very much of the data itself at all. > That each changeset remembers *everything* for *each point in the tree*. But only BY REFERENCE. A "commit" is usually very small. For example, the top-of-tree commit-file for my currest kernel test is literally 401 _bytes_ in size. Because it just references a tree (20 bytes of _reference_). > Linus, if you actually mean to differentiate between the full data > and a SHA1 of the data There is no differentiation. The sha1 _is_ the data as far as git is concerned. It's only confusing if you think they are different. > Also, the details of just what data constitutes a 'changeset' would be > lovely... i.e. a precise spec of what Pat is describing below... [EMAIL PROTECTED]:~/test-tools/linux-2.6.12-rc2> cat-file commit `cat .git/HEAD ` tree cf9fd295d3048cd84c65d5e1a5a6b606bf4fddc6 parent c7a1a189dd0fe2c6ecd0aa33f2bd2f414c7892a0 author NeilBrown <[EMAIL PROTECTED]> Tue Apr 12 08:27:08 2005 committer Linus Torvalds <[EMAIL PROTECTED]> Tue Apr 12 08:27:08 2005 [PATCH] md: remove a number of misleading calls to MD_BUG The conditions that cause these calls to MD_BUG are not kernel bugs, just oddities in what userspace is asking for. Also convert analyze_sbs to return void, and the value it returned was always 0. Signed-off-by: Neil Brown <[EMAIL PROTECTED]> Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]> That's it. In all it's glory. Compressed and tagged it's 401 bytes. The tree it references is 677 bytes in size. That in turn references a number of subtrees, but almost all of the sub-trees are shared with _other_ tree commits, so their size is spread out over all the commits. The full archive of the 2.6.12-rc2 kernel that I used for testing (only _one_ version) is 102MB in size. That's about half of what the kernel is uncompressed. The full .git archive for 199 versions of the kernel (the 2.6.12-rc2 one and a test-run of 198 patches from Andrew) is 111MB. In other words, adding 198 "full" new kernels only grew the archive by 9MB (that's all "actual disk usage" btw - the files themselves are smaller, but since they all end up taking up a full disk block..) Basically, the whole point of git is that objects are equated with their sha1 name, and that you can thus "include" an object by just referring to its name. The two are equivalent. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
The reason I am questioning this point is the GIT README file. Linus makes explicit that a "blob" is just the "file contents," and that really, a "blob" is not just the SHA1 of the "blob": > In particular, the "current directory cache" certainly does not need to > be consistent with the current directory contents, but it has two very > important attributes: > > (a) it can re-generate the full state it caches (not just the directory > structure: through the "blob" object it can regenerate the data too) And he defines "TREE" with the same name: blob > TREE: The next hierarchical object type is the "tree" object. A tree > object is a list of permission/name/blob data, sorted by name. Therefore, "TREE" must be the *full* data, and since we have the following definition for CHANGESET: > A "changeset" is defined by the tree-object that it results in, the > parent changesets (zero, one or more) that led up to that point, and a > comment on what happened. That each changeset remembers *everything* for *each point in the tree*. Linus, if you actually mean to differentiate between the full data and a SHA1 of the data, *please please please* say "blob" in one place and "SHA1 of the blob" elsewhere. It's quite confusing, to me at least. Also, the details of just what data constitutes a 'changeset' would be lovely... i.e. a precise spec of what Pat is describing below... -dte > where David Eger <[EMAIL PROTECTED]> told me that... > > So with git, *every* changeset is an entire (compressed) copy of the > > kernel. Really? Every patch you accept adds 37 MB to your hard disk? > > > > Am I missing something here? > > Yes. Only changes files re-appear. The unchanged files keep the same > SHA1 hash, therefore they don't re-appear in the repository. > > So, if Linus gets a patch which sanitizes drivers/char/selection.c, > only these new objects appear in the repository: > > drivers/char/selection.c > drivers/char > drivers > . (project root) > commit message > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
Dear diary, on Tue, Apr 12, 2005 at 06:05:19AM CEST, I got a letter where David Eger <[EMAIL PROTECTED]> told me that... > So with git, *every* changeset is an entire (compressed) copy of the > kernel. Really? Every patch you accept adds 37 MB to your hard disk? > > Am I missing something here? Yes. Only changes files re-appear. The unchanged files keep the same SHA1 hash, therefore they don't re-appear in the repository. So, if Linus gets a patch which sanitizes drivers/char/selection.c, only these new objects appear in the repository: drivers/char/selection.c drivers/char drivers . (project root) commit message Kind regards, -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
Dear diary, on Tue, Apr 12, 2005 at 06:05:19AM CEST, I got a letter where David Eger [EMAIL PROTECTED] told me that... So with git, *every* changeset is an entire (compressed) copy of the kernel. Really? Every patch you accept adds 37 MB to your hard disk? Am I missing something here? Yes. Only changes files re-appear. The unchanged files keep the same SHA1 hash, therefore they don't re-appear in the repository. So, if Linus gets a patch which sanitizes drivers/char/selection.c, only these new objects appear in the repository: drivers/char/selection.c drivers/char drivers . (project root) commit message Kind regards, -- Petr Pasky Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
The reason I am questioning this point is the GIT README file. Linus makes explicit that a blob is just the file contents, and that really, a blob is not just the SHA1 of the blob: In particular, the current directory cache certainly does not need to be consistent with the current directory contents, but it has two very important attributes: (a) it can re-generate the full state it caches (not just the directory structure: through the blob object it can regenerate the data too) And he defines TREE with the same name: blob TREE: The next hierarchical object type is the tree object. A tree object is a list of permission/name/blob data, sorted by name. Therefore, TREE must be the *full* data, and since we have the following definition for CHANGESET: A changeset is defined by the tree-object that it results in, the parent changesets (zero, one or more) that led up to that point, and a comment on what happened. That each changeset remembers *everything* for *each point in the tree*. Linus, if you actually mean to differentiate between the full data and a SHA1 of the data, *please please please* say blob in one place and SHA1 of the blob elsewhere. It's quite confusing, to me at least. Also, the details of just what data constitutes a 'changeset' would be lovely... i.e. a precise spec of what Pat is describing below... -dte where David Eger [EMAIL PROTECTED] told me that... So with git, *every* changeset is an entire (compressed) copy of the kernel. Really? Every patch you accept adds 37 MB to your hard disk? Am I missing something here? Yes. Only changes files re-appear. The unchanged files keep the same SHA1 hash, therefore they don't re-appear in the repository. So, if Linus gets a patch which sanitizes drivers/char/selection.c, only these new objects appear in the repository: drivers/char/selection.c drivers/char drivers . (project root) commit message - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Tue, 12 Apr 2005, David Eger wrote: The reason I am questioning this point is the GIT README file. Linus makes explicit that a blob is just the file contents, and that really, a blob is not just the SHA1 of the blob: In particular, the current directory cache certainly does not need to be consistent with the current directory contents, but it has two very important attributes: (a) it can re-generate the full state it caches (not just the directory structure: through the blob object it can regenerate the data too) And he defines TREE with the same name: blob Yes. A tree is defined by the blobs it references (and the subtrees) but it doesn't _contain_ them. It just contains a pointer to them. Therefore, TREE must be the *full* data, and since we have the following definition for CHANGESET: No. A tree is not the full data. A tree contains enough information to _recreate_ the full data, but the tree itself just tells you _how_ to do that. It doesn't contain very much of the data itself at all. That each changeset remembers *everything* for *each point in the tree*. But only BY REFERENCE. A commit is usually very small. For example, the top-of-tree commit-file for my currest kernel test is literally 401 _bytes_ in size. Because it just references a tree (20 bytes of _reference_). Linus, if you actually mean to differentiate between the full data and a SHA1 of the data There is no differentiation. The sha1 _is_ the data as far as git is concerned. It's only confusing if you think they are different. Also, the details of just what data constitutes a 'changeset' would be lovely... i.e. a precise spec of what Pat is describing below... [EMAIL PROTECTED]:~/test-tools/linux-2.6.12-rc2 cat-file commit `cat .git/HEAD ` tree cf9fd295d3048cd84c65d5e1a5a6b606bf4fddc6 parent c7a1a189dd0fe2c6ecd0aa33f2bd2f414c7892a0 author NeilBrown [EMAIL PROTECTED] Tue Apr 12 08:27:08 2005 committer Linus Torvalds [EMAIL PROTECTED] Tue Apr 12 08:27:08 2005 [PATCH] md: remove a number of misleading calls to MD_BUG The conditions that cause these calls to MD_BUG are not kernel bugs, just oddities in what userspace is asking for. Also convert analyze_sbs to return void, and the value it returned was always 0. Signed-off-by: Neil Brown [EMAIL PROTECTED] Signed-off-by: Andrew Morton [EMAIL PROTECTED] Signed-off-by: Linus Torvalds [EMAIL PROTECTED] That's it. In all it's glory. Compressed and tagged it's 401 bytes. The tree it references is 677 bytes in size. That in turn references a number of subtrees, but almost all of the sub-trees are shared with _other_ tree commits, so their size is spread out over all the commits. The full archive of the 2.6.12-rc2 kernel that I used for testing (only _one_ version) is 102MB in size. That's about half of what the kernel is uncompressed. The full .git archive for 199 versions of the kernel (the 2.6.12-rc2 one and a test-run of 198 patches from Andrew) is 111MB. In other words, adding 198 full new kernels only grew the archive by 9MB (that's all actual disk usage btw - the files themselves are smaller, but since they all end up taking up a full disk block..) Basically, the whole point of git is that objects are equated with their sha1 name, and that you can thus include an object by just referring to its name. The two are equivalent. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Tue, Apr 12, 2005 at 02:21:58PM -0700, Linus Torvalds wrote: Yes. A tree is defined by the blobs it references (and the subtrees) but it doesn't _contain_ them. It just contains a pointer to them. A pointer to them? You mean a SHA1 hash of them? or what? Where is the *real* data stored? The real files, the real patches? Are these somewhere completely outside of git? Therefore, TREE must be the *full* data, and since we have the following definition for CHANGESET: No. A tree is not the full data. A tree contains enough information to _recreate_ the full data, but the tree itself just tells you _how_ to do that. It doesn't contain very much of the data itself at all. Perhaps I'd understand this if you tell me what recreate means. If a have a SHA1 hash of a file, and I have the file, I can verify that said file has the SHA1 hash it's supposed to have, but I can't generate the file from it's hash... Sorry for being stubbornly dumb, but you'll have a couple of us puzzling at the README ;-) -dte - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Wed, 13 Apr 2005, Andrea Arcangeli wrote: At the rate of 9M for every 198 changeset checkins, that means I'll have to download 2.7G _uncompressible_ (i.e. already compressed with a bad per-file ratio due the too-small files) for a whole pack including all changesets without accounting the original 111MB of the original tree, with rsync -z of git. That compares with 514M _compressible_ with CVS format on-disk, and with ~79M of the CVS-network download with rsync -z of the CVS repository (assuming default gzip compression level). Yes. CVS is much denser. CVS is also total crap. So your point is? Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
Hi David, On Tue, Apr 12, 2005 at 06:36:23PM -0400, David Eger wrote: No. A tree is not the full data. A tree contains enough information to _recreate_ the full data, but the tree itself just tells you _how_ to do that. It doesn't contain very much of the data itself at all. Perhaps I'd understand this if you tell me what recreate means. If a have a SHA1 hash of a file, and I have the file, I can verify that said file has the SHA1 hash it's supposed to have, but I can't generate the file from it's hash... But, but if you have that hexified SHA1 hash of a particular file you want to access, there would be a file with a filename equal to that hexified SHA1 hash which contained the compressed contents of the file you're looking for. At least, that's how I understood it... With friendly regards, Takis -- OpenPGP key: http://lumumba.luc.ac.be/takis/takis_public_key.txt fingerprint: 6571 13A3 33D9 3726 F728 AA98 F643 B12E ECF3 E029 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Tue, Apr 12, 2005 at 02:21:58PM -0700, Linus Torvalds wrote: The full .git archive for 199 versions of the kernel (the 2.6.12-rc2 one and a test-run of 198 patches from Andrew) is 111MB. In other words, adding 198 full new kernels only grew the archive by 9MB (that's all actual disk usage btw - the files themselves are smaller, but since they all end up taking up a full disk block..) reiserfs can do tail packing, plus the disk block is meaningless when fetching the data from the network which is the real cost to worry about when synchronizing and downloading (disk cost isn't a big deal). The pagecache cost sounds a very minor one too, since you don't need the whole data in ram, not even all dentries need to be in cache. This is one of the reasons why you don't need to run readdir, and why you can discard the old trees anytime. At the rate of 9M for every 198 changeset checkins, that means I'll have to download 2.7G _uncompressible_ (i.e. already compressed with a bad per-file ratio due the too-small files) for a whole pack including all changesets without accounting the original 111MB of the original tree, with rsync -z of git. That compares with 514M _compressible_ with CVS format on-disk, and with ~79M of the CVS-network download with rsync -z of the CVS repository (assuming default gzip compression level). What BKCVS provided with 79M of rsync -z, now is provided with 2.8G of rsync -z, with a network-bound slowdown of -97.2%. Similar slowdowns should be expected for synchronizations over time while fetching new blobs etc... Ok, BKCVS has less than 6 checkins due the linearization and coalescing of pulls that couldn't be represented losslessy in CVS, so the network-bound slowdown is less than -97.2%, my math is approximative, but the order of magnitude should remain the same. Clearly one can write an ad-hoc network protocol instead of using rsync/wget, but the server will need quite a bit of cpu and ram to do a checkout/update/sync efficiently to unpack all data and create all changesets to gzip and transfer. Anyway git simplicity and immutable hashes robustness certainly makes it an ideal interim format (and it may even be a very pratical local live format on-disk, except for the backups), I'm only unsure if it's a wise idea to build an SCM on top of the current git format or if it's better to use something like SCCS or CVS to coalesce all diffs of a single file together and to save space and make rsync -z very efficient too (or an approach like arch and darcs that stores changesets per file, i.e. patches). - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Tue, Apr 12, 2005 at 04:45:07PM -0700, Linus Torvalds wrote: Yes. CVS is much denser. CVS is also total crap. So your point is? I wasn't suggesting to use CVS. I meant that for a newly developed SCM, the CVS/SCCS format as storage may be more appealing than the current git format. I guess I should have said RCS instead of CVS, sorry if that created any confusion. The arch/darcs approach of pratically storing patches would also be much denser but it has no efficient way of doing rcs up -p 1.x on a file, that doesn't involve potentially unpacking tons of unrelated changesets. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Wed, 13 Apr 2005, Andrea Arcangeli wrote: I wasn't suggesting to use CVS. I meant that for a newly developed SCM, the CVS/SCCS format as storage may be more appealing than the current git format. Go wild. I did mine in six days, and you've been whining about other peoples SCM's for three years. In other words - go and _do_ something instead of whining. I'm not interested. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
Dear diary, on Mon, Apr 11, 2005 at 05:49:31PM CEST, I got a letter where "Randy.Dunlap" <[EMAIL PROTECTED]> told me that... > On Sun, 10 Apr 2005 16:38:00 -0700 (PDT) Linus Torvalds wrote: ..snip.. > | Yes. Crappy old tree, but it can still read my git.git directory, so you > | can use it to update to my current source base. > > Please go into a little more detail about how to do this step... > that seems to be the most basic concept that I am missing. > i.e., how to find the "latest/current" tree (version/commit) > and check it out (read-tree, checkout-cache, etc.). Well, its ID is by convention kept in .dircache/HEAD. But that is really only a convention, no "core git" tool reads it directly, and you need to update it manually after you do commit-tree. First, you need to get the accompanying tree's id. git-pasky's shortcut is $(tree-id), but manually you can do it by $(cat-file commit $(cat .dircache/HEAD)) | egrep '^tree' Note that if you ever forgot to update HEAD or if you have multiple branches in your repository, you can list all "head commits" (that is, commits which have no other commits referencing them as parents) by doing fsck-cache. Now, you need to populate the directory cache by the tree (see Paul Jackson's diagram): read-tree $tree_id And now you want to update your working tree from the cache: checkout-cache -a -f This will bring your tree in sync with the cache (it won't remove any stale files, though). That means it will overwrite your local changes too - turn that off by omitting the "-f". If you want to update only some files, omit the "-a" and list them. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
Dear diary, on Mon, Apr 11, 2005 at 05:49:31PM CEST, I got a letter where Randy.Dunlap [EMAIL PROTECTED] told me that... On Sun, 10 Apr 2005 16:38:00 -0700 (PDT) Linus Torvalds wrote: ..snip.. | Yes. Crappy old tree, but it can still read my git.git directory, so you | can use it to update to my current source base. Please go into a little more detail about how to do this step... that seems to be the most basic concept that I am missing. i.e., how to find the latest/current tree (version/commit) and check it out (read-tree, checkout-cache, etc.). Well, its ID is by convention kept in .dircache/HEAD. But that is really only a convention, no core git tool reads it directly, and you need to update it manually after you do commit-tree. First, you need to get the accompanying tree's id. git-pasky's shortcut is $(tree-id), but manually you can do it by $(cat-file commit $(cat .dircache/HEAD)) | egrep '^tree' Note that if you ever forgot to update HEAD or if you have multiple branches in your repository, you can list all head commits (that is, commits which have no other commits referencing them as parents) by doing fsck-cache. Now, you need to populate the directory cache by the tree (see Paul Jackson's diagram): read-tree $tree_id And now you want to update your working tree from the cache: checkout-cache -a -f This will bring your tree in sync with the cache (it won't remove any stale files, though). That means it will overwrite your local changes too - turn that off by omitting the -f. If you want to update only some files, omit the -a and list them. -- Petr Pasky Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
Dear diary, on Mon, Apr 11, 2005 at 01:14:57AM CEST, I got a letter where Paul Jackson <[EMAIL PROTECTED]> told me that... > Useful explanation - thanks, Linus. > > Is this picture and description accurate: > > == > > > < working directory files (foo.c) > >^ > ^| > | upward ops|downward ops | > | --| | > | checkout-cache |update-cache | > | show-diff | v >v > < current directory cache (".dircache/index") > >^ > ^| > | upward ops|downward ops | > | --| | > | read-tree| write-tree | > ||commit-tree | >| v >v > < git filesystem (blobs, trees, commits: .dircache/{HEAD,objects}) > Well, except that from purely technical standpoint commit-tree has nothing to do in this picture - it creates new object in the git filesystem based on its input data, but regardless to the directory cache or current tree. It probably still belongs where it is from the workflow standpoint, though. ..snip.. > Minor question: > > I must have an old version - I got 'git-0.03', but > it doesn't have 'checkout-cache', and its 'read-tree' > directly writes my working files. > > How do I get a current version? Well, one way I see, > and that's to pick up Pasky's: > > http://pasky.or.cz/~pasky/dev/git/git-pasky-base.tar.bz2 > > Perhaps that's the best way? You can take mine, and do: git pull pasky git pull linus cp .dircache/HEAD .dircache/HEAD.local Now, your tree and git filesystem is up to date. git track local Now, when you do git pull pasky, your working tree will not be updated automatically anymore. git track linus Now, you start tracking Linus' tree instead. Note that the initial update will blow away the scripts in your current tree, so before you do the last two steps you will probably want to clone the tree and set PATH to the one still tracking me, so you get all the comfort. ;-) -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
Dear diary, on Sun, Apr 10, 2005 at 08:42:53PM CEST, I got a letter where Christopher Li <[EMAIL PROTECTED]> told me that... > I totally agree that odds is really really small. > That is why it is not worthy to handle the case. People hit that > can just add a new line or some thing to avoid it, if > it happen after all. > > It is the little peace of mind to know for sure that did > not happen. I am just paranoid. BTW, I've merged the check to git-pasky some time ago, you can disable it in the Makefile. It is by default on now, until someone convinces me it actually affects performance measurably. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RE: more git updates..
Dear diary, on Mon, Apr 11, 2005 at 12:07:37AM CEST, I got a letter where "Luck, Tony" <[EMAIL PROTECTED]> told me that... ..snip.. > >Hey, I may end up being wrong, and yes, maybe I should have done a > >two-level one. The good news is that we can trivially fix it later (even > >dynamically - we can make the "sha1 object tree layout" be a per-tree > >config option, and there would be no real issue, so you could make small > >projects use a flat version and big projects use a very deep structure > >etc). You'd just have to script some renames to move the files around. > > It depends on how many eco-system shell scripts get built that need to > know about the layout ... if some shell/perl "libraries" encode this > filename layout (and people use them) ... then switching later would > indeed be painless. FWIW, my short-term plans include support for monotone-like hash ID shortening - it's enough to use the shortest leading unique part of the ID to identify the revision. I will poke to the object repository for that. I also already have Randy Dunlap's git lsobj, which will list all objects of a specified type (very useful especially when looking for orphaned commits and such rather lowlevel work). -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Sun, Apr 10, 2005 at 11:41:53AM +0200, Petr Baudis wrote: > Dear diary, on Sun, Apr 10, 2005 at 07:53:40AM CEST, I got a letter > where Christopher Li <[EMAIL PROTECTED]> told me that... > > On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote: > > > > > > But I am wondering what your plans are to handle renames---or > > > does git already represent them? > > > > > > > Rename should just work. It will create a new tree object and you > > will notice that in the entry that changed, the hash for the blob > > object is the same. > > Which is of course wrong when you want to do proper merging, examine > per-file history, etc. One solution which springs to my mind is to have > a UUID accompany each blob and tree; that will take relatively lot of > space though, and I'm not sure it is really worth it. It should just use the rename + change two step then it is tractable with git now. Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
Dear diary, on Sun, Apr 10, 2005 at 11:28:54AM CEST, I got a letter where Junio C Hamano <[EMAIL PROTECTED]> told me that... > > "CL" == Christopher Li <[EMAIL PROTECTED]> writes: > > CL> On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote: > >> > >> But I am wondering what your plans are to handle renames---or > >> does git already represent them? > >> > > CL> Rename should just work. It will create a new tree object and you > CL> will notice that in the entry that changed, the hash for the blob > CL> object is the same. > > Sorry, I was unclear. But doesn't that imply that a SCM built > on top of git storage needs to read all the commit and tree > records up to the common ancestor to show tree diffs between two > forked tree? No. See diff-tree output and http://pasky.or.cz/~pasky/dev/git/gitdiff-do for how it's done. Basically, you just take the two trees and compare them linearily (do a normal diff on them, essentialy). Then the differences you spot this way are everything what needs to appear in the patch. > I suspect that another problem is that noticing the move of the > same SHA1 hash from one pathname to another and recognizing that > as a rename would not always work in the real world, because > sometimes people move files *and* make small changes at the same > time. If git is meant to be an intermediate format to suck > existing kernel history out of BK so that the history can be > converted for the next SCM chosen for the kernel work, I would > imagine that there needs to be a way to represent such a case. > Maybe convert a file rename as two git trees (one tree for pure > move which immediately followed by another tree for edit) if it > is not a pure move? Actually, this could be possible too I think. We will have to make diff-tree two-pass, but it is already so blinding fast that I guess that doesn't hurt too much. I might try to get my hands on that. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
Dear diary, on Sun, Apr 10, 2005 at 07:53:40AM CEST, I got a letter where Christopher Li <[EMAIL PROTECTED]> told me that... > On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote: > > > > But I am wondering what your plans are to handle renames---or > > does git already represent them? > > > > Rename should just work. It will create a new tree object and you > will notice that in the entry that changed, the hash for the blob > object is the same. Which is of course wrong when you want to do proper merging, examine per-file history, etc. One solution which springs to my mind is to have a UUID accompany each blob and tree; that will take relatively lot of space though, and I'm not sure it is really worth it. How many renames were there in the 64k commits so far anyway? -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
Dear diary, on Sun, Apr 10, 2005 at 07:53:40AM CEST, I got a letter where Christopher Li [EMAIL PROTECTED] told me that... On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote: But I am wondering what your plans are to handle renames---or does git already represent them? Rename should just work. It will create a new tree object and you will notice that in the entry that changed, the hash for the blob object is the same. Which is of course wrong when you want to do proper merging, examine per-file history, etc. One solution which springs to my mind is to have a UUID accompany each blob and tree; that will take relatively lot of space though, and I'm not sure it is really worth it. How many renames were there in the 64k commits so far anyway? -- Petr Pasky Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
Dear diary, on Sun, Apr 10, 2005 at 11:28:54AM CEST, I got a letter where Junio C Hamano [EMAIL PROTECTED] told me that... CL == Christopher Li [EMAIL PROTECTED] writes: CL On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote: But I am wondering what your plans are to handle renames---or does git already represent them? CL Rename should just work. It will create a new tree object and you CL will notice that in the entry that changed, the hash for the blob CL object is the same. Sorry, I was unclear. But doesn't that imply that a SCM built on top of git storage needs to read all the commit and tree records up to the common ancestor to show tree diffs between two forked tree? No. See diff-tree output and http://pasky.or.cz/~pasky/dev/git/gitdiff-do for how it's done. Basically, you just take the two trees and compare them linearily (do a normal diff on them, essentialy). Then the differences you spot this way are everything what needs to appear in the patch. I suspect that another problem is that noticing the move of the same SHA1 hash from one pathname to another and recognizing that as a rename would not always work in the real world, because sometimes people move files *and* make small changes at the same time. If git is meant to be an intermediate format to suck existing kernel history out of BK so that the history can be converted for the next SCM chosen for the kernel work, I would imagine that there needs to be a way to represent such a case. Maybe convert a file rename as two git trees (one tree for pure move which immediately followed by another tree for edit) if it is not a pure move? Actually, this could be possible too I think. We will have to make diff-tree two-pass, but it is already so blinding fast that I guess that doesn't hurt too much. I might try to get my hands on that. -- Petr Pasky Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Sun, Apr 10, 2005 at 11:41:53AM +0200, Petr Baudis wrote: Dear diary, on Sun, Apr 10, 2005 at 07:53:40AM CEST, I got a letter where Christopher Li [EMAIL PROTECTED] told me that... On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote: But I am wondering what your plans are to handle renames---or does git already represent them? Rename should just work. It will create a new tree object and you will notice that in the entry that changed, the hash for the blob object is the same. Which is of course wrong when you want to do proper merging, examine per-file history, etc. One solution which springs to my mind is to have a UUID accompany each blob and tree; that will take relatively lot of space though, and I'm not sure it is really worth it. It should just use the rename + change two step then it is tractable with git now. Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RE: more git updates..
Dear diary, on Mon, Apr 11, 2005 at 12:07:37AM CEST, I got a letter where Luck, Tony [EMAIL PROTECTED] told me that... ..snip.. Hey, I may end up being wrong, and yes, maybe I should have done a two-level one. The good news is that we can trivially fix it later (even dynamically - we can make the sha1 object tree layout be a per-tree config option, and there would be no real issue, so you could make small projects use a flat version and big projects use a very deep structure etc). You'd just have to script some renames to move the files around. It depends on how many eco-system shell scripts get built that need to know about the layout ... if some shell/perl libraries encode this filename layout (and people use them) ... then switching later would indeed be painless. FWIW, my short-term plans include support for monotone-like hash ID shortening - it's enough to use the shortest leading unique part of the ID to identify the revision. I will poke to the object repository for that. I also already have Randy Dunlap's git lsobj, which will list all objects of a specified type (very useful especially when looking for orphaned commits and such rather lowlevel work). -- Petr Pasky Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
Dear diary, on Sun, Apr 10, 2005 at 08:42:53PM CEST, I got a letter where Christopher Li [EMAIL PROTECTED] told me that... I totally agree that odds is really really small. That is why it is not worthy to handle the case. People hit that can just add a new line or some thing to avoid it, if it happen after all. It is the little peace of mind to know for sure that did not happen. I am just paranoid. BTW, I've merged the check to git-pasky some time ago, you can disable it in the Makefile. It is by default on now, until someone convinces me it actually affects performance measurably. -- Petr Pasky Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
Dear diary, on Mon, Apr 11, 2005 at 01:14:57AM CEST, I got a letter where Paul Jackson [EMAIL PROTECTED] told me that... Useful explanation - thanks, Linus. Is this picture and description accurate: == working directory files (foo.c) ^ ^| | upward ops|downward ops | | --| | | checkout-cache |update-cache | | show-diff | v v current directory cache (.dircache/index) ^ ^| | upward ops|downward ops | | --| | | read-tree| write-tree | ||commit-tree | | v v git filesystem (blobs, trees, commits: .dircache/{HEAD,objects}) Well, except that from purely technical standpoint commit-tree has nothing to do in this picture - it creates new object in the git filesystem based on its input data, but regardless to the directory cache or current tree. It probably still belongs where it is from the workflow standpoint, though. ..snip.. Minor question: I must have an old version - I got 'git-0.03', but it doesn't have 'checkout-cache', and its 'read-tree' directly writes my working files. How do I get a current version? Well, one way I see, and that's to pick up Pasky's: http://pasky.or.cz/~pasky/dev/git/git-pasky-base.tar.bz2 Perhaps that's the best way? You can take mine, and do: git pull pasky git pull linus cp .dircache/HEAD .dircache/HEAD.local Now, your tree and git filesystem is up to date. git track local Now, when you do git pull pasky, your working tree will not be updated automatically anymore. git track linus Now, you start tracking Linus' tree instead. Note that the initial update will blow away the scripts in your current tree, so before you do the last two steps you will probably want to clone the tree and set PATH to the one still tracking me, so you get all the comfort. ;-) -- Petr Pasky Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
Dear diary, on Sun, Apr 10, 2005 at 01:31:10AM CEST, I got a letter where Linus Torvalds <[EMAIL PROTECTED]> told me that... > On Sat, 9 Apr 2005, Linus Torvalds wrote: > > > > Actually, I guess I wouldn't have to change the format. I could just > > extend the existing "tree" object to be able to point to other trees, and > > that's it. > > Done, and pushed out. The current git.git repository seems to do all of > this correctly. ..snip.. Ok, so now I can dare announce it, I hope. I hacked my branch of git somewhat, kept in sync with Linus, and now I have something to show. Please see it at http://pasky.or.cz/~pasky/dev/git/ It is basically a set of (still rather crude) shell scripts upon Linus' git, which make it sanely usable by mere humans for actual version tracking. Its usage _is_ going to change, so don't get too used to it (that'd be hard anyway, I suspect), but it should be working nicely. I have described most of the interesting parts and some basic usage in the README at that page. It wraps commits, supports log retrieval and comfortable diffing between any two trees. And on top of that, it can do some basic remote repositories - it will pull (rsync) from them and it can make the local copy track them - on pull, it will be updated accordingly (and your local commits on the tracked branch will get orphaned). I didn't attach a patch against Linus since I think it's pretty much useless now. It's available as against-linus.patch on the web, and you can apply it to the latest git tree (NOT 0.03). But it's probably better idea to wget my tree. You can then watch us making progress by gitpull.sh linus gitpull.sh pasky and see where we differ by: gitdiff.sh linus pasky (This is how the against-linus.patch was generated. I'd easily generate even 0.03 patch this way, but I forgot to merge the fsck at that time, so it would suck.) (Note that the tree you wget is set up to track my branch. If you want to stop tracking it (basically necessary now if you want to do local commits), do: cp .dircache/HEAD .dircache/HEAD.local gittrack.sh The cp says that something like "I want to pick up where the tracked branch left off". Otherwise, untracking would return you to your "local" branch, which is just some ancient predecessor of the pasky branch here anyway.) Note that I didn't really test it on anything but git itself yet, so I'm not sure how will it cope especially with directories - I tried to make it aware of them though. I will do some more practical testing tomorrow. Otherwise, I will probably try to consolidate the usage and documentation now, and beautify the scripts. I might start pondering some merging too. Oh, and gitpatch.sh. :-) Have fun and please share your opinions, -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
Dear diary, on Sun, Apr 10, 2005 at 01:31:10AM CEST, I got a letter where Linus Torvalds [EMAIL PROTECTED] told me that... On Sat, 9 Apr 2005, Linus Torvalds wrote: Actually, I guess I wouldn't have to change the format. I could just extend the existing tree object to be able to point to other trees, and that's it. Done, and pushed out. The current git.git repository seems to do all of this correctly. ..snip.. Ok, so now I can dare announce it, I hope. I hacked my branch of git somewhat, kept in sync with Linus, and now I have something to show. Please see it at http://pasky.or.cz/~pasky/dev/git/ It is basically a set of (still rather crude) shell scripts upon Linus' git, which make it sanely usable by mere humans for actual version tracking. Its usage _is_ going to change, so don't get too used to it (that'd be hard anyway, I suspect), but it should be working nicely. I have described most of the interesting parts and some basic usage in the README at that page. It wraps commits, supports log retrieval and comfortable diffing between any two trees. And on top of that, it can do some basic remote repositories - it will pull (rsync) from them and it can make the local copy track them - on pull, it will be updated accordingly (and your local commits on the tracked branch will get orphaned). I didn't attach a patch against Linus since I think it's pretty much useless now. It's available as against-linus.patch on the web, and you can apply it to the latest git tree (NOT 0.03). But it's probably better idea to wget my tree. You can then watch us making progress by gitpull.sh linus gitpull.sh pasky and see where we differ by: gitdiff.sh linus pasky (This is how the against-linus.patch was generated. I'd easily generate even 0.03 patch this way, but I forgot to merge the fsck at that time, so it would suck.) (Note that the tree you wget is set up to track my branch. If you want to stop tracking it (basically necessary now if you want to do local commits), do: cp .dircache/HEAD .dircache/HEAD.local gittrack.sh The cp says that something like I want to pick up where the tracked branch left off. Otherwise, untracking would return you to your local branch, which is just some ancient predecessor of the pasky branch here anyway.) Note that I didn't really test it on anything but git itself yet, so I'm not sure how will it cope especially with directories - I tried to make it aware of them though. I will do some more practical testing tomorrow. Otherwise, I will probably try to consolidate the usage and documentation now, and beautify the scripts. I might start pondering some merging too. Oh, and gitpatch.sh. :-) Have fun and please share your opinions, -- Petr Pasky Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/