subject:"Re\: Kernel SCM saga.."

Linus wrote:
> Almost everything
> else keeps the  in the ASCII hexadecimal representation, and I
> should have done that here too. Why? Not because it's a  - hey, the 
> binary representation is certainly denser and equivalent

Since the size of  ASCII sha1's is only about 18% larger
than the size of the same number of binary sha1's , I
don't see you gain much from the binary.

I cast my non-existent vote for making the sha1 ascii - while you still can ;).

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

Chris wrote:
> How many is alot?  Are we talking 100k, 1m, 10m?

I pulled some numbers out of my bk tree for Linux.

I have 16817 source files.

They average 12.2 bitkeeper changes per file (counting the number of
changes visible from doing 'bk sccslog' on each of the 16817 files). 

These 16817 files consume:

224 MBytes uncompressed and
 95 MBytes compressed

(using zlib's minigzip, on a 4 KB page reiserfs.)

Since each change will get its own copy of the file, multiplying these
two sizes (224 and 95) by 12.2 changes per file means the disk cost
would be:

2.73 GByte uncompressed, or
1.16 GBytes compressed.

I was pleasantly surprised at the degree of compression, shrinking files
to 42% of their original size.  I expected, since the classic rule of
thumb here to archive before compressing wasn't being followed (nor
should it be) and we were compressing lots a little files, we would save
fewer disk blocks than this.

Of course, since as Linus reminds us, it's disk buffers in memory,
not blocks on disk, that are precious, it's more like we will save
224 - 95 == 129 MBytes of RAM to hold one entire tree.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: Re: Kernel SCM saga..

2005-04-09 Thread Phillip Lougher

On Apr 10, 2005 2:42 AM, Petr Baudis <[EMAIL PROTECTED]> wrote:
> Dear diary, on Sun, Apr 10, 2005 at 03:01:12AM CEST, I got a letter
> where Phillip Lougher <[EMAIL PROTECTED]> told me that...
> > On Apr 9, 2005 3:53 AM, Petr Baudis <[EMAIL PROTECTED]> wrote:
> >
> > >   FWIW, I made few small fixes (to prevent some trivial usage errors to
> > > cause cache corruption) and added scripts gitcommit.sh, gitadd.sh and
> > > gitlog.sh - heavily inspired by what already went through the mailing
> > > list. Everything is available at http://pasky.or.cz/~pasky/dev/git/
> > > (including .dircache, even though it isn't shown in the index), the
> > > cumulative patch can be found below. The scripts aim to provide some
> > > (obviously very interim) more high-level interface for git.
> >
> > I did a bit of playing about with the changelog generate script,
> > trying to produce a faster version.  The attached version uses a
> > couple of improvements to be a lot faster (e.g. no recursion in the
> > common case of one parent).
> >
> > FWIW it is 7x faster than makechlog.sh (4.342 secs vs 34.129 secs) and
> > 28x faster than gitlog.sh (4.342 secs vs 2 mins 4 secs) on my
> > hardware.  You mileage may of course vary.
> 
> Wow, really impressive! Great work, I've merged it (if you don't object,
> of course).

Of course I don't object...

> 
> Wondering why I wasn't in the Cc list, BTW.

Weird, it wasn't intentional.  I read LKML in Gmail (which I don't use
for much else), and just clicked "reply", expecting to do the right
thing.  Replying to this email it's also left you off the CC list. 
Looking at the email source I believe it's probably to do with the
following:

Mail-Followup-To: Linus Torvalds <[EMAIL PROTECTED]>,
[EMAIL PROTECTED],
Kernel Mailing List > 

I've CC'd you explicitly on this.

Phillip
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: Re: Kernel SCM saga..

2005-04-09 Thread Petr Baudis

Dear diary, on Sun, Apr 10, 2005 at 03:01:12AM CEST, I got a letter
where Phillip Lougher <[EMAIL PROTECTED]> told me that...
> On Apr 9, 2005 3:53 AM, Petr Baudis <[EMAIL PROTECTED]> wrote:
> 
> >   FWIW, I made few small fixes (to prevent some trivial usage errors to
> > cause cache corruption) and added scripts gitcommit.sh, gitadd.sh and
> > gitlog.sh - heavily inspired by what already went through the mailing
> > list. Everything is available at http://pasky.or.cz/~pasky/dev/git/
> > (including .dircache, even though it isn't shown in the index), the
> > cumulative patch can be found below. The scripts aim to provide some
> > (obviously very interim) more high-level interface for git.
> 
> I did a bit of playing about with the changelog generate script,
> trying to produce a faster version.  The attached version uses a
> couple of improvements to be a lot faster (e.g. no recursion in the
> common case of one parent).
> 
> FWIW it is 7x faster than makechlog.sh (4.342 secs vs 34.129 secs) and
> 28x faster than gitlog.sh (4.342 secs vs 2 mins 4 secs) on my
> hardware.  You mileage may of course vary.

Wow, really impressive! Great work, I've merged it (if you don't object,
of course).

Wondering why I wasn't in the Cc list, BTW.

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: Kernel SCM saga..

2005-04-09 Thread Phillip Lougher

On Apr 9, 2005 3:53 AM, Petr Baudis <[EMAIL PROTECTED]> wrote:

>   FWIW, I made few small fixes (to prevent some trivial usage errors to
> cause cache corruption) and added scripts gitcommit.sh, gitadd.sh and
> gitlog.sh - heavily inspired by what already went through the mailing
> list. Everything is available at http://pasky.or.cz/~pasky/dev/git/
> (including .dircache, even though it isn't shown in the index), the
> cumulative patch can be found below. The scripts aim to provide some
> (obviously very interim) more high-level interface for git.

I did a bit of playing about with the changelog generate script,
trying to produce a faster version.  The attached version uses a
couple of improvements to be a lot faster (e.g. no recursion in the
common case of one parent).

FWIW it is 7x faster than makechlog.sh (4.342 secs vs 34.129 secs) and
28x faster than gitlog.sh (4.342 secs vs 2 mins 4 secs) on my
hardware.  You mileage may of course vary.

Regards

Phillip

--
#!/bin/sh

changelog() {
local parents new_parent
declare -a new_parent

new_parent[0]=$1
parents=1

while [ $parents -gt 0 ]; do
parent=${new_parent[$((parents-1))]}
echo $parent >> $TMP
cat-file commit $parent > $TMP_FILE

echo me $parent
cat $TMP_FILE
echo -e "\n--\n"

parents=0
while read type text; do
if [ $type = 'committer' ]; then
break;
elif [ $type = 'parent' ] &&
! grep -q $text $TMP ; then
new_parent[$parents]=$text
parents=$((parents+1))
fi
done < $TMP_FILE

i=0
while [ $i -lt $((parents-1)) ]; do
changelog ${new_parent[$i]}
i=$((i+1))
done
done
}

TMP=`mktemp`
TMP_FILE=`mktemp`

base=$1
if [ ! "$base" ]; then
base=$(cat .dircache/HEAD)
fi
changelog $base
rm -rf $TMP $TMP_FILE
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

David wrote:
> recovery is more difficult when you corrupt some
> file in your repository.

Agreed.  I too have recovered RCS and SCCS files by hand editing.


Linus wrote:
> I don't want people editing repostitory files by hand.

Tyrant !;)

>From Wikipedia:

A tyrant is a usurper of rightful power.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-09 Thread Chris Wedgwood

On Sat, Apr 09, 2005 at 04:13:51PM -0700, Linus Torvalds wrote:

> > I understand the arguments for compression, but I hate it for one
> > simple reason: recovery is more difficult when you corrupt some
> > file in your repository.

I've had this too.  Magic binary blobs are horrible here for data loss
which is why I'm not keen on subversion.

> Trust me, the way git does things, you'll have so much redundancy
> that you'll have to really _work_ at losing data.

It's not clear to me that compression should be *required* though.
Shouldn't we be able to turn this off in some cases?

> The bad news is that this is obviously why it does eat a lot of
> disk.

Disk is cheap, but sadly page-cache is not :-(

> Since it saves full-file commits, you're going to have a lot of
> (compressed) full files around.

How many is alot?  Are we talking 100k, 1m, 10m?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-09 Thread Tupshin Harper

Roman Zippel wrote:
It seems you exported the complete parent information and this is exactly 
the "nitty-gritty" I was "whining" about and which is not available via 
bkcvs or bkweb and it's the most crucial information to make the bk data 
useful outside of bk. Larry was previously very clear about this that he 
considers this proprietary bk meta data and anyone attempting to export 
this information is in violation with the free bk licence, so you indeed 
just took the important parts and this is/was explicitly verboten for 
normal bk users.
 

Yes, this is exactly the information that would be necessary to create a 
general interop tool between bk and darcs|arch|monotone, and is the 
fundamental objection I and others have had to open source projects 
using BK. Is Bitmover willing to grant a special dispensation to allow a 
lossless conversion of the linux history to another format?

-Tupshin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-09 Thread Linus Torvalds

On Sat, 9 Apr 2005, David S. Miller wrote:
> 
> I understand the arguments for compression, but I hate it for one
> simple reason: recovery is more difficult when you corrupt some
> file in your repository.

Trust me, the way git does things, you'll have so much redundancy that 
you'll have to really _work_ at losing data.

That's the good news.

The bad news is that this is obviously why it does eat a lot of disk. 
Since it saves full-file commits, you're going to have a lot of 
(compressed) full files around.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-09 Thread David S. Miller

On Fri, 8 Apr 2005 22:45:18 -0700 (PDT)
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> Also, I don't want people editing repostitory files by hand. Sure, the 
> sha1 catches it, but still... I'd rather force the low-level ops to use 
> the proper helper routines. Which is why it's a raw zlib compressed blob, 
> not a gzipped file.

I understand the arguments for compression, but I hate it for one
simple reason: recovery is more difficult when you corrupt some
file in your repository.

It's happened to me more than once and I did lose data.

Without compression, I might be able to recover if something
causes a block of zeros to be written to the middle of some
repository file.  With compression, you pretty much just lose.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-09 Thread Florian Weimer

* David Lang:

>> Databases supporting replication are called high end. You forgot
>> the cats dance around the network this issue involves.
>
> And Postgres (which is Free in all senses of the word) is high end by this 
> definition.

I'm not aware of *any* DBMS, commercial or not, which can perform
meaningful multi-master replication on tables which mainly consist of
text files as records.  All you can get is single-master replication
(which is well-understood), or some rather scary stuff which involves
throwing away updates, or taking extrema or averages (even automatic
3-way merges aren't available).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-09 Thread Ray Lee

On Sat, 2005-04-09 at 19:40 +0200, Roman Zippel wrote:
> On Sat, 9 Apr 2005, Eric D. Mudama wrote:
> > > For example bk does something like this:
> > > 
> > > A1 -> A2 -> A3 -> BM
> > >   \-> B1 -> B2 --^
> > > 
> > > and instead of creating the merge changeset, one could merge them like
> > > this:
> > > 
> > > A1 -> A2 -> A3 -> B1 -> B2

> > I believe that flattening the change graph makes history reproduction
> > impossible, or alternately, you are imposing on each developer to test
> > the merge results at B1 + A1..3 before submission, but in doing so,
> > the test time may require additional test periods etc and with
> > sufficient velocity, might never close.
> 
> The merge result has to be tested either way, so I'm not exactly sure, 
> what you're trying to say.

The kernel changes. A lot. And often.

With that in mind, if (for example) A2 and A3 are simple changes that
are quick to test and B1 is large, or complex, or requires hours (days,
weeks) of testing to validate, then a maintainer's decision can
legitimately be to rebase a tree (say, -mm) upon the B1 line of
development, and toss the A2 branch back to those developers with a
"Sorry it didn't work out, something here causes Unhappiness with B1,
can you track down the problem and try again?"

Ray

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-09 Thread Marcin Dalecki

On 2005-04-09, at 17:42, Paul Jackson wrote:
Marcin wrote:
But what will impress you are either the price tag the
DB comes with or
the hardware it runs on :-)
The payroll for the staffing to care and feed for these
babies is often impressive as well.
Please don't forget the bill from the electric plant behind it!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Re: Kernel SCM saga..

2005-04-09 Thread Petr Baudis

Dear diary, on Sat, Apr 09, 2005 at 09:08:59AM CEST, I got a letter
where "Randy.Dunlap" <[EMAIL PROTECTED]> told me that...
> On Sat, 9 Apr 2005 04:53:57 +0200 Petr Baudis wrote:
..snip..
> |   FWIW, I made few small fixes (to prevent some trivial usage errors to
> | cause cache corruption) and added scripts gitcommit.sh, gitadd.sh and
> | gitlog.sh - heavily inspired by what already went through the mailing
> | list. Everything is available at http://pasky.or.cz/~pasky/dev/git/
> | (including .dircache, even though it isn't shown in the index), the
> | cumulative patch can be found below. The scripts aim to provide some
> | (obviously very interim) more high-level interface for git.
> | 
> |   I'm now working on tree-diff.c which will (surprise!) produce a diff
> | of two trees (I'll finish it after I get some sleep, though), and then I
> | will probably do some dwimmy gitdiff.sh wrapper for tree-diff and
> | show-diff. At that point I might get my hand on some pull more kind to
> | local changes.
> 
> Hi,

  Hi,

> I'll look at your scripts this weekend.  I've also been
> working on some, but mine are a bit more experimental (cruder)
> than yours are.  Anyway, here they are (attached) -- also
> available at http://developer.osdl.org/rddunlap/git/
> 
> gitin : checkin/commit
> gitwhat sha1 : what is that sha1 file (type and contents if blob or commit)
> gitlist (blob, commit, tree, or all) :
>   list all objects with type (commit, tree, blob, or all)

  thanks - I had a look, but so far I borrowed only the prompt message
from your gitin. ;-) I'm not sure if gitwhat would be useful for me in
any way and gitlist doesn't appear too practical to me either.

  In the meantime, I've made some progress too. I made ls-tree, which
will just convert the tree object to a human readable (and script
processable) form, and wrapper gitls.sh, which will also try to guess
the tree ID. parent-id will just return the commit ID(s) of the previous
commit(s), practical if you want to diff against the previous commit
easily etc.  And finally, there is gitdiff.sh, which will produce a diff
of any two trees.

  Everything is again available at http://pasky.or.cz/~pasky/dev/git/
and again including .dircache, even though it's invisible in the index.
The cumulative patch (against 0.03) is there as well as below, generated
by the

./gitdiff.sh 0af20307bb4c634722af0f9203dac7b3222c4a4f

command. The empty entries are changed modes (664 vs 644), I will yet
have to think about how to denote them if the content didn't change;
or I might ignore them altogether...?

  You can obviously fetch any arbitrary change by doing the appropriate
gitdiff.sh call. You can find the ids in the ChangeLog, which was
generated by the plain

./gitlog.sh

command. (That is for HEAD. 0af20307bb4c634722af0f9203dac7b3222c4a4f is
the last commit on the Linus' branch, pass that to gitlog.sh to get his
ChangeLog. ;-)

  Next, I will probably do some bk-style pull tool. Or perhaps first
a gitpatch.sh which will verify the sha1s and do the mode changes.

  Linus, could you please have a look and tell me what do you think
about it so far?

  Thanks,

Petr Baudis

Index: Makefile
===
--- 6be98a9e92a3f131a3fdf0dc3a8576fba6421569/Makefile (mode:100664 
sha1:270cd4f8a8bf10cd513b489c4aaf76c14d4504a7)
+++ 3f6cc0ad3e076e05281438b0de69a7d6a5522d17/Makefile (mode:100644 
sha1:185ff422e68984e68da011509dec116f05fc6f8d)
@@ -1,7 +1,7 @@
 CFLAGS=-g -O3 -Wall
 CC=gcc

-PROG=update-cache show-diff init-db write-tree read-tree commit-tree cat-file 
fsck-cache
+PROG=update-cache show-diff init-db write-tree read-tree commit-tree cat-file 
fsck-cache ls-tree

 all: $(PROG)

@@ -30,6 +30,9 @@
 cat-file: cat-file.o read-cache.o
$(CC) $(CFLAGS) -o cat-file cat-file.o read-cache.o $(LIBS)

+ls-tree: ls-tree.o read-cache.o
+   $(CC) $(CFLAGS) -o ls-tree ls-tree.o read-cache.o $(LIBS)
+
 fsck-cache: fsck-cache.o read-cache.o
$(CC) $(CFLAGS) -o fsck-cache fsck-cache.o read-cache.o $(LIBS)

Index: README
===
Index: cache.h
===
Index: cat-file.c
===
Index: commit-tree.c
===
Index: fsck-cache.c
===
Index: gitadd.sh
===
--- 6be98a9e92a3f131a3fdf0dc3a8576fba6421569/gitadd.sh
+++ 3f6cc0ad3e076e05281438b0de69a7d6a5522d17/gitadd.sh (mode:100755 
sha1:d23be758c0c9fc1cf9756bcd3ee4d7266c60a2c9)
@@ -0,0 +1,13 @@
+#!/bin/sh
+#
+# Add new file to a GIT repository.
+# Copyright (c) Petr Baudis, 2005
+#
+# Takes a list of file names at the command line, and schedules them
+# for addition to

Re: Kernel SCM saga..

2005-04-09 Thread Roman Zippel

Hi,

On Sat, 9 Apr 2005, Eric D. Mudama wrote:

> > For example bk does something like this:
> > 
> > A1 -> A2 -> A3 -> BM
> >   \-> B1 -> B2 --^
> > 
> > and instead of creating the merge changeset, one could merge them like
> > this:
> > 
> > A1 -> A2 -> A3 -> B1 -> B2
> > 
> > This results in a simpler repository, which is more scalable and which
> > is easier for users to work with (e.g. binary bug search).
> > The disadvantage would be it will cause more minor conflicts, when changes
> > are pulled back into the original tree, but which should be easily
> > resolvable most of the time.
> 
> The kicker comes that B1 was developed based on A1, so any test
> results were based on B1 being a single changeset delta away from A1. 
> If the resulting 'BM' fails testing, and you've converted into the
> linear model above where B2 has failed, you lose the ability to
> isolate B1's changes and where they came from, to revalidate the
> developer's results.

What good does it do if you can revalidate the original B1? The important 
point is that the end result works and if it only fails in the merged 
version you have a big problem. The serialized version gives you the 
chance to test whether it fails in B1 or B2.

> I believe that flattening the change graph makes history reproduction
> impossible, or alternately, you are imposing on each developer to test
> the merge results at B1 + A1..3 before submission, but in doing so,
> the test time may require additional test periods etc and with
> sufficient velocity, might never close.

The merge result has to be tested either way, so I'm not exactly sure, 
what you're trying to say.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

>  (b) while I depend on the fact that if the SHA of an object matches, the 
>  objects are the same, I generally try to avoid the reverse 
>  dependency.

It might be a valid point that you want to leave the door open to using
a different (than SHA1) digest.  (So this means you going to store it
as an ASCII string, right?)

But I don't see how that applies here.  Any optimization that avoids
rereading old versions if the digests match will never trigger on the
day you change digests.  No problem here - you doomed to reread the old
version in any case.

Either you got your logic backwards, or I need another cup of coffee.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

Linus wrote:
> In "git", you usually care about 
> the old contents too.

True - in your case, you probably want the old contents
so might as well dig them out as soon as it becomes
convenient to have them.

I was objecting to your claim that you _had_ to dig out
the old contents to determine if a file changed.

You don't _have_ to ... but I agree that it's a good
time to do so.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-09 Thread Roman Zippel

Hi,

On Fri, 8 Apr 2005, Linus Torvalds wrote:

> Yes.  Per-file history is expensive in git, because if the way it is 
> indexed. Things are indexed by tree and by changeset, and there are no 
> per-file indexes.
> 
> You could create per-file _caches_ (*) on top of git if you wanted to make
> it behave more like a real SCM, but yes, it's all definitely optimized for
> the things that _I_ tend to care about, which is the whole-repository
> operations.

Per file history is also expensive for another reason. The basic reason is 
that I think that a hash based storage is not the best approach for SCM. 
It's lacking locality, so the more it grows the more it has to seek to 
collect all the data.
To reduce the space usage you could replace the parent file with a sha1 
reference + delta to the new file. This is basically what monotone does 
and might cause perfomance problems if you need to restore old versions 
(e.g. if you want to annotate a file).

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

Linus wrote:
> (you need to remember to escape '%' 
> too when you do that ;).

No - don't have to.  Not if I don't mind giving fools that embed
newlines in paths second class service.

In my case, if I create a file named "foo\nbar", then backup and restore
it, I end up with a restored file named "foo%0Abar".  If I had backed up
another file named "foo%0Abar", and now restore it, it collides, and
last one to be restored wins.  If I really need the "foo\nbar" file back
as originally named, I will have to dig it out by hand.

I dare say that Linux kernel source does not require first class support
for newlines embedded in pathnames.

> ASCII isn't magical.

No - but it's damn convenient.  Alot of tools work on line-oriented
ASCII that don't work elsewhere.

I guess Perl-hackers won't care much, but those working with either
classic shell script tools or Python will find line formatted ASCII more
convenient.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-09 Thread Eric D. Mudama

On Apr 8, 2005 4:52 PM, Roman Zippel <[EMAIL PROTECTED]> wrote:
> The problem is you pay a price for this. There must be a reason developers
> were adding another GB of memory just to run BK.
> Preserving the complete merge history does indeed make repeated merges
> simpler, but it builds up complex meta data, which has to be managed
> forever. I doubt that this is really an advantage in the long term. I
> expect that we were better off serializing changesets in the main
> repository. For example bk does something like this:
> 
> A1 -> A2 -> A3 -> BM
>   \-> B1 -> B2 --^
> 
> and instead of creating the merge changeset, one could merge them like
> this:
> 
> A1 -> A2 -> A3 -> B1 -> B2
> 
> This results in a simpler repository, which is more scalable and which
> is easier for users to work with (e.g. binary bug search).
> The disadvantage would be it will cause more minor conflicts, when changes
> are pulled back into the original tree, but which should be easily
> resolvable most of the time.

The kicker comes that B1 was developed based on A1, so any test
results were based on B1 being a single changeset delta away from A1. 
If the resulting 'BM' fails testing, and you've converted into the
linear model above where B2 has failed, you lose the ability to
isolate B1's changes and where they came from, to revalidate the
developer's results.

With bugs and fixes that can be validated in a few hours, this may not
be a problem, but when chasing a bug that takes days or weeks to
manifest, that a developer swears they fixed, one has to be able to
reproduce their exact test environment.

I believe that flattening the change graph makes history reproduction
impossible, or alternately, you are imposing on each developer to test
the merge results at B1 + A1..3 before submission, but in doing so,
the test time may require additional test periods etc and with
sufficient velocity, might never close.  This is the problem CVS has
if you don't create micro branches for every single modification.

--eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-09 Thread Roman Zippel

Hi,

On Fri, 8 Apr 2005, Linus Torvalds wrote:

> Also, I suspect that BKCVS actually bothers to get more details out of a
> BK tree than I cared about. People have pestered Larry about it, so BKCVS
> exports a lot of the nitty-gritty (per-file comments etc) that just
> doesn't actually _matter_, but people whine about. Me, I don't care. My
> sparse-conversion just took the important parts.

As soon as you want to synchronize and merge two trees, you will know why 
this information does matter.
(/me looks closer at the sparse-conversion...)
It seems you exported the complete parent information and this is exactly 
the "nitty-gritty" I was "whining" about and which is not available via 
bkcvs or bkweb and it's the most crucial information to make the bk data 
useful outside of bk. Larry was previously very clear about this that he 
considers this proprietary bk meta data and anyone attempting to export 
this information is in violation with the free bk licence, so you indeed 
just took the important parts and this is/was explicitly verboten for 
normal bk users.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-09 Thread Linus Torvalds

On Sat, 9 Apr 2005, Paul Jackson wrote:
>
> > in order to avoid having to worry about special characters
> > they are NUL-terminated)
> 
> Would this be a possible alternative - newline terminated (convert any
> newlines embedded in filenames to the 3 chars '%0A', and leave it as an
> exercise to the reader to de-convert them.)

Sure, you could obviously do escaping (you need to remember to escape '%' 
too when you do that ;).

However, whenever you do escaping, that means that you're already going to 
have to use a tool to unpack the dang thing. So you didn't actually win 
anything. I pretty much guarantee that my existing format is easier to 
unpack than your escaped format.

ASCII isn't magical.

This is "fsck_tree()", which walks the unpacked tree representation and 
checks that it looks sane and marks the sha1's it finds as being 
needed (so that you can do reachability analysis in a second pass). It's 
not exactly complicated:

static int fsck_tree(unsigned char *sha1, void *data, unsigned long 
size)
{
while (size) {
int len = 1+strlen(data);
unsigned char *file_sha1 = data + len;
char *path = strchr(data, ' ');
if (size < len + 20 || !path)
return -1;
data += len + 20;
size -= len + 20;
mark_needs_sha1(sha1, "blob", file_sha1);
}
return 0;
}

and there's one HUGE advantage to _not_ having escaping: sorting and
comparing.

If you escape things, you now have to decide how you sort filenames. Do
you sort them by the escaped representation, or by the "raw"  
representation? Do you always have to escape or unescape the name in order 
to sort it.

So I like ASCII as much as the next guy, but it's not a religion. If there 
isn't any point to it, there isn't any point to it.

The biggest irritation I have with the "tree" format I chose is actually
not the name (which is trivial), it's the  part. Almost everything
else keeps the  in the ASCII hexadecimal representation, and I
should have done that here too. Why? Not because it's a  - hey, the 
binary representation is certainly denser and equivalent - but because an 
ASCII representation there would have allowed me to much more easily 
change the key format if I ever wanted to. Now it's very SHA1-specific.

Which I guess is fine - I don't really see any reason to change, and if I 
do change, I could always just re-generate the whole tree. But I think it 
would have been cleaner to have _that_ part in ASCII.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-09 Thread David Roundy

On Thu, Apr 07, 2005 at 12:30:18PM +0200, Matthias Andree wrote:
> On Thu, 07 Apr 2005, Sergei Organov wrote:
> > darcs? 
> 
> Close. Some things:
> 
> 1. It's rather slow and quite CPU consuming and certainly I/O consuming
>at times - I keep, to try it out, leafnode-2 in a DARCS repo, which
>has a mere 20,000 lines in 140 files, with 1,436 changes so far, on a
>RAID-1 with two 7200/min disk drives, with an Athlon XP 2500+ with
>512 MB RAM. The repo has 1,700 files in 11.5 MB, the source itself
>189 files in 1.8 MB.
> 
>Example: darcs annotate nntpd.c takes 23 s. (2,660 lines, 60 kByte)
> 
>The maintainer himself states that there's still optimization required.

Indeed, there's still a lot of optimization to be done.  I've recently made
some improvements recently which will reduce the memory use (and speed
things up) for a few of the worst-performing commands.  No improvement to
the initial record, but on the plus side, that's only done once.  But I was
able to cut down the memory used checking out a kernel repository to 500m.
(Which, sadly enough, is a major improvement.)

You would do much better if you recorded the initial state one directory at
a time, since it's the size of the largest changeset that determines the
memory use on checkout, but that's ugly.

> Getting DARCS up to the task would probably require some polishing, and
> should probably be discussed with the DARCS maintainer before making
> this decision.
> 
> Don't get me wrong, DARCS looks promising, but I'm not convinced it's
> ready for the linux kernel yet.

Indeed, I do believe that darcs has a way to go before it'll perform
acceptably on the kernel.  On the other hand, tar seems to perform
unacceptably slow on the kernel, so I'm not sure how slow is too slow.
Definitely input from interested kernel developers on which commands are
too slow would be welcome.
-- 
David Roundy
http://www.darcs.net
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

Linus wrote:
> If you want to have spaces
>  and newlines in your pathname, go wild.

So long as there is only one pathname in a record, you don't need
nul-terminators to be allow spaces in the name.  The rest of the record
is well known, so the pathname is just whatever is left after chomping
off the rest of the record.

It's only the support for embedded newlines that forces you to use
nul-terminators.

Not worth it - in my view.  Rather, do just enough hackery that
such a pathname doesn't break you, even if it means not giving
full service to such names.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-09 Thread Linus Torvalds

On Sat, 9 Apr 2005, Paul Jackson wrote:
> 
> I must be missing something here ...
> 
> If the stat shows a possible change, then you shouldn't have to open the
> original version to determine if it really changed - just compute the
> SHA1 of the new file, and see if that changed from the original SHA1.

Yes. However, I've got two reasons for this:

 (a) it may actually be cheaper to just unpack the compressed thing than
 it is to compute the sha, _especially_ since it's very likely that
 you have to do that anyway (ie if it turns out that they _are_
 different, you need the unpacked data to then look at the
 differences).

 So when you come from your backup angle, you only care about "has it 
 changed", and you'll do a backup. In "git", you usually care about 
 the old contents too.

 (b) while I depend on the fact that if the SHA of an object matches, the 
 objects are the same, I generally try to avoid the reverse 
 dependency. Why? Because if I end up changing the way I pack objects,
 and still want to work with old objects, I may end up in the 
 situation that two identical objects could get different object 
 names.

I don't actually know how valid a point "(b)" is, and I don't think it's 
likely, but imagine that SHA1 ends up being broken (*) and I decide that I 
want to pack new objects with a new-and-improved-SHA256 or something. Such 
a thing would obviously mean that you end up with lots of _duplicate_ data 
(any new data that is repackaged with the new name will now cause a new 
git object), but "duplicate" is better than "broken".

I don't actually guarantee that "git" could handle that right, but I've
been idly trying to avoid locking myself into the mindset that "file
equality has to mean name equality over the long run". So while the system 
right now works on the 1:1 "name" <-> "content" mapping, it's possible 
that it _could_ work with a more relaxed 1:n "content" -> "name" mapping.

But it's entirely possible that I'm being a git about this.

Linus 

(*) yeah, yeah, I know about the current theoretical case, and I don't
care. Not only is it theoretical, the way my objects are packed you'd have
to not just generate the same SHA1 for it, it would have to _also_ still
be a valid zlib object _and_ get the header to match the "type + length"  
of object part. IOW, the object validity checks are actually even stricter
than just "sha1 matches".
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

> in order to avoid having to worry about special characters
> they are NUL-terminated)

Would this be a possible alternative - newline terminated (convert any
newlines embedded in filenames to the 3 chars '%0A', and leave it as an
exercise to the reader to de-convert them.)

Line formatted ASCII files are really nice - worth pissing on embedded
newlines in paths to obtain.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

Marcin wrote:
> But what will impress you are either the price tag the 
> DB comes with or
> the hardware it runs on :-)

The payroll for the staffing to care and feed for these
babies is often impressive as well.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

Linus wrote:
> then git will open have exactly _one_ 
> file (no searching, no messing around), which contains absolutely nothing 
> except for the compressed (and SHA1-signed) old contents of the file. It 
> obviously _has_ to do that, because in order to know whether you've 
> changed it, it needs to now compare it to the original.

I must be missing something here ...

If the stat shows a possible change, then you shouldn't have to open the
original version to determine if it really changed - just compute the
SHA1 of the new file, and see if that changed from the original SHA1.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

Linus wrote:
> you need to reuse the same inode/dev numbers
> (again - I didn't worry about portability, and filesystems where those
> aren't stable are a "don't do that then") 

On filesystems that don't have a stable inode number, I use the md5sum
of the full (relative to mount point) pathname as the inode number. 

Since these same file systems (not surprisingly) lack hard links as
well, the pathname _is_ essentially the stable inode number.

Off-topic details ...

This is on my backup program, which does a full snapshot of my 90 Gb
system, including some FAT file systems, in 6 or 7 minutes, plus time
proportional to actual changes.  I have given up finding a backup
program I can tolerate, and write my own.  It stores each md5sum unique
blob exactly once, but uses the same sort of tricks you describe to
detect changes from examining just the stat information so as to avoid
reading every damn byte on the disk.  It works with smb, fat, vfat,
ntfs, reiserfs, xfs, ext2/3, ...  A single manifest file, in plain
ascii, one file per line, captures a full snapshot, disk-to-disk, every
few hours.

This comment from my backup source explains more:

# Unfortunately, fat, vfat, smb, and ncpfs (Netware) file systems
# do not have unique disk-based persistent inode numbers.
# The kernel constructs transient inode numbers for inodes
# in its cache.  But after an umount and re-mount, the inode
# numbers are all different.  So we would end up recalculating
# the md5sums of all files in any such file systems.
#
# To avoid this, we keep track of which directories are on such
# file systems, and for files in any such directory, instead
# of using the inode value from stat'ing a file, we use the
# md5sum of its path as a pseudo-inode number.  This digest of
# a file's path has improved persistance over it's transiently
# assigned inode number.  Fields 5,6,7 (files total, free and
# avail) happen to be zero on file systems (fat, vfat, smb,
# ...) with no real inodes, so we we use this fallback means
# of getting a persistent pseudo-inode if a statvfs() call on
# its directory has fields 5,6,7 summing to zero:
#   sum(os.statvfs(dir)[5:8]) == 0
# We include that dir in the fat_directories set in this case.

fat_directories = sets.Set()# set of directory paths on FAT file systems

# The Python statvfs() on Linux is a tad expensive - the
# glibc statvfs(2) code does several system calls, including
# scanning /proc/mounts and stat'ing its entries.  We need
# to know for each file whether it is on a "fat" file system
# (see above), but for efficiency we only statvfs at mount
# points, then propagate the file system type from there down.

mountpoints = [m.split()[1] for m in open("/proc/mounts")]

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-09 Thread Samium Gromoff

It seems that Tom Lord, the primary architect behind GNU Arch
has recently published an open letter to Linus Torvalds.

Because no open letter to Linus would be really open without an
accompanying reference post on lkml, here it is:

http://lists.seyza.com/pipermail/gnu-arch-dev/2005-April/001001.html

---
cheers,
   Samium Gromoff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-09 Thread Samium Gromoff

Ok, this was literally screaming for a rebuttal! :-)
 

 
> Arch isn't a sound example of software design. Quite contrary to the  
>  
> random notes posted by it's author the following issues did strike me 
>  
> the time I did evaluate it:   
>  
(Note that here you take a stab at the Arch design fundamentals, but
 
actually fail to substantiate it later) 
 

 
> The application (tla) claims to have "intuitive" command names. However   
>  
> I didn't see that as given. Most of them where difficult to remember  
>  
> and appeared to be just infantile. I stopped looking further after I  
>  
> saw:  
>  
[ UI issues snipped, not really core design ]   
 

 
Yes, some people perceive that there _are_ UI issues in Arch.   
 
However, as strange as it may sound, some don`t feel so.
 

 
> As an added bonus it relies on the applications named by accident 
>  
> patch and diff and installed on the host in question as well as few   
>  
> other as well to  
>  
> operate.  
>  

 
This is called modularity and code reuse.   
 

 
And given that patch and diff are installed by default on all of the
 
relevant developer machines i fail to see as to why it is by any
 
measure a derogatory.   
 

 
(and the rest you speak about is tar and gzip)  
 

 
> Better don't waste your time with looking at Arch. Stick with patches 
>  
> you maintain by hand combined with some scripts containing a list of  
>  
> apply commands
>  
> and you should be still more productive then when using Arch. 
>  

 
Sure, you should`ve had come up with something more based than that! :-)
 

 
Now to the real design issues...
 

 
Globally unique, meaningful, symbolic revision names -- the core of the 
 
Arch namespace. 
 

 
"Stone simple" on-disk format to store things -- a hierarchy
 
of directories with textual files and tarballs. 
 

 
No smart server -- any sftp, ftp, webdav (or just http for read-only access)
 
server is exactly up to the task.   
 

 
O(0) branching -- a branc

Re: Kernel SCM saga..

2005-04-09 Thread Neil Brown

On Saturday April 9, [EMAIL PROTECTED] wrote:
> On Sat, Apr 09, 2005 at 05:47:08PM +1000, Neil Brown wrote:
> > On Saturday April 9, [EMAIL PROTECTED] wrote:
> > > 
> > > I've just checked, it takes 5.7s to compare 2.4.29{,-hf3} over NFS (13300
> > > files each) and 1.3s once the trees are cached locally. This is without
> > > comparing file contents, just meta-data. And it takes 19.33s to compare
> > > the file's md5 sums once the trees are cached. I don't know if there are
> > > ways to avoid some NFS operations when everything is cached.
> > > 
> > > Anyway, the system does not seem much efficient on hard links, it caches
> > > the files twice :-(
> > 
> > I suspect you'll be wanting to add a "no_subtree_check" export option
> > on your NFS server...
> 
> Thanks a lot, Neil ! This is very valuable information. I didn't
> understand such implications from the exports(5) man page, but it
> makes a great difference. And the diff sped up from 5.7 to 3.9s
> and from 19.3 to 15.3s.

No, that implication had never really occurred to me before either.
But when you said "caches the file twice" it suddenly made sense.
With subtree_check, the NFS file handle contains information about the
directory, and NFS uses the filehandle as the primary key to tell if
two things are the same or not.

Trond keeps prodding me to make no_subtree_check the default.  Maybe it
is time that I actually did

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-09 Thread Jan Hudec

On Sat, Apr 09, 2005 at 03:01:29 +0200, Marcin Dalecki wrote:
> 
> On 2005-04-07, at 09:44, Jan Hudec wrote:
> >
> >I have looked at most systems currently available. I would suggest
> >following for closer look on:
> >
> >1) GNU Arch/Bazaar. They use the same archive format, simple, have the
> >   concepts right. It may need some scripts or add ons. When Bazaar-NG
> >   is ready, it will be able to read the GNU Arch/Bazaar archives so
> >   switching should be easy.
> 
> Arch isn't a sound example of software design. Quite contrary to the 

I actually _do_ agree with you. I like Arch, but it's user interface
certainly is broken and some parts of it would sure needs some redesign.

> random notes posted by it's author the following issues did strike me 
> the time I did evaluate it:
> 
> The application (tla) claims to have "intuitive" command names. However
> I didn't see that as given. Most of them where difficult to remember
> and appeared to be just infantile. I stopped looking further after I 
> saw:
> 
> tla my-id instead of: tla user-id or oeven tla set id ...
> 
> tla make-archive instead of tla init

In this case, tla init would be a lot *worse*, because there are two
different things to initialize -- the archive and the tree. But
init-archive would be a little better, for consistency.

> tla my-default-archive [EMAIL PROTECTED]

This one is kinda broken. Even in concept it is.

> No more "My Compuer" please...
> 
> Repository addressing requires you to use informally defined
> very elaborated and typing error prone conventions:
> 
> mkdir ~/{archives}

*NO*. Usng this is name is STRONGLY recommended *AGAINST*. Tom once used
it in the example or in some of his archive and people started doing it,
but it's a compelete bogosity and it is not required anywhere.

> tla make-archive [EMAIL PROTECTED] 
> ~/{archives}/2005-VersionPatrol
> 
> You notice the requirement for two commands to accomplish a single task 
> already well denoted by the second command? There is more of the same
> at quite a few places when you try to use it. You notice the triple
> zero it didn't catch?

I sure do. But the folks writing Bazaar are gradually fixing these.
There is a lot of them and it's not that long since they started, so
they did not fix all of them yey, but I think they eventually will.

> As an added bonus it relies on the applications named by accident
> patch and diff and installed on the host in question as well as few 
> other as well to
> operate.

No. The build process actually checks that the diff and patch
applications are actually the GNU Diff and GNU Patch in sufficiently
recent version. It's was not always the case, but now it does.

> Better don't waste your time with looking at Arch. Stick with patches
> you maintain by hand combined with some scripts containing a list of 
> apply commands
> and you should be still more productive then when using Arch.

I don't agree with you. Using Arch is more productive (eg. because it
does merges), but certainly one could do a lot better than Arch does.

---
 Jan 'Bulb' Hudec <[EMAIL 
PROTECTED]>

signature.asc
Description: Digital signature

Re: Kernel SCM saga..

2005-04-09 Thread Willy Tarreau

On Sat, Apr 09, 2005 at 05:47:08PM +1000, Neil Brown wrote:
> On Saturday April 9, [EMAIL PROTECTED] wrote:
> > 
> > I've just checked, it takes 5.7s to compare 2.4.29{,-hf3} over NFS (13300
> > files each) and 1.3s once the trees are cached locally. This is without
> > comparing file contents, just meta-data. And it takes 19.33s to compare
> > the file's md5 sums once the trees are cached. I don't know if there are
> > ways to avoid some NFS operations when everything is cached.
> > 
> > Anyway, the system does not seem much efficient on hard links, it caches
> > the files twice :-(
> 
> I suspect you'll be wanting to add a "no_subtree_check" export option
> on your NFS server...

Thanks a lot, Neil ! This is very valuable information. I didn't
understand such implications from the exports(5) man page, but it
makes a great difference. And the diff sped up from 5.7 to 3.9s
and from 19.3 to 15.3s.

Cheers,
Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-09 Thread Neil Brown

On Saturday April 9, [EMAIL PROTECTED] wrote:
> 
> I've just checked, it takes 5.7s to compare 2.4.29{,-hf3} over NFS (13300
> files each) and 1.3s once the trees are cached locally. This is without
> comparing file contents, just meta-data. And it takes 19.33s to compare
> the file's md5 sums once the trees are cached. I don't know if there are
> ways to avoid some NFS operations when everything is cached.
> 
> Anyway, the system does not seem much efficient on hard links, it caches
> the files twice :-(

I suspect you'll be wanting to add a "no_subtree_check" export option
on your NFS server...

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-09 Thread Willy Tarreau

On Fri, Apr 08, 2005 at 11:56:09AM -0700, Chris Wedgwood wrote:
> On Fri, Apr 08, 2005 at 11:47:10AM -0700, Linus Torvalds wrote:
> 
> > Don't use NFS for development. It sucks for BK too.
> 
> Some times NFS is unavoidable.
> 
> In the best case (see previous email wrt to only stat'ing the parent
> directories when you can) for a current kernel though you can get away
> with 894 stats --- over NFS that would probably be tolerable.
> 
> After claiming such an optimization is probably not worth while I'm
> now thinking for network filesystems it might be.

I've just checked, it takes 5.7s to compare 2.4.29{,-hf3} over NFS (13300
files each) and 1.3s once the trees are cached locally. This is without
comparing file contents, just meta-data. And it takes 19.33s to compare
the file's md5 sums once the trees are cached. I don't know if there are
ways to avoid some NFS operations when everything is cached.

Anyway, the system does not seem much efficient on hard links, it caches
the files twice :-(

Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-09 Thread Willy Tarreau

On Fri, Apr 08, 2005 at 12:03:49PM -0700, Linus Torvalds wrote:

> And if you do actively malicious things in your own directory, you get
> what you deserve. It's actually _hard_ to try to fool git into believing a
> file hasn't changed: you need to not only replace it with the exact same
> file length and ctime/mtime, you need to reuse the same inode/dev numbers
> (again - I didn't worry about portability, and filesystems where those
> aren't stable are a "don't do that then") and keep the mode the same. Oh,
> and uid/gid, but that was much me being silly.

It would be even easier to touch the tree with a known date before
patching (eg:1/1/70). It would protect against any accidental date
change if for any reason your system time went backwards while
working on the tree.

Another trick I use when I build the 2.4-hf patches is to build a
list of filenames from the patches. It works only because I want
to keep all original patches and no change should appear outside
those patches. Using this + cp -al + diff -pruN makes the process
very fast. It would not work if I had to rebuild those patches from
hand-edited files of course.

Last but not least, it only takes 0.26 seconds on my dual athlon
1800 to find date/size changes between 2.6.11{,.7} and 4.7s if the
tool includes the md5 sum in its checks :

$ time flx check --ignore-owner --ignore-mode --ignore-ldate --ignore-dir \
  --ignore-dot --only-new --ignore-sum linux-2.6.11/. linux-2.6.11.7/. |wc -l
 47

real0m0.255s
user0m0.094s
sys 0m0.162s

$ time flx check --ignore-owner --ignore-mode --ignore-ldate --ignore-dir \
  --ignore-dot --only-new linux-2.6.11/. linux-2.6.11.7/. |wc -l
 47

real0m4.705s
user0m3.398s
sys 0m1.310s

(This was with 'flx', a tool a friend developped for file-system integrity
checking which we also use to build our packages). Anyway, what I wanted
to show is that once the trees are cached, even somewhat heavy operations
such as checksumming can be done occasionnaly (such as md5 for double
checking) without you waiting too long. And I don't think that a database
would provide all the comfort of a standard file-system (cp -al, rsync,
choice of tools, etc...).

Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-09 Thread Randy.Dunlap

On Sat, 9 Apr 2005 04:53:57 +0200 Petr Baudis wrote:

|   Hello,
| 
| Dear diary, on Fri, Apr 08, 2005 at 05:50:21PM CEST, I got a letter
| where Linus Torvalds <[EMAIL PROTECTED]> told me that...
| > 
| > 
| > On Fri, 8 Apr 2005 [EMAIL PROTECTED] wrote:
| > > 
| > > Here's a partial solution.  It does depend on a modified version of
| > > cat-file that behaves like cat.  I found it easier to have cat-file
| > > just dump the object indicated on stdout.  Trivial patch for that is 
included.
| > 
| > Your trivial patch is trivially incorrect, though. First off, some files
| > may be binary (and definitely are - the "tree" type object contains
| > pathnames, and in order to avoid having to worry about special characters
| > they are NUL-terminated), and your modified "cat-file" breaks that.  
| > 
| > Secondly, it doesn't check or print the tag.
| 
|   FWIW, I made few small fixes (to prevent some trivial usage errors to
| cause cache corruption) and added scripts gitcommit.sh, gitadd.sh and
| gitlog.sh - heavily inspired by what already went through the mailing
| list. Everything is available at http://pasky.or.cz/~pasky/dev/git/
| (including .dircache, even though it isn't shown in the index), the
| cumulative patch can be found below. The scripts aim to provide some
| (obviously very interim) more high-level interface for git.
| 
|   I'm now working on tree-diff.c which will (surprise!) produce a diff
| of two trees (I'll finish it after I get some sleep, though), and then I
| will probably do some dwimmy gitdiff.sh wrapper for tree-diff and
| show-diff. At that point I might get my hand on some pull more kind to
| local changes.

Hi,

I'll look at your scripts this weekend.  I've also been
working on some, but mine are a bit more experimental (cruder)
than yours are.  Anyway, here they are (attached) -- also
available at http://developer.osdl.org/rddunlap/git/

gitin : checkin/commit
gitwhat sha1 : what is that sha1 file (type and contents if blob or commit)
gitlist (blob, commit, tree, or all) :
list all objects with type (commit, tree, blob, or all)

---
~Randy


gitin
Description: Binary data


gitlist
Description: Binary data


gitwhat
Description: Binary data

Re: Kernel SCM saga..

On Sat, 9 Apr 2005, Andrea Arcangeli wrote:
> 
> I'm not entirely convinced wget is going to be an efficient way to
> synchronize and fetch your tree

I don't think it's efficient per se, but I think it's important that 
people can just "pass the files along". Ie it's a huge benefit if any 
everyday mirror script (whether rsync, wget, homebrew or whatever) will 
just automatically do the right thing. 

> Perhaps that's why you were compressing the stuff too? It sounds better
> not to compress the stuff on-disk

I much prefer to waste some CPU time to save disk cache. Especially since 
the compression is "free" if you do it early on (ie it's done only once, 
since the files are stable). Also, if the difference is a 1.5GB kernel 
repository or a 3GB kernel repository, I know which one I'll pick ;)

Also, I don't want people editing repostitory files by hand. Sure, the 
sha1 catches it, but still... I'd rather force the low-level ops to use 
the proper helper routines. Which is why it's a raw zlib compressed blob, 
not a gzipped file.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-08 Thread Walter Landry

Linus Torvalds wrote:
> Which is why I'd love to hear from people who have actually used
> various SCM's with the kernel. There's bound to be people who have
> already tried.

At the end of my Codecon talk, there is a performance comparison of a
number of different distributed SCM's with the kernel.

  http://superbeast.ucsd.edu/~landry/ArX/codecon/codecon.html

I develop ArX (http://www.nongnu.org/arx).  You may find it of
interest ;)

Cheers,
Walter Landry
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-08 Thread Andrea Arcangeli

On Fri, Apr 08, 2005 at 11:08:58PM -0400, Brian Gerst wrote:
> It's my understanding that the files don't change.  Only new ones are 
> created for each revision.

I said diff between the trees, not diff between files ;). When you fetch
the new changes with rsync, it'll compress better and in turn it'll be
faster (assuming we're network bound and I am with 1mbit and 2.5ghz
cpu), if it's rsync applying gzip to the big "combined diff between
trees" instead of us compressing every single small file on disk, that
won't compress anymore inside rsync.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-08 Thread Brian Gerst

Andrea Arcangeli wrote:
On Fri, Apr 08, 2005 at 05:12:49PM -0700, Linus Torvalds wrote:
really designed for something like a offline http grabber, in that you can 
just grab files purely by filename (and verify that you got them right by 
running sha1sum on the resulting local copy). So think "wget".

I'm not entirely convinced wget is going to be an efficient way to
synchronize and fetch your tree, its simplicitly is great though. It's a
tradeoff between optimzing and re-using existing tools (like webservers).
Perhaps that's why you were compressing the stuff too? It sounds better
not to compress the stuff on-disk, and to synchronize with a rsync-like
protocol (rsync server would make it) that handles the compression in
the network protocol itself, and in turn that can apply compression to a
large blob (i.e. the diff between the trees), and not to the single tiny
files.
It's my understanding that the files don't change.  Only new ones are 
created for each revision.

--
Brian gErst 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: Kernel SCM saga..

2005-04-08 Thread Petr Baudis

  Hello,

Dear diary, on Fri, Apr 08, 2005 at 05:50:21PM CEST, I got a letter
where Linus Torvalds <[EMAIL PROTECTED]> told me that...
> 
> 
> On Fri, 8 Apr 2005 [EMAIL PROTECTED] wrote:
> > 
> > Here's a partial solution.  It does depend on a modified version of
> > cat-file that behaves like cat.  I found it easier to have cat-file
> > just dump the object indicated on stdout.  Trivial patch for that is 
> > included.
> 
> Your trivial patch is trivially incorrect, though. First off, some files
> may be binary (and definitely are - the "tree" type object contains
> pathnames, and in order to avoid having to worry about special characters
> they are NUL-terminated), and your modified "cat-file" breaks that.  
> 
> Secondly, it doesn't check or print the tag.

  FWIW, I made few small fixes (to prevent some trivial usage errors to
cause cache corruption) and added scripts gitcommit.sh, gitadd.sh and
gitlog.sh - heavily inspired by what already went through the mailing
list. Everything is available at http://pasky.or.cz/~pasky/dev/git/
(including .dircache, even though it isn't shown in the index), the
cumulative patch can be found below. The scripts aim to provide some
(obviously very interim) more high-level interface for git.

  I'm now working on tree-diff.c which will (surprise!) produce a diff
of two trees (I'll finish it after I get some sleep, though), and then I
will probably do some dwimmy gitdiff.sh wrapper for tree-diff and
show-diff. At that point I might get my hand on some pull more kind to
local changes.

  Kind regards,
Petr Baudis

diff -ruN git-0.03/gitadd.sh git-devel-clean/gitadd.sh
--- git-0.03/gitadd.sh  1970-01-01 01:00:00.0 +0100
+++ git-devel-clean/gitadd.sh   2005-04-09 03:17:34.220577000 +0200
@@ -0,0 +1,13 @@
+#!/bin/sh
+#
+# Add new file to a GIT repository.
+# Copyright (c) Petr Baudis, 2005
+#
+# Takes a list of file names at the command line, and schedules them
+# for addition to the GIT repository at the next commit.
+#
+# FIXME: Those files are omitted from show-diff output!
+
+for file in "$@"; do
+   echo $file >>.dircache/add-queue
+done
diff -ruN git-0.03/gitcommit.sh git-devel-clean/gitcommit.sh
--- git-0.03/gitcommit.sh   1970-01-01 01:00:00.0 +0100
+++ git-devel-clean/gitcommit.sh2005-04-09 03:17:34.220577000 +0200
@@ -0,0 +1,36 @@
+#!/bin/sh
+#
+# Commit into a GIT repository.
+# Copyright (c) Petr Baudis, 2005
+# Based on an example script fragment sent to LKML by Linus Torvalds.
+#
+# Ignores any parameters for now, excepts changelog entry on stdin.
+#
+# FIXME: Gets it wrong for filenames containing spaces.
+
+
+if [ -r .dircache/add-queue ]; then
+   mv .dircache/add-queue .dircache/add-queue-progress
+   addedfiles=$(cat .dircache/add-queue-progress)
+else
+   addedfiles=
+fi
+changedfiles=$(show-diff -s | grep -v ': ok$' | cut -d : -f 1)
+commitfiles="$addedfiles $changedfiles"
+if [ ! "$commitfiles" ]; then
+   echo 'Nothing to commit.' >&2
+   exit
+fi
+update-cache $commitfiles
+rm -f .dircache/add-queue-progress
+
+
+oldhead=$(cat .dircache/HEAD)
+treeid=$(write-tree)
+newhead=$(commit-tree $treeid -p $oldhead)
+
+if [ "$newhead" ]; then
+   echo $newhead >.dircache/HEAD
+else
+   echo "Error during commit (oldhead $oldhead, treeid $treeid)" >&2
+fi
diff -ruN git-0.03/gitlog.sh git-devel-clean/gitlog.sh
--- git-0.03/gitlog.sh  1970-01-01 01:00:00.0 +0100
+++ git-devel-clean/gitlog.sh   2005-04-09 04:28:51.227791000 +0200
@@ -0,0 +1,61 @@
+#!/bin/sh
+
+ Call this script with an object and it will produce the change
+ information for all the parents of that object
+
+ This script was originally written by Ross Vandegrift.
+# multiple parents test 1d0f4aec21e5b66c441213643426c770dc6dedc0
+# parents: ffa098b2e187b71b86a76d3cd5eb77d074a2503c
+# 6860e0d9197c7f52155466c225baf39b42d62f63
+
+# regex for parent declarations
+PARENTS="^parent [A-z0-9]{40}$"
+
+TMPCL="/tmp/gitlog.$$"
+
+# takes an object and generates the object's parent(s)
+function unpack_parents () {
+   echo "me $1"
+   echo "me $1" >>$TMPCL
+   RENTS=""
+
+   TMPCM=$(mktemp)
+   cat-file commit $1 >$TMPCM
+   while read line; do
+   if echo "$line" | egrep -q "$PARENTS"; then
+   RENTS="$RENTS "$(echo $line | sed 's/parent //g')
+   fi
+   echo $line
+   done <$TMPCM
+   rm $TMPCM
+
+   echo -e "\n--\n"
+
+   # if the last object had no parents, return
+   if [ ! "$RENTS" ]; then
+   return;
+   fi
+
+   #useful for testing
+   #echo $RENTS
+   #read
+   for i in `echo $RENTS`; do
+   # break cycles
+   if grep -q "me $i" $TMPCL; then
+   echo "Already visited $i" >&2
+   continue
+   else
+   unpack_parents $i
+   fi

Re: Kernel SCM saga..

2005-04-08 Thread Andrea Arcangeli

On Fri, Apr 08, 2005 at 07:38:30PM -0400, Daniel Phillips wrote:
> For the immediate future, all we need is something than can _losslessly_ 
> capture the new metadata that's being generated.  That buys time to bring one 
> of the promising open source candidates up to full speed.

Agreed.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-08 Thread David Lang

On Sat, 9 Apr 2005, Andrea Arcangeli wrote:
On Fri, Apr 08, 2005 at 05:12:49PM -0700, Linus Torvalds wrote:
really designed for something like a offline http grabber, in that you can
just grab files purely by filename (and verify that you got them right by
running sha1sum on the resulting local copy). So think "wget".
I'm not entirely convinced wget is going to be an efficient way to
synchronize and fetch your tree, its simplicitly is great though. It's a
tradeoff between optimzing and re-using existing tools (like webservers).
Perhaps that's why you were compressing the stuff too? It sounds better
not to compress the stuff on-disk, and to synchronize with a rsync-like
protocol (rsync server would make it) that handles the compression in
the network protocol itself, and in turn that can apply compression to a
large blob (i.e. the diff between the trees), and not to the single tiny
files.
note that many webservers will compress the data for you on the fly as 
well, so there's even less need to have it pre-compressed

David Lang
--
There are two ways of constructing a software design. One way is to make it so 
simple that there are obviously no deficiencies. And the other way is to make 
it so complicated that there are no obvious deficiencies.
 -- C.A.R. Hoare
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-08 Thread Andrea Arcangeli

On Fri, Apr 08, 2005 at 05:12:49PM -0700, Linus Torvalds wrote:
> really designed for something like a offline http grabber, in that you can 
> just grab files purely by filename (and verify that you got them right by 
> running sha1sum on the resulting local copy). So think "wget".

I'm not entirely convinced wget is going to be an efficient way to
synchronize and fetch your tree, its simplicitly is great though. It's a
tradeoff between optimzing and re-using existing tools (like webservers).
Perhaps that's why you were compressing the stuff too? It sounds better
not to compress the stuff on-disk, and to synchronize with a rsync-like
protocol (rsync server would make it) that handles the compression in
the network protocol itself, and in turn that can apply compression to a
large blob (i.e. the diff between the trees), and not to the single tiny
files.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-08 Thread David Lang

On Sat, 9 Apr 2005, Marcin Dalecki wrote:
On 2005-04-08, at 20:28, Jon Smirl wrote:
On Apr 8, 2005 2:14 PM, Linus Torvalds <[EMAIL PROTECTED]> wrote:
   How do you replicate your database incrementally? I've given you enough
   clues to do it for "git" in probably five lines of perl.
Efficient database replication is achieved by copying the transaction
logs and then replaying them. Most mid to high end databases support
this. You only need to copy the parts of the logs that you don't
already have.
Databases supporting replication are called high end. You forgot the cats 
dance
around the network this issue involves.
And Postgres (which is Free in all senses of the word) is high end by this 
definition.

I'm not saying that it's an efficiant thing to use for this task, but 
don't be fooled into thinking you need something on the price of Oracle to 
do this job.

David Lang
--
There are two ways of constructing a software design. One way is to make it so 
simple that there are obviously no deficiencies. And the other way is to make 
it so complicated that there are no obvious deficiencies.
 -- C.A.R. Hoare
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-08 Thread Tupshin Harper

Roman Zippel wrote:

Please show me how you would do a binary search with arch.
I don't really like the arch model, it's far too restrictive and it's 
jumping through hoops to get to an acceptable speed.
What I expect from a SCM is that it maintains both a version index of the 
directory structure and a version index of the individual files. Arch 
makes it especially painful to extract this data quickly. For the common 
cases it throws disk space at the problem and does a lot of caching, but 
there are still enough problems (e.g. annotate), which require scanning of 
lots of tarballs.

bye, Roman
 

I'm not going to defend or attack arch since I haven't used it enough. I 
will say that darcs largely does suffer from the same problem that you 
describe since its fundamental unit of storage is individual patches 
(though it avoids the tarball issue). This is why David Roundy has 
indicated his intention of eventually having a per-file cache:
http://kerneltrap.org/mailarchive/1/message/24317/flat

You could then make the argument that if you have a per-file 
representation of the history, why do you also need/want a per-patch 
representation as the canonical format, but that's been argued plenty on 
both the darcs and arch mailing lists and probably isn't worth going 
into here.

-Tupshin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

On 2005-04-09, at 03:09, Chris Wedgwood wrote:
On Sat, Apr 09, 2005 at 03:00:44AM +0200, Marcin Dalecki wrote:
Yes it sucks less for this purpose. See subversion as reference.
Whatever solution people come up with, ideally it should be tolerant
to minor amounts of corruption (so I can recover the rest of my data
if need be) and it should also have decent sanity checks to find
corruption as soon as reasonable possible.
Yes this is the reason subversion is moving toward an alternative 
back-end
based on a custom DB mapped closely to the file system.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

On 2005-04-08, at 20:28, Jon Smirl wrote:
On Apr 8, 2005 2:14 PM, Linus Torvalds <[EMAIL PROTECTED]> wrote:
   How do you replicate your database incrementally? I've given you 
enough
   clues to do it for "git" in probably five lines of perl.
Efficient database replication is achieved by copying the transaction
logs and then replaying them. Most mid to high end databases support
this. You only need to copy the parts of the logs that you don't
already have.
Databases supporting replication are called high end. You forgot the 
cats dance
around the network this issue involves.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

On Sat, Apr 09, 2005 at 03:00:44AM +0200, Marcin Dalecki wrote:

> Yes it sucks less for this purpose. See subversion as reference.

Whatever solution people come up with, ideally it should be tolerant
to minor amounts of corruption (so I can recover the rest of my data
if need be) and it should also have decent sanity checks to find
corruption as soon as reasonable possible.

I've been bitten by problems that subversion didn't catch but bk did.
In the subversion case by the time I noticed much data was lost and
none of the subversion tools were able to recover the rest of it.

In the bk case, the data-loss was almost immediately noticeable and
only affected a few files making recovery much easier.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

On 2005-04-08, at 20:14, Linus Torvalds wrote:

On Fri, 8 Apr 2005, Matthias-Christian Ott wrote:
Ok, but if you want to search for information in such big text files 
it
slow, because you do linear search
No I don't. I don't search for _anything_. I have my own
content-addressable filesystem, and I guarantee you that it's faster 
than
mysql, because it depends on the kernel doing the right thing (which it
does).
Linus Sorry but you mistake the frequently seen SQL db abuse as DATA
storage for what SQL databases are good at storing: well defined 
RELATIONS.
Sure a filesystem is for data. SQL is for relations.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

On 2005-04-08, at 19:14, Linus Torvalds wrote:
You do that with an sql database, and I'll be impressed.
It's possible. But what will impress you are either the price tag the 
DB comes with or
the hardware it runs on :-)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

On 2005-04-06, at 23:13, [EMAIL PROTECTED] wrote:
Linus Torvalds wrote:
PS. Don't bother telling me about subversion. If you must, start 
reading
up on "monotone". That seems to be the most viable alternative, but 
don't
pester the developers so much that they don't get any work done. They 
are
already aware of my problems ;)
By the way, the Subversion developers have no argument with the claim
that Subversion would not be the right choice for Linux kernel
development.  We've written an open letter entitled "Please Stop
Bugging Linus Torvalds About Subversion" to explain why:
   http://subversion.tigris.org/subversion-linus.html
Thumbs up "Subverters"! I just love you. I love your attitude toward 
high engineering
quality. And I  appreciate actually very much what you provide as 
software. Both:
from function and in terms of quality of implementation.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

On 2005-04-07, at 09:44, Jan Hudec wrote:
I have looked at most systems currently available. I would suggest
following for closer look on:
1) GNU Arch/Bazaar. They use the same archive format, simple, have the
   concepts right. It may need some scripts or add ons. When Bazaar-NG
   is ready, it will be able to read the GNU Arch/Bazaar archives so
   switching should be easy.
Arch isn't a sound example of software design. Quite contrary to the 
random notes posted by it's author the following issues did strike me 
the time I did evaluate it:

The application (tla) claims to have "intuitive" command names. However
I didn't see that as given. Most of them where difficult to remember
and appeared to be just infantile. I stopped looking further after I 
saw:

tla my-id instead of: tla user-id or oeven tla set id ...
tla make-archive instead of tla init
tla my-default-archive [EMAIL PROTECTED]
No more "My Compuer" please...
Repository addressing requires you to use informally defined
very elaborated and typing error prone conventions:
mkdir ~/{archives}
tla make-archive [EMAIL PROTECTED] 
~/{archives}/2005-VersionPatrol

You notice the requirement for two commands to accomplish a single task 
already
well denoted by the second command? There is more of the same at quite 
a few places
when you try to use it. You notice the triple zero it didn't catch?

As an added bonus it relies on the applications named by accident
patch and diff and installed on the host in question as well as few 
other as well to
operate.

Better don't waste your time with looking at Arch. Stick with patches
you maintain by hand combined with some scripts containing a list of 
apply commands
and you should be still more productive then when using Arch.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

On 2005-04-08, at 18:15, Matthias-Christian Ott wrote:
Linus Torvalds wrote:

SQL Databases like SQLite aren't slow.
But maybe a Berkeley Database v.4 is a better solution.
Yes it sucks less for this purpose. See subversion as reference.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-08 Thread Roman Zippel

Hi,

On Fri, 8 Apr 2005, Tupshin Harper wrote:

> > A1 -> A2 -> A3 -> B1 -> B2
> > 
> > This results in a simpler repository, which is more scalable and which is
> > easier for users to work with (e.g. binary bug search).
> > The disadvantage would be it will cause more minor conflicts, when changes
> > are pulled back into the original tree, but which should be easily
> > resolvable most of the time.
> > 
> Both darcs and arch (and arch's siblings) have ways of maintaining the
> complete history but speeding up operations.

Please show me how you would do a binary search with arch.

I don't really like the arch model, it's far too restrictive and it's 
jumping through hoops to get to an acceptable speed.
What I expect from a SCM is that it maintains both a version index of the 
directory structure and a version index of the individual files. Arch 
makes it especially painful to extract this data quickly. For the common 
cases it throws disk space at the problem and does a lot of caching, but 
there are still enough problems (e.g. annotate), which require scanning of 
lots of tarballs.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

On Fri, 8 Apr 2005, Linus Torvalds wrote:
> 
> Also note that the above algorithm really works for _any_ two commit 
> points (apart for the two first steps, which are obviously all about 
> finding the parent tree when you want to diff against a predecessor). 

Btw, if you want to try this, you should get an updated copy. I've pushed 
a "raw" git archive of both git and sparse (the latter is much more 
interesting from an archive standpoint, since it actually has 1400 
changesets in it) to kernel.org, but I'm not convinced it gets mirrored 
out. I think the mirror scripts may mirror only things they understand.

I've also added a partial "fsck" for the "git filesystem". It doesn't do
the connectivity analysis yet, but that should be pretty straightforward
to add - it already parses all the data, it just doesn't save it away (and
the connectivity analysis will automatically show how many "root"
changesets you have, and what the different HEADs are).

I'll make a tar-file (git-0.03), although at this point I've actually been 
maintaining it in itself, so to some degree it's almost getting easier if 
I'd just have a place to rsync it..

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

On Fri, 8 Apr 2005, Andrea Arcangeli wrote:
> 
> We'd need a regenerated coherent copy of BKCVS to pipe into those SCM to
> evaluate how well they scale.

Yes, that makes most sense, I believe. Especially as BKCVS does the 
linearization that makes other SCM's _able_ to take the data in the first 
place. Few enough SCM's really understand the BK merge model, although the 
distributed ones obviously have to do something similar.

> OTOH if your git project already allows storing the data in there,
> that looks nice ;).

I can express the data, and I did a sparse .git archive to prove the 
concept. It doesn't even try to save BK-specific details, but as far as I 
can tell, my git-conversion did capture all the basic things (ie not just 
the actual source tree, but hopefully all the "who did what" parts too).

Of course, my git visualization tools are so horribly crappy that it is 
hard to make sure ;)

Also, I suspect that BKCVS actually bothers to get more details out of a
BK tree than I cared about. People have pestered Larry about it, so BKCVS
exports a lot of the nitty-gritty (per-file comments etc) that just
doesn't actually _matter_, but people whine about. Me, I don't care. My
sparse-conversion just took the important parts.

> I don't yet fully understand how the algorithms of the trees are meant
> to work

Well, things like actually merging two git trees is not even something git
tries to do. It leaves that to somebody else - you can see what the
relationship is, and you can see all the data, but as far as I'm
concerned, git is really a "filesystem". It's a way of expression
revisions, but it's not a way of creating them.

> It looks similar to a diff -ur of two hardlinked trees

Yes. You could really think of it that way. It's not really about
hardlinking, but the fact that objects are named by their content does
mean that two objects (regardless of their type) can be seen as
"hardlinked" whenever their contents match.

But the more interesting part is the hierarchical virtual format it has,
ie it is not only hardlinked, but it also has the three different levels
of "views" into those hardlinked objects ("blob", "tree", "revision").

So even though the hash tree looks flat in the _physcal_ filesystem, it 
detinitely isn't flat in its own virtual world. It's just flattened to fit 
in a normal filesystem ;)

[ There's also a fourth level view in "trust", but that one hasn't been
  implemented yet since I think it might as well be done at a higher
  level. ]

Btw, the sha1 file format isn't actually designed for "rsync", since rsync 
is really a hell of a lot more capable than my format needs. The format is 
really designed for something like a offline http grabber, in that you can 
just grab files purely by filename (and verify that you got them right by 
running sha1sum on the resulting local copy). So think "wget".

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-08 Thread Tupshin Harper

Roman Zippel wrote:
Preserving the complete merge history does indeed make repeated merges 
simpler, but it builds up complex meta data, which has to be managed 
forever. I doubt that this is really an advantage in the long term. I 
expect that we were better off serializing changesets in the main 
repository. For example bk does something like this:

A1 -> A2 -> A3 -> BM
  \-> B1 -> B2 --^
and instead of creating the merge changeset, one could merge them like 
this:

A1 -> A2 -> A3 -> B1 -> B2
This results in a simpler repository, which is more scalable and which 
is easier for users to work with (e.g. binary bug search).
The disadvantage would be it will cause more minor conflicts, when changes 
are pulled back into the original tree, but which should be easily 
resolvable most of the time.

Both darcs and arch (and arch's siblings) have ways of maintaining the 
complete history but speeding up operations.

Arch use's revision libraries:
http://www.gnu.org/software/gnu-arch/tutorial/revision-libraries.html
though i'm not all that up on arch so I'll just leave it at that.
Darcs uses "darcs optimize --checkpoint"
http://darcs.net/manual/node7.html#SECTION00764000
which "allows for users to retrieve a working repository with limited 
history with a savings of disk space and bandwidth." In darcs case, you 
can pull a partial repository by doing "darcs get --partial", in which 
case you only grab the state at the point that the repository was 
optimized and subsequent patches, and all operations only need to work 
against the set of patches since that optimize.

Note, that I'm not promoting darcs for kernel usage because of speed (or 
the lack thereof) but I am curious why Linus would consider monotone 
given its speed issues but not consider darcs.

-Tupshin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-08 Thread Daniel Phillips

On Friday 08 April 2005 04:38, Andrea Arcangeli wrote:
> On Thu, Apr 07, 2005 at 11:41:29PM -0700, Linus Torvalds wrote:
> The huge number of changesets is the crucial point, there are good
> distributed SCM already but they are apparently not efficient enough at
> handling 60k changesets.
>
> We'd need a regenerated coherent copy of BKCVS to pipe into those SCM to
> evaluate how well they scale.
>
> OTOH if your git project already allows storing the data in there,
> that looks nice ;).

Hi Andrea,

For the immediate future, all we need is something than can _losslessly_ 
capture the new metadata that's being generated.  That buys time to bring one 
of the promising open source candidates up to full speed.

By the way, which one are you working on? :-)

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

On Fri, 8 Apr 2005, Rajesh Venkatasubramanian wrote:
> 
> Although directory changes are tracked using change-sets, there 
> seems to be no easy way to answer "give me the diff corresponding to
> the commit (change-set) object ".  That will be really helpful to
> review the changes.

Actually, it is very easy indeed. Here's what you do:

 - look up the commit object ("cat-file commit ")

   This object starts out with "tree ", followed by a list of
   parent commit objects: "parent "

   Remember the tree object (it defines what the tree looks like at
   the time of the commit). Pick the parent object you want to diff
   against (normally the first one).

   Also, print the checking messages at the end of the commit object.

 - look up the parent object ("cat-file commit ")

   Here you have the same kind of object, but this time you don't care
   about going deeper, you just pick up the tree  that describes
   the tree at the parent.

 - look up the two tree objects. Unlike a commit object, a tree object
   is a binary data blob, but the format is an _extremely_ simple table
   of thse guys:

<20-byte sha1>

  and the reason it's binary is really that that way "git" doesn't end
  up having any issues with strange pathnames. If you want to have spaces
  and newlines in your pathname, go wild.

  In particular, the tree object is also _sorted_ by the pathname. This 
  makes things simple, because you now have to sorted trees, and the 
  first thing you do is just walk the two trees in lock-step, which is 
  trivial thanks to the sorted nature of the tree "array".

  So now you have three cases:
- you have the same name, and the same sha1

  ignore it - the file didn't change, you don't even have to look 
  at the contents (although if the file mode changed you might
  want to note that)

- you have the same name in parent and child tree lists, but the
  sha differs. Now you just need to do a "cat-file" on both of the 
  SHA1 values, and do a "diff -u" between them.

- you have the filename in only parent or only child. Do a 
  "create" or "delete" diff with the content of the sha1 file.

See? Very efficient. For any files that didn't change, you didn't have to 
do anything at all - you didn't even have to look at their data.

Also note that the above algorithm really works for _any_ two commit 
points (apart for the two first steps, which are obviously all about 
finding the parent tree when you want to diff against a predecessor). 

It doesn't have to be parent and child. Pick any commit you have. And pick
them in the other order, and you'll automatically get the reverse diff.

You can even do diffs between unrelated projects this way if you use the
shared sha1 directory model, although that obviously doesn't tend to be
all that sensible ;)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-08 Thread Roman Zippel

Hi,

On Thu, 7 Apr 2005, Linus Torvalds wrote:

> I really disliked that in BitKeeper too originally. I argued with Larry
> about it, but Larry (correctly, I believe) argued that efficient and
> reliable distribution really requires the concept of "history is
> immutable". It makes replication much easier when you know that the known
> subset _never_ shrinks or changes - you only add on top of it.

The problem is you pay a price for this. There must be a reason developers 
were adding another GB of memory just to run BK.
Preserving the complete merge history does indeed make repeated merges 
simpler, but it builds up complex meta data, which has to be managed 
forever. I doubt that this is really an advantage in the long term. I 
expect that we were better off serializing changesets in the main 
repository. For example bk does something like this:

A1 -> A2 -> A3 -> BM
  \-> B1 -> B2 --^

and instead of creating the merge changeset, one could merge them like 
this:

A1 -> A2 -> A3 -> B1 -> B2

This results in a simpler repository, which is more scalable and which 
is easier for users to work with (e.g. binary bug search).
The disadvantage would be it will cause more minor conflicts, when changes 
are pulled back into the original tree, but which should be easily 
resolvable most of the time.
I'm not saying with this that the bk model is bad, but I think it's a 
problem if it's the only model applied to everything.

> The thing is, cherry-picking very much implies that the people "up" the 
> foodchain end up editing the work of the people "below" them. The whole 
> reason you want cherry-picking is that you want to fix up somebody elses 
> mistakes, ie something you disagree with.
> 
> That sounds like an obviously good thing, right? Yes it does.
> 
> The problem is, it actually results in the wrong dynamics and psychology 
> in the system. First off, it makes the implicit assumption that there is 
> an "up" and "down" in the food-chain, and I think that's wrong.

These dynamics do exists and our tools should be able to represent them.
For example when people post patches, they get reviewed and often need 
more changes and bk doesn't really help them to redo the patches.
Bk helped you to offload the cherry-picking process to other people, so 
that you only had to do cherry-collecting very efficiently.
Another prime example of cherry-picking is Andrews mm tree, he picks a 
number of patches which are ready for merging and forwards them to you.
Our current basic development model (at least until a few days ago) looks 
something like this:

linux-mm -> linux-bk -> linux-stable

Ideally most changes would get into the tree via linux-mm and depending 
on depending various conditions (e.g. urgency, review state) it would get 
into the stable tree. In practice linux-mm is more an aggregation of 
patches which need testing and since most bk users were developing 
against linux-bk, it got a lot less testing and a lot of problems are 
only caught at the next stage. Changes from the stable tree would even 
flow in the opposite direction.
Bk supports certain aspects of the kernel development process very well, 
but due its closed nature it was practically impossible to really 
integrate it fully into this process (at least for anyone outside BM). 
In the short term we probably are in for a tough ride and we take whatever 
works best for you, but in the long term we need to think about how SCM 
fits into our kernel development model, which includes development, 
review, testing and releasing of kernel changes. This is more than just 
pulling and merging kernel trees. I'm aiming at a tool that can also 
support Andrews work, so that he can also better offload some of this 
work (and take a break sometimes :) ). Unfortunately every existing tool I 
know of is lacking in its own way, so we still have some way to go...

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-08 Thread Rajesh Venkatasubramanian

Linus wrote:
It looks like an operation like "show me the history of mm/memory.c" will
be pretty expensive using git.
Yes.  Per-file history is expensive in git, because if the way it is 
indexed. Things are indexed by tree and by changeset, and there are no 
per-file indexes.
Although directory changes are tracked using change-sets, there 
seems to be no easy way to answer "give me the diff corresponding to
the commit (change-set) object ".  That will be really helpful to
review the changes.

Rajesh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-08 Thread Daniel Phillips

On Friday 08 April 2005 13:24, Jon Masters wrote:
> On Apr 7, 2005 6:54 PM, Daniel Phillips <[EMAIL PROTECTED]> wrote:
> > So I propose that everybody who is interested, pick one of the above
> > projects and join it, to help get it to the point of being able to
> > losslessly import the version graph.  Given the importance, I think that
> > _all_ viable alternatives need to be worked on in parallel, so that two
> > months from now we have several viable options.
>
> What about BitKeeper licensing constraints on such involvement?

They don't apply to me, for one.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

On Fri, 8 Apr 2005 [EMAIL PROTECTED] wrote:
>
> It looks like an operation like "show me the history of mm/memory.c" will
> be pretty expensive using git.

Yes.  Per-file history is expensive in git, because if the way it is 
indexed. Things are indexed by tree and by changeset, and there are no 
per-file indexes.

You could create per-file _caches_ (*) on top of git if you wanted to make
it behave more like a real SCM, but yes, it's all definitely optimized for
the things that _I_ tend to care about, which is the whole-repository
operations.

Linus

(*) Doing caching on that level is probably find, especially since most
people really tend to want it for just the relatively few files that they
work on anyway. Limiting the caches to a subset of the tree should be
quite effective.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-08 Thread Luck

It looks like an operation like "show me the history of mm/memory.c" will
be pretty expensive using git.  I'd need to look at the current tree, and
then trace backwards through all 60,000 changesets to see which ones had
actual changes to this file.  Could you expand the tuple in the tree object
to include a back pointer to the previous tree in which the tuple changed?
Or does adding history to the tree violate other goals of the tree type?

-Tony
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Uncached stat performace [ Was: Re: Kernel SCM saga.. ]

On Fri, Apr 08, 2005 at 10:11:51PM +0200, Ragnar Kj?rstad wrote:

> It does, so why isn't there a way to do this without the disgusting
> hack? (Your words, not mine :) )

inode sorting probably a good guess for a number of filesystems, you
can map the blocks used to do better still (somewhat fs specific)

you can do better still if you multiple stats in parallel (up to a
point) and let the elevator sort things out

> I bet it would make a significant difference from things like "ls -l" in
> large uncached directories and imap-servers with maildir?

sort + concurrent stats would help here i think

i'm not sure i like the idea of ls using lots of threads though :)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Uncached stat performace [ Was: Re: Kernel SCM saga.. ]

2005-04-08 Thread Ragnar Kjørstad

On Fri, Apr 08, 2005 at 12:39:26PM -0700, Linus Torvalds wrote:
> One of the reasons I do inode numbers in the "index" file (apart from 
> checking that the inode hasn't changed) is in fact that "stat()" is damn 
> slow if it causes seeks. Since your stat loop is entirely 
> 
> You can optimize your stat() patterns on traditional unix-like filesystems
> by just sorting the stats by inode number (since the inode number is
> historically a special index into the inode table - even when filesystems
> distribute the inodes over several tables, sorting will generally do the
> right thing from a seek perspective). It's a disgusting hack, but it
> literally gets you orders-of-magnitude performance improvments in many
> real-life cases.

It does, so why isn't there a way to do this without the disgusting
hack? (Your words, not mine :) )

E.g, wouldn't a aio_stat() allow simular or better speedups in a way
that doesn't depend on ext2/3 internals?

I bet it would make a significant difference from things like "ls -l" in
large uncached directories and imap-servers with maildir?



-- 
Ragnar Kjørstad
Software Engineer
Scali - http://www.scali.com
Scaling the Linux Datacenter
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

On Fri, Apr 08, 2005 at 09:38:09PM +0200, Florian Weimer wrote:

> Does sorting by inode number make a difference?

It almost certainly would.  But I can sort more intelligently than
that even (all the world isn't ext2/3).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-08 Thread Matthias-Christian Ott

Linus Torvalds wrote:
On Fri, 8 Apr 2005, Matthias-Christian Ott wrote:
 

But as mentioned you need to _open_ each file (It doesn't matter if it's 
cached (this speeds up only reading it) -- you need a _slow_ system call 
and _very slow_ hardware access anyway).
   

Nope. System calls aren't slow. What crappy OS are you running?
 

But they're slower because there're some instances checking them.
I hope my idea/opinion is clear now.
   

Numbers talk. I've got something that you can test ;)
 

This doesn't mean it's better just because you had the time develope it 
;). But anyhow the folk needs something, they can test to see if it's 
good or not, most don't believe in concepts.

Linus
 

We will see which solutions wins the "race". But I think you're 
solutions will "win", because you're Linus Torvalds -- the "Boss" of 
Linux and have to work with this system very day (usualy people are 
using what they have developed :)) -- and I have not the time develop a 
database based solution (maybe someone else is interested in developing it).

Matthias-Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

On Fri, 8 Apr 2005, Chris Wedgwood wrote:
> 
> > It doesn't matter so much for the cached case, but it _does_ matter
> > for the uncached one.
> 
> Doing the minimal stat cold-cache here is about 6s for local disk.
> I'm somewhat surprised it's that bad actually.

One of the reasons I do inode numbers in the "index" file (apart from 
checking that the inode hasn't changed) is in fact that "stat()" is damn 
slow if it causes seeks. Since your stat loop is entirely 

You can optimize your stat() patterns on traditional unix-like filesystems
by just sorting the stats by inode number (since the inode number is
historically a special index into the inode table - even when filesystems
distribute the inodes over several tables, sorting will generally do the
right thing from a seek perspective). It's a disgusting hack, but it
literally gets you orders-of-magnitude performance improvments in many
real-life cases.

It does have some downsides:
 - it buys you nothing when it's cached (and obviously you have the 
   sorting overhead, although that's pretty cheap)
 - on other filesystems it can make things slower.

But if the cold-cache case actually is a concern, I do have the solution 
for it. Just a simple "prime-cache" program that does a qsort on the index 
file entries and does the stat() on them all will bring the numbers down. 
Those 6 seconds you see are the disk head seeking around like mad.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-08 Thread Florian Weimer

* Chris Wedgwood:

>> It doesn't matter so much for the cached case, but it _does_ matter
>> for the uncached one.
>
> Doing the minimal stat cold-cache here is about 6s for local disk.

Does sorting by inode number make a difference?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..



On Fri, 8 Apr 2005, Matthias-Christian Ott wrote:
>
> But as mentioned you need to _open_ each file (It doesn't matter if it's 
> cached (this speeds up only reading it) -- you need a _slow_ system call 
> and _very slow_ hardware access anyway).

Nope. System calls aren't slow. What crappy OS are you running?

> I hope my idea/opinion is clear now.

Numbers talk. I've got something that you can test ;)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

On Fri, Apr 08, 2005 at 12:03:49PM -0700, Linus Torvalds wrote:

> Yes, doing the stat just on the directory (on leaf directories only, of
> course, but nlink==2 does say that on most filesystems) is indeed a huge
> potential speedup.

Here I measure about 6ms for cache --- essentially below the noise
threshold for something that does real work.

> It doesn't matter so much for the cached case, but it _does_ matter
> for the uncached one.

Doing the minimal stat cold-cache here is about 6s for local disk.
I'm somewhat surprised it's that bad actually.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-08 Thread Matthias-Christian Ott

Linus Torvalds wrote:
On Fri, 8 Apr 2005, Matthias-Christian Ott wrote:

Ok, but if you want to search for information in such big text files it
slow, because you do linear search

No I don't. I don't search for _anything_. I have my own
content-addressable filesystem, and I guarantee you that it's faster than
mysql, because it depends on the kernel doing the right thing (which it
does).

I'm not talking about mysql, i'm talking about fast databases like
sqlite or db4.

I never do a single "readdir". It's all direct data lookup, no "searching"
anywhere.

Databases aren't magical. Quite the reverse. They easily end up being
_slower_ than doing it by hand, simply because they have to solve a much
more generic issue. If you design your data structures and abstractions
right, a database is pretty much guaranteed to only incur overhead.
The advantage of a database is the abstraction and management it gives
you. But I did my own special-case abstraction in git.

Yeah, I bet "git" might suck if your OS sucks. I definitely depend on name
caching at an OS level so that I know that opening a file is fast. In
other words, there _is_ an indexing and caching database in there, and
it's called the Linux VFS layer and the dentry cache.
The proof is in the pudding. git is designed for _one_ thing, and one
thing only: tracking a series of directory states in a way that can be
replicated. It's very very fast at that. A database with a more flexible
abstraction migt be faster at other things, but the fact is, you do take a
hit.

The problem with databases are:
- they are damn hard to just replicate wildly and without control. The
database backing file inherently has a lot of internal state. You may
be able to "just copy it", but you have to copy the whole damn thing.

This is _not_ true for every database (specialy plain/text databases
with meta information).

In "git", the data is all there in immutable blobs that you can just
rsync. In fact, you don't even need rsync: you can just look at the
filenames, and anything new you copy. No need for any fancy "read the
files to see that they match". They _will_ match, or you can tell
immediately that a file is corrupt.

Look at this:
[EMAIL PROTECTED]:~/git> sha1sum .dircache/objects/e7/bfaadd5d2331123663a8f14a26604a3cdcb678
e7bfaadd5d2331123663a8f14a26604a3cdcb678 .dircache/objects/e7/bfaadd5d2331123663a8f14a26604a3cdcb678

see a pattern anywhere? Imagine that you know the list of files you
have, and the list of files the other side has (never mind the
contents), and how _easy_ it is to synchronize. Without ever having to
even read the remote files that you know you already have.
How do you replicate your database incrementally? I've given you enough
clues to do it for "git" in probably five lines of perl.

I replicate my database incremently by using a hash list like you (the
client sends its hash list, the server compares the lists and acquaints
the client behind which data (data = hash + data) the data has to added
(this is like your solution -- you also submit the data and the location
(you have directories too, right?)). A database is in some cases (like
this one) like a filesystem, but it's build one top of better filesystem
like xfs, reiser4 or ext3 which support features like LVM, Quotas or
Journaling (Is your filesystem also build on top of existing filesystem?
I don't think so because you're talking about vfs operatations on the
filesystem).

- they tend to take time to set up and prime.
In contrast, the filesystem is always there. Sure, you effectively have
to "prime" that one too, but the thing is, if your OS is doing its job,
you basically only need to prime it once per reboot. No need to prime
it for each process you start or play games with connecting to servers
etc. It's just there. Always.

The database -- single file (sqlite or db4) -- is always there too
because it's on the filesystem and doesn't need a server.

So if you think of the filesystem as a database, you're all set. If you
design your data structure so that there is just one index, you make that
the name, and the kernel will do all the O(1) hashed lookups etc for you.
You do have to limit yourself in some ways.

But as mentioned you need to _open_ each file (It doesn't matter if it's
cached (this speeds up only reading it) -- you need a _slow_ system call
and _very slow_ hardware access anyway).
Have a look at this comparison:
If you have big chest and lots of small chests containing the same bulk
of gold, it's more work to collect the gold from the small chests than
from the big one (which would contain as many a cases as little chests
exist). You can faster find your gold because you don't have to walk to
the other chests and you don't have to open that much caps which saves
also time.

Oh, and you have to be willing to waste diskspace. "git" is _not_
space-efficient. The good news is that it is c

Re: Kernel SCM saga..