Re: [Gnu-arch-users] Re: [GNU-arch-dev] [ANNOUNCEMENT] /Arch/ embraces `git'

2005-04-21 Thread Tom Lord


  > However you're
  > right that the original structure proposed by Linus is too flat.

That was the only point I *meant* to defend.   The rest was error.

-t
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Gnu-arch-users] Re: [GNU-arch-dev] [ANNOUNCEMENT] /Arch/ embraces `git'

2005-04-21 Thread Tom Lord


   > Tom, please stop this ext* filesystem bashing ;-) 

For one thing... yes, i'm totally embarassed on this issue.
I made a late-night math error in a spec.  *hopefully* would
have noticed it on my own as I coded to that spec but y'all
have been wonderful at pointing out my mistake to me even 
though I initially defended it.

As for ext* bashing it's not bashing exactly.  I/O bandwidth
gets a little better, disks get a bunch cheaper --- then ext* 
doesn't look bad at all in this respect.  And we're awefully close
to that point.   

Meanwhile, for times measured in years, I've gotten complaints from
ext* users about software that is just fine on other filesystems
over issues like the allocation size of small files.

So ext*, from my perspective, was a little too far ahead of its time
and, yes, my complaints about it are just about reaching their
expiration date.

Anyway, thanks for all the sanity about directory layout.  Really,
it was just an "I'm too sleepy to be doing this right now" error.

-t

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Gnu-arch-users] Re: [GNU-arch-dev] [ANNOUNCEMENT] /Arch/ embraces `git'

2005-04-21 Thread Tom Lord

   > [your 0:3/4:7 directory hierarchy is horked]

Absolutely.  Made a dumb mistake the night I wrote that spec
and embarassed that I initially defended it.   I had an arithmetic
error.   Thanks, this time, for your persistence in pointing it out.

-t
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Gnu-arch-users] Re: [GNU-arch-dev] [ANNOUNCEMENT] /Arch/ embraces `git'

2005-04-21 Thread Tom Lord


   > Yes, it really doesn't make much sense to have so big keys on the
   > directories.

It's official... i'm blushing wildly thank you for the various
replies that pointed out my thinko.

That part of my spec hasn't been coded yet --- i just wrote text.  It
really was the silly late-night error of sort: "hmm...let's see, 4 hex
digits plus 4 hex digits  that's 16 bits sounds about right."

Really, I'll fix it.

-t
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Gnu-arch-users] Re: [GNU-arch-dev] [ANNOUNCEMENT] /Arch/ embraces `git'

2005-04-21 Thread Tom Lord

  > Using your suggested indexing method that uses [0:4] as the 1st level key 
and
 [0:3]
  > [4:8] as the 2nd level key, I obtain an indexed archive that occupies 159M,
  > where the top level contains 18665 1st level keys, the largest first level 
dir
  > contains 5 entries, and all 2nd level dirs contain exactly 1 entry.


That's just a mistake in the spec.  The format should probably be
multi-level but, yes, the fanout I suggested is currently quite bogus.
When I write that part of that code (today or tomorrow) I'll fix it.

A few people pointed that out.  Thanks.

-t

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Gnu-arch-users] Re: [GNU-arch-dev] [ANNOUNCEMENT] /Arch/ embraces `git'

2005-04-21 Thread duchier
Tomas Mraz <[EMAIL PROTECTED]> writes:

>> Btw, if, as you indicate above, you do believe that a 1 level indexing should
>> use [0:2], then it doesn't make much sense to me to also suggest that a 2 
>> level
>> indexing should use [0:1] as primary subkey :-)
>
> Why do you think so? IMHO we should always target a similar number of
> files/subdirectories in a directories of the blob archive. So If I
> always suppose that the archive would contain at most 16 millions of
> files then the possible indexing schemes are either 1 level with key
> length 3 (each directory would contain ~4096 files) or 2 level with key
> length 2 (each directory would contain ~256 files).
> Which one is better could be of course filesystem and hardware
> dependent.

First off, I have been using python slice notation, so when I write [0:2] I mean
a key of length 2 (the second index is not included).  I now realize that when
you wrote the same you meant to include the second index.

I believe that our disagreement comes from the fact that we are asking different
questions.  You consider the question of how to best index a fixed database and
I consider the question of how to best index an ever increasing database.

Now consider why we even want multiple indexing levels: presumably this is
because certain operations become too costly when the size of a directory
becomes too large.  If that's not the case, then we might as well just have one
big flat directory - perhaps that's even a viable option for some
filesystems.[1]

  [1] there is the additional consideration that a hierarchical system
  implements a form of key compression by sharing key prefixes.  I don't know at
  what point such an effect becomes beneficial, if ever.

Now suppose we need at least one level of indexing.  Under an assumption of
uniform distribution of bits in keys, as more objects are added to the database,
the lower levels are going to fill up uniformly.  Therefore at those levels we
are again faced with exactly the same indexing problem and thus should come up
with exactly the same answer.

This is why I believe that the scheme I proposed is best: when a bottom level
directory fills up past a certain size, introduce under it an additional level,
and reindex the keys.  Since the "certain size" is fixed, this is a constant
time operation.

One could also entertain the idea of reindexing not just a bottom level
directory but an entire subtree of the database (this would be closer to your
idea of finding an optimal reindexing of just this part of the database).
However this has the disadvantage that the operation's cost grows exponentially
with the depth of the tree.

Cheers,

--Denys

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Gnu-arch-users] Re: [GNU-arch-dev] [ANNOUNCEMENT] /Arch/ embraces `git'

2005-04-21 Thread Tomas Mraz
On Thu, 2005-04-21 at 11:09 +0200, Denys Duchier wrote:
> Tomas Mraz <[EMAIL PROTECTED]> writes:
> 
> > If we suppose the maximum number of stored blobs in the order of milions
> > probably the optimal indexing would be 1 level [0:2] indexing or 2
> > levels [0:1] [2:3]. However it would be necessary to do some
> > benchmarking first before setting this to stone.
> 
> As I have suggested in a previous message, it is trivial to implement adaptive
> indexing: there is no need to hardwire a specific indexing scheme.  
> Furthermore,
> I suspect that the optimal size of subkeys may well depend on the filesystem.
> My experiments seem to indicate that subkeys of length 2 achieve an excellent
> compromise between discriminatory power and disk footprint on ext2.
> 
> Btw, if, as you indicate above, you do believe that a 1 level indexing should
> use [0:2], then it doesn't make much sense to me to also suggest that a 2 
> level
> indexing should use [0:1] as primary subkey :-)

Why do you think so? IMHO we should always target a similar number of
files/subdirectories in a directories of the blob archive. So If I
always suppose that the archive would contain at most 16 millions of
files then the possible indexing schemes are either 1 level with key
length 3 (each directory would contain ~4096 files) or 2 level with key
length 2 (each directory would contain ~256 files).
Which one is better could be of course filesystem and hardware
dependent.

Of course it might be best to allow adaptive indexing but I think that
first some benchmarking should be made and it's possible that some fixed
scheme could be chosen as optimal.

-- 
Tomas Mraz <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Gnu-arch-users] Re: [GNU-arch-dev] [ANNOUNCEMENT] /Arch/ embraces `git'

2005-04-21 Thread Denys Duchier
Tomas Mraz <[EMAIL PROTECTED]> writes:

> If we suppose the maximum number of stored blobs in the order of milions
> probably the optimal indexing would be 1 level [0:2] indexing or 2
> levels [0:1] [2:3]. However it would be necessary to do some
> benchmarking first before setting this to stone.

As I have suggested in a previous message, it is trivial to implement adaptive
indexing: there is no need to hardwire a specific indexing scheme.  Furthermore,
I suspect that the optimal size of subkeys may well depend on the filesystem.
My experiments seem to indicate that subkeys of length 2 achieve an excellent
compromise between discriminatory power and disk footprint on ext2.

Btw, if, as you indicate above, you do believe that a 1 level indexing should
use [0:2], then it doesn't make much sense to me to also suggest that a 2 level
indexing should use [0:1] as primary subkey :-)

Cheers,

-- 
Dr. Denys Duchier - IRI & LIFL - CNRS, Lille, France
AIM: duchierdenys
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Gnu-arch-users] Re: [GNU-arch-dev] [ANNOUNCEMENT] /Arch/ embraces `git'

2005-04-21 Thread Tomas Mraz
On Wed, 2005-04-20 at 16:04 -0700, Tom Lord wrote:

> I think that to a large extent you are seeing artifacts
> of the questionable trade-offs that (reports tell me) the
> ext* filesystems make.   With a different filesystem, the 
> results would be very different.
Tom, please stop this ext* filesystem bashing ;-) Even with filesystem
which compresses the tails of files into one filesystem block it
wouldn't make a difference that there are potentially (and very probably
even with blob numbers in order of 10) 65536 directories on the
first level. This doesn't help much in fast reading the first level.

> I'm imagining a blob database containing may revisions of the linux
> kernel.  It will contain millions of blobs.
> 
> It's fine that some filesystems and some blob operations work fine
> on a directory with millions of files but what about other operations
> on the database?   I pity the poor program that has to `readdir' through
> millions of files.

Even with milions of files this key structure doesn't make much sense -
the keys on the first and second levels are too long. However you're
right that the original structure proposed by Linus is too flat.

-- 
Tomas Mraz <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Gnu-arch-users] Re: [GNU-arch-dev] [ANNOUNCEMENT] /Arch/ embraces `git'

2005-04-20 Thread Denys Duchier
Tom Lord <[EMAIL PROTECTED]> writes:

> Thank you for your experiment.

you are welcome.

> I think that to a large extent you are seeing artifacts
> of the questionable trade-offs that (reports tell me) the
> ext* filesystems make.   With a different filesystem, the 
> results would be very different.

No, this is not the only thing that we observe.  For example, here are the
reports for the following two experiments:

Indexing method = [2]

Max keys at level  0: 256
Max keys at level  1: 108
Total number of dirs: 257
Total number of keys:   21662
Disk footprint  :1.8M

Indexing method = [4 4]

Max keys at level  0:   18474
Max keys at level  1:   5
Max keys at level  2:   1
Total number of dirs:   40137
Total number of keys:   21662
Disk footprint  :157M

Notice the huge number of directories in the second experiment and they don't
help at all in performing discrimination.

> I'm imagining a blob database containing may revisions of the linux
> kernel.  It will contain millions of blobs.

It is very easy to write code that uses an adaptive discrimination method
(i.e. when a directory becomes too full, introduce an additional level of
discrimination and rehash).  In fact I have code that does that (rehashing if
the size of a leaf directory exceed 256), but the [2] method above doesn't even
need it even though it has 21662 keys.

Just in case there is some interest, I attach below the python scripts which I
used for my experiments:

To create an indexed archive:

python build.py SRC DST N1 ... Nk

where SRC is the root directory of the tree to be indexed, and DST names the
root directory of the indexed archive to be created.  N1 through Nk are integers
that each indicate how many chars to chop off the key to create the next level
indexing key.

python info.py DST

collects and then prints out statistics about an indexed archive.

For example, the invocation that relates to your original proposal would be:

python build.py /usr/src/linux store 4 4
python info.py store

import os,os.path,stat,sha

tree  = None
archive   = None
slices= []
lastslice = (0,-1)

def recurse(path):
s = os.stat(path)
if stat.S_ISDIR(s.st_mode):
print path
contents = []
for n in os.listdir(path):
uid = recurse(os.path.join(path,n))
contents.append('\t'.join((n,uid)))
contents = '\n'.join(contents)
buf = sha.new(contents)
uid = buf.hexdigest()
uid = ','.join((uid,str(len(contents
store(uid)
return uid
else:
fd = file(path,"rb")
contents = fd.read()
fd.close()
buf = sha.new(contents)
uid = ','.join((buf.hexdigest(),str(s.st_size)))
store(uid)
return uid

def store(uid):
p = archive
if not os.path.exists(p):
os.mkdir(p)
for s in slices:
p = os.path.join(p,uid[s[0]:s[1]])
if not os.path.exists(p):
os.mkdir(p)
p = os.path.join(p,uid[lastslice[0]:lastslice[1]])
fd = file(p,"wb")
fd.close()

if __name__ == '__main__':
import sys
from optparse import OptionParser
from types import IntType
parser = OptionParser(usage="usage: %prog TREE ARCHIVE N1 ... Nk")
(options, args) = parser.parse_args()
if len(args) < 3:
print sys.stderr, "expected at least 3 positional arguments"
sys.exit(1)
tree= args[0]
archive = args[1]
prev= 0
for a in args[2:]:
try:
next = prev+int(a)
slices.append((prev,next))
prev = next
except:
print >>sys.stderr, "positional argument is not an integer:",a
sys.exit(1)
lastslice = (next,-1)
recurse(tree)
sys.exit(0)
import os,os.path,stat

info = []
archive = None
total_keys = 0
total_dirs = 0

def collect_info(path,i):
global total_dirs,total_keys
s = os.stat(path)
if stat.S_ISDIR(s.st_mode):
total_dirs += 1
l = os.listdir(path)
n = len(l)
if i==len(info):
info.append(n)
elif n>info[i]:
info[i] = n
i += 1
for f in l:
collect_info(os.path.join(path,f),i)
else:
total_keys += 1

def print_info():
i = 0
for n in info:
print "Max keys at level %2s: %7s" % (i,n)
i += 1
print "Total number of dirs: %7s" % total_dirs
print "Total number of keys: %7s" % total_keys
fd = os.popen("du -csh %s" % archive,"r")
s = fd.read()
fd.close()
s = s.split()[0]
print "Disk footprint  : %7s"  % s

if __name__ == '__main__':
import sys
from optparse import OptionParser
parser = OptionParser(usage="usage: %prog ARCHIVE")
(options, args) = parser.parse_args()
if len(args) != 1:
print sys.stderr, "expected exactly 1 positional argument"
sys.exit(1)
archive = args[0]
collect_info(archive,0)
pr

Re: chunking (Re: [ANNOUNCEMENT] /Arch/ embraces `git')

2005-04-20 Thread C. Scott Ananian
On Wed, 20 Apr 2005, Linus Torvalds wrote:
What's the disk usage results? I'm on ext3, for example, which means that
even small files invariably take up 4.125kB on disk (with the inode).
Even uncompressed, most source files tend to be small. Compressed, I'm
seeing the median blob size being ~1.6kB in my trivial checks. That's
blobs only, btw.
I'm working on it.  The format was chosen so that blobs under 1 block long 
*stay* 1 block long; i.e. there's no 'chunk plus index file' overhead.
So the chunking should only kick in on multiple-block files.
I hacked 'convert-cache' to do the conversion and it's running out of
memory on linux-2.6.git, however --- I found a few memory leaks in your 
code =) but I certainly seem to be missing a big one still (maybe it's in 
my code!).

When I get this working properly, my plan is to do a number of runs over 
the linux-2.6 archive trying out various combinations of chunking 
parameters.  I *will* be watching both 'real' disk usage (bunged up to 
block boundaries) and 'ideal' disk usage (on a reiserfs-type system).
The goal is to improve both, but if I can improve 'ideal' usage 
significantly with a minimal penalty in 'real' usage then I would argue 
it's still worth doing, since that will improve network times.

The handshaking penalties you mention are significant, but that's why 
rsync uses a pipelined approach.  The 'upstream' part of your full-duplex 
pipe is 'free' while you've got bits clogging your 'downstream' 
pipe.  The wonders of full-duplex...

Anyway, "numbers talk, etc".  I'm working on them.
 --scott
LIONIZER LCPANES shortwave MKSEARCH ESGAIN Saddam Hussein Rijndael 
WASHTUB Morwenstow ZPSEMANTIC SKIMMER cryptographic FJHOPEFUL assassination
 ( http://cscott.net/ )
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GNU-arch-dev] [ANNOUNCEMENT] /Arch/ embraces `git'

2005-04-20 Thread Tom Lord



   From: [EMAIL PROTECTED]

Thank you for your experiment.  I'm not surprised by the 
result but it is very nice to know that my expectations
are right.

I think that to a large extent you are seeing artifacts
of the questionable trade-offs that (reports tell me) the
ext* filesystems make.   With a different filesystem, the 
results would be very different.

I'm imagining a blob database containing may revisions of the linux
kernel.  It will contain millions of blobs.

It's fine that some filesystems and some blob operations work fine
on a directory with millions of files but what about other operations
on the database?   I pity the poor program that has to `readdir' through
millions of files.

That said: I may add an optional flat-directory format to my library,
just to avoid issues such as those you raise over the next couple 
years.

-t

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Gnu-arch-users] Re: [GNU-arch-dev] [ANNOUNCEMENT] /Arch/ embraces `git'

2005-04-20 Thread Tomas Mraz
On Wed, 2005-04-20 at 19:15 +0200, [EMAIL PROTECTED] wrote:
...
> As data, I used my /usr/src/linux which uses 301M and contains 20753 files and
> 1389 directories.  To compute the key for a directory, I considered that its
> contents were a mapping from names to keys.
I suppose if you used the blob archive for storing many revisions the
number of stored blobs would be much higher. However even then we can
estimate that the maximum number of stored blobs will be in the order of
milions.

> When constructing the indexed archive, I actually stored empty files instead 
> of
> blobs because I am only interested in overhead.
> 
> Using your suggested indexing method that uses [0:4] as the 1st level key and
 [0:3]
> [4:8] as the 2nd level key, I obtain an indexed archive that occupies 159M,
> where the top level contains 18665 1st level keys, the largest first level dir
> contains 5 entries, and all 2nd level dirs contain exactly 1 entry.
Yes, it really doesn't make much sense to have so big keys on the
directories. If we would assume that SHA1 is a really good hashing
function so the probability of any hash value is the same this would
allow storing 2^16 * 2^16 * 2^16 blobs with approximately same directory
usage.

> Using Linus suggested 1 level [0:2] indexing, I obtain an indexed archive that
[0:1] I suppose
> occupies 1.8M, where the top level contains 256 1st level keys, and where the
> largest 1st level dir contains 110 entries.
The question is how many entries in directory is optimal compromise
between space and the speed of access to it's files.

If we suppose the maximum number of stored blobs in the order of milions
probably the optimal indexing would be 1 level [0:2] indexing or 2
levels [0:1] [2:3]. However it would be necessary to do some
benchmarking first before setting this to stone.

-- 
Tomas Mraz <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Gnu-arch-users] Re: [GNU-arch-dev] [ANNOUNCEMENT] /Arch/ embraces `git'

2005-04-20 Thread Tomas Mraz
On Wed, 2005-04-20 at 19:15 +0200, [EMAIL PROTECTED] wrote:
...
> As data, I used my /usr/src/linux which uses 301M and contains 20753 files and
> 1389 directories.  To compute the key for a directory, I considered that its
> contents were a mapping from names to keys.
I suppose if you used the blob archive for storing many revisions the
number of stored blobs would be much higher. However even then we can
estimate that the maximum number of stored blobs will be in the order of
milions.

> When constructing the indexed archive, I actually stored empty files instead 
> of
> blobs because I am only interested in overhead.
> 
> Using your suggested indexing method that uses [0:4] as the 1st level key and
 [0:3]
> [4:8] as the 2nd level key, I obtain an indexed archive that occupies 159M,
> where the top level contains 18665 1st level keys, the largest first level dir
> contains 5 entries, and all 2nd level dirs contain exactly 1 entry.
Yes, it really doesn't make much sense to have so big keys on the
directories. If we would assume that SHA1 is a really good hashing
function so the probability of any hash value is the same this would
allow storing 2^16 * 2^16 * 2^16 blobs with approximately same directory
usage.

> Using Linus suggested 1 level [0:2] indexing, I obtain an indexed archive that
[0:1] I suppose
> occupies 1.8M, where the top level contains 256 1st level keys, and where the
> largest 1st level dir contains 110 entries.
The question is how many entries in directory is optimal compromise
between space and the speed of access to it's files.

If we suppose the maximum number of stored blobs in the order of milions
probably the optimal indexing would be 1 level [0:2] indexing or 2
levels [0:1] [2:3]. However it would be necessary to do some
benchmarking first before setting this to stone.

-- 
Tomas Mraz <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


chunking (Re: [ANNOUNCEMENT] /Arch/ embraces `git')

2005-04-20 Thread Linus Torvalds


On Wed, 20 Apr 2005, C. Scott Ananian wrote:
> 
> I'm hoping my 'chunking' patches will fix this.  This ought to reduce the 
> size of the object store by (in effect) doing delta compression; rsync
> will then Do The Right Thing and only transfer the needed deltas.
> Running some benchmarks right now to see how well it lives up to this 
> promise...

What's the disk usage results? I'm on ext3, for example, which means that
even small files invariably take up 4.125kB on disk (with the inode).

Even uncompressed, most source files tend to be small. Compressed, I'm 
seeing the median blob size being ~1.6kB in my trivial checks. That's 
blobs only, btw.

My point being that about 75% of all blobs already take up less than the
minimal amount of space that most filesystems can sanely allocate. And I'm
_not_ going to say "you have to use reiserfs" with git.

So the disk fragmentation really does matter. It doesn't help to make a 
file smaller than 4kB, it hurts - while that can be offset by sharing 
chunks, it might not be.

Also, while network performance is important, so is the handshaking on
which objects to get. Lots of small objects potentially need lots of
handshaking to figure out _which_ of the objects to do.

Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCEMENT] /Arch/ embraces `git'

2005-04-20 Thread C. Scott Ananian
On Wed, 20 Apr 2005, Petr Baudis wrote:
I think one thing git's objects database is not very well suited for are
network transports. You want to have something smart doing the
transports, comparing trees so that it can do some delta compression;
that could probably reduce the amount of data needed to be sent
significantly.
I'm hoping my 'chunking' patches will fix this.  This ought to reduce the 
size of the object store by (in effect) doing delta compression; rsync
will then Do The Right Thing and only transfer the needed deltas.
Running some benchmarks right now to see how well it lives up to this 
promise...
 --scott

terrorist AEROPLANE munitions PAPERCLIP MI5 Morwenstow WSHOOFS CABOUNCE 
colonel Yakima AES MI6 nuclear NSA Cocaine Columbia plastique LICOZY
 ( http://cscott.net/ )
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCEMENT] /Arch/ embraces `git'

2005-04-20 Thread Petr Baudis
Dear diary, on Wed, Apr 20, 2005 at 12:00:36PM CEST, I got a letter
where Tom Lord <[EMAIL PROTECTED]> told me that...
> >From the /Arch/ perspective: `git' technology will form the
> basis of a new archive/revlib/cache format and the basis
> of new network transports.

I think one thing git's objects database is not very well suited for are
network transports. You want to have something smart doing the
transports, comparing trees so that it can do some delta compression;
that could probably reduce the amount of data needed to be sent
significantly.

> >From the `git' perspective, /Arch/ will replace the lame "directory
> cache" component of `git' with a proper revision control system.

I'm not sure if you fully grasped the git's philosophy yet. The
"directory cache" component is not by itself any revision control system
- it is merely a staging area for any revision system on top of it (IOW:
subordinate, not competitor).

> I started here:
> 
>http://www.seyza.com/=clients/linus/tree/index.html
> 
> and for those interested in `git'-theory, a good place to start is
> 
>http://www.seyza.com/=clients/linus/tree/src/liblob/index.html

These pages are surely very nice, unfortunately I have to enjoy them
only from the "HTML source" view. The HTML seems completely broken,
containing unterminated comments like "

Re: [GNU-arch-dev] [ANNOUNCEMENT] /Arch/ embraces `git'

2005-04-20 Thread duchier
Hi Tom,

just as a datapoint, here is an experiment I carried out.  I wanted to evaluate
how much overhead is incurred by using several levels of directories to
implement a discrimating index.  I used the key format you specified:

SHA1,SIZE

As data, I used my /usr/src/linux which uses 301M and contains 20753 files and
1389 directories.  To compute the key for a directory, I considered that its
contents were a mapping from names to keys.

When constructing the indexed archive, I actually stored empty files instead of
blobs because I am only interested in overhead.

Using your suggested indexing method that uses [0:4] as the 1st level key and
[4:8] as the 2nd level key, I obtain an indexed archive that occupies 159M,
where the top level contains 18665 1st level keys, the largest first level dir
contains 5 entries, and all 2nd level dirs contain exactly 1 entry.

Using Linus suggested 1 level [0:2] indexing, I obtain an indexed archive that
occupies 1.8M, where the top level contains 256 1st level keys, and where the
largest 1st level dir contains 110 entries.

This experiment was performed on an ext3 file system.

Cheers,

--Denys

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GNU-arch-dev] [ANNOUNCEMENT] /Arch/ embraces `git'

2005-04-20 Thread Miles Bader
Way to go.

-Miles
-- 
Do not taunt Happy Fun Ball.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ANNOUNCEMENT] /Arch/ embraces `git'

2005-04-20 Thread Tom Lord

`git', by Linus Torvalds, contains some very good ideas and some
very entertaining source code -- recommended reading for hackers.

/GNU Arch/ will adopt `git':

>From the /Arch/ perspective: `git' technology will form the
basis of a new archive/revlib/cache format and the basis
of new network transports.

>From the `git' perspective, /Arch/ will replace the lame "directory
cache" component of `git' with a proper revision control system.

In my view, the core ideas in `git' are quite profound and deserve
an impeccable implementation.   This is practical because those ideas
are also pretty simple.

I started here:

   http://www.seyza.com/=clients/linus/tree/index.html

and for those interested in `git'-theory, a good place to start is

   http://www.seyza.com/=clients/linus/tree/src/liblob/index.html

(Linus is not literally a "client" of mine.  That's just the directory 
where this goes.)

-t
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ANNOUNCEMENT] /Arch/ embraces `git'

2005-04-20 Thread Tom Lord

`git', by Linus Torvalds, contains some very good ideas and some
very entertaining source code -- recommended reading for hackers.

/GNU Arch/ will adopt `git':

>From the /Arch/ perspective: `git' technology will form the
basis of a new archive/revlib/cache format and the basis
of new network transports.

>From the `git' perspective, /Arch/ will replace the lame "directory
cache" component of `git' with a proper revision control system.

In my view, the core ideas in `git' are quite profound and deserve
an impeccable implementation.   This is practical because those ideas
are also pretty simple.

I started here:

   http://www.seyza.com/=clients/linus/tree/index.html

and for those interested in `git'-theory, a good place to start is

   http://www.seyza.com/=clients/linus/tree/src/liblob/index.html

(Linus is not literally a "client" of mine.  That's just the directory 
where this goes.)

-t
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html