subject:"\[rfc\] Near\-constant time directory index for Ext2"

Re: [rfc] Near-constant time directory index for Ext2

2001-02-23 Thread Ralph Loader


Andries,


> > int hash_fn (char * p)
> > {
> >   int hash = 0;
> >   while (*p) {
> >  hash = hash + *p;
> >  // Rotate a 31 bit field 7 bits:
> >  hash = ((hash << 7) | (hash >> 24)) & 0x7fff;
> >   }
> >   return hash;
> > }

> 

> Hmm. This one goes in the "catastrophic" category.

> 
> For actual names:
> 
> N=557398, m=51 sz=2048, min 82, max 4002, av 272.17, stddev 45122.99
> 
> For generated names:
> 
> N=557398, m=51 sz=2048, min 0, max 44800, av 272.17, stddev 10208445.83
> 

Instead of masking the hash value down to 11 bits you could try:

index = (hash ^ (hash >> 11) ^ (hash >> 22)) & 0x7ff;

I ran a quick test which gave fairly good results with that: 12871
identifiers
from a source tree) gave a mean square bucket size of 45.65, expected
value for a random function is 45.78.

That change might improve some of your other hashes as well - there
doesn't
seem to be much point in computing a 32 bit value only to throw 20 bits
away -
stirring in the extra bits makes much more sense to me.

> > The rotate is equivalent to a multiplication by x**7 in Z_2[P=0],

> > where P is the polynomial x**31 - 1 (over Z_2).
> > Presumably the "best" P would be irreducible - but that would have more
> > bits set in the polynomial, making reduction harder.  A compromise is to
> > choose P in the form x**N - 1 but with relatively few factors.
> > X**31 - 1 is such a P.
> 
> It has seven irreducible factors. Hardly "almost irreducible".

I didn't say it was.  "almost irreducible" polynomials with Hamming
weight two are pretty rare...  Relative to say, x**32 - 1 or x**24 - 1,
having 7 factors is good.

Ralph.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-23 Thread Ralph Loader


Andries,

> is very little interaction to start with. And the final AND that
truncates

> to the final size of the hash chain kills the high order bits.
> No good.

I didn't realise you were bit-masking down to the required size.

Yes, it would be pretty useless in that case.

Ralph.


> 
> Andries
> 
> 
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

2001-02-23 Thread Andreas Dilger

Ted writes:
> Note that in the long run, the fully comatible version should probably
> have a COMPAT feature flag set so that you're forced to use a new enough
> version of e2fsck.  Otherwise an old e2fsck may end up not noticing
> corruptions in an index block which might cause a new kernel to have
> serious heartburn.

Actually, having a COMPAT flag also helps in other ways:

1) Turning indexing on and off is not a mount option as it currently is
   (or automatically done) so it will quell Linus' fears about priniciple
   of least surprise (i.e. not converting a filesystem without user action).
   A superblock COMPAT flag is more in keeping with other ext2 features.

2) Running a new e2fsck on a COMPAT_INDEX filesystem could create the
   index for existing "large" directories that don't have the BTREE/INDEX
   flag set, so the kernel only ever has to deal with incremental indexing
   after the first block.  The kernel would just do linear access on
   existing multi-block directories until e2fsck is run.

3) Clearing the COMPAT flag would make e2fsck remove the indexes, if the
   user so desires.  I think this would be the behaviour of existing
   e2fsck anyways.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-23 Thread Guest section DW


On Sat, Feb 24, 2001 at 10:43:16AM +1300, Ralph Loader wrote:

> A while ago I did some experimentation with simple bit-op based string
> hash functions.  I.e., no multiplications / divides in the hash loop.
> 
> The best I found was:
> 
> int hash_fn (char * p)
> {
>   int hash = 0;
>   while (*p) {
>  hash = hash + *p;
>  // Rotate a 31 bit field 7 bits:
>  hash = ((hash << 7) | (hash >> 24)) & 0x7fff;
>   }
>   return hash;
> }

Hmm. This one goes in the "catastrophic" category.

For actual names:

N=557398, m=51 sz=2048, min 82, max 4002, av 272.17, stddev 45122.99

For generated names:

N=557398, m=51 sz=2048, min 0, max 44800, av 272.17, stddev 10208445.83

A very non-uniform distribution.

> The rotate is equivalent to a multiplication by x**7 in Z_2[P=0],
> where P is the polynomial x**31 - 1 (over Z_2).
> Presumably the "best" P would be irreducible - but that would have more
> bits set in the polynomial, making reduction harder.  A compromise is to
> choose P in the form x**N - 1 but with relatively few factors.
> X**31 - 1 is such a P.

It has seven irreducible factors. Hardly "almost irreducible".

Shifting the 7-bit ASCII characters over 7 bits makes sure that there
is very little interaction to start with. And the final AND that truncates
to the final size of the hash chain kills the high order bits.
No good.

Andries


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-23 Thread Ralph Loader


Hi,

I while ago I did some experimentation with simple bit-op based string
hash
functions.  I.e., no multiplications / divides in the hash loop.

The best I found was:

int hash_fn (char * p)
{
  int hash = 0;
  while (*p) {
 hash = hash + *p;
 // Rotate a 31 bit field 7 bits:
 hash = ((hash << 7) | (hash >> 24)) & 0x7fff;
  }
  return hash;
}

[I haven't kept my test program / data set - if anyone compares the
above
 to the others functions mentioned on the list, let me know.]

The 31 and 7 were determined experimentally.  But the 31 has a
theoretical
explanation (which might even be valid):

The rotate is equivalent to a multiplication by x**7 in Z_2[P=0],
where P is the polynomial x**31 - 1 (over Z_2).
Presumably the "best" P would be irreducible - but that would have more
bits set in the polynomial, making reduction harder.  A compromise is to
choose P in the form x**N - 1 but with relatively few factors.
X**31 - 1 is such a P.

Also, a 32 bit rotate (modulo X**32 - 1, which is equal
to (X - 1) ** 32 over Z_2), came out pretty badly.

One thing that shouldn't be forgotten about hashing for hash tables
is that you have to reduce the hash value to the required range - doing
that well greatly reduces the difference between various hash functions.

Ralph.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

2001-02-23 Thread tytso

   From: Daniel Phillips <[EMAIL PROTECTED]>
   Date: Fri, 23 Feb 2001 00:04:02 +0100

   I resolve not to take a position on this subject, and I will carry
   forward both a 'squeaky clean' backward-compatible version that sets an
   INCOMPAT flag, and a 'slightly tarnished' but very clever version that
   is both forward and backward-compatible, along the lines suggested by
   Ted.  Both flavors have the desireable property that old versions of
   fsck with no knowledge of the new index structure can remove the indices
   automatically, with fsck -y.

Note that in the long run, the fully comatible version should probably
have a COMPAT feature flag set so that you're forced to use a new enough
version of e2fsck.  Otherwise an old e2fsck may end up not noticing
corruptions in an index block which might cause a new kernel to have
serious heartburn.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-23 Thread Andreas Dilger

Jonathan Morton writes:
> Meanwhile, let's go back to Linus' comment on compatibility and so on.  He
> has a *very* good point, which I'll expand on slightly here:
> 
> Suppose some stone-age Linux user, running 2.0.35 or something equally old
> (which runs ext2), decides to finally bite the bullet and upgrade...
> ... 2.0.35's old ext2 code suddenly can't read the filesystem, which was
> modified by 2.6.1 before the boot process stalled.

One of the proposals on the table for the indexed directories is read AND
WRITE compatible with older kernels.  For 1.0 through 2.4 kernels the read
compatibility is very safe (it will generate some errors on reads because
of "sparse" directory blocks, but the code will continue on correctly,
<= 1.2 won't even generate an error).  For < 2.2 kernels deleting a file from
an indexed directory will work, but would leave the index out of date.
For < 2.2 kernels adding a new file to an indexed directory would always
overwrite the index, so it is also safe.

> I hope people understand this as well as I do - if a filesystem upgrade is
> desirable, let the user perform some *specific* action to upgrade it, when
> he has an otherwise-working setup *and* sufficient backups.  I for one do
> not want to be caught out like the hypothetical user I mentioned above.

I am on the side of maintaining compatibility.  There _may_ be a _small_
performance (more like capacity) impact on indexed directories for this
compatibility, but it is well worth it, IMHO.

> - Currently, it's a stable and universally-utilised filesystem which offers
> very good performance for most applications.  I'm specifically drawing
> attention to certain benchmarks which place ext2's disk-based performance
> ahead of many commercial UNIX' ram-based filesystem performance.

Totally agree.

> - There are specific problems with performance when reading and/or
> modifying large directories.  I haven't tested for this personally, but I
> have noticed slowness when using 'rm -rf' on a large *tree* of directories.

That is what the index change will address.  Actually, "rm -r" may not
be speeded up very much, but "rm *" definitely would be ("rm -r" deletes
files in directory order, but "rm *" deletes each file individually in
alphabetical order).

> One of the current suggestions, if I've interpreted it correctly, is to
> introduce an extension to ext2 which essentially makes a "fast index" of a
> large directory, attaches it to the directory in some backwards-compatible
> manner, and uses it *in conjunction with* the regular directory structure.

Yes, this is essentially true.  The on-disk directory entries are exactly
the same.  The index itself (in the compatible layout) appears to simply
be empty directory blocks (at the cost of 8 bytes = 1 index entry per block).

> - How much overhead does the "fast index" introduce for modification of the
> directory?  Large directories are the most likely to have stuff added and
> deleted, and it doesn't help if during an "rm *" operation the saving on
> the search is negated by the overhead on the unlink.

The index will improve the performance for file add, stat, and delete.  For
all of these operations you need to find a directory entry (add needs to
check if a file of the same name already exists before a new file is added).

> - If the index gets out of sync with the directory, how is this detected
> and recovered from?  Assuming the index merely points to the correct
> position in the regular directory, some simple sanity checks will suffice
> for most cases (is this entry in the directory the same as I asked for?),
> and if the entry is not in the index then a standard search of the real
> directory can be done.  In either case, the index can be marked as invalid
> (and removed?) and rebuilt whenever necessary.

On an index-aware kernel, the index can obviously not get out of sync.
All 2.2+ kernels that don't understand indexing will clear the "has
index" flag if they modify that directory, and the index will disappear.
Since the index is "hidden" (in my proposal at least) inside a totally
normal directory block, it will simply be overwritten by new entries.
As I mentioned above, 1.x and 2.0 kernels will overwrite the index on
an add, but not on a delete, and will not clear the "has index" flag.
This means we need some extra magic at the start of the index to ensure
we have a valid index header.

> - At what threshold of directory size does the index come into play?
> (fairly obviously, the index is useless for tiny directories)

This remains to be seen.  Definitely not for directories 1 block in size
(which is 85%? of all directories).  It looks like directories with about
250-300 files or more are needed for indexing to be useful.  The good
news is that since indexing is optional, it can be tuned to only improve
performance of directories.

> - What happens when an old kernel encounters a directory which has an index
> attached to it?  Does it look like a virtual file,

Re: [rfc] Near-constant time directory index for Ext2

2001-02-23 Thread Andries . Brouwer


From [EMAIL PROTECTED] Fri Feb 23 04:43:23 2001

> Now that you provide source for r5 and dx_hack_hash, let me feed my
> collections to them.
> r5: catastrophic
> dx_hack_hash: not bad, but the linear hash is better.

I never expected dx_hack_hash to be particularly good at anything, but
we might as well test the version without the mistake in it - I was
previously using < 0 to test the sign bit - on an unsigned variable :-/

unsigned dx_hack_hash (const char *name, int len)
{
u32 hash0 = 0x12a3fe2d, hash1 = 0x37abe8f9;
while (len--)
{
u32 hash = hash1 + (hash0 ^ (*name++ * 71523));
if (hash & 0x8000) hash -= 0x7fff;
hash1 = hash0;
hash0 = hash;
}
return hash0;
}


The correction gained me another 1% in the leaf block fullness measure. 
I will try your hash with the htree index code tomorrow.

Basically I find the same results as before.

Actual names (N=557398)
dx_hack+if  dx_hack-if  best
Sizemin max av  stddev  min max av  stddev
2048217 326 272.17  273.45  220 330 272.17  280.43  +
409697  191 136.08  138.35  100 182 136.08  138.29  -
819240  105 68.04   68.57   36  102 68.04   68.06   -
16384   14  59  34.02   34.36   14  59  34.02   34.08   -
32768   3   37  17.01   17.24   4   36  17.01   17.09   -
65536   0   24  8.518.550   23  8.518.51-
131072  0   18  4.254.240   16  4.254.26+
262144  0   13  2.132.120   11  2.132.13-

Generated names
2048195 347 272.17  509.38  191 346 272.17  636.11  +
409671  218 136.08  645.73  56  224 136.08  995.79  +
819223  125 68.04   202.16  23  135 68.04   288.99  +
16384   10  69  34.02   67.47   8   72  34.02   89.29   +
32768   1   42  17.01   26.32   1   43  17.01   31.39   +
65536   0   28  8.5110.92   0   26  8.5112.24   +
131072  0   17  4.254.930   18  4.255.28+
262144  0   12  2.132.320   13  2.132.40+

In other words, the "broken" version wins on actual names, the "correct" version
on generated names with number tail. The differences are small.
(And of course the broken version is faster.)
As a comparison:

Actual names (N=557398)
linear hash (m=11)  linear hash (m=51)  best of 4
Sizemin max av  stddev  min max av  stddev
2048221 324 272.17  254.02  218 325 272.17  265.46  lin-11
409696  176 136.08  132.53  92  174 136.08  130.94  lin-51
819238  102 68.04   68.26   41  97  68.04   64.98   lin-51
16384   14  57  34.02   33.97   15  60  34.02   32.29   lin-51
32768   3   37  17.01   17.50   4   41  17.01   16.46   lin-51
65536   0   24  8.518.700   24  8.518.31lin-51
131072  0   17  4.254.390   16  4.254.22lin-51
262144  0   12  2.132.200   12  2.132.12lin-51

Generated names
2048262 283 272.17  12.25   244 298 272.17  136.72  lin-11
4096128 146 136.08  9.39119 151 136.08  39.73   lin-11
819261  76  68.04   4.4158  79  68.04   12.47   lin-11
16384   26  41  34.02   3.6125  44  34.02   6.32lin-11
32768   10  23  17.01   4.3610  25  17.01   4.04lin-51
65536   2   14  8.512.793   16  8.514.54lin-11
131072  0   8   4.251.850   10  4.251.88lin-11
262144  0   5   2.131.100   6   2.131.22lin-11

So both linear hash versions are far superior (if the criterion is uniform
distribution) in the case of generated names, and lin-51 also beats dx_hack
in the case of actual names. Of course it also wins in speed.

Andries


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-23 Thread Jonathan Morton


>Now that you provide source for r5 and dx_hack_hash, let me feed my
>collections to them.
>r5: catastrophic
>dx_hack_hash: not bad, but the linear hash is better.



So, not only does the linear hash normally provide a shorter worst-case
chain, its' results are actually more consistent than the other two.  Looks
like simple is good here, but is it still possible to produce
"pathological" sets for the linear hash to see how badly it falls down?
I'm no mathematician, so I'll leave that to the gurus...

Meanwhile, let's go back to Linus' comment on compatibility and so on.  He
has a *very* good point, which I'll expand on slightly here:

Suppose some stone-age Linux user, running 2.0.35 or something equally old
(which runs ext2), decides to finally bite the bullet and upgrade to the
all-new 2.6.1 (OK, this is some time in the future).  2.6.1 implements some
"enhanced" version of ext2 which makes some incompatible modifications to
the directory structure.  However, since the process of upgrading through
such a massive range of kernels also involves upgrading most other software
to boot, this user forgot one or two pieces, and reboots to 2.0.35 to
regain a sufficiently working system that he can build the updated software
- or not, because 2.0.35's old ext2 code suddenly can't read the
filesystem, which was modified by 2.6.1 before the boot process stalled.
e2fsck is no help here either, because he now has an unbootable system with
old software that doesn't understand the new stuff.

I hope people understand this as well as I do - if a filesystem upgrade is
desirable, let the user perform some *specific* action to upgrade it, when
he has an otherwise-working setup *and* sufficient backups.  I for one do
not want to be caught out like the hypothetical user I mentioned above.

OTOH, I have my own opinions on the direction of ext2:

- Currently, it's a stable and universally-utilised filesystem which offers
very good performance for most applications.  I'm specifically drawing
attention to certain benchmarks which place ext2's disk-based performance
ahead of many commercial UNIX' ram-based filesystem performance.

- There are specific problems with performance when reading and/or
modifying large directories.  I haven't tested for this personally, but I
have noticed slowness when using 'rm -rf' on a large *tree* of directories.
The problem appeared to be one of disk access, but may be a side-effect of
poor storage distribution (I haven't examined the ext2 code).  Related to
this, rebuilding the slocate database on all my systems appears to be
disk-bound rather than CPU-bound, and takes too long for my liking.

One of the current suggestions, if I've interpreted it correctly, is to
introduce an extension to ext2 which essentially makes a "fast index" of a
large directory, attaches it to the directory in some backwards-compatible
manner, and uses it *in conjunction with* the regular directory structure.
This is probably a good idea, but it needs some thought:

- How much overhead does the "fast index" introduce for modification of the
directory?  Large directories are the most likely to have stuff added and
deleted, and it doesn't help if during an "rm *" operation the saving on
the search is negated by the overhead on the unlink.

- If the index gets out of sync with the directory, how is this detected
and recovered from?  Assuming the index merely points to the correct
position in the regular directory, some simple sanity checks will suffice
for most cases (is this entry in the directory the same as I asked for?),
and if the entry is not in the index then a standard search of the real
directory can be done.  In either case, the index can be marked as invalid
(and removed?) and rebuilt whenever necessary.

- At what threshold of directory size does the index come into play?
(fairly obviously, the index is useless for tiny directories)

- What happens when an old kernel encounters a directory which has an index
attached to it?  Does it look like a virtual file, which has no special
properties but whose name is reserved for system use?  (cf. lost+found)  Or
is it some inidentifiable bits in the directory structure and a few lost
clusters on the disk?  If the latter, it will have to be done in a way that
older versions of e2fsck can clean it up easily and older versions of ext2
won't throw up on it, which could be kinda hard.  If the former, an
'unused' name will have to be found to avoid conflicts but the big
advantage is *no* inconsistency with old systems.

Answers on a postcard...

--
from: Jonathan "Chromatix" Morton
mail: [EMAIL PROTECTED]  (not for attachments)
big-mail: [EMAIL PROTECTED]
uni-mail: [EMAIL PROTECTED]

The key to knowledge is not to rely on people to teach you it.

Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/

-BEGIN GEEK CODE BLOCK-
Version 3.12
GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+

Re: [rfc] Near-constant time directory index for Ext2

2001-02-23 Thread Jonathan Morton


Now that you provide source for r5 and dx_hack_hash, let me feed my
collections to them.
r5: catastrophic
dx_hack_hash: not bad, but the linear hash is better.

snip verbose results

So, not only does the linear hash normally provide a shorter worst-case
chain, its' results are actually more consistent than the other two.  Looks
like simple is good here, but is it still possible to produce
"pathological" sets for the linear hash to see how badly it falls down?
I'm no mathematician, so I'll leave that to the gurus...

Meanwhile, let's go back to Linus' comment on compatibility and so on.  He
has a *very* good point, which I'll expand on slightly here:

Suppose some stone-age Linux user, running 2.0.35 or something equally old
(which runs ext2), decides to finally bite the bullet and upgrade to the
all-new 2.6.1 (OK, this is some time in the future).  2.6.1 implements some
"enhanced" version of ext2 which makes some incompatible modifications to
the directory structure.  However, since the process of upgrading through
such a massive range of kernels also involves upgrading most other software
to boot, this user forgot one or two pieces, and reboots to 2.0.35 to
regain a sufficiently working system that he can build the updated software
- or not, because 2.0.35's old ext2 code suddenly can't read the
filesystem, which was modified by 2.6.1 before the boot process stalled.
e2fsck is no help here either, because he now has an unbootable system with
old software that doesn't understand the new stuff.

I hope people understand this as well as I do - if a filesystem upgrade is
desirable, let the user perform some *specific* action to upgrade it, when
he has an otherwise-working setup *and* sufficient backups.  I for one do
not want to be caught out like the hypothetical user I mentioned above.

OTOH, I have my own opinions on the direction of ext2:

- Currently, it's a stable and universally-utilised filesystem which offers
very good performance for most applications.  I'm specifically drawing
attention to certain benchmarks which place ext2's disk-based performance
ahead of many commercial UNIX' ram-based filesystem performance.

- There are specific problems with performance when reading and/or
modifying large directories.  I haven't tested for this personally, but I
have noticed slowness when using 'rm -rf' on a large *tree* of directories.
The problem appeared to be one of disk access, but may be a side-effect of
poor storage distribution (I haven't examined the ext2 code).  Related to
this, rebuilding the slocate database on all my systems appears to be
disk-bound rather than CPU-bound, and takes too long for my liking.

One of the current suggestions, if I've interpreted it correctly, is to
introduce an extension to ext2 which essentially makes a "fast index" of a
large directory, attaches it to the directory in some backwards-compatible
manner, and uses it *in conjunction with* the regular directory structure.
This is probably a good idea, but it needs some thought:

- How much overhead does the "fast index" introduce for modification of the
directory?  Large directories are the most likely to have stuff added and
deleted, and it doesn't help if during an "rm *" operation the saving on
the search is negated by the overhead on the unlink.

- If the index gets out of sync with the directory, how is this detected
and recovered from?  Assuming the index merely points to the correct
position in the regular directory, some simple sanity checks will suffice
for most cases (is this entry in the directory the same as I asked for?),
and if the entry is not in the index then a standard search of the real
directory can be done.  In either case, the index can be marked as invalid
(and removed?) and rebuilt whenever necessary.

- At what threshold of directory size does the index come into play?
(fairly obviously, the index is useless for tiny directories)

- What happens when an old kernel encounters a directory which has an index
attached to it?  Does it look like a virtual file, which has no special
properties but whose name is reserved for system use?  (cf. lost+found)  Or
is it some inidentifiable bits in the directory structure and a few lost
clusters on the disk?  If the latter, it will have to be done in a way that
older versions of e2fsck can clean it up easily and older versions of ext2
won't throw up on it, which could be kinda hard.  If the former, an
'unused' name will have to be found to avoid conflicts but the big
advantage is *no* inconsistency with old systems.

Answers on a postcard...

--
from: Jonathan "Chromatix" Morton
mail: [EMAIL PROTECTED]  (not for attachments)
big-mail: [EMAIL PROTECTED]
uni-mail: [EMAIL PROTECTED]

The key to knowledge is not to rely on people to teach you it.

Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/

-BEGIN GEEK CODE BLOCK-
Version 3.12
GCS$/E/S dpu(!) s:- a20 C+++

Re: [rfc] Near-constant time directory index for Ext2

2001-02-23 Thread Andries . Brouwer


From [EMAIL PROTECTED] Fri Feb 23 04:43:23 2001

 Now that you provide source for r5 and dx_hack_hash, let me feed my
 collections to them.
 r5: catastrophic
 dx_hack_hash: not bad, but the linear hash is better.

I never expected dx_hack_hash to be particularly good at anything, but
we might as well test the version without the mistake in it - I was
previously using  0 to test the sign bit - on an unsigned variable :-/

unsigned dx_hack_hash (const char *name, int len)
{
u32 hash0 = 0x12a3fe2d, hash1 = 0x37abe8f9;
while (len--)
{
u32 hash = hash1 + (hash0 ^ (*name++ * 71523));
if (hash  0x8000) hash -= 0x7fff;
hash1 = hash0;
hash0 = hash;
}
return hash0;
}


The correction gained me another 1% in the leaf block fullness measure. 
I will try your hash with the htree index code tomorrow.

Basically I find the same results as before.

Actual names (N=557398)
dx_hack+if  dx_hack-if  best
Sizemin max av  stddev  min max av  stddev
2048217 326 272.17  273.45  220 330 272.17  280.43  +
409697  191 136.08  138.35  100 182 136.08  138.29  -
819240  105 68.04   68.57   36  102 68.04   68.06   -
16384   14  59  34.02   34.36   14  59  34.02   34.08   -
32768   3   37  17.01   17.24   4   36  17.01   17.09   -
65536   0   24  8.518.550   23  8.518.51-
131072  0   18  4.254.240   16  4.254.26+
262144  0   13  2.132.120   11  2.132.13-

Generated names
2048195 347 272.17  509.38  191 346 272.17  636.11  +
409671  218 136.08  645.73  56  224 136.08  995.79  +
819223  125 68.04   202.16  23  135 68.04   288.99  +
16384   10  69  34.02   67.47   8   72  34.02   89.29   +
32768   1   42  17.01   26.32   1   43  17.01   31.39   +
65536   0   28  8.5110.92   0   26  8.5112.24   +
131072  0   17  4.254.930   18  4.255.28+
262144  0   12  2.132.320   13  2.132.40+

In other words, the "broken" version wins on actual names, the "correct" version
on generated names with number tail. The differences are small.
(And of course the broken version is faster.)
As a comparison:

Actual names (N=557398)
linear hash (m=11)  linear hash (m=51)  best of 4
Sizemin max av  stddev  min max av  stddev
2048221 324 272.17  254.02  218 325 272.17  265.46  lin-11
409696  176 136.08  132.53  92  174 136.08  130.94  lin-51
819238  102 68.04   68.26   41  97  68.04   64.98   lin-51
16384   14  57  34.02   33.97   15  60  34.02   32.29   lin-51
32768   3   37  17.01   17.50   4   41  17.01   16.46   lin-51
65536   0   24  8.518.700   24  8.518.31lin-51
131072  0   17  4.254.390   16  4.254.22lin-51
262144  0   12  2.132.200   12  2.132.12lin-51

Generated names
2048262 283 272.17  12.25   244 298 272.17  136.72  lin-11
4096128 146 136.08  9.39119 151 136.08  39.73   lin-11
819261  76  68.04   4.4158  79  68.04   12.47   lin-11
16384   26  41  34.02   3.6125  44  34.02   6.32lin-11
32768   10  23  17.01   4.3610  25  17.01   4.04lin-51
65536   2   14  8.512.793   16  8.514.54lin-11
131072  0   8   4.251.850   10  4.251.88lin-11
262144  0   5   2.131.100   6   2.131.22lin-11

So both linear hash versions are far superior (if the criterion is uniform
distribution) in the case of generated names, and lin-51 also beats dx_hack
in the case of actual names. Of course it also wins in speed.

Andries


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-23 Thread Andreas Dilger


Jonathan Morton writes:
 Meanwhile, let's go back to Linus' comment on compatibility and so on.  He
 has a *very* good point, which I'll expand on slightly here:
 
 Suppose some stone-age Linux user, running 2.0.35 or something equally old
 (which runs ext2), decides to finally bite the bullet and upgrade...
 ... 2.0.35's old ext2 code suddenly can't read the filesystem, which was
 modified by 2.6.1 before the boot process stalled.

One of the proposals on the table for the indexed directories is read AND
WRITE compatible with older kernels.  For 1.0 through 2.4 kernels the read
compatibility is very safe (it will generate some errors on reads because
of "sparse" directory blocks, but the code will continue on correctly,
= 1.2 won't even generate an error).  For  2.2 kernels deleting a file from
an indexed directory will work, but would leave the index out of date.
For  2.2 kernels adding a new file to an indexed directory would always
overwrite the index, so it is also safe.

 I hope people understand this as well as I do - if a filesystem upgrade is
 desirable, let the user perform some *specific* action to upgrade it, when
 he has an otherwise-working setup *and* sufficient backups.  I for one do
 not want to be caught out like the hypothetical user I mentioned above.

I am on the side of maintaining compatibility.  There _may_ be a _small_
performance (more like capacity) impact on indexed directories for this
compatibility, but it is well worth it, IMHO.

 - Currently, it's a stable and universally-utilised filesystem which offers
 very good performance for most applications.  I'm specifically drawing
 attention to certain benchmarks which place ext2's disk-based performance
 ahead of many commercial UNIX' ram-based filesystem performance.

Totally agree.

 - There are specific problems with performance when reading and/or
 modifying large directories.  I haven't tested for this personally, but I
 have noticed slowness when using 'rm -rf' on a large *tree* of directories.

That is what the index change will address.  Actually, "rm -r" may not
be speeded up very much, but "rm *" definitely would be ("rm -r" deletes
files in directory order, but "rm *" deletes each file individually in
alphabetical order).

 One of the current suggestions, if I've interpreted it correctly, is to
 introduce an extension to ext2 which essentially makes a "fast index" of a
 large directory, attaches it to the directory in some backwards-compatible
 manner, and uses it *in conjunction with* the regular directory structure.

Yes, this is essentially true.  The on-disk directory entries are exactly
the same.  The index itself (in the compatible layout) appears to simply
be empty directory blocks (at the cost of 8 bytes = 1 index entry per block).

 - How much overhead does the "fast index" introduce for modification of the
 directory?  Large directories are the most likely to have stuff added and
 deleted, and it doesn't help if during an "rm *" operation the saving on
 the search is negated by the overhead on the unlink.

The index will improve the performance for file add, stat, and delete.  For
all of these operations you need to find a directory entry (add needs to
check if a file of the same name already exists before a new file is added).

 - If the index gets out of sync with the directory, how is this detected
 and recovered from?  Assuming the index merely points to the correct
 position in the regular directory, some simple sanity checks will suffice
 for most cases (is this entry in the directory the same as I asked for?),
 and if the entry is not in the index then a standard search of the real
 directory can be done.  In either case, the index can be marked as invalid
 (and removed?) and rebuilt whenever necessary.

On an index-aware kernel, the index can obviously not get out of sync.
All 2.2+ kernels that don't understand indexing will clear the "has
index" flag if they modify that directory, and the index will disappear.
Since the index is "hidden" (in my proposal at least) inside a totally
normal directory block, it will simply be overwritten by new entries.
As I mentioned above, 1.x and 2.0 kernels will overwrite the index on
an add, but not on a delete, and will not clear the "has index" flag.
This means we need some extra magic at the start of the index to ensure
we have a valid index header.

 - At what threshold of directory size does the index come into play?
 (fairly obviously, the index is useless for tiny directories)

This remains to be seen.  Definitely not for directories 1 block in size
(which is 85%? of all directories).  It looks like directories with about
250-300 files or more are needed for indexing to be useful.  The good
news is that since indexing is optional, it can be tuned to only improve
performance of directories.

 - What happens when an old kernel encounters a directory which has an index
 attached to it?  Does it look like a virtual file, which has no special
 properties but

Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

2001-02-23 Thread tytso

   From: Daniel Phillips [EMAIL PROTECTED]
   Date: Fri, 23 Feb 2001 00:04:02 +0100

   I resolve not to take a position on this subject, and I will carry
   forward both a 'squeaky clean' backward-compatible version that sets an
   INCOMPAT flag, and a 'slightly tarnished' but very clever version that
   is both forward and backward-compatible, along the lines suggested by
   Ted.  Both flavors have the desireable property that old versions of
   fsck with no knowledge of the new index structure can remove the indices
   automatically, with fsck -y.

Note that in the long run, the fully comatible version should probably
have a COMPAT feature flag set so that you're forced to use a new enough
version of e2fsck.  Otherwise an old e2fsck may end up not noticing
corruptions in an index block which might cause a new kernel to have
serious heartburn.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-23 Thread Ralph Loader


Hi,

I while ago I did some experimentation with simple bit-op based string
hash
functions.  I.e., no multiplications / divides in the hash loop.

The best I found was:

int hash_fn (char * p)
{
  int hash = 0;
  while (*p) {
 hash = hash + *p;
 // Rotate a 31 bit field 7 bits:
 hash = ((hash  7) | (hash  24))  0x7fff;
  }
  return hash;
}

[I haven't kept my test program / data set - if anyone compares the
above
 to the others functions mentioned on the list, let me know.]

The 31 and 7 were determined experimentally.  But the 31 has a
theoretical
explanation (which might even be valid):

The rotate is equivalent to a multiplication by x**7 in Z_2[P=0],
where P is the polynomial x**31 - 1 (over Z_2).
Presumably the "best" P would be irreducible - but that would have more
bits set in the polynomial, making reduction harder.  A compromise is to
choose P in the form x**N - 1 but with relatively few factors.
X**31 - 1 is such a P.

Also, a 32 bit rotate (modulo X**32 - 1, which is equal
to (X - 1) ** 32 over Z_2), came out pretty badly.

One thing that shouldn't be forgotten about hashing for hash tables
is that you have to reduce the hash value to the required range - doing
that well greatly reduces the difference between various hash functions.

Ralph.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-23 Thread Guest section DW


On Sat, Feb 24, 2001 at 10:43:16AM +1300, Ralph Loader wrote:

 A while ago I did some experimentation with simple bit-op based string
 hash functions.  I.e., no multiplications / divides in the hash loop.
 
 The best I found was:
 
 int hash_fn (char * p)
 {
   int hash = 0;
   while (*p) {
  hash = hash + *p;
  // Rotate a 31 bit field 7 bits:
  hash = ((hash  7) | (hash  24))  0x7fff;
   }
   return hash;
 }

Hmm. This one goes in the "catastrophic" category.

For actual names:

N=557398, m=51 sz=2048, min 82, max 4002, av 272.17, stddev 45122.99

For generated names:

N=557398, m=51 sz=2048, min 0, max 44800, av 272.17, stddev 10208445.83

A very non-uniform distribution.

 The rotate is equivalent to a multiplication by x**7 in Z_2[P=0],
 where P is the polynomial x**31 - 1 (over Z_2).
 Presumably the "best" P would be irreducible - but that would have more
 bits set in the polynomial, making reduction harder.  A compromise is to
 choose P in the form x**N - 1 but with relatively few factors.
 X**31 - 1 is such a P.

It has seven irreducible factors. Hardly "almost irreducible".

Shifting the 7-bit ASCII characters over 7 bits makes sure that there
is very little interaction to start with. And the final AND that truncates
to the final size of the hash chain kills the high order bits.
No good.

Andries


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

2001-02-23 Thread Andreas Dilger


Ted writes:
 Note that in the long run, the fully comatible version should probably
 have a COMPAT feature flag set so that you're forced to use a new enough
 version of e2fsck.  Otherwise an old e2fsck may end up not noticing
 corruptions in an index block which might cause a new kernel to have
 serious heartburn.

Actually, having a COMPAT flag also helps in other ways:

1) Turning indexing on and off is not a mount option as it currently is
   (or automatically done) so it will quell Linus' fears about priniciple
   of least surprise (i.e. not converting a filesystem without user action).
   A superblock COMPAT flag is more in keeping with other ext2 features.

2) Running a new e2fsck on a COMPAT_INDEX filesystem could create the
   index for existing "large" directories that don't have the BTREE/INDEX
   flag set, so the kernel only ever has to deal with incremental indexing
   after the first block.  The kernel would just do linear access on
   existing multi-block directories until e2fsck is run.

3) Clearing the COMPAT flag would make e2fsck remove the indexes, if the
   user so desires.  I think this would be the behaviour of existing
   e2fsck anyways.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-23 Thread Ralph Loader


Andries,

 is very little interaction to start with. And the final AND that
truncates

 to the final size of the hash chain kills the high order bits.
 No good.

I didn't realise you were bit-masking down to the required size.

Yes, it would be pretty useless in that case.

Ralph.


 
 Andries
 
 
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-23 Thread Ralph Loader


Andries,


  int hash_fn (char * p)
  {
int hash = 0;
while (*p) {
   hash = hash + *p;
   // Rotate a 31 bit field 7 bits:
   hash = ((hash  7) | (hash  24))  0x7fff;
}
return hash;
  }

 

 Hmm. This one goes in the "catastrophic" category.

 
 For actual names:
 
 N=557398, m=51 sz=2048, min 82, max 4002, av 272.17, stddev 45122.99
 
 For generated names:
 
 N=557398, m=51 sz=2048, min 0, max 44800, av 272.17, stddev 10208445.83
 

Instead of masking the hash value down to 11 bits you could try:

index = (hash ^ (hash  11) ^ (hash  22))  0x7ff;

I ran a quick test which gave fairly good results with that: 12871
identifiers
from a source tree) gave a mean square bucket size of 45.65, expected
value for a random function is 45.78.

That change might improve some of your other hashes as well - there
doesn't
seem to be much point in computing a 32 bit value only to throw 20 bits
away -
stirring in the extra bits makes much more sense to me.

  The rotate is equivalent to a multiplication by x**7 in Z_2[P=0],

  where P is the polynomial x**31 - 1 (over Z_2).
  Presumably the "best" P would be irreducible - but that would have more
  bits set in the polynomial, making reduction harder.  A compromise is to
  choose P in the form x**N - 1 but with relatively few factors.
  X**31 - 1 is such a P.
 
 It has seven irreducible factors. Hardly "almost irreducible".

I didn't say it was.  "almost irreducible" polynomials with Hamming
weight two are pretty rare...  Relative to say, x**32 - 1 or x**24 - 1,
having 7 factors is good.

Ralph.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Daniel Phillips


[EMAIL PROTECTED] wrote:
> 
> Just looking at R5 I knew it wasn't going to do well in this application
> because it's similar to a number of hash functions I tried with the same
> idea in mind: to place similar names together in the same leaf block.
> That turned out to be not very important compared to achieving a
> relatively high fullness of leaf blocks.  The problem with R5 when used
> with my htree is, it doesn't give very uniform dispersal.
> 
> The bottom line: dx_hack_hash is still the reigning champion.
> 
> Now that you provide source for r5 and dx_hack_hash, let me feed my
> collections to them.
> r5: catastrophic
> dx_hack_hash: not bad, but the linear hash is better.

I never expected dx_hack_hash to be particularly good at anything, but
we might as well test the version without the mistake in it - I was
previously using < 0 to test the sign bit - on an unsigned variable :-/

unsigned dx_hack_hash (const char *name, int len)
{
u32 hash0 = 0x12a3fe2d, hash1 = 0x37abe8f9;
while (len--)
{
u32 hash = hash1 + (hash0 ^ (*name++ * 71523));
if (hash & 0x8000) hash -= 0x7fff;
hash1 = hash0;
hash0 = hash;
}
return hash0;
}


The correction gained me another 1% in the leaf block fullness measure. 
I will try your hash with the htree index code tomorrow.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Andries . Brouwer



Just looking at R5 I knew it wasn't going to do well in this application
because it's similar to a number of hash functions I tried with the same
idea in mind: to place similar names together in the same leaf block. 
That turned out to be not very important compared to achieving a
relatively high fullness of leaf blocks.  The problem with R5 when used
with my htree is, it doesn't give very uniform dispersal.

The bottom line: dx_hack_hash is still the reigning champion.

Now that you provide source for r5 and dx_hack_hash, let me feed my
collections to them.
r5: catastrophic
dx_hack_hash: not bad, but the linear hash is better.

E.g.:
Actual file names:

Linear hash, m=11, sz=2048, min 262, max 283,  av 272.17, stddev 12.25
dx_hack_hash:  sz=2048, min 220, max 330,  av 272.17, stddev 280.43
r5:sz=2048, min 205, max 382,  av 272.17, stddev 805.18

Linear hash, m=11, sz=65536, min 0, max 24, av 8.51, stddev 8.70
dx_hack_hash:  sz=65536, min 0, max 23, av 8.51, stddev 8.51
r5:sz=65536, min 0, max 26, av 8.51, stddev 8.89

Generated consecutive names:

Linear hash, m=11, sz=2048, min 262, max 283,  av 272.17, stddev 12.25
dx_hack_hash:  sz=2048, min 191, max 346,  av 272.17, stddev 636.11
r5:sz=2048, min 0,   max 3587, av 272.17, stddev 755222.91

Linear hash, m=11, sz=65536, min 2, max  14, av 8.51, stddev 2.79
dx_hack_hash:  sz=65536, min 0, max  26, av 8.51, stddev 12.24
r5:sz=65536, min 0, max 120, av 8.51, stddev 738.08

Andries
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Andries . Brouwer


>> idea of using crc32 as the default hash function

> I once liked those things, too - but I've learned better since.

> A universal class of hashing functions is a class with the property that
> given any input, the average performance of all the functions is good.
> For example, h(k) = (a * k + b) mod m with integers a != 0 and b is
> a universal class of hash functions.

Here a != 0 should be (a,m) = 1.

> for(i=0; *s; s++) i = 131*i + *s;

Yes, that is a good function.

(Also because 131 has only 3 bits, so that there is a cheap shift and add
implementation.)

I did some random tests, on the one hand on a collection of 557398 file
names (last part of path names) in a file system here.
On the other hand on an artificially generated sequence of file names
with increasing number tails: foo0001, foo0002, ...

On the first collection the choice of multiplier didnt really matter
provided that it was odd and not too close to a power of two.
The smallest number with good behaviour was 11, the winner was 51.

(51 has 4 bits, but is not more expensive because they are evenly spaced:
/* 51x = 17*3*x */
x += (x << 1);
x += (x << 4);
)

On the second collection there were large differences between multipliers.
The clear winner was 11.

Some numbers:

Hash 557398 actual names, using
hash(unsigned char *s) {
unsigned int h = 0;

while(*s)
h = m*h + *s++;
return h % sz;
}
for various values of m and powers of two sz (so that the % is an AND).
Report min, max, average length of hash chain, and standard deviation.
Of course min and max should be close to average and stddev should be small.

m= 11 sz=2048, min 221, max 324, av 272.17, stddev 254.12
m= 13 sz=2048, min 219, max 322, av 272.17, stddev 259.96
m= 51 sz=2048, min 218, max 325, av 272.17, stddev 265.52
m=131 sz=2048, min 222, max 344, av 272.17, stddev 264.20

m= 11 sz=4096, min  96, max 176, av 136.08, stddev 132.58
m= 13 sz=4096, min 101, max 177, av 136.08, stddev 128.71
m= 51 sz=4096, min  92, max 174, av 136.08, stddev 130.89
m=131 sz=4096, min  85, max 180, av 136.08, stddev 131.99

m= 11 sz=8192, min 38, max 102, av 68.04, stddev 68.26
m= 13 sz=8192, min 42, max 100, av 68.04, stddev 66.30
m= 51 sz=8192, min 41, max  97, av 68.04, stddev 64.98
m=131 sz=8192, min 39, max 102, av 68.04, stddev 66.19

m= 11 sz=16384, min 14, max 57, av 34.02, stddev 33.96
m= 13 sz=16384, min 14, max 58, av 34.02, stddev 33.51
m= 51 sz=16384, min 15, max 60, av 34.02, stddev 32.29
m=131 sz=16384, min 16, max 59, av 34.02, stddev 33.94

m= 11 sz=32768, min 3, max 37, av 17.01, stddev 17.50
m= 13 sz=32768, min 3, max 34, av 17.01, stddev 16.84
m= 51 sz=32768, min 4, max 41, av 17.01, stddev 16.46
m=131 sz=32768, min 3, max 40, av 17.01, stddev 16.90

m= 11 sz=65536, min 0, max 24, av 8.51, stddev 8.70
m= 13 sz=65536, min 0, max 23, av 8.51, stddev 8.56
m= 51 sz=65536, min 0, max 24, av 8.51, stddev 8.31
m=131 sz=65536, min 0, max 23, av 8.51, stddev 8.51

m= 11 sz=131072, min 0, max 17, av 4.25, stddev 4.39
m= 13 sz=131072, min 0, max 16, av 4.25, stddev 4.32
m= 51 sz=131072, min 0, max 16, av 4.25, stddev 4.22
m=131 sz=131072, min 0, max 16, av 4.25, stddev 4.24

m= 11 sz=262144, min 0, max 12, av 2.13, stddev 2.20
m= 13 sz=262144, min 0, max 12, av 2.13, stddev 2.18
m= 51 sz=262144, min 0, max 12, av 2.13, stddev 2.12
m=131 sz=262144, min 0, max 12, av 2.13, stddev 2.12

On the second, nonrandom, collection there are more variations:

m= 11 sz=8192, min 61, max 76, av 68.04, stddev  4.41
m= 13 sz=8192, min 55, max 83, av 68.04, stddev 18.64
m= 51 sz=8192, min 58, max 79, av 68.04, stddev 12.47
m=131 sz=8192, min 52, max 83, av 68.04, stddev 29.05

m= 11 sz=16384, min 26, max 41, av 34.02, stddev  3.61
m= 13 sz=16384, min 24, max 45, av 34.02, stddev  8.76
m= 51 sz=16384, min 25, max 44, av 34.02, stddev  6.32
m=131 sz=16384, min 23, max 47, av 34.02, stddev 14.00

m= 11 sz=32768, min 10, max 23, av 17.01, stddev 4.36
m= 13 sz=32768, min  7, max 28, av 17.01, stddev 8.66
m= 51 sz=32768, min 10, max 25, av 17.01, stddev 4.04
m=131 sz=32768, min  6, max 27, av 17.01, stddev 8.66

Andries
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Felix von Leitner


Thus spake Alan Cox ([EMAIL PROTECTED]):
> > > There will be a lot fewer metadata index
> > > blocks in your directory file, for one thing.
> > Oh yes, another thing: a B-tree directory structure does not need
> > metadata index blocks.
> Before people get excited about complex tree directory indexes, remember to 
> solve the other 95% before implementation - recovering from lost blocks,
> corruption and the like

And don't forget the trouble with NFS handles after the tree was rebalanced.

Trees are nice only theoretically.  In practice, the benefits are
outweighed by the nastiness in form of fsck and NFS and bigger code
(normally: more complex -> less reliable).

Felix
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread tytso

   From: Andreas Dilger <[EMAIL PROTECTED]>
   Date: Thu, 22 Feb 2001 11:16:32 -0700 (MST)

   One important question as to the disk format is whether the "." and ".."
   interception by VFS is a new phenomenon in 2.4 or if this also happened
   in 2.2?  If so, then having these entries on disk will be important
   for 2.2 compatibility, and you don't want to have different on-disk formats
   between 2.2 and 2.4.

Well, you need to have the '.' and '..' there for compatibility if you
for the full backwards compatibility.   That's clear.

If you don't care about backwards compatibility, it's important that
there be a way to find the parent directory, but there doesn't have to
be explicit '.' and '..'  entries.

So if Daniel is going to try implementing it both ways then that's one
place where the #ifdef's might get a bit more complicated.  After it's
done, we should do some benchmarks comparing it both ways; if the
difference is negligible, I'd argue for simply always providing
backwards compatibility.  One of the key advantages of ext2/ext3 is its
backwards compatibility, and so if it's not too costly to preserve it
(as I suspect will be the case), we should try to do so.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Daniel Phillips

Andreas Dilger wrote:
> Daniel writes:
> > All references to "." and ".." are now intercepted and never reach the
> > filesystem level.
> 
> Ted writes:
> >From: Daniel Phillips <[EMAIL PROTECTED]>
> >
> >I'll leave that up to somebody else - we now have two alternatives, the
> >100%, no-compromise INCOMPAT solution, and the slightly-bruised but
> >still largely intact forward compatible solution.  I'll maintain both
> >solutions for now code so it's just as easy to choose either in the end.
> >
> > Well, the $64,000 question is exactly how much performance does it cost?
> > My guess is that it will be barely measurable, but only benchmarks will
> > answer that question.
> 
> One important question as to the disk format is whether the "." and ".."
> interception by VFS is a new phenomenon in 2.4 or if this also happened
> in 2.2?  If so, then having these entries on disk will be important
> for 2.2 compatibility, and you don't want to have different on-disk formats
> between 2.2 and 2.4.

The answer is 'yes', it's been in since at least the beginning of 2.2:

http://innominate.org/cgi-bin/lksr/linux/fs/namei.c?rev=1.1=text/x-cvsweb-markup=v2.2

Search for '.'.

By the way, out whole linux cvsweb tree is here:

http://lksr.org/ 

will all versions of linux back to linux-0.97.pl5, with a makefile that
starts out with:

#
# Makefile for linux.
# If you don't have '-mstring-insns' in your gcc (and nobody but me has
:-)
# remove them from the CFLAGS defines.
#

Getting back on topic, this makes the idea of getting rid of the actual
on-disk "." and ".." entries a little less scary, though I am keeping in
mind the fact that having those entries on disk could in some extreme
circumstance help fsck recover a a corrupted directory tree little
better and more automatically.

I resolve not to take a position on this subject, and I will carry
forward both a 'squeaky clean' backward-compatible version that sets an
INCOMPAT flag, and a 'slightly tarnished' but very clever version that
is both forward and backward-compatible, along the lines suggested by
Ted.  Both flavors have the desireable property that old versions of
fsck with no knowledge of the new index structure can remove the indices
automatically, with fsck -y.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Daniel Phillips

Linus Torvalds wrote:
> 
> On Thu, 22 Feb 2001, Daniel Phillips wrote:
> >
> > In the first heat of hash races - creating 20,000 files in one directory
> > - dentry::hash lost out to my original hack::dx_hash, causing a high
> > percentage of leaf blocks to remain exactly half full and slowing down
> > the whole thing by about 5%.  (This was under uml - I haven't tried it
> > native yet but I expect the results to be similar.)
> >
> > Contender Result
> > = ==
> >   dentry::hashAverage fullness = 2352 (57%)
> >   hack::dx_hash   Average fullness = 2758 (67%)
> >
> > This suggests that dentry::hash is producing distinctly non-dispersed
> > results and needs to be subjected to further scrutiny.  I'll run the
> > next heat of hash races tomorrow, probably with R5, and CRC32 too if I
> > have time.
> 
> I'd love to hear the results from R5, as that seems to be the reiserfs
> favourite, and I'm trying it out in 2.4.2 because it was so easy to plug
> in..

In this round there were two new contenders:

- ReiserFS's R5
- Bob Jenkins' hash

Eirik Fuller pointed me to the latter, the subject of a very interesting
article in Dr. Dobbs, available online here: 

http://burtleburtle.net/bob/hash/doobs.html

As before, the runs are for 20,000 creates and I report only the
fullness, because I'm still running these under UML.  Suffice to say
that the total running time is roughly related to the average fullness
with a variance of about 15% from the best to the worst.  Eventually I
will rerun the entire series of tests natively and provide more detailed
statistics.  Here are the results from the second heat of hash races:

 Contender  Result
 =  ==
dentry::hashAverage fullness = 2352 (57%)
daniel::hack_hash   Average fullness = 2758 (67%)
bob::hash   Average fullness = 2539
(61%)  

reiserfs::r5Average fullness = 2064 (50%)

Just looking at R5 I knew it wasn't going to do well in this application
because it's similar to a number of hash functions I tried with the same
idea in mind: to place similar names together in the same leaf block. 
That turned out to be not very important compared to achieving a
relatively high fullness of leaf blocks.  The problem with R5 when used
with my htree is, it doesn't give very uniform dispersal   But according
to Chris Mason (see his post) it does work very well for ReiserFS.  This
provides a little more evidence that my htree scheme is a quite
different from other approaches.

u32 r5_hash (const char *msg, int len)
{
  u32 a=0;
  while(*msg) { 
a += *msg << 4;
a += *msg >> 4;
a *= 11;
msg++;
   } 
  return a;
}

I expected more from bob::hash since it's very carefully well-thought
out in terms of dispersal and avoidance of 'funnelling' (the property
that determines the probabililty collision), but it still fell short of
hack_hash's performance.  Oh well.  Tomorrow I'll try CRC32.

The bottom line: dx_hack_hash is still the reigning champion.  OK, come
out and take a bow:

unsigned dx_hack_hash (const char *name, int len)
{
u32 hash0 = 0x12a3fe2d, hash1 = 0x37abe8f9;
while (len--)
{
u32 hash = hash1 + (hash0 ^ (*name++ * 71523));
if (hash < 0) hash -= 0x7fff;
hash1 = hash0;
hash0 = hash;
}
return hash0;
}

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Kai Henningsen


[EMAIL PROTECTED] (Martin Mares)  wrote on 22.02.01 in 
<[EMAIL PROTECTED]>:

> One could avoid this, but it would mean designing the whole filesystem in a
> completely different way -- merge all directories to a single gigantic
> hash table and use (directory ID,file name) as a key, but we were originally
> talking about extending ext2, so such massive changes are out of question
> and your log n access argument is right.

s/hash table/btree/ and you have just described the Macintosh HFS file  
system. (Incidentally, it stores file extent indices in a similar manner,  
with key = (file id, fork, offset).)


MfG Kai
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Kai Henningsen


[EMAIL PROTECTED] (Daniel Phillips)  wrote on 20.02.01 in 
<01022020011905.18944@gimli>:

> But the current hash function is just a place holder, waiting for
> an better version based on some solid theory.  I currently favor the
> idea of using crc32 as the default hash function, but I welcome
> suggestions.

I once liked those things, too - but I've learned better since.

Quoting _Handbook_of_Algorithms_and_Data_Structures_ (Gonnet/Baeza-Yates,  
ISBM 0-201-41607-7, Addison-Wesley):

--- snip ---

3.3.1 Practical hashing functions

[...]

A universal class of hashing functions is a class with the property that  
given any input, the average performance of all the functions is good.  
[...] For example, h(k) = (a * k + b) mod m with integers a != 0 and b is  
a universal class of hash functions.
[...]
Keys which are strings or sequences of words (including those which are of  
variable length) are best treated by considering them as a number base b.  
Let the string s be composed of k characters s1s2...sk. Then

h(s) = ( sum(i=0..k-1) B^i*s(k-i) ) mod m

To obtain a more efficient version of this function we can compute

h(s) = ( ( sum(i=0..k-1) B^i*s(k-i) ) mod 2^w ) mod m

where w is the number of bits in a computer word, and the mod 2^w  
operation is done by the hardware. For this function the value B = 131 is  
recommended, as B^i has a maximum cycle mod 2^k for 8<=k<=64.

Hashing function for strings

int hashfunction(s)
char *s;

{ int i;
  for(i=0; *s; s++) i = 131*i + *s;
  return(i % m);
}

--- snip ---

I've actually used that function for a hash containing something like a  
million phone numbers as keys, and there were *very* few collisions.  
Similarly for another hash containgng megabytes of RFC 822 message-ids.

MfG Kai
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Linus Torvalds

On Thu, 22 Feb 2001, Ingo Oeser wrote:
> 
> On Wed, Feb 21, 2001 at 09:19:45PM -0800, Linus Torvalds wrote:
> > In article <97230a$16k$[EMAIL PROTECTED]>,
> > Linus Torvalds <[EMAIL PROTECTED]> wrote:
> > >allocate blocks one at a time. Make the blocksize something nice and
> > >big, not just 4kB or 8kB or something.
> > 
> > Btw, this is also going to be a VM and performance issue some time in
> > the future.  Tgere are already CPU's that would _love_ to have 64kB
> > pages etc, and as such a filesystem that doesn't play with the old silly
> > "everthing is a block" rules would be much appreciated with the kind of
> > people who have multi-gigabyte files and want to read in big chunks at a
> > time. 
>  
> For this we need a block remapper layer that can map any
> blocksize n to any blocksize m with only the following constraints:

No, nothing like that at all..

What you can _trivially_ do is to basically act to the VFS and VM layer as
if you're a 1kB block filesystem (or something), and then when you get
called to do a "bmap()" (which only happens for kernel installing and
LILO, not under normal load), you just return the "offset" into a larger
block.

The VFS and MM layers do not care what the _real_ underlying blocksize of
the filesystem is. They will just do "readpage()" and "write()" calls, and
you can implement those any way you want to - never showing that you are
chunking out page-sized pieces from a bigger allocation block.

It's not all that hard. You just have to think a bit differently: don't
think of it as a block-based filesystem that has to have fixed blocks. The
VFS and MM layer don't care. They just want to access it.

> Daniel (and others) uses ext2 as as a playground, because it is
> implemented, tested and not that hard to understand and verify.

I realize that. But I get _very_ nervous when people talk about adding
stuff to ext2, because there are a lot of people who do not want to ever
even by mistake run code that is "new" on their filesystem.

Note that I had the same issue with ext3 - for the longest time, Stephen
Tweedie wanted to just extend ext2, and make it an invisible upgrade where
the filesystem would just magically become journalled when the user asked
for it. It _sounds_ appealing, but it doesn't take into account (a)
inevitable bugs and (b) the fact that Reiserfs actually got a head start
at least partly because it didn't need to worry about backwards
compatibility at all (there were other reasons too).

Basically, if there is one thing I've learnt over ten years of Linux, it's
that it is absolutely _evil_ to add more "linkages" or dependencies than
you absolutely have to. It is _much_ easier to create a new filesystem,
and slowly phase out old code that is no longer used. It's been done
several times (both with filesystems and with drivers), and every time
we've had a "new X, phase out old X" kind of situation it has been very
smooth.

In comparison, if you have "new features in X, which also handles the old
cases of X" situation, you not only bind yourself to backwards
compatibility, but you also cause yourself to be unable to ever phase out
the old code. Which means that eventually the whole system is a piece of
crap, full of old garbage that nobody needs to use, but that is part of
the new stuff that everybody _does_ use.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Andreas Dilger


Daniel writes:
> All references to "." and ".." are now intercepted and never reach the
> filesystem level.

Ted writes:
>From: Daniel Phillips <[EMAIL PROTECTED]>
> 
>I'll leave that up to somebody else - we now have two alternatives, the
>100%, no-compromise INCOMPAT solution, and the slightly-bruised but
>still largely intact forward compatible solution.  I'll maintain both
>solutions for now code so it's just as easy to choose either in the end.
> 
> Well, the $64,000 question is exactly how much performance does it cost?
> My guess is that it will be barely measurable, but only benchmarks will
> answer that question.

One important question as to the disk format is whether the "." and ".."
interception by VFS is a new phenomenon in 2.4 or if this also happened
in 2.2?  If so, then having these entries on disk will be important
for 2.2 compatibility, and you don't want to have different on-disk formats
between 2.2 and 2.4.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Chris Mason

On Wednesday, February 21, 2001 07:30:47 PM -0800 Linus Torvalds
<[EMAIL PROTECTED]> wrote:
> On Thu, 22 Feb 2001, Daniel Phillips wrote:
>> 
> 
> I'd love to hear the results from R5, as that seems to be the reiserfs
> favourite, and I'm trying it out in 2.4.2 because it was so easy to plug
> in..

Quick details, since I don't think I've seen them on l-k yet.  r5 was
chosen because it is more tuned to the reiserfs disk format.  The location
of a directory item on disk is determined by the hash of the name, and r5
is designed to put similar names close to each other on disk.

The benchmark that shows this best is creating X number of files in a
single dir (named 0001, 0002, 0003 etc).  r5 greating increases the chances
the directory item for 6 will be right next to the item for 7.  If
the application accesses these files in the same order they were created,
this has benefits at other times than just creation.  The benchmarks Ed
posted give a general idea for other naming patterns, but this one is best
case:

Time to create 100,000 files (10 bytes each) with r5 hash: 48s
Time to create 100,000 files (10 bytes each) with tea: 3m58s

The percentage increase just gets bigger as you create more and more files.
That doesn't mean this is a real world case, but it is what the hash was
designed for.  

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread tytso

   From: Daniel Phillips <[EMAIL PROTECTED]>
   Date: Thu, 22 Feb 2001 08:24:08 +0100
   Content-Type: text/plain

   > Is it worth it?  Well, it means you lose an index entry from each
   > directory block, thus reducing your fanout at each node of the tree by a
   > worse case of 0.7% in the worst case (1k blocksize) and 0.2% if you're
   > using 4k blocksizes.

   I'll leave that up to somebody else - we now have two alternatives, the
   100%, no-compromise INCOMPAT solution, and the slightly-bruised but
   still largely intact forward compatible solution.  I'll maintain both
   solutions for now code so it's just as easy to choose either in the end.

Well, the $64,000 question is exactly how much performance does it cost?
My guess is that it will be barely measurable, but only benchmarks will
answer that question.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Ingo Oeser

Hi Linus,
Hi LKML people,

On Wed, Feb 21, 2001 at 09:19:45PM -0800, Linus Torvalds wrote:
> In article <97230a$16k$[EMAIL PROTECTED]>,
> Linus Torvalds <[EMAIL PROTECTED]> wrote:
> >allocate blocks one at a time. Make the blocksize something nice and
> >big, not just 4kB or 8kB or something.
> 
> Btw, this is also going to be a VM and performance issue some time in
> the future.  Tgere are already CPU's that would _love_ to have 64kB
> pages etc, and as such a filesystem that doesn't play with the old silly
> "everthing is a block" rules would be much appreciated with the kind of
> people who have multi-gigabyte files and want to read in big chunks at a
> time. 

For this we need a block remapper layer that can map any
blocksize n to any blocksize m with only the following constraints:

   - n and m are powers of 2
   - n is a multiple of m

Both should use the page cache ( of size p) of course, so it
becomes 2 layers, if n > p.

   -  translating a buffer of n into some pages
   -  translating a page into buffers of m (current buffercache)

We could limit the translation to 5 powers of 2 obove and 5 powers of 2
below PAGE_CACHE_SIZE so that we can maintain a validity bitmap
(2^5 = 32 bits) for each layer if access is too expensive[1].

Some subsystems could certainly benefit from it.

   -  loop device (with all the crypto stuff)
   -  LVM
   -  FSes that support block sizes != PAGE_CACHE_SIZE
   -  Devices with blocksize != 512 (they don't have to care
  being special anymore). There are even some rumors
  about very pervert blocksizes of 1M and the like.

Since these remapped buffers will look like merged requests, I
see even no problems with the elevator any more.

The question is, where we implement this infrastructure, esp. if
we consider the last user (devices with blocksize != 512).

This has to be answered by the main architects of Linux before
anyone could start.

> So either you have a simple block-based filesystem (current ext2, no
> extents, no crapola), or you decide to do it over.  Don't do some
> half-way thing, please. 

Daniel (and others) uses ext2 as as a playground, because it is
implemented, tested and not that hard to understand and verify.

Hope they will switch to some own design later, once they
sufficiently played around with^W^W^Wtested their ideas.

Regards

Ingo Oeser

[1] In buffer cache we use read-modify-write for partial pages,
   which hurts performance for them and is annoying for media
   with limited write cycles like flash and CD-RW[2].

[2] Yes I know about packet writing mode ;-)
-- 
10.+11.03.2001 - 3. Chemnitzer LinuxTag 
    come and join the fun   
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Rogier Wolff


H. Peter Anvin wrote:
> Martin Mares wrote:
> > 
> > Hello!
> > 
> > > True.  Note too, though, that on a filesystem (which we are, after all,
> > > talking about), if you assume a large linear space you have to create a
> > > file, which means you need to multiply the cost of all random-access
> > > operations with O(log n).
> > 
> > One could avoid this, but it would mean designing the whole filesystem in a
> > completely different way -- merge all directories to a single gigantic
> > hash table and use (directory ID,file name) as a key,

Novell, NTFS, HFS all do this. 

Roger. 

-- 
** [EMAIL PROTECTED] ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots. 
* There are also old, bald pilots. 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Rogier Wolff


H. Peter Anvin wrote:
 Martin Mares wrote:
  
  Hello!
  
   True.  Note too, though, that on a filesystem (which we are, after all,
   talking about), if you assume a large linear space you have to create a
   file, which means you need to multiply the cost of all random-access
   operations with O(log n).
  
  One could avoid this, but it would mean designing the whole filesystem in a
  completely different way -- merge all directories to a single gigantic
  hash table and use (directory ID,file name) as a key,

Novell, NTFS, HFS all do this. 

Roger. 

-- 
** [EMAIL PROTECTED] ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots. 
* There are also old, bald pilots. 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Ingo Oeser


Hi Linus,
Hi LKML people,

On Wed, Feb 21, 2001 at 09:19:45PM -0800, Linus Torvalds wrote:
 In article 97230a$16k$[EMAIL PROTECTED],
 Linus Torvalds [EMAIL PROTECTED] wrote:
 allocate blocks one at a time. Make the blocksize something nice and
 big, not just 4kB or 8kB or something.
 
 Btw, this is also going to be a VM and performance issue some time in
 the future.  Tgere are already CPU's that would _love_ to have 64kB
 pages etc, and as such a filesystem that doesn't play with the old silly
 "everthing is a block" rules would be much appreciated with the kind of
 people who have multi-gigabyte files and want to read in big chunks at a
 time. 
 
For this we need a block remapper layer that can map any
blocksize n to any blocksize m with only the following constraints:

   - n and m are powers of 2
   - n is a multiple of m

Both should use the page cache ( of size p) of course, so it
becomes 2 layers, if n  p.

   -  translating a buffer of n into some pages
   -  translating a page into buffers of m (current buffercache)

We could limit the translation to 5 powers of 2 obove and 5 powers of 2
below PAGE_CACHE_SIZE so that we can maintain a validity bitmap
(2^5 = 32 bits) for each layer if access is too expensive[1].

Some subsystems could certainly benefit from it.

   -  loop device (with all the crypto stuff)
   -  LVM
   -  FSes that support block sizes != PAGE_CACHE_SIZE
   -  Devices with blocksize != 512 (they don't have to care
  being special anymore). There are even some rumors
  about very pervert blocksizes of 1M and the like.

Since these remapped buffers will look like merged requests, I
see even no problems with the elevator any more.

The question is, where we implement this infrastructure, esp. if
we consider the last user (devices with blocksize != 512).

This has to be answered by the main architects of Linux before
anyone could start.

 So either you have a simple block-based filesystem (current ext2, no
 extents, no crapola), or you decide to do it over.  Don't do some
 half-way thing, please. 

Daniel (and others) uses ext2 as as a playground, because it is
implemented, tested and not that hard to understand and verify.

Hope they will switch to some own design later, once they
sufficiently played around with^W^W^Wtested their ideas.

Regards

Ingo Oeser

[1] In buffer cache we use read-modify-write for partial pages,
   which hurts performance for them and is annoying for media
   with limited write cycles like flash and CD-RW[2].

[2] Yes I know about packet writing mode ;-)
-- 
10.+11.03.2001 - 3. Chemnitzer LinuxTag http://www.tu-chemnitz.de/linux/tag
come and join the fun   
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread tytso

   From: Daniel Phillips [EMAIL PROTECTED]
   Date: Thu, 22 Feb 2001 08:24:08 +0100
   Content-Type: text/plain

Is it worth it?  Well, it means you lose an index entry from each
directory block, thus reducing your fanout at each node of the tree by a
worse case of 0.7% in the worst case (1k blocksize) and 0.2% if you're
using 4k blocksizes.

   I'll leave that up to somebody else - we now have two alternatives, the
   100%, no-compromise INCOMPAT solution, and the slightly-bruised but
   still largely intact forward compatible solution.  I'll maintain both
   solutions for now code so it's just as easy to choose either in the end.

Well, the $64,000 question is exactly how much performance does it cost?
My guess is that it will be barely measurable, but only benchmarks will
answer that question.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Chris Mason




On Wednesday, February 21, 2001 07:30:47 PM -0800 Linus Torvalds
[EMAIL PROTECTED] wrote:
 On Thu, 22 Feb 2001, Daniel Phillips wrote:
 
 
 I'd love to hear the results from R5, as that seems to be the reiserfs
 favourite, and I'm trying it out in 2.4.2 because it was so easy to plug
 in..

Quick details, since I don't think I've seen them on l-k yet.  r5 was
chosen because it is more tuned to the reiserfs disk format.  The location
of a directory item on disk is determined by the hash of the name, and r5
is designed to put similar names close to each other on disk.

The benchmark that shows this best is creating X number of files in a
single dir (named 0001, 0002, 0003 etc).  r5 greating increases the chances
the directory item for 6 will be right next to the item for 7.  If
the application accesses these files in the same order they were created,
this has benefits at other times than just creation.  The benchmarks Ed
posted give a general idea for other naming patterns, but this one is best
case:

Time to create 100,000 files (10 bytes each) with r5 hash: 48s
Time to create 100,000 files (10 bytes each) with tea: 3m58s

The percentage increase just gets bigger as you create more and more files.
That doesn't mean this is a real world case, but it is what the hash was
designed for.  

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Andreas Dilger


Daniel writes:
 All references to "." and ".." are now intercepted and never reach the
 filesystem level.

Ted writes:
From: Daniel Phillips [EMAIL PROTECTED]
 
I'll leave that up to somebody else - we now have two alternatives, the
100%, no-compromise INCOMPAT solution, and the slightly-bruised but
still largely intact forward compatible solution.  I'll maintain both
solutions for now code so it's just as easy to choose either in the end.
 
 Well, the $64,000 question is exactly how much performance does it cost?
 My guess is that it will be barely measurable, but only benchmarks will
 answer that question.

One important question as to the disk format is whether the "." and ".."
interception by VFS is a new phenomenon in 2.4 or if this also happened
in 2.2?  If so, then having these entries on disk will be important
for 2.2 compatibility, and you don't want to have different on-disk formats
between 2.2 and 2.4.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Linus Torvalds




On Thu, 22 Feb 2001, Ingo Oeser wrote:
 
 On Wed, Feb 21, 2001 at 09:19:45PM -0800, Linus Torvalds wrote:
  In article 97230a$16k$[EMAIL PROTECTED],
  Linus Torvalds [EMAIL PROTECTED] wrote:
  allocate blocks one at a time. Make the blocksize something nice and
  big, not just 4kB or 8kB or something.
  
  Btw, this is also going to be a VM and performance issue some time in
  the future.  Tgere are already CPU's that would _love_ to have 64kB
  pages etc, and as such a filesystem that doesn't play with the old silly
  "everthing is a block" rules would be much appreciated with the kind of
  people who have multi-gigabyte files and want to read in big chunks at a
  time. 
  
 For this we need a block remapper layer that can map any
 blocksize n to any blocksize m with only the following constraints:

No, nothing like that at all..

What you can _trivially_ do is to basically act to the VFS and VM layer as
if you're a 1kB block filesystem (or something), and then when you get
called to do a "bmap()" (which only happens for kernel installing and
LILO, not under normal load), you just return the "offset" into a larger
block.

The VFS and MM layers do not care what the _real_ underlying blocksize of
the filesystem is. They will just do "readpage()" and "write()" calls, and
you can implement those any way you want to - never showing that you are
chunking out page-sized pieces from a bigger allocation block.

It's not all that hard. You just have to think a bit differently: don't
think of it as a block-based filesystem that has to have fixed blocks. The
VFS and MM layer don't care. They just want to access it.

 Daniel (and others) uses ext2 as as a playground, because it is
 implemented, tested and not that hard to understand and verify.

I realize that. But I get _very_ nervous when people talk about adding
stuff to ext2, because there are a lot of people who do not want to ever
even by mistake run code that is "new" on their filesystem.

Note that I had the same issue with ext3 - for the longest time, Stephen
Tweedie wanted to just extend ext2, and make it an invisible upgrade where
the filesystem would just magically become journalled when the user asked
for it. It _sounds_ appealing, but it doesn't take into account (a)
inevitable bugs and (b) the fact that Reiserfs actually got a head start
at least partly because it didn't need to worry about backwards
compatibility at all (there were other reasons too).

Basically, if there is one thing I've learnt over ten years of Linux, it's
that it is absolutely _evil_ to add more "linkages" or dependencies than
you absolutely have to. It is _much_ easier to create a new filesystem,
and slowly phase out old code that is no longer used. It's been done
several times (both with filesystems and with drivers), and every time
we've had a "new X, phase out old X" kind of situation it has been very
smooth.

In comparison, if you have "new features in X, which also handles the old
cases of X" situation, you not only bind yourself to backwards
compatibility, but you also cause yourself to be unable to ever phase out
the old code. Which means that eventually the whole system is a piece of
crap, full of old garbage that nobody needs to use, but that is part of
the new stuff that everybody _does_ use.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Kai Henningsen


[EMAIL PROTECTED] (Daniel Phillips)  wrote on 20.02.01 in 
01022020011905.18944@gimli:

 But the current hash function is just a place holder, waiting for
 an better version based on some solid theory.  I currently favor the
 idea of using crc32 as the default hash function, but I welcome
 suggestions.

I once liked those things, too - but I've learned better since.

Quoting _Handbook_of_Algorithms_and_Data_Structures_ (Gonnet/Baeza-Yates,  
ISBM 0-201-41607-7, Addison-Wesley):

--- snip ---

3.3.1 Practical hashing functions

[...]

A universal class of hashing functions is a class with the property that  
given any input, the average performance of all the functions is good.  
[...] For example, h(k) = (a * k + b) mod m with integers a != 0 and b is  
a universal class of hash functions.
[...]
Keys which are strings or sequences of words (including those which are of  
variable length) are best treated by considering them as a number base b.  
Let the string s be composed of k characters s1s2...sk. Then

h(s) = ( sum(i=0..k-1) B^i*s(k-i) ) mod m

To obtain a more efficient version of this function we can compute

h(s) = ( ( sum(i=0..k-1) B^i*s(k-i) ) mod 2^w ) mod m

where w is the number of bits in a computer word, and the mod 2^w  
operation is done by the hardware. For this function the value B = 131 is  
recommended, as B^i has a maximum cycle mod 2^k for 8=k=64.

Hashing function for strings

int hashfunction(s)
char *s;

{ int i;
  for(i=0; *s; s++) i = 131*i + *s;
  return(i % m);
}

--- snip ---

I've actually used that function for a hash containing something like a  
million phone numbers as keys, and there were *very* few collisions.  
Similarly for another hash containgng megabytes of RFC 822 message-ids.

MfG Kai
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Kai Henningsen


[EMAIL PROTECTED] (Martin Mares)  wrote on 22.02.01 in 
[EMAIL PROTECTED]:

 One could avoid this, but it would mean designing the whole filesystem in a
 completely different way -- merge all directories to a single gigantic
 hash table and use (directory ID,file name) as a key, but we were originally
 talking about extending ext2, so such massive changes are out of question
 and your log n access argument is right.

s/hash table/btree/ and you have just described the Macintosh HFS file  
system. (Incidentally, it stores file extent indices in a similar manner,  
with key = (file id, fork, offset).)


MfG Kai
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Daniel Phillips


Linus Torvalds wrote:
 
 On Thu, 22 Feb 2001, Daniel Phillips wrote:
 
  In the first heat of hash races - creating 20,000 files in one directory
  - dentry::hash lost out to my original hack::dx_hash, causing a high
  percentage of leaf blocks to remain exactly half full and slowing down
  the whole thing by about 5%.  (This was under uml - I haven't tried it
  native yet but I expect the results to be similar.)
 
  Contender Result
  = ==
dentry::hashAverage fullness = 2352 (57%)
hack::dx_hash   Average fullness = 2758 (67%)
 
  This suggests that dentry::hash is producing distinctly non-dispersed
  results and needs to be subjected to further scrutiny.  I'll run the
  next heat of hash races tomorrow, probably with R5, and CRC32 too if I
  have time.
 
 I'd love to hear the results from R5, as that seems to be the reiserfs
 favourite, and I'm trying it out in 2.4.2 because it was so easy to plug
 in..

In this round there were two new contenders:

- ReiserFS's R5
- Bob Jenkins' hash

Eirik Fuller pointed me to the latter, the subject of a very interesting
article in Dr. Dobbs, available online here: 

http://burtleburtle.net/bob/hash/doobs.html

As before, the runs are for 20,000 creates and I report only the
fullness, because I'm still running these under UML.  Suffice to say
that the total running time is roughly related to the average fullness
with a variance of about 15% from the best to the worst.  Eventually I
will rerun the entire series of tests natively and provide more detailed
statistics.  Here are the results from the second heat of hash races:

 Contender  Result
 =  ==
dentry::hashAverage fullness = 2352 (57%)
daniel::hack_hash   Average fullness = 2758 (67%)
bob::hash   Average fullness = 2539
(61%)  
   
reiserfs::r5Average fullness = 2064 (50%)

Just looking at R5 I knew it wasn't going to do well in this application
because it's similar to a number of hash functions I tried with the same
idea in mind: to place similar names together in the same leaf block. 
That turned out to be not very important compared to achieving a
relatively high fullness of leaf blocks.  The problem with R5 when used
with my htree is, it doesn't give very uniform dispersal   But according
to Chris Mason (see his post) it does work very well for ReiserFS.  This
provides a little more evidence that my htree scheme is a quite
different from other approaches.

u32 r5_hash (const char *msg, int len)
{
  u32 a=0;
  while(*msg) { 
a += *msg  4;
a += *msg  4;
a *= 11;
msg++;
   } 
  return a;
}

I expected more from bob::hash since it's very carefully well-thought
out in terms of dispersal and avoidance of 'funnelling' (the property
that determines the probabililty collision), but it still fell short of
hack_hash's performance.  Oh well.  Tomorrow I'll try CRC32.

The bottom line: dx_hack_hash is still the reigning champion.  OK, come
out and take a bow:

unsigned dx_hack_hash (const char *name, int len)
{
u32 hash0 = 0x12a3fe2d, hash1 = 0x37abe8f9;
while (len--)
{
u32 hash = hash1 + (hash0 ^ (*name++ * 71523));
if (hash  0) hash -= 0x7fff;
hash1 = hash0;
hash0 = hash;
}
return hash0;
}

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Daniel Phillips


Andreas Dilger wrote:
 Daniel writes:
  All references to "." and ".." are now intercepted and never reach the
  filesystem level.
 
 Ted writes:
 From: Daniel Phillips [EMAIL PROTECTED]
 
 I'll leave that up to somebody else - we now have two alternatives, the
 100%, no-compromise INCOMPAT solution, and the slightly-bruised but
 still largely intact forward compatible solution.  I'll maintain both
 solutions for now code so it's just as easy to choose either in the end.
 
  Well, the $64,000 question is exactly how much performance does it cost?
  My guess is that it will be barely measurable, but only benchmarks will
  answer that question.
 
 One important question as to the disk format is whether the "." and ".."
 interception by VFS is a new phenomenon in 2.4 or if this also happened
 in 2.2?  If so, then having these entries on disk will be important
 for 2.2 compatibility, and you don't want to have different on-disk formats
 between 2.2 and 2.4.

The answer is 'yes', it's been in since at least the beginning of 2.2:

 
http://innominate.org/cgi-bin/lksr/linux/fs/namei.c?rev=1.1content-type=text/x-cvsweb-markupcvsroot=v2.2

Search for '.'.

By the way, out whole linux cvsweb tree is here:

http://lksr.org/ 

will all versions of linux back to linux-0.97.pl5, with a makefile that
starts out with:

#
# Makefile for linux.
# If you don't have '-mstring-insns' in your gcc (and nobody but me has
:-)
# remove them from the CFLAGS defines.
#

Getting back on topic, this makes the idea of getting rid of the actual
on-disk "." and ".." entries a little less scary, though I am keeping in
mind the fact that having those entries on disk could in some extreme
circumstance help fsck recover a a corrupted directory tree little
better and more automatically.

I resolve not to take a position on this subject, and I will carry
forward both a 'squeaky clean' backward-compatible version that sets an
INCOMPAT flag, and a 'slightly tarnished' but very clever version that
is both forward and backward-compatible, along the lines suggested by
Ted.  Both flavors have the desireable property that old versions of
fsck with no knowledge of the new index structure can remove the indices
automatically, with fsck -y.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread tytso

   From: Andreas Dilger [EMAIL PROTECTED]
   Date: Thu, 22 Feb 2001 11:16:32 -0700 (MST)

   One important question as to the disk format is whether the "." and ".."
   interception by VFS is a new phenomenon in 2.4 or if this also happened
   in 2.2?  If so, then having these entries on disk will be important
   for 2.2 compatibility, and you don't want to have different on-disk formats
   between 2.2 and 2.4.

Well, you need to have the '.' and '..' there for compatibility if you
for the full backwards compatibility.   That's clear.

If you don't care about backwards compatibility, it's important that
there be a way to find the parent directory, but there doesn't have to
be explicit '.' and '..'  entries.

So if Daniel is going to try implementing it both ways then that's one
place where the #ifdef's might get a bit more complicated.  After it's
done, we should do some benchmarks comparing it both ways; if the
difference is negligible, I'd argue for simply always providing
backwards compatibility.  One of the key advantages of ext2/ext3 is its
backwards compatibility, and so if it's not too costly to preserve it
(as I suspect will be the case), we should try to do so.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Felix von Leitner


Thus spake Alan Cox ([EMAIL PROTECTED]):
   There will be a lot fewer metadata index
   blocks in your directory file, for one thing.
  Oh yes, another thing: a B-tree directory structure does not need
  metadata index blocks.
 Before people get excited about complex tree directory indexes, remember to 
 solve the other 95% before implementation - recovering from lost blocks,
 corruption and the like

And don't forget the trouble with NFS handles after the tree was rebalanced.

Trees are nice only theoretically.  In practice, the benefits are
outweighed by the nastiness in form of fsck and NFS and bigger code
(normally: more complex - less reliable).

Felix
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Andries . Brouwer


 idea of using crc32 as the default hash function

 I once liked those things, too - but I've learned better since.

 A universal class of hashing functions is a class with the property that
 given any input, the average performance of all the functions is good.
 For example, h(k) = (a * k + b) mod m with integers a != 0 and b is
 a universal class of hash functions.

Here a != 0 should be (a,m) = 1.

 for(i=0; *s; s++) i = 131*i + *s;

Yes, that is a good function.

(Also because 131 has only 3 bits, so that there is a cheap shift and add
implementation.)

I did some random tests, on the one hand on a collection of 557398 file
names (last part of path names) in a file system here.
On the other hand on an artificially generated sequence of file names
with increasing number tails: foo0001, foo0002, ...

On the first collection the choice of multiplier didnt really matter
provided that it was odd and not too close to a power of two.
The smallest number with good behaviour was 11, the winner was 51.

(51 has 4 bits, but is not more expensive because they are evenly spaced:
/* 51x = 17*3*x */
x += (x  1);
x += (x  4);
)

On the second collection there were large differences between multipliers.
The clear winner was 11.

Some numbers:

Hash 557398 actual names, using
hash(unsigned char *s) {
unsigned int h = 0;

while(*s)
h = m*h + *s++;
return h % sz;
}
for various values of m and powers of two sz (so that the % is an AND).
Report min, max, average length of hash chain, and standard deviation.
Of course min and max should be close to average and stddev should be small.

m= 11 sz=2048, min 221, max 324, av 272.17, stddev 254.12
m= 13 sz=2048, min 219, max 322, av 272.17, stddev 259.96
m= 51 sz=2048, min 218, max 325, av 272.17, stddev 265.52
m=131 sz=2048, min 222, max 344, av 272.17, stddev 264.20

m= 11 sz=4096, min  96, max 176, av 136.08, stddev 132.58
m= 13 sz=4096, min 101, max 177, av 136.08, stddev 128.71
m= 51 sz=4096, min  92, max 174, av 136.08, stddev 130.89
m=131 sz=4096, min  85, max 180, av 136.08, stddev 131.99

m= 11 sz=8192, min 38, max 102, av 68.04, stddev 68.26
m= 13 sz=8192, min 42, max 100, av 68.04, stddev 66.30
m= 51 sz=8192, min 41, max  97, av 68.04, stddev 64.98
m=131 sz=8192, min 39, max 102, av 68.04, stddev 66.19

m= 11 sz=16384, min 14, max 57, av 34.02, stddev 33.96
m= 13 sz=16384, min 14, max 58, av 34.02, stddev 33.51
m= 51 sz=16384, min 15, max 60, av 34.02, stddev 32.29
m=131 sz=16384, min 16, max 59, av 34.02, stddev 33.94

m= 11 sz=32768, min 3, max 37, av 17.01, stddev 17.50
m= 13 sz=32768, min 3, max 34, av 17.01, stddev 16.84
m= 51 sz=32768, min 4, max 41, av 17.01, stddev 16.46
m=131 sz=32768, min 3, max 40, av 17.01, stddev 16.90

m= 11 sz=65536, min 0, max 24, av 8.51, stddev 8.70
m= 13 sz=65536, min 0, max 23, av 8.51, stddev 8.56
m= 51 sz=65536, min 0, max 24, av 8.51, stddev 8.31
m=131 sz=65536, min 0, max 23, av 8.51, stddev 8.51

m= 11 sz=131072, min 0, max 17, av 4.25, stddev 4.39
m= 13 sz=131072, min 0, max 16, av 4.25, stddev 4.32
m= 51 sz=131072, min 0, max 16, av 4.25, stddev 4.22
m=131 sz=131072, min 0, max 16, av 4.25, stddev 4.24

m= 11 sz=262144, min 0, max 12, av 2.13, stddev 2.20
m= 13 sz=262144, min 0, max 12, av 2.13, stddev 2.18
m= 51 sz=262144, min 0, max 12, av 2.13, stddev 2.12
m=131 sz=262144, min 0, max 12, av 2.13, stddev 2.12

On the second, nonrandom, collection there are more variations:

m= 11 sz=8192, min 61, max 76, av 68.04, stddev  4.41
m= 13 sz=8192, min 55, max 83, av 68.04, stddev 18.64
m= 51 sz=8192, min 58, max 79, av 68.04, stddev 12.47
m=131 sz=8192, min 52, max 83, av 68.04, stddev 29.05

m= 11 sz=16384, min 26, max 41, av 34.02, stddev  3.61
m= 13 sz=16384, min 24, max 45, av 34.02, stddev  8.76
m= 51 sz=16384, min 25, max 44, av 34.02, stddev  6.32
m=131 sz=16384, min 23, max 47, av 34.02, stddev 14.00

m= 11 sz=32768, min 10, max 23, av 17.01, stddev 4.36
m= 13 sz=32768, min  7, max 28, av 17.01, stddev 8.66
m= 51 sz=32768, min 10, max 25, av 17.01, stddev 4.04
m=131 sz=32768, min  6, max 27, av 17.01, stddev 8.66

Andries
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Andries . Brouwer



Just looking at R5 I knew it wasn't going to do well in this application
because it's similar to a number of hash functions I tried with the same
idea in mind: to place similar names together in the same leaf block. 
That turned out to be not very important compared to achieving a
relatively high fullness of leaf blocks.  The problem with R5 when used
with my htree is, it doesn't give very uniform dispersal.

The bottom line: dx_hack_hash is still the reigning champion.

Now that you provide source for r5 and dx_hack_hash, let me feed my
collections to them.
r5: catastrophic
dx_hack_hash: not bad, but the linear hash is better.

E.g.:
Actual file names:

Linear hash, m=11, sz=2048, min 262, max 283,  av 272.17, stddev 12.25
dx_hack_hash:  sz=2048, min 220, max 330,  av 272.17, stddev 280.43
r5:sz=2048, min 205, max 382,  av 272.17, stddev 805.18

Linear hash, m=11, sz=65536, min 0, max 24, av 8.51, stddev 8.70
dx_hack_hash:  sz=65536, min 0, max 23, av 8.51, stddev 8.51
r5:sz=65536, min 0, max 26, av 8.51, stddev 8.89

Generated consecutive names:

Linear hash, m=11, sz=2048, min 262, max 283,  av 272.17, stddev 12.25
dx_hack_hash:  sz=2048, min 191, max 346,  av 272.17, stddev 636.11
r5:sz=2048, min 0,   max 3587, av 272.17, stddev 755222.91

Linear hash, m=11, sz=65536, min 2, max  14, av 8.51, stddev 2.79
dx_hack_hash:  sz=65536, min 0, max  26, av 8.51, stddev 12.24
r5:sz=65536, min 0, max 120, av 8.51, stddev 738.08

Andries
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread Daniel Phillips


[EMAIL PROTECTED] wrote:
 
 Just looking at R5 I knew it wasn't going to do well in this application
 because it's similar to a number of hash functions I tried with the same
 idea in mind: to place similar names together in the same leaf block.
 That turned out to be not very important compared to achieving a
 relatively high fullness of leaf blocks.  The problem with R5 when used
 with my htree is, it doesn't give very uniform dispersal.
 
 The bottom line: dx_hack_hash is still the reigning champion.
 
 Now that you provide source for r5 and dx_hack_hash, let me feed my
 collections to them.
 r5: catastrophic
 dx_hack_hash: not bad, but the linear hash is better.

I never expected dx_hack_hash to be particularly good at anything, but
we might as well test the version without the mistake in it - I was
previously using  0 to test the sign bit - on an unsigned variable :-/

unsigned dx_hack_hash (const char *name, int len)
{
u32 hash0 = 0x12a3fe2d, hash1 = 0x37abe8f9;
while (len--)
{
u32 hash = hash1 + (hash0 ^ (*name++ * 71523));
if (hash  0x8000) hash -= 0x7fff;
hash1 = hash0;
hash0 = hash;
}
return hash0;
}


The correction gained me another 1% in the leaf block fullness measure. 
I will try your hash with the htree index code tomorrow.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Daniel Phillips


On Thu, 22 Feb 2001, [EMAIL PROTECTED] wrote:
> A couple of comments.  If you make the beginning of each index block
> look like a an empty directory block (i.e, the first 8 blocks look like
> this):
> 
>   32 bits: ino == 0
>   16 bits: rec_len == blocksize
>   16 bits: name_len = 0
> 
> ... then you will have full backwards compatibility, both for reading
> *and* writing.  When reading, old kernels will simply ignore the index
> blocks, since it looks like it has an unpopulated directory entry.  And
> if the kernel attempts to write into the directory, it will clear the
> BTREE_FL flag, in which case new kernels won't treat the directory as a
> tree anymore.  (Running a smart e2fsck which knows about directory trees
> will be able to restore the tree structure).

:-)  That's really nice, now I see what you were thinking about with
all those bit clears.

> Is it worth it?  Well, it means you lose an index entry from each
> directory block, thus reducing your fanout at each node of the tree by a
> worse case of 0.7% in the worst case (1k blocksize) and 0.2% if you're
> using 4k blocksizes.

I'll leave that up to somebody else - we now have two alternatives, the
100%, no-compromise INCOMPAT solution, and the slightly-bruised but
still largely intact forward compatible solution.  I'll maintain both
solutions for now code so it's just as easy to choose either in the end.

-- 
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Bill Wendling


Also sprach H. Peter Anvin:
} Martin Mares wrote:
} > 
} > Hello!
} > 
} > > True.  Note too, though, that on a filesystem (which we are, after all,
} > > talking about), if you assume a large linear space you have to create a
} > > file, which means you need to multiply the cost of all random-access
} > > operations with O(log n).
} > 
} > One could avoid this, but it would mean designing the whole filesystem in a
} > completely different way -- merge all directories to a single gigantic
} > hash table and use (directory ID,file name) as a key, but we were originally
} > talking about extending ext2, so such massive changes are out of question
} > and your log n access argument is right.
} > 
} 
} It would still be tricky since you have to have actual files in the
} filesystem as well.
} 
But that's just a user space issue, isn't it.

(Just kidding :-)

-- 
|| Bill Wendling[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Andreas Dilger

HPA writes:
> Daniel Phillips wrote:
> > I mentioned this earlier but it's worth repeating: the desire to use a
> > small block size is purely an artifact of the fact that ext2 has no
> > handling for tail block fragmentation.  That's a temporary situation -
> > once we've dealt with it your 2,000,000 file directory will be happier
> > with 4K filesystem blocks.  There will be a lot fewer metadata index
> > blocks in your directory file, for one thing.  Another practical matter
> > is that 4K filesystem blocks map directly to 4K PAGE_SIZE and are as a
> > result friendlier to the page cache and memory manager.
> > 
> 
> Well, that's something I really don't expect to see anymore -- this
> "purely temporary situation" is now already 7 years old at least.

Peter, you're barking up the wrong tree - Daniel has had an ext2 tail
merging patch around for 6 months or more...  However, from the sounds
of it, Linus may not want such a thing in ext2 (at least not until he
is convinced otherwise).  It will be interesting to compare ext2 +
ongoing patches vs. new filesystems like reiserfs, XFS, JFS --  not only
speed, but reliability as well.  XFS and JFS have previous implementations
to work with (although the JFS code is not the AIX JFS code), but reiserfs
has a long way to go, just from the standpoint of being run on millions
of machines, and being looked at by thousands of programmers.

I think people will be surprised at how ext2 + patches will continue to
improve.  One of the reasons (despite Linus' misgivings, IMHO) is that
ext2 is continually being improved by small measures, has lots of eyes
on the code, and it offers a stable base for each improvement - which
means each improvement is stable and reliable much quicker than if you
were to code a new filesystem from scratch for each new feature.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread tytso


Daniel,

Nice work!

A couple of comments.  If you make the beginning of each index block
look like a an empty directory block (i.e, the first 8 blocks look like
this):

32 bits: ino == 0
16 bits: rec_len == blocksize
16 bits: name_len = 0

... then you will have full backwards compatibility, both for reading
*and* writing.  When reading, old kernels will simply ignore the index
blocks, since it looks like it has an unpopulated directory entry.  And
if the kernel attempts to write into the directory, it will clear the
BTREE_FL flag, in which case new kernels won't treat the directory as a
tree anymore.  (Running a smart e2fsck which knows about directory trees
will be able to restore the tree structure).

Is it worth it?  Well, it means you lose an index entry from each
directory block, thus reducing your fanout at each node of the tree by a
worse case of 0.7% in the worst case (1k blocksize) and 0.2% if you're
using 4k blocksizes.

- Ted

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Daniel Phillips


"H. Peter Anvin" wrote:
> 
> Daniel Phillips wrote:
> >
> > Have you looked at the structure and algorithms I'm using?  I would not
> > call this a hash table, nor is it a btree.  It's a 'hash-keyed
> > uniform-depth tree'.  It never needs to be rehashed (though it might be
> > worthwhile compacting it at some point).  It also never needs to be
> > rebalanced - it's only two levels deep for up to 50 million files.
> 
> I'm curious how you do that.  It seems each level would have to be 64K
> large in order to do that, with a minimum disk space consumption of 128K
> for a directory.  That seems extremely painful *except* in the case of
> hysterically large directories, which tend to be the exception even on
> filesystems where they occur.

Easy, with average dirent reclen of 16 bytes each directory leaf block
can holds up to 256 entries.  Each index block indexes 512 directory
blocks and the root indexes 511 index blocks.  Assuming the leaves are
on average 75% full this gives:

(4096 / 16) * 512 * 511 * .75 = 50,233,344

I practice I'm getting a little more than 90,000 entries indexed by a
*single* index block (the root) so I'm not just making this up.

> I think I'd rather take the extra complexity and rebalancing cost of a
> B-tree.

Do you still think so?

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Ed Tomlinson

Linus Torvalds <[EMAIL PROTECTED]> wrote:
>
> Ed Tomlinson  <[EMAIL PROTECTED]> wrote:
> >The default in reiserfs is now the R5 hash, but you are right that lots of
> > efforts went into finding this hash.  This includes testing various
> > hashes on real directory structures to see which one worked best.  R5
> > won.
>
> That's interesting.  The R5 hash is easily also the only one of the
> reiser hashes that might be useable for the generic VFS hashing.  It's
> not so different in spirit from the current one, and if you've done the
> work to test it, it's bound to be a lot better.

It was not me personally.   I just remembered the thread (from june 2000) on 
the reiserfs list...  I have summerized the results for you below.

For the program see: http://www.jedi.claranet.fr/hash_torture.tar.gz

Ed 

PS.  I am still seeing hangs with (2.4.2pre2 then I switched to ac7 or so and 
have had hangs with all pre and ac(s) tried and that is most of them)  ac20 
plus the latest reiserfs fixes has stayed up 8 hours so far - it can take two 
or three days  to trigger the hang though.  When it hangs it really dead,  a 
UPS connected via a serial port cannot shut it down.   pings to the box fail. 
A+SysRQ is dead, and the software watchdog does not trigger a reboot.  
ideas?

> (The current VFS name hash is probably _really_ stupid - I think it's
> still my original one, and nobody probably ever even tried to run it
> through any testing.  For example, I bet that using a shift factor of 4
> is really bad, because it evenly divides a byte, which together with the
> xor means that you can really easily generate trivial bad cases).
>
> What did you use for a test-case? Real-life directory contents? Did you
> do any worst-case analysis too?
>
>        Linus

some test results from june 2000 with Hans's summary first.
---
(reiserfs) Re: r5 hash
From: Hans Reiser <[EMAIL PROTECTED]>
To: "Yury Yu. Rupasov" <[EMAIL PROTECTED]>
Cc: Jedi/Sector One <[EMAIL PROTECTED]>, Petru Paler <[EMAIL PROTECTED]>, 
"[EMAIL PROTECTED]" <[EMAIL PROTECTED]>, Yury Shevchuk 
<[EMAIL PROTECTED]>

Ok, based on this benchmark let's put rupasov5 in, and warn users who choose 
the
currently used rupasov1 hash that rupasov5 has obsoleted it.  Do this in both
3.6 and 3.5, and fix the the delimiting key check in 3.5 REISERFS_CHECK bug at
the same time.  Cut the patch, start testing, and see if you can release by
Monday.  Make rupasov5 the default.  sizif, review the documentation he 
creates
for users.

Jedi, if you disagree with the benchmarks let me know.  You might try
concatenating two filenames together instead of adding a digit to them, or
running find on a really large FS, to improve these tests.  Thanks for helping
us with analyzing the different hash methods available Jedi.

Hans

---
(reiserfs) Re: r5 hash
From: "Yury Yu. Rupasov" <[EMAIL PROTECTED]>
To: Hans Reiser <[EMAIL PROTECTED]>
Cc: Jedi/Sector One <[EMAIL PROTECTED]>, Petru Paler <[EMAIL PROTECTED]>, 
"[EMAIL PROTECTED]" <[EMAIL PROTECTED]>, Yury Shevchuk 
<[EMAIL PROTECTED]>

Hans Reiser wrote:
> 
> What is the speed of the real filenames, not just the number of collisions.
> 

Ok, here is the results for real names :
# find / -type d -exec ls {} \; | sort | uniq > allfiles.txt

# wc -l allfiles.txt
161101 allfiles.txt

Collisions for 161 101 names:

tea_hash  : 784 total,  2 dangerous
jedi_hash2: 957 total,  2 dangerous 
r5_hash   :1191 total,  2 dangerous 
r7_hash   :8439 total, 18 dangerous

The speed for 161 101 real names :

create 161101 files of 10 bytes with names from allfiles.txt

# time create d1 allfiles.txt
# time cp d1 d2 -r
# time rm d1 -r

              create      copy        remove 

tea_hash   : 1m27.223s   5m43.069s  2m33.449s
jedi_hash2 : 1m26.062s   5m40.872s  2m32.795s
r5_hash    : 1m16.729s   4m14.967s  1m53.037s
r7_hash    : 1m10.665s   3m34.950s  1m39.756s

As you can see the results are differ, but not too much. :)
The situation changes dramatically if we will test 1 million files.

The same test, but at the end of each name from allfiles.txt 
added numbers from 0 to 6 (1 127 707 files):

              create      copy        remove 

tea_hash   : 81m44.449s  
jedi_hash2 : 79m46.419s
r5_hash    : 15m56.037s
r7_hash    : 15m30.680s

Dual Celeron 500, 128 MB RAM, 8 GB scsi HDD
Reiserfs-3.5.21, Linux-2.2.15

Thanks,
Yura.
---
body { font-family: "helvetica" } p { font-size: 12pt } a { color: #ff; 
text-decoration: none; }(reiserfs) Torture results
From: Jedi/Sector One <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]

  Here are the results of the hash torture on a Celeron 300.
  Once again, you can substract 1 from the dangerous collisions numbers.
  Xuan, can you provide a test

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Linus Torvalds

In article <[EMAIL PROTECTED]>,
Daniel Phillips  <[EMAIL PROTECTED]> wrote:
>
>I mentioned this earlier but it's worth repeating: the desire to use a
>small block size is purely an artifact of the fact that ext2 has no
>handling for tail block fragmentation.  That's a temporary situation -
>once we've dealt with it your 2,000,000 file directory will be happier
>with 4K filesystem blocks.

I'd rather see a whole new filesystem than have ext2 do tail-block
fragmentation. 

Once you do tail fragments, you might as well do the whole filesystem
over and have it do fancier stuff than just handling sub-blocking. 

Another way of saying this: if you go to the complexity of no longer
being a purely block-based filesystem, please go the whole way. Make the
thing be extent-based, and get away from the notion that you have to
allocate blocks one at a time. Make the blocksize something nice and
big, not just 4kB or 8kB or something.

And don't call it ext2. 

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Linus Torvalds

In article <[EMAIL PROTECTED]>,
Davide Libenzi  <[EMAIL PROTECTED]> wrote:
>
>Yep, 4 is not good as a shifting factor. Prime number are the better choice for
>this stuff.

Oh, absolutely.

It looks like the hash function was done rather early on in the dcache
lifetime (one of the first things), back when nobody cared about whether
it was really good or not because there were many much more complicated
questions like "how the h*ll will this all ever work" ;)

And at no point did anybody ever go back and verify whether the hash
function made much sense or not.

We had another boo-boo with the actual _folding_ of the "full" hash
value into the actual hash chain pointer that is done when the name
cache is actually looked up, which was even more embarrassing: even if
the hash ended up being ok, we would remove most of the valid bits from
it because it would under certain circumstances (512MB of RAM on x86)
basically xor itself with itself. 

That took quite a while to find too - the code still worked fine, it
just had a horrible distribution on machines with half a gig of memory.

>The issue to have a good distribution is not only to have a good hashing
>function, but also to give this function not correlated data.
>Good hashing function for a Domain A may not be so good for a Domain B.

This is not something we can do all that much about.  The data we get is
generated by the user, and can basically be a random string of
characters.  HOWEVER, there are certainly tons of _usual_ data, and
while there's no way to select the data we can at least try to make sure
that the distribution is good for the normal case (ie regular ASCII
filenames, not forgetting the fact that many people use more interesting
encodings)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Daniel Phillips


Andreas Dilger wrote:
> 
> Daniel Phillips writes:
> > Easy, with average dirent reclen of 16 bytes each directory leaf block
> > can holds up to 256 entries.  Each index block indexes 512 directory
> > blocks and the root indexes 511 index blocks.  Assuming the leaves are
> > on average 75% full this gives:
> >
> >   (4096 / 16) * 512 * 511 * .75 = 50,233,344
> >
> > I practice I'm getting a little more than 90,000 entries indexed by a
> > *single* index block (the root) so I'm not just making this up.
> 
> I was just doing the math for 1k ext2 filesystems, and the numbers aren't
> nearly as nice.  We get:
> 
> (1024 / 16) * 127 * .75 = 6096  # 1 level
> (1024 / 16) * 128 * 127 * .75 = 780288  # 2 levels
> 
> Basically (IMHO) we will not really get any noticable benefit with 1 level
> index blocks for a 1k filesystem - my estimates at least are that the break
> even point is about 5k files.  We _should_ be OK with 780k files in a single
> directory for a while.  Looks like we will need 2-level indexes sooner than
> you would think though.  Note that tests on my workstation showed an average
> filename length of 10 characters (excluding MP3s at 78 characters), so this
> would give 20-byte (or 88-byte) dirents for ext3, reducing the files count
> to 4857 and 621792 (or 78183 and 40029696 for 4k filesystems) at 75% full.

But you are getting over 3/4 million files in one directory on a 1K
blocksize system, and you really shouldn't be using 1K blocks on a
filesystem under that big a load.  Is it just to reduce tail block
fragmentation?  That's what tail merging is for - it does a much better
job than shrinking the block size.

But if you are *determined* to use 1K blocks and have more than 1/2
million files in one directory then I suppose a 3rd level is what you
need.  The uniform-depth tree still works just fine and still doesn't
need to be rebalanced - it's never out of balance.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Andreas Dilger

Daniel Phillips writes:
> Easy, with average dirent reclen of 16 bytes each directory leaf block
> can holds up to 256 entries.  Each index block indexes 512 directory
> blocks and the root indexes 511 index blocks.  Assuming the leaves are
> on average 75% full this gives:
> 
>   (4096 / 16) * 512 * 511 * .75 = 50,233,344
> 
> I practice I'm getting a little more than 90,000 entries indexed by a
> *single* index block (the root) so I'm not just making this up.

I was just doing the math for 1k ext2 filesystems, and the numbers aren't
nearly as nice.  We get:

(1024 / 16) * 127 * .75 = 6096  # 1 level
(1024 / 16) * 128 * 127 * .75 = 780288  # 2 levels

Basically (IMHO) we will not really get any noticable benefit with 1 level
index blocks for a 1k filesystem - my estimates at least are that the break
even point is about 5k files.  We _should_ be OK with 780k files in a single
directory for a while.  Looks like we will need 2-level indexes sooner than
you would think though.  Note that tests on my workstation showed an average
filename length of 10 characters (excluding MP3s at 78 characters), so this
would give 20-byte (or 88-byte) dirents for ext3, reducing the files count
to 4857 and 621792 (or 78183 and 40029696 for 4k filesystems) at 75% full.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread H. Peter Anvin


Daniel Phillips wrote:
> 
> "H. Peter Anvin" wrote:
> >
> > Andreas Dilger wrote:
> > >
> > > Basically (IMHO) we will not really get any noticable benefit with 1 level
> > > index blocks for a 1k filesystem - my estimates at least are that the break
> > > even point is about 5k files.  We _should_ be OK with 780k files in a single
> > > directory for a while.
> > >
> >
> > I've had a news server with 200 files in one directory.  Such a
> > filesystem is likely to use small blocks, too, because each file is
> > generally small.
> >
> > This is an important connection: filesystems which have lots and lots of
> > small files will have large directories and small block sizes.
> 
> I mentioned this earlier but it's worth repeating: the desire to use a
> small block size is purely an artifact of the fact that ext2 has no
> handling for tail block fragmentation.  That's a temporary situation -
> once we've dealt with it your 2,000,000 file directory will be happier
> with 4K filesystem blocks.  There will be a lot fewer metadata index
> blocks in your directory file, for one thing.  Another practical matter
> is that 4K filesystem blocks map directly to 4K PAGE_SIZE and are as a
> result friendlier to the page cache and memory manager.
> 

Well, that's something I really don't expect to see anymore -- this
"purely temporary situation" is now already 7 years old at least.

-hpa

-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread H. Peter Anvin


Daniel Phillips wrote:
> 
> There will be a lot fewer metadata index
> blocks in your directory file, for one thing.
> 

Oh yes, another thing: a B-tree directory structure does not need
metadata index blocks.

-hpa

-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Daniel Phillips


Linus Torvalds wrote:
> 
> On Tue, 20 Feb 2001, Daniel Phillips wrote:
> >
> > You mean full_name_hash?  I will un-static it and try it.  I should have
> > some statistics tomorrow.  I have a couple of simple metrics for
> > measuring the effectiveness of the hash function: the uniformity of
> > the hash space splitting (which in turn affects the average fullness
> > of directory leaves) and speed.
> 
> I was more thinking about just using "dentry->d_name->hash" directly, and
> not worrying about how that hash was computed. Yes, for ext2 it will have
> the same value as "full_name_hash" - the difference really being that
> d_hash has already been precomputed for you anyway.
> 
> > Let the hash races begin.
> 
> Note that dentry->d_name->hash is really quick (no extra computation), but
> I'm not claiming that it has anything like a CRC quality. And it's
> probably a bad idea to use it, because in theory at least the VFS layer
> might decide to switch the hash function around. I'm more interested in
> hearing whether it's a good hash, and maybe we could improve the VFS hash
> enough that there's no reason to use anything else..

In the first heat of hash races - creating 20,000 files in one directory
- dentry::hash lost out to my original hack::dx_hash, causing a high
percentage of leaf blocks to remain exactly half full and slowing down
the whole thing by about 5%.  (This was under uml - I haven't tried it
native yet but I expect the results to be similar.)

  Contender Result
  = ==
dentry::hashAverage fullness = 2352 (57%)
hack::dx_hash   Average fullness = 2758 (67%)

This suggests that dentry::hash is producing distinctly non-dispersed
results and needs to be subjected to further scrutiny.  I'll run the
next heat of hash races tomorrow, probably with R5, and CRC32 too if I
have time.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Linus Torvalds

In article <97230a$16k$[EMAIL PROTECTED]>,
Linus Torvalds <[EMAIL PROTECTED]> wrote:
>
>Another way of saying this: if you go to the complexity of no longer
>being a purely block-based filesystem, please go the whole way. Make the
>thing be extent-based, and get away from the notion that you have to
>allocate blocks one at a time. Make the blocksize something nice and
>big, not just 4kB or 8kB or something.

Btw, this is also going to be a VM and performance issue some time in
the future.  Tgere are already CPU's that would _love_ to have 64kB
pages etc, and as such a filesystem that doesn't play with the old silly
"everthing is a block" rules would be much appreciated with the kind of
people who have multi-gigabyte files and want to read in big chunks at a
time. 

So either you have a simple block-based filesystem (current ext2, no
extents, no crapola), or you decide to do it over.  Don't do some
half-way thing, please. 

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Daniel Phillips

"H. Peter Anvin" wrote:
> 
> Andreas Dilger wrote:
> >
> > Basically (IMHO) we will not really get any noticable benefit with 1 level
> > index blocks for a 1k filesystem - my estimates at least are that the break
> > even point is about 5k files.  We _should_ be OK with 780k files in a single
> > directory for a while.
> >
> 
> I've had a news server with 200 files in one directory.  Such a
> filesystem is likely to use small blocks, too, because each file is
> generally small.
> 
> This is an important connection: filesystems which have lots and lots of
> small files will have large directories and small block sizes.

I mentioned this earlier but it's worth repeating: the desire to use a
small block size is purely an artifact of the fact that ext2 has no
handling for tail block fragmentation.  That's a temporary situation -
once we've dealt with it your 2,000,000 file directory will be happier
with 4K filesystem blocks.  There will be a lot fewer metadata index
blocks in your directory file, for one thing.  Another practical matter
is that 4K filesystem blocks map directly to 4K PAGE_SIZE and are as a
result friendlier to the page cache and memory manager.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread H. Peter Anvin


Andreas Dilger wrote:
> 
> Basically (IMHO) we will not really get any noticable benefit with 1 level
> index blocks for a 1k filesystem - my estimates at least are that the break
> even point is about 5k files.  We _should_ be OK with 780k files in a single
> directory for a while.
>

I've had a news server with 200 files in one directory.  Such a
filesystem is likely to use small blocks, too, because each file is
generally small.

This is an important connection: filesystems which have lots and lots of
small files will have large directories and small block sizes.

-hpa

-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread H. Peter Anvin


Daniel Phillips wrote:
> 
> "H. Peter Anvin" wrote:
> >
> > Daniel Phillips wrote:
> > >
> > > Have you looked at the structure and algorithms I'm using?  I would not
> > > call this a hash table, nor is it a btree.  It's a 'hash-keyed
> > > uniform-depth tree'.  It never needs to be rehashed (though it might be
> > > worthwhile compacting it at some point).  It also never needs to be
> > > rebalanced - it's only two levels deep for up to 50 million files.
> >
> > I'm curious how you do that.  It seems each level would have to be 64K
> > large in order to do that, with a minimum disk space consumption of 128K
> > for a directory.  That seems extremely painful *except* in the case of
> > hysterically large directories, which tend to be the exception even on
> > filesystems where they occur.
> 
> Easy, with average dirent reclen of 16 bytes each directory leaf block
> can holds up to 256 entries.  Each index block indexes 512 directory
> blocks and the root indexes 511 index blocks.  Assuming the leaves are
> on average 75% full this gives:
> 
> (4096 / 16) * 512 * 511 * .75 = 50,233,344
> 

That's a three-level tree, not a two-level tree.

> I practice I'm getting a little more than 90,000 entries indexed by a
> *single* index block (the root) so I'm not just making this up.
> 
> > I think I'd rather take the extra complexity and rebalancing cost of a
> > B-tree.
> 
> Do you still think so?

I think so.

-hpa

-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread H. Peter Anvin


Followup to:  <971i36$180$[EMAIL PROTECTED]>
By author:[EMAIL PROTECTED] (Linus Torvalds)
In newsgroup: linux.dev.kernel
> 
> (The current VFS name hash is probably _really_ stupid - I think it's
> still my original one, and nobody probably ever even tried to run it
> through any testing.  For example, I bet that using a shift factor of 4
> is really bad, because it evenly divides a byte, which together with the
> xor means that you can really easily generate trivial bad cases). 
> 

Actually, the VFS name hash I think is derived from the "Dragon Book"
hash (via autofs), so it's not like it's completely untested.

-hpa
-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Linus Torvalds




On Thu, 22 Feb 2001, Daniel Phillips wrote:
> 
> In the first heat of hash races - creating 20,000 files in one directory
> - dentry::hash lost out to my original hack::dx_hash, causing a high
> percentage of leaf blocks to remain exactly half full and slowing down
> the whole thing by about 5%.  (This was under uml - I haven't tried it
> native yet but I expect the results to be similar.)
> 
> Contender Result
> = ==
>   dentry::hashAverage fullness = 2352 (57%)
>   hack::dx_hash   Average fullness = 2758 (67%)
> 
> This suggests that dentry::hash is producing distinctly non-dispersed
> results and needs to be subjected to further scrutiny.  I'll run the
> next heat of hash races tomorrow, probably with R5, and CRC32 too if I
> have time.

I'd love to hear the results from R5, as that seems to be the reiserfs
favourite, and I'm trying it out in 2.4.2 because it was so easy to plug
in..

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Daniel Phillips

"H. Peter Anvin" wrote:
> 
> Martin Mares wrote:
> >
> > > True.  Note too, though, that on a filesystem (which we are, after all,
> > > talking about), if you assume a large linear space you have to create a
> > > file, which means you need to multiply the cost of all random-access
> > > operations with O(log n).
> >
> > One could avoid this, but it would mean designing the whole filesystem in a
> > completely different way -- merge all directories to a single gigantic
> > hash table and use (directory ID,file name) as a key, but we were originally
> > talking about extending ext2, so such massive changes are out of question
> > and your log n access argument is right.
> 
> It would still be tricky since you have to have actual files in the
> filesystem as well.

Have you looked at the structure and algorithms I'm using?  I would not
call this a hash table, nor is it a btree.  It's a 'hash-keyed
uniform-depth tree'.  It never needs to be rehashed (though it might be
worthwhile compacting it at some point).  It also never needs to be
rebalanced - it's only two levels deep for up to 50 million files.

This thing deserves a name of its own.  I call it an 'htree'.  The
performance should speak for itself - 150 usec/create across 90,000
files and still a few optmizations to go.

Random access runs at similar speeds too, it's not just taking advantage
of a long sequence of insertions into the same directory.

BTW, the discussion in this thread has been very interesting, it just
isn't entirely relevant to my patch :-)

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread H. Peter Anvin


Daniel Phillips wrote:
> 
> Have you looked at the structure and algorithms I'm using?  I would not
> call this a hash table, nor is it a btree.  It's a 'hash-keyed
> uniform-depth tree'.  It never needs to be rehashed (though it might be
> worthwhile compacting it at some point).  It also never needs to be
> rebalanced - it's only two levels deep for up to 50 million files.
> 

I'm curious how you do that.  It seems each level would have to be 64K
large in order to do that, with a minimum disk space consumption of 128K
for a directory.  That seems extremely painful *except* in the case of
hysterically large directories, which tend to be the exception even on
filesystems where they occur.

I think I'd rather take the extra complexity and rebalancing cost of a
B-tree.

> This thing deserves a name of its own.  I call it an 'htree'.  The
> performance should speak for itself - 150 usec/create across 90,000
> files and still a few optmizations to go.
> 
> Random access runs at similar speeds too, it's not just taking advantage
> of a long sequence of insertions into the same directory.
> 
> BTW, the discussion in this thread has been very interesting, it just
> isn't entirely relevant to my patch :-)

-hpa

-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Davide Libenzi



On 21-Feb-2001 Daniel Phillips wrote:
> "H. Peter Anvin" wrote:
>> 
>> Martin Mares wrote:
>> >
>> > > True.  Note too, though, that on a filesystem (which we are, after all,
>> > > talking about), if you assume a large linear space you have to create a
>> > > file, which means you need to multiply the cost of all random-access
>> > > operations with O(log n).
>> >
>> > One could avoid this, but it would mean designing the whole filesystem in
>> > a
>> > completely different way -- merge all directories to a single gigantic
>> > hash table and use (directory ID,file name) as a key, but we were
>> > originally
>> > talking about extending ext2, so such massive changes are out of question
>> > and your log n access argument is right.
>> 
>> It would still be tricky since you have to have actual files in the
>> filesystem as well.
> 
> Have you looked at the structure and algorithms I'm using?  I would not
> call this a hash table, nor is it a btree.  It's a 'hash-keyed
> uniform-depth tree'.  It never needs to be rehashed (though it might be
> worthwhile compacting it at some point).  It also never needs to be
> rebalanced - it's only two levels deep for up to 50 million files.
> 
> This thing deserves a name of its own.  I call it an 'htree'.  The
> performance should speak for itself - 150 usec/create across 90,000
> files and still a few optmizations to go.
> 
> Random access runs at similar speeds too, it's not just taking advantage
> of a long sequence of insertions into the same directory.
> 
> BTW, the discussion in this thread has been very interesting, it just
> isn't entirely relevant to my patch :-)

Daniel,

I'm all but saying that Your algo is not good.
I use something very like to it in my mail server ( XMail ) to index mail queue
files that has a two level depth fs splitting.
The mine was only an hint to try different types of directory indexing.



- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Davide Libenzi



On 21-Feb-2001 Linus Torvalds wrote:
> In article <[EMAIL PROTECTED]>,
> Ed Tomlinson  <[EMAIL PROTECTED]> wrote:
>>
>>The default in reiserfs is now the R5 hash, but you are right that lots of
>>efforts went 
>>into finding this hash.  This includes testing various hashes on real
>>directory 
>>structures to see which one worked best.  R5 won.
> 
> That's interesting.  The R5 hash is easily also the only one of the
> reiser hashes that might be useable for the generic VFS hashing.  It's
> not so different in spirit from the current one, and if you've done the
> work to test it, it's bound to be a lot better.
> 
> (The current VFS name hash is probably _really_ stupid - I think it's
> still my original one, and nobody probably ever even tried to run it
> through any testing.  For example, I bet that using a shift factor of 4
> is really bad, because it evenly divides a byte, which together with the
> xor means that you can really easily generate trivial bad cases). 
> 
> What did you use for a test-case? Real-life directory contents? Did you
> do any worst-case analysis too?

Yep, 4 is not good as a shifting factor. Prime number are the better choice for
this stuff.
The issue to have a good distribution is not only to have a good hashing
function, but also to give this function not correlated data.
Good hashing function for a Domain A may not be so good for a Domain B.




- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Jamie Lokier


Martin Mares wrote:
> Hello!
> 
> > True.  Note too, though, that on a filesystem (which we are, after all,
> > talking about), if you assume a large linear space you have to create a
> > file, which means you need to multiply the cost of all random-access
> > operations with O(log n).
> 
> One could avoid this, but it would mean designing the whole filesystem in a
> completely different way -- merge all directories to a single gigantic
> hash table and use (directory ID,file name) as a key, but we were originally
> talking about extending ext2, so such massive changes are out of question
> and your log n access argument is right.

A gigantic hash table has serious problems with non-locality of
reference.  Basically any regular access pattern you started with is
destroyed.  This is a problem with pageable RAM, let alone disks with
millisecond seek times.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread H. Peter Anvin


Martin Mares wrote:
> 
> Hello!
> 
> > True.  Note too, though, that on a filesystem (which we are, after all,
> > talking about), if you assume a large linear space you have to create a
> > file, which means you need to multiply the cost of all random-access
> > operations with O(log n).
> 
> One could avoid this, but it would mean designing the whole filesystem in a
> completely different way -- merge all directories to a single gigantic
> hash table and use (directory ID,file name) as a key, but we were originally
> talking about extending ext2, so such massive changes are out of question
> and your log n access argument is right.
> 

It would still be tricky since you have to have actual files in the
filesystem as well.

-hpa

-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Linus Torvalds

In article <[EMAIL PROTECTED]>,
Ed Tomlinson  <[EMAIL PROTECTED]> wrote:
>
>The default in reiserfs is now the R5 hash, but you are right that lots of efforts 
>went 
>into finding this hash.  This includes testing various hashes on real directory 
>structures to see which one worked best.  R5 won.

That's interesting.  The R5 hash is easily also the only one of the
reiser hashes that might be useable for the generic VFS hashing.  It's
not so different in spirit from the current one, and if you've done the
work to test it, it's bound to be a lot better.

(The current VFS name hash is probably _really_ stupid - I think it's
still my original one, and nobody probably ever even tried to run it
through any testing.  For example, I bet that using a shift factor of 4
is really bad, because it evenly divides a byte, which together with the
xor means that you can really easily generate trivial bad cases). 

What did you use for a test-case? Real-life directory contents? Did you
do any worst-case analysis too?

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread H. Peter Anvin


Martin Mares wrote:
> 
> Hello!
> 
> > Not true.  The rehashing is O(n) and it has to be performed O(log n)
> > times during insertion.  Therefore, insertion is O(log n).
> 
> Rehashing is O(n), but the "n" is the _current_ number of items, not the
> maximum one after all the insertions.
> 
> Let's assume you start with a single-entry hash table. You rehash for the
> first time after inserting the first item (giving hash table of size 2),
> then after the second item (=> size 4), then after the fourth item (=> size 8)
> and so on. I.e., when you insert n items, the total cost of rehashing summed
> over all the insertions is at most 1 + 2 + 4 + 8 + 16 + ... + 2^k (where
> k=floor(log2(n))) <= 2^k+1 = O(n). That is O(1) operations per item inserted.
> 

You're right.  However, for each hash table operation to be O(1) the size
of the hash table must be >> n.

I suggested at one point to use B-trees with a hash value as the key. 
B-trees are extremely efficient when used on a small constant-size key.

-hpa

-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Martin Mares


Hello!

> Not true.  The rehashing is O(n) and it has to be performed O(log n)
> times during insertion.  Therefore, insertion is O(log n).

Rehashing is O(n), but the "n" is the _current_ number of items, not the
maximum one after all the insertions.

Let's assume you start with a single-entry hash table. You rehash for the
first time after inserting the first item (giving hash table of size 2),
then after the second item (=> size 4), then after the fourth item (=> size 8)
and so on. I.e., when you insert n items, the total cost of rehashing summed
over all the insertions is at most 1 + 2 + 4 + 8 + 16 + ... + 2^k (where
k=floor(log2(n))) <= 2^k+1 = O(n). That is O(1) operations per item inserted.

Have a nice fortnight
-- 
Martin `MJ' Mares <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> http://atrey.karlin.mff.cuni.cz/~mj/
MIPS: Meaningless Indicator of Processor Speed.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread H. Peter Anvin


Followup to:  <[EMAIL PROTECTED]>
By author:Martin Mares <[EMAIL PROTECTED]>
In newsgroup: linux.dev.kernel
>
> Hello!
> 
> > To have O(1) you've to have the number of hash entries > number of files and a
> > really good hasing function.
> 
> No, if you enlarge the hash table twice (and re-hash everything) every time the
> table fills up, the load factor of the table keeps small and everything is O(1)
> amortized, of course if you have a good hashing function. If you are really
> smart and re-hash incrementally, you can get O(1) worst case complexity, but
> the multiplicative constant is large.
> 

Not true.  The rehashing is O(n) and it has to be performed O(log n)
times during insertion.  Therefore, insertion is O(log n).

-hpa
-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Martin Mares


Hello!

> True.  Note too, though, that on a filesystem (which we are, after all,
> talking about), if you assume a large linear space you have to create a
> file, which means you need to multiply the cost of all random-access
> operations with O(log n).

One could avoid this, but it would mean designing the whole filesystem in a
completely different way -- merge all directories to a single gigantic
hash table and use (directory ID,file name) as a key, but we were originally
talking about extending ext2, so such massive changes are out of question
and your log n access argument is right.

Have a nice fortnight
-- 
Martin `MJ' Mares <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> http://atrey.karlin.mff.cuni.cz/~mj/
COBOL -- Completely Outdated, Badly Overused Language
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Martin Mares


Hello!

> You're right.  However, for each hash table operation to be O(1) the size
> of the hash table must be >> n.

If we are talking about average case complexity (which is the only possibility
with fixed hash function and arbitrary input keys), it suffices to have
hash table size >= c*n for some constant c which gives O(1/c) cost of
all operations.
 
> I suggested at one point to use B-trees with a hash value as the key. 
> B-trees are extremely efficient when used on a small constant-size key.

Although from asymptotic complexity standpoint hashing is much better
than B-trees, I'm not sure at all what will give the best performance for
reasonable directory sizes. Maybe the B-trees are really the better
alternative as they are updated dynamically and the costs of successive
operations are similar as opposed to hashing which is occassionally very
slow due to rehashing unless you try to rehash on-line, but I don't
know any algorithm for on-line rehashing with both inserts and deletes
which wouldn't be awfully complex and slow (speaking of multiplicative
constants, of course -- it's still O(1) per operation, but "the big Oh
is really big there").

Have a nice fortnight
-- 
Martin `MJ' Mares <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> http://atrey.karlin.mff.cuni.cz/~mj/
"#define QUESTION ((bb) || !(bb))"  -- Shakespeare
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread H. Peter Anvin

Mark Hahn wrote:
> 
> > You're right.  However, for each hash table operation to be O(1) the size
> > of the hash table must be >> n.
> 
> there's at least one kind of HT where the table starts small
> and gets bigger, but at trivial cost (memcpy).  while those
> memcpy's are O(n) each time, it's a little misleading to treat
> them as costing the same as O(n) rehashing.
> 

memcpy() isn't exactly trivial, especially not when we're talking about
disk storage.  Note, too, that we're talking about storage in a
filesystem, and random access a large, growable linear space (i.e. a
file) in a filesystem is O(log n) because of necessary inode indirection.

That's yet another reason I like the idea of using B-trees over hash
values: B-trees are O(log n), but do not need the file inode indirection
to do the job, so what you end up with is very nice and fast.

-hpa

-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Martin Mares


Hello!

> My personal preference goes to skiplist coz it doesn't have fixed ( or growing
> ) tables to handle. You've simply a stub of data togheter with FS data in each
> direntry.

Another problem with skip lists is that they require variable sized nodes,
so you either need to keep free chunk lists and lose some space in deleted
nodes kept in these lists, or you choose to shift remaining nodes which is
slow and complicated as you need to keep the inter-node links right. With
hashing, you can separate the control part of the structure and the actual
data and shift data while leaving most of the control part intact.

> And performance ( O(log2(n)) ) are the same for whatever number of entries.

I don't understand this complexity estimate -- it cannot be the same for
whatever number of entries as the complexity function depends on the number
of entries.

Have a nice fortnight
-- 
Martin `MJ' Mares <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> http://atrey.karlin.mff.cuni.cz/~mj/
P.C.M.C.I.A. stands for `People Can't Memorize Computer Industry Acronyms'
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Davide Libenzi



On 21-Feb-2001 Martin Mares wrote:
> Hello!
> 
>> My personal preference goes to skiplist coz it doesn't have fixed ( or
>> growing
>> ) tables to handle. You've simply a stub of data togheter with FS data in
>> each
>> direntry.
> 
> Another problem with skip lists is that they require variable sized nodes,
> so you either need to keep free chunk lists and lose some space in deleted
> nodes kept in these lists, or you choose to shift remaining nodes which is
> slow and complicated as you need to keep the inter-node links right. With
> hashing, you can separate the control part of the structure and the actual
> data and shift data while leaving most of the control part intact.

An entry in skip list table is a u32 direntry offset and You've not to keep
free entries, simply the height of the node will change depending on the number
of entries.


>> And performance ( O(log2(n)) ) are the same for whatever number of entries.
> 
> I don't understand this complexity estimate -- it cannot be the same for
> whatever number of entries as the complexity function depends on the number
> of entries.

n == number of entries

For constant I mean the formula not the result.



- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread H. Peter Anvin


Martin Mares wrote:
> 
> Hello!
> 
> > You're right.  However, for each hash table operation to be O(1) the size
> > of the hash table must be >> n.
> 
> If we are talking about average case complexity (which is the only possibility
> with fixed hash function and arbitrary input keys), it suffices to have
> hash table size >= c*n for some constant c which gives O(1/c) cost of
> all operations.
> 

True.  Note too, though, that on a filesystem (which we are, after all,
talking about), if you assume a large linear space you have to create a
file, which means you need to multiply the cost of all random-access
operations with O(log n).

> > I suggested at one point to use B-trees with a hash value as the key.
> > B-trees are extremely efficient when used on a small constant-size key.
> 
> Although from asymptotic complexity standpoint hashing is much better
> than B-trees, I'm not sure at all what will give the best performance for
> reasonable directory sizes. Maybe the B-trees are really the better
> alternative as they are updated dynamically and the costs of successive
> operations are similar as opposed to hashing which is occassionally very
> slow due to rehashing unless you try to rehash on-line, but I don't
> know any algorithm for on-line rehashing with both inserts and deletes
> which wouldn't be awfully complex and slow (speaking of multiplicative
> constants, of course -- it's still O(1) per operation, but "the big Oh
> is really big there").

Well, once you multiply with O(log n) for the file indirection (which
B-trees don't need, since they inherently handle blocking and thus can
use block pointers directly) then the asymptotic complexity is the same
as well, and I think the B-trees are the overall winner.

-hpa

-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Davide Libenzi



On 21-Feb-2001 Martin Mares wrote:
> Hello!
> 
>> To have O(1) you've to have the number of hash entries > number of files and
>> a
>> really good hasing function.
> 
> No, if you enlarge the hash table twice (and re-hash everything) every time
> the
> table fills up, the load factor of the table keeps small and everything is
> O(1)
> amortized, of course if you have a good hashing function. If you are really
> smart and re-hash incrementally, you can get O(1) worst case complexity, but
> the multiplicative constant is large.

My personal preference goes to skiplist coz it doesn't have fixed ( or growing
) tables to handle. You've simply a stub of data togheter with FS data in each
direntry.
And performance ( O(log2(n)) ) are the same for whatever number of entries.




- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Martin Mares


Hello!

> To have O(1) you've to have the number of hash entries > number of files and a
> really good hasing function.

No, if you enlarge the hash table twice (and re-hash everything) every time the
table fills up, the load factor of the table keeps small and everything is O(1)
amortized, of course if you have a good hashing function. If you are really
smart and re-hash incrementally, you can get O(1) worst case complexity, but
the multiplicative constant is large.

> To be sincere, here is pretty daylight :)

:)
Martin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Martin Mares


Hello!

> Have You tried to use skiplists ?
> In 93 I've coded a skiplist based directory access for Minix and it gave very
> interesting performances.
> Skiplists have a link-list like performance when linear scanned, and overall
> good performance in insertion/seek/delete.

Skip list search/insert/delete is O(log N) in average as skip lists are just a
dynamic version of interval bisection. Good hashing is O(1).

Have a nice fortnight
-- 
Martin `MJ' Mares <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> http://atrey.karlin.mff.cuni.cz/~mj/
Entropy isn't what it used to be.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Davide Libenzi



On 21-Feb-2001 Martin Mares wrote:
> Hello!
> 
>> Have You tried to use skiplists ?
>> In 93 I've coded a skiplist based directory access for Minix and it gave
>> very
>> interesting performances.
>> Skiplists have a link-list like performance when linear scanned, and overall
>> good performance in insertion/seek/delete.
> 
> Skip list search/insert/delete is O(log N) in average as skip lists are just
> a
> dynamic version of interval bisection. Good hashing is O(1).

To have O(1) you've to have the number of hash entries > number of files and a
really good hasing function.



> 
>   Have a nice fortnight

To be sincere, here is pretty daylight :)



- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Daniel Phillips


On Wed, 21 Feb 2001, Bernd Eckenfels wrote:
> In article <01022100361408.18944@gimli> you wrote:
> > But actually, rm is not problem, it's open and create.  To do a
> > create you have to make sure the file doesn't already exist, and
> > without an index you have to scan on average half the directory file. 
> 
> Unless you use a File System which is better for that, like Reiser-FS. Of
> course a even better solution is to distribute those files in hashed subdirs.

Ahem.  Please read the first post in the thread. ;-)

-- 
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Davide Libenzi



On 20-Feb-2001 Daniel Phillips wrote:
> Earlier this month a runaway installation script decided to mail all its
> problems to root.  After a couple of hours the script aborted, having
> created 65535 entries in Postfix's maildrop directory.  Removing those
> files took an awfully long time.  The problem is that Ext2 does each
> directory access using a simple, linear search though the entire
> directory file, resulting in n**2 behaviour to create/delete n files. 
> It's about time we fixed that.
> 
> Last fall in Miami, Ted Ts'o mentioned some ideas he was playing with
> for an Ext2 directory index, including the following points:
> 
>   - Fixed-size hash keys instead of names in the index
>   - Leaf blocks are normal ext2 directory blocks
>   - Leaf blocks are sequental, so readdir doesn't have to be changed

Have You tried to use skiplists ?
In 93 I've coded a skiplist based directory access for Minix and it gave very
interesting performances.
Skiplists have a link-list like performance when linear scanned, and overall
good performance in insertion/seek/delete.




- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Davide Libenzi



On 20-Feb-2001 Daniel Phillips wrote:
 Earlier this month a runaway installation script decided to mail all its
 problems to root.  After a couple of hours the script aborted, having
 created 65535 entries in Postfix's maildrop directory.  Removing those
 files took an awfully long time.  The problem is that Ext2 does each
 directory access using a simple, linear search though the entire
 directory file, resulting in n**2 behaviour to create/delete n files. 
 It's about time we fixed that.
 
 Last fall in Miami, Ted Ts'o mentioned some ideas he was playing with
 for an Ext2 directory index, including the following points:
 
   - Fixed-size hash keys instead of names in the index
   - Leaf blocks are normal ext2 directory blocks
   - Leaf blocks are sequental, so readdir doesn't have to be changed

Have You tried to use skiplists ?
In 93 I've coded a skiplist based directory access for Minix and it gave very
interesting performances.
Skiplists have a link-list like performance when linear scanned, and overall
good performance in insertion/seek/delete.




- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Daniel Phillips


On Wed, 21 Feb 2001, Bernd Eckenfels wrote:
 In article 01022100361408.18944@gimli you wrote:
  But actually, rm is not problem, it's open and create.  To do a
  create you have to make sure the file doesn't already exist, and
  without an index you have to scan on average half the directory file. 
 
 Unless you use a File System which is better for that, like Reiser-FS. Of
 course a even better solution is to distribute those files in hashed subdirs.

Ahem.  Please read the first post in the thread. ;-)

-- 
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Davide Libenzi



On 21-Feb-2001 Martin Mares wrote:
 Hello!
 
 Have You tried to use skiplists ?
 In 93 I've coded a skiplist based directory access for Minix and it gave
 very
 interesting performances.
 Skiplists have a link-list like performance when linear scanned, and overall
 good performance in insertion/seek/delete.
 
 Skip list search/insert/delete is O(log N) in average as skip lists are just
 a
 dynamic version of interval bisection. Good hashing is O(1).

To have O(1) you've to have the number of hash entries  number of files and a
really good hasing function.



 
   Have a nice fortnight

To be sincere, here is pretty daylight :)



- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Martin Mares


Hello!

 Have You tried to use skiplists ?
 In 93 I've coded a skiplist based directory access for Minix and it gave very
 interesting performances.
 Skiplists have a link-list like performance when linear scanned, and overall
 good performance in insertion/seek/delete.

Skip list search/insert/delete is O(log N) in average as skip lists are just a
dynamic version of interval bisection. Good hashing is O(1).

Have a nice fortnight
-- 
Martin `MJ' Mares [EMAIL PROTECTED] [EMAIL PROTECTED] http://atrey.karlin.mff.cuni.cz/~mj/
Entropy isn't what it used to be.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Martin Mares


Hello!

 To have O(1) you've to have the number of hash entries  number of files and a
 really good hasing function.

No, if you enlarge the hash table twice (and re-hash everything) every time the
table fills up, the load factor of the table keeps small and everything is O(1)
amortized, of course if you have a good hashing function. If you are really
smart and re-hash incrementally, you can get O(1) worst case complexity, but
the multiplicative constant is large.

 To be sincere, here is pretty daylight :)

:)
Martin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Davide Libenzi



On 21-Feb-2001 Martin Mares wrote:
 Hello!
 
 To have O(1) you've to have the number of hash entries  number of files and
 a
 really good hasing function.
 
 No, if you enlarge the hash table twice (and re-hash everything) every time
 the
 table fills up, the load factor of the table keeps small and everything is
 O(1)
 amortized, of course if you have a good hashing function. If you are really
 smart and re-hash incrementally, you can get O(1) worst case complexity, but
 the multiplicative constant is large.

My personal preference goes to skiplist coz it doesn't have fixed ( or growing
) tables to handle. You've simply a stub of data togheter with FS data in each
direntry.
And performance ( O(log2(n)) ) are the same for whatever number of entries.




- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread H. Peter Anvin


Martin Mares wrote:
 
 Hello!
 
  You're right.  However, for each hash table operation to be O(1) the size
  of the hash table must be  n.
 
 If we are talking about average case complexity (which is the only possibility
 with fixed hash function and arbitrary input keys), it suffices to have
 hash table size = c*n for some constant c which gives O(1/c) cost of
 all operations.
 

True.  Note too, though, that on a filesystem (which we are, after all,
talking about), if you assume a large linear space you have to create a
file, which means you need to multiply the cost of all random-access
operations with O(log n).

  I suggested at one point to use B-trees with a hash value as the key.
  B-trees are extremely efficient when used on a small constant-size key.
 
 Although from asymptotic complexity standpoint hashing is much better
 than B-trees, I'm not sure at all what will give the best performance for
 reasonable directory sizes. Maybe the B-trees are really the better
 alternative as they are updated dynamically and the costs of successive
 operations are similar as opposed to hashing which is occassionally very
 slow due to rehashing unless you try to rehash on-line, but I don't
 know any algorithm for on-line rehashing with both inserts and deletes
 which wouldn't be awfully complex and slow (speaking of multiplicative
 constants, of course -- it's still O(1) per operation, but "the big Oh
 is really big there").

Well, once you multiply with O(log n) for the file indirection (which
B-trees don't need, since they inherently handle blocking and thus can
use block pointers directly) then the asymptotic complexity is the same
as well, and I think the B-trees are the overall winner.

-hpa

-- 
[EMAIL PROTECTED] at work, [EMAIL PROTECTED] in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Davide Libenzi



On 21-Feb-2001 Martin Mares wrote:
 Hello!
 
 My personal preference goes to skiplist coz it doesn't have fixed ( or
 growing
 ) tables to handle. You've simply a stub of data togheter with FS data in
 each
 direntry.
 
 Another problem with skip lists is that they require variable sized nodes,
 so you either need to keep free chunk lists and lose some space in deleted
 nodes kept in these lists, or you choose to shift remaining nodes which is
 slow and complicated as you need to keep the inter-node links right. With
 hashing, you can separate the control part of the structure and the actual
 data and shift data while leaving most of the control part intact.

An entry in skip list table is a u32 direntry offset and You've not to keep
free entries, simply the height of the node will change depending on the number
of entries.


 And performance ( O(log2(n)) ) are the same for whatever number of entries.
 
 I don't understand this complexity estimate -- it cannot be the same for
 whatever number of entries as the complexity function depends on the number
 of entries.

n == number of entries

For constant I mean the formula not the result.



- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Martin Mares


Hello!

 You're right.  However, for each hash table operation to be O(1) the size
 of the hash table must be  n.

If we are talking about average case complexity (which is the only possibility
with fixed hash function and arbitrary input keys), it suffices to have
hash table size = c*n for some constant c which gives O(1/c) cost of
all operations.
 
 I suggested at one point to use B-trees with a hash value as the key. 
 B-trees are extremely efficient when used on a small constant-size key.

Although from asymptotic complexity standpoint hashing is much better
than B-trees, I'm not sure at all what will give the best performance for
reasonable directory sizes. Maybe the B-trees are really the better
alternative as they are updated dynamically and the costs of successive
operations are similar as opposed to hashing which is occassionally very
slow due to rehashing unless you try to rehash on-line, but I don't
know any algorithm for on-line rehashing with both inserts and deletes
which wouldn't be awfully complex and slow (speaking of multiplicative
constants, of course -- it's still O(1) per operation, but "the big Oh
is really big there").

Have a nice fortnight
-- 
Martin `MJ' Mares [EMAIL PROTECTED] [EMAIL PROTECTED] http://atrey.karlin.mff.cuni.cz/~mj/
"#define QUESTION ((bb) || !(bb))"  -- Shakespeare
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread H. Peter Anvin


Mark Hahn wrote:
 
  You're right.  However, for each hash table operation to be O(1) the size
  of the hash table must be  n.
 
 there's at least one kind of HT where the table starts small
 and gets bigger, but at trivial cost (memcpy).  while those
 memcpy's are O(n) each time, it's a little misleading to treat
 them as costing the same as O(n) rehashing.
 

memcpy() isn't exactly trivial, especially not when we're talking about
disk storage.  Note, too, that we're talking about storage in a
filesystem, and random access a large, growable linear space (i.e. a
file) in a filesystem is O(log n) because of necessary inode indirection.

That's yet another reason I like the idea of using B-trees over hash
values: B-trees are O(log n), but do not need the file inode indirection
to do the job, so what you end up with is very nice and fast.

-hpa

-- 
[EMAIL PROTECTED] at work, [EMAIL PROTECTED] in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread Martin Mares


Hello!

 My personal preference goes to skiplist coz it doesn't have fixed ( or growing
 ) tables to handle. You've simply a stub of data togheter with FS data in each
 direntry.

Another problem with skip lists is that they require variable sized nodes,
so you either need to keep free chunk lists and lose some space in deleted
nodes kept in these lists, or you choose to shift remaining nodes which is
slow and complicated as you need to keep the inter-node links right. With
hashing, you can separate the control part of the structure and the actual
data and shift data while leaving most of the control part intact.

 And performance ( O(log2(n)) ) are the same for whatever number of entries.

I don't understand this complexity estimate -- it cannot be the same for
whatever number of entries as the complexity function depends on the number
of entries.

Have a nice fortnight
-- 
Martin `MJ' Mares [EMAIL PROTECTED] [EMAIL PROTECTED] http://atrey.karlin.mff.cuni.cz/~mj/
P.C.M.C.I.A. stands for `People Can't Memorize Computer Industry Acronyms'
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 155 matches

Mail list logo