Re: [PATCH] various allocator optimizations

2003-03-14 Thread Manuel Krause
On 03/14/2003 02:34 AM, Chris Mason wrote:
On Thu, 2003-03-13 at 19:15, Hans Reiser wrote:

[ discussion on how to implement lower fragmentation on ReiserFS ]

Let's get lots of different testers.  You may have a nice heuristic here 
though



If everyone agrees the approach is worth trying, I'll make a patch that
enables it via a mount option.
[...]

A dumb question inbetween: How do we - possible testers, users - get 
information about fragmentation on our ReiserFS partitions?

Thanks,

Manuel



Re: [PATCH] various allocator optimizations

2003-03-14 Thread Chris Mason
On Fri, 2003-03-14 at 08:59, Manuel Krause wrote:
 On 03/14/2003 02:34 AM, Chris Mason wrote:
  On Thu, 2003-03-13 at 19:15, Hans Reiser wrote:
  
 
 [ discussion on how to implement lower fragmentation on ReiserFS ]
 
 
 Let's get lots of different testers.  You may have a nice heuristic here 
 though
 
  
  
  If everyone agrees the approach is worth trying, I'll make a patch that
  enables it via a mount option.
  
 [...]
 
 A dumb question inbetween: How do we - possible testers, users - get 
 information about fragmentation on our ReiserFS partitions?

The best tool I've seen so far originally came from Vladimir and was
modified for a study on fragmentation of reiserfs and ext2, Jeff found
the link somewhere in his archives:

http://www.informatik.uni-frankfurt.de/~loizides/reiserfs/index.html

There is also a filesystem aging tools there that I haven't played with
yet.

-chris




Re: [PATCH] various allocator optimizations

2003-03-14 Thread Hans Reiser
Chris Mason wrote:

On Fri, 2003-03-14 at 05:26, Hans Reiser wrote:

 

That would mean the parent directory counter would have to be updated
every time we allocated a block in any sub directory.  Plus the counter
would have to be inherited down the chain in deep directory structures. 
More importantly, I'd rather not waste space in the stat data to store
the information when we can get it during a search ;-)

 

The space usage is trivial.

   

Grin, who are you and what have you done with the real hans ;-) 

You don't need it for every file, you need it for every directory.

It's
two fields, one for the counter and one to point up the chain to the
real owner.  It's yet another field to maintain as objects are deleted
and created, 

or written to or truncated, yes, the cost of lots of updates to this are 
worrying.It might be better done in the repacker than dynamically, 
in fact I just convinced myself of that, how about you.?

a minor format change since old filesystem stat data won't
have the field, and requires support from fsck.
Nobody will mind if we change reiser4 format now

All of which is a lot of work when we can get similar info directly from
the tree.
 

How big are your packing localities tending to be?
   

Not more than can be pointed to by the leaf level and the level directly
above it.  I know that's not very specific, but it varies by the
dataset.  packed tails and long directory names lead to more packing
localities per MB.
 

Which is why it is the wrong measure, yes?

   

Well, yes and no.  The packing locality groups tree objects, and so the
idea behind the patch is to group all tree objects when they are part of
a directory tree that isn't very large.  A smart block allocator for the
tree nodes can use this information too.
In other words, my hope is this patch also makes btree searches more
efficient while walking a given directory tree, since we aren't jumping
all over the btree for each subdirectory.
-chris



 



--
Hans



Re: [PATCH] various allocator optimizations

2003-03-11 Thread Hans Reiser
Chris Mason wrote:

On Tue, 2003-03-11 at 11:42, Oleg Drokin wrote:
 

Hello!

On Tue, Mar 11, 2003 at 11:34:43AM -0500, Chris Mason wrote:

   

changes blocknrs_and_prealloc_arrays_from_search_start into three
passes.  pass1 goes from the hint to the end of the disk, pass2 goes
from the border to the hint, and pass3 goes from the start of the disk
to the border.
 

As you probably remember, we decided to drop border stiff all together
because of all the extra seeking it incurrs.
   

The border does do extra seeks for some cases (search_reada helps), but
no border at all spreads tree blocks all over.  That too does a lot of
seeks, since leaves and the formatted nodes that point to them might be
on entirely different areas of the disk.
 
 

Overall, I believe this will significantly improve fragmentation over
time.  oid_groups should only be used if your FS has a small number of
 

I hope we won't have read-access speed degradation with these.
   

It does, but so does skip_busy alone.  You don't see the problem with
skip_busy during a mongo run, but run stress.sh -n 1 data set that uses
50% of the disk for a few hours and then run mongo again without
deleting the stress.sh data set.
The 2.4.20 default is great on a clean FS but breaks down over time,
just like the 2.4.19 allocator did.  Various people have demonstrated it
with benchmarks.
-chris



 

Chris, don't you think the right answer would be to take zam's resizer 
and make a defragmenter out of it?

--
Hans



Re: [PATCH] various allocator optimizations

2003-03-11 Thread Hans Reiser
Chris Mason wrote:

On Tue, 2003-03-11 at 16:42, Hans Reiser wrote:

 

Chris, don't you think the right answer would be to take zam's resizer 
and make a defragmenter out of it?
   

Yes and no, for a defrag program to fix things we'd have to agree on an
optimal layout ;-)  Also it assumes the machine has idle time when a
defragment cycle is possible. 

No, it assumes that 80% of files don't move during the course of a week, 
so if defrag takes a week, it still adds value.

For many servers this is entirely
untrue...the oracle boxes I ran didn't have a spare second for something
like a defrag.
We can all agree that fragmentation is bad, but the real question is how
do we group the blocks.  Lets pretend for a minute that fragmentation
isn't an issue at all, and our allocator is perfect.
The optimal grouping for reading/writing files is to have the files you
are going to read/write together in the same area of the disk.
The current default uses the start of the disk as a starting point for
each new file.
No, it uses the left neighbor in the tree.  Please correct me if I am 
wrong, because if I am wrong we have a bug.

 This roughly translates to files that are created
together end up in the same part of the disk.  As long as you always
access files in roughly the same order that you create them, it performs
pretty well.
But if a process creates dirA/file1 and then dirB/file2, file1 and file2
are going to be together on the disk.  If file1 tends to be used along
with all the other files in dirA, performance will suffer because we've
got to seek from all the other files in dirA over to file1.
If I understand your intended statement, you meant to say

If file1 tends to be used along
with all the other files in dirA, performance will suffer because we've
got to seek over all  other files in dirB when going from file1 to the next file in 
dirA..
And this is what we see over time, our performance decreases as people
add files onto their directories and shift things around.  Especially on
multi-user systems files are rarely accessed in the same order they were
created.
What we need is a knob for the admin to use to suggest 'I'm probably
going to access these files together'.  The only one I can think if is
the directory itself, but it isn't optimal either since subdirectories
are frequently accessed with their parents and with other subdirs.
In 1994, we realized that putting the grandparent directory into the key 
was infeasible, and decided we would just leave it for some future 
repacker to try to locate subdirectories of the same directory 
together.  We decided that locating files within the same directory near 
each other was good enough.  I still think this is correct.

-chris



 



--
Hans