Re: [PATCH] various allocator optimizations
On 03/14/2003 02:34 AM, Chris Mason wrote: On Thu, 2003-03-13 at 19:15, Hans Reiser wrote: [ discussion on how to implement lower fragmentation on ReiserFS ] Let's get lots of different testers. You may have a nice heuristic here though If everyone agrees the approach is worth trying, I'll make a patch that enables it via a mount option. [...] A dumb question inbetween: How do we - possible testers, users - get information about fragmentation on our ReiserFS partitions? Thanks, Manuel
Re: [PATCH] various allocator optimizations
On Fri, 2003-03-14 at 08:59, Manuel Krause wrote: On 03/14/2003 02:34 AM, Chris Mason wrote: On Thu, 2003-03-13 at 19:15, Hans Reiser wrote: [ discussion on how to implement lower fragmentation on ReiserFS ] Let's get lots of different testers. You may have a nice heuristic here though If everyone agrees the approach is worth trying, I'll make a patch that enables it via a mount option. [...] A dumb question inbetween: How do we - possible testers, users - get information about fragmentation on our ReiserFS partitions? The best tool I've seen so far originally came from Vladimir and was modified for a study on fragmentation of reiserfs and ext2, Jeff found the link somewhere in his archives: http://www.informatik.uni-frankfurt.de/~loizides/reiserfs/index.html There is also a filesystem aging tools there that I haven't played with yet. -chris
Re: [PATCH] various allocator optimizations
Chris Mason wrote: On Fri, 2003-03-14 at 05:26, Hans Reiser wrote: That would mean the parent directory counter would have to be updated every time we allocated a block in any sub directory. Plus the counter would have to be inherited down the chain in deep directory structures. More importantly, I'd rather not waste space in the stat data to store the information when we can get it during a search ;-) The space usage is trivial. Grin, who are you and what have you done with the real hans ;-) You don't need it for every file, you need it for every directory. It's two fields, one for the counter and one to point up the chain to the real owner. It's yet another field to maintain as objects are deleted and created, or written to or truncated, yes, the cost of lots of updates to this are worrying.It might be better done in the repacker than dynamically, in fact I just convinced myself of that, how about you.? a minor format change since old filesystem stat data won't have the field, and requires support from fsck. Nobody will mind if we change reiser4 format now All of which is a lot of work when we can get similar info directly from the tree. How big are your packing localities tending to be? Not more than can be pointed to by the leaf level and the level directly above it. I know that's not very specific, but it varies by the dataset. packed tails and long directory names lead to more packing localities per MB. Which is why it is the wrong measure, yes? Well, yes and no. The packing locality groups tree objects, and so the idea behind the patch is to group all tree objects when they are part of a directory tree that isn't very large. A smart block allocator for the tree nodes can use this information too. In other words, my hope is this patch also makes btree searches more efficient while walking a given directory tree, since we aren't jumping all over the btree for each subdirectory. -chris -- Hans
Re: [PATCH] various allocator optimizations
Chris Mason wrote: On Tue, 2003-03-11 at 11:42, Oleg Drokin wrote: Hello! On Tue, Mar 11, 2003 at 11:34:43AM -0500, Chris Mason wrote: changes blocknrs_and_prealloc_arrays_from_search_start into three passes. pass1 goes from the hint to the end of the disk, pass2 goes from the border to the hint, and pass3 goes from the start of the disk to the border. As you probably remember, we decided to drop border stiff all together because of all the extra seeking it incurrs. The border does do extra seeks for some cases (search_reada helps), but no border at all spreads tree blocks all over. That too does a lot of seeks, since leaves and the formatted nodes that point to them might be on entirely different areas of the disk. Overall, I believe this will significantly improve fragmentation over time. oid_groups should only be used if your FS has a small number of I hope we won't have read-access speed degradation with these. It does, but so does skip_busy alone. You don't see the problem with skip_busy during a mongo run, but run stress.sh -n 1 data set that uses 50% of the disk for a few hours and then run mongo again without deleting the stress.sh data set. The 2.4.20 default is great on a clean FS but breaks down over time, just like the 2.4.19 allocator did. Various people have demonstrated it with benchmarks. -chris Chris, don't you think the right answer would be to take zam's resizer and make a defragmenter out of it? -- Hans
Re: [PATCH] various allocator optimizations
Chris Mason wrote: On Tue, 2003-03-11 at 16:42, Hans Reiser wrote: Chris, don't you think the right answer would be to take zam's resizer and make a defragmenter out of it? Yes and no, for a defrag program to fix things we'd have to agree on an optimal layout ;-) Also it assumes the machine has idle time when a defragment cycle is possible. No, it assumes that 80% of files don't move during the course of a week, so if defrag takes a week, it still adds value. For many servers this is entirely untrue...the oracle boxes I ran didn't have a spare second for something like a defrag. We can all agree that fragmentation is bad, but the real question is how do we group the blocks. Lets pretend for a minute that fragmentation isn't an issue at all, and our allocator is perfect. The optimal grouping for reading/writing files is to have the files you are going to read/write together in the same area of the disk. The current default uses the start of the disk as a starting point for each new file. No, it uses the left neighbor in the tree. Please correct me if I am wrong, because if I am wrong we have a bug. This roughly translates to files that are created together end up in the same part of the disk. As long as you always access files in roughly the same order that you create them, it performs pretty well. But if a process creates dirA/file1 and then dirB/file2, file1 and file2 are going to be together on the disk. If file1 tends to be used along with all the other files in dirA, performance will suffer because we've got to seek from all the other files in dirA over to file1. If I understand your intended statement, you meant to say If file1 tends to be used along with all the other files in dirA, performance will suffer because we've got to seek over all other files in dirB when going from file1 to the next file in dirA.. And this is what we see over time, our performance decreases as people add files onto their directories and shift things around. Especially on multi-user systems files are rarely accessed in the same order they were created. What we need is a knob for the admin to use to suggest 'I'm probably going to access these files together'. The only one I can think if is the directory itself, but it isn't optimal either since subdirectories are frequently accessed with their parents and with other subdirs. In 1994, we realized that putting the grandparent directory into the key was infeasible, and decided we would just leave it for some future repacker to try to locate subdirectories of the same directory together. We decided that locating files within the same directory near each other was good enough. I still think this is correct. -chris -- Hans