Re: Btrfs: broken file system design

2010-06-25 Thread Andi Kleen
Daniel Taylor daniel.tay...@wdc.com writes:

 As long as no object smaller than the disk block size is ever
 flushed to media, and all flushed objects are aligned to the disk
 blocks, there should be no real performance hit from that.

The question is just how large such a block needs to be.
Traditionally some RAID controllers (and possibly some SSDs now) 
needed very large blocks upto MBs.


 Otherwise we end up with the damage for the ext[234] family, where
 the file blocks can be aligned, but the 1K inode updates cause
 the read-modify-write (RMW) cycles and and cost 10% performance
 hit for creation/update of large numbers of files.

Fixing that doesn't require a new file system layout, just some effort
to read/write inodes in batches of multiple of them. XFS did similar
things for a long time, I believe there were some efforts for this
for ext4 too.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-25 Thread Ric Wheeler

On 06/24/2010 06:06 PM, Daniel Taylor wrote:



   

-Original Message-
From: mikefe...@gmail.com [mailto:mikefe...@gmail.com] On
Behalf Of Mike Fedyk
Sent: Wednesday, June 23, 2010 9:51 PM
To: Daniel Taylor
Cc: Daniel J Blueman; Mat; LKML;
linux-fsde...@vger.kernel.org; Chris Mason; Ric Wheeler;
Andrew Morton; Linus Torvalds; The development of BTRFS
Subject: Re: Btrfs: broken file system design (was Unbound(?)
internal fragmentation in Btrfs)

On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor
daniel.tay...@wdc.com  wrote:
 

Just an FYI reminder.  The original test (2K files) is utterly
pathological for disk drives with 4K physical sectors, such as
those now shipping from WD, Seagate, and others.  Some of the
SSDs have larger (16K0 or smaller blocks (2K).  There is also
the issue of btrfs over RAID (which I know is not entirely
sensible, but which will happen).

The absolute minimum allocation size for data should be the same
as, and aligned with, the underlying disk block size.  If that
results in underutilization, I think that's a good thing for
performance, compared to read-modify-write cycles to update
partial disk blocks.
   

Block size = 4k

Btrfs packs smaller objects into the blocks in certain cases.

 

As long as no object smaller than the disk block size is ever
flushed to media, and all flushed objects are aligned to the disk
blocks, there should be no real performance hit from that.

Otherwise we end up with the damage for the ext[234] family, where
the file blocks can be aligned, but the 1K inode updates cause
the read-modify-write (RMW) cycles and and cost10% performance
hit for creation/update of large numbers of files.

An RMW cycle costs at least a full rotation (11 msec on a 5400 RPM
drive), which is painful.
   


Also interesting is to note that you can get a significant overheard 
even with 0 byte length files. Path names, metadata overhead, etc can 
consume (depending on the pathname length) quite a bit of space per file.


Ric

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html