Re: BTRFS SSD

2010-09-30 Thread Sander
Yuehai Xu wrote (ao):
 So, is it a bottleneck in the case of SSD since the cost for over
 write is very high? For every write, I think the superblocks should be
 overwritten, it might be much more frequent than other common blocks
 in SSD, even though SSD will do wear leveling inside by its FTL.

The FTL will make sure the write cycles are evenly divided among the
physical blocks, regardless of how often you overwrite a single spot on
the fs.

 What I current know is that for Intel x25-V SSD, the write throughput
 of BTRFS is almost 80% less than the one of EXT3 in the case of
 PostMark. This really confuses me.

Can you show the script you use to test this, provide some info
regarding your setup, and show the numbers you see?

Sander

-- 
Humilis IT Services and Solutions
http://www.humilis.net
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add a disk info ioctl to get the disks attached to a filesystem

2010-09-30 Thread Kay Sievers
On Thu, Sep 30, 2010 at 01:43, Christoph Hellwig h...@infradead.org wrote:
 On Wed, Sep 29, 2010 at 10:04:31AM +0200, Kay Sievers wrote:
 On Wed, Sep 29, 2010 at 09:25, Ric Wheeler rwhee...@redhat.com wrote:

  Second question is why is checking in /sys a big deal, would ??you prefer 
  an
  interface like we did for alignment in libblkid?

 It's about knowing what's behind the 'nodev' major == 0 of a btrfs
 mount. There is no way to get that from /sys or anywhere else at the
 moment.

 Usually filesystems backed by a disk have the dev_t of the device, or
 the fake block devices like md/dm/raid have their own major and the
 slaves/ directory pointing to the devices.

 This is not only about readahead, it's every other tool, that needs to
 know what kind of disks are behind a btrfs 'nodev' major == 0 mount.

 Thanks for explaining the problem.  It's one that affects everything
 with more than one underlying block device, so adding a
 filesystem-specific ioctl hack is not a good idea.  As mentioned in this
 mail we already have a solution for that - the block device slaves
 links used for raid and volume managers.  The most logical fix is to
 re-use that for btrfs as well and stop it from abusing the anonymous
 block major that was never intended for block based filesystems (and
 already has caused trouble in other areas).  One way to to this might
 be to allocate a block major for btrfs that only gets used for
 representing these links.

Yeah, we thought about that too, but a btrfs mount does not show up as
a block device, like md/dm, so there is no place for a slaves/
directory in /sys with the individual disks listed. How could be solve
that? Create some fake blockdev for every btrfs mount,  but that can't
be used to read/write raw blocks?

A generic solution, statfs()-like, which operates at the superblock
would be another option. Any idea if that could be made working?

Kay
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS SSD

2010-09-30 Thread David Brown

On 29/09/2010 23:31, Yuehai Xu wrote:

On Wed, Sep 29, 2010 at 3:59 PM, Sean Bartellwingedtachik...@gmail.com  wrote:

On Wed, Sep 29, 2010 at 02:45:29PM -0400, Yuehai Xu wrote:

On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartellwingedtachik...@gmail.com  wrote:

On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote:

I know BTRFS is a kind of Log-structured File System, which doesn't do
overwrite. Here is my question, suppose file A is overwritten by A',
instead of writing A' to the original place of A, a new place is
selected to store it. However, we know that the address of a file
should be recorded in its inode. In such case, the corresponding part
in inode of A should update from the original place A to the new place
A', is this a kind of overwrite actually? I think no matter what
design it is for Log-Structured FS, a mapping table is always needed,
such as inode map, DAT, etc. When a update operation happens for this
mapping table, is it actually a kind of over-write? If it is, is it a
bottleneck for the performance of write for SSD?


In btrfs, this is solved by doing the same thing for the inode--a new
place for the leaf holding the inode is chosen. Then the parent of the
leaf must point to the new position of the leaf, so the parent is moved,
and the parent's parent, etc. This goes all the way up to the
superblocks, which are actually overwritten one at a time.


You mean that there is no over-write for inode too, once the inode
need to be updated, this inode is actually written to a new place
while the only thing to do is to change the point of its parent to
this new place. However, for the last parent, or the superblock, does
it need to be overwritten?


Yes. The idea of copy-on-write, as used by btrfs, is that whenever
*anything* is changed, it is simply written to a new location. This
applies to data, inodes, and all of the B-trees used by the filesystem.
However, it's necessary to have *something* in a fixed place on disk
pointing to everything else. So the superblocks can't move, and they are
overwritten instead.



So, is it a bottleneck in the case of SSD since the cost for over
write is very high? For every write, I think the superblocks should be
overwritten, it might be much more frequent than other common blocks
in SSD, even though SSD will do wear leveling inside by its FTL.



SSDs already do copy-on-write.  They can't change small parts of the 
data in a block, but have to re-write the block.  While that could be 
done by reading the whole erase block to a ram buffer, changing the 
data, erasing the flash block, then re-writing, this is not what happens 
in practice.  To make efficient use of write blocks that are smaller 
than erase blocks, and to provide wear levelling, the flash disk will 
implement a small change to a block by writing a new copy of the 
modified block to a different part of the flash, then updating its block 
indirection tables.


BTRFS just makes this process a bit more explicit (except for superblock 
writes).



What I current know is that for Intel x25-V SSD, the write throughput
of BTRFS is almost 80% less than the one of EXT3 in the case of
PostMark. This really confuses me.



Different file systems have different strengths and weaknesses.  I 
haven't actually tested BTRFS much, but my understanding is that it will 
be significantly slower than EXT in certain cases, such as small 
modifications to large files (since copy-on-write means a lot of extra 
disk activity in such cases).  But for other things it is faster.  Also 
remember that BTRFS is under development - optimising for raw speed 
comes at a lower priority than correctness and safety of data, and 
implementation of BTRFS features.  Once everyone is happy with the 
stability of the file system and its functionality and tools, you can 
expect the speed to improve somewhat over time.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS SSD

2010-09-30 Thread Yuehai Xu
On Thu, Sep 30, 2010 at 3:51 AM, David Brown da...@westcontrol.com wrote:
 On 29/09/2010 23:31, Yuehai Xu wrote:

 On Wed, Sep 29, 2010 at 3:59 PM, Sean Bartellwingedtachik...@gmail.com
  wrote:

 On Wed, Sep 29, 2010 at 02:45:29PM -0400, Yuehai Xu wrote:

 On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartellwingedtachik...@gmail.com
  wrote:

 On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote:

 I know BTRFS is a kind of Log-structured File System, which doesn't do
 overwrite. Here is my question, suppose file A is overwritten by A',
 instead of writing A' to the original place of A, a new place is
 selected to store it. However, we know that the address of a file
 should be recorded in its inode. In such case, the corresponding part
 in inode of A should update from the original place A to the new place
 A', is this a kind of overwrite actually? I think no matter what
 design it is for Log-Structured FS, a mapping table is always needed,
 such as inode map, DAT, etc. When a update operation happens for this
 mapping table, is it actually a kind of over-write? If it is, is it a
 bottleneck for the performance of write for SSD?

 In btrfs, this is solved by doing the same thing for the inode--a new
 place for the leaf holding the inode is chosen. Then the parent of the
 leaf must point to the new position of the leaf, so the parent is
 moved,
 and the parent's parent, etc. This goes all the way up to the
 superblocks, which are actually overwritten one at a time.

 You mean that there is no over-write for inode too, once the inode
 need to be updated, this inode is actually written to a new place
 while the only thing to do is to change the point of its parent to
 this new place. However, for the last parent, or the superblock, does
 it need to be overwritten?

 Yes. The idea of copy-on-write, as used by btrfs, is that whenever
 *anything* is changed, it is simply written to a new location. This
 applies to data, inodes, and all of the B-trees used by the filesystem.
 However, it's necessary to have *something* in a fixed place on disk
 pointing to everything else. So the superblocks can't move, and they are
 overwritten instead.


 So, is it a bottleneck in the case of SSD since the cost for over
 write is very high? For every write, I think the superblocks should be
 overwritten, it might be much more frequent than other common blocks
 in SSD, even though SSD will do wear leveling inside by its FTL.


 SSDs already do copy-on-write.  They can't change small parts of the data in
 a block, but have to re-write the block.  While that could be done by
 reading the whole erase block to a ram buffer, changing the data, erasing
 the flash block, then re-writing, this is not what happens in practice.  To
 make efficient use of write blocks that are smaller than erase blocks, and
 to provide wear levelling, the flash disk will implement a small change to a
 block by writing a new copy of the modified block to a different part of the
 flash, then updating its block indirection tables.

Yes, the FTL inside the SSDs will do such kind of job, and the
overhead should be small once the block mapping is page-level mapping,
however, the size of page-level mapping is too large to be stored
totally in the SRAM of SSDs, So, many complicated algorithms have been
developed to optimize this. In another word, SSDs might not always be
smart enough to do wear leveling with small overhead. This is my
subjective opinion.


 BTRFS just makes this process a bit more explicit (except for superblock
 writes).

As you have said, the superblocks should be over written, is it
frequent? If it is, is it possible to be potential bottleneck for the
throughput of SSDs? Afterall, SSDs are not happy with over-write. Of
course, few people really knows what's the algorithms really are for
the FTL, which determines the efficiency of SSDs actually.



 What I current know is that for Intel x25-V SSD, the write throughput
 of BTRFS is almost 80% less than the one of EXT3 in the case of
 PostMark. This really confuses me.


 Different file systems have different strengths and weaknesses.  I haven't
 actually tested BTRFS much, but my understanding is that it will be
 significantly slower than EXT in certain cases, such as small modifications
 to large files (since copy-on-write means a lot of extra disk activity in
 such cases).  But for other things it is faster.  Also remember that BTRFS
 is under development - optimising for raw speed comes at a lower priority
 than correctness and safety of data, and implementation of BTRFS features.
  Once everyone is happy with the stability of the file system and its
 functionality and tools, you can expect the speed to improve somewhat over
 time.

My test case for PostMark is:
set file size 9216 15360 (file size from 9216 bytes to 15360 bytes)
set number 5(file number is 5)

write throughput(MB/s) for different file systems in Intel SSD X25-V:
EXT3: 28.09
NILFS2: 10
BTRFS: 17.35
EXT4: 31.04
XFS: 

Re: BTRFS SSD

2010-09-30 Thread Yuehai Xu
On Thu, Sep 30, 2010 at 3:15 AM, Sander san...@humilis.net wrote:
 Yuehai Xu wrote (ao):
 So, is it a bottleneck in the case of SSD since the cost for over
 write is very high? For every write, I think the superblocks should be
 overwritten, it might be much more frequent than other common blocks
 in SSD, even though SSD will do wear leveling inside by its FTL.

 The FTL will make sure the write cycles are evenly divided among the
 physical blocks, regardless of how often you overwrite a single spot on
 the fs.

 What I current know is that for Intel x25-V SSD, the write throughput
 of BTRFS is almost 80% less than the one of EXT3 in the case of
 PostMark. This really confuses me.

 Can you show the script you use to test this, provide some info
 regarding your setup, and show the numbers you see?

My test case for PostMark is:
set file size 9216 15360 (file size from 9216 bytes to 15360 bytes)
set number 5(file number is 5)

write throughput(MB/s) for different file systems in Intel SSD X25-V:
EXT3: 28.09
NILFS2: 10
BTRFS: 17.35
EXT4: 31.04
XFS: 11.56
REISERFS: 28.09
EXT2: 15.94

Thanks,
Yuehai

        Sander

 --
 Humilis IT Services and Solutions
 http://www.humilis.net

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add a disk info ioctl to get the disks attached to a filesystem

2010-09-30 Thread Josef Bacik
On Thu, Sep 30, 2010 at 09:43:00AM +0200, Kay Sievers wrote:
 On Thu, Sep 30, 2010 at 01:43, Christoph Hellwig h...@infradead.org wrote:
  On Wed, Sep 29, 2010 at 10:04:31AM +0200, Kay Sievers wrote:
  On Wed, Sep 29, 2010 at 09:25, Ric Wheeler rwhee...@redhat.com wrote:
 
   Second question is why is checking in /sys a big deal, would ??you 
   prefer an
   interface like we did for alignment in libblkid?
 
  It's about knowing what's behind the 'nodev' major == 0 of a btrfs
  mount. There is no way to get that from /sys or anywhere else at the
  moment.
 
  Usually filesystems backed by a disk have the dev_t of the device, or
  the fake block devices like md/dm/raid have their own major and the
  slaves/ directory pointing to the devices.
 
  This is not only about readahead, it's every other tool, that needs to
  know what kind of disks are behind a btrfs 'nodev' major == 0 mount.
 
  Thanks for explaining the problem.  It's one that affects everything
  with more than one underlying block device, so adding a
  filesystem-specific ioctl hack is not a good idea.  As mentioned in this
  mail we already have a solution for that - the block device slaves
  links used for raid and volume managers.  The most logical fix is to
  re-use that for btrfs as well and stop it from abusing the anonymous
  block major that was never intended for block based filesystems (and
  already has caused trouble in other areas).  One way to to this might
  be to allocate a block major for btrfs that only gets used for
  representing these links.
 
 Yeah, we thought about that too, but a btrfs mount does not show up as
 a block device, like md/dm, so there is no place for a slaves/
 directory in /sys with the individual disks listed. How could be solve
 that? Create some fake blockdev for every btrfs mount,  but that can't
 be used to read/write raw blocks?
 

That's what I was going to do.  We essentially do that anyway with the anonymous
superblock, so instead I'll just make /dev/btrfs-# whatever and do the
bd_claim_by_disk stuff to make all of our devices slaves of that parent virtual
device.  Does this seem like a resonable solution?  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add a disk info ioctl to get the disks attached to a filesystem

2010-09-30 Thread Andi Kleen
Kay Sievers kay.siev...@vrfy.org writes:

 Yeah, we thought about that too, but a btrfs mount does not show up as
 a block device, like md/dm, so there is no place for a slaves/
 directory in /sys with the individual disks listed. How could be solve
 that? Create some fake blockdev for every btrfs mount,  but that can't
 be used to read/write raw blocks?

You could simply create a new class for btrfs? (or maybe a generic
compound class)

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add a disk info ioctl to get the disks attached to a filesystem

2010-09-30 Thread Josef Bacik
On Wed, Sep 29, 2010 at 07:43:27PM -0400, Christoph Hellwig wrote:
 On Wed, Sep 29, 2010 at 10:04:31AM +0200, Kay Sievers wrote:
  On Wed, Sep 29, 2010 at 09:25, Ric Wheeler rwhee...@redhat.com wrote:
  
   Second question is why is checking in /sys a big deal, would ??you prefer 
   an
   interface like we did for alignment in libblkid?
  
  It's about knowing what's behind the 'nodev' major == 0 of a btrfs
  mount. There is no way to get that from /sys or anywhere else at the
  moment.
  
  Usually filesystems backed by a disk have the dev_t of the device, or
  the fake block devices like md/dm/raid have their own major and the
  slaves/ directory pointing to the devices.
  
  This is not only about readahead, it's every other tool, that needs to
  know what kind of disks are behind a btrfs 'nodev' major == 0 mount.
 
 Thanks for explaining the problem.  It's one that affects everything
 with more than one underlying block device, so adding a
 filesystem-specific ioctl hack is not a good idea.  As mentioned in this
 mail we already have a solution for that - the block device slaves
 links used for raid and volume managers.  The most logical fix is to
 re-use that for btrfs as well and stop it from abusing the anonymous
 block major that was never intended for block based filesystems (and
 already has caused trouble in other areas).  One way to to this might
 be to allocate a block major for btrfs that only gets used for
 representing these links.
 

Ok I've spent a few hours on this and I'm hitting a wall.  In order to get the
sort of /sys/block/btrfs-# sort of thing I have to do

1) register_blkdev to get a major
2) setup a gendisk
3) do a bdget_disk
4) Loop through all of our devices and do a bd_claim_by_disk on each of them

This sucks because for step #2 I have to have a request_queue for the disk.
It's a bogus disk, and theres no way to not have a request_queue, so I'd have to
wire that up and put a bunch of WARN_ON()'s to make sure nobody is trying to
write to our special disk (since I assume that if I go through all this crap I'm
going to end up with a /dev/btrfs-# that people are going to try to write to).

So my question is, is this what we want?  Do I just need to quit bitching and
make it work?  Or am I doing something wrong?  This is a completely new area for
me so I'm just looking around at what md/dm does and trying to mirror it for my
own uses, if thats not what I should be doing please tell me, otherwise this
seems like alot of work for a very shitty solution to our problem.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add a disk info ioctl to get the disks attached to a filesystem

2010-09-30 Thread Kay Sievers
On Thu, Sep 30, 2010 at 21:48, Josef Bacik jo...@redhat.com wrote:
 On Wed, Sep 29, 2010 at 07:43:27PM -0400, Christoph Hellwig wrote:
 On Wed, Sep 29, 2010 at 10:04:31AM +0200, Kay Sievers wrote:
  On Wed, Sep 29, 2010 at 09:25, Ric Wheeler rwhee...@redhat.com wrote:
 
   Second question is why is checking in /sys a big deal, would ??you 
   prefer an
   interface like we did for alignment in libblkid?
 
  It's about knowing what's behind the 'nodev' major == 0 of a btrfs
  mount. There is no way to get that from /sys or anywhere else at the
  moment.
 
  Usually filesystems backed by a disk have the dev_t of the device, or
  the fake block devices like md/dm/raid have their own major and the
  slaves/ directory pointing to the devices.
 
  This is not only about readahead, it's every other tool, that needs to
  know what kind of disks are behind a btrfs 'nodev' major == 0 mount.

 Thanks for explaining the problem.  It's one that affects everything
 with more than one underlying block device, so adding a
 filesystem-specific ioctl hack is not a good idea.  As mentioned in this
 mail we already have a solution for that - the block device slaves
 links used for raid and volume managers.  The most logical fix is to
 re-use that for btrfs as well and stop it from abusing the anonymous
 block major that was never intended for block based filesystems (and
 already has caused trouble in other areas).  One way to to this might
 be to allocate a block major for btrfs that only gets used for
 representing these links.


 Ok I've spent a few hours on this and I'm hitting a wall.  In order to get the
 sort of /sys/block/btrfs-# sort of thing I have to do

 1) register_blkdev to get a major
 2) setup a gendisk
 3) do a bdget_disk
 4) Loop through all of our devices and do a bd_claim_by_disk on each of them

 This sucks because for step #2 I have to have a request_queue for the disk.
 It's a bogus disk, and theres no way to not have a request_queue, so I'd have 
 to
 wire that up and put a bunch of WARN_ON()'s to make sure nobody is trying to
 write to our special disk (since I assume that if I go through all this crap 
 I'm
 going to end up with a /dev/btrfs-# that people are going to try to write to).

 So my question is, is this what we want?  Do I just need to quit bitching and
 make it work?  Or am I doing something wrong?  This is a completely new area 
 for
 me so I'm just looking around at what md/dm does and trying to mirror it for 
 my
 own uses, if thats not what I should be doing please tell me, otherwise this
 seems like alot of work for a very shitty solution to our problem.  Thanks,

Yeah, that matches what I was experiencing when thinking about the
options. Making a btrfs mount a fake blockdev of zero size seems like
a pretty weird hack, just get some 'dead' directories in sysfs. A
btrfs mount is just not a raw blockdev, and should probably not
pretend to be one.

I guess a statfs()-like call from the filesystem side and not the
block side, which can put out such information in some generic way,
would better fit here.

Kay
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add a disk info ioctl to get the disks attached to a filesystem

2010-09-30 Thread Lennart Poettering
On Thu, 30.09.10 21:59, Kay Sievers (kay.siev...@vrfy.org) wrote:

  So my question is, is this what we want?  Do I just need to quit bitching 
  and
  make it work?  Or am I doing something wrong?  This is a completely new 
  area for
  me so I'm just looking around at what md/dm does and trying to mirror it 
  for my
  own uses, if thats not what I should be doing please tell me, otherwise this
  seems like alot of work for a very shitty solution to our problem.  Thanks,
 
 Yeah, that matches what I was experiencing when thinking about the
 options. Making a btrfs mount a fake blockdev of zero size seems like
 a pretty weird hack, just get some 'dead' directories in sysfs. A
 btrfs mount is just not a raw blockdev, and should probably not
 pretend to be one.
 
 I guess a statfs()-like call from the filesystem side and not the
 block side, which can put out such information in some generic way,
 would better fit here.

Note that for my particular usecase it would even suffice to have two
flags in struct statfs or struct statvfs that encode whether there's a at
least one SSD in the fs, resp. at least one rotating disk in the fs.

if (statvfs.f_flag  ST_SSD) 
printf(FS contains at least one SSD disk);
if (statvfs.f_flag  ST_ROTATING) 
printf(FS contains at least one rotating disk);

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html