Re: BTRFS SSD
Yuehai Xu wrote (ao): So, is it a bottleneck in the case of SSD since the cost for over write is very high? For every write, I think the superblocks should be overwritten, it might be much more frequent than other common blocks in SSD, even though SSD will do wear leveling inside by its FTL. The FTL will make sure the write cycles are evenly divided among the physical blocks, regardless of how often you overwrite a single spot on the fs. What I current know is that for Intel x25-V SSD, the write throughput of BTRFS is almost 80% less than the one of EXT3 in the case of PostMark. This really confuses me. Can you show the script you use to test this, provide some info regarding your setup, and show the numbers you see? Sander -- Humilis IT Services and Solutions http://www.humilis.net -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: add a disk info ioctl to get the disks attached to a filesystem
On Thu, Sep 30, 2010 at 01:43, Christoph Hellwig h...@infradead.org wrote: On Wed, Sep 29, 2010 at 10:04:31AM +0200, Kay Sievers wrote: On Wed, Sep 29, 2010 at 09:25, Ric Wheeler rwhee...@redhat.com wrote: Second question is why is checking in /sys a big deal, would ??you prefer an interface like we did for alignment in libblkid? It's about knowing what's behind the 'nodev' major == 0 of a btrfs mount. There is no way to get that from /sys or anywhere else at the moment. Usually filesystems backed by a disk have the dev_t of the device, or the fake block devices like md/dm/raid have their own major and the slaves/ directory pointing to the devices. This is not only about readahead, it's every other tool, that needs to know what kind of disks are behind a btrfs 'nodev' major == 0 mount. Thanks for explaining the problem. It's one that affects everything with more than one underlying block device, so adding a filesystem-specific ioctl hack is not a good idea. As mentioned in this mail we already have a solution for that - the block device slaves links used for raid and volume managers. The most logical fix is to re-use that for btrfs as well and stop it from abusing the anonymous block major that was never intended for block based filesystems (and already has caused trouble in other areas). One way to to this might be to allocate a block major for btrfs that only gets used for representing these links. Yeah, we thought about that too, but a btrfs mount does not show up as a block device, like md/dm, so there is no place for a slaves/ directory in /sys with the individual disks listed. How could be solve that? Create some fake blockdev for every btrfs mount, but that can't be used to read/write raw blocks? A generic solution, statfs()-like, which operates at the superblock would be another option. Any idea if that could be made working? Kay -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS SSD
On 29/09/2010 23:31, Yuehai Xu wrote: On Wed, Sep 29, 2010 at 3:59 PM, Sean Bartellwingedtachik...@gmail.com wrote: On Wed, Sep 29, 2010 at 02:45:29PM -0400, Yuehai Xu wrote: On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartellwingedtachik...@gmail.com wrote: On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote: I know BTRFS is a kind of Log-structured File System, which doesn't do overwrite. Here is my question, suppose file A is overwritten by A', instead of writing A' to the original place of A, a new place is selected to store it. However, we know that the address of a file should be recorded in its inode. In such case, the corresponding part in inode of A should update from the original place A to the new place A', is this a kind of overwrite actually? I think no matter what design it is for Log-Structured FS, a mapping table is always needed, such as inode map, DAT, etc. When a update operation happens for this mapping table, is it actually a kind of over-write? If it is, is it a bottleneck for the performance of write for SSD? In btrfs, this is solved by doing the same thing for the inode--a new place for the leaf holding the inode is chosen. Then the parent of the leaf must point to the new position of the leaf, so the parent is moved, and the parent's parent, etc. This goes all the way up to the superblocks, which are actually overwritten one at a time. You mean that there is no over-write for inode too, once the inode need to be updated, this inode is actually written to a new place while the only thing to do is to change the point of its parent to this new place. However, for the last parent, or the superblock, does it need to be overwritten? Yes. The idea of copy-on-write, as used by btrfs, is that whenever *anything* is changed, it is simply written to a new location. This applies to data, inodes, and all of the B-trees used by the filesystem. However, it's necessary to have *something* in a fixed place on disk pointing to everything else. So the superblocks can't move, and they are overwritten instead. So, is it a bottleneck in the case of SSD since the cost for over write is very high? For every write, I think the superblocks should be overwritten, it might be much more frequent than other common blocks in SSD, even though SSD will do wear leveling inside by its FTL. SSDs already do copy-on-write. They can't change small parts of the data in a block, but have to re-write the block. While that could be done by reading the whole erase block to a ram buffer, changing the data, erasing the flash block, then re-writing, this is not what happens in practice. To make efficient use of write blocks that are smaller than erase blocks, and to provide wear levelling, the flash disk will implement a small change to a block by writing a new copy of the modified block to a different part of the flash, then updating its block indirection tables. BTRFS just makes this process a bit more explicit (except for superblock writes). What I current know is that for Intel x25-V SSD, the write throughput of BTRFS is almost 80% less than the one of EXT3 in the case of PostMark. This really confuses me. Different file systems have different strengths and weaknesses. I haven't actually tested BTRFS much, but my understanding is that it will be significantly slower than EXT in certain cases, such as small modifications to large files (since copy-on-write means a lot of extra disk activity in such cases). But for other things it is faster. Also remember that BTRFS is under development - optimising for raw speed comes at a lower priority than correctness and safety of data, and implementation of BTRFS features. Once everyone is happy with the stability of the file system and its functionality and tools, you can expect the speed to improve somewhat over time. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS SSD
On Thu, Sep 30, 2010 at 3:51 AM, David Brown da...@westcontrol.com wrote: On 29/09/2010 23:31, Yuehai Xu wrote: On Wed, Sep 29, 2010 at 3:59 PM, Sean Bartellwingedtachik...@gmail.com wrote: On Wed, Sep 29, 2010 at 02:45:29PM -0400, Yuehai Xu wrote: On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartellwingedtachik...@gmail.com wrote: On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote: I know BTRFS is a kind of Log-structured File System, which doesn't do overwrite. Here is my question, suppose file A is overwritten by A', instead of writing A' to the original place of A, a new place is selected to store it. However, we know that the address of a file should be recorded in its inode. In such case, the corresponding part in inode of A should update from the original place A to the new place A', is this a kind of overwrite actually? I think no matter what design it is for Log-Structured FS, a mapping table is always needed, such as inode map, DAT, etc. When a update operation happens for this mapping table, is it actually a kind of over-write? If it is, is it a bottleneck for the performance of write for SSD? In btrfs, this is solved by doing the same thing for the inode--a new place for the leaf holding the inode is chosen. Then the parent of the leaf must point to the new position of the leaf, so the parent is moved, and the parent's parent, etc. This goes all the way up to the superblocks, which are actually overwritten one at a time. You mean that there is no over-write for inode too, once the inode need to be updated, this inode is actually written to a new place while the only thing to do is to change the point of its parent to this new place. However, for the last parent, or the superblock, does it need to be overwritten? Yes. The idea of copy-on-write, as used by btrfs, is that whenever *anything* is changed, it is simply written to a new location. This applies to data, inodes, and all of the B-trees used by the filesystem. However, it's necessary to have *something* in a fixed place on disk pointing to everything else. So the superblocks can't move, and they are overwritten instead. So, is it a bottleneck in the case of SSD since the cost for over write is very high? For every write, I think the superblocks should be overwritten, it might be much more frequent than other common blocks in SSD, even though SSD will do wear leveling inside by its FTL. SSDs already do copy-on-write. They can't change small parts of the data in a block, but have to re-write the block. While that could be done by reading the whole erase block to a ram buffer, changing the data, erasing the flash block, then re-writing, this is not what happens in practice. To make efficient use of write blocks that are smaller than erase blocks, and to provide wear levelling, the flash disk will implement a small change to a block by writing a new copy of the modified block to a different part of the flash, then updating its block indirection tables. Yes, the FTL inside the SSDs will do such kind of job, and the overhead should be small once the block mapping is page-level mapping, however, the size of page-level mapping is too large to be stored totally in the SRAM of SSDs, So, many complicated algorithms have been developed to optimize this. In another word, SSDs might not always be smart enough to do wear leveling with small overhead. This is my subjective opinion. BTRFS just makes this process a bit more explicit (except for superblock writes). As you have said, the superblocks should be over written, is it frequent? If it is, is it possible to be potential bottleneck for the throughput of SSDs? Afterall, SSDs are not happy with over-write. Of course, few people really knows what's the algorithms really are for the FTL, which determines the efficiency of SSDs actually. What I current know is that for Intel x25-V SSD, the write throughput of BTRFS is almost 80% less than the one of EXT3 in the case of PostMark. This really confuses me. Different file systems have different strengths and weaknesses. I haven't actually tested BTRFS much, but my understanding is that it will be significantly slower than EXT in certain cases, such as small modifications to large files (since copy-on-write means a lot of extra disk activity in such cases). But for other things it is faster. Also remember that BTRFS is under development - optimising for raw speed comes at a lower priority than correctness and safety of data, and implementation of BTRFS features. Once everyone is happy with the stability of the file system and its functionality and tools, you can expect the speed to improve somewhat over time. My test case for PostMark is: set file size 9216 15360 (file size from 9216 bytes to 15360 bytes) set number 5(file number is 5) write throughput(MB/s) for different file systems in Intel SSD X25-V: EXT3: 28.09 NILFS2: 10 BTRFS: 17.35 EXT4: 31.04 XFS:
Re: BTRFS SSD
On Thu, Sep 30, 2010 at 3:15 AM, Sander san...@humilis.net wrote: Yuehai Xu wrote (ao): So, is it a bottleneck in the case of SSD since the cost for over write is very high? For every write, I think the superblocks should be overwritten, it might be much more frequent than other common blocks in SSD, even though SSD will do wear leveling inside by its FTL. The FTL will make sure the write cycles are evenly divided among the physical blocks, regardless of how often you overwrite a single spot on the fs. What I current know is that for Intel x25-V SSD, the write throughput of BTRFS is almost 80% less than the one of EXT3 in the case of PostMark. This really confuses me. Can you show the script you use to test this, provide some info regarding your setup, and show the numbers you see? My test case for PostMark is: set file size 9216 15360 (file size from 9216 bytes to 15360 bytes) set number 5(file number is 5) write throughput(MB/s) for different file systems in Intel SSD X25-V: EXT3: 28.09 NILFS2: 10 BTRFS: 17.35 EXT4: 31.04 XFS: 11.56 REISERFS: 28.09 EXT2: 15.94 Thanks, Yuehai Sander -- Humilis IT Services and Solutions http://www.humilis.net -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: add a disk info ioctl to get the disks attached to a filesystem
On Thu, Sep 30, 2010 at 09:43:00AM +0200, Kay Sievers wrote: On Thu, Sep 30, 2010 at 01:43, Christoph Hellwig h...@infradead.org wrote: On Wed, Sep 29, 2010 at 10:04:31AM +0200, Kay Sievers wrote: On Wed, Sep 29, 2010 at 09:25, Ric Wheeler rwhee...@redhat.com wrote: Second question is why is checking in /sys a big deal, would ??you prefer an interface like we did for alignment in libblkid? It's about knowing what's behind the 'nodev' major == 0 of a btrfs mount. There is no way to get that from /sys or anywhere else at the moment. Usually filesystems backed by a disk have the dev_t of the device, or the fake block devices like md/dm/raid have their own major and the slaves/ directory pointing to the devices. This is not only about readahead, it's every other tool, that needs to know what kind of disks are behind a btrfs 'nodev' major == 0 mount. Thanks for explaining the problem. It's one that affects everything with more than one underlying block device, so adding a filesystem-specific ioctl hack is not a good idea. As mentioned in this mail we already have a solution for that - the block device slaves links used for raid and volume managers. The most logical fix is to re-use that for btrfs as well and stop it from abusing the anonymous block major that was never intended for block based filesystems (and already has caused trouble in other areas). One way to to this might be to allocate a block major for btrfs that only gets used for representing these links. Yeah, we thought about that too, but a btrfs mount does not show up as a block device, like md/dm, so there is no place for a slaves/ directory in /sys with the individual disks listed. How could be solve that? Create some fake blockdev for every btrfs mount, but that can't be used to read/write raw blocks? That's what I was going to do. We essentially do that anyway with the anonymous superblock, so instead I'll just make /dev/btrfs-# whatever and do the bd_claim_by_disk stuff to make all of our devices slaves of that parent virtual device. Does this seem like a resonable solution? Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: add a disk info ioctl to get the disks attached to a filesystem
Kay Sievers kay.siev...@vrfy.org writes: Yeah, we thought about that too, but a btrfs mount does not show up as a block device, like md/dm, so there is no place for a slaves/ directory in /sys with the individual disks listed. How could be solve that? Create some fake blockdev for every btrfs mount, but that can't be used to read/write raw blocks? You could simply create a new class for btrfs? (or maybe a generic compound class) -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: add a disk info ioctl to get the disks attached to a filesystem
On Wed, Sep 29, 2010 at 07:43:27PM -0400, Christoph Hellwig wrote: On Wed, Sep 29, 2010 at 10:04:31AM +0200, Kay Sievers wrote: On Wed, Sep 29, 2010 at 09:25, Ric Wheeler rwhee...@redhat.com wrote: Second question is why is checking in /sys a big deal, would ??you prefer an interface like we did for alignment in libblkid? It's about knowing what's behind the 'nodev' major == 0 of a btrfs mount. There is no way to get that from /sys or anywhere else at the moment. Usually filesystems backed by a disk have the dev_t of the device, or the fake block devices like md/dm/raid have their own major and the slaves/ directory pointing to the devices. This is not only about readahead, it's every other tool, that needs to know what kind of disks are behind a btrfs 'nodev' major == 0 mount. Thanks for explaining the problem. It's one that affects everything with more than one underlying block device, so adding a filesystem-specific ioctl hack is not a good idea. As mentioned in this mail we already have a solution for that - the block device slaves links used for raid and volume managers. The most logical fix is to re-use that for btrfs as well and stop it from abusing the anonymous block major that was never intended for block based filesystems (and already has caused trouble in other areas). One way to to this might be to allocate a block major for btrfs that only gets used for representing these links. Ok I've spent a few hours on this and I'm hitting a wall. In order to get the sort of /sys/block/btrfs-# sort of thing I have to do 1) register_blkdev to get a major 2) setup a gendisk 3) do a bdget_disk 4) Loop through all of our devices and do a bd_claim_by_disk on each of them This sucks because for step #2 I have to have a request_queue for the disk. It's a bogus disk, and theres no way to not have a request_queue, so I'd have to wire that up and put a bunch of WARN_ON()'s to make sure nobody is trying to write to our special disk (since I assume that if I go through all this crap I'm going to end up with a /dev/btrfs-# that people are going to try to write to). So my question is, is this what we want? Do I just need to quit bitching and make it work? Or am I doing something wrong? This is a completely new area for me so I'm just looking around at what md/dm does and trying to mirror it for my own uses, if thats not what I should be doing please tell me, otherwise this seems like alot of work for a very shitty solution to our problem. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: add a disk info ioctl to get the disks attached to a filesystem
On Thu, Sep 30, 2010 at 21:48, Josef Bacik jo...@redhat.com wrote: On Wed, Sep 29, 2010 at 07:43:27PM -0400, Christoph Hellwig wrote: On Wed, Sep 29, 2010 at 10:04:31AM +0200, Kay Sievers wrote: On Wed, Sep 29, 2010 at 09:25, Ric Wheeler rwhee...@redhat.com wrote: Second question is why is checking in /sys a big deal, would ??you prefer an interface like we did for alignment in libblkid? It's about knowing what's behind the 'nodev' major == 0 of a btrfs mount. There is no way to get that from /sys or anywhere else at the moment. Usually filesystems backed by a disk have the dev_t of the device, or the fake block devices like md/dm/raid have their own major and the slaves/ directory pointing to the devices. This is not only about readahead, it's every other tool, that needs to know what kind of disks are behind a btrfs 'nodev' major == 0 mount. Thanks for explaining the problem. It's one that affects everything with more than one underlying block device, so adding a filesystem-specific ioctl hack is not a good idea. As mentioned in this mail we already have a solution for that - the block device slaves links used for raid and volume managers. The most logical fix is to re-use that for btrfs as well and stop it from abusing the anonymous block major that was never intended for block based filesystems (and already has caused trouble in other areas). One way to to this might be to allocate a block major for btrfs that only gets used for representing these links. Ok I've spent a few hours on this and I'm hitting a wall. In order to get the sort of /sys/block/btrfs-# sort of thing I have to do 1) register_blkdev to get a major 2) setup a gendisk 3) do a bdget_disk 4) Loop through all of our devices and do a bd_claim_by_disk on each of them This sucks because for step #2 I have to have a request_queue for the disk. It's a bogus disk, and theres no way to not have a request_queue, so I'd have to wire that up and put a bunch of WARN_ON()'s to make sure nobody is trying to write to our special disk (since I assume that if I go through all this crap I'm going to end up with a /dev/btrfs-# that people are going to try to write to). So my question is, is this what we want? Do I just need to quit bitching and make it work? Or am I doing something wrong? This is a completely new area for me so I'm just looking around at what md/dm does and trying to mirror it for my own uses, if thats not what I should be doing please tell me, otherwise this seems like alot of work for a very shitty solution to our problem. Thanks, Yeah, that matches what I was experiencing when thinking about the options. Making a btrfs mount a fake blockdev of zero size seems like a pretty weird hack, just get some 'dead' directories in sysfs. A btrfs mount is just not a raw blockdev, and should probably not pretend to be one. I guess a statfs()-like call from the filesystem side and not the block side, which can put out such information in some generic way, would better fit here. Kay -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: add a disk info ioctl to get the disks attached to a filesystem
On Thu, 30.09.10 21:59, Kay Sievers (kay.siev...@vrfy.org) wrote: So my question is, is this what we want? Do I just need to quit bitching and make it work? Or am I doing something wrong? This is a completely new area for me so I'm just looking around at what md/dm does and trying to mirror it for my own uses, if thats not what I should be doing please tell me, otherwise this seems like alot of work for a very shitty solution to our problem. Thanks, Yeah, that matches what I was experiencing when thinking about the options. Making a btrfs mount a fake blockdev of zero size seems like a pretty weird hack, just get some 'dead' directories in sysfs. A btrfs mount is just not a raw blockdev, and should probably not pretend to be one. I guess a statfs()-like call from the filesystem side and not the block side, which can put out such information in some generic way, would better fit here. Note that for my particular usecase it would even suffice to have two flags in struct statfs or struct statvfs that encode whether there's a at least one SSD in the fs, resp. at least one rotating disk in the fs. if (statvfs.f_flag ST_SSD) printf(FS contains at least one SSD disk); if (statvfs.f_flag ST_ROTATING) printf(FS contains at least one rotating disk); Lennart -- Lennart Poettering - Red Hat, Inc. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html