Re: btrfs for enterprise raid arrays
Dear Erwin, Erwin van Londen wrote (ao): Another thing is that some arrays have the capability to thin-provision volumes. In the back-end on the physical layer the array configures, let say, a 1 TB volume and virtually provisions 5TB to the host. On writes it dynamically allocates more pages in the pool up to the 5TB point. Now if for some reason large holes occur on the volume, maybe a couple of ISO images that have been deleted, what normally happens is just some pointers in the inodes get deleted so from an array perspective there is still data on those locations and will never release those allocated blocks. New firmware/microcode versions are able to reclaim that space if it sees a certain number of consecutive zero's and will reclaim that space to the volume pool. Are there any thoughts on writing a low-priority tread that zeros out those non-used blocks? SSD would also benefit from such a feature as it doesn't need to copy deleted data when erasing blocks. The storage could use the ATA/SCSI commands TRIM, UNMAP and DISCARD for that? I have one question on thin provisioning: if Windows XP performs defrag on a 20GB 'virtual' size LUN with 2GB in actuall use, whil the volume grow to 20GB on the storage and never shrink afterwards anymore, while the client still has only 2GB in use? This would make thin provisioning on virtual desktops less useful. Do you have any numbers on the performance impact of thin provisioning? I can imagine that thin provisioning causes on-storage defragmentation of disk images, which would kill any OS optimisations like grouping often read files. With kind regards, Sander -- Humilis IT Services and Solutions http://www.humilis.net -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs for enterprise raid arrays
On Fri, 2009-04-03 at 12:43 +0100, Ric Wheeler wrote: New firmware/microcode versions are able to reclaim that space if it sees a certain number of consecutive zero's and will reclaim that space to the volume pool. Are there any thoughts on writing a low-priority tread that zeros out those non-used blocks? Patches have been floating around to support this - see the recent patches around DISCARD on linux-ide and lkml. It would be great to get access to a box that implemented the T10 proposed UNMAP commands that we could test against. We've already made btrfs support TRIM, and Matthew has patches which hook it up for ATA/IDE devices. Adding SCSI support shouldn't be hard once the dust settles on the spec. I don't think I've seen anybody talking about deliberately writing zeroes instead of just issuing a discard command though. That doesn't seem like a massively cunning plan. -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs for enterprise raid arrays
On Thu, 2009-04-02 at 21:34 -0700, Erwin van Londen wrote: Dear all, While going through the archived mailing list and crawling along the wiki I didn't find any clues if there would be any optimizations in Btrfs to make efficient use of functions and features that today exist on enterprise class storage arrays. One exception to that was the ssd option which I think can make a improvement on read and write IO's however when attached to a storage array, from an OS perspective, it doesn't really matter since it can't look behind the array front-end interface anyhow(whether it FC/iSCSI or any other). There are however more options that we could think of. Almost all storage arrays these days have the capabilities to replicate volume (or part of it in COW cases) either in the system or remotely. It would be handy that if a Btrfs formatted volume could make use of those features since this might offload a lot of the processing time involved in maintaining these. The arrays already have optimized code to make these snapshots. I'm not saying we should step away from the host based snapshots but integration would be very nice. Storage based snapshotting would definitely be useful for replication in btrfs, and in that case we could wire it up from userland. Basically there is a point during commit where a storage snapshot could be taken and fully consistent. Outside of replication though, I'm not sure exactly where storage based snapshotting would come in. It wouldn't really be compatible with the snapshots btrfs is already doing (but I'm always open to more ideas). Furthermore some enterprise array have a feature that allows for full or partial staging data in cache. By this I mean when a volume contains a certain amount of blocks you can define to have the first X number of blocks pre-staged in cache which enables you to have extremely high IO rates on these first ones. An option related to the -ssd parameter could be to have a mount command say mount -t btrfs -ssd 0-1 so Btrfs knows what to expect from the partial area and maybe can optimize the locality of frequently used blocks to optimize performance. This would be very useful, although I would tend to export it to btrfs as a second lun. My long term goal is to have code in btrfs that supports a super fast staging lun, which might be an ssd or cache carved out of a high end array. Another thing is that some arrays have the capability to thin-provision volumes. In the back-end on the physical layer the array configures, let say, a 1 TB volume and virtually provisions 5TB to the host. On writes it dynamically allocates more pages in the pool up to the 5TB point. Now if for some reason large holes occur on the volume, maybe a couple of ISO images that have been deleted, what normally happens is just some pointers in the inodes get deleted so from an array perspective there is still data on those locations and will never release those allocated blocks. New firmware/microcode versions are able to reclaim that space if it sees a certain number of consecutive zero's and will reclaim that space to the volume pool. Are there any thoughts on writing a low-priority tread that zeros out those non-used blocks? Other people have replied about the trim commands, which btrfs can issue on every block it frees. But, another way to look at this is that btrfs already is thinly provisioned. When you add storage to btrfs, it allocates from that storage in 1GB chunks, and then hands those over to the FS allocation code for more fine grained use. It may make sense to talk about how that can fit in with your own thin provisioning. Given the scalability targets of Btrfs it will most likely be heavily used in the enterprise environment once it reaches a stable code level. If we would be able to interface with these array based features that would be very beneficial. Furthermore one question also pops to mind and that's when looking at the scalability of Btrfs and its targeted capacity levels I think we will run into problems with the capabilities of the server hardware itself. From what I can see now it will not be designed as a distributed file-system with integrated distributed lock manager to scale out over multiple nodes. (I know Oracle is working on a similar thing but this might get things more complicated than it already is.) This might impose some serious issues with recovery scenarios like backup/restore since it will take quite some time to backup/restore a multi PB system when it resides on just 1 physical host even when we're talking high end P-series, I25K's or Superdome class. This is true. Things like replication and failover are the best plans for it today. Thanks for your interest, we're always looking for ways to better utilize high end storage features. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to
Re: btrfs for enterprise raid arrays
On Fri, Apr 03, 2009 at 12:58:00PM +0100, David Woodhouse wrote: On Fri, 2009-04-03 at 12:43 +0100, Ric Wheeler wrote: New firmware/microcode versions are able to reclaim that space if it sees a certain number of consecutive zero's and will reclaim that space to the volume pool. Are there any thoughts on writing a low-priority tread that zeros out those non-used blocks? Patches have been floating around to support this - see the recent patches around DISCARD on linux-ide and lkml. It would be great to get access to a box that implemented the T10 proposed UNMAP commands that we could test against. We've already made btrfs support TRIM, and Matthew has patches which hook it up for ATA/IDE devices. Adding SCSI support shouldn't be hard once the dust settles on the spec. It seems like the dust has settled ... I just need to check that my code still conforms to the spec. Understandably, I've been focused on TRIM ;-) I don't think I've seen anybody talking about deliberately writing zeroes instead of just issuing a discard command though. That doesn't seem like a massively cunning plan. Yeah, WRITE SAME with the discard bit. A bit of a crappy way to go, to be sure. I'm not exactly sure how we're supposed to be deciding whether to issue an UNMAP or WRITE SAME command. Perhaps if I read the spec properly it'll tell me. I just had a quick chat with someone from another storage vendor who don't yet implement UNMAP -- if you do a WRITE SAME with all zeroes, their device will notice that and unmap the LBAs in question. Something for the plane on Sunday anyway. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs for enterprise raid arrays
On Fri, 2009-04-03 at 06:27 -0700, Matthew Wilcox wrote: On Fri, Apr 03, 2009 at 12:58:00PM +0100, David Woodhouse wrote: On Fri, 2009-04-03 at 12:43 +0100, Ric Wheeler wrote: New firmware/microcode versions are able to reclaim that space if it sees a certain number of consecutive zero's and will reclaim that space to the volume pool. Are there any thoughts on writing a low-priority tread that zeros out those non-used blocks? Patches have been floating around to support this - see the recent patches around DISCARD on linux-ide and lkml. It would be great to get access to a box that implemented the T10 proposed UNMAP commands that we could test against. We've already made btrfs support TRIM, and Matthew has patches which hook it up for ATA/IDE devices. Adding SCSI support shouldn't be hard once the dust settles on the spec. It seems like the dust has settled ... I just need to check that my code still conforms to the spec. Understandably, I've been focused on TRIM ;-) I don't think I've seen anybody talking about deliberately writing zeroes instead of just issuing a discard command though. That doesn't seem like a massively cunning plan. Yeah, WRITE SAME with the discard bit. A bit of a crappy way to go, to be sure. I'm not exactly sure how we're supposed to be deciding whether to issue an UNMAP or WRITE SAME command. Perhaps if I read the spec properly it'll tell me. Actually, the point about WRITE SAME is that it's a far smaller patch to the standards (just a couple of bits). Plus it gets around the problem of what does the array return when an unmapped block is requested (which occupies pages in the UNMAP proposal), so from that point of view it seems very logical. I just had a quick chat with someone from another storage vendor who don't yet implement UNMAP -- if you do a WRITE SAME with all zeroes, their device will notice that and unmap the LBAs in question. I actually already looked at using WRITE SAME in sd.c ... it turns out to be surprisingly little work ... the thing you'll like about it is that there are no extents to worry about and if you plan on writing all zeros, you can keep a static zeroed data buffer around for the purpose ... James -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs for enterprise raid arrays
On Fri, 2009-04-03 at 07:43 -0400, Ric Wheeler wrote: Erwin van Londen wrote: Another thing is that some arrays have the capability to thin-provision volumes. In the back-end on the physical layer the array configures, let say, a 1 TB volume and virtually provisions 5TB to the host. On writes it dynamically allocates more pages in the pool up to the 5TB point. Now if for some reason large holes occur on the volume, maybe a couple of ISO images that have been deleted, what normally happens is just some pointers in the inodes get deleted so from an array perspective there is still data on those locations and will never release those allocated blocks. New firmware/microcode versions are able to reclaim that space if it sees a certain number of consecutive zero's and will reclaim that space to the volume pool. Are there any thoughts on writing a low-priority tread that zeros out those non-used blocks? Patches have been floating around to support this - see the recent patches around DISCARD on linux-ide and lkml. It would be great to get access to a box that implemented the T10 proposed UNMAP commands that we could test against. So we went several times around the block in the upcoming Linux Filesystem and Storage workshop to see if anyone from the array vendors might be interested in discussing thin provisioning. The general result was no since travel is tight. The upshot will be that most of our discard infrastructure will be focussed on SSD TRIM, but that we'll try to preserve the TP option for arrays ... there are still private conversations going on with various people who know the UNMAP/WRITE SAME requirements of the various arrays at the various vendors. James -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html