Re: [zfs-discuss] Thin device support in ZFS?
et == Erik Trimble erik.trim...@sun.com writes: et Probably, the smart thing to push for is inclusion of some new et command in the ATA standard (in a manner like TRIM). Likely et something that would return both native Block and Page sizes et upon query. that would be the *sane* thing to do. The *smart* thing to do would be write a quick test to determine the apparent page size by performance-testing write-flush-write-flush-write-flush with various write sizes and finding the knee that indicates the smallest size at which read-before-write has stopped. The test could happen in 'zpool create' and have its result written into the vdev label. Inventing ATA commands takes too long to propogate through the technosphere, and the EE's always implement them wrongly: for example, a device with SDRAM + supercap should probably report 512 byte sectors because the algorithm for copying from SDRAM to NAND is subject to change and none of your business, but EE's are not good with language and will try to apelike match up the paragraph in the spec with the disorganized thoughts in their head, fit pegs into holes, and will end up giving you the NAND page size without really understanding why you wanted it other than that some standard they can't control demands it. They may not even understand why their devices are faster and slower---they are probably just hurling shit against an NTFS and shipping whatever runs some testsuite fastest---so doing the empirical test is the only way to document what you really care about in a way that will make it across the language and cultural barriers between people who argue about javascript vs python and ones that argue about Agilent vs LeCroy. Within the proprietary wall of these flash filesystem companies the testsuites are probably worth as much as the filesystem code, and here without the wall an open-source statistical test is worth more than a haggled standard. Remember the ``removeable'' bit in USB sticks and the mess that both software and hardware made out of it. (hot-swappable SATA drives are ``non-removeable'' and don't need rmformat while USB/firewore do? yeah, sorry, u fail abstraction. and USB drives have the ``removable medium'' bit set when the medium and the controller are inseperable, it's the _controller_ that's removeable? ya sorry u fail reading English.) If you can get an answer by testing, DO IT, and evolve the test to match products on the market as necessary. This promises to be a lot more resilient than the track record with bullshit ATA commands and will work with old devices too. By the time you iron out your standard we will be using optonanocyberflash instead: that's what happened with the removeable bit and r/w optical storage. BTW let me know when read/write UDF 2.0 on dvd+r is ready---the standard was only announced twelve years ago, thanks. pgpOg9cjVknOA.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
ah == Al Hopper a...@logical-approach.com writes: ah The main issue is that most flash devices support 128k byte ah pages, and the smallest chunk (for want of a better word) of ah flash memory that can be written is a page - or 128kb. So if ah you have a write to an SSD that only changes 1 byte in one 512 ah byte disk sector, the SSD controller has to either ah read/re-write the affected page or figure out how to update ah the flash memory with the minimum affect on flash wear. yeah well, I'm not sure it matters, but that's untrue. there are two sizes for NAND flash, the minimum write size and the minimum erase size. The minimum write size is the size over which error correction is done, the unit at which inband and OOB data is interleaved, on NAND flash. The minimum erase size is just what it sounds, the size the cleaner/garbagecolelctor must evacuate. The minimum write size is I suppose likely to provoke read/modify/write and wasting of write and wear bandwidth for smaller writes in flashes which do not have a DRAM+supercap, if you ask to SYNCHRONZIE CACHE right after the write. If there is a supercap, or if you allow teh drive to do write caching, then the smaller write could be coalesced making this size irrelevant. I think it's usually 2 - 4 kB. I would expect resistance to growing it larger than 4kB because of NTFS---electrical engineers are usually over-obsessed with Windows. The minimum erase size you don't really care about at all. That's the one that's usually at least 128kB. ah For anyone who is interested in getting more details of the ah challenges with flash memory, when used to build solid state ah drives, reading the tech data sheets on the flash memory ah devices will give you a feel for the basic issues that must be ah solved. and the linux-mtd list will give you a feel for how people are solving them, because that's the only place I know of where NAND filesystem work is going on in the open. There are a bunch of geezers saying ``I wrote one for BSD but my employer won't let me release it,'' and then the new crop of intel/sandforce/stec proprietary kids, but in the open world AFAIK there is just yaffs and ubifs. The tmobile G1 is yaffs. ah Bobs point is well made. The specifics of a given SSD ah implementation will make the performance characteristics of ah the resulting SSD very difficult to predict or even describe - I'm really a fan of thte idea of using ACARD ANS-9010 for a slog. It's basically all DRAM+battery, and uses a low performance CF card for durable storage if the battery starts to run low, or if you explicitly request it (to move data between ACARD units by moving the CF card maybe). It will even make non-ECC RAM into ECC storage (using a sector size and OOB data :). It seems like Zeus-like performance at 1/10th the price, but of course it's a little goofy, and I've never tried it. slog is where I'd expect the high synchronous workload to be, so this is where there are small writes that can't be coalesced, I would presume, and appropriate slog sizes are reachable with DRAM alone. pgpvCrA05zqYv.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
Yet another way to thin-out the backing devices for a zpool on a thin-provisioned storage host, today: resilver. If your zpool has some redundancy across the SAN backing LUNs, simply drop and replace one at a time and allow zfs to resilver only the blocks currently in use onto the replacement LUN. -- Dan. pgpo7ejxaipJy.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
dm == David Magda dma...@ee.ryerson.ca writes: dm 4096 - to-512 blocks aiui NAND flash has a minimum write size (determiined by ECC OOB bits) of 2 - 4kB, and a minimum erase size that's much larger. Remapping cannot abstract away the performance implication of the minimum write size if you are doing a series of synchronous writes smaller than the minimum size on a device with no battery/capacitor, although using a DRAM+supercap prebuffer might be able to abstract away some of it. pgp7ymX3mE7r4.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
As a further update, I went back and re-read my SSD controller info, and then did some more Googling. Turns out, I'm about a year behind on State-of-the-SSD.Eric is correct on the way current SSDs implement writes (both SLC and MLC), so I'm issuing a mea-cupla here. The change in implementation appears to occur sometime shortly after the introduction of the Indilinx controllers. My fault for not catching this. -Erik Eric D. Mudama wrote: On Sat, Jan 2 at 22:24, Erik Trimble wrote: In MLC-style SSDs, you typically have a block size of 2k or 4k. However, you have a Page size of several multiples of that, 128k being common, but by no means ubiquitous. I believe your terminology is crossed a bit. What you call a block is usually called a sector, and what you call a page is known as a block. Sector is (usually) the unit of reading from the NAND flash. The unit of write in NAND flash is the page, typically 2k or 4k depending on NAND generation, and thus consisting of 4-8 ATA sectors (typically). A single page may be written at a time. I believe some vendors support partial-page programming as well, allowing a single sector append type operation where the previous write left off. Ordered pages are collected into the unit of erase, which is known as a block (or erase block), and is anywhere from 128KB to 512KB or more, depending again on NAND generation, manufacturer, and a bunch of other things. Some large number of blocks are grouped by chip enables, often 4K or 8K blocks. I think you're confusing erasing with writing. When I say minimum write size, I mean that for an MLC, no matter how small you make a change, the minimum amount of data actually being written to the SSD is a full page (128k in my example). There Page is the unit of write, but it's much smaller in all NAND I am aware of. is no append down at this level. If I have a page of 128k, with data in 5 of the 4k blocks, and I then want to add another 2k of data to this, I have to READ all 5 4k blocks into the controller's DRAM, add the 2k of data to that, then write out the full amount to a new page (if available), or wait for a older page to be erased before writing to it. Thus, in this case, in order to do an actual 2k write, the SSD must first read 10k of data, do some compositing, then write 12k to a fresh page. Thus, to change any data inside a single page, then entire contents of that page have to be read, the page modified, then the entire page written back out. See above. What I'm describing is how ALL MLC-based SSDs work. SLC-based SSDs work differently, but still have problems with what I'll call excess-writing. I think you're only describing dumb SSDs with erase-block granularity mapping. Most (all) vendors have moved away from that technique since random write performance is awful in those designs and they fall over dead from wAmp in a jiffy. SLC and MLC NAND is similar, and they are read/written/erased almost identically by the controller. I'm not sure that SSDs actually _have_ to erase - they just overwrite anything there with new data. But this is implementation dependent, so I can say how /all/ MLC SSDs behave. Technically you can program the same NAND page repeatedly, but since bits can only transition from 1-0 on a program operation, the result wouldn't be very meaningful. An erase sets all the bits in the block to 1, allowing you to store your data. Once again, what I'm talking about is a characteristic of MLC SSDs, which are used in most consumer SSDS (the Intel X25-M, included). Sure, such an SSD will commit any new writes to pages drawn from the list of never before used NAND. However, at some point, this list becomes empty. In most current MLC SSDs, there's about 10% extra (a 60GB advertised capacity is actually ~54GB usable with 6-8GB extra). Once this list is empty, the SSD has to start writing back to previous used pages, which may require an erase step first before any write. Which is why MLC SSDs slow down drastically once they've been fulled to capacity several times. From what I've seen, erasing a block typically takes a time in the same scale as programming an MLC page, meaning in flash with large page counts per block, the % of time spent erasing is not very large. Lets say that an erase took 100ms and a program took 10ms, in an MLC NAND device with 100 pages per block. In this design, it takes us 1s to program the entire block, but only 1/10 of the time to erase it. An infinitely fast erase would only make the design about 10% faster. For SLC the erase performance matters more since page writes are much faster on average and there are half as many pages, but we were talking MLC. The performance differences seen is because they were artificially fast to begin with because they were empty. It's similar to destroking a rotating drive in many ways to speed seek times. Once the drive is full, it all comes down to raw NAND performance,
Re: [zfs-discuss] Thin device support in ZFS?
On Sat, Jan 2 at 22:24, Erik Trimble wrote: In MLC-style SSDs, you typically have a block size of 2k or 4k. However, you have a Page size of several multiples of that, 128k being common, but by no means ubiquitous. I believe your terminology is crossed a bit. What you call a block is usually called a sector, and what you call a page is known as a block. Sector is (usually) the unit of reading from the NAND flash. The unit of write in NAND flash is the page, typically 2k or 4k depending on NAND generation, and thus consisting of 4-8 ATA sectors (typically). A single page may be written at a time. I believe some vendors support partial-page programming as well, allowing a single sector append type operation where the previous write left off. Ordered pages are collected into the unit of erase, which is known as a block (or erase block), and is anywhere from 128KB to 512KB or more, depending again on NAND generation, manufacturer, and a bunch of other things. Some large number of blocks are grouped by chip enables, often 4K or 8K blocks. I think you're confusing erasing with writing. When I say minimum write size, I mean that for an MLC, no matter how small you make a change, the minimum amount of data actually being written to the SSD is a full page (128k in my example). There Page is the unit of write, but it's much smaller in all NAND I am aware of. is no append down at this level. If I have a page of 128k, with data in 5 of the 4k blocks, and I then want to add another 2k of data to this, I have to READ all 5 4k blocks into the controller's DRAM, add the 2k of data to that, then write out the full amount to a new page (if available), or wait for a older page to be erased before writing to it. Thus, in this case, in order to do an actual 2k write, the SSD must first read 10k of data, do some compositing, then write 12k to a fresh page. Thus, to change any data inside a single page, then entire contents of that page have to be read, the page modified, then the entire page written back out. See above. What I'm describing is how ALL MLC-based SSDs work. SLC-based SSDs work differently, but still have problems with what I'll call excess-writing. I think you're only describing dumb SSDs with erase-block granularity mapping. Most (all) vendors have moved away from that technique since random write performance is awful in those designs and they fall over dead from wAmp in a jiffy. SLC and MLC NAND is similar, and they are read/written/erased almost identically by the controller. I'm not sure that SSDs actually _have_ to erase - they just overwrite anything there with new data. But this is implementation dependent, so I can say how /all/ MLC SSDs behave. Technically you can program the same NAND page repeatedly, but since bits can only transition from 1-0 on a program operation, the result wouldn't be very meaningful. An erase sets all the bits in the block to 1, allowing you to store your data. Once again, what I'm talking about is a characteristic of MLC SSDs, which are used in most consumer SSDS (the Intel X25-M, included). Sure, such an SSD will commit any new writes to pages drawn from the list of never before used NAND. However, at some point, this list becomes empty. In most current MLC SSDs, there's about 10% extra (a 60GB advertised capacity is actually ~54GB usable with 6-8GB extra). Once this list is empty, the SSD has to start writing back to previous used pages, which may require an erase step first before any write. Which is why MLC SSDs slow down drastically once they've been fulled to capacity several times. From what I've seen, erasing a block typically takes a time in the same scale as programming an MLC page, meaning in flash with large page counts per block, the % of time spent erasing is not very large. Lets say that an erase took 100ms and a program took 10ms, in an MLC NAND device with 100 pages per block. In this design, it takes us 1s to program the entire block, but only 1/10 of the time to erase it. An infinitely fast erase would only make the design about 10% faster. For SLC the erase performance matters more since page writes are much faster on average and there are half as many pages, but we were talking MLC. The performance differences seen is because they were artificially fast to begin with because they were empty. It's similar to destroking a rotating drive in many ways to speed seek times. Once the drive is full, it all comes down to raw NAND performance, controller design, reserve/extra area (or TRIM) and algorithmic quality. --eric -- Eric D. Mudama edmud...@mail.bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
Eric D. Midama did a very good job answering this, and I don't have much to add. Thanks Eric! On 3 jan 2010, at 07.24, Erik Trimble wrote: I think you're confusing erasing with writing. I am now quite certain that it actually was you who were confusing those. I hope this discussion has cleared things up a little though. What I'm describing is how ALL MLC-based SSDs work. SLC-based SSDs work differently, but still have problems with what I'll call excess-writing. Eric already said it, but I need to say this myself too: SLC and MLC disks could be almost identical, only the storing of the bits in the flash chips differs a little (1 or 2 bits per storage cell). There is absolutely no other fundamental difference between the two. Hopefully no modern MLC *or* SLC disk works as you described, since it is a horrible design, and selling it would be close to robbery. It would be slow and it would wear out quite fast. Now, SLC disks are typically better overall, because those who want to pay for SLC flash typically also want to pay for better controllers, but otherwise those issues are really orthogonal. I'm not sure that SSDs actually _have_ to erase - they just overwrite anything there with new data. But this is implementation dependent, so I can say how /all/ MLC SSDs behave. As Eric said - yes you have to erase, otherwise you can't write new data. It is not implementation dependent, it is inherent in the flash technology. And, as has been said several times now, erasing can only be done in large chunks, but writing can be done in small chunks. I'd say that this is the main problem to handle when creating a good flash SSD. The whole point behind ZFS is that CPU cycles are cheap and available, much more so than dedicated hardware of any sort. What I'm arguing here is that the controller on an SSD is in the same boat as a dedicated RAID HBA - in the latter case, use a cheap HBA instead and let the CPU ZFS do the work, while in the former case, use a dumb controller for the SSD instead of a smart one. This could be true, I am still not sure. My main issues with this is that it would make the file system code dependent of a special hardware behavior (that of todays flash chips), and that it could be quite a lot of data to shuffle around when compacting. But we'll see. If it could be cheap enough, it could absolutely happen and be worth it even if it has some drawbacks. And, as I pointed out in another message, doing it my way doesn't increase bus traffic that much over what is being done now, in any case. Yes, it would increase bus traffic, if you would handle flash the compacting in the host - which you have to with your idea - it could be many times the real workload bandwidth. But it could still be worth it, that is quite possible. - On 3 jan 2010, at 07.43, Erik Trimble wrote: I meant to say that I DON'T know how all MLC drives deal with erasure. Again - yes they do. (Or they would be write-once only. :-) I'm pretty sure compacting doesn't occur in ANY SSDs without any OS intervention (that is, the SSD itself doesn't do it), and I'd be surprised to see an OS try to implement some sort of intra-page compaction - there benefit doesn't seem to be there; it's better just to optimize writes than try to compact existing pages. As far as reclaiming unused space, the TRIM command is there to allow the SSD to mark a page Free for reuse, and an SSD isn't going to be erasing a page unless it's right before something is to be written to that page. My thinking of what compacting meant doesn't match up with what I'm seeing general usage in the SSD technical papers is, so in this respect, I'm wrong: compacting does occur, but only when there are no fully erased (or unused) pages available. Thus, compacting is done in the context of a write operation. Exactly what and when it is that triggers compacting is another issue, and that could probably change with firmware revisions. It is wise to do it earlier than when you get that write that didn't fit, since if you have some erased space you can then take burts of writes up to that size quickly. But compacting takes bandwidth from the flash chips and wears them out, so you don't want to do it to early and to much. I guess this could be an interesting optimization problem, and optimal behavior probably depends on the workload too. Maybe it should be an adjustable knob. - On 3 jan 2010, at 10.57, Eric D. Mudama wrote: On Sat, Jan 2 at 22:24, Erik Trimble wrote: In MLC-style SSDs, you typically have a block size of 2k or 4k. However, you have a Page size of several multiples of that, 128k being common, but by no means ubiquitous. I believe your terminology is crossed a bit. What you call a block is usually called a sector, and what you call a page is known as a block. Sector is (usually) the unit of reading from the NAND flash. ... Indeed, and I am partly guilty to that mess, but
Re: [zfs-discuss] Thin device support in ZFS?
On 1 jan 2010, at 17.44, Richard Elling wrote: On Dec 31, 2009, at 12:59 PM, Ragnar Sundblad wrote: Flash SSDs actually always remap new writes into a only-append-to-new-pages style, pretty much as ZFS does itself. So for a SSD there is no big difference between ZFS and filesystems as UFS, NTFS, HFS+ et al, on the flash level they all work the same. The reason is that there is no way for it to rewrite single disk blocks, it can only fill up already erased pages of 512K (for example). When the old blocks get mixed with unused blocks (because of block rewrites, TRIM or Write Many/UNMAP), it needs to compact the data by copying all active blocks from those pages into previously erased pages, and there write the active data compacted/continuos. (When this happens, things tend to get really slow.) However, the quantity of small, overwritten pages is vastly different. I am not convinced that a workload that generates few overwrites will be penalized as much as a workload that generates a large number of overwrites. Zfs is not append only in itself, there will be holes from deleted files after a while, and space will have to be reclaimed sooner or later. I am not convinced that a zfs that has been in use for a while rewrites a lot less than other file systems. But maybe you are right, and if so, I agree that intuitively such a workload may be better matched to a flash based device. If you have a workload that only appends data and never changes or deletes it, zfs is probably a bit better than other file systems of not rewriting blocks. But that is a pretty special use case, and another file system could rewrite almost as little. I think most folks here will welcome good, empirical studies, but thus far the only one I've found is from STEC and their disks behave very well after they've been filled and subjected to a rewrite workload. You get what you pay for. Additional pointers are always appreciated :-) http://www.stec-inc.com/ssd/videos/ssdvideo1.php There certainly are big differences between the flash SSD drives out there, I wouldn't argue about that for a second! /ragge ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On 1 jan 2010, at 17.28, David Magda wrote: On Jan 1, 2010, at 11:04, Ragnar Sundblad wrote: But that would only move the hardware specific and dependent flash chip handling code into the file system code, wouldn't it? What is won with that? As long as the flash chips have larger pages than the file system blocks, someone will have to shuffle around blocks to reclaim space, why not let the one thing that knows the hardware and also is very close to the hardware do it? And if this is good for SSDs, why isn't it as good for rotating rust? Don't really see how things are either hardware specific or dependent. The inner workings of a SSD flash drive is pretty hardware (or rather vendor) specific, and it may not be a good idea to move any knowledge about that to the file system layer. COW is COW. Am I missing something? It's done by code somewhere in the stack, if the FS knows about it, it can lay things out in sequential writes. If we're talking about 512 KB blocks, ZFS in particular would create four 128 KB txgs--and 128 KB is simply the currently #define'd size, which can be changed in the future. As I said in another mail, zfs is not append only, especially not if it has been in random read write use for a while. There will be holes in the data and space to be reclaimed, something has to handle that, and I am not sure it is a good idea to move that into the host, since it it dependent of the design of the SSD drive. One thing you gain is perhaps not requiring to have as much of a reserve. At most you have some hidden bad block re-mapping, similar to rotating rust nowadays. If you're shuffling blocks around, you're doing a read-modify-write, which if done in the file system, you could use as a mechanism to defrag on-the-fly or to group many small files together. Yes, defrag on the fly may be interesting. Otherwise I am not sure I think the file system should do any of that, since it may be that it can be done much faster and smarter in the SSD controller. Not quite sure what you mean by your last question. I meant that if hardware dependent handling of the storage medium is good to move into the host, why isn't the same true for spinning disks? But we can leave that for now. /ragge ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
Mike, As far as I know only Hitachi is using such a huge chunk size : So each vendor’s implementation of TP uses a different block size. HDS use 42MB on the USP, EMC use 768KB on DMX, IBM allow a variable size from 32KB to 256KB on the SVC and 3Par use blocks of just 16KB. The reasons for this are many and varied and for legacy hardware are a reflection of the underlying hardware architecture. http://gestaltit.com/all/tech/storage/chris/thin-provisioning-holy-grail-utilisation/ Also, here Hu explains the reason why they believe 42M is the most efficient : http://blogs.hds.com/hu/2009/07/chunk-size-matters.html He has some good points in his arguments. Regards, sendai -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
Ragnar Sundblad ra...@csc.kth.se wrote: On 1 jan 2010, at 17.28, David Magda wrote: Don't really see how things are either hardware specific or dependent. The inner workings of a SSD flash drive is pretty hardware (or rather vendor) specific, and it may not be a good idea to move any knowledge about that to the file system layer. If ZFS likes to keep SSDs fast even after it was in use for a while, then even ZFS would need to tell the SSD which sectors are no longer in use. Such a mode may cause a noticable performance loss as ZFS for this reason may need to traverse freed outdated data trees but it will help the SSD to erase the needed space in advance. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
Ragnar Sundblad ra...@csc.kth.se wrote: I certainly agree, but there still isn't much they can do about the WORM-like properties of flash chips, were reading is pretty fast, writing is not to bad, but erasing is very slow and must be done in pretty large pages which also means that active data probably have to be copied around before an erase. WORM devices do not allow to write a block a secdond time. There is a typical 5% reserve that would allow to reassign some blocks and to make it appear they have been rewritten, but this is not what ZFS does. Well, you are hoewever true that there is a slight relation as I did invent COW for a WORM filesystem in 1989 ;-) Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
Joerg Schilling wrote: Ragnar Sundblad ra...@csc.kth.se wrote: On 1 jan 2010, at 17.28, David Magda wrote: Don't really see how things are either hardware specific or dependent. The inner workings of a SSD flash drive is pretty hardware (or rather vendor) specific, and it may not be a good idea to move any knowledge about that to the file system layer. If ZFS likes to keep SSDs fast even after it was in use for a while, then even ZFS would need to tell the SSD which sectors are no longer in use. Such a mode may cause a noticable performance loss as ZFS for this reason may need to traverse freed outdated data trees but it will help the SSD to erase the needed space in advance. Jör the TRIM command is what is intended for an OS to notify the SSD as to which blocks are deleted/erased, so the SSD's internal free list can be updated (that is, it allows formerly-in-use blocks to be moved to the free list). This is necessary since only the OS has the information to determine which previous-written-to blocks are actually no longer in-use. See the parallel discussion here titled preview of new SSD based on SandForce controller for more about smart vs dumb SSD controllers. From ZFS's standpoint, the optimal configuration would be for the SSD to inform ZFS as to it's PAGE size, and ZFS would use this as the fundamental BLOCK size for that device (i.e. all writes are in integer multiples of the SSD page size). Reads could be in smaller sections, though. Which would be interesting: ZFS would write in Page Size increments, and read in Block Size amounts. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
Erik Trimble erik.trim...@sun.com wrote: From ZFS's standpoint, the optimal configuration would be for the SSD to inform ZFS as to it's PAGE size, and ZFS would use this as the fundamental BLOCK size for that device (i.e. all writes are in integer It seems that a command to retrieve this information does not yet exist, or could you help me? Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On 2 jan 2010, at 12.43, Joerg Schilling wrote: Ragnar Sundblad ra...@csc.kth.se wrote: I certainly agree, but there still isn't much they can do about the WORM-like properties of flash chips, were reading is pretty fast, writing is not to bad, but erasing is very slow and must be done in pretty large pages which also means that active data probably have to be copied around before an erase. WORM devices do not allow to write a block a secdond time. (I know, that is why I wrote WORM-like.) There is a typical 5% reserve that would allow to reassign some blocks and to make it appear they have been rewritten, but this is not what ZFS does. Well, zfs kind of does, but especially typical flash SSDs do it, they have a redirection layer so that any block can go anywhere, so they can use the flash media in a WORM like style with occasional bulk erases. Well, you are hoewever true that there is a slight relation as I did invent COW for a WORM filesystem in 1989 ;-) Yes, there indeed are several similarities. /ragge ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On 2 jan 2010, at 13.10, Erik Trimble wrote: Joerg Schilling wrote: Ragnar Sundblad ra...@csc.kth.se wrote: On 1 jan 2010, at 17.28, David Magda wrote: Don't really see how things are either hardware specific or dependent. The inner workings of a SSD flash drive is pretty hardware (or rather vendor) specific, and it may not be a good idea to move any knowledge about that to the file system layer. If ZFS likes to keep SSDs fast even after it was in use for a while, then even ZFS would need to tell the SSD which sectors are no longer in use. Such a mode may cause a noticable performance loss as ZFS for this reason may need to traverse freed outdated data trees but it will help the SSD to erase the needed space in advance. Jör the TRIM command is what is intended for an OS to notify the SSD as to which blocks are deleted/erased, so the SSD's internal free list can be updated (that is, it allows formerly-in-use blocks to be moved to the free list). This is necessary since only the OS has the information to determine which previous-written-to blocks are actually no longer in-use. See the parallel discussion here titled preview of new SSD based on SandForce controller for more about smart vs dumb SSD controllers. From ZFS's standpoint, the optimal configuration would be for the SSD to inform ZFS as to it's PAGE size, and ZFS would use this as the fundamental BLOCK size for that device (i.e. all writes are in integer multiples of the SSD page size). Reads could be in smaller sections, though. Which would be interesting: ZFS would write in Page Size increments, and read in Block Size amounts. Well, this could be useful if updates are larger than the block size, for example 512 K, as it is then possible to erase and rewrite without having to copy around other data from the page. If updates are smaller, zfs will have to reclaim erased space by itself, which if I am not mistaken it can not do today (but probably will in some future, I guess the BP Rewrite is what is needed). I am still not entirely convinced that it would be better to let the file system take care of that instead of a flash controller, there could be quite a lot of reading and writing going on for space reclamation (depending on the work load, of course). /ragge ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
Joerg Schilling wrote: Erik Trimble erik.trim...@sun.com wrote: From ZFS's standpoint, the optimal configuration would be for the SSD to inform ZFS as to it's PAGE size, and ZFS would use this as the fundamental BLOCK size for that device (i.e. all writes are in integer It seems that a command to retrieve this information does not yet exist, or could you help me? Jörg Sadly, no, there does not exist any way for the SSD to communicate that info back to the OS. Probably, the smart thing to push for is inclusion of some new command in the ATA standard (in a manner like TRIM). Likely something that would return both native Block and Page sizes upon query. I'm still trying to see if there will be any support for TRIM-like things in SAS. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
Ragnar Sundblad wrote: On 2 jan 2010, at 13.10, Erik Trimble wrote Joerg Schilling wrote: the TRIM command is what is intended for an OS to notify the SSD as to which blocks are deleted/erased, so the SSD's internal free list can be updated (that is, it allows formerly-in-use blocks to be moved to the free list). This is necessary since only the OS has the information to determine which previous-written-to blocks are actually no longer in-use. See the parallel discussion here titled preview of new SSD based on SandForce controller for more about smart vs dumb SSD controllers. From ZFS's standpoint, the optimal configuration would be for the SSD to inform ZFS as to it's PAGE size, and ZFS would use this as the fundamental BLOCK size for that device (i.e. all writes are in integer multiples of the SSD page size). Reads could be in smaller sections, though. Which would be interesting: ZFS would write in Page Size increments, and read in Block Size amounts. Well, this could be useful if updates are larger than the block size, for example 512 K, as it is then possible to erase and rewrite without having to copy around other data from the page. If updates are smaller, zfs will have to reclaim erased space by itself, which if I am not mistaken it can not do today (but probably will in some future, I guess the BP Rewrite is what is needed). Sure, it does that today. What do you think happens on a standard COW action? Let's be clear here: I'm talking about exactly the same thing that currently happens when you modify a ZFS block that spans multiple vdevs (say, in a RAIDZ). The entire ZFS block is read from disk/L2ARC, the modifications made, then it is written back to storage, likely in another LBA. The original ZFS block location ON THE VDEV is now available for re-use (i.e. the vdev adds it to it's Free Block List). This is one of the things that leads to ZFS's fragmentation issues (note, we're talking about block fragmentation on the vdev, not ZFS block fragmentation), and something that we're looking to BP rewrite to enable defragging to be implemented. In fact, I would argue that the biggest advantage of removing any advanced intelligence from the SSD controller is with small modifications to existing files. By using the L2ARC (and other features, like compression, encryption, and dedup), ZFS can composite the needed changes with an existing cached copy of the ZFS block(s) to be changed, then issue a full new block write to the SSD. This eliminates the need for the SSD to do the dreaded Read-Modify-Write cycle, and instead do just a Write. In this scenario, the ZFS block is likely larger than the SSD Page size, so more data will need to be written; however, given the highly parallel nature of SSDs, writing several SSD pages simultaneously is easy (and fast); let's remember that a ZFS block is a maximum of only 8x the size of a SSD page, and writing 8 pages is only slightly more work than writing 1 page. This larger write is all a single IOP, where a R-M-W essentially requires 3 IOPS. If you want the SSD controller to do the work, then it ALWAYS has to read the to-be-modified page from NAND, do the mod itself, then issue the write - and, remember, ZFS likely has already issued a full ZFS-block write (due to the COW nature of ZFS, there is no concept of just change this 1 bit and leave everything else on disk where it is), so you likely don't save on the number of pages that need to be written in any case. I am still not entirely convinced that it would be better to let the file system take care of that instead of a flash controller, there could be quite a lot of reading and writing going on for space reclamation (depending on the work load, of course). /ragge The point here is that regardless of the workload, there's a R-M-W cycle that has to happen, whether that occurs at the ZFS level or at the SSD level. My argument is that the OS has a far better view of the whole data picture, and access to much higher performing caches (i.e. RAM/registers) than the SSD, so not only can the OS make far better decisions about the data and how (and how much of) it should be stored, but it's almost certainly to be able to do so far faster than any little SSD controller can do. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On 2 jan 2010, at 22.49, Erik Trimble wrote: Ragnar Sundblad wrote: On 2 jan 2010, at 13.10, Erik Trimble wrote Joerg Schilling wrote: the TRIM command is what is intended for an OS to notify the SSD as to which blocks are deleted/erased, so the SSD's internal free list can be updated (that is, it allows formerly-in-use blocks to be moved to the free list). This is necessary since only the OS has the information to determine which previous-written-to blocks are actually no longer in-use. See the parallel discussion here titled preview of new SSD based on SandForce controller for more about smart vs dumb SSD controllers. From ZFS's standpoint, the optimal configuration would be for the SSD to inform ZFS as to it's PAGE size, and ZFS would use this as the fundamental BLOCK size for that device (i.e. all writes are in integer multiples of the SSD page size). Reads could be in smaller sections, though. Which would be interesting: ZFS would write in Page Size increments, and read in Block Size amounts. Well, this could be useful if updates are larger than the block size, for example 512 K, as it is then possible to erase and rewrite without having to copy around other data from the page. If updates are smaller, zfs will have to reclaim erased space by itself, which if I am not mistaken it can not do today (but probably will in some future, I guess the BP Rewrite is what is needed). Sure, it does that today. What do you think happens on a standard COW action? Let's be clear here: I'm talking about exactly the same thing that currently happens when you modify a ZFS block that spans multiple vdevs (say, in a RAIDZ). The entire ZFS block is read from disk/L2ARC, the modifications made, then it is written back to storage, likely in another LBA. The original ZFS block location ON THE VDEV is now available for re-use (i.e. the vdev adds it to it's Free Block List). This is one of the things that leads to ZFS's fragmentation issues (note, we're talking about block fragmentation on the vdev, not ZFS block fragmentation), and something that we're looking to BP rewrite to enable defragging to be implemented. What I am talking about is to be able to reuse the free space you get in the previously written data when you write modified data to new places on the disk, or just remove a file for that matter. To be able to reclaim that space with flash, you have to erase large pages (for example 512 KB), but before you erase, you will also have to save away all still valid data in that page and rewrite that to a free page. What I am saying is that I am not sure that this would be best done in the file system, since it could be quite a bit of data to shuffle around, and there could possibly be hardware specific optimizations that could be done here that zfs wouldn't know about. A good flash controller could probably do it much better. (And a bad one worse, of course.) And as far as I know, zfs can not do that today - it can not move around already written data, not for defragmentation, not for adding or removing disks to stripes/raidz:s, not for deduping/duping and so on, and I have understood it as BP Rewrite could solve a lot of this. Still, it could certainly be useful if zfs could try to use a blocksize that matches the SSD erase page size - this could avoid having to copy and compact data before erasing, which could speed up writes in a typical flash SSD disk. In fact, I would argue that the biggest advantage of removing any advanced intelligence from the SSD controller is with small modifications to existing files. By using the L2ARC (and other features, like compression, encryption, and dedup), ZFS can composite the needed changes with an existing cached copy of the ZFS block(s) to be changed, then issue a full new block write to the SSD. This eliminates the need for the SSD to do the dreaded Read-Modify-Write cycle, and instead do just a Write. In this scenario, the ZFS block is likely larger than the SSD Page size, so more data will need to be written; however, given the highly parallel nature of SSDs, writing several SSD pages simultaneously is easy (and fast); let's remember that a ZFS block is a maximum of only 8x the size of a SSD page, and writing 8 pages is only slightly more work than writing 1 page. This larger write is all a single IOP, where a R-M-W essentially requires 3 IOPS. If you want the SSD controller t o do the work, then it ALWAYS has to read the to-be-modified page from NAND, do the mod itself, then issue the write - and, remember, ZFS likely has already issued a full ZFS-block write (due to the COW nature of ZFS, there is no concept of just change this 1 bit and leave everything else on disk where it is), so you likely don't save on the number of pages that need to be written in any case. I don't think many SSDs do R-M-W, but rather just append blocks to free pages (pretty
Re: [zfs-discuss] Thin device support in ZFS?
On Jan 2, 2010, at 16:49, Erik Trimble wrote: My argument is that the OS has a far better view of the whole data picture, and access to much higher performing caches (i.e. RAM/ registers) than the SSD, so not only can the OS make far better decisions about the data and how (and how much of) it should be stored, but it's almost certainly to be able to do so far faster than any little SSD controller can do. Though one advantage of doing it with-in the disk is that you're not using up bus bandwidth. Probably not that big of a deal, but worth mentioning for completeness / fairness. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Jan 2, 2010, at 1:47 AM, Andras Spitzer wrote: Mike, As far as I know only Hitachi is using such a huge chunk size : So each vendor’s implementation of TP uses a different block size. HDS use 42MB on the USP, EMC use 768KB on DMX, IBM allow a variable size from 32KB to 256KB on the SVC and 3Par use blocks of just 16KB. The reasons for this are many and varied and for legacy hardware are a reflection of the underlying hardware architecture. http://gestaltit.com/all/tech/storage/chris/thin-provisioning-holy-grail-utilisation/ Also, here Hu explains the reason why they believe 42M is the most efficient : http://blogs.hds.com/hu/2009/07/chunk-size-matters.html He has some good points in his arguments. Yes, and they apply to ZFS dedup as well... :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
Ragnar Sundblad wrote: On 2 jan 2010, at 22.49, Erik Trimble wrote: Ragnar Sundblad wrote: On 2 jan 2010, at 13.10, Erik Trimble wrote Joerg Schilling wrote: the TRIM command is what is intended for an OS to notify the SSD as to which blocks are deleted/erased, so the SSD's internal free list can be updated (that is, it allows formerly-in-use blocks to be moved to the free list). This is necessary since only the OS has the information to determine which previous-written-to blocks are actually no longer in-use. See the parallel discussion here titled preview of new SSD based on SandForce controller for more about smart vs dumb SSD controllers. From ZFS's standpoint, the optimal configuration would be for the SSD to inform ZFS as to it's PAGE size, and ZFS would use this as the fundamental BLOCK size for that device (i.e. all writes are in integer multiples of the SSD page size). Reads could be in smaller sections, though. Which would be interesting: ZFS would write in Page Size increments, and read in Block Size amounts. Well, this could be useful if updates are larger than the block size, for example 512 K, as it is then possible to erase and rewrite without having to copy around other data from the page. If updates are smaller, zfs will have to reclaim erased space by itself, which if I am not mistaken it can not do today (but probably will in some future, I guess the BP Rewrite is what is needed). Sure, it does that today. What do you think happens on a standard COW action? Let's be clear here: I'm talking about exactly the same thing that currently happens when you modify a ZFS block that spans multiple vdevs (say, in a RAIDZ). The entire ZFS block is read from disk/L2ARC, the modifications made, then it is written back to storage, likely in another LBA. The original ZFS block location ON THE VDEV is now available for re-use (i.e. the vdev adds it to it's Free Block List). This is one of the things that leads to ZFS's fragmentation issues (note, we're talking about block fragmentation on the vdev, not ZFS block fragmentation), and something that we're looking to BP rewrite to enable defragging to be implemented. What I am talking about is to be able to reuse the free space you get in the previously written data when you write modified data to new places on the disk, or just remove a file for that matter. To be able to reclaim that space with flash, you have to erase large pages (for example 512 KB), but before you erase, you will also have to save away all still valid data in that page and rewrite that to a free page. What I am saying is that I am not sure that this would be best done in the file system, since it could be quite a bit of data to shuffle around, and there could possibly be hardware specific optimizations that could be done here that zfs wouldn't know about. A good flash controller could probably do it much better. (And a bad one worse, of course.) You certainly DO get to reuse the free space again. Here's what happens nowdays in an SSD: Let's say I have 4k blocks, grouped into a 128k page. That is, the SSD's fundamental minimum unit size is 4k, but the minimum WRITE size is 128k. Thus, 32 blocks in a page. So, I write a bit of data 100k in size. This occupies the first 25 blocks in the one page. The remaining 9 blocks are still one the SSD's Free List (i.e. list of free space). Now, I want to change the last byte of the file, and add 10k more to the file. Currently, a non-COW filesystem will simply send the 1 byte modification request and the 10k addition to the SSD (all as one unit, if you are lucky - if not, it comes as two ops: 1 byte modification followed by a 10k append). The SSD now has to read all 25 blocks of the page back into it's local cache on the controller, do the modification and append computing, then writes out 28 blocks to NAND. In all likelihood, if there is any extra pre-erased (or never written to) space on the drive, this 28 block write will go to a whole new page. The blocks in the original page will be moved over to the SSD Free List (and may or may not be actually erased, depending on the controller). For filesystems like ZFS, this is a whole lot of extra work being done that doesn't need to happen (and, chews up valuable IOPS and time). For, when ZFS does a write, it doesn't merely just twiddle the modified/appended bits - instead, it creates a whole new ZFS block to write. In essence, ZFS has already done all the work that the SSD controller is planning on doing. So why duplicate the effort? SSDs should simply notify ZFS about their block page sizes, which would then allow ZFS to better align it's own variable block size to optimally coincide with the SSD's implementation. And as far as I know, zfs can not do that today - it can not move around already written data, not for defragmentation, not for adding or removing
Re: [zfs-discuss] Thin device support in ZFS?
David Magda wrote: On Jan 2, 2010, at 16:49, Erik Trimble wrote: My argument is that the OS has a far better view of the whole data picture, and access to much higher performing caches (i.e. RAM/registers) than the SSD, so not only can the OS make far better decisions about the data and how (and how much of) it should be stored, but it's almost certainly to be able to do so far faster than any little SSD controller can do. Though one advantage of doing it with-in the disk is that you're not using up bus bandwidth. Probably not that big of a deal, but worth mentioning for completeness / fairness. This is true. But, also in fairness, this is /already/ being used by the COW nature of ZFS. Changing one bit in a file causes the /entire/ ZFS block containing that bit to be re-written. So I'm not really using much (if any) more bus bandwidth by doing the SSD page layout in the OS rather than in the SSD controller. Remember that I'm highly likely not to have to read anything from the SSD to do the page rewrite, as the data I want is already in the L2ARC. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On 3 jan 2010, at 04.19, Erik Trimble wrote: Ragnar Sundblad wrote: On 2 jan 2010, at 22.49, Erik Trimble wrote: Ragnar Sundblad wrote: On 2 jan 2010, at 13.10, Erik Trimble wrote Joerg Schilling wrote: the TRIM command is what is intended for an OS to notify the SSD as to which blocks are deleted/erased, so the SSD's internal free list can be updated (that is, it allows formerly-in-use blocks to be moved to the free list). This is necessary since only the OS has the information to determine which previous-written-to blocks are actually no longer in-use. See the parallel discussion here titled preview of new SSD based on SandForce controller for more about smart vs dumb SSD controllers. From ZFS's standpoint, the optimal configuration would be for the SSD to inform ZFS as to it's PAGE size, and ZFS would use this as the fundamental BLOCK size for that device (i.e. all writes are in integer multiples of the SSD page size). Reads could be in smaller sections, though. Which would be interesting: ZFS would write in Page Size increments, and read in Block Size amounts. Well, this could be useful if updates are larger than the block size, for example 512 K, as it is then possible to erase and rewrite without having to copy around other data from the page. If updates are smaller, zfs will have to reclaim erased space by itself, which if I am not mistaken it can not do today (but probably will in some future, I guess the BP Rewrite is what is needed). Sure, it does that today. What do you think happens on a standard COW action? Let's be clear here: I'm talking about exactly the same thing that currently happens when you modify a ZFS block that spans multiple vdevs (say, in a RAIDZ). The entire ZFS block is read from disk/L2ARC, the modifications made, then it is written back to storage, likely in another LBA. The original ZFS block location ON THE VDEV is now available for re-use (i.e. the vdev adds it to it's Free Block List). This is one of the things that leads to ZFS's fragmentation issues (note, we're talking about block fragmentation on the vdev, not ZFS block fragmentation), and something that we're looking to BP rewrite to enable defragging to be implemented. What I am talking about is to be able to reuse the free space you get in the previously written data when you write modified data to new places on the disk, or just remove a file for that matter. To be able to reclaim that space with flash, you have to erase large pages (for example 512 KB), but before you erase, you will also have to save away all still valid data in that page and rewrite that to a free page. What I am saying is that I am not sure that this would be best done in the file system, since it could be quite a bit of data to shuffle around, and there could possibly be hardware specific optimizations that could be done here that zfs wouldn't know about. A good flash controller could probably do it much better. (And a bad one worse, of course.) You certainly DO get to reuse the free space again. Here's what happens nowdays in an SSD: Let's say I have 4k blocks, grouped into a 128k page. That is, the SSD's fundamental minimum unit size is 4k, but the minimum WRITE size is 128k. Thus, 32 blocks in a page. Do you know of SSD disks that have a minimum write size of 128 KB? I don't understand why it would be designed that way. A typical flash chip has pretty small write block sizes, like 2 KB or so, but they can only erase in pages of 128 KB or so. (And then you are running a few of those in parallel to get some speed, so these numbers often multiply with the number of parallel chips, like 4 or 8 or so.) Typically, you have to write the 2 KB blocks consecutively in a page. Pretty much all set up for an append-style system. :-) In addition, flash SSDs typically have some DRAM write buffer that buffers up writes (like a txg, if you will), so small writes should not be a problem - just collect a few and append! So, I write a bit of data 100k in size. This occupies the first 25 blocks in the one page. The remaining 9 blocks are still one the SSD's Free List (i.e. list of free space). Now, I want to change the last byte of the file, and add 10k more to the file. Currently, a non-COW filesystem will simply send the 1 byte modification request and the 10k addition to the SSD (all as one unit, if you are lucky - if not, it comes as two ops: 1 byte modification followed by a 10k append). The SSD now has to read all 25 blocks of the page back into it's local cache on the controller, do the modification and append computing, then writes out 28 blocks to NAND. In all likelihood, if there is any extra pre-erased (or never written to) space on the drive, this 28 block write will go to a whole new page. The blocks in the original page will be moved over to the SSD Free List
Re: [zfs-discuss] Thin device support in ZFS?
On 3 jan 2010, at 06.07, Ragnar Sundblad wrote: (I don't think they typically merge pages, I believe they rather just pick pages with some freed blocks, copies the active blocks to the end of the disk, and erases the page.) (And of course you implement wear leveling with the same mechanism - when the wear differs to much, pick a page with low wear and copy it to a more worn page.) I actually happened to stumble on an application note from Numonyx that describes the append-style SSD disk and space reclamation method I described, right here: http://www.numonyx.com/Documents/Application%20Notes/AN1821.pdf (No - I had not read this before writing my previous mail! :-) To me, it seems also in this paper that it is common knowledge that this is how you should implement a flash SSD disk - if you don't do anything fancier of course. /ragge ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
Ragnar Sundblad wrote: On 3 jan 2010, at 04.19, Erik Trimble wrote: Let's say I have 4k blocks, grouped into a 128k page. That is, the SSD's fundamental minimum unit size is 4k, but the minimum WRITE size is 128k. Thus, 32 blocks in a page. Do you know of SSD disks that have a minimum write size of 128 KB? I don't understand why it would be designed that way. A typical flash chip has pretty small write block sizes, like 2 KB or so, but they can only erase in pages of 128 KB or so. (And then you are running a few of those in parallel to get some speed, so these numbers often multiply with the number of parallel chips, like 4 or 8 or so.) Typically, you have to write the 2 KB blocks consecutively in a page. Pretty much all set up for an append-style system. :-) In addition, flash SSDs typically have some DRAM write buffer that buffers up writes (like a txg, if you will), so small writes should not be a problem - just collect a few and append! In MLC-style SSDs, you typically have a block size of 2k or 4k. However, you have a Page size of several multiples of that, 128k being common, but by no means ubiquitous. I think you're confusing erasing with writing. When I say minimum write size, I mean that for an MLC, no matter how small you make a change, the minimum amount of data actually being written to the SSD is a full page (128k in my example). There is no append down at this level. If I have a page of 128k, with data in 5 of the 4k blocks, and I then want to add another 2k of data to this, I have to READ all 5 4k blocks into the controller's DRAM, add the 2k of data to that, then write out the full amount to a new page (if available), or wait for a older page to be erased before writing to it. Thus, in this case, in order to do an actual 2k write, the SSD must first read 10k of data, do some compositing, then write 12k to a fresh page. Thus, to change any data inside a single page, then entire contents of that page have to be read, the page modified, then the entire page written back out. So, I write a bit of data 100k in size. This occupies the first 25 blocks in the one page. The remaining 9 blocks are still one the SSD's Free List (i.e. list of free space). Now, I want to change the last byte of the file, and add 10k more to the file. Currently, a non-COW filesystem will simply send the 1 byte modification request and the 10k addition to the SSD (all as one unit, if you are lucky - if not, it comes as two ops: 1 byte modification followed by a 10k append). The SSD now has to read all 25 blocks of the page back into it's local cache on the controller, do the modification and append computing, then writes out 28 blocks to NAND. In all likelihood, if there is any extra pre-erased (or never written to) space on the drive, this 28 block write will go to a whole new page. The blocks in the original page will be moved over to the SSD Free List (and may or may not be actually erased, depending on the controller). Do you know for sure that you have SSD flash disks that work this way? It seems incredibly stupid. It would also use up the available erase cycles much faster than necessary. What write speed do you get? What I'm describing is how ALL MLC-based SSDs work. SLC-based SSDs work differently, but still have problems with what I'll call excess-writing. And as far as I know, zfs can not do that today - it can not move around already written data, not for defragmentation, not for adding or removing disks to stripes/raidz:s, not for deduping/duping and so on, and I have understood it as BP Rewrite could solve a lot of this. ZFS's propensity to fragmentation doesn't mean you lose space. Rather, it means that COW often results in frequently-modified files being distributed over the entire media, rather than being contiguous. So, over time, the actual media has very little (if any) contiguous free space, which is what the fragmentation problem is. BP rewrite will indeed allow us to create a de-fragger. Areas which used to hold a ZFS block (now vacated by a COW to somewhere else) are simply added back to the device's Free List. Now, in SSD's case, this isn't a worry. Due to the completely even performance characteristics of NAND, it doesn't make any difference if the physical layout of a file happens to be sections (e.g. ZFS blocks) scattered all over the SSD. Yes, there is something to worry about, as you can only erase flash in large pages - you can not erase them only where the free data blocks in the Free List are. I'm not sure that SSDs actually _have_ to erase - they just overwrite anything there with new data. But this is implementation dependent, so I can say how /all/ MLC SSDs behave. (I don't think they typically merge pages, I believe they rather just pick pages with some freed blocks, copies the active blocks to the end of the disk, and erases the page.) Well, the algorithms are often trade
Re: [zfs-discuss] Thin device support in ZFS?
Erik Trimble wrote: Ragnar Sundblad wrote: Yes, there is something to worry about, as you can only erase flash in large pages - you can not erase them only where the free data blocks in the Free List are. I'm not sure that SSDs actually _have_ to erase - they just overwrite anything there with new data. But this is implementation dependent, so I can say how /all/ MLC SSDs behave. I meant to say that I DON'T know how all MLC drives deal with erasure. (I don't think they typically merge pages, I believe they rather just pick pages with some freed blocks, copies the active blocks to the end of the disk, and erases the page.) That is correct, as your pointer to the Numonyx doc explains. I'm pretty sure compacting doesn't occur in ANY SSDs without any OS intervention (that is, the SSD itself doesn't do it), and I'd be surprised to see an OS try to implement some sort of intra-page compaction - there benefit doesn't seem to be there; it's better just to optimize writes than try to compact existing pages. As far as reclaiming unused space, the TRIM command is there to allow the SSD to mark a page Free for reuse, and an SSD isn't going to be erasing a page unless it's right before something is to be written to that page. My thinking of what compacting meant doesn't match up with what I'm seeing general usage in the SSD technical papers is, so in this respect, I'm wrong: compacting does occur, but only when there are no fully erased (or unused) pages available. Thus, compacting is done in the context of a write operation. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Thu, Dec 31 at 16:53, David Magda wrote: Just as the first 4096-byte block disks are silently emulating 4096 - to-512 blocks, SSDs are currently re-mapping LBAs behind the scenes. Perhaps in the future there will be a setting to say no really, I'm talking about the /actual/ LBA 123456. What, exactly, is the /actual/ LBA 123456 on a modern SSD? --eric -- Eric D. Mudama edmud...@mail.bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Thu, Dec 31 at 10:18, Bob Friesenhahn wrote: There are of course SSDs with hardly any (or no) reserve space, but while we might be willing to sacrifice an image or two to SSD block failure in our digital camera, that is just not acceptable for serious computer use. Some people are doing serious computing on devices with 6-7% reserve. Devices with less enforced reserve will be significantly cheaper per exposed gigabyte, independent of all other factors, and always give the user the flexibility to increase their effective reserve by destroking the working area a little or a lot. If someone just needs blazing fast read access and isn't expecting to put more than a few cycles/day on their devices, small reserve MLC drives may be very cost effective and just as fast as their 20-30% reserve SLC counterparts. -- Eric D. Mudama edmud...@mail.bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On 31 dec 2009, at 22.53, David Magda wrote: On Dec 31, 2009, at 13:44, Joerg Schilling wrote: ZFS is COW, but does the SSD know which block is in use and which is not? If the SSD did know whether a block is in use, it could erase unused blocks in advance. But what is an unused block on a filesystem that supports snapshots? Snapshots make no difference - when you delete the last dataset/snapshot that references a file you also delete the data. Snapshots is a way to keep more files around, it is not a really way to keep the disk entirely full or anything like that. There is obviously no problem to distinguish between used and unused blocks, and zfs (or btrfs or similar) make no difference. Personally, I think that at some point in the future there will need to be a command telling SSDs that the file system will take care of handling blocks, as new FS designs will be COW. ZFS is the first mainstream one to do it, but Btrfs is there as well, and it looks like Apple will be making its own FS. That could be an idea, but there still will be holes after deleted files that need to be reclaimed. Do you mean it would be a major win to have the file system take care of the space reclaiming instead of the drive? Just as the first 4096-byte block disks are silently emulating 4096 -to-512 blocks, SSDs are currently re-mapping LBAs behind the scenes. Perhaps in the future there will be a setting to say no really, I'm talking about the /actual/ LBA 123456. A typical flash page size is 512 KB. You probably don't want to use all the physical pages, since those could be worn out or bad, so those need to be remapped (or otherwise avoided) at some level anyway. These days, typically disks do the remapping without the host computer knowing (both SSDs and rotating rust). I see the possible win that you could always use all the working blocks on the disk, and when blocks goes bad your disk will shrink. I am not sure that is really what people expect, though. Apart from that, I am not sure what the gain would be. Could you elaborate on why this would be called for? /ragge ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Jan 1, 2010, at 03:30, Eric D. Mudama wrote: On Thu, Dec 31 at 16:53, David Magda wrote: Just as the first 4096-byte block disks are silently emulating 4096 - to-512 blocks, SSDs are currently re-mapping LBAs behind the scenes. Perhaps in the future there will be a setting to say no really, I'm talking about the /actual/ LBA 123456. What, exactly, is the /actual/ LBA 123456 on a modern SSD? It doesn't exist currently because of the behind-the-scenes re-mapping that's being done by the SSD's firmware. While arbitrary to some extent, and actual LBA would presumably the number of a particular cell in the SSD. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Jan 1, 2010, at 04:33, Ragnar Sundblad wrote: I see the possible win that you could always use all the working blocks on the disk, and when blocks goes bad your disk will shrink. I am not sure that is really what people expect, though. Apart from that, I am not sure what the gain would be. Could you elaborate on why this would be called for? Currently you have SSDs that look like disks, but under certain circumstances the OS / FS know that it isn't rotating rust--in which case the TRIM command is then used by the OS to help the SSD's allocation algorithm(s). If the file system is COW, and knows about SSDs via TRIM, why not just skip the middle-man and tell the SSD I'll take care of managing blocks. In the ZFS case, I think it's a logical extension of how RAID is handling: ZFS' system is much more helpful in most case that hardware- / firmware-based RAID, so it's generally best just to expose the underlying hardware to ZFS. In the same way ZFS already does COW, so why bother with the SSD's firmware doing it when giving extra knowledge to ZFS could be more useful? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On 1 jan 2010, at 14.14, David Magda wrote: On Jan 1, 2010, at 04:33, Ragnar Sundblad wrote: I see the possible win that you could always use all the working blocks on the disk, and when blocks goes bad your disk will shrink. I am not sure that is really what people expect, though. Apart from that, I am not sure what the gain would be. Could you elaborate on why this would be called for? Currently you have SSDs that look like disks, but under certain circumstances the OS / FS know that it isn't rotating rust--in which case the TRIM command is then used by the OS to help the SSD's allocation algorithm(s). (Note that TRIM and equivalents are not only useful on SSDs, but on other storage too, such as when using sparse/thin storage.) If the file system is COW, and knows about SSDs via TRIM, why not just skip the middle-man and tell the SSD I'll take care of managing blocks. In the ZFS case, I think it's a logical extension of how RAID is handling: ZFS' system is much more helpful in most case that hardware- / firmware-based RAID, so it's generally best just to expose the underlying hardware to ZFS. In the same way ZFS already does COW, so why bother with the SSD's firmware doing it when giving extra knowledge to ZFS could be more useful? But that would only move the hardware specific and dependent flash chip handling code into the file system code, wouldn't it? What is won with that? As long as the flash chips have larger pages than the file system blocks, someone will have to shuffle around blocks to reclaim space, why not let the one thing that knows the hardware and also is very close to the hardware do it? And if this is good for SSDs, why isn't it as good for rotating rust? /ragge s ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Jan 1, 2010, at 11:04, Ragnar Sundblad wrote: But that would only move the hardware specific and dependent flash chip handling code into the file system code, wouldn't it? What is won with that? As long as the flash chips have larger pages than the file system blocks, someone will have to shuffle around blocks to reclaim space, why not let the one thing that knows the hardware and also is very close to the hardware do it? And if this is good for SSDs, why isn't it as good for rotating rust? Don't really see how things are either hardware specific or dependent. COW is COW. Am I missing something? It's done by code somewhere in the stack, if the FS knows about it, it can lay things out in sequential writes. If we're talking about 512 KB blocks, ZFS in particular would create four 128 KB txgs--and 128 KB is simply the currently #define'd size, which can be changed in the future. One thing you gain is perhaps not requiring to have as much of a reserve. At most you have some hidden bad block re-mapping, similar to rotating rust nowadays. If you're shuffling blocks around, you're doing a read-modify-write, which if done in the file system, you could use as a mechanism to defrag on-the-fly or to group many small files together. Not quite sure what you mean by your last question. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Dec 31, 2009, at 12:59 PM, Ragnar Sundblad wrote: Flash SSDs actually always remap new writes into a only-append-to-new-pages style, pretty much as ZFS does itself. So for a SSD there is no big difference between ZFS and filesystems as UFS, NTFS, HFS+ et al, on the flash level they all work the same. The reason is that there is no way for it to rewrite single disk blocks, it can only fill up already erased pages of 512K (for example). When the old blocks get mixed with unused blocks (because of block rewrites, TRIM or Write Many/UNMAP), it needs to compact the data by copying all active blocks from those pages into previously erased pages, and there write the active data compacted/continuos. (When this happens, things tend to get really slow.) However, the quantity of small, overwritten pages is vastly different. I am not convinced that a workload that generates few overwrites will be penalized as much as a workload that generates a large number of overwrites. I think most folks here will welcome good, empirical studies, but thus far the only one I've found is from STEC and their disks behave very well after they've been filled and subjected to a rewrite workload. You get what you pay for. Additional pointers are always appreciated :-) http://www.stec-inc.com/ssd/videos/ssdvideo1.php -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Fri, 1 Jan 2010, David Magda wrote: It doesn't exist currently because of the behind-the-scenes re-mapping that's being done by the SSD's firmware. While arbitrary to some extent, and actual LBA would presumably the number of a particular cell in the SSD. There seems to be some severe misunderstanding of that a SSD is. This severe misunderstanding leads one to assume that a SSD has a native blocksize. SSDs (as used in computer drives) are comprised of many tens of FLASH memory chips which can be layed out and mapped in whatever fashion the designers choose to do. They could be mapped sequentially, in parallel, a combination of the two, or perhaps even change behavior depending on use. Individual FLASH devices usually have a much smaller page size than 4K. A 4K write would likely be striped across several/many FLASH devices. The construction of any given SSD is typically a closely-held trade secret and the vendor will not reveal how it is designed. You would have to chip away the epoxy yourself and reverse-engineer in order to gain some understanding of how a given SSD operates and even then it would be mostly guesswork. It would be wrong for anyone here, including someone who has participated in the design of an SSD, to claim that they know how a SSD will behave unless they have access to the design of that particular SSD. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Fri, Jan 1, 2010 at 11:17 AM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Fri, 1 Jan 2010, David Magda wrote: It doesn't exist currently because of the behind-the-scenes re-mapping that's being done by the SSD's firmware. While arbitrary to some extent, and actual LBA would presumably the number of a particular cell in the SSD. There seems to be some severe misunderstanding of that a SSD is. This severe misunderstanding leads one to assume that a SSD has a native blocksize. SSDs (as used in computer drives) are comprised of many tens of FLASH memory chips which can be layed out and mapped in whatever fashion the designers choose to do. They could be mapped sequentially, in parallel, a combination of the two, or perhaps even change behavior depending on use. Individual FLASH devices usually have a much smaller page size than 4K. A 4K write would likely be striped across several/many FLASH devices. The construction of any given SSD is typically a closely-held trade secret and the vendor will not reveal how it is designed. You would have to chip away the epoxy yourself and reverse-engineer in order to gain some understanding of how a given SSD operates and even then it would be mostly guesswork. It would be wrong for anyone here, including someone who has participated in the design of an SSD, to claim that they know how a SSD will behave unless they have access to the design of that particular SSD. The main issue is that most flash devices support 128k byte pages, and the smallest chunk (for want of a better word) of flash memory that can be written is a page - or 128kb. So if you have a write to an SSD that only changes 1 byte in one 512 byte disk sector, the SSD controller has to either read/re-write the affected page or figure out how to update the flash memory with the minimum affect on flash wear. If one did'nt have to worry about flash wear levelling, one could read/update/write the affected page all day long. And, to date, flash writes are much slower than flash reads - which is another basic property of the current generation of flash devices. For anyone who is interested in getting more details of the challenges with flash memory, when used to build solid state drives, reading the tech data sheets on the flash memory devices will give you a feel for the basic issues that must be solved. Bobs point is well made. The specifics of a given SSD implementation will make the performance characteristics of the resulting SSD very difficult to predict or even describe - especially as the device hardware and firmware continue to evolve. And some SSDs change the algorithms they implement on-the-fly - depending on the characteristics of the current workload and of the (inbound) data being written. There are some links to well written articles in the URL I posted earlier this morning: http://www.anandtech.com/storage/showdoc.aspx?i=3702 Regards, -- Al Hopper Logical Approach Inc,Plano,TX a...@logical-approach.com Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
Let me sum up my thoughts in this topic. To Richard [relling] : I agree with you this topic is even more confusing if we are not careful enough to specify exactly what we are talking about. Thin provision can be done on multiple layers, and though you said you like it to be closer to the app than closer to the dumb disks (if you were referring to SAN), my opinion is that each and every scenario has it's own pros/cons. I learned long time ago not to declare a technology good/bad, there are technologies which are used properly (usually declared as good tech) and others which are not (usually declared as bad). -- Let me clarify my case, and why I mentioned thin devices on SAN specifically. Many people replied with the thin device support of ZFS (which is called sparse volumes if I'm correct), but what I was talking about is something else. It's thin device awareness on the SAN. In this case you configure your LUN in the SAN as thin device, a virtual LUN(s) which is backed by a pool of physical disks in the SAN. From the OS it's transparent, so it is from the Volume Manager/Filesystem point of view. That is the basic definition of my scenarion with thin devices on SAN. High-end SAN frames like HDS USP-V (feature called Hitachi Dynamic Provisioning), EMC Symmetrix V-Max (feature called Virtual provisioning) supports this (and I'm sure many others as well). As you discovered the LUN in the OS, you start to use it, like put under Volume Manager, create filesystem, copy files, but the SAN only allocates physical blocks (more precisely group of blocks called extents) as you write them, which means you'll use only as much (or a bit more rounded to the next extent) on the physical disk as you use in reality. From this standpoint we can define two terms, thin-friendly and thin-hostile environments. Thin-friendly would be any environment where OS/VM/FS doesn't write to blocks it doesn't really use (for example during initialization it doesn't fills up the LUN with a pattern or 0s). That's why Veritas' SmartMove is a nice feature, as when you move from fat to thin devices (from the OS both LUNs look exactly the same), it will copy the blocks only which are used by the VxFS files. That is still the basics of having thin devices on SAN, and hope to have a thin-friendly environment. The next level of this is the management of the thin devices and the physical pool where thin devices allocates their extents from. Even if you get migrated to thin device LUNs, your thin devices will become fat again, even if you fill up your filesystem once, the thin device on the SAN will remain fat, no space reclamation is happening by default. The reason is pretty simple, the SAN storage has no knowledge of the filesystem structure, as such it can't decide whether a block should be released back to the pool, or it's really not in use. Then came Veritas with this brilliant idea of building a bridge between the FS and the SAN frame (this became the Thin Reclamation API), so they can communicate which blocks are not in use indeed. I really would like you to read this Quick Note from Veritas about this feature, it will explain way better the concept as I did : http://ftp.support.veritas.com/pub/support/products/Foundation_Suite/338546.pdf Btw, in this concept VxVM can even detect (via ASL) whether a LUN is thin device/thin device reclamation capable or not. Honestly I have mixed feeling about ZFS. I feel that this is obviously the future's VM/Filesystem, but then I realize in the same time the roles of the individual parts in the big picture are getting mixed up. Am I the only one with the impression that ZFS sooner or later will evolve to a SAN OS, and the zfs, zpool commands will only become some lightweight interfaces to control the SAN frame? :-) (like Solution Enabler for EMC) If you ask me the pool concept always works more efficient if 1# you have more capacity in the pool 2# if you have more systems to share the pool, that's why I see the thin device pool more rational in a SAN frame. Anyway, I'm sorry if you were already aware what I explained above, I also hope I didn't offend anyone with my views, Regards, sendai -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On 31 dec 2009, at 06.01, Richard Elling wrote: On Dec 30, 2009, at 2:24 PM, Ragnar Sundblad wrote: On 30 dec 2009, at 22.45, Richard Elling wrote: On Dec 30, 2009, at 12:25 PM, Andras Spitzer wrote: Richard, That's an interesting question, if it's worth it or not. I guess the question is always who are the targets for ZFS (I assume everyone, though in reality priorities has to set up as the developer resources are limited). For a home office, no doubt thin provisioning is not much of a use, for an enterprise company the numbers might really make a difference if we look at the space used vs space allocated. There are some studies that thin provisioning can reduce physical space used up to 30%, which is huge. (Even though I understands studies are not real life and thin provisioning is not viable in every environment) Btw, I would like to discuss scenarios where though we have over-subscribed pool in the SAN (meaning the overall allocated space to the systems is more than the physical space in the pool) with proper monitoring and proactive physical drive adds we won't let any systems/applications attached to the SAN realize that we have thin devices. Actually that's why I believe configuring thin devices without periodically reclaiming space is just a timebomb, though if you have the option to periodically reclaim space, you can maintain the pool in the SAN in a really efficient way. That's why I found Veritas' Thin Reclamation API as a milestone in the thin device field. Anyway, only future can tell if thin provisioning will or won't be a major feature in the storage world, though as I saw Veritas already added this feature I was wondering if ZFS has it at least on it's roadmap. Thin provisioning is absolutely, positively a wonderful, good thing! The question is, how does the industry handle the multitude of thin provisioning models, each layered on top of another? For example, here at the ranch I use VMWare and Xen, which thinly provision virtual disks. I do this over iSCSI to a server running ZFS which thinly provisions the iSCSI target. If I had a virtual RAID array, I would probably use that, too. Personally, I think being thinner closer to the application wins over being thinner closer to dumb storage devices (disk drives). I don't get it - why do we need anything more magic (or complicated) than support for TRIM from the filesystems and the storage systems? TRIM is just one part of the problem (or solution, depending on your point of view). The TRIM command is part of the T10 protocols that allows a host to tell a block device that data in a set of blocks is no longer of any value, and the block device can destroy the data without adverse consequence. In a world with copy-on-write and without snapshots, it is obvious that there will be a lot of blocks running around that are no longer in use. Snapshots (and their clones) changes that use case. So in a world of snapshots, there will be fewer blocks which are not used. Remember, the TRIM command is very important to OSes like Windows or OSX which do not have file systems that are copy-on-write or have decent snapshots. OTOH, ZFS does copy-on-write and lots of ZFS folks use snapshots. I don't believe that there is such a big difference between those cases. Sure, snapshots may keep more data on disk, but only as much as the user choose to keep. There has been other ways to keep old data on disk before (RCS, Solaris patch backout blurbs, logs, caches, what have you), so there is not really a brand new world there. (BTW, once upon a time, real operating systems had (optional) file versioning built into the operating system or file system itself.) If there was a mechanism that always tended to keep all of the disk full, that would be another case. Snapshots may do that with the autosnapshot and warn-and-clean-when-getting-full features of OpenSolaris, but especially servers will probably not be managed that way, they will probably have a much more controlled snapshot policy. (Especially if you want to save every possible bit of disk space, as those guys with the big fantastic and ridiculously expensive storage systems always want to do - maybe that will change in the future though.) That said, adding TRIM support is not hard in ZFS. But it depends on lower level drivers to pass the TRIM commands down the stack. These ducks are lining up now. Good. I don't see why TRIM would be hard to implement for ZFS either, except that you may want to keep data from a few txgs back just for safety, which would probably call for some two-stage freeing of data blocks (those free blocks that are to be TRIMmed, and those that already are). Once a block is freed in ZFS, it no longer needs it. So the problem of TRIM in ZFS is not related to the recent txg commit history. It may be that you want to save a few txgs back, so if you get a failure where
Re: [zfs-discuss] Thin device support in ZFS?
On 31 dec 2009, at 00.31, Bob Friesenhahn wrote: On Wed, 30 Dec 2009, Mike Gerdts wrote: Should the block size be a tunable so that page size of SSD (typically 4K, right?) and upcoming hard disks that sport a sector size 512 bytes? Enterprise SSDs are still in their infancy. The actual page size of an SSD could be almost anything. Due to lack of seek time concerns and the high cost of erasing a page, a SSD could be designed with a level of indirection so that multiple logical writes to disjoint offsets could be combined into a single SSD physical page. Likewise a large logical block could be subdivided into mutiple SSD pages, which are allocated on demand. Logic is cheap and SSDs are full of logic so it seems reasonable that future SSDs will do this, if not already, since similar logic enables wear-leveling. I believe that almost all flash devices are already are doing this, and only the first generation SD cards or something like that are not doing it and leaving it to the host. But I could be wrong of course. /ragge s ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Thu, 31 Dec 2009, Ragnar Sundblad wrote: Also, currently, when the SSDs for some very strange reason is constructed from flash chips designed for firmware and slowly changing configuration data and can only erase in very large chunks, TRIMing is good for the housekeeping in the SSD drive. A typical use case for this would be a laptop. I have heard quite a few times that TRIM is good for SSD drives but I don't see much actual use for it. Every responsible SSD drive maintains a reserve of unused space (20-50%) since it is needed for wear leveling and to repair failing spots. This means that even when a SSD is 100% full it still has considerable space remaining. A very simple SSD design solution is that when a SSD block is overwritten it is replaced with an already-erased block from the free pool and the old block is submitted to the free pool for eventual erasure and re-use. This approach avoids adding erase times to the write latency as long as the device can erase as fast as the average date write rate. There are of course SSDs with hardly any (or no) reserve space, but while we might be willing to sacrifice an image or two to SSD block failure in our digital camera, that is just not acceptable for serious computer use. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
Just an update : Finally I found some technical details about this Thin Reclamation API : (http://blogs.hds.com/claus/2009/12/i-love-it-when-a-plan-comes-together.html) This week, (December 7th), Symantec announced their “completing the thin provisioning ecosystem” that includes the necessary API calls for the file system to “notify” the storage array when space is “deleted”. The interface is a previously disused and now revised/reused/repurposed SCSI command (called Write Same) which was jointly worked out with Symantec, Hitachi, and 3PAR. This command allows the file systems (in this case Veritas VxFS) to notify the storage systems that space is no longer occupied. How cool is that! There is also a subcommittee to INCITS T10 studying the standardization is this and SNIA is also studying this. It won’t be long before most file systems, databases, and storage vendors adopt this technology. So it's based on the SCSI Write Same/UNMAP command, (and if I understand correctly SATA TRIM is similar to this from the FS point of view) which standard is not ratified yet. Also, happy new year to everyone! Regards, sendai -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On 31 dec 2009, at 17.18, Bob Friesenhahn wrote: On Thu, 31 Dec 2009, Ragnar Sundblad wrote: Also, currently, when the SSDs for some very strange reason is constructed from flash chips designed for firmware and slowly changing configuration data and can only erase in very large chunks, TRIMing is good for the housekeeping in the SSD drive. A typical use case for this would be a laptop. I have heard quite a few times that TRIM is good for SSD drives but I don't see much actual use for it. Every responsible SSD drive maintains a reserve of unused space (20-50%) since it is needed for wear leveling and to repair failing spots. This means that even when a SSD is 100% full it still has considerable space remaining. (At least as long as those blocks aren't used up in place of bad/worn out) blocks...) A very simple SSD design solution is that when a SSD block is overwritten it is replaced with an already-erased block from the free pool and the old block is submitted to the free pool for eventual erasure and re-use. This approach avoids adding erase times to the write latency as long as the device can erase as fast as the average date write rate. This is what they do, as far as I have understood, but more free space to play with makes the job easier and therefor faster, and gives you a larger burst headroom before you hit the erase-speed limit of the disk. There are of course SSDs with hardly any (or no) reserve space, but while we might be willing to sacrifice an image or two to SSD block failure in our digital camera, that is just not acceptable for serious computer use. I think the idea is that with TRIM you can also use the file system's unused space for wear leveling and flash block filling. If your disk is completely full there is of course no gain. /ragge s ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Dec 31, 2009, at 1:43 AM, Andras Spitzer wrote: Let me sum up my thoughts in this topic. To Richard [relling] : I agree with you this topic is even more confusing if we are not careful enough to specify exactly what we are talking about. Thin provision can be done on multiple layers, and though you said you like it to be closer to the app than closer to the dumb disks (if you were referring to SAN), my opinion is that each and every scenario has it's own pros/cons. I learned long time ago not to declare a technology good/bad, there are technologies which are used properly (usually declared as good tech) and others which are not (usually declared as bad). I hear you. But you are trapped thinking about 20th century designs and ZFS is a 21st century design. More below... Let me clarify my case, and why I mentioned thin devices on SAN specifically. Many people replied with the thin device support of ZFS (which is called sparse volumes if I'm correct), but what I was talking about is something else. It's thin device awareness on the SAN. In this case you configure your LUN in the SAN as thin device, a virtual LUN(s) which is backed by a pool of physical disks in the SAN. From the OS it's transparent, so it is from the Volume Manager/ Filesystem point of view. That is the basic definition of my scenarion with thin devices on SAN. High-end SAN frames like HDS USP-V (feature called Hitachi Dynamic Provisioning), EMC Symmetrix V-Max (feature called Virtual provisioning) supports this (and I'm sure many others as well). As you discovered the LUN in the OS, you start to use it, like put under Volume Manager, create filesystem, copy files, but the SAN only allocates physical blocks (more precisely group of blocks called extents) as you write them, which means you'll use only as much (or a bit more rounded to the next extent) on the physical disk as you use in reality. From this standpoint we can define two terms, thin-friendly and thin-hostile environments. Thin-friendly would be any environment where OS/VM/FS doesn't write to blocks it doesn't really use (for example during initialization it doesn't fills up the LUN with a pattern or 0s). That's why Veritas' SmartMove is a nice feature, as when you move from fat to thin devices (from the OS both LUNs look exactly the same), it will copy the blocks only which are used by the VxFS files. ZFS does this by design. There is no way in ZFS to not do this. I suppose it could be touted as a feature :-) Maybe we should brand ZFS as THINbyDESIGN(TM) Or perhaps we can rebrand SMARTMOVE(TM) as TRYINGTOCATCHUPWITHZFS(TM) :-) That is still the basics of having thin devices on SAN, and hope to have a thin-friendly environment. The next level of this is the management of the thin devices and the physical pool where thin devices allocates their extents from. Even if you get migrated to thin device LUNs, your thin devices will become fat again, even if you fill up your filesystem once, the thin device on the SAN will remain fat, no space reclamation is happening by default. The reason is pretty simple, the SAN storage has no knowledge of the filesystem structure, as such it can't decide whether a block should be released back to the pool, or it's really not in use. Then came Veritas with this brilliant idea of building a bridge between the FS and the SAN frame (this became the Thin Reclamation API), so they can communicate which blocks are not in use indeed. I really would like you to read this Quick Note from Veritas about this feature, it will explain way better the concept as I did : http://ftp.support.veritas.com/pub/support/products/Foundation_Suite/338546.pdf Btw, in this concept VxVM can even detect (via ASL) whether a LUN is thin device/thin device reclamation capable or not. Correct. Since VxVM and VxFS are separate software, they have expanded the interface between them. Consider adding a mirror or replacing a drive. Prior to SMARTMOVE, VxVM had no idea what part of the volume was data and what was unused. So VxVM would silver the mirror by copying all of the blocks from one side to the other. Clearly this is uncool when your SAN storage is virtualized. With SMARTMOVE, VxFS has a method to tell VxVM that portions of the volume are unused. Now when you silver the mirror, VxVM knows that some bits are unused and it won't bother to copy them. This is a bona fide good thing for virtualized SAN arrays. ZFS was designed with the knowledge that the limited interface between file systems and volume managers was a severe limitation that leads to all sorts of complexity and angst. So a different design is needed. ZFS has fully integrated RAID with the file system, so there is no need, by design, to create a new interface between these layers. In other words, the only way to silver a disk in ZFS is to silver the data. You can't silver unused space.
Re: [zfs-discuss] Thin device support in ZFS?
[I TRIMmed the thread a bit ;-)] On Dec 31, 2009, at 1:43 AM, Ragnar Sundblad wrote: On 31 dec 2009, at 06.01, Richard Elling wrote: In a world with copy-on-write and without snapshots, it is obvious that there will be a lot of blocks running around that are no longer in use. Snapshots (and their clones) changes that use case. So in a world of snapshots, there will be fewer blocks which are not used. Remember, the TRIM command is very important to OSes like Windows or OSX which do not have file systems that are copy-on-write or have decent snapshots. OTOH, ZFS does copy-on-write and lots of ZFS folks use snapshots. I don't believe that there is such a big difference between those cases. The reason you want TRIM for SSDs is to recover the write speed. A freshly cleaned page can be written faster than a dirty page. But in COW, you are writing to new pages and not rewriting old pages. This is fundamentally different than FAT, NTFS, or HFS+, but it is those markets which are driving TRIM adoption. [TRIMmed] Once a block is freed in ZFS, it no longer needs it. So the problem of TRIM in ZFS is not related to the recent txg commit history. It may be that you want to save a few txgs back, so if you get a failure where parts of the last txg gets lost, you will still be able to get an old (few seconds/minutes) version of your data back. This is already implemented. Blocks freed in the past few txgs are not returned to the freelist immediately. This was needed to enable uberblock recovery in b128. So TRIMming from the freelist is safe. This could happen if the sync commands aren't correctly implemented all the way (as we have seen some stories about on this list). Maybe someone disabled syncing somewhere to improve performance. It could also happen if a non volatile caching device, such as a storage controller, breaks in some bad way. Or maybe you just had a bad/old battery/supercap in a device that implements NV storage with batteries/supercaps. The issue is that traversing the free block list has to be protected by locks, so that the file system does not allocate a block when it is also TRIMming the block. Not so difficult, as long as the TRIM occurs relatively quickly. I think that any TRIM implementation should be an administration command, like scrub. It probably doesn't make sense to have it running all of the time. But on occasion, it might make sense. I am not sure why it shouldn't run at all times, except for the fact that it seems to be badly implemented in some SATA devices with high latencies, so that it will interrupt any data streaming to/from the disks. I don't see how it would not have negative performance impacts. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: I have heard quite a few times that TRIM is good for SSD drives but I don't see much actual use for it. Every responsible SSD drive maintains a reserve of unused space (20-50%) since it is needed for wear leveling and to repair failing spots. This means that even when a SSD is 100% full it still has considerable space remaining. A very simple SSD design solution is that when a SSD block is overwritten it is replaced with an already-erased block from the free pool and the old block is submitted to the free pool for eventual erasure and re-use. This approach avoids adding erase times to the write latency as long as the device can erase as fast as the average date write rate. The question in case if SSDs is: ZFS is COW, but does the SSD know which block is in use and which is not? If the SSD did know whether a block is in use, it could erase unused blocks in advance. But what is an unused block on a filesystem that supports snapshots? From the perspective of the SSD I see only the following difference between a COW filesystem an a conventional filesystem. A conventional filesystem may write more often to the same block number than a COW filesystem does. But even for the non-COW case, I would expect that the SSD frequently remaps overwritten blocks to previously erased spares. My conclusion is that ZFS on a SSD works fine in case that the the primary used blocks plus all active snapshots use less space than the official size - the spare reserve from the SSD. If you however fill up the medium, I expect a performance degradation. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
Richard Elling richard.ell...@gmail.com wrote: The reason you want TRIM for SSDs is to recover the write speed. A freshly cleaned page can be written faster than a dirty page. But in COW, you are writing to new pages and not rewriting old pages. This is fundamentally different than FAT, NTFS, or HFS+, but it is those markets which are driving TRIM adoption. Your mistake is to asume a maiden SSD and not to think about what's happening after the SSD was in use for a while. Even for the COW case, blocks are reused after some time and the disk does has no way to know in advance which blocks are still in use and which blocks are no longer used and may be prepared for being overwritten. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On 31 dec 2009, at 19.26, Richard Elling wrote: [I TRIMmed the thread a bit ;-)] On Dec 31, 2009, at 1:43 AM, Ragnar Sundblad wrote: On 31 dec 2009, at 06.01, Richard Elling wrote: In a world with copy-on-write and without snapshots, it is obvious that there will be a lot of blocks running around that are no longer in use. Snapshots (and their clones) changes that use case. So in a world of snapshots, there will be fewer blocks which are not used. Remember, the TRIM command is very important to OSes like Windows or OSX which do not have file systems that are copy-on-write or have decent snapshots. OTOH, ZFS does copy-on-write and lots of ZFS folks use snapshots. I don't believe that there is such a big difference between those cases. The reason you want TRIM for SSDs is to recover the write speed. A freshly cleaned page can be written faster than a dirty page. But in COW, you are writing to new pages and not rewriting old pages. This is fundamentally different than FAT, NTFS, or HFS+, but it is those markets which are driving TRIM adoption. Flash SSDs actually always remap new writes into a only-append-to-new-pages style, pretty much as ZFS does itself. So for a SSD there is no big difference between ZFS and filesystems as UFS, NTFS, HFS+ et al, on the flash level they all work the same. The reason is that there is no way for it to rewrite single disk blocks, it can only fill up already erased pages of 512K (for example). When the old blocks get mixed with unused blocks (because of block rewrites, TRIM or Write Many/UNMAP), it needs to compact the data by copying all active blocks from those pages into previously erased pages, and there write the active data compacted/continuos. (When this happens, things tend to get really slow.) So TRIM is just as applicable to ZFS as any other file system for flash SSD, there is no real difference. [TRIMmed] Once a block is freed in ZFS, it no longer needs it. So the problem of TRIM in ZFS is not related to the recent txg commit history. It may be that you want to save a few txgs back, so if you get a failure where parts of the last txg gets lost, you will still be able to get an old (few seconds/minutes) version of your data back. This is already implemented. Blocks freed in the past few txgs are not returned to the freelist immediately. This was needed to enable uberblock recovery in b128. So TRIMming from the freelist is safe. I see, very good! This could happen if the sync commands aren't correctly implemented all the way (as we have seen some stories about on this list). Maybe someone disabled syncing somewhere to improve performance. It could also happen if a non volatile caching device, such as a storage controller, breaks in some bad way. Or maybe you just had a bad/old battery/supercap in a device that implements NV storage with batteries/supercaps. The issue is that traversing the free block list has to be protected by locks, so that the file system does not allocate a block when it is also TRIMming the block. Not so difficult, as long as the TRIM occurs relatively quickly. I think that any TRIM implementation should be an administration command, like scrub. It probably doesn't make sense to have it running all of the time. But on occasion, it might make sense. I am not sure why it shouldn't run at all times, except for the fact that it seems to be badly implemented in some SATA devices with high latencies, so that it will interrupt any data streaming to/from the disks. I don't see how it would not have negative performance impacts. It will, I am sure! But *if* the user for one reason or the other wants TRIM, it can not be assumed that TRIMing major bunches at certain times is any better than trimming small amounts all the time. Both behaviors may be useful, but I have hard to see a real good use case where you want batch trimming, but easy to see cases where continuos trimming could be useful and hopefully hardly noticeable thanks to the file system caching. /ragge s ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Dec 31, 2009, at 13:44, Joerg Schilling wrote: ZFS is COW, but does the SSD know which block is in use and which is not? If the SSD did know whether a block is in use, it could erase unused blocks in advance. But what is an unused block on a filesystem that supports snapshots? Personally, I think that at some point in the future there will need to be a command telling SSDs that the file system will take care of handling blocks, as new FS designs will be COW. ZFS is the first mainstream one to do it, but Btrfs is there as well, and it looks like Apple will be making its own FS. Just as the first 4096-byte block disks are silently emulating 4096 - to-512 blocks, SSDs are currently re-mapping LBAs behind the scenes. Perhaps in the future there will be a setting to say no really, I'm talking about the /actual/ LBA 123456. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Wed, Dec 30, 2009 at 19:23, roland devz...@web.de wrote: making transactional,logging filesystems thin-provisioning aware should be hard to do, as every new and every changed block is written to a new location. so what applies to zfs, should also apply to btrfs or nilfs or similar filesystems. If that where a problem it would be a problem for UFS when you write new files... ZFS knows what blocks are free and that is all you need send to the disk system. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
making transactional,logging filesystems thin-provisioning aware should be hard to do, as every new and every changed block is written to a new location. so what applies to zfs, should also apply to btrfs or nilfs or similar filesystems. i`m not sure if there is a good way to make zfs thin-provisioning aware/friendly - so you should wait what a zfs developer has to tell about this. ZFS already supports thin-provisioning, and has since pretty much the beginning (earliest I've used it in is ZFSv6). I may get the terms backwards here, but if the Quota property is larger than the Reservation, then you have a thin-provisioned volume or filesystem. The Quota will set the disk size or available space that the OS sees, while the Reservation sets the currently usable space. As the OS uses space in the volume/fs and approaches the Reservation, you just increase the value. The total size that the OS doesn't change, but the actual amount of usable space does. This is especially useful for volumes that are exported via iSCSI. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Dec 30, 2009, at 10:53 AM, Andras Spitzer wrote: Devzero, Unfortunately that was my assumption as well. I don't have source level knowledge of ZFS, though based on what I know it wouldn't be an easy way to do it. I'm not even sure it's only a technical question, but a design question, which would make it even less feasible. It is not hard, because ZFS knows the current free list, so walking that list and telling the storage about the freed blocks isn't very hard. What is hard is figuring out if this would actually improve life. The reason I say this is because people like to use snapshots and clones on ZFS. If you keep snapshots, then you aren't freeing blocks, so the free list doesn't grow. This is a very different use case than UFS, as an example. There are a few minor bumps in the road. The ATA PASSTHROUGH command, which allows TRIM to pass through the SATA drivers, was just integrated into b130. This will be more important to small servers than SANs, but the point is that all parts of the software stack need to support the effort. As such, it is not clear to me who, if anyone, inside Sun is champion for the effort -- it crosses multiple organizational boundaries. Apart from the technical possibilities, this feature looks really inevitable to me in the long run especially for enterprise customers with high-end SAN as cost is always a major factor in a storage design and it's a huge difference if you have to pay based on the space used vs space allocated (for example). If the high cost of SAN storage is the problem, then I think there are better ways to solve that :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On 12/30/2009 2:40 PM, Richard Elling wrote: There are a few minor bumps in the road. The ATA PASSTHROUGH command, which allows TRIM to pass through the SATA drivers, was just integrated into b130. This will be more important to small servers than SANs, but the point is that all parts of the software stack need to support the effort. As such, it is not clear to me who, if anyone, inside Sun is champion for the effort -- it crosses multiple organizational boundaries. I'd think it more important for devices where this is an issue, namely SSDs, then it is spinning rust though use of the TRIM command, or something like it, would fix a lot of the issues I've seen with thin provisioning over the last six years or so. However, I'm not sure it's going to be much of an impact until you can get the entire stack - application to device - rewired to work with the concept behind it. One of the biggest issues I've seen with thin provisioning is how the applications work and you can't fix that in the file system code. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Wed, Dec 30, 2009 at 1:40 PM, Richard Elling richard.ell...@gmail.com wrote: On Dec 30, 2009, at 10:53 AM, Andras Spitzer wrote: Devzero, Unfortunately that was my assumption as well. I don't have source level knowledge of ZFS, though based on what I know it wouldn't be an easy way to do it. I'm not even sure it's only a technical question, but a design question, which would make it even less feasible. It is not hard, because ZFS knows the current free list, so walking that list and telling the storage about the freed blocks isn't very hard. What is hard is figuring out if this would actually improve life. The reason I say this is because people like to use snapshots and clones on ZFS. If you keep snapshots, then you aren't freeing blocks, so the free list doesn't grow. This is a very different use case than UFS, as an example. It seems as though the oft mentioned block rewrite capabilities needed for pool shrinking and changing things like compression, encryption, and deduplication would also show benefit here. That is, blocks would be re-written in such a way to minimize the number of chunks of storage that is allocated. The current HDS chunk size is 42 MB. The most benefit would seem to be to have ZFS make a point of reusing old but freed blocks before doing an allocation that causes the back-end storage to allocate another chunk of disk to the thin-provisioned. While it is important to be able to roll back a few transactions in the event of some widely discussed failure modes, it is probably reasonable to reuse a block freed by a txg that is 3,000 txg's old (about 1 day old if 1 txg per 30 seconds). Such a threshold could be used to determine whether to reuse a block or venture into previously untouched regions of the disk. This strategy would allow the SAN administrator (who is a different person than the sysadmin) to allocate extra space to servers and the sysadmin can control the amount of space really used by quotas. In the event that there is an emergency need for more space, the sysadmin can increase the quota and allow more of the allocate SAN space to be used. Assuming the block rewrite feature comes to fruition, this emergency growth could be shrunk back down to the original size once the surge in demand (or errant process) subsides. There are a few minor bumps in the road. The ATA PASSTHROUGH command, which allows TRIM to pass through the SATA drivers, was just integrated into b130. This will be more important to small servers than SANs, but the point is that all parts of the software stack need to support the effort. As such, it is not clear to me who, if anyone, inside Sun is champion for the effort -- it crosses multiple organizational boundaries. Apart from the technical possibilities, this feature looks really inevitable to me in the long run especially for enterprise customers with high-end SAN as cost is always a major factor in a storage design and it's a huge difference if you have to pay based on the space used vs space allocated (for example). If the high cost of SAN storage is the problem, then I think there are better ways to solve that :-) The SAN could be an OpenSolaris device serving LUNs through COMSTAR. If those LUNs are used to hold a zpool, the zpool could notify the LUN that blocks are no longer used and the SAN could reclaim those blocks. This is just a variant of the same problem faced with expensive SAN devices that have thin provisioning allocation units measured in the tens of megabytes instead of hundreds to thousands of kilobytes. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
Richard, That's an interesting question, if it's worth it or not. I guess the question is always who are the targets for ZFS (I assume everyone, though in reality priorities has to set up as the developer resources are limited). For a home office, no doubt thin provisioning is not much of a use, for an enterprise company the numbers might really make a difference if we look at the space used vs space allocated. There are some studies that thin provisioning can reduce physical space used up to 30%, which is huge. (Even though I understands studies are not real life and thin provisioning is not viable in every environment) Btw, I would like to discuss scenarios where though we have over-subscribed pool in the SAN (meaning the overall allocated space to the systems is more than the physical space in the pool) with proper monitoring and proactive physical drive adds we won't let any systems/applications attached to the SAN realize that we have thin devices. Actually that's why I believe configuring thin devices without periodically reclaiming space is just a timebomb, though if you have the option to periodically reclaim space, you can maintain the pool in the SAN in a really efficient way. That's why I found Veritas' Thin Reclamation API as a milestone in the thin device field. Anyway, only future can tell if thin provisioning will or won't be a major feature in the storage world, though as I saw Veritas already added this feature I was wondering if ZFS has it at least on it's roadmap. Regards, sendai -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
To some extent it already does. If what you're talking about is filesystems/datasets, then all filesystems within a pool share the same free space, which is functionally very similar to each filesystem within the pool being thin-provisioned. To get a thick filesystem, you'd need to set at least the filesystem's reservation, and probably quota as well. Basically filesystems within a pool are thin by default, with the added bonus that space freed within a single filesystem is available for use in any other filesystem within the pool. If you're talking about volumes provisioned from a pool, then volumes can be provisioned as sparse, which is pretty much the same thing. And if you happen to be providing ISCSI luns from files rather than volumes, then those files can be created sparse as well. Reclaiming space from sparse volumes and files is not so easy unfortunately! If you're talking about the pool itself being thin... that's harder to do, although if you really needed it I guess if you provision your pool from an array that itself provides thin provisioning. Regards, Tristan On 30/12/2009 9:34 PM, Andras Spitzer wrote: Hi, Does anyone heard about having any plans to support thin devices by ZFS? I'm talking about the thin device feature by SAN frames (EMC, HDS) which provides more efficient space utilization. The concept is similar to ZFS with the pool and datasets, though the pool in this case is in the SAN frame itself, so the pool can be shared among different systems attached to the same SAN frame. This topic is really complex but I'm sure it's inevitable to support for enterprise customers with SAN storage, basically it brings the differentiation of space used vs space allocated, which can be a huge difference in a large environment, and this difference is major even on the financial level as well. Veritas already added support to thin devices, first of all support to VxFS to be thin-aware (for example how to handle over-subscribed thin devices), then Veritas added a feature called SmartMove, a nice feature to migrate from fat to thin devices, and the most brilliant feature of all (my personal opinion, of course) is they released the Veritas Thin Device Reclamation API, which provides an interface to the SAN frame to report unused space at the block level. This API is a major hit, and even though SAN vendors today doesn't support it, HP and HDS already working on it, and I assume EMC has to follow as well. With this API Veritas can keep track of files deleted for example, and with a simple command once a day (depending on your policy) it can report the unused space back to the frame, so thin devices [b]remain[/b] thin. I really believe that ZFS should have support to thin devices, especially referring to the feature what this API brings into this field, as it can result a huge cost difference to enterprise customers. Regards, sendai ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
now this is getting interesting :-)... On Dec 30, 2009, at 12:13 PM, Mike Gerdts wrote: On Wed, Dec 30, 2009 at 1:40 PM, Richard Elling richard.ell...@gmail.com wrote: On Dec 30, 2009, at 10:53 AM, Andras Spitzer wrote: Devzero, Unfortunately that was my assumption as well. I don't have source level knowledge of ZFS, though based on what I know it wouldn't be an easy way to do it. I'm not even sure it's only a technical question, but a design question, which would make it even less feasible. It is not hard, because ZFS knows the current free list, so walking that list and telling the storage about the freed blocks isn't very hard. What is hard is figuring out if this would actually improve life. The reason I say this is because people like to use snapshots and clones on ZFS. If you keep snapshots, then you aren't freeing blocks, so the free list doesn't grow. This is a very different use case than UFS, as an example. It seems as though the oft mentioned block rewrite capabilities needed for pool shrinking and changing things like compression, encryption, and deduplication would also show benefit here. That is, blocks would be re-written in such a way to minimize the number of chunks of storage that is allocated. The current HDS chunk size is 42 MB. Good observation, Mike. ZFS divides a leaf vdev into approximately 200 metaslabs. Space is allocated in a metaslab and at some point another metaslab will be chosen. The assumption is made that the outer tracks of a disk have higher bandwidth than inner tracks, so allocations should be biased towards lower-numbered metaslabs. Let's ignore, for the moment, that SSDs, and to some degree, RAID arrays, don't exhibit this behavior. OK, so here's how it works, in a nutshell. Space is allocated in the same metaslab until it fills or becomes fragmented and then the next metaslab is used. You can see this in my Spacemaps from Space blog, http://blogs.sun.com/relling/entry/space_maps_from_space where the lower numbered tracks (towards the bottom) you can see occasional, small blank areas. Note to self: a better picture would be useful :-) Note: copies are intentionally spread to other, distant metaslabs for diversity. Inside the metaslab, space is allocated on a first-fit basis until the space is mostly consumed and the algorithm changes to best-fit. The algorithm for these two decisions was changed in b129, in an effort to improve performance. So, the questions that arise are: Should the allocator be made aware of the chunk size of virtual storage vdevs? [hint: there is evidence of the intention to permit different allocators in the source, but I dunno if there is an intent to expose those through an interface.] If the allocator can change, what sorts of policies should be implemented? Examples include: + should the allocator stick with best-fit and encourage more gangs when the vdev is virtual? + should the allocator be aware of an SSD's page size? Is said page size available to an OS? + should the metaslab boundaries align with virtual storage or SSD page boundaries? And, perhaps most important, how can this be done automatically so that system administrators don't have to be rocket scientists to make a good choice? -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
Ack.. I've just re-read your original post. :-) It's clear you are talking about support for thin devices behind the pool, not features inside the pool itself. Mea culpa. So I guess we wait for trim to be fully supported.. :-) T. On 31/12/2009 8:09 AM, Tristan Ball wrote: To some extent it already does. If what you're talking about is filesystems/datasets, then all filesystems within a pool share the same free space, which is functionally very similar to each filesystem within the pool being thin-provisioned. To get a thick filesystem, you'd need to set at least the filesystem's reservation, and probably quota as well. Basically filesystems within a pool are thin by default, with the added bonus that space freed within a single filesystem is available for use in any other filesystem within the pool. If you're talking about volumes provisioned from a pool, then volumes can be provisioned as sparse, which is pretty much the same thing. And if you happen to be providing ISCSI luns from files rather than volumes, then those files can be created sparse as well. Reclaiming space from sparse volumes and files is not so easy unfortunately! If you're talking about the pool itself being thin... that's harder to do, although if you really needed it I guess if you provision your pool from an array that itself provides thin provisioning. Regards, Tristan On 30/12/2009 9:34 PM, Andras Spitzer wrote: Hi, Does anyone heard about having any plans to support thin devices by ZFS? I'm talking about the thin device feature by SAN frames (EMC, HDS) which provides more efficient space utilization. The concept is similar to ZFS with the pool and datasets, though the pool in this case is in the SAN frame itself, so the pool can be shared among different systems attached to the same SAN frame. This topic is really complex but I'm sure it's inevitable to support for enterprise customers with SAN storage, basically it brings the differentiation of space used vs space allocated, which can be a huge difference in a large environment, and this difference is major even on the financial level as well. Veritas already added support to thin devices, first of all support to VxFS to be thin-aware (for example how to handle over-subscribed thin devices), then Veritas added a feature called SmartMove, a nice feature to migrate from fat to thin devices, and the most brilliant feature of all (my personal opinion, of course) is they released the Veritas Thin Device Reclamation API, which provides an interface to the SAN frame to report unused space at the block level. This API is a major hit, and even though SAN vendors today doesn't support it, HP and HDS already working on it, and I assume EMC has to follow as well. With this API Veritas can keep track of files deleted for example, and with a simple command once a day (depending on your policy) it can report the unused space back to the frame, so thin devices [b]remain[/b] thin. I really believe that ZFS should have support to thin devices, especially referring to the feature what this API brings into this field, as it can result a huge cost difference to enterprise customers. Regards, sendai ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss __ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email __ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Dec 30, 2009, at 12:25 PM, Andras Spitzer wrote: Richard, That's an interesting question, if it's worth it or not. I guess the question is always who are the targets for ZFS (I assume everyone, though in reality priorities has to set up as the developer resources are limited). For a home office, no doubt thin provisioning is not much of a use, for an enterprise company the numbers might really make a difference if we look at the space used vs space allocated. There are some studies that thin provisioning can reduce physical space used up to 30%, which is huge. (Even though I understands studies are not real life and thin provisioning is not viable in every environment) Btw, I would like to discuss scenarios where though we have over- subscribed pool in the SAN (meaning the overall allocated space to the systems is more than the physical space in the pool) with proper monitoring and proactive physical drive adds we won't let any systems/applications attached to the SAN realize that we have thin devices. Actually that's why I believe configuring thin devices without periodically reclaiming space is just a timebomb, though if you have the option to periodically reclaim space, you can maintain the pool in the SAN in a really efficient way. That's why I found Veritas' Thin Reclamation API as a milestone in the thin device field. Anyway, only future can tell if thin provisioning will or won't be a major feature in the storage world, though as I saw Veritas already added this feature I was wondering if ZFS has it at least on it's roadmap. Thin provisioning is absolutely, positively a wonderful, good thing! The question is, how does the industry handle the multitude of thin provisioning models, each layered on top of another? For example, here at the ranch I use VMWare and Xen, which thinly provision virtual disks. I do this over iSCSI to a server running ZFS which thinly provisions the iSCSI target. If I had a virtual RAID array, I would probably use that, too. Personally, I think being thinner closer to the application wins over being thinner closer to dumb storage devices (disk drives). BTW, I do not see an RFE for this on http://bugs.opensolaris.org Would you be so kind as to file one? -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Wed, Dec 30, 2009 at 3:12 PM, Richard Elling richard.ell...@gmail.com wrote: If the allocator can change, what sorts of policies should be implemented? Examples include: + should the allocator stick with best-fit and encourage more gangs when the vdev is virtual? + should the allocator be aware of an SSD's page size? Is said page size available to an OS? + should the metaslab boundaries align with virtual storage or SSD page boundaries? Wandering off topic a little bit... Should the block size be a tunable so that page size of SSD (typically 4K, right?) and upcoming hard disks that sport a sector size 512 bytes? http://arc.opensolaris.org/caselog/PSARC/2008/769/final_spec.txt And, perhaps most important, how can this be done automatically so that system administrators don't have to be rocket scientists to make a good choice? Didn't you read the marketing literature? ZFS is easy because you only need to know two commands: zpool and zfs. If you just ignore all the subcommands, options to those subcommands, evil tuning that is sometimes needed, and effects of redundancy choices then there is no need for any rocket scientists. :) -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On 30 dec 2009, at 22.45, Richard Elling wrote: On Dec 30, 2009, at 12:25 PM, Andras Spitzer wrote: Richard, That's an interesting question, if it's worth it or not. I guess the question is always who are the targets for ZFS (I assume everyone, though in reality priorities has to set up as the developer resources are limited). For a home office, no doubt thin provisioning is not much of a use, for an enterprise company the numbers might really make a difference if we look at the space used vs space allocated. There are some studies that thin provisioning can reduce physical space used up to 30%, which is huge. (Even though I understands studies are not real life and thin provisioning is not viable in every environment) Btw, I would like to discuss scenarios where though we have over-subscribed pool in the SAN (meaning the overall allocated space to the systems is more than the physical space in the pool) with proper monitoring and proactive physical drive adds we won't let any systems/applications attached to the SAN realize that we have thin devices. Actually that's why I believe configuring thin devices without periodically reclaiming space is just a timebomb, though if you have the option to periodically reclaim space, you can maintain the pool in the SAN in a really efficient way. That's why I found Veritas' Thin Reclamation API as a milestone in the thin device field. Anyway, only future can tell if thin provisioning will or won't be a major feature in the storage world, though as I saw Veritas already added this feature I was wondering if ZFS has it at least on it's roadmap. Thin provisioning is absolutely, positively a wonderful, good thing! The question is, how does the industry handle the multitude of thin provisioning models, each layered on top of another? For example, here at the ranch I use VMWare and Xen, which thinly provision virtual disks. I do this over iSCSI to a server running ZFS which thinly provisions the iSCSI target. If I had a virtual RAID array, I would probably use that, too. Personally, I think being thinner closer to the application wins over being thinner closer to dumb storage devices (disk drives). I don't get it - why do we need anything more magic (or complicated) than support for TRIM from the filesystems and the storage systems? I don't see why TRIM would be hard to implement for ZFS either, except that you may want to keep data from a few txgs back just for safety, which would probably call for some two-stage freeing of data blocks (those free blocks that are to be TRIMmed, and those that already are). /ragge ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Wed, 30 Dec 2009, Mike Gerdts wrote: Should the block size be a tunable so that page size of SSD (typically 4K, right?) and upcoming hard disks that sport a sector size 512 bytes? Enterprise SSDs are still in their infancy. The actual page size of an SSD could be almost anything. Due to lack of seek time concerns and the high cost of erasing a page, a SSD could be designed with a level of indirection so that multiple logical writes to disjoint offsets could be combined into a single SSD physical page. Likewise a large logical block could be subdivided into mutiple SSD pages, which are allocated on demand. Logic is cheap and SSDs are full of logic so it seems reasonable that future SSDs will do this, if not already, since similar logic enables wear-leveling. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Dec 30, 2009, at 2:24 PM, Ragnar Sundblad wrote: On 30 dec 2009, at 22.45, Richard Elling wrote: On Dec 30, 2009, at 12:25 PM, Andras Spitzer wrote: Richard, That's an interesting question, if it's worth it or not. I guess the question is always who are the targets for ZFS (I assume everyone, though in reality priorities has to set up as the developer resources are limited). For a home office, no doubt thin provisioning is not much of a use, for an enterprise company the numbers might really make a difference if we look at the space used vs space allocated. There are some studies that thin provisioning can reduce physical space used up to 30%, which is huge. (Even though I understands studies are not real life and thin provisioning is not viable in every environment) Btw, I would like to discuss scenarios where though we have over- subscribed pool in the SAN (meaning the overall allocated space to the systems is more than the physical space in the pool) with proper monitoring and proactive physical drive adds we won't let any systems/applications attached to the SAN realize that we have thin devices. Actually that's why I believe configuring thin devices without periodically reclaiming space is just a timebomb, though if you have the option to periodically reclaim space, you can maintain the pool in the SAN in a really efficient way. That's why I found Veritas' Thin Reclamation API as a milestone in the thin device field. Anyway, only future can tell if thin provisioning will or won't be a major feature in the storage world, though as I saw Veritas already added this feature I was wondering if ZFS has it at least on it's roadmap. Thin provisioning is absolutely, positively a wonderful, good thing! The question is, how does the industry handle the multitude of thin provisioning models, each layered on top of another? For example, here at the ranch I use VMWare and Xen, which thinly provision virtual disks. I do this over iSCSI to a server running ZFS which thinly provisions the iSCSI target. If I had a virtual RAID array, I would probably use that, too. Personally, I think being thinner closer to the application wins over being thinner closer to dumb storage devices (disk drives). I don't get it - why do we need anything more magic (or complicated) than support for TRIM from the filesystems and the storage systems? TRIM is just one part of the problem (or solution, depending on your point of view). The TRIM command is part of the T10 protocols that allows a host to tell a block device that data in a set of blocks is no longer of any value, and the block device can destroy the data without adverse consequence. In a world with copy-on-write and without snapshots, it is obvious that there will be a lot of blocks running around that are no longer in use. Snapshots (and their clones) changes that use case. So in a world of snapshots, there will be fewer blocks which are not used. Remember, the TRIM command is very important to OSes like Windows or OSX which do not have file systems that are copy-on-write or have decent snapshots. OTOH, ZFS does copy-on-write and lots of ZFS folks use snapshots. That said, adding TRIM support is not hard in ZFS. But it depends on lower level drivers to pass the TRIM commands down the stack. These ducks are lining up now. I don't see why TRIM would be hard to implement for ZFS either, except that you may want to keep data from a few txgs back just for safety, which would probably call for some two-stage freeing of data blocks (those free blocks that are to be TRIMmed, and those that already are). Once a block is freed in ZFS, it no longer needs it. So the problem of TRIM in ZFS is not related to the recent txg commit history. The issue is that traversing the free block list has to be protected by locks, so that the file system does not allocate a block when it is also TRIMming the block. Not so difficult, as long as the TRIM occurs relatively quickly. I think that any TRIM implementation should be an administration command, like scrub. It probably doesn't make sense to have it running all of the time. But on occasion, it might make sense. My concern is that people will have an expectation that they can use snapshots and TRIM -- the former reduces the effectiveness of the latter. As the price of storing bytes continues to decrease, will the cost of not TRIMming be a long term issue? I think not. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss