Re: [zfs-discuss] Thin device support in ZFS?

2010-01-13 Thread Miles Nordin
 et == Erik Trimble erik.trim...@sun.com writes:

et Probably, the smart thing to push for is inclusion of some new
et command in the ATA standard (in a manner like TRIM).  Likely
et something that would return both native Block and Page sizes
et upon query.

that would be the *sane* thing to do.  The *smart* thing to do would
be write a quick test to determine the apparent page size by
performance-testing write-flush-write-flush-write-flush with various
write sizes and finding the knee that indicates the smallest size at
which read-before-write has stopped.  The test could happen in 'zpool
create' and have its result written into the vdev label.  

Inventing ATA commands takes too long to propogate through the
technosphere, and the EE's always implement them wrongly: for example,
a device with SDRAM + supercap should probably report 512 byte sectors
because the algorithm for copying from SDRAM to NAND is subject to
change and none of your business, but EE's are not good with language
and will try to apelike match up the paragraph in the spec with the
disorganized thoughts in their head, fit pegs into holes, and will end
up giving you the NAND page size without really understanding why you
wanted it other than that some standard they can't control demands it.
They may not even understand why their devices are faster and
slower---they are probably just hurling shit against an NTFS and
shipping whatever runs some testsuite fastest---so doing the empirical
test is the only way to document what you really care about in a way
that will make it across the language and cultural barriers between
people who argue about javascript vs python and ones that argue about
Agilent vs LeCroy.  Within the proprietary wall of these flash
filesystem companies the testsuites are probably worth as much as the
filesystem code, and here without the wall an open-source statistical
test is worth more than a haggled standard.  

Remember the ``removeable'' bit in USB sticks and the mess that both
software and hardware made out of it.  (hot-swappable SATA drives are
``non-removeable'' and don't need rmformat while USB/firewore do?
yeah, sorry, u fail abstraction.  and USB drives have the ``removable
medium'' bit set when the medium and the controller are inseperable,
it's the _controller_ that's removeable?  ya sorry u fail reading
English.)  If you can get an answer by testing, DO IT, and evolve the
test to match products on the market as necessary.  This promises to
be a lot more resilient than the track record with bullshit ATA
commands and will work with old devices too.  By the time you iron out
your standard we will be using optonanocyberflash instead: that's what
happened with the removeable bit and r/w optical storage.  BTW let me
know when read/write UDF 2.0 on dvd+r is ready---the standard was only
announced twelve years ago, thanks.


pgpOg9cjVknOA.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-12 Thread Miles Nordin
 ah == Al Hopper a...@logical-approach.com writes:

ah The main issue is that most flash devices support 128k byte
ah pages, and the smallest chunk (for want of a better word) of
ah flash memory that can be written is a page - or 128kb.  So if
ah you have a write to an SSD that only changes 1 byte in one 512
ah byte disk sector, the SSD controller has to either
ah read/re-write the affected page or figure out how to update
ah the flash memory with the minimum affect on flash wear.

yeah well, I'm not sure it matters, but that's untrue.

there are two sizes for NAND flash, the minimum write size and the
minimum erase size.  The minimum write size is the size over which
error correction is done, the unit at which inband and OOB data is
interleaved, on NAND flash.  The minimum erase size is just what it
sounds, the size the cleaner/garbagecolelctor must evacuate.

The minimum write size is I suppose likely to provoke
read/modify/write and wasting of write and wear bandwidth for smaller
writes in flashes which do not have a DRAM+supercap, if you ask to
SYNCHRONZIE CACHE right after the write.  If there is a supercap, or
if you allow teh drive to do write caching, then the smaller write
could be coalesced making this size irrelevant.  I think it's usually
2 - 4 kB.  I would expect resistance to growing it larger than 4kB
because of NTFS---electrical engineers are usually over-obsessed with
Windows.

The minimum erase size you don't really care about at all.  That's the
one that's usually at least 128kB.

ah For anyone who is interested in getting more details of the
ah challenges with flash memory, when used to build solid state
ah drives, reading the tech data sheets on the flash memory
ah devices will give you a feel for the basic issues that must be
ah solved.

and the linux-mtd list will give you a feel for how people are solving
them, because that's the only place I know of where NAND filesystem
work is going on in the open.  There are a bunch of geezers saying ``I
wrote one for BSD but my employer won't let me release it,'' and then
the new crop of intel/sandforce/stec proprietary kids, but in the open
world AFAIK there is just yaffs and ubifs.  The tmobile G1 is yaffs.

ah Bobs point is well made.  The specifics of a given SSD
ah implementation will make the performance characteristics of
ah the resulting SSD very difficult to predict or even describe -

I'm really a fan of thte idea of using ACARD ANS-9010 for a slog.
It's basically all DRAM+battery, and uses a low performance CF card
for durable storage if the battery starts to run low, or if you
explicitly request it (to move data between ACARD units by moving the
CF card maybe).  It will even make non-ECC RAM into ECC storage (using
a sector size and OOB data :).  It seems like Zeus-like performance at
1/10th the price, but of course it's a little goofy, and I've never
tried it.

slog is where I'd expect the high synchronous workload to be, so this
is where there are small writes that can't be coalesced, I would
presume, and appropriate slog sizes are reachable with DRAM alone.


pgpvCrA05zqYv.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-08 Thread Daniel Carosone
Yet another way to thin-out the backing devices for a zpool on a
thin-provisioned storage host, today: resilver. 

If your zpool has some redundancy across the SAN backing LUNs, simply
drop and replace one at a time and allow zfs to resilver only the
blocks currently in use onto the replacement LUN.

--
Dan.

pgpo7ejxaipJy.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-05 Thread Miles Nordin
 dm == David Magda dma...@ee.ryerson.ca writes:

dm 4096 - to-512 blocks

aiui NAND flash has a minimum write size (determiined by ECC OOB bits)
of 2 - 4kB, and a minimum erase size that's much larger.  Remapping
cannot abstract away the performance implication of the minimum write
size if you are doing a series of synchronous writes smaller than the
minimum size on a device with no battery/capacitor, although using a
DRAM+supercap prebuffer might be able to abstract away some of it.


pgp7ymX3mE7r4.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-05 Thread Erik Trimble
As a further update, I went back and re-read my SSD controller info, and 
then did some more Googling.


Turns out, I'm about a year behind on State-of-the-SSD.Eric is 
correct on the way current SSDs implement writes (both SLC and MLC),  so 
I'm issuing a mea-cupla here. The change in implementation appears to 
occur sometime shortly after the introduction of the Indilinx 
controllers.  My fault for not catching this.


-Erik



Eric D. Mudama wrote:

On Sat, Jan  2 at 22:24, Erik Trimble wrote:
In MLC-style SSDs, you typically have a block size of 2k or 4k. 
However, you have a Page size of several multiples of that, 128k 
being common, but by no means ubiquitous.


I believe your terminology is crossed a bit.  What you call a block is
usually called a sector, and what you call a page is known as a block.

Sector is (usually) the unit of reading from the NAND flash.

The unit of write in NAND flash is the page, typically 2k or 4k
depending on NAND generation, and thus consisting of 4-8 ATA sectors
(typically).  A single page may be written at a time.  I believe some
vendors support partial-page programming as well, allowing a single
sector append type operation where the previous write left off.

Ordered pages are collected into the unit of erase, which is known as
a block (or erase block), and is anywhere from 128KB to 512KB or
more, depending again on NAND generation, manufacturer, and a bunch of
other things.

Some large number of blocks are grouped by chip enables, often 4K or
8K blocks.


I think you're confusing erasing with writing.

When I say minimum write size, I mean that for an MLC, no matter 
how small you make a change, the minimum amount of data actually 
being written to the SSD is a full page (128k in my example).   There


Page is the unit of write, but it's much smaller in all NAND I am
aware of.

is no append down at this level. If I have a page of 128k, with 
data in 5 of the 4k blocks, and I then want to add another 2k of data 
to this, I have to READ all 5 4k blocks into the controller's DRAM, 
add the 2k of data to that, then write out the full amount to a new 
page (if available), or wait for a older page to be erased before 
writing to it.  Thus, in this case,  in order to do an actual 2k 
write, the SSD must first read 10k of data, do some compositing, then 
write 12k to a fresh page.


Thus, to change any data inside a single page, then entire contents 
of that page have to be read, the page modified, then the entire page 
written back out.


See above.

What I'm describing is how ALL MLC-based SSDs work. SLC-based SSDs 
work differently, but still have problems with what I'll call 
excess-writing.


I think you're only describing dumb SSDs with erase-block granularity
mapping. Most (all) vendors have moved away from that technique since
random write performance is awful in those designs and they fall over
dead from wAmp in a jiffy.

SLC and MLC NAND is similar, and they are read/written/erased almost
identically by the controller.

I'm not sure that SSDs actually _have_ to erase - they just overwrite 
anything there with new data. But this is implementation dependent, 
so I can say how /all/ MLC SSDs behave.


Technically you can program the same NAND page repeatedly, but since
bits can only transition from 1-0 on a program operation, the result
wouldn't be very meaningful.  An erase sets all the bits in the block
to 1, allowing you to store your data.

Once again, what I'm talking about is a characteristic of MLC SSDs, 
which are used in most consumer SSDS (the Intel X25-M, included).


Sure, such an SSD will commit any new writes to pages drawn from the 
list of never before used NAND.  However, at some point, this list 
becomes empty.  In most current MLC SSDs, there's about 10% extra 
(a 60GB advertised capacity is actually ~54GB usable with 6-8GB 
extra).   Once this list is empty, the SSD has to start writing 
back to previous used pages, which may require an erase step first 
before any write. Which is why MLC SSDs slow down drastically once 
they've been fulled to capacity several times.


From what I've seen, erasing a block typically takes a time in the
same scale as programming an MLC page, meaning in flash with large
page counts per block, the % of time spent erasing is not very large.

Lets say that an erase took 100ms and a program took 10ms, in an MLC
NAND device with 100 pages per block.  In this design, it takes us 1s
to program the entire block, but only 1/10 of the time to erase it.
An infinitely fast erase would only make the design about 10% faster.

For SLC the erase performance matters more since page writes are much
faster on average and there are half as many pages, but we were
talking MLC.

The performance differences seen is because they were artificially
fast to begin with because they were empty.  It's similar to
destroking a rotating drive in many ways to speed seek times.  Once
the drive is full, it all comes down to raw NAND performance,

Re: [zfs-discuss] Thin device support in ZFS?

2010-01-03 Thread Eric D. Mudama

On Sat, Jan  2 at 22:24, Erik Trimble wrote:
In MLC-style SSDs, you typically have a block size of 2k or 4k. 
However, you have a Page size of several multiples of that, 128k 
being common, but by no means ubiquitous.


I believe your terminology is crossed a bit.  What you call a block is
usually called a sector, and what you call a page is known as a block.

Sector is (usually) the unit of reading from the NAND flash.

The unit of write in NAND flash is the page, typically 2k or 4k
depending on NAND generation, and thus consisting of 4-8 ATA sectors
(typically).  A single page may be written at a time.  I believe some
vendors support partial-page programming as well, allowing a single
sector append type operation where the previous write left off.

Ordered pages are collected into the unit of erase, which is known as
a block (or erase block), and is anywhere from 128KB to 512KB or
more, depending again on NAND generation, manufacturer, and a bunch of
other things.

Some large number of blocks are grouped by chip enables, often 4K or
8K blocks.


I think you're confusing erasing with writing.

When I say minimum write size, I mean that for an MLC, no matter 
how small you make a change, the minimum amount of data actually 
being written to the SSD is a full page (128k in my example).   There


Page is the unit of write, but it's much smaller in all NAND I am
aware of.

is no append down at this level. If I have a page of 128k, with 
data in 5 of the 4k blocks, and I then want to add another 2k of data 
to this, I have to READ all 5 4k blocks into the controller's DRAM, 
add the 2k of data to that, then write out the full amount to a new 
page (if available), or wait for a older page to be erased before 
writing to it.  Thus, in this case,  in order to do an actual 2k 
write, the SSD must first read 10k of data, do some compositing, then 
write 12k to a fresh page.


Thus, to change any data inside a single page, then entire contents 
of that page have to be read, the page modified, then the entire page 
written back out.


See above.

What I'm describing is how ALL MLC-based SSDs work. SLC-based SSDs 
work differently, but still have problems with what I'll call 
excess-writing.


I think you're only describing dumb SSDs with erase-block granularity
mapping. Most (all) vendors have moved away from that technique since
random write performance is awful in those designs and they fall over
dead from wAmp in a jiffy.

SLC and MLC NAND is similar, and they are read/written/erased almost
identically by the controller.

I'm not sure that SSDs actually _have_ to erase - they just overwrite 
anything there with new data. But this is implementation dependent, 
so I can say how /all/ MLC SSDs behave.


Technically you can program the same NAND page repeatedly, but since
bits can only transition from 1-0 on a program operation, the result
wouldn't be very meaningful.  An erase sets all the bits in the block
to 1, allowing you to store your data.

Once again, what I'm talking about is a characteristic of MLC SSDs, 
which are used in most consumer SSDS (the Intel X25-M, included).


Sure, such an SSD will commit any new writes to pages drawn from the 
list of never before used NAND.  However, at some point, this list 
becomes empty.  In most current MLC SSDs, there's about 10% extra 
(a 60GB advertised capacity is actually ~54GB usable with 6-8GB 
extra).   Once this list is empty, the SSD has to start writing 
back to previous used pages, which may require an erase step first 
before any write. Which is why MLC SSDs slow down drastically once 
they've been fulled to capacity several times.


From what I've seen, erasing a block typically takes a time in the
same scale as programming an MLC page, meaning in flash with large
page counts per block, the % of time spent erasing is not very large.

Lets say that an erase took 100ms and a program took 10ms, in an MLC
NAND device with 100 pages per block.  In this design, it takes us 1s
to program the entire block, but only 1/10 of the time to erase it.
An infinitely fast erase would only make the design about 10% faster.

For SLC the erase performance matters more since page writes are much
faster on average and there are half as many pages, but we were
talking MLC.

The performance differences seen is because they were artificially
fast to begin with because they were empty.  It's similar to
destroking a rotating drive in many ways to speed seek times.  Once
the drive is full, it all comes down to raw NAND performance,
controller design, reserve/extra area (or TRIM) and algorithmic
quality.

--eric


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-03 Thread Ragnar Sundblad

Eric D. Midama did a very good job answering this, and I don't have
much to add. Thanks Eric!

On 3 jan 2010, at 07.24, Erik Trimble wrote:

 I think you're confusing erasing with writing.

I am now quite certain that it actually was you who were
confusing those. I hope this discussion has cleared things
up a little though.

 What I'm describing is how ALL MLC-based SSDs work. SLC-based SSDs work 
 differently, but still have problems with what I'll call excess-writing.

Eric already said it, but I need to say this myself too:
SLC and MLC disks could be almost identical, only the storing of
the bits in the flash chips differs a little (1 or 2 bits per
storage cell). There is absolutely no other fundamental difference
between the two.

Hopefully no modern MLC *or* SLC disk works as you described,
since it is a horrible design, and selling it would be close to
robbery. It would be slow and it would wear out quite fast.

Now, SLC disks are typically better overall, because those who
want to pay for SLC flash typically also want to pay for better
controllers, but otherwise those issues are really orthogonal.

 I'm not sure that SSDs actually _have_ to erase - they just overwrite 
 anything there with new data. But this is implementation dependent, so I can 
 say how /all/ MLC SSDs behave.

As Eric said - yes you have to erase, otherwise you can't write
new data. It is not implementation dependent, it is inherent in
the flash technology. And, as has been said several times now,
erasing can only be done in large chunks, but writing can be done
in small chunks. I'd say that this is the main problem to handle
when creating a good flash SSD.

 The whole point behind ZFS is that CPU cycles are cheap and available, much 
 more so than dedicated hardware of any sort. What I'm arguing here is that 
 the controller on an SSD is in the same boat as a dedicated RAID HBA -  in 
 the latter case, use a cheap HBA instead and let the CPU  ZFS do the work, 
 while in the former case, use a dumb controller for the SSD instead of a 
 smart one.

This could be true, I am still not sure. My main issues with this
is that it would make the file system code dependent of a special
hardware behavior (that of todays flash chips), and that it could
be quite a lot of data to shuffle around when compacting. But
we'll see. If it could be cheap enough, it could absolutely happen
and be worth it even if it has some drawbacks.

 And, as I pointed out in another message, doing it my way doesn't increase 
 bus traffic that much over what is being done now, in any case.

Yes, it would increase bus traffic, if you would handle flash the
compacting in the host - which you have to with your idea - it could
be many times the real workload bandwidth. But it could still be
worth it, that is quite possible.

-

On 3 jan 2010, at 07.43, Erik Trimble wrote:
 I meant to say that I DON'T know how all MLC drives deal with erasure.

Again - yes they do. (Or they would be write-once only. :-)

 I'm pretty sure compacting doesn't occur in ANY SSDs without any OS 
 intervention (that is, the SSD itself doesn't do it), and I'd be surprised 
 to see an OS try to implement some sort of intra-page compaction - there 
 benefit doesn't seem to be there; it's better just to optimize writes than 
 try to compact existing pages. As far as reclaiming unused space, the TRIM 
 command is there to allow the SSD to mark a page Free for reuse, and an SSD 
 isn't going to be erasing a page unless it's right before something is to be 
 written to that page.
 My thinking of what compacting meant doesn't match up with what I'm seeing 
 general usage in the SSD technical papers is, so in this respect, I'm wrong:  
 compacting does occur, but only when there are no fully erased (or unused) 
 pages available.  Thus, compacting is done in the context of a write 
 operation.

Exactly what and when it is that triggers compacting is another
issue, and that could probably change with firmware revisions.

It is wise to do it earlier than when you get that write that
didn't fit, since if you have some erased space you can then take
burts of writes up to that size quickly. But compacting takes
bandwidth from the flash chips and wears them out, so you don't
want to do it to early and to much.

I guess this could be an interesting optimization problem, and
optimal behavior probably depends on the workload too. Maybe it
should be an adjustable knob.

-

On 3 jan 2010, at 10.57, Eric D. Mudama wrote:

 On Sat, Jan  2 at 22:24, Erik Trimble wrote:
 In MLC-style SSDs, you typically have a block size of 2k or 4k. However, you 
 have a Page size of several multiples of that, 128k being common, but by no 
 means ubiquitous.
 
 I believe your terminology is crossed a bit.  What you call a block is
 usually called a sector, and what you call a page is known as a block.
 
 Sector is (usually) the unit of reading from the NAND flash.
...

Indeed, and I am partly guilty to that mess, but 

Re: [zfs-discuss] Thin device support in ZFS?

2010-01-02 Thread Ragnar Sundblad

On 1 jan 2010, at 17.44, Richard Elling wrote:

 On Dec 31, 2009, at 12:59 PM, Ragnar Sundblad wrote:
 Flash SSDs actually always remap new writes into a
 only-append-to-new-pages style, pretty much as ZFS does itself.
 So for a SSD there is no big difference between ZFS and
 filesystems as UFS, NTFS, HFS+ et al, on the flash level they
 all work the same.
 
 The reason is that there is no way for it to rewrite single
 disk blocks, it can only fill up already erased pages of
 512K (for example). When the old blocks get mixed with unused
 blocks (because of block rewrites, TRIM or Write Many/UNMAP),
 it needs to compact the data by copying all active blocks from
 those pages into previously erased pages, and there write the
 active data compacted/continuos. (When this happens, things tend
 to get really slow.)
 
 However, the quantity of small, overwritten pages is vastly different.
 I am not convinced that a workload that generates few overwrites
 will be penalized as much as a workload that generates a large
 number of overwrites.

Zfs is not append only in itself, there will be holes from
deleted files after a while, and space will have to be
reclaimed sooner or later.

I am not convinced that a zfs that has been in use for a while
rewrites a lot less than other file systems. But maybe you are
right, and if so, I agree that intuitively such a workload
may be better matched to a flash based device.

If you have a workload that only appends data and never changes
or deletes it, zfs is probably a bit better than other file
systems of not rewriting blocks. But that is a pretty special
use case, and another file system could rewrite almost as
little.

 I think most folks here will welcome good, empirical studies,
 but thus far the only one I've found is from STEC and their
 disks behave very well after they've been filled and subjected
 to a rewrite workload. You get what you pay for.  Additional
 pointers are always appreciated :-)
 http://www.stec-inc.com/ssd/videos/ssdvideo1.php

There certainly are big differences between the flash SSD drives
out there, I wouldn't argue about that for a second!

/ragge

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-02 Thread Ragnar Sundblad

On 1 jan 2010, at 17.28, David Magda wrote:

 On Jan 1, 2010, at 11:04, Ragnar Sundblad wrote:
 
 But that would only move the hardware specific and dependent flash
 chip handling code into the file system code, wouldn't it? What
 is won with that? As long as the flash chips have larger pages than
 the file system blocks, someone will have to shuffle around blocks
 to reclaim space, why not let the one thing that knows the hardware
 and also is very close to the hardware do it?
 
 And if this is good for SSDs, why isn't it as good for rotating rust?
 
 Don't really see how things are either hardware specific or dependent.

The inner workings of a SSD flash drive is pretty hardware (or
rather vendor) specific, and it may not be a good idea to move
any knowledge about that to the file system layer.

 COW is COW. Am I missing something? It's done by code somewhere in the stack, 
 if the FS knows about it, it can lay things out in sequential writes. If 
 we're talking about 512 KB blocks, ZFS in particular would create four 128 KB 
 txgs--and 128 KB is simply the currently #define'd size, which can be changed 
 in the future.

As I said in another mail, zfs is not append only, especially
not if it has been in random read write use for a while.
There will be holes in the data and space to be reclaimed,
something has to handle that, and I am not sure it is a good
idea to move that into the host, since it it dependent of the
design of the SSD drive.

 One thing you gain is perhaps not requiring to have as much of a reserve. At 
 most you have some hidden bad block re-mapping, similar to rotating rust 
 nowadays. If you're shuffling blocks around, you're doing a 
 read-modify-write, which if done in the file system, you could use as a 
 mechanism to defrag on-the-fly or to group many small files together.

Yes, defrag on the fly may be interesting. Otherwise I am not
sure I think the file system should do any of that, since it
may be that it can be done much faster and smarter in the
SSD controller.

 Not quite sure what you mean by your last question.

I meant that if hardware dependent handling of the storage medium
is good to move into the host, why isn't the same true for
spinning disks? But we can leave that for now.

/ragge

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-02 Thread Andras Spitzer
Mike,

As far as I know only Hitachi is using such a huge chunk size : 

So each vendor’s implementation of TP uses a different block size. HDS use 
42MB on the USP, EMC use 768KB on DMX, IBM allow a variable size from 32KB to 
256KB on the SVC and 3Par use blocks of just 16KB. The reasons for this are 
many and varied and for legacy hardware are a reflection of the underlying 
hardware architecture.

http://gestaltit.com/all/tech/storage/chris/thin-provisioning-holy-grail-utilisation/

Also, here Hu explains the reason why they believe 42M is the most efficient :

http://blogs.hds.com/hu/2009/07/chunk-size-matters.html

He has some good points in his arguments.

Regards,
sendai
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-02 Thread Joerg Schilling
Ragnar Sundblad ra...@csc.kth.se wrote:

 On 1 jan 2010, at 17.28, David Magda wrote:

  Don't really see how things are either hardware specific or dependent.

 The inner workings of a SSD flash drive is pretty hardware (or
 rather vendor) specific, and it may not be a good idea to move
 any knowledge about that to the file system layer.

If ZFS likes to keep SSDs fast even after it was in use for a while, then
even ZFS would need to tell the SSD which sectors are no longer in use.


Such a mode may cause a noticable performance loss as ZFS for this reason
may need to traverse freed outdated data trees but it will help the SSD
to erase the needed space in advance.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-02 Thread Joerg Schilling
Ragnar Sundblad ra...@csc.kth.se wrote:

 I certainly agree, but there still isn't much they can do about
 the WORM-like properties of flash chips, were reading is pretty
 fast, writing is not to bad, but erasing is very slow and must be
 done in pretty large pages which also means that active data
 probably have to be copied around before an erase.

WORM devices do not allow to write a block a secdond time. There is
a typical 5% reserve that would allow to reassign some blocks and to make it 
appear they have been rewritten, but this is not what ZFS does. Well, you are 
hoewever true that there is a slight relation as I did invent COW for a WORM 
filesystem in 1989 ;-)

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-02 Thread Erik Trimble

Joerg Schilling wrote:

Ragnar Sundblad ra...@csc.kth.se wrote:

  

On 1 jan 2010, at 17.28, David Magda wrote:



  

Don't really see how things are either hardware specific or dependent.
  

The inner workings of a SSD flash drive is pretty hardware (or
rather vendor) specific, and it may not be a good idea to move
any knowledge about that to the file system layer.



If ZFS likes to keep SSDs fast even after it was in use for a while, then
even ZFS would need to tell the SSD which sectors are no longer in use.


Such a mode may cause a noticable performance loss as ZFS for this reason
may need to traverse freed outdated data trees but it will help the SSD
to erase the needed space in advance.

Jör
the TRIM command is what is intended for an OS to notify the SSD as to 
which blocks are deleted/erased, so the SSD's internal free list can be 
updated (that is, it allows formerly-in-use blocks to be moved to the 
free list).  This is necessary since only the OS has the information to 
determine which previous-written-to blocks are actually no longer in-use.


See the parallel discussion here titled preview of new SSD based on 
SandForce controller for more about smart vs dumb SSD controllers.


From ZFS's standpoint, the optimal configuration would be for the SSD 
to inform ZFS as to it's PAGE size, and ZFS would use this as the 
fundamental BLOCK size for that device (i.e. all writes are in integer 
multiples of the SSD page size).  Reads could be in smaller sections, 
though.  Which would be interesting:  ZFS would write in Page Size 
increments, and read in Block Size amounts.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-02 Thread Joerg Schilling
Erik Trimble erik.trim...@sun.com wrote:

  From ZFS's standpoint, the optimal configuration would be for the SSD 
 to inform ZFS as to it's PAGE size, and ZFS would use this as the 
 fundamental BLOCK size for that device (i.e. all writes are in integer 

It seems that a command to retrieve this information does not yet exist,
or could you help me?

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-02 Thread Ragnar Sundblad

On 2 jan 2010, at 12.43, Joerg Schilling wrote:

 Ragnar Sundblad ra...@csc.kth.se wrote:
 
 I certainly agree, but there still isn't much they can do about
 the WORM-like properties of flash chips, were reading is pretty
 fast, writing is not to bad, but erasing is very slow and must be
 done in pretty large pages which also means that active data
 probably have to be copied around before an erase.
 
 WORM devices do not allow to write a block a secdond time.

(I know, that is why I wrote WORM-like.)

 There is
 a typical 5% reserve that would allow to reassign some blocks and to make it 
 appear they have been rewritten, but this is not what ZFS does.

Well, zfs kind of does, but especially typical flash SSDs do it,
they have a redirection layer so that any block can go anywhere,
so they can use the flash media in a WORM like style with
occasional bulk erases.

 Well, you are 
 hoewever true that there is a slight relation as I did invent COW for a WORM 
 filesystem in 1989 ;-)

Yes, there indeed are several similarities.

/ragge

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-02 Thread Ragnar Sundblad

On 2 jan 2010, at 13.10, Erik Trimble wrote:

 Joerg Schilling wrote:
 Ragnar Sundblad ra...@csc.kth.se wrote:
 
  
 On 1 jan 2010, at 17.28, David Magda wrote:

 
  
 Don't really see how things are either hardware specific or dependent.
  
 The inner workings of a SSD flash drive is pretty hardware (or
 rather vendor) specific, and it may not be a good idea to move
 any knowledge about that to the file system layer.

 
 If ZFS likes to keep SSDs fast even after it was in use for a while, then
 even ZFS would need to tell the SSD which sectors are no longer in use.
 
 
 Such a mode may cause a noticable performance loss as ZFS for this reason
 may need to traverse freed outdated data trees but it will help the SSD
 to erase the needed space in advance.
 
 Jör
 the TRIM command is what is intended for an OS to notify the SSD as to which 
 blocks are deleted/erased, so the SSD's internal free list can be updated 
 (that is, it allows formerly-in-use blocks to be moved to the free list).  
 This is necessary since only the OS has the information to determine which 
 previous-written-to blocks are actually no longer in-use.
 
 See the parallel discussion here titled preview of new SSD based on 
 SandForce controller for more about smart vs dumb SSD controllers.
 
 From ZFS's standpoint, the optimal configuration would be for the SSD to 
 inform ZFS as to it's PAGE size, and ZFS would use this as the fundamental 
 BLOCK size for that device (i.e. all writes are in integer multiples of the 
 SSD page size).  Reads could be in smaller sections, though.  Which would be 
 interesting:  ZFS would write in Page Size increments, and read in Block Size 
 amounts.

Well, this could be useful if updates are larger than the block size, for 
example 512 K, as it is then possible to erase and rewrite without having to 
copy around other data from the page. If updates are smaller, zfs will have to 
reclaim erased space by itself, which if I am not mistaken it can not do today 
(but probably will in some future, I guess the BP Rewrite is what is needed).

I am still not entirely convinced that it would be better to let the file 
system take care of that instead of a flash controller, there could be quite a 
lot of reading and writing going on for space reclamation (depending on the 
work load, of course).

/ragge

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-02 Thread Erik Trimble

Joerg Schilling wrote:

Erik Trimble erik.trim...@sun.com wrote:

  
 From ZFS's standpoint, the optimal configuration would be for the SSD 
to inform ZFS as to it's PAGE size, and ZFS would use this as the 
fundamental BLOCK size for that device (i.e. all writes are in integer 



It seems that a command to retrieve this information does not yet exist,
or could you help me?

Jörg

  
Sadly, no, there does not exist any way for the SSD to communicate that 
info back to the OS.


Probably, the smart thing to push for is inclusion of some new command 
in the ATA standard (in a manner like TRIM).  Likely something that 
would return both native Block and Page sizes upon query.


I'm still trying to see if there will be any support for TRIM-like 
things in SAS.




--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-02 Thread Erik Trimble

Ragnar Sundblad wrote:

On 2 jan 2010, at 13.10, Erik Trimble wrote

Joerg Schilling wrote:

the TRIM command is what is intended for an OS to notify the SSD as to which blocks are deleted/erased, so the SSD's internal free list can be updated (that is, it allows formerly-in-use blocks to be moved to the free list).  This is necessary since only the OS has the information to determine which previous-written-to blocks are actually no longer in-use.


See the parallel discussion here titled preview of new SSD based on SandForce controller for more 
about smart vs dumb SSD controllers.

From ZFS's standpoint, the optimal configuration would be for the SSD to inform 
ZFS as to it's PAGE size, and ZFS would use this as the fundamental BLOCK size 
for that device (i.e. all writes are in integer multiples of the SSD page 
size).  Reads could be in smaller sections, though.  Which would be 
interesting:  ZFS would write in Page Size increments, and read in Block Size 
amounts.



Well, this could be useful if updates are larger than the block size, for 
example 512 K, as it is then possible to erase and rewrite without having to 
copy around other data from the page. If updates are smaller, zfs will have to 
reclaim erased space by itself, which if I am not mistaken it can not do today 
(but probably will in some future, I guess the BP Rewrite is what is needed).
  
Sure, it does that today. What do you think happens on a standard COW 
action?   Let's be clear here:  I'm talking about exactly the same thing 
that currently happens when you modify a ZFS block that spans multiple 
vdevs (say, in a RAIDZ).   The entire ZFS block is read from disk/L2ARC, 
the modifications made, then it is written back to storage, likely in 
another LBA. The original ZFS block location ON THE VDEV is now 
available for re-use (i.e. the vdev adds it to it's Free Block List).   
This is one of the things that leads to ZFS's fragmentation issues 
(note, we're talking about block fragmentation on the vdev, not ZFS 
block fragmentation), and something that we're looking to BP rewrite to 
enable defragging to be implemented.


In fact, I would argue that the biggest advantage of removing any 
advanced intelligence from the SSD controller is with small 
modifications to existing files.  By using the L2ARC (and other 
features, like compression, encryption, and dedup), ZFS can composite 
the needed changes with an existing cached copy of the ZFS block(s) to 
be changed, then issue a full new block write to the SSD.  This 
eliminates the need for the SSD to do the dreaded Read-Modify-Write 
cycle, and instead do just a Write.  In this scenario, the ZFS block is 
likely larger than the SSD Page size, so more data will need to be 
written; however, given the highly parallel nature of SSDs, writing 
several SSD pages simultaneously is easy (and fast);  let's remember 
that a ZFS block is a maximum of only 8x the size of a SSD page, and 
writing 8 pages is only slightly more work than writing 1 page.  This 
larger write is all a single IOP, where a R-M-W essentially requires 3 
IOPS.  If you want the SSD controller to do the work, then it ALWAYS has 
to read the to-be-modified page from NAND, do the mod itself, then issue 
the write - and, remember, ZFS likely has already issued a full 
ZFS-block write (due to the COW nature of ZFS, there is no concept of 
just change this 1 bit and leave everything else on disk where it is), 
so you likely don't save on the number of pages that need to be written 
in any case.




I am still not entirely convinced that it would be better to let the file 
system take care of that instead of a flash controller, there could be quite a 
lot of reading and writing going on for space reclamation (depending on the 
work load, of course).

/ragge
The point here is that regardless of the workload, there's a R-M-W cycle 
that has to happen, whether that occurs at the ZFS level or at the SSD 
level.  My argument is that the OS has a far better view of the whole 
data picture, and access to much higher performing caches (i.e. 
RAM/registers) than the SSD, so not only can the OS make far better 
decisions about the data and how (and how much of) it should be stored, 
but it's almost certainly to be able to do so far faster than any little 
SSD controller can do. 


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-02 Thread Ragnar Sundblad

On 2 jan 2010, at 22.49, Erik Trimble wrote:

 Ragnar Sundblad wrote:
 On 2 jan 2010, at 13.10, Erik Trimble wrote
 Joerg Schilling wrote:
the TRIM command is what is intended for an OS to notify the SSD as to 
 which blocks are deleted/erased, so the SSD's internal free list can be 
 updated (that is, it allows formerly-in-use blocks to be moved to the free 
 list).  This is necessary since only the OS has the information to 
 determine which previous-written-to blocks are actually no longer in-use.
 
 See the parallel discussion here titled preview of new SSD based on 
 SandForce controller for more about smart vs dumb SSD controllers.
 
 From ZFS's standpoint, the optimal configuration would be for the SSD to 
 inform ZFS as to it's PAGE size, and ZFS would use this as the fundamental 
 BLOCK size for that device (i.e. all writes are in integer multiples of the 
 SSD page size).  Reads could be in smaller sections, though.  Which would 
 be interesting:  ZFS would write in Page Size increments, and read in Block 
 Size amounts.

 
 Well, this could be useful if updates are larger than the block size, for 
 example 512 K, as it is then possible to erase and rewrite without having to 
 copy around other data from the page. If updates are smaller, zfs will have 
 to reclaim erased space by itself, which if I am not mistaken it can not do 
 today (but probably will in some future, I guess the BP Rewrite is what is 
 needed).
  
 Sure, it does that today. What do you think happens on a standard COW action? 
   Let's be clear here:  I'm talking about exactly the same thing that 
 currently happens when you modify a ZFS block that spans multiple vdevs 
 (say, in a RAIDZ).   The entire ZFS block is read from disk/L2ARC, the 
 modifications made, then it is written back to storage, likely in another 
 LBA. The original ZFS block location ON THE VDEV is now available for re-use 
 (i.e. the vdev adds it to it's Free Block List).   This is one of the things 
 that leads to ZFS's fragmentation issues (note, we're talking about block 
 fragmentation on the vdev, not ZFS block fragmentation), and something that 
 we're looking to BP rewrite to enable defragging to be implemented.

What I am talking about is to be able to reuse the free space
you get in the previously written data when you write modified
data to new places on the disk, or just remove a file for that
matter. To be able to reclaim that space with flash, you have
to erase large pages (for example 512 KB), but before you erase,
you will also have to save away all still valid data in that
page and rewrite that to a free page. What I am saying is that
I am not sure that this would be best done in the file system,
since it could be quite a bit of data to shuffle around, and
there could possibly be hardware specific optimizations that
could be done here that zfs wouldn't know about. A good flash
controller could probably do it much better. (And a bad one
worse, of course.)

And as far as I know, zfs can not do that today - it can not
move around already written data, not for defragmentation, not
for adding or removing disks to stripes/raidz:s, not for
deduping/duping and so on, and I have understood it as
BP Rewrite could solve a lot of this.

Still, it could certainly be useful if zfs could try to use a
blocksize that matches the SSD erase page size - this could
avoid having to copy and compact data before erasing, which
could speed up writes in a typical flash SSD disk.

 In fact, I would argue that the biggest advantage of removing any advanced 
 intelligence from the SSD controller is with small modifications to existing 
 files.  By using the L2ARC (and other features, like compression, encryption, 
 and dedup), ZFS can composite the needed changes with an existing cached copy 
 of the ZFS block(s) to be changed, then issue a full new block write to the 
 SSD.  This eliminates the need for the SSD to do the dreaded 
 Read-Modify-Write cycle, and instead do just a Write.  In this scenario, the 
 ZFS block is likely larger than the SSD Page size, so more data will need to 
 be written; however, given the highly parallel nature of SSDs, writing 
 several SSD pages simultaneously is easy (and fast);  let's remember that a 
 ZFS block is a maximum of only 8x the size of a SSD page, and writing 8 pages 
 is only slightly more work than writing 1 page.  This larger write is all a 
 single IOP, where a R-M-W essentially requires 3 IOPS.  If you want the SSD 
 controller t
 o do the work, then it ALWAYS has to read the to-be-modified page from NAND, 
do the mod itself, then issue the write - and, remember, ZFS likely has already 
issued a full ZFS-block write (due to the COW nature of ZFS, there is no 
concept of just change this 1 bit and leave everything else on disk where it 
is), so you likely don't save on the number of pages that need to be written 
in any case.

I don't think many SSDs do R-M-W, but rather just append blocks
to free pages (pretty 

Re: [zfs-discuss] Thin device support in ZFS?

2010-01-02 Thread David Magda

On Jan 2, 2010, at 16:49, Erik Trimble wrote:

My argument is that the OS has a far better view of the whole data  
picture, and access to much higher performing caches (i.e. RAM/ 
registers) than the SSD, so not only can the OS make far better  
decisions about the data and how (and how much of) it should be  
stored, but it's almost certainly to be able to do so far faster  
than any little SSD controller can do.


Though one advantage of doing it with-in the disk is that you're not  
using up bus bandwidth. Probably not that big of a deal, but worth  
mentioning for completeness / fairness.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-02 Thread Richard Elling

On Jan 2, 2010, at 1:47 AM, Andras Spitzer wrote:

Mike,

As far as I know only Hitachi is using such a huge chunk size :

So each vendor’s implementation of TP uses a different block size.  
HDS use 42MB on the USP, EMC use 768KB on DMX, IBM allow a variable  
size from 32KB to 256KB on the SVC and 3Par use blocks of just 16KB.  
The reasons for this are many and varied and for legacy hardware are  
a reflection of the underlying hardware architecture.


http://gestaltit.com/all/tech/storage/chris/thin-provisioning-holy-grail-utilisation/

Also, here Hu explains the reason why they believe 42M is the most  
efficient :


http://blogs.hds.com/hu/2009/07/chunk-size-matters.html

He has some good points in his arguments.


Yes, and they apply to ZFS dedup as well... :-)
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-02 Thread Erik Trimble

Ragnar Sundblad wrote:

On 2 jan 2010, at 22.49, Erik Trimble wrote:

  

Ragnar Sundblad wrote:


On 2 jan 2010, at 13.10, Erik Trimble wrote
  

Joerg Schilling wrote:
   the TRIM command is what is intended for an OS to notify the SSD as to which 
blocks are deleted/erased, so the SSD's internal free list can be updated (that 
is, it allows formerly-in-use blocks to be moved to the free list).  This is 
necessary since only the OS has the information to determine which 
previous-written-to blocks are actually no longer in-use.

See the parallel discussion here titled preview of new SSD based on SandForce controller for more 
about smart vs dumb SSD controllers.

From ZFS's standpoint, the optimal configuration would be for the SSD to inform 
ZFS as to it's PAGE size, and ZFS would use this as the fundamental BLOCK size 
for that device (i.e. all writes are in integer multiples of the SSD page 
size).  Reads could be in smaller sections, though.  Which would be 
interesting:  ZFS would write in Page Size increments, and read in Block Size 
amounts.
   


Well, this could be useful if updates are larger than the block size, for 
example 512 K, as it is then possible to erase and rewrite without having to 
copy around other data from the page. If updates are smaller, zfs will have to 
reclaim erased space by itself, which if I am not mistaken it can not do today 
(but probably will in some future, I guess the BP Rewrite is what is needed).
 
  

Sure, it does that today. What do you think happens on a standard COW action?   Let's be 
clear here:  I'm talking about exactly the same thing that currently happens when you 
modify a ZFS block that spans multiple vdevs (say, in a RAIDZ).   The entire 
ZFS block is read from disk/L2ARC, the modifications made, then it is written back to 
storage, likely in another LBA. The original ZFS block location ON THE VDEV is now 
available for re-use (i.e. the vdev adds it to it's Free Block List).   This is one of 
the things that leads to ZFS's fragmentation issues (note, we're talking about block 
fragmentation on the vdev, not ZFS block fragmentation), and something that we're looking 
to BP rewrite to enable defragging to be implemented.



What I am talking about is to be able to reuse the free space
you get in the previously written data when you write modified
data to new places on the disk, or just remove a file for that
matter. To be able to reclaim that space with flash, you have
to erase large pages (for example 512 KB), but before you erase,
you will also have to save away all still valid data in that
page and rewrite that to a free page. What I am saying is that
I am not sure that this would be best done in the file system,
since it could be quite a bit of data to shuffle around, and
there could possibly be hardware specific optimizations that
could be done here that zfs wouldn't know about. A good flash
controller could probably do it much better. (And a bad one
worse, of course.)
  
You certainly DO get to reuse the free space again.   Here's what 
happens nowdays in an SSD:


Let's say I have 4k blocks, grouped into a 128k page.  That is, the 
SSD's fundamental minimum unit size is 4k, but the minimum WRITE size is 
128k.  Thus, 32 blocks in a page.


So, I write a bit of data 100k in size. This occupies the first 25 
blocks in the one page. The remaining 9 blocks are still one the SSD's 
Free List (i.e. list of free space).


Now, I want to change the last byte of the file, and add 10k more to the 
file.  Currently, a non-COW filesystem will simply send the 1 byte 
modification request and the 10k addition to the SSD (all as one unit, 
if you are lucky - if not, it comes as two ops: 1 byte modification 
followed by a 10k append).   The SSD now has to read all 25 blocks of 
the page back into it's local cache on the controller, do the 
modification and append computing, then writes out 28 blocks to NAND.  
In all likelihood, if there is any extra pre-erased (or never written 
to) space on the drive, this 28 block write will go to a whole new 
page.  The blocks in the original page will be moved over to the SSD 
Free List (and may or may not be actually erased, depending on the 
controller).


For filesystems like ZFS, this is a whole lot of extra work being done 
that doesn't need to happen (and, chews up valuable IOPS and time).  
For, when ZFS does a write, it doesn't merely just twiddle the 
modified/appended bits - instead, it creates a whole new ZFS block to 
write.   In essence, ZFS has already done all the work that the SSD 
controller is planning on doing.  So why duplicate the effort?   SSDs 
should simply notify ZFS about their block  page sizes, which would 
then allow ZFS to better align it's own variable block size to optimally 
coincide with the SSD's implementation.




And as far as I know, zfs can not do that today - it can not
move around already written data, not for defragmentation, not
for adding or removing 

Re: [zfs-discuss] Thin device support in ZFS?

2010-01-02 Thread Erik Trimble

David Magda wrote:

On Jan 2, 2010, at 16:49, Erik Trimble wrote:

My argument is that the OS has a far better view of the whole data 
picture, and access to much higher performing caches (i.e. 
RAM/registers) than the SSD, so not only can the OS make far better 
decisions about the data and how (and how much of) it should be 
stored, but it's almost certainly to be able to do so far faster than 
any little SSD controller can do.


Though one advantage of doing it with-in the disk is that you're not 
using up bus bandwidth. Probably not that big of a deal, but worth 
mentioning for completeness / fairness.
This is true.  But, also in fairness, this is /already/ being used by 
the COW nature of ZFS.  Changing one bit in a file causes the /entire/ 
ZFS block containing that bit to be re-written.  So I'm not really using 
much (if any) more bus bandwidth by doing the SSD page layout in the OS 
rather than in the SSD controller. Remember that I'm highly likely not 
to have to read anything from the SSD to do the page rewrite, as the 
data I want is already in the L2ARC.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-02 Thread Ragnar Sundblad

On 3 jan 2010, at 04.19, Erik Trimble wrote:

 Ragnar Sundblad wrote:
 On 2 jan 2010, at 22.49, Erik Trimble wrote:
 
  
 Ragnar Sundblad wrote:

 On 2 jan 2010, at 13.10, Erik Trimble wrote
  
 Joerg Schilling wrote:
   the TRIM command is what is intended for an OS to notify the SSD as to 
 which blocks are deleted/erased, so the SSD's internal free list can be 
 updated (that is, it allows formerly-in-use blocks to be moved to the 
 free list).  This is necessary since only the OS has the information to 
 determine which previous-written-to blocks are actually no longer in-use.
 
 See the parallel discussion here titled preview of new SSD based on 
 SandForce controller for more about smart vs dumb SSD controllers.
 
 From ZFS's standpoint, the optimal configuration would be for the SSD to 
 inform ZFS as to it's PAGE size, and ZFS would use this as the 
 fundamental BLOCK size for that device (i.e. all writes are in integer 
 multiples of the SSD page size).  Reads could be in smaller sections, 
 though.  Which would be interesting:  ZFS would write in Page Size 
 increments, and read in Block Size amounts.
   
 Well, this could be useful if updates are larger than the block size, for 
 example 512 K, as it is then possible to erase and rewrite without having 
 to copy around other data from the page. If updates are smaller, zfs will 
 have to reclaim erased space by itself, which if I am not mistaken it can 
 not do today (but probably will in some future, I guess the BP Rewrite is 
 what is needed).
   
 Sure, it does that today. What do you think happens on a standard COW 
 action?   Let's be clear here:  I'm talking about exactly the same thing 
 that currently happens when you modify a ZFS block that spans multiple 
 vdevs (say, in a RAIDZ).   The entire ZFS block is read from disk/L2ARC, 
 the modifications made, then it is written back to storage, likely in 
 another LBA. The original ZFS block location ON THE VDEV is now available 
 for re-use (i.e. the vdev adds it to it's Free Block List).   This is one 
 of the things that leads to ZFS's fragmentation issues (note, we're talking 
 about block fragmentation on the vdev, not ZFS block fragmentation), and 
 something that we're looking to BP rewrite to enable defragging to be 
 implemented.

 
 What I am talking about is to be able to reuse the free space
 you get in the previously written data when you write modified
 data to new places on the disk, or just remove a file for that
 matter. To be able to reclaim that space with flash, you have
 to erase large pages (for example 512 KB), but before you erase,
 you will also have to save away all still valid data in that
 page and rewrite that to a free page. What I am saying is that
 I am not sure that this would be best done in the file system,
 since it could be quite a bit of data to shuffle around, and
 there could possibly be hardware specific optimizations that
 could be done here that zfs wouldn't know about. A good flash
 controller could probably do it much better. (And a bad one
 worse, of course.)
  
 You certainly DO get to reuse the free space again.   Here's what happens 
 nowdays in an SSD:
 
 Let's say I have 4k blocks, grouped into a 128k page.  That is, the SSD's 
 fundamental minimum unit size is 4k, but the minimum WRITE size is 128k.  
 Thus, 32 blocks in a page.

Do you know of SSD disks that have a minimum write size of
128 KB? I don't understand why it would be designed that way.

A typical flash chip has pretty small write block sizes, like
2 KB or so, but they can only erase in pages of 128 KB or so.
(And then you are running a few of those in parallel to get some
speed, so these numbers often multiply with the number of
parallel chips, like 4 or 8 or so.)
Typically, you have to write the 2 KB blocks consecutively
in a page. Pretty much all set up for an append-style system.
:-)

In addition, flash SSDs typically have some DRAM write buffer
that buffers up writes (like a txg, if you will), so small
writes should not be a problem - just collect a few and append!

 So, I write a bit of data 100k in size. This occupies the first 25 blocks in 
 the one page. The remaining 9 blocks are still one the SSD's Free List (i.e. 
 list of free space).
 
 Now, I want to change the last byte of the file, and add 10k more to the 
 file.  Currently, a non-COW filesystem will simply send the 1 byte 
 modification request and the 10k addition to the SSD (all as one unit, if you 
 are lucky - if not, it comes as two ops: 1 byte modification followed by a 
 10k append).   The SSD now has to read all 25 blocks of the page back into 
 it's local cache on the controller, do the modification and append computing, 
 then writes out 28 blocks to NAND.  In all likelihood, if there is any extra 
 pre-erased (or never written to) space on the drive, this 28 block write will 
 go to a whole new page.  The blocks in the original page will be moved over 
 to the SSD Free List 

Re: [zfs-discuss] Thin device support in ZFS?

2010-01-02 Thread Ragnar Sundblad

On 3 jan 2010, at 06.07, Ragnar Sundblad wrote:

 (I don't think they typically merge pages, I believe they rather
 just pick pages with some freed blocks, copies the active blocks
 to the end of the disk, and erases the page.)

(And of course you implement wear leveling with the same
mechanism - when the wear differs to much, pick a page
with low wear and copy it to a more worn page.)

I actually happened to stumble on an application note from Numonyx
that describes the append-style SSD disk and space reclamation
method I described, right here:
http://www.numonyx.com/Documents/Application%20Notes/AN1821.pdf
(No - I had not read this before writing my previous mail! :-)

To me, it seems also in this paper that it is common knowledge
that this is how you should implement a flash SSD disk - if you
don't do anything fancier of course.

/ragge

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-02 Thread Erik Trimble

Ragnar Sundblad wrote:

On 3 jan 2010, at 04.19, Erik Trimble wrote:
  

Let's say I have 4k blocks, grouped into a 128k page.  That is, the SSD's 
fundamental minimum unit size is 4k, but the minimum WRITE size is 128k.  Thus, 
32 blocks in a page.


Do you know of SSD disks that have a minimum write size of
128 KB? I don't understand why it would be designed that way.

A typical flash chip has pretty small write block sizes, like
2 KB or so, but they can only erase in pages of 128 KB or so.
(And then you are running a few of those in parallel to get some
speed, so these numbers often multiply with the number of
parallel chips, like 4 or 8 or so.)
Typically, you have to write the 2 KB blocks consecutively
in a page. Pretty much all set up for an append-style system.
:-)

In addition, flash SSDs typically have some DRAM write buffer
that buffers up writes (like a txg, if you will), so small
writes should not be a problem - just collect a few and append!
  
In MLC-style SSDs, you typically have a block size of 2k or 4k. However, 
you have a Page size of several multiples of that, 128k being common, 
but by no means ubiquitous.


I think you're confusing erasing with writing.

When I say minimum write size, I mean that for an MLC, no matter how 
small you make a change, the minimum amount of data actually being 
written to the SSD is a full page (128k in my example).   There is no 
append down at this level. If I have a page of 128k, with data in 5 of 
the 4k blocks, and I then want to add another 2k of data to this, I have 
to READ all 5 4k blocks into the controller's DRAM, add the 2k of data 
to that, then write out the full amount to a new page (if available), or 
wait for a older page to be erased before writing to it.  Thus, in this 
case,  in order to do an actual 2k write, the SSD must first read 10k of 
data, do some compositing, then write 12k to a fresh page.  

Thus, to change any data inside a single page, then entire contents of 
that page have to be read, the page modified, then the entire page 
written back out.





So, I write a bit of data 100k in size. This occupies the first 25 blocks in 
the one page. The remaining 9 blocks are still one the SSD's Free List (i.e. 
list of free space).

Now, I want to change the last byte of the file, and add 10k more to the file.  
Currently, a non-COW filesystem will simply send the 1 byte modification 
request and the 10k addition to the SSD (all as one unit, if you are lucky - if 
not, it comes as two ops: 1 byte modification followed by a 10k append).   The 
SSD now has to read all 25 blocks of the page back into it's local cache on the 
controller, do the modification and append computing, then writes out 28 blocks 
to NAND.  In all likelihood, if there is any extra pre-erased (or never written 
to) space on the drive, this 28 block write will go to a whole new page.  The 
blocks in the original page will be moved over to the SSD Free List (and may or 
may not be actually erased, depending on the controller).



Do you know for sure that you have SSD flash disks that
work this way? It seems incredibly stupid. It would also
use up the available erase cycles much faster than necessary.
What write speed do you get?
  
What I'm describing is how ALL MLC-based SSDs work. SLC-based SSDs work 
differently, but still have problems with what I'll call excess-writing.




And as far as I know, zfs can not do that today - it can not
move around already written data, not for defragmentation, not
for adding or removing disks to stripes/raidz:s, not for
deduping/duping and so on, and I have understood it as
BP Rewrite could solve a lot of this. 
  
ZFS's propensity to fragmentation doesn't mean you lose space.  Rather, it means that COW often results in frequently-modified files being distributed over the entire media, rather than being contiguous. So, over time, the actual media has very little (if any) contiguous free space, which is what the fragmentation problem is.  BP rewrite will indeed allow us to create a de-fragger.  Areas which used to hold a ZFS block (now vacated by a COW to somewhere else) are simply added back to the device's Free List. 
Now, in SSD's case, this isn't a worry.  Due to the completely even performance characteristics of NAND, it doesn't make any difference if the physical layout of a file happens to be sections (e.g. ZFS blocks) scattered all over the SSD.



Yes, there is something to worry about, as you can only
erase flash in large pages - you can not erase them only where
the free data blocks in the Free List are.
  
I'm not sure that SSDs actually _have_ to erase - they just overwrite 
anything there with new data. But this is implementation dependent, so I 
can say how /all/ MLC SSDs behave.



(I don't think they typically merge pages, I believe they rather
just pick pages with some freed blocks, copies the active blocks
to the end of the disk, and erases the page.)

Well, the algorithms are often trade 

Re: [zfs-discuss] Thin device support in ZFS?

2010-01-02 Thread Erik Trimble

Erik Trimble wrote:

Ragnar Sundblad wrote:

Yes, there is something to worry about, as you can only
erase flash in large pages - you can not erase them only where
the free data blocks in the Free List are.   
I'm not sure that SSDs actually _have_ to erase - they just overwrite 
anything there with new data. But this is implementation dependent, so 
I can say how /all/ MLC SSDs behave.


I meant to say that I DON'T know how all MLC drives deal with erasure.


(I don't think they typically merge pages, I believe they rather
just pick pages with some freed blocks, copies the active blocks
to the end of the disk, and erases the page.)

That is correct, as your pointer to the Numonyx doc explains.

I'm pretty sure compacting doesn't occur in ANY SSDs without any OS 
intervention (that is, the SSD itself doesn't do it), and I'd be 
surprised to see an OS try to implement some sort of intra-page 
compaction - there benefit doesn't seem to be there; it's better just 
to optimize writes than try to compact existing pages. As far as 
reclaiming unused space, the TRIM command is there to allow the SSD to 
mark a page Free for reuse, and an SSD isn't going to be erasing a 
page unless it's right before something is to be written to that page.
My thinking of what compacting meant doesn't match up with what I'm 
seeing general usage in the SSD technical papers is, so in this respect, 
I'm wrong:  compacting does occur, but only when there are no fully 
erased (or unused) pages available.  Thus, compacting is done in the 
context of a write operation.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-01 Thread Eric D. Mudama

On Thu, Dec 31 at 16:53, David Magda wrote:

Just as the first 4096-byte block disks are silently emulating 4096 -
to-512 blocks, SSDs are currently re-mapping LBAs behind the scenes. 
Perhaps in the future there will be a setting to say no really, I'm 
talking about the /actual/ LBA 123456.


What, exactly, is the /actual/ LBA 123456 on a modern SSD?

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-01 Thread Eric D. Mudama

On Thu, Dec 31 at 10:18, Bob Friesenhahn wrote:
There are of course SSDs with hardly any (or no) reserve space, but 
while we might be willing to sacrifice an image or two to SSD block 
failure in our digital camera, that is just not acceptable for 
serious computer use.


Some people are doing serious computing on devices with 6-7% reserve.
Devices with less enforced reserve will be significantly cheaper per
exposed gigabyte, independent of all other factors, and always give
the user the flexibility to increase their effective reserve by
destroking the working area a little or a lot.

If someone just needs blazing fast read access and isn't expecting to
put more than a few cycles/day on their devices, small reserve MLC
drives may be very cost effective and just as fast as their 20-30%
reserve SLC counterparts.

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-01 Thread Ragnar Sundblad

On 31 dec 2009, at 22.53, David Magda wrote:

 On Dec 31, 2009, at 13:44, Joerg Schilling wrote:
 
 ZFS is COW, but does the SSD know which block is in use and which is not?
 
 If the SSD did know whether a block is in use, it could erase unused blocks
 in advance. But what is an unused block on a filesystem that supports
 snapshots?

Snapshots make no difference - when you delete the last
dataset/snapshot that references a file you also delete the
data. Snapshots is a way to keep more files around, it is not
a really way to keep the disk entirely full or anything like
that. There is obviously no problem to distinguish between
used and unused blocks, and zfs (or btrfs or similar) make no
difference.

 Personally, I think that at some point in the future there will need to be a 
 command telling SSDs that the file system will take care of handling blocks, 
 as new FS designs will be COW. ZFS is the first mainstream one to do it, 
 but Btrfs is there as well, and it looks like Apple will be making its own FS.

That could be an idea, but there still will be holes after
deleted files that need to be reclaimed. Do you mean it would
be a major win to have the file system take care of the
space reclaiming instead of the drive?

 Just as the first 4096-byte block disks are silently emulating 4096 -to-512 
 blocks, SSDs are currently re-mapping LBAs behind the scenes. Perhaps in the 
 future there will be a setting to say no really, I'm talking about the 
 /actual/ LBA 123456.

A typical flash page size is 512 KB. You probably don't want to
use all the physical pages, since those could be worn out or bad,
so those need to be remapped (or otherwise avoided) at some level
anyway. These days, typically disks do the remapping without the
host computer knowing (both SSDs and rotating rust).

I see the possible win that you could always use all the working
blocks on the disk, and when blocks goes bad your disk will shrink.
I am not sure that is really what people expect, though. Apart from
that, I am not sure what the gain would be.
Could you elaborate on why this would be called for?

/ragge 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-01 Thread David Magda


On Jan 1, 2010, at 03:30, Eric D. Mudama wrote:


On Thu, Dec 31 at 16:53, David Magda wrote:

Just as the first 4096-byte block disks are silently emulating 4096 -
to-512 blocks, SSDs are currently re-mapping LBAs behind the  
scenes. Perhaps in the future there will be a setting to say no  
really, I'm talking about the /actual/ LBA 123456.


What, exactly, is the /actual/ LBA 123456 on a modern SSD?


It doesn't exist currently because of the behind-the-scenes re-mapping  
that's being done by the SSD's firmware.


While arbitrary to some extent, and actual LBA would presumably the  
number of a particular cell in the SSD.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-01 Thread David Magda

On Jan 1, 2010, at 04:33, Ragnar Sundblad wrote:


I see the possible win that you could always use all the working
blocks on the disk, and when blocks goes bad your disk will shrink.
I am not sure that is really what people expect, though. Apart from
that, I am not sure what the gain would be.

Could you elaborate on why this would be called for?


Currently you have SSDs that look like disks, but under certain  
circumstances the OS / FS know that it isn't rotating rust--in which  
case the TRIM command is then used by the OS to help the SSD's  
allocation algorithm(s).


If the file system is COW, and knows about SSDs via TRIM, why not just  
skip the middle-man and tell the SSD I'll take care of managing  
blocks.


In the ZFS case, I think it's a logical extension of how RAID is  
handling: ZFS' system is much more helpful in most case that  
hardware- / firmware-based RAID, so it's generally best just to expose  
the underlying hardware to ZFS. In the same way ZFS already does COW,  
so why bother with the SSD's firmware doing it when giving extra  
knowledge to ZFS could be more useful?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-01 Thread Ragnar Sundblad

On 1 jan 2010, at 14.14, David Magda wrote:

 On Jan 1, 2010, at 04:33, Ragnar Sundblad wrote:
 
 I see the possible win that you could always use all the working
 blocks on the disk, and when blocks goes bad your disk will shrink.
 I am not sure that is really what people expect, though. Apart from
 that, I am not sure what the gain would be.
 
 Could you elaborate on why this would be called for?
 
 Currently you have SSDs that look like disks, but under certain circumstances 
 the OS / FS know that it isn't rotating rust--in which case the TRIM command 
 is then used by the OS to help the SSD's allocation algorithm(s).

(Note that TRIM and equivalents are not only useful on SSDs,
but on other storage too, such as when using sparse/thin
storage.)

 If the file system is COW, and knows about SSDs via TRIM, why not just skip 
 the middle-man and tell the SSD I'll take care of managing blocks.
 
 In the ZFS case, I think it's a logical extension of how RAID is handling: 
 ZFS' system is much more helpful in most case that hardware- / firmware-based 
 RAID, so it's generally best just to expose the underlying hardware to ZFS. 
 In the same way ZFS already does COW, so why bother with the SSD's firmware 
 doing it when giving extra knowledge to ZFS could be more useful?

But that would only move the hardware specific and dependent flash
chip handling code into the file system code, wouldn't it? What
is won with that? As long as the flash chips have larger pages than
the file system blocks, someone will have to shuffle around blocks
to reclaim space, why not let the one thing that knows the hardware
and also is very close to the hardware do it?

And if this is good for SSDs, why isn't it as good for rotating rust?

/ragge s

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-01 Thread David Magda

On Jan 1, 2010, at 11:04, Ragnar Sundblad wrote:


But that would only move the hardware specific and dependent flash
chip handling code into the file system code, wouldn't it? What
is won with that? As long as the flash chips have larger pages than
the file system blocks, someone will have to shuffle around blocks
to reclaim space, why not let the one thing that knows the hardware
and also is very close to the hardware do it?

And if this is good for SSDs, why isn't it as good for rotating rust?


Don't really see how things are either hardware specific or dependent.  
COW is COW. Am I missing something? It's done by code somewhere in the  
stack, if the FS knows about it, it can lay things out in sequential  
writes. If we're talking about 512 KB blocks, ZFS in particular would  
create four 128 KB txgs--and 128 KB is simply the currently #define'd  
size, which can be changed in the future.


One thing you gain is perhaps not requiring to have as much of a  
reserve. At most you have some hidden bad block re-mapping, similar to  
rotating rust nowadays. If you're shuffling blocks around, you're  
doing a read-modify-write, which if done in the file system, you could  
use as a mechanism to defrag on-the-fly or to group many small files  
together.



Not quite sure what you mean by your last question.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-01 Thread Richard Elling

On Dec 31, 2009, at 12:59 PM, Ragnar Sundblad wrote:

Flash SSDs actually always remap new writes into a
only-append-to-new-pages style, pretty much as ZFS does itself.
So for a SSD there is no big difference between ZFS and
filesystems as UFS, NTFS, HFS+ et al, on the flash level they
all work the same.



The reason is that there is no way for it to rewrite single
disk blocks, it can only fill up already erased pages of
512K (for example). When the old blocks get mixed with unused
blocks (because of block rewrites, TRIM or Write Many/UNMAP),
it needs to compact the data by copying all active blocks from
those pages into previously erased pages, and there write the
active data compacted/continuos. (When this happens, things tend
to get really slow.)


However, the quantity of small, overwritten pages is vastly different.
I am not convinced that a workload that generates few overwrites
will be penalized as much as a workload that generates a large
number of overwrites.

I think most folks here will welcome good, empirical studies,
but thus far the only one I've found is from STEC and their
disks behave very well after they've been filled and subjected
to a rewrite workload. You get what you pay for.  Additional
pointers are always appreciated :-)
http://www.stec-inc.com/ssd/videos/ssdvideo1.php

 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-01 Thread Bob Friesenhahn

On Fri, 1 Jan 2010, David Magda wrote:


It doesn't exist currently because of the behind-the-scenes re-mapping that's 
being done by the SSD's firmware.


While arbitrary to some extent, and actual LBA would presumably the number 
of a particular cell in the SSD.


There seems to be some severe misunderstanding of that a SSD is. 
This severe misunderstanding leads one to assume that a SSD has a 
native blocksize.  SSDs (as used in computer drives) are comprised 
of many tens of FLASH memory chips which can be layed out and mapped 
in whatever fashion the designers choose to do.  They could be mapped 
sequentially, in parallel, a combination of the two, or perhaps even 
change behavior depending on use.  Individual FLASH devices usually 
have a much smaller page size than 4K.  A 4K write would likely be 
striped across several/many FLASH devices.


The construction of any given SSD is typically a closely-held trade 
secret and the vendor will not reveal how it is designed.  You would 
have to chip away the epoxy yourself and reverse-engineer in order to 
gain some understanding of how a given SSD operates and even then it 
would be mostly guesswork.


It would be wrong for anyone here, including someone who has 
participated in the design of an SSD, to claim that they know how a 
SSD will behave unless they have access to the design of that 
particular SSD.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-01 Thread Al Hopper
On Fri, Jan 1, 2010 at 11:17 AM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:
 On Fri, 1 Jan 2010, David Magda wrote:

 It doesn't exist currently because of the behind-the-scenes re-mapping
 that's being done by the SSD's firmware.

 While arbitrary to some extent, and actual LBA would presumably the
 number of a particular cell in the SSD.

 There seems to be some severe misunderstanding of that a SSD is. This severe
 misunderstanding leads one to assume that a SSD has a native blocksize.
  SSDs (as used in computer drives) are comprised of many tens of FLASH
 memory chips which can be layed out and mapped in whatever fashion the
 designers choose to do.  They could be mapped sequentially, in parallel, a
 combination of the two, or perhaps even change behavior depending on use.
  Individual FLASH devices usually have a much smaller page size than 4K.  A
 4K write would likely be striped across several/many FLASH devices.

 The construction of any given SSD is typically a closely-held trade secret
 and the vendor will not reveal how it is designed.  You would have to chip
 away the epoxy yourself and reverse-engineer in order to gain some
 understanding of how a given SSD operates and even then it would be mostly
 guesswork.

 It would be wrong for anyone here, including someone who has participated in
 the design of an SSD, to claim that they know how a SSD will behave unless
 they have access to the design of that particular SSD.


The main issue is that most flash devices support 128k byte pages, and
the smallest chunk (for want of a better word) of flash memory that
can be written is a page - or 128kb.  So if you have a write to an SSD
that only changes 1 byte in one 512 byte disk sector, the SSD
controller has to either read/re-write the affected page or figure out
how to update the flash memory with the minimum affect on flash wear.

If one did'nt have to worry about flash wear levelling, one could
read/update/write the affected page all day long.

And, to date, flash writes are much slower than flash reads - which is
another basic property of the current generation of flash devices.

For anyone who is interested in getting more details of the challenges
with flash memory, when used to build solid state drives, reading the
tech data sheets on the flash memory devices will give you a feel for
the basic issues that must be solved.

Bobs point is well made.  The specifics of a given SSD implementation
will make the performance characteristics of the resulting SSD very
difficult to predict or even describe - especially as the device
hardware and firmware continue to evolve.   And some SSDs change the
algorithms they implement on-the-fly - depending on the
characteristics of the current workload and of the (inbound) data
being written.

There are some links to well written articles in the URL I posted
earlier this morning:
http://www.anandtech.com/storage/showdoc.aspx?i=3702

Regards,

-- 
Al Hopper  Logical Approach Inc,Plano,TX a...@logical-approach.com
   Voice: 972.379.2133 Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-31 Thread Andras Spitzer
Let me sum up my thoughts in this topic.

To Richard [relling] : I agree with you this topic is even more confusing if we 
are not careful enough to specify exactly what we are talking about. Thin 
provision can be done on multiple layers, and though you said you like it to be 
closer to the app than closer to the dumb disks (if you were referring to SAN), 
my opinion is that each and every scenario has it's own pros/cons. I learned 
long time ago not to declare a technology good/bad, there are technologies 
which are used properly (usually declared as good tech) and others which are 
not (usually declared as bad).

--

Let me clarify my case, and why I mentioned thin devices on SAN specifically. 
Many people replied with the thin device support of ZFS (which is called sparse 
volumes if I'm correct), but what I was talking about is something else. It's 
thin device awareness on the SAN.

In this case you configure your LUN in the SAN as thin device, a virtual LUN(s) 
which is backed by a pool of physical disks in the SAN. From the OS it's 
transparent, so it is from the Volume Manager/Filesystem point of view.

That is the basic definition of my scenarion with thin devices on SAN. High-end 
SAN frames like HDS USP-V (feature called Hitachi Dynamic Provisioning), EMC 
Symmetrix V-Max (feature called Virtual provisioning) supports this (and I'm 
sure many others as well). As you discovered the LUN in the OS, you start to 
use it, like put under Volume Manager, create filesystem, copy files, but the 
SAN only allocates physical blocks (more precisely group of blocks called 
extents) as you write them, which means you'll use only as much (or a bit more 
rounded to the next extent) on the physical disk as you use in reality.

From this standpoint we can define two terms, thin-friendly and thin-hostile 
environments. Thin-friendly would be any environment where OS/VM/FS doesn't 
write to blocks it doesn't really use (for example during initialization it 
doesn't fills up the LUN with a pattern or 0s).

That's why Veritas' SmartMove is a nice feature, as when you move from fat to 
thin devices (from the OS both LUNs look exactly the same), it will copy the 
blocks only which are used by the VxFS files. 

That is still the basics of having thin devices on SAN, and hope to have a 
thin-friendly environment. The next level of this is the management of the thin 
devices and the physical pool where thin devices allocates their extents from.

Even if you get migrated to thin device LUNs, your thin devices will become fat 
again, even if you fill up your filesystem once, the thin device on the SAN 
will remain fat, no space reclamation is happening by default. The reason is 
pretty simple, the SAN storage has no knowledge of the filesystem structure, as 
such it can't decide whether a block should be released back to the pool, or 
it's really not in use. Then came Veritas with this brilliant idea of building 
a bridge between the FS and the SAN frame (this became the Thin Reclamation 
API), so they can communicate which blocks are not in use indeed.

I really would like you to read this Quick Note from Veritas about this 
feature, it will explain way better the concept as I did : 
http://ftp.support.veritas.com/pub/support/products/Foundation_Suite/338546.pdf

Btw, in this concept VxVM can even detect (via ASL) whether a LUN is thin 
device/thin device reclamation capable or not.

Honestly I have mixed feeling about ZFS. I feel that this is obviously the 
future's VM/Filesystem, but then I realize in the same time the roles of the 
individual parts in the big picture are getting mixed up. Am I the only one 
with the impression that ZFS sooner or later will evolve to a SAN OS, and the 
zfs, zpool commands will only become some lightweight interfaces to control the 
SAN frame? :-) (like Solution Enabler for EMC)

If you ask me the pool concept always works more efficient if 1# you have more 
capacity in the pool 2# if you have more systems to share the pool, that's why 
I see the thin device pool more rational in a SAN frame.

Anyway, I'm sorry if you were already aware what I explained above, I also hope 
I didn't offend anyone with my views,

Regards,
sendai
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-31 Thread Ragnar Sundblad

On 31 dec 2009, at 06.01, Richard Elling wrote:

 
 On Dec 30, 2009, at 2:24 PM, Ragnar Sundblad wrote:
 
 
 On 30 dec 2009, at 22.45, Richard Elling wrote:
 
 On Dec 30, 2009, at 12:25 PM, Andras Spitzer wrote:
 
 Richard,
 
 That's an interesting question, if it's worth it or not. I guess the 
 question is always who are the targets for ZFS (I assume everyone, though 
 in reality priorities has to set up as the developer resources are 
 limited). For a home office, no doubt thin provisioning is not much of a 
 use, for an enterprise company the numbers might really make a difference 
 if we look at the space used vs space allocated.
 
 There are some studies that thin provisioning can reduce physical space 
 used up to 30%, which is huge. (Even though I understands studies are not 
 real life and thin provisioning is not viable in every environment)
 
 Btw, I would like to discuss scenarios where though we have 
 over-subscribed pool in the SAN (meaning the overall allocated space to 
 the systems is more than the physical space in the pool) with proper 
 monitoring and proactive physical drive adds we won't let any 
 systems/applications attached to the SAN realize that we have thin devices.
 
 Actually that's why I believe configuring thin devices without 
 periodically reclaiming space is just a timebomb, though if you have the 
 option to periodically reclaim space, you can maintain the pool in the SAN 
 in a really efficient way. That's why I found Veritas' Thin Reclamation 
 API as a milestone in the thin device field.
 
 Anyway, only future can tell if thin provisioning will or won't be a major 
 feature in the storage world, though as I saw Veritas already added this 
 feature I was wondering if ZFS has it at least on it's roadmap.
 
 Thin provisioning is absolutely, positively a wonderful, good thing!  The 
 question
 is, how does the industry handle the multitude of thin provisioning models, 
 each
 layered on top of another? For example, here at the ranch I use VMWare and 
 Xen,
 which thinly provision virtual disks. I do this over iSCSI to a server 
 running ZFS
 which thinly provisions the iSCSI target.  If I had a virtual RAID array, I 
 would
 probably use that, too. Personally, I think being thinner closer to the 
 application
 wins over being thinner closer to dumb storage devices (disk drives).
 
 I don't get it - why do we need anything more magic (or complicated)
 than support for TRIM from the filesystems and the storage systems?
 
 TRIM is just one part of the problem (or solution, depending on your point
 of view). The TRIM command is part of the T10 protocols that allows a
 host to tell a block device that data in a set of blocks is no longer of
 any value, and the block device can destroy the data without adverse
 consequence.
 
 In a world with copy-on-write and without snapshots, it is obvious that
 there will be a lot of blocks running around that are no longer in use.
 Snapshots (and their clones) changes that use case. So in a world of
 snapshots, there will be fewer blocks which are not used. Remember,
 the TRIM command is very important to OSes like Windows or OSX
 which do not have file systems that are copy-on-write or have decent
 snapshots. OTOH, ZFS does copy-on-write and lots of ZFS folks use
 snapshots.

I don't believe that there is such a big difference between those
cases. Sure, snapshots may keep more data on disk, but only as much
as the user choose to keep. There has been other ways to keep old
data on disk before (RCS, Solaris patch backout blurbs, logs, caches,
what have you), so there is not really a brand new world there.
(BTW, once upon a time, real operating systems had (optional) file
versioning built into the operating system or file system itself.)

If there was a mechanism that always tended to keep all of the
disk full, that would be another case. Snapshots may do that
with the autosnapshot and warn-and-clean-when-getting-full
features of OpenSolaris, but especially servers will probably
not be managed that way, they will probably have a much more
controlled snapshot policy. (Especially if you want to save every
possible bit of disk space, as those guys with the big fantastic
and ridiculously expensive storage systems always want to do -
maybe that will change in the future though.)

 That said, adding TRIM support is not hard in ZFS. But it depends on
 lower level drivers to pass the TRIM commands down the stack. These
 ducks are lining up now.

Good.

 I don't see why TRIM would be hard to implement for ZFS either,
 except that you may want to keep data from a few txgs back just
 for safety, which would probably call for some two-stage freeing
 of data blocks (those free blocks that are to be TRIMmed, and
 those that already are).
 
 Once a block is freed in ZFS, it no longer needs it. So the problem
 of TRIM in ZFS is not related to the recent txg commit history.

It may be that you want to save a few txgs back, so if you get
a failure where 

Re: [zfs-discuss] Thin device support in ZFS?

2009-12-31 Thread Ragnar Sundblad

On 31 dec 2009, at 00.31, Bob Friesenhahn wrote:

 On Wed, 30 Dec 2009, Mike Gerdts wrote:
 
 Should the block size be a tunable so that page size of SSD (typically
 4K, right?) and upcoming hard disks that sport a sector size  512
 bytes?
 
 Enterprise SSDs are still in their infancy.  The actual page size of an SSD 
 could be almost anything.  Due to lack of seek time concerns and the high 
 cost of erasing a page, a SSD could be designed with a level of indirection 
 so that multiple logical writes to disjoint offsets could be combined into a 
 single SSD physical page.  Likewise a large logical block could be subdivided 
 into mutiple SSD pages, which are allocated on demand.  Logic is cheap and 
 SSDs are full of logic so it seems reasonable that future SSDs will do this, 
 if not already, since similar logic enables wear-leveling.

I believe that almost all flash devices are already are doing this,
and only the first generation SD cards or something like that are
not doing it and leaving it to the host.

But I could be wrong of course.

/ragge s

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-31 Thread Bob Friesenhahn

On Thu, 31 Dec 2009, Ragnar Sundblad wrote:


Also, currently, when the SSDs for some very strange reason is
constructed from flash chips designed for firmware and slowly
changing configuration data and can only erase in very large chunks,
TRIMing is good for the housekeeping in the SSD drive. A typical
use case for this would be a laptop.


I have heard quite a few times that TRIM is good for SSD drives but 
I don't see much actual use for it.  Every responsible SSD drive 
maintains a reserve of unused space (20-50%) since it is needed for 
wear leveling and to repair failing spots.  This means that even when 
a SSD is 100% full it still has considerable space remaining.  A very 
simple SSD design solution is that when a SSD block is overwritten 
it is replaced with an already-erased block from the free pool and the 
old block is submitted to the free pool for eventual erasure and 
re-use.  This approach avoids adding erase times to the write latency 
as long as the device can erase as fast as the average date write 
rate.


There are of course SSDs with hardly any (or no) reserve space, but 
while we might be willing to sacrifice an image or two to SSD block 
failure in our digital camera, that is just not acceptable for serious 
computer use.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-31 Thread Andras Spitzer
Just an update :

Finally I found some technical details about this Thin Reclamation API :

(http://blogs.hds.com/claus/2009/12/i-love-it-when-a-plan-comes-together.html)

This week, (December 7th), Symantec announced their “completing the thin 
provisioning ecosystem” that includes the necessary API calls for the file 
system to “notify” the storage array when space is “deleted”. The interface is 
a previously disused and now revised/reused/repurposed SCSI command (called 
Write Same) which was jointly worked out with Symantec, Hitachi, and 3PAR. This 
command allows the file systems (in this case Veritas VxFS) to notify the 
storage systems that space is no longer occupied. How cool is that! There is 
also a subcommittee to INCITS T10 studying the standardization is this and SNIA 
is also studying this. It won’t be long before most file systems, databases, 
and storage vendors adopt this technology.

So it's based on the SCSI Write Same/UNMAP command, (and if I understand 
correctly SATA TRIM is similar to this from the FS point of view) which 
standard is not ratified yet.

Also, happy new year to everyone!

Regards,
sendai
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-31 Thread Ragnar Sundblad

On 31 dec 2009, at 17.18, Bob Friesenhahn wrote:

 On Thu, 31 Dec 2009, Ragnar Sundblad wrote:
 
 Also, currently, when the SSDs for some very strange reason is
 constructed from flash chips designed for firmware and slowly
 changing configuration data and can only erase in very large chunks,
 TRIMing is good for the housekeeping in the SSD drive. A typical
 use case for this would be a laptop.
 
 I have heard quite a few times that TRIM is good for SSD drives but I don't 
 see much actual use for it.  Every responsible SSD drive maintains a reserve 
 of unused space (20-50%) since it is needed for wear leveling and to repair 
 failing spots.  This means that even when a SSD is 100% full it still has 
 considerable space remaining.

(At least as long as those blocks aren't used up in place of
bad/worn out) blocks...)

  A very simple SSD design solution is that when a SSD block is overwritten 
 it is replaced with an already-erased block from the free pool and the old 
 block is submitted to the free pool for eventual erasure and re-use.  This 
 approach avoids adding erase times to the write latency as long as the device 
 can erase as fast as the average date write rate.

This is what they do, as far as I have understood, but more
free space to play with makes the job easier and therefor
faster, and gives you a larger burst headroom before you hit
the erase-speed limit of the disk.

 There are of course SSDs with hardly any (or no) reserve space, but while we 
 might be willing to sacrifice an image or two to SSD block failure in our 
 digital camera, that is just not acceptable for serious computer use.

I think the idea is that with TRIM you can also use the file
system's unused space for wear leveling and flash block filling.
If your disk is completely full there is of course no gain.

/ragge s

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-31 Thread Richard Elling

On Dec 31, 2009, at 1:43 AM, Andras Spitzer wrote:


Let me sum up my thoughts in this topic.

To Richard [relling] : I agree with you this topic is even more  
confusing if we are not careful enough to specify exactly what we  
are talking about. Thin provision can be done on multiple layers,  
and though you said you like it to be closer to the app than closer  
to the dumb disks (if you were referring to SAN), my opinion is that  
each and every scenario has it's own pros/cons. I learned long time  
ago not to declare a technology good/bad, there are technologies  
which are used properly (usually declared as good tech) and others  
which are not (usually declared as bad).


I hear you.  But you are trapped thinking about 20th century designs  
and ZFS is a

21st century design.  More below...

Let me clarify my case, and why I mentioned thin devices on SAN  
specifically. Many people replied with the thin device support of  
ZFS (which is called sparse volumes if I'm correct), but what I was  
talking about is something else. It's thin device awareness on the  
SAN.


In this case you configure your LUN in the SAN as thin device, a  
virtual LUN(s) which is backed by a pool of physical disks in the  
SAN. From the OS it's transparent, so it is from the Volume Manager/ 
Filesystem point of view.


That is the basic definition of my scenarion with thin devices on  
SAN. High-end SAN frames like HDS USP-V (feature called Hitachi  
Dynamic Provisioning), EMC Symmetrix V-Max (feature called Virtual  
provisioning) supports this (and I'm sure many others as well). As  
you discovered the LUN in the OS, you start to use it, like put  
under Volume Manager, create filesystem, copy files, but the SAN  
only allocates physical blocks (more precisely group of blocks  
called extents) as you write them, which means you'll use only as  
much (or a bit more rounded to the next extent) on the physical disk  
as you use in reality.


From this standpoint we can define two terms, thin-friendly and  
thin-hostile environments. Thin-friendly would be any environment  
where OS/VM/FS doesn't write to blocks it doesn't really use (for  
example during initialization it doesn't fills up the LUN with a  
pattern or 0s).


That's why Veritas' SmartMove is a nice feature, as when you move  
from fat to thin devices (from the OS both LUNs look exactly the  
same), it will copy the blocks only which are used by the VxFS files.


ZFS does this by design. There is no way in ZFS to not do this.
I suppose it could be touted as a feature :-)  Maybe we should brand
ZFS as THINbyDESIGN(TM)  Or perhaps we can rebrand
SMARTMOVE(TM) as TRYINGTOCATCHUPWITHZFS(TM) :-)

That is still the basics of having thin devices on SAN, and hope to  
have a thin-friendly environment. The next level of this is the  
management of the thin devices and the physical pool where thin  
devices allocates their extents from.


Even if you get migrated to thin device LUNs, your thin devices will  
become fat again, even if you fill up your filesystem once, the thin  
device on the SAN will remain fat, no space reclamation is happening  
by default. The reason is pretty simple, the SAN storage has no  
knowledge of the filesystem structure, as such it can't decide  
whether a block should be released back to the pool, or it's really  
not in use. Then came Veritas with this brilliant idea of building a  
bridge between the FS and the SAN frame (this became the Thin  
Reclamation API), so they can communicate which blocks are not in  
use indeed.


I really would like you to read this Quick Note from Veritas about  
this feature, it will explain way better the concept as I did : http://ftp.support.veritas.com/pub/support/products/Foundation_Suite/338546.pdf


Btw, in this concept VxVM can even detect (via ASL) whether a LUN is  
thin device/thin device reclamation capable or not.


Correct.  Since VxVM and VxFS are separate software, they have expanded
the interface between them.

Consider adding a mirror or replacing a drive.

Prior to SMARTMOVE, VxVM had no idea what part of the volume was data
and what was unused. So VxVM would silver the mirror by copying all of  
the

blocks from one side to the other. Clearly this is uncool when your SAN
storage is virtualized.

With SMARTMOVE, VxFS has a method to tell VxVM that portions of the
volume are unused. Now when you silver the mirror, VxVM knows that
some bits are unused and it won't bother to copy them.  This is a bona
fide good thing for virtualized SAN arrays.

ZFS was designed with the knowledge that the limited interface between
file systems and volume managers was a severe limitation that leads to
all sorts of complexity and angst. So a different design is needed.  ZFS
has fully integrated RAID with the file system, so there is no need, by
design, to create a new interface between these layers. In other words,
the only way to silver a disk in ZFS is to silver the data. You can't  
silver

unused space. 

Re: [zfs-discuss] Thin device support in ZFS?

2009-12-31 Thread Richard Elling

[I TRIMmed the thread a bit ;-)]

On Dec 31, 2009, at 1:43 AM, Ragnar Sundblad wrote:

On 31 dec 2009, at 06.01, Richard Elling wrote:


In a world with copy-on-write and without snapshots, it is obvious  
that
there will be a lot of blocks running around that are no longer in  
use.

Snapshots (and their clones) changes that use case. So in a world of
snapshots, there will be fewer blocks which are not used. Remember,
the TRIM command is very important to OSes like Windows or OSX
which do not have file systems that are copy-on-write or have decent
snapshots. OTOH, ZFS does copy-on-write and lots of ZFS folks use
snapshots.


I don't believe that there is such a big difference between those
cases.


The reason you want TRIM for SSDs is to recover the write speed.
A freshly cleaned page can be written faster than a dirty page.
But in COW, you are writing to new pages and not rewriting old
pages. This is fundamentally different than FAT, NTFS, or HFS+,
but it is those markets which are driving TRIM adoption.

[TRIMmed]


Once a block is freed in ZFS, it no longer needs it. So the problem
of TRIM in ZFS is not related to the recent txg commit history.


It may be that you want to save a few txgs back, so if you get
a failure where parts of the last txg gets lost, you will still be
able to get an old (few seconds/minutes) version of your data back.


This is already implemented. Blocks freed in the past few txgs are
not returned to the freelist immediately. This was needed to enable
uberblock recovery in b128. So TRIMming from the freelist is safe.


This could happen if the sync commands aren't correctly implemented
all the way (as we have seen some stories about on this list).
Maybe someone disabled syncing somewhere to improve performance.

It could also happen if a non volatile caching device, such as
a storage controller, breaks in some bad way. Or maybe you just
had a bad/old battery/supercap in a device that implements
NV storage with batteries/supercaps.


The
issue is that traversing the free block list has to be protected by
locks, so that the file system does not allocate a block when it is
also TRIMming the block. Not so difficult, as long as the TRIM
occurs relatively quickly.

I think that any TRIM implementation should be an administration
command, like scrub. It probably doesn't make sense to have it
running all of the time.  But on occasion, it might make sense.


I am not sure why it shouldn't run at all times, except for the
fact that it seems to be badly implemented in some SATA devices
with high latencies, so that it will interrupt any data streaming
to/from the disks.


I don't see how it would not have negative performance impacts.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-31 Thread Joerg Schilling
Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote:

 I have heard quite a few times that TRIM is good for SSD drives but 
 I don't see much actual use for it.  Every responsible SSD drive 
 maintains a reserve of unused space (20-50%) since it is needed for 
 wear leveling and to repair failing spots.  This means that even when 
 a SSD is 100% full it still has considerable space remaining.  A very 
 simple SSD design solution is that when a SSD block is overwritten 
 it is replaced with an already-erased block from the free pool and the 
 old block is submitted to the free pool for eventual erasure and 
 re-use.  This approach avoids adding erase times to the write latency 
 as long as the device can erase as fast as the average date write 
 rate.

The question in case if SSDs is:

ZFS is COW, but does the SSD know which block is in use and which is not?

If the SSD did know whether a block is in use, it could erase unused blocks
in advance. But what is an unused block on a filesystem that supports
snapshots?


From the perspective of the SSD I see only the following difference between
a COW filesystem an a conventional filesystem. A conventional filesystem 
may write more often to the same block number than a COW filesystem does.
But even for the non-COW case, I would expect that the SSD frequently remaps
overwritten blocks to previously erased spares.

My conclusion is that ZFS on a SSD works fine in case that the the primary used
blocks plus all active snapshots use less space than the official size - the 
spare reserve from the SSD. If you however fill up the medium, I expect a
performance degradation.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-31 Thread Joerg Schilling
Richard Elling richard.ell...@gmail.com wrote:

 The reason you want TRIM for SSDs is to recover the write speed.
 A freshly cleaned page can be written faster than a dirty page.
 But in COW, you are writing to new pages and not rewriting old
 pages. This is fundamentally different than FAT, NTFS, or HFS+,
 but it is those markets which are driving TRIM adoption.

Your mistake is to asume a maiden SSD and not to think about what's
happening after the SSD was in use for a while. Even for the COW case,
blocks are reused after some time and the disk does has no way to
know in advance which blocks are still in use and which blocks are no
longer used and may be prepared for being overwritten.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-31 Thread Ragnar Sundblad

On 31 dec 2009, at 19.26, Richard Elling wrote:

 [I TRIMmed the thread a bit ;-)]
 
 On Dec 31, 2009, at 1:43 AM, Ragnar Sundblad wrote:
 On 31 dec 2009, at 06.01, Richard Elling wrote:
 
 In a world with copy-on-write and without snapshots, it is obvious that
 there will be a lot of blocks running around that are no longer in use.
 Snapshots (and their clones) changes that use case. So in a world of
 snapshots, there will be fewer blocks which are not used. Remember,
 the TRIM command is very important to OSes like Windows or OSX
 which do not have file systems that are copy-on-write or have decent
 snapshots. OTOH, ZFS does copy-on-write and lots of ZFS folks use
 snapshots.
 
 I don't believe that there is such a big difference between those
 cases.
 
 The reason you want TRIM for SSDs is to recover the write speed.
 A freshly cleaned page can be written faster than a dirty page.
 But in COW, you are writing to new pages and not rewriting old
 pages. This is fundamentally different than FAT, NTFS, or HFS+,
 but it is those markets which are driving TRIM adoption.

Flash SSDs actually always remap new writes into a
only-append-to-new-pages style, pretty much as ZFS does itself.
So for a SSD there is no big difference between ZFS and
filesystems as UFS, NTFS, HFS+ et al, on the flash level they
all work the same.
The reason is that there is no way for it to rewrite single
disk blocks, it can only fill up already erased pages of
512K (for example). When the old blocks get mixed with unused
blocks (because of block rewrites, TRIM or Write Many/UNMAP),
it needs to compact the data by copying all active blocks from
those pages into previously erased pages, and there write the
active data compacted/continuos. (When this happens, things tend
to get really slow.)

So TRIM is just as applicable to ZFS as any other file system
for flash SSD, there is no real difference.

 [TRIMmed]
 
 Once a block is freed in ZFS, it no longer needs it. So the problem
 of TRIM in ZFS is not related to the recent txg commit history.
 
 It may be that you want to save a few txgs back, so if you get
 a failure where parts of the last txg gets lost, you will still be
 able to get an old (few seconds/minutes) version of your data back.
 
 This is already implemented. Blocks freed in the past few txgs are
 not returned to the freelist immediately. This was needed to enable
 uberblock recovery in b128. So TRIMming from the freelist is safe.

I see, very good!

 This could happen if the sync commands aren't correctly implemented
 all the way (as we have seen some stories about on this list).
 Maybe someone disabled syncing somewhere to improve performance.
 
 It could also happen if a non volatile caching device, such as
 a storage controller, breaks in some bad way. Or maybe you just
 had a bad/old battery/supercap in a device that implements
 NV storage with batteries/supercaps.
 
 The
 issue is that traversing the free block list has to be protected by
 locks, so that the file system does not allocate a block when it is
 also TRIMming the block. Not so difficult, as long as the TRIM
 occurs relatively quickly.
 
 I think that any TRIM implementation should be an administration
 command, like scrub. It probably doesn't make sense to have it
 running all of the time.  But on occasion, it might make sense.
 
 I am not sure why it shouldn't run at all times, except for the
 fact that it seems to be badly implemented in some SATA devices
 with high latencies, so that it will interrupt any data streaming
 to/from the disks.
 
 I don't see how it would not have negative performance impacts.

It will, I am sure! But *if* the user for one reason or the other
wants TRIM, it can not be assumed that TRIMing major bunches at
certain times is any better than trimming small amounts all the
time. Both behaviors may be useful, but I have hard to see a real
good use case where you want batch trimming, but easy to see cases
where continuos trimming could be useful and hopefully hardly
noticeable thanks to the file system caching.

/ragge s

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-31 Thread David Magda

On Dec 31, 2009, at 13:44, Joerg Schilling wrote:

ZFS is COW, but does the SSD know which block is in use and which  
is not?


If the SSD did know whether a block is in use, it could erase unused  
blocks
in advance. But what is an unused block on a filesystem that  
supports

snapshots?


Personally, I think that at some point in the future there will need  
to be a command telling SSDs that the file system will take care of  
handling blocks, as new FS designs will be COW. ZFS is the first  
mainstream one to do it, but Btrfs is there as well, and it looks  
like Apple will be making its own FS.


Just as the first 4096-byte block disks are silently emulating 4096 - 
to-512 blocks, SSDs are currently re-mapping LBAs behind the scenes.  
Perhaps in the future there will be a setting to say no really, I'm  
talking about the /actual/ LBA 123456.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-30 Thread Mattias Pantzare
On Wed, Dec 30, 2009 at 19:23, roland devz...@web.de wrote:
 making transactional,logging filesystems thin-provisioning aware should be 
 hard to do, as every new and every changed block is written to a new location.
 so what applies to zfs, should also apply to btrfs or nilfs or similar 
 filesystems.

If that where a problem it would be a problem for UFS when you write
new files...

ZFS knows what blocks are free and that is all you need send to the disk system.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-30 Thread Freddie Cash
 making transactional,logging filesystems
 thin-provisioning aware should be hard to do, as
 every new and every changed block is written to a new
 location.  so what applies to zfs, should also apply to btrfs or
 nilfs or similar filesystems.
 
 i`m not sure if there is a good way to make zfs
 thin-provisioning aware/friendly - so you should wait
 what a zfs developer has to tell about this.

ZFS already supports thin-provisioning, and has since pretty much the beginning 
(earliest I've used it in is ZFSv6).

I may get the terms backwards here, but if the Quota property is larger than 
the Reservation, then you have a thin-provisioned volume or filesystem.  The 
Quota will set the disk size or available space that the OS sees, while the 
Reservation sets the currently usable space.  As the OS uses space in the 
volume/fs and approaches the Reservation, you just increase the value.  The 
total size that the OS doesn't change, but the actual amount of usable space 
does.

This is especially useful for volumes that are exported via iSCSI.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-30 Thread Richard Elling

On Dec 30, 2009, at 10:53 AM, Andras Spitzer wrote:


Devzero,

Unfortunately that was my assumption as well. I don't have source  
level knowledge of ZFS, though based on what I know it wouldn't be  
an easy way to do it. I'm not even sure it's only a technical  
question, but a design question, which would make it even less  
feasible.


It is not hard, because ZFS knows the current free list, so walking  
that list

and telling the storage about the freed blocks isn't very hard.

What is hard is figuring out if this would actually improve life.  The  
reason

I say this is because people like to use snapshots and clones on ZFS.
If you keep snapshots, then you aren't freeing blocks, so the free list
doesn't grow. This is a very different use case than UFS, as an example.

There are a few minor bumps in the road. The ATA PASSTHROUGH
command, which allows TRIM to pass through the SATA drivers, was
just integrated into b130. This will be more important to small servers
than SANs, but the point is that all parts of the software stack need to
support the effort. As such, it is not clear to me who, if anyone,  
inside

Sun is champion for the effort -- it crosses multiple organizational
boundaries.



Apart from the technical possibilities, this feature looks really  
inevitable to me in the long run especially for enterprise customers  
with high-end SAN as cost is always a major factor in a storage  
design and it's a huge difference if you have to pay based on the  
space used vs space allocated (for example).


If the high cost of SAN storage is the problem, then I think there are
better ways to solve that :-)
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-30 Thread Torrey McMahon


On 12/30/2009 2:40 PM, Richard Elling wrote:

There are a few minor bumps in the road. The ATA PASSTHROUGH
command, which allows TRIM to pass through the SATA drivers, was
just integrated into b130. This will be more important to small servers
than SANs, but the point is that all parts of the software stack need to
support the effort. As such, it is not clear to me who, if anyone, inside
Sun is champion for the effort -- it crosses multiple organizational
boundaries. 


I'd think it more important for devices where this is an issue, namely 
SSDs, then it is spinning rust though use of the TRIM command, or 
something like it, would fix a lot of the issues I've seen with thin 
provisioning over the last six years or so. However, I'm not sure it's 
going to be much of an impact until you can get the entire stack - 
application to device - rewired to work with the concept behind it. One 
of the biggest issues I've seen with thin provisioning is how the 
applications work and you can't fix that in the file system code.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-30 Thread Mike Gerdts
On Wed, Dec 30, 2009 at 1:40 PM, Richard Elling
richard.ell...@gmail.com wrote:
 On Dec 30, 2009, at 10:53 AM, Andras Spitzer wrote:

 Devzero,

 Unfortunately that was my assumption as well. I don't have source level
 knowledge of ZFS, though based on what I know it wouldn't be an easy way to
 do it. I'm not even sure it's only a technical question, but a design
 question, which would make it even less feasible.

 It is not hard, because ZFS knows the current free list, so walking that
 list
 and telling the storage about the freed blocks isn't very hard.

 What is hard is figuring out if this would actually improve life.  The
 reason
 I say this is because people like to use snapshots and clones on ZFS.
 If you keep snapshots, then you aren't freeing blocks, so the free list
 doesn't grow. This is a very different use case than UFS, as an example.

It seems as though the oft mentioned block rewrite capabilities needed
for pool shrinking and changing things like compression, encryption,
and deduplication would also show benefit here.  That is, blocks would
be re-written in such a way to minimize the number of chunks of
storage that is allocated.  The current HDS chunk size is 42 MB.

The most benefit would seem to be to have ZFS make a point of reusing
old but freed blocks before doing an allocation that causes the
back-end storage to allocate another chunk of disk to the
thin-provisioned.  While it is important to be able to roll back a few
transactions in the event of some widely discussed failure modes, it
is probably reasonable to reuse a block freed by a txg that is 3,000
txg's old (about 1 day old if 1 txg per 30 seconds).  Such a threshold
could be used to determine whether to reuse a block or venture into
previously untouched regions of the disk.

This strategy would allow the SAN administrator (who is a different
person than the sysadmin) to allocate extra space to servers and the
sysadmin can control the amount of space really used by quotas.  In
the event that there is an emergency need for more space, the sysadmin
can increase the quota and allow more of the allocate SAN space to be
used.  Assuming the block rewrite feature comes to fruition, this
emergency growth could be shrunk back down to the original size once
the surge in demand (or errant process) subsides.


 There are a few minor bumps in the road. The ATA PASSTHROUGH
 command, which allows TRIM to pass through the SATA drivers, was
 just integrated into b130. This will be more important to small servers
 than SANs, but the point is that all parts of the software stack need to
 support the effort. As such, it is not clear to me who, if anyone, inside
 Sun is champion for the effort -- it crosses multiple organizational
 boundaries.


 Apart from the technical possibilities, this feature looks really
 inevitable to me in the long run especially for enterprise customers with
 high-end SAN as cost is always a major factor in a storage design and it's a
 huge difference if you have to pay based on the space used vs space
 allocated (for example).

 If the high cost of SAN storage is the problem, then I think there are
 better ways to solve that :-)

The SAN could be an OpenSolaris device serving LUNs through COMSTAR.
 If those LUNs are used to hold a zpool, the zpool could notify the
LUN that blocks are no longer used and the SAN could reclaim those
blocks.  This is just a variant of the same problem faced with
expensive SAN devices that have thin provisioning allocation units
measured in the tens of megabytes instead of hundreds to thousands of
kilobytes.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-30 Thread Andras Spitzer
Richard,

That's an interesting question, if it's worth it or not. I guess the question 
is always who are the targets for ZFS (I assume everyone, though in reality 
priorities has to set up as the developer resources are limited). For a home 
office, no doubt thin provisioning is not much of a use, for an enterprise 
company the numbers might really make a difference if we look at the space used 
vs space allocated.

There are some studies that thin provisioning can reduce physical space used up 
to 30%, which is huge. (Even though I understands studies are not real life and 
thin provisioning is not viable in every environment)

Btw, I would like to discuss scenarios where though we have over-subscribed 
pool in the SAN (meaning the overall allocated space to the systems is more 
than the physical space in the pool) with proper monitoring and proactive 
physical drive adds we won't let any systems/applications attached to the SAN 
realize that we have thin devices.

Actually that's why I believe configuring thin devices without periodically 
reclaiming space is just a timebomb, though if you have the option to 
periodically reclaim space, you can maintain the pool in the SAN in a really 
efficient way. That's why I found Veritas' Thin Reclamation API as a milestone 
in the thin device field.

Anyway, only future can tell if thin provisioning will or won't be a major 
feature in the storage world, though as I saw Veritas already added this 
feature I was wondering if ZFS has it at least on it's roadmap.

Regards,
sendai
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-30 Thread Tristan Ball

To some extent it already does.

If what you're talking about is filesystems/datasets, then all 
filesystems within a pool share the same free space, which is 
functionally very similar to each filesystem within the pool being 
thin-provisioned. To get a thick filesystem, you'd need to set at 
least the filesystem's reservation, and probably quota as well. 
Basically filesystems within a pool are thin by default, with the added 
bonus that space freed within a single filesystem is available for use 
in any other filesystem within the pool.


If you're talking about volumes provisioned from a pool, then volumes 
can be provisioned as sparse, which is pretty much the same thing.


And if you happen to be providing ISCSI luns from files rather than 
volumes, then those files can be created sparse as well.


Reclaiming space from sparse volumes and files is not so easy unfortunately!

If you're talking about the pool itself being thin... that's harder to 
do, although if you really needed it I guess if you provision your pool 
from an array that itself provides thin provisioning.


Regards,
Tristan



On 30/12/2009 9:34 PM, Andras Spitzer wrote:

Hi,

Does anyone heard about having any plans to support thin devices by ZFS? I'm 
talking about the thin device feature by SAN frames (EMC, HDS) which provides 
more efficient space utilization. The concept is similar to ZFS with the pool 
and datasets, though the pool in this case is in the SAN frame itself, so the 
pool can be shared among different systems attached to the same SAN frame.

This topic is really complex but I'm sure it's inevitable to support for 
enterprise customers with SAN storage, basically it brings the differentiation 
of space used vs space allocated, which can be a huge difference in a large 
environment, and this difference is major even on the financial level as well.

Veritas already added support to thin devices, first of all support to VxFS to be 
thin-aware (for example how to handle over-subscribed thin devices), then 
Veritas added a feature called SmartMove, a nice feature to migrate from fat to thin 
devices, and the most brilliant feature of all (my personal opinion, of course) is they 
released the Veritas Thin Device Reclamation API, which provides an interface to the SAN 
frame to report unused space at the block level.

This API is a major hit, and even though SAN vendors today doesn't support it, 
HP and HDS already working on it, and I assume EMC has to follow as well. With 
this API Veritas can keep track of files deleted for example, and with a simple 
command once a day (depending on your policy) it can report the unused space 
back to the frame, so thin devices [b]remain[/b] thin.

I really believe that ZFS should have support to thin devices, especially 
referring to the feature what this API brings into this field, as it can result 
a huge cost difference to enterprise customers.

Regards,
sendai
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-30 Thread Richard Elling

now this is getting interesting :-)...

On Dec 30, 2009, at 12:13 PM, Mike Gerdts wrote:


On Wed, Dec 30, 2009 at 1:40 PM, Richard Elling
richard.ell...@gmail.com wrote:

On Dec 30, 2009, at 10:53 AM, Andras Spitzer wrote:


Devzero,

Unfortunately that was my assumption as well. I don't have source  
level
knowledge of ZFS, though based on what I know it wouldn't be an  
easy way to
do it. I'm not even sure it's only a technical question, but a  
design

question, which would make it even less feasible.


It is not hard, because ZFS knows the current free list, so walking  
that

list
and telling the storage about the freed blocks isn't very hard.

What is hard is figuring out if this would actually improve life.   
The

reason
I say this is because people like to use snapshots and clones on ZFS.
If you keep snapshots, then you aren't freeing blocks, so the free  
list
doesn't grow. This is a very different use case than UFS, as an  
example.


It seems as though the oft mentioned block rewrite capabilities needed
for pool shrinking and changing things like compression, encryption,
and deduplication would also show benefit here.  That is, blocks would
be re-written in such a way to minimize the number of chunks of
storage that is allocated.  The current HDS chunk size is 42 MB.


Good observation, Mike. ZFS divides a leaf vdev into approximately 200
metaslabs. Space is allocated in a metaslab and at some point another
metaslab will be chosen.  The assumption is made that the outer tracks
of a disk have higher bandwidth than inner tracks, so allocations should
be biased towards lower-numbered metaslabs.  Let's ignore, for the
moment, that SSDs, and to some degree, RAID arrays, don't exhibit
this behavior. OK, so here's how it works, in a nutshell.

Space is allocated in the same metaslab until it fills or becomes
fragmented and then the next metaslab is used.  You can see this
in my Spacemaps from Space blog,
http://blogs.sun.com/relling/entry/space_maps_from_space
where the lower numbered tracks (towards the bottom) you can see
occasional, small blank areas.  Note to self: a better picture would be
useful :-)

Note: copies are intentionally spread to other, distant metaslabs for
diversity.

Inside the metaslab, space is allocated on a first-fit basis until the
space is mostly consumed and the algorithm changes to best-fit.

The algorithm for these two decisions was changed in b129, in an
effort to improve performance.

So, the questions that arise are:
Should the allocator be made aware of the chunk size of virtual
storage vdevs?  [hint: there is evidence of the intention to permit
different allocators in the source, but I dunno if there is an intent
to expose those through an interface.]

If the allocator can change, what sorts of policies should be
implemented?  Examples include:
+ should the allocator stick with best-fit and encourage more
   gangs when the vdev is virtual?
+ should the allocator be aware of an SSD's page size?  Is
   said page size available to an OS?
+ should the metaslab boundaries align with virtual storage
   or SSD page boundaries?

And, perhaps most important, how can this be done automatically
so that system administrators don't have to be rocket scientists
to make a good choice?

 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-30 Thread Tristan Ball

Ack..

I've just re-read your original post. :-) It's clear you are talking 
about support for thin devices behind the pool, not features inside the 
pool itself.


Mea culpa.

So I guess we wait for trim to be fully supported..  :-)

T.



On 31/12/2009 8:09 AM, Tristan Ball wrote:

To some extent it already does.

If what you're talking about is filesystems/datasets, then all 
filesystems within a pool share the same free space, which is 
functionally very similar to each filesystem within the pool being 
thin-provisioned. To get a thick filesystem, you'd need to set at 
least the filesystem's reservation, and probably quota as well. 
Basically filesystems within a pool are thin by default, with the 
added bonus that space freed within a single filesystem is available 
for use in any other filesystem within the pool.


If you're talking about volumes provisioned from a pool, then volumes 
can be provisioned as sparse, which is pretty much the same thing.


And if you happen to be providing ISCSI luns from files rather than 
volumes, then those files can be created sparse as well.


Reclaiming space from sparse volumes and files is not so easy 
unfortunately!


If you're talking about the pool itself being thin... that's harder to 
do, although if you really needed it I guess if you provision your 
pool from an array that itself provides thin provisioning.


Regards,
Tristan



On 30/12/2009 9:34 PM, Andras Spitzer wrote:

Hi,

Does anyone heard about having any plans to support thin devices by 
ZFS? I'm talking about the thin device feature by SAN frames (EMC, 
HDS) which provides more efficient space utilization. The concept is 
similar to ZFS with the pool and datasets, though the pool in this 
case is in the SAN frame itself, so the pool can be shared among 
different systems attached to the same SAN frame.


This topic is really complex but I'm sure it's inevitable to support 
for enterprise customers with SAN storage, basically it brings the 
differentiation of space used vs space allocated, which can be a huge 
difference in a large environment, and this difference is major even 
on the financial level as well.


Veritas already added support to thin devices, first of all support 
to VxFS to be thin-aware (for example how to handle over-subscribed 
thin devices), then Veritas added a feature called SmartMove, a nice 
feature to migrate from fat to thin devices, and the most brilliant 
feature of all (my personal opinion, of course) is they released the 
Veritas Thin Device Reclamation API, which provides an interface to 
the SAN frame to report unused space at the block level.


This API is a major hit, and even though SAN vendors today doesn't 
support it, HP and HDS already working on it, and I assume EMC has to 
follow as well. With this API Veritas can keep track of files deleted 
for example, and with a simple command once a day (depending on your 
policy) it can report the unused space back to the frame, so thin 
devices [b]remain[/b] thin.


I really believe that ZFS should have support to thin devices, 
especially referring to the feature what this API brings into this 
field, as it can result a huge cost difference to enterprise customers.


Regards,
sendai

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-30 Thread Richard Elling

On Dec 30, 2009, at 12:25 PM, Andras Spitzer wrote:


Richard,

That's an interesting question, if it's worth it or not. I guess the  
question is always who are the targets for ZFS (I assume everyone,  
though in reality priorities has to set up as the developer  
resources are limited). For a home office, no doubt thin  
provisioning is not much of a use, for an enterprise company the  
numbers might really make a difference if we look at the space used  
vs space allocated.


There are some studies that thin provisioning can reduce physical  
space used up to 30%, which is huge. (Even though I understands  
studies are not real life and thin provisioning is not viable in  
every environment)


Btw, I would like to discuss scenarios where though we have over- 
subscribed pool in the SAN (meaning the overall allocated space to  
the systems is more than the physical space in the pool) with proper  
monitoring and proactive physical drive adds we won't let any  
systems/applications attached to the SAN realize that we have thin  
devices.


Actually that's why I believe configuring thin devices without  
periodically reclaiming space is just a timebomb, though if you have  
the option to periodically reclaim space, you can maintain the pool  
in the SAN in a really efficient way. That's why I found Veritas'  
Thin Reclamation API as a milestone in the thin device field.


Anyway, only future can tell if thin provisioning will or won't be a  
major feature in the storage world, though as I saw Veritas already  
added this feature I was wondering if ZFS has it at least on it's  
roadmap.


Thin provisioning is absolutely, positively a wonderful, good thing!   
The question
is, how does the industry handle the multitude of thin provisioning  
models, each
layered on top of another? For example, here at the ranch I use VMWare  
and Xen,
which thinly provision virtual disks. I do this over iSCSI to a server  
running ZFS
which thinly provisions the iSCSI target.  If I had a virtual RAID  
array, I would
probably use that, too. Personally, I think being thinner closer to  
the application

wins over being thinner closer to dumb storage devices (disk drives).

BTW, I do not see an RFE for this on http://bugs.opensolaris.org
Would you be so kind as to file one?
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-30 Thread Mike Gerdts
On Wed, Dec 30, 2009 at 3:12 PM, Richard Elling
richard.ell...@gmail.com wrote:
 If the allocator can change, what sorts of policies should be
 implemented?  Examples include:
        + should the allocator stick with best-fit and encourage more
           gangs when the vdev is virtual?
        + should the allocator be aware of an SSD's page size?  Is
           said page size available to an OS?
        + should the metaslab boundaries align with virtual storage
           or SSD page boundaries?

Wandering off topic a little bit...

Should the block size be a tunable so that page size of SSD (typically
4K, right?) and upcoming hard disks that sport a sector size  512
bytes?

http://arc.opensolaris.org/caselog/PSARC/2008/769/final_spec.txt

 And, perhaps most important, how can this be done automatically
 so that system administrators don't have to be rocket scientists
 to make a good choice?

Didn't you read the marketing literature?  ZFS is easy because you
only need to know two commands: zpool and zfs.  If you just ignore all
the subcommands, options to those subcommands, evil tuning that is
sometimes needed, and effects of redundancy choices then there is no
need for any rocket scientists.  :)

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-30 Thread Ragnar Sundblad

On 30 dec 2009, at 22.45, Richard Elling wrote:

 On Dec 30, 2009, at 12:25 PM, Andras Spitzer wrote:
 
 Richard,
 
 That's an interesting question, if it's worth it or not. I guess the 
 question is always who are the targets for ZFS (I assume everyone, though in 
 reality priorities has to set up as the developer resources are limited). 
 For a home office, no doubt thin provisioning is not much of a use, for an 
 enterprise company the numbers might really make a difference if we look at 
 the space used vs space allocated.
 
 There are some studies that thin provisioning can reduce physical space used 
 up to 30%, which is huge. (Even though I understands studies are not real 
 life and thin provisioning is not viable in every environment)
 
 Btw, I would like to discuss scenarios where though we have over-subscribed 
 pool in the SAN (meaning the overall allocated space to the systems is more 
 than the physical space in the pool) with proper monitoring and proactive 
 physical drive adds we won't let any systems/applications attached to the 
 SAN realize that we have thin devices.
 
 Actually that's why I believe configuring thin devices without periodically 
 reclaiming space is just a timebomb, though if you have the option to 
 periodically reclaim space, you can maintain the pool in the SAN in a really 
 efficient way. That's why I found Veritas' Thin Reclamation API as a 
 milestone in the thin device field.
 
 Anyway, only future can tell if thin provisioning will or won't be a major 
 feature in the storage world, though as I saw Veritas already added this 
 feature I was wondering if ZFS has it at least on it's roadmap.
 
 Thin provisioning is absolutely, positively a wonderful, good thing!  The 
 question
 is, how does the industry handle the multitude of thin provisioning models, 
 each
 layered on top of another? For example, here at the ranch I use VMWare and 
 Xen,
 which thinly provision virtual disks. I do this over iSCSI to a server 
 running ZFS
 which thinly provisions the iSCSI target.  If I had a virtual RAID array, I 
 would
 probably use that, too. Personally, I think being thinner closer to the 
 application
 wins over being thinner closer to dumb storage devices (disk drives).

I don't get it - why do we need anything more magic (or complicated)
than support for TRIM from the filesystems and the storage systems?

I don't see why TRIM would be hard to implement for ZFS either,
except that you may want to keep data from a few txgs back just
for safety, which would probably call for some two-stage freeing
of data blocks (those free blocks that are to be TRIMmed, and
those that already are).

/ragge

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-30 Thread Bob Friesenhahn

On Wed, 30 Dec 2009, Mike Gerdts wrote:


Should the block size be a tunable so that page size of SSD (typically
4K, right?) and upcoming hard disks that sport a sector size  512
bytes?


Enterprise SSDs are still in their infancy.  The actual page size of 
an SSD could be almost anything.  Due to lack of seek time concerns 
and the high cost of erasing a page, a SSD could be designed with a 
level of indirection so that multiple logical writes to disjoint 
offsets could be combined into a single SSD physical page.  Likewise a 
large logical block could be subdivided into mutiple SSD pages, which 
are allocated on demand.  Logic is cheap and SSDs are full of logic so 
it seems reasonable that future SSDs will do this, if not already, 
since similar logic enables wear-leveling.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



Re: [zfs-discuss] Thin device support in ZFS?

2009-12-30 Thread Richard Elling


On Dec 30, 2009, at 2:24 PM, Ragnar Sundblad wrote:



On 30 dec 2009, at 22.45, Richard Elling wrote:


On Dec 30, 2009, at 12:25 PM, Andras Spitzer wrote:


Richard,

That's an interesting question, if it's worth it or not. I guess  
the question is always who are the targets for ZFS (I assume  
everyone, though in reality priorities has to set up as the  
developer resources are limited). For a home office, no doubt thin  
provisioning is not much of a use, for an enterprise company the  
numbers might really make a difference if we look at the space  
used vs space allocated.


There are some studies that thin provisioning can reduce physical  
space used up to 30%, which is huge. (Even though I understands  
studies are not real life and thin provisioning is not viable in  
every environment)


Btw, I would like to discuss scenarios where though we have over- 
subscribed pool in the SAN (meaning the overall allocated space to  
the systems is more than the physical space in the pool) with  
proper monitoring and proactive physical drive adds we won't let  
any systems/applications attached to the SAN realize that we have  
thin devices.


Actually that's why I believe configuring thin devices without  
periodically reclaiming space is just a timebomb, though if you  
have the option to periodically reclaim space, you can maintain  
the pool in the SAN in a really efficient way. That's why I found  
Veritas' Thin Reclamation API as a milestone in the thin device  
field.


Anyway, only future can tell if thin provisioning will or won't be  
a major feature in the storage world, though as I saw Veritas  
already added this feature I was wondering if ZFS has it at least  
on it's roadmap.


Thin provisioning is absolutely, positively a wonderful, good  
thing!  The question
is, how does the industry handle the multitude of thin provisioning  
models, each
layered on top of another? For example, here at the ranch I use  
VMWare and Xen,
which thinly provision virtual disks. I do this over iSCSI to a  
server running ZFS
which thinly provisions the iSCSI target.  If I had a virtual RAID  
array, I would
probably use that, too. Personally, I think being thinner closer to  
the application

wins over being thinner closer to dumb storage devices (disk drives).


I don't get it - why do we need anything more magic (or complicated)
than support for TRIM from the filesystems and the storage systems?


TRIM is just one part of the problem (or solution, depending on your  
point

of view). The TRIM command is part of the T10 protocols that allows a
host to tell a block device that data in a set of blocks is no longer of
any value, and the block device can destroy the data without adverse
consequence.

In a world with copy-on-write and without snapshots, it is obvious that
there will be a lot of blocks running around that are no longer in use.
Snapshots (and their clones) changes that use case. So in a world of
snapshots, there will be fewer blocks which are not used. Remember,
the TRIM command is very important to OSes like Windows or OSX
which do not have file systems that are copy-on-write or have decent
snapshots. OTOH, ZFS does copy-on-write and lots of ZFS folks use
snapshots.

That said, adding TRIM support is not hard in ZFS. But it depends on
lower level drivers to pass the TRIM commands down the stack. These
ducks are lining up now.


I don't see why TRIM would be hard to implement for ZFS either,
except that you may want to keep data from a few txgs back just
for safety, which would probably call for some two-stage freeing
of data blocks (those free blocks that are to be TRIMmed, and
those that already are).


Once a block is freed in ZFS, it no longer needs it. So the problem
of TRIM in ZFS is not related to the recent txg commit history. The
issue is that traversing the free block list has to be protected by
locks, so that the file system does not allocate a block when it is
also TRIMming the block. Not so difficult, as long as the TRIM
occurs relatively quickly.

I think that any TRIM implementation should be an administration
command, like scrub. It probably doesn't make sense to have it
running all of the time.  But on occasion, it might make sense.

My concern is that people will have an expectation that they can
use snapshots and TRIM -- the former reduces the effectiveness
of the latter.  As the price of storing bytes continues to decrease,
will the cost of not TRIMming be a long term issue?  I think not.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss