Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-24 Thread Bob Friesenhahn

On Fri, 23 Oct 2009, Eric D. Mudama wrote:


I don't believe the above statement is correct.

According to anandtech who asked Intel:

http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403p=10

the DRAM doesn't hold user data.  The article claims that data goes
through an internal 256KB buffer.


These folks may well be in Intel's back pocket, but it seems that the 
data given is not clear or accurate.  It does not matter if DRAM or 
SRAM is used for the disk's cache.  What matters is if all user data 
gets flushed to non-volatile storage for each cache flush request. 
Since FLASH drives need to erase a larger block than might be written, 
existing data needs to be read, updated, and then written.  This data 
needs to be worked on in a volatile buffer.  Without extreme care, it 
is possible for the FLASH drive to corrupt other existing unrelated 
data if there is power loss.  The FLASH drive could use a COW scheme 
(like ZFS) but it still needs to take care to persist the block 
mappings for each cache sync request or transactions would be lost.


Folks at another site found that the drive was losing the last few 
synchronous writes with the cache enabled.  This could be a problem 
with the drive, or the OS if it is not issuing the cache flush 
request.



Is solaris incapable of issuing a SATA command FLUSH CACHE EXT?


It issues one for each update to the intent log.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-24 Thread Bob Friesenhahn

On Sat, 24 Oct 2009, Bob Friesenhahn wrote:



Is solaris incapable of issuing a SATA command FLUSH CACHE EXT?


It issues one for each update to the intent log.


I should mention that FLASH SSDs without a capacitor/battery-backed 
cache flush (like the X25-E) are likely to get burned out pretty 
quickly if they respect each cache flush request.  The reason is that 
each write needs to update a full FLASH metablock.  This means that a 
small 4K syncronous update forces a write of a full FLASH metablock in 
the X25-E.  I don't know the size of the FLASH metablock in the X25-E 
(seems to be a closely-held secret), but perhaps it is 128K, 256K, or 
512K.


The rumor that disabling the cache on the X25-E disables the wear 
leveling is probably incorrect.  It is much more likely that disabling 
the cache causes each write to erase and write a full FLASH 
metablock (known as write amplification), therefore causing the 
device to wear out much more quickly than if it deferred writes.


  http://www.tomshardware.com/reviews/Intel-x25-m-SSD,2012-5.html

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-23 Thread Eric D. Mudama

On Tue, Oct 20 at 22:24, Frédéric VANNIERE wrote:


You can't use the Intel X25-E because it has a 32 or 64 MB volatile
cache that can't be disabled neither flushed by ZFS.


I don't believe the above statement is correct.

According to anandtech who asked Intel:

http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403p=10

the DRAM doesn't hold user data.  The article claims that data goes
through an internal 256KB buffer.

Is solaris incapable of issuing a SATA command FLUSH CACHE EXT?


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-22 Thread Edward Ned Harvey
 Replacing failed disks is easy when PERC is doing the RAID. Just remove
 the failed drive and replace with a good one, and the PERC will rebuild
 automatically. 

Sorry, not correct.  When you replace a failed drive, the perc card doesn't
know for certain that the new drive you're adding is meant to be a
replacement.  For all it knows, you could coincidentally be adding new disks
for a new VirtualDevice which already contains data, during the failure
state of some other device.  So it will not automatically resilver (which
would be a permanently destructive process, applied to a disk which is not
*certainly* meant for destruction).

You have to open the perc config interface, tell it this disk is a
replacement for the old disk (probably you're just saying This disk is the
new global hotspare) or else the new disk will sit there like a bump on a
log.  Doing nothing.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-22 Thread Edward Ned Harvey
 The Intel specified random write IOPS are with the cache enabled and
 without cache flushing.  They also carefully only use a limited span
 of the device, which fits most perfectly with how the device is built.

How do you know this?  This sounds much more detailed than any average
person could ever know

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-22 Thread Ross
Actually, I think this is a case of crossed wires.  This issue was reported a 
while back on a news site for the X25-M G2.  Somebody pointed out that these 
devices have 8GB of cache, which is exactly the dataset size they use for the 
iops figures.

The X25-E datasheet however states that while write cache is enabled, the iops 
figures are over the entire drive.

And looking at the X25-M G2 datasheet again, it states that the measurements 
are over 8GB of range, but these come with 32MB of cache, so I think that was 
also a false alarm.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-22 Thread Bob Friesenhahn

On Thu, 22 Oct 2009, Marc Bevand wrote:


Bob Friesenhahn bfriesen at simple.dallas.tx.us writes:

For random write I/O, caching improves I/O latency not sustained I/O
throughput (which is what random write IOPS usually refer to). So Intel can't
cheat with caching. However they can cheat by benchmarking a brand new drive
instead of an aged one.


With FLASH devices, a sufficiently large write cache can improve 
random write I/O.  One can imagine that the wear leveling logic 
could be used to do tricky remapping so that several random writes 
actually lead to sequential writes to the same FLASH superblock so 
only one superblock needs to be updated and the parts of the old 
superblocks which would have been overwritten are marked as unused. 
This of course requires rather advanced remapping logic at a 
finer-grained resolution than the superblock.  When erased space 
becomes tight (or on a periodic basis), the data in several 
sparsely-used superblocks are migrated to a different superblock in a 
more compact way (along with requisite logical block remapping) to 
reclaim space.  It is worth developing such remapping logic since 
FLASH erasures and re-writes are so expensive.



They also carefully only use a limited span
of the device, which fits most perfectly with how the device is built.


AFAIK, for the X25-E series, they benchmark random write IOPS on a 100% span.
You may be confusing it with the X25-M series with which they actually clearly
disclose two performance numbers: 350 random write IOPS on 8GB span, and 3.3k
on 100% span. See
http://www.intel.com/cd/channel/reseller/asmo-na/eng/products/nand/tech/425265.htm


You are correct that I interpreted the benchmark scenarios from the 
X25-M series documentation.  It seems reasonable for the same 
manufacturer to use the same benchmark methodology for similar 
products.  Then again, they are still new at this.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-22 Thread Meilicke, Scott
Interesting. We must have different setups with our PERCs. Mine have  
always auto rebuilt.


--
Scott Meilicke

On Oct 22, 2009, at 6:14 AM, Edward Ned Harvey  
sola...@nedharvey.com wrote:


Replacing failed disks is easy when PERC is doing the RAID. Just  
remove
the failed drive and replace with a good one, and the PERC will  
rebuild

automatically.


Sorry, not correct.  When you replace a failed drive, the perc card  
doesn't

know for certain that the new drive you're adding is meant to be a
replacement.  For all it knows, you could coincidentally be adding  
new disks
for a new VirtualDevice which already contains data, during the  
failure
state of some other device.  So it will not automatically resilver  
(which
would be a permanently destructive process, applied to a disk which  
is not

*certainly* meant for destruction).

You have to open the perc config interface, tell it this disk is a
replacement for the old disk (probably you're just saying This disk  
is the
new global hotspare) or else the new disk will sit there like a  
bump on a

log.  Doing nothing.




We value your opinion!  How may we serve you better? 
Please click the survey link to tell us how we are doing:

http://www.craneae.com/ContactUs/VoiceofCustomer.aspx
Your feedback is of the utmost importance to us. Thank you for your time.

Crane Aerospace  Electronics Confidentiality Statement:
The information contained in this email message may be privileged and is 
confidential information intended only for the use of the recipient, or any 
employee or agent responsible to deliver it to the intended recipient. Any 
unauthorized use, distribution or copying of this information is strictly prohibited 
and may be unlawful. If you have received this communication in error, please notify 
the sender immediately and destroy the original message and all attachments from 
your electronic files.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Marc Bevand
Bob Friesenhahn bfriesen at simple.dallas.tx.us writes:
 [...]
 X25-E's write cache is volatile), the X25-E has been found to offer a 
 bit more than 1000 write IOPS.

I think this is incorrect. On the paper the X25-E offers 3300 random write
4kB IOPS (and Intel is known to be very conservative about the IOPS perf 
numbers they publish). Dumb storage IOPS benchmark tools that don't issue 
parallel I/O ops to the drive tend to report numbers less than half the 
theoretical IOPS. This would explain why you see only 1000 IOPS.

I have direct evidence to prove this (with the other MLC line of SSD drives: 
X25-M): 35000 random read 4kB IOPS theoretical, 1 instance of a private 
benchmarking tool measures 6000, 10+ instances of this tool measure 37000 IOPS 
(slightly better than the theoretical max!)

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Tristan Ball
What makes you say that the X25-E's cache can't be disabled or flushed? 
The net seems to be full of references to people who are disabling the 
cache, or flushing it frequently, and then complaining about the 
performance!


T

Frédéric VANNIERE wrote:
The ZIL is a write-only log that is only read after a power failure. Several GB is large enough for most workloads. 

You can't use the Intel X25-E because it has a 32 or 64 MB volatile cache that can't be disabled neither flushed by ZFS. 


Imagine your server has a power failure while writing data to the pool. In 
normal situation, with ZIL on a reliable device, ZFS will read the ZIL and come 
back to a stable state at reboot. You may have lost some data (30 seconds) but 
the zpool works.   With the Intel X25-E as ZIL some log data has been lost with 
the power failure (32/64MB max) which lead to a corrupted log and so ... you 
loose your zpool and all your data !!

For the ZIL you need 2 reliable mirrored SSD devices with a supercapacitor that can flush the write cache to NAND when a power failure occurs. 


A hard-disk has a write cache but it can be disabled or flush by the operating 
system.

For more informations : 
http://www.c0t0d0s0.org/archives/5993-Somewhat-stable-Solid-State.html
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Meilicke, Scott
Thank you Bob and Richard. I will go with A, as it also keeps things simple.
One physical device per pool.

-Scott


On 10/20/09 6:46 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote:

 On Tue, 20 Oct 2009, Richard Elling wrote:
 
 The ZIL device will never require more space than RAM.
 In other words, if you only have 16 GB of RAM, you won't need
 more than that for the separate log.
 
 Does the wasted storage space annoy you? :-)
 
 What happens if the machine is upgraded to 32GB of RAM later?
 
 The write performace of the X25-E is likely to be the bottleneck for a
 write-mostly storage server if the storage server has excellent
 network connectivity.
 
 Bob
 --
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/



We value your opinion!  How may we serve you better? 
Please click the survey link to tell us how we are doing:
http://www.craneae.com/ContactUs/VoiceofCustomer.aspx
Your feedback is of the utmost importance to us. Thank you for your time.

Crane Aerospace  Electronics Confidentiality Statement:
The information contained in this email message may be privileged and is 
confidential information intended only for the use of the recipient, or any 
employee or agent responsible to deliver it to the intended recipient. Any 
unauthorized use, distribution or copying of this information is strictly 
prohibited 
and may be unlawful. If you have received this communication in error, please 
notify 
the sender immediately and destroy the original message and all attachments 
from 
your electronic files.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Meilicke, Scott
Thanks Ed. It sounds like you have run in this mode? No issues with  
the perc?


--
Scott Meilicke

On Oct 20, 2009, at 9:59 PM, Edward Ned Harvey  
sola...@nedharvey.com wrote:



System:
Dell 2950
16G RAM
16 1.5T SATA disks in a SAS chassis hanging off of an LSI 3801e, no
extra drive slots, a single zpool.
svn_124, but with my zpool still running at the 2009.06 version (14).

My plan is to put the SSD into an open disk slot on the 2950, but  
will

have to configure it as a RAID 0, since the onboard PERC5 controller
does not have a JBOD mode.


You can JBOD with the perc.  It might be technically a raid0 or  
raid1 with a

single disk in it, but that would be functionally equivalent to JBOD.





We value your opinion!  How may we serve you better? 
Please click the survey link to tell us how we are doing:

http://www.craneae.com/ContactUs/VoiceofCustomer.aspx
Your feedback is of the utmost importance to us. Thank you for your time.

Crane Aerospace  Electronics Confidentiality Statement:
The information contained in this email message may be privileged and is 
confidential information intended only for the use of the recipient, or any 
employee or agent responsible to deliver it to the intended recipient. Any 
unauthorized use, distribution or copying of this information is strictly prohibited 
and may be unlawful. If you have received this communication in error, please notify 
the sender immediately and destroy the original message and all attachments from 
your electronic files.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Edward Ned Harvey
 Thanks Ed. It sounds like you have run in this mode? No issues with
 the perc?
 
  You can JBOD with the perc.  It might be technically a raid0 or
  raid1 with a
  single disk in it, but that would be functionally equivalent to JBOD.

The only time I did this was ...

I have a Windows server, on a PE2950 with Perc5i, running on a disk mirror
for the OS with hotspare.  Then I needed to add some more space, and the
only disk I had available was a single 750G.  So I added it with no problem,
and I ordered another 750G to be the mirror of the first one.  I used a
single disk successfully, until the 2nd disk arrived, and then I enabled
mirroring from the 1st to the 2nd.  Everything went well.  No interruptions.
The system was a little slow while resilvering.

The one big obvious difference between my setup and yours is the OS.  I
expect that the OS doesn't change the capabilities of the Perc card, so I
think you should be fine.

The one comment I will make, in regards to the OS, which many people might
overlook, is ...

There are two interfaces to configure your Perc card.  One is the BIOS
interface, and the other is the Dell OpenManage System Administrator
(managed node.)  AKA, the Dell OMSA Managed Node.  This provides an
interface at https://machine:1311 which allows you to configure the card,
monitor health, enable/disable hotspare, resilver a new disk etc.  While the
OS is running.  (No need to shutdown into BIOS).

OMSA is required in order to replace a failed disk without a reboot.  Or add
disks, etc, or anything else you might want to do on the Perc card.  I know
OMSA is available for Windows and Linux.  How about Solaris?

Based on curiosity, I logged into Dell support just now, to look up my 2950.
The supported OSes are Netware, Windows, RedHat, and Suse.  Which means, on
my system, if I were running Solaris, I could count on *not* being able to
run OMSA, and consequently the only interface to configure the Perc would be
BIOS.  If solaris is able to install at all, I would have to acknowledge, I
have to shutdown anytime I need to change the Perc configuration, including
replacing failed disks.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Scott Meilicke
sigh

Thanks Frédéric, that is a very interesting read. 

So my options as I see them now:

1. Keep the x25-e, and disable the cache. Performance should still be improved, 
but not by a *whole* like, right? I will google for an expectation, but if 
anyone knows off the top of their head, I would be appreciative.
2. Buy a ZEUS or similar SSD with a cap backed cache. Pricing is a little hard 
to come by, based on my quick google, but I am seeing $2-3k for an 8G model. Is 
that right? Yowch.
3. Wait for the x25-e g2, which is rumored to have cap backed cache, and may or 
may not work well (but probably will).
4. Put the x25-e with disabled cache behind my PERC with the PERC cache enabled.

My budget is tight. I want better performance now. #4 sounds good. Thoughts?

Regarding mirrored SSDs for the ZIL, it was my understanding that if the SSD 
backed ZIL failed, ZFS would fail back to using the regular pool for the ZIL, 
correct? Assuming this is correct, a mirror would be to preserve performance 
during a failure?

Thanks everyone, this has been really helpful.

-Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Scott Meilicke
Ed, your comment:

If solaris is able to install at all, I would have to acknowledge, I
have to shutdown anytime I need to change the Perc configuration, including
replacing failed disks.

Replacing failed disks is easy when PERC is doing the RAID. Just remove the 
failed drive and replace with a good one, and the PERC will rebuild 
automatically. But are you talking about OpenSolaris managed RAID? I am pretty 
sure, but not tested, that in pseudo JBOD mode (each disk a raid 0 or 1), the 
PERC would still present a replaced disk to the OS without reconfiguring the 
PERC BIOS.

Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Richard Elling

On Oct 20, 2009, at 10:24 PM, Frédéric VANNIERE wrote:

The ZIL is a write-only log that is only read after a power failure.  
Several GB is large enough for most workloads.


You can't use the Intel X25-E because it has a 32 or 64 MB volatile  
cache that can't be disabled neither flushed by ZFS.


I am surprised by this assertion and cannot find any confirmation from  
Intel.
Rather, the cache flush command is specifically mentioned as supported  
in

Section 6.1.1 of the Intel X25-E SATA Solid State Drive Product Manual.
http://download.intel.com/design/flash/nand/extreme/extreme-sata-ssd-datasheet.pdf

I suspect that this is confusion relating to the various file systems,  
OSes,

or virtualization platforms which may or may not by default ignore cache
flushes.

Since NTFS uses the cache flush commands, I would be very surprised if
Intel would intentionally ignore it.

Imagine your server has a power failure while writing data to the  
pool. In normal situation, with ZIL on a reliable device, ZFS will  
read the ZIL and come back to a stable state at reboot. You may have  
lost some data (30 seconds) but the zpool works.   With the Intel  
X25-E as ZIL some log data has been lost with the power failure  
(32/64MB max) which lead to a corrupted log and so ... you loose  
your zpool and all your data !!


The ZIL works fine for devices which support the cache flush command.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Bob Friesenhahn

On Wed, 21 Oct 2009, Marc Bevand wrote:


Bob Friesenhahn bfriesen at simple.dallas.tx.us writes:

[...]
X25-E's write cache is volatile), the X25-E has been found to offer a
bit more than 1000 write IOPS.


I think this is incorrect. On the paper the X25-E offers 3300 random write
4kB IOPS (and Intel is known to be very conservative about the IOPS perf
numbers they publish). Dumb storage IOPS benchmark tools that don't issue
parallel I/O ops to the drive tend to report numbers less than half the
theoretical IOPS. This would explain why you see only 1000 IOPS.


The Intel specified random write IOPS are with the cache enabled and 
without cache flushing.  They also carefully only use a limited span 
of the device, which fits most perfectly with how the device is built. 
There is no mention of burning in the device for a few days to make 
sure that it is in a useful state.  In order for the test to be 
meaningful, the device needs to be loaded up for a while before taking 
any measurements.


Device performance should be specified as a minimum assured level of 
performance and not as meaningless peak (up to) values.  I repeat: 
peak values are meaningless.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread David Dyer-Bennet

On Wed, October 21, 2009 12:21, Bob Friesenhahn wrote:


 Device performance should be specified as a minimum assured level of
 performance and not as meaningless peak (up to) values.  I repeat:
 peak values are meaningless.

Seems a little pessimistic to me.  Certainly minimum assured values are
the basic thing people need to know, but reasonably characterized peak
values can be valuable, if the conditions yielding them match possible
application usage patterns.

The obvious example in electrical wiring is that the startup surge of
motors and the short-term over-current potential of circuit breakers
actually match each other fairly well, so that most saws (for example)
that can run comfortably on a given circuit can actually be *started* on
that circuit. Peak performance can have practical applications!

Certainly a really carefully optimized peak will almost certainly NOT
represent a useful possible performance level, and they should always be
considered meaningless until you've really proven otherwise.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Bob Friesenhahn

On Wed, 21 Oct 2009, David Dyer-Bennet wrote:


Device performance should be specified as a minimum assured level of
performance and not as meaningless peak (up to) values.  I repeat:
peak values are meaningless.


Seems a little pessimistic to me.  Certainly minimum assured values are
the basic thing people need to know, but reasonably characterized peak
values can be valuable, if the conditions yielding them match possible
application usage patterns.


Agreed. It is useful to know minimum, median, and peak values.  If 
there is a peak, it is useful to know how long that peak may be 
sustained. Intel's specifications have not characterized the actual 
performance of the device at all.


The performance characteristics of rotating media are well understood 
since they have been observed for tens of years.  From this we already 
know that the peak performance of a hard drive does not have much to 
do with its steady-state performance since the peak performance is 
often defined by the hard drive cache size and the interface type and 
clock rate.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread David Dyer-Bennet

On Wed, October 21, 2009 12:53, Bob Friesenhahn wrote:
 On Wed, 21 Oct 2009, David Dyer-Bennet wrote:

 Device performance should be specified as a minimum assured level of
 performance and not as meaningless peak (up to) values.  I repeat:
 peak values are meaningless.

 Seems a little pessimistic to me.  Certainly minimum assured values are
 the basic thing people need to know, but reasonably characterized peak
 values can be valuable, if the conditions yielding them match possible
 application usage patterns.

 Agreed. It is useful to know minimum, median, and peak values.  If
 there is a peak, it is useful to know how long that peak may be
 sustained. Intel's specifications have not characterized the actual
 performance of the device at all.

And just a random number labeled as peak really IS meaningless, yes.

 The performance characteristics of rotating media are well understood
 since they have been observed for tens of years.  From this we already
 know that the peak performance of a hard drive does not have much to
 do with its steady-state performance since the peak performance is
 often defined by the hard drive cache size and the interface type and
 clock rate.

It strikes me that disks have been developing rather too independently of,
and sometimes in conflict with, requirements for reliable interaction with
the filesystems in various OSes.  Things like power-dependent write
caches.  Boosts peak write but not sustained write, which is probably
benchmark-friendly, AND introduces the problem of writes committed to the
drive not being safe in a power failure.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Paul B. Henson
On Tue, 20 Oct 2009, [UTF-8] Fr??d??ric VANNIERE wrote:

 You can't use the Intel X25-E because it has a 32 or 64 MB volatile cache
 that can't be disabled neither flushed by ZFS.

Say what? My understanding is that the officially supported Sun SSD for the
x4540 is an OEM'd Intel X25-E, so I don't see how it could not be a good
slog device.


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  hen...@csupomona.edu
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Marc Bevand
Bob Friesenhahn bfriesen at simple.dallas.tx.us writes:
 
 The Intel specified random write IOPS are with the cache enabled and 
 without cache flushing.

For random write I/O, caching improves I/O latency not sustained I/O 
throughput (which is what random write IOPS usually refer to). So Intel can't 
cheat with caching. However they can cheat by benchmarking a brand new drive 
instead of an aged one.

 They also carefully only use a limited span 
 of the device, which fits most perfectly with how the device is built. 

AFAIK, for the X25-E series, they benchmark random write IOPS on a 100% span. 
You may be confusing it with the X25-M series with which they actually clearly 
disclose two performance numbers: 350 random write IOPS on 8GB span, and 3.3k 
on 100% span. See 
http://www.intel.com/cd/channel/reseller/asmo-na/eng/products/nand/tech/425265.htm

I agree with the rest of your email.

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-20 Thread Bob Friesenhahn

On Tue, 20 Oct 2009, Scott Meilicke wrote:


A. Use all 32G for the ZIL
B. Use 8G for the ZIL, 24G for an L2ARC. Any issues with slicing up an SSD like 
this?
C. Use 8G for the ZIL, 16G for an L2ARC, and reserve 8G to be used as a ZIL for 
the future zpool.

Since my future zpool would just be used as a backup to disk target, 
I am leaning towards option C. Any gotchas I should be aware of?


Option A seems better to me.  The reason why it seems better is that 
any write to the device consumes write IOPS and the X25-E does not 
really have that many to go around.  FLASH SSDs don't really handle 
writes all that well due to the need to erase larger blocks than are 
actually written.  Contention for access will simply make matters 
worse.  With its write cache disabled (which you should do since the 
X25-E's write cache is volatile), the X25-E has been found to offer a 
bit more than 1000 write IOPS.  With 16GB of RAM, you should not need 
a L2ARC for a backup to disk target (a write-mostly application). 
The ZFS ARC will be able to expand to 14GB or so, which is quite a lot 
of read caching already.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-20 Thread Richard Elling

On Oct 20, 2009, at 4:44 PM, Bob Friesenhahn wrote:


On Tue, 20 Oct 2009, Scott Meilicke wrote:


A. Use all 32G for the ZIL
B. Use 8G for the ZIL, 24G for an L2ARC. Any issues with slicing up  
an SSD like this?
C. Use 8G for the ZIL, 16G for an L2ARC, and reserve 8G to be used  
as a ZIL for the future zpool.


Since my future zpool would just be used as a backup to disk  
target, I am leaning towards option C. Any gotchas I should be  
aware of?


Option A seems better to me.  The reason why it seems better is  
that any write to the device consumes write IOPS and the X25-E does  
not really have that many to go around.  FLASH SSDs don't really  
handle writes all that well due to the need to erase larger blocks  
than are actually written.  Contention for access will simply make  
matters worse.  With its write cache disabled (which you should do  
since the X25-E's write cache is volatile), the X25-E has been found  
to offer a bit more than 1000 write IOPS.  With 16GB of RAM, you  
should not need a L2ARC for a backup to disk target (a write-mostly  
application). The ZFS ARC will be able to expand to 14GB or so,  
which is quite a lot of read caching already.


The ZIL device will never require more space than RAM.
In other words, if you only have 16 GB of RAM, you won't need
more than that for the separate log.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-20 Thread Bob Friesenhahn

On Tue, 20 Oct 2009, Richard Elling wrote:


The ZIL device will never require more space than RAM.
In other words, if you only have 16 GB of RAM, you won't need
more than that for the separate log.


Does the wasted storage space annoy you? :-)

What happens if the machine is upgraded to 32GB of RAM later?

The write performace of the X25-E is likely to be the bottleneck for a 
write-mostly storage server if the storage server has excellent 
network connectivity.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-20 Thread Edward Ned Harvey
 System:
 Dell 2950
 16G RAM
 16 1.5T SATA disks in a SAS chassis hanging off of an LSI 3801e, no
 extra drive slots, a single zpool.
 svn_124, but with my zpool still running at the 2009.06 version (14).
 
 My plan is to put the SSD into an open disk slot on the 2950, but will
 have to configure it as a RAID 0, since the onboard PERC5 controller
 does not have a JBOD mode.

You can JBOD with the perc.  It might be technically a raid0 or raid1 with a
single disk in it, but that would be functionally equivalent to JBOD.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-20 Thread Frédéric VANNIERE
The ZIL is a write-only log that is only read after a power failure. Several GB 
is large enough for most workloads. 

You can't use the Intel X25-E because it has a 32 or 64 MB volatile cache that 
can't be disabled neither flushed by ZFS. 

Imagine your server has a power failure while writing data to the pool. In 
normal situation, with ZIL on a reliable device, ZFS will read the ZIL and come 
back to a stable state at reboot. You may have lost some data (30 seconds) but 
the zpool works.   With the Intel X25-E as ZIL some log data has been lost with 
the power failure (32/64MB max) which lead to a corrupted log and so ... you 
loose your zpool and all your data !!

For the ZIL you need 2 reliable mirrored SSD devices with a supercapacitor that 
can flush the write cache to NAND when a power failure occurs. 

A hard-disk has a write cache but it can be disabled or flush by the operating 
system.

For more informations : 
http://www.c0t0d0s0.org/archives/5993-Somewhat-stable-Solid-State.html
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss