Re: [zfs-discuss] x4500/x4540 does the internal controllers have a bbu?

2010-01-13 Thread Richard Elling
On Jan 12, 2010, at 7:46 PM, Brad wrote:

 Richard,
 
 Yes, write cache is enabled by default, depending on the pool configuration.
 Is it enabled for a striped (mirrored configuration) zpool?  I'm asking 
 because of a concern I've read on this forum about a problem with SSDs (and 
 disks) where if a power outage occurs any data in cache would be lost if it 
 hasn't been flushed to disk.

If the vdev is a whole disk (for Solaris == not a slice), then ZFS will attempt 
to
set the write cache enable. By default, Solaris will not set write cache enable
on disks, in part because it causes bad juju for UFS.  This is independent of
the data protection configuration.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] x4500/x4540 does the internal controllers have a bbu?

2010-01-12 Thread Brad
Has anyone worked with a x4500/x4540 and know if the internal raid controllers 
have a bbu?  I'm concern that we won't be able to turn off the write-cache on 
the internal hds and SSDs to prevent data corruption in case of a power failure.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500/x4540 does the internal controllers have a bbu?

2010-01-12 Thread Toby Thain


On 12-Jan-10, at 5:53 AM, Brad wrote:

Has anyone worked with a x4500/x4540 and know if the internal raid  
controllers have a bbu?  I'm concern that we won't be able to turn  
off the write-cache on the internal hds and SSDs to prevent data  
corruption in case of a power failure.



A power fail won't corrupt data even with write cache enabled, under  
the assumptions about device behaviour recently mentioned on the  
list. (Caching isn't the problem; ordering is.)


The Sun machines must be tested and qualified for correct behaviour.

--Toby


--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500/x4540 does the internal controllers have a bbu?

2010-01-12 Thread Richard Elling

On Jan 12, 2010, at 2:53 AM, Brad wrote:

 Has anyone worked with a x4500/x4540 and know if the internal raid 
 controllers have a bbu?  I'm concern that we won't be able to turn off the 
 write-cache on the internal hds and SSDs to prevent data corruption in case 
 of a power failure.

Yes, write cache is enabled by default, depending on the pool configuration.
This is true for all disks that support write caches.

The key to make this work is whether the device honors the cache flush
command. The disks qualified for X4500/X4540 will honor the cache flush
command.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500/x4540 does the internal controllers have a bbu?

2010-01-12 Thread Brad
(Caching isn't the problem; ordering is.)

Weird I was reading about a problem where using SSDs (intel x25-e) if the power 
goes out and the data in cache is not flushed, you would have loss of data.

Could you elaborate on ordering?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500/x4540 does the internal controllers have a bbu?

2010-01-12 Thread Brad
Richard,

Yes, write cache is enabled by default, depending on the pool configuration.
Is it enabled for a striped (mirrored configuration) zpool?  I'm asking because 
of a concern I've read on this forum about a problem with SSDs (and disks) 
where if a power outage occurs any data in cache would be lost if it hasn't 
been flushed to disk.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500/x4540 does the internal controllers have a bbu?

2010-01-12 Thread Toby Thain


On 12-Jan-10, at 10:40 PM, Brad wrote:


(Caching isn't the problem; ordering is.)

Weird I was reading about a problem where using SSDs (intel x25-e)  
if the power goes out and the data in cache is not flushed, you  
would have loss of data.


Could you elaborate on ordering?



ZFS integrity is maintained if the device correctly respects flush/ 
barrier semantics, which, as required, enforce an ordering of  
operations. The synchronous completion of flush guarantees that prior  
writes have durably completed. This is irrespective of write caching.


When a device does not properly flush, all bets are off, because  
inflight data (including unwritten data in the write cache) is not  
written in any determinate manner (you cannot know what was written,  
or in what order). The precondition for an atomic überblock update is  
that the tree of blocks it references has been fully written.


This has been mentioned periodically on the list. I thought somebody  
(Richard Elling?) did a nice capsule summary recently but I can't  
find it, so here are some other past list snippets by more  
knowledgeable people than I.


Neil Perrin, 6 Dec, 2009:


ZFS uses a 3 stage transaction model: Open, Quiescing and Syncing.
Transactions enter in Open. Quiescing is where a new Open stage has
started and waits for transactions that have yet to commit to finish.
Syncing is where all the completed transactions are pushed to the pool
in an atomic manner with the last write being the root of the new tree
of blocks (uberblock).

All the guarantees assume good hardware. As part of the new  
uberblock update
we flush the write caches of the pool devices. If this is broken  
all bets

are off.


14 Oct, 2009, James R. Van Artsdalen:


ZFS is different because it uses a different superblock every few
seconds (every transaction commit), and more importantly, the top  
levels
of the filesystem and some pool metadata are moved too.  After  
every tx

commit the uberblock is in a different place and some of its pointers
are to different places.

Moreover, blocks that were freed by this process are rapidly  
reclaimed.
The uberblock itself is not reclaimed for another 127 commits -  
several
minutes - but the things it points to are.  In other words as soon  
as tx
group N is committed, blocks from N-1 that are no longer referenced  
are

reclaimed as free space.

What goes wrong when the write fence / cache flush doesn't happen:

As soon as the uberblock for tx group N is written everything from N-1
that is no longer referenced is marked free for reallocation, and  
these
newly-freed blocks often contain part of the top levels of the N-1  
pool

/ filesystems and metadata.

If the uberblock for N is _not_ written to media when it was  
supposed to

be then ZFS may happily reuse the blocks from N-1 while the uberblock
for N-1 is still the most recent on media, instead of N as ZFS  
expects.

In other words there might be a window where the most recent uberblock
on disk media (N-1) points to a toplevel directory block that is
overwritten with unrelated data - disaster.

That window closes once uberblock N hits media.  Unfortunately with no
write fence it might be a long time before that happens.  ...


10 Oct, 2009, James Relph quotes Dominic Giampaolo:


Last, I do not believe that the crash protection scheme used
by ZFS can ever work reliably on drives that drop the flush
track cache request.  The only approach that is guaranteed to
work is to keep enough data in a log that when you remount the
drive, you can replay more data than the drive could have kept
cached.


Nicolas Williams, 13 Feb, 2009:


Also, note that ignoring barriers is effectively as bad as dropping
writes if there's any chance that some writes will never hit the disk
because of, say, power failures.  Imagine 100 txgs, but some writes  
from
the first txg never hitting the disk because the drive keeps them  
in the
cache without flushing them for too long, then you pull out the  
disk, or

power fails -- in that case not even fallback to older txgs will help
you, there'd be nothing that ZFS could do to help you.


Peter Schuller, 10 Feb, 2009:


What's stopping a RAID device from,
for example, ACK:ing an I/O before it is even in the cache? I have not
designed RAID controller firmware so I am not sure how likely that is,
but I don't see it as an impossibility. Disabling flushing because you
have battery backed nvram implies that your battery-backed nvram
guarantees ordering of all writes, and that nothing is ever placed in
said battery backed cache out of order.



Jeff Bonwick, 12 Feb, 2007:


Even if you disable the intent log, the transactional nature
of ZFS ensures preservation of event ordering.  Note that disk caches
don't come into it: ZFS builds up a wad of transactions in memory,
then pushes them out as a transaction group.  That entire group will
either commit or not.  ZFS writes all the new data to new locations,
then flushes all disk write caches, then