[zfs-discuss] ZFS and SNDR..., now I'm confused.

2009-03-06 Thread Jim Dunham
A recent increase in email about ZFS and SNDR (the replication  
component of Availability Suite), has given me reasons to post one of  
my replies.


Well, now I'm confused! A collegue just pointed me towards your blog  
entry about SNDR and ZFS which, until now, I thought was not a  
supported configuration. So, could you confirm that for me one way  
or the other?


ZFS is supported with SNDR, because SNDR is filesystem agnostic. That  
said, ZFS is a very different beast then other Solaris filesystems.


The two golden rules of ZFS replication are:

1). All volumes in a ZFS storage pool (see output of zpool status),  
must be placed in a single SNDR I/O consistency group. ZFS is the  
first Solaris filesystem that validates consistency at all levels, so  
all vdevs in a single storage pool must be replicated in a write-order  
consistent manner, and I/O consistency groups is the means to  
accomplish this.


2). While SNDR replication is active, do not attempt to zpool import  
the SNDR secondary volumes, and while the ZFS storage pool is imported  
on the SNDR secondary node, do not resume replication. This is truly a  
double-edge sword, as the instance of ZFS running on the SNDR  
secondary node, will see replicated writes from ZFS on the SNDR  
primary node, consider these unknown CRCs as some form of data  
corruption, and panic Solaris. This is the same reason two or more  
Solaris hosts can't access the same ZFS storage pool in a SAN.


There is a slight safety net here, in that zpool import will think  
that the ZFS storage pool is active on another node. Unfortunately  
stopping replication does not change this state, so you will still  
need to use the -f (force) option anyway, that is unless the zpool is  
in the exported state on the SNDR primary node, as the exported state  
will be replicated to the SNDR secondary node.


Of course I know that AVS only cares about blocks so, in principle,  
the FS is irrelevant. However, last time I was researching this, I  
found a doc that explained that the lack of support was due to the  
unpredictable nature of zfs background processes (resilver, etc) and  
therefore not being guaranteed of a truly quiesced FS.


ZFS the filesystem is always on disk consistent, and ZFS does maintain  
filesystem consistency through coordination between the ZPL (ZFS POSIX  
Layer) and the ZIL (ZFS Intent Log). Unfortunately for SNDR, ZFS  
caches a lot of an applications filesystem data in the ZIL, therefore  
the data is in memory, not written to disk, so SNDR does not know this  
data exists. ZIL flushes to disk can be seconds behind the actual  
application writes completing, and if SNDR is running asynchronously,  
these replicated writes to the SNDR secondary can be additional  
seconds behind the actual application writes.


Unlike UFS filesystems and lockfs -f, or lockfs -w, there is no  
'supported' way to get ZFS to empty the ZIL to disk on demand. So even  
though one will get both ZFS and application filesystem consistency  
within the SNDR secondary volume, there can be many seconds worth of  
lost data, since SNDR can't replicate what it does not see.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and SNDR..., now I'm confused.

2009-03-06 Thread Darren J Moffat

Jim Dunham wrote:
Unlike UFS filesystems and lockfs -f, or lockfs -w, there is no 
'supported' way to get ZFS to empty the ZIL to disk on demand. So even 
though one will get both ZFS and application filesystem consistency 
within the SNDR secondary volume, there can be many seconds worth of 
lost data, since SNDR can't replicate what it does not see.


If the application depends on that then it should be using O_DSYNC - if 
it isn't then it is broken.  In which case the ZIL is on disk and SNDR 
should be replicating that too (either because the dataset ZIL is in 
pool or because the slog device is part of the SNDR consistency group as 
well).


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and SNDR..., now I'm confused.

2009-03-06 Thread Andrew Gabriel

Jim Dunham wrote:
ZFS the filesystem is always on disk consistent, and ZFS does maintain 
filesystem consistency through coordination between the ZPL (ZFS POSIX 
Layer) and the ZIL (ZFS Intent Log). Unfortunately for SNDR, ZFS 
caches a lot of an applications filesystem data in the ZIL, therefore 
the data is in memory, not written to disk, so SNDR does not know this 
data exists. ZIL flushes to disk can be seconds behind the actual 
application writes completing, and if SNDR is running asynchronously, 
these replicated writes to the SNDR secondary can be additional 
seconds behind the actual application writes.


Unlike UFS filesystems and lockfs -f, or lockfs -w, there is no 
'supported' way to get ZFS to empty the ZIL to disk on demand. 


I'm wondering if you really meant ZIL here, or ARC?

In either case, creating a snapshot should get both flushed to disk, I 
think?
(If you don't actually need a snapshot, simply destroy it immediately 
afterwards.)


--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and SNDR..., now I'm confused.

2009-03-06 Thread Jonathan Edwards


On Mar 6, 2009, at 8:58 AM, Andrew Gabriel wrote:


Jim Dunham wrote:
ZFS the filesystem is always on disk consistent, and ZFS does  
maintain filesystem consistency through coordination between the  
ZPL (ZFS POSIX Layer) and the ZIL (ZFS Intent Log). Unfortunately  
for SNDR, ZFS caches a lot of an applications filesystem data in  
the ZIL, therefore the data is in memory, not written to disk, so  
SNDR does not know this data exists. ZIL flushes to disk can be  
seconds behind the actual application writes completing, and if  
SNDR is running asynchronously, these replicated writes to the SNDR  
secondary can be additional seconds behind the actual application  
writes.


Unlike UFS filesystems and lockfs -f, or lockfs -w, there is no  
'supported' way to get ZFS to empty the ZIL to disk on demand.


I'm wondering if you really meant ZIL here, or ARC?

In either case, creating a snapshot should get both flushed to disk,  
I think?
(If you don't actually need a snapshot, simply destroy it  
immediately afterwards.)


not sure if there's another way to trigger a full flush or lockfs, but  
to make sure you do have all transactions that may not have been  
flushed from the ARC you could just unmount the filesystem or export  
the zpool .. with the latter, then you wouldn't have to worry about  
the -f on the import


---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and SNDR..., now I'm confused.

2009-03-06 Thread Jim Dunham

Andrew,


Jim Dunham wrote:
ZFS the filesystem is always on disk consistent, and ZFS does  
maintain filesystem consistency through coordination between the  
ZPL (ZFS POSIX Layer) and the ZIL (ZFS Intent Log). Unfortunately  
for SNDR, ZFS caches a lot of an applications filesystem data in  
the ZIL, therefore the data is in memory, not written to disk, so  
SNDR does not know this data exists. ZIL flushes to disk can be  
seconds behind the actual application writes completing, and if  
SNDR is running asynchronously, these replicated writes to the SNDR  
secondary can be additional seconds behind the actual application  
writes.


Unlike UFS filesystems and lockfs -f, or lockfs -w, there is no  
'supported' way to get ZFS to empty the ZIL to disk on demand.


I'm wondering if you really meant ZIL here, or ARC?


It is my understanding that the ZFS intent log (ZIL) satisfies POSIX  
requirements for synchronous transactions, thus filesystem  
consistency. The ZFS adaptive replacement cache (ARC) is where  
uncommitted filesystem data is being cached. So although unwritten  
filesystem data allocated from the DMU, retained in the ARC, it is the  
ZIL which influences filesystem metadata and data consistency on disk.


In either case, creating a snapshot should get both flushed to disk,  
I think?


No. A ZFS snapshot is a control path, verse data path operation and  
(to the best of my understanding, and testing) has no influence over  
POSIX filesystem consistency. See the discussion here: http://www.opensolaris.org/jive/click.jspa?searchID=1695691messageID=124809


Invoking a ZFS snapshot will assure the ZFS snapshot is consistent on  
the replicated disk, but not all actively opened files.


A simple test I performed to verify this, was to append to a ZFS file  
(no synchronous filesystem options being set) a series of blocks with  
a block order pattern contained within. At some random point in this  
process, I took a ZFS snapshot, immediately dropped SNDR into logging  
mode. When importing the ZFS storage pool on the SNDR remote host, I  
could see the ZFS snapshot just taken, but neither the snapshot  
version of the file, or the file itself contained all of the data  
previously written to it.


I then retested, but opened the file with O_DSYNC, and when following  
the same test steps above, both the snapshot version of the file, and  
the file itself contained all of the data previously written to it.


(If you don't actually need a snapshot, simply destroy it  
immediately afterwards.)


--
Andrew


Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and SNDR..., now I'm confused.

2009-03-06 Thread Neil Perrin

I'd like to correct a few misconceptions about the ZIL here.

On 03/06/09 06:01, Jim Dunham wrote:
ZFS the filesystem is always on disk consistent, and ZFS does maintain 
filesystem consistency through coordination between the ZPL (ZFS POSIX 
Layer) and the ZIL (ZFS Intent Log).


Pool and file system consistency is more a function of the DMU  SPA.

Unfortunately for SNDR, ZFS caches 
a lot of an applications filesystem data in the ZIL, therefore the data 
is in memory, not written to disk,


ZFS data is actually cached in the ARC. The ZIL code keeps in-memory records
of system call transactions in case a fsync() occurs.

so SNDR does not know this data 
exists. ZIL flushes to disk can be seconds behind the actual application 
writes completing,


It's the DMU/SPA that handles the transaction group commits (not the ZIL).
Currently these occur 30 seconds or more frequently on a loaded system.

and if SNDR is running asynchronously, these 
replicated writes to the SNDR secondary can be additional seconds behind 
the actual application writes.


Unlike UFS filesystems and lockfs -f, or lockfs -w, there is no 
'supported' way to get ZFS to empty the ZIL to disk on demand.


The sync(2) system call is implemented differently in ZFS.
For UFS it initiates a flush of cached data to disk, but does
not wait for completion. This satisfies the POSIX requirement but
never seemed right. For ZFS we wait for all transactions
to complete and commit to stable storage (including flushing any
disk write caches) before returning. So any asynchronous data
in the ARC is written.

Alternatively, a lockfs will flush just a file system to stable storage
but in this case just the intent log is written. (Then later when
the txg commits those intent log records are discarded).

For some basic info on the ZIL see:
http://blogs.sun.com/perrin/entry/the_lumberjack

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and SNDR..., now I'm confused.

2009-03-06 Thread Neil Perrin



On 03/06/09 08:10, Jim Dunham wrote:

Andrew,


Jim Dunham wrote:
ZFS the filesystem is always on disk consistent, and ZFS does 
maintain filesystem consistency through coordination between the ZPL 
(ZFS POSIX Layer) and the ZIL (ZFS Intent Log). Unfortunately for 
SNDR, ZFS caches a lot of an applications filesystem data in the ZIL, 
therefore the data is in memory, not written to disk, so SNDR does 
not know this data exists. ZIL flushes to disk can be seconds behind 
the actual application writes completing, and if SNDR is running 
asynchronously, these replicated writes to the SNDR secondary can be 
additional seconds behind the actual application writes.


Unlike UFS filesystems and lockfs -f, or lockfs -w, there is no 
'supported' way to get ZFS to empty the ZIL to disk on demand.


I'm wondering if you really meant ZIL here, or ARC?


It is my understanding that the ZFS intent log (ZIL) satisfies POSIX 
requirements for synchronous transactions,


True.


thus filesystem consistency.


No. The filesystems in the pool are always consistent with or without
the ZIL.  The ZIL is not the same as a journal (or the log in UFS).

The ZFS adaptive replacement cache (ARC) is where uncommitted filesystem 
data is being cached. So although unwritten filesystem data allocated 
from the DMU, retained in the ARC, it is the ZIL which influences 
filesystem metadata and data consistency on disk.


No. It just ensures the synchronous requests (O_DSYNC, fsync() etc)
are on stable storage in case a crash/power fail occurs before
the dirty ARC is written when the txg commits.



In either case, creating a snapshot should get both flushed to disk, I 
think?


No. A ZFS snapshot is a control path, verse data path operation and (to 
the best of my understanding, and testing) has no influence over POSIX 
filesystem consistency. See the discussion here: 
http://www.opensolaris.org/jive/click.jspa?searchID=1695691messageID=124809


Invoking a ZFS snapshot will assure the ZFS snapshot is consistent on 
the replicated disk, but not all actively opened files.


A simple test I performed to verify this, was to append to a ZFS file 
(no synchronous filesystem options being set) a series of blocks with a 
block order pattern contained within. At some random point in this 
process, I took a ZFS snapshot, immediately dropped SNDR into logging 
mode. When importing the ZFS storage pool on the SNDR remote host, I 
could see the ZFS snapshot just taken, but neither the snapshot version 
of the file, or the file itself contained all of the data previously 
written to it.


That seems like a bug in ZFS to me. A snapshot ought to contain all data
that has been written (whether synchronous or asynchronous) prior to the 
snapshot.



I then retested, but opened the file with O_DSYNC, and when following 
the same test steps above, both the snapshot version of the file, and 
the file itself contained all of the data previously written to it.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and SNDR..., now I'm confused.

2009-03-06 Thread Richard Elling

Jonathan Edwards wrote:


On Mar 6, 2009, at 8:58 AM, Andrew Gabriel wrote:


Jim Dunham wrote:
ZFS the filesystem is always on disk consistent, and ZFS does 
maintain filesystem consistency through coordination between the ZPL 
(ZFS POSIX Layer) and the ZIL (ZFS Intent Log). Unfortunately for 
SNDR, ZFS caches a lot of an applications filesystem data in the 
ZIL, therefore the data is in memory, not written to disk, so SNDR 
does not know this data exists. ZIL flushes to disk can be seconds 
behind the actual application writes completing, and if SNDR is 
running asynchronously, these replicated writes to the SNDR 
secondary can be additional seconds behind the actual application 
writes.


Unlike UFS filesystems and lockfs -f, or lockfs -w, there is no 
'supported' way to get ZFS to empty the ZIL to disk on demand.


I'm wondering if you really meant ZIL here, or ARC?

In either case, creating a snapshot should get both flushed to disk, 
I think?
(If you don't actually need a snapshot, simply destroy it immediately 
afterwards.)


not sure if there's another way to trigger a full flush or lockfs, but 
to make sure you do have all transactions that may not have been 
flushed from the ARC you could just unmount the filesystem or export 
the zpool .. with the latter, then you wouldn't have to worry about 
the -f on the import


sync(1m)
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and SNDR..., now I'm confused.

2009-03-06 Thread Miles Nordin
 jd == Jim Dunham james.dun...@sun.com writes:

jd It is my understanding that the ZFS intent log (ZIL) satisfies
jd POSIX requirements for synchronous transactions, thus
jd filesystem consistency.

maybe ``file consistency'' would be clearer.  When you say filesystem
consistency people imagine their pools won't import, which I think
isn't what you're talking about.  Databases rely on the ZIL to keep
their data files internally consistent, and MTA's to keep their queue
directories consistent: ``file consistency'' meaning the insides of a
file must be consistent with the rest of the insides of the same file,
and they won't be without the ZIL.

so, for example, in an imaginary better world where virtual machine
software didn't break all kinds of sync and barrier rules and the ZIL
were the only issue, then disabling the ZIL on the Host could cause
the filesystems of virtual Guests to become inconsistent and refuse to
import or need drastic fsck if the Host lost power, or in the
SNDR-replicated copy of the Host, but the Host filesystem and its
replica would always stay clean and mountable with or without the ZIL.

The ZIL is stored on the disk, never in RAM as your earlier message
suggested, so it should be replicated along with everything else,
shouldn't it?

unless you are using a slog and leave the slog outside replication,
but in that case it should be impossible to import the pool on the
secondary because importing with missing slogs doesn't work yet, so
I'm not sure what's happening to you.

Are you actually observing violation of POSIX consistency
``suggestions'' w.r.t. fsync() or O_DSYNC on the secondary?  

Or are you talking about close-to-open?  Files that you close(), wait
for the close to return, break replication, and the file does not
appear on the secondary?

What's breaking exactly?

jd A simple test I performed to verify this, was to append to a
jd ZFS file (no synchronous filesystem options being set) a
jd series of blocks with a block order pattern contained
jd within. At some random point in this process, I took a ZFS
jd snapshot, immediately dropped SNDR into logging mode. When
jd importing the ZFS storage pool on the SNDR remote host, I
jd could see the ZFS snapshot just taken, but neither the
jd snapshot version of the file, or the file itself contained all
jd of the data previously written to it.

that's a really good test!  so SNDR is good for testing, too, it seems.

I'm glad you've done it.  If we'd just listened to the several people
speculating, ``just take a snapshot, it ought to imply a lockfs'' we
could be having nasty surprises months from now.  I'm also not that
upset about the behavior, if it lets one take and destroy snapshots
really fast.  I could see the opposing argument that all snapshots
should commit to disk atomically, though, because you are saying the
snapshot _exists_ but doesn't have in it what it should---maybe in a
more ideal world snapshot should either disappear after reboot, or
else if it exists contain exactly what it logically should.

jd I then retested, but opened the file with O_DSYNC, and when
jd following the same test steps above, both the snapshot version
jd of the file, and the file itself contained all of the data
jd previously written to it.

AIUI, in this test some of the file data may be written to the ZIL.
In the former test, the ZIL would not be used at all.

but the ZIL is just a separate area on the disk that's faster to write
to, since with O_DSYNC or fsync() you would like to return to the
application in a hurry.  ZFS scribbles down the change as quickly as
possible in the ZIL on the disk, then rewrites it in a more organized
way later.

-- 
READ CAREFULLY. By reading this fortune, you agree, on behalf of your employer,
to release me from all obligations and waivers arising from any and all
NON-NEGOTIATED  agreements, licenses, terms-of-service, shrinkwrap, clickwrap,
browsewrap, confidentiality, non-disclosure, non-compete and acceptable use
policies (BOGUS AGREEMENTS) that I have entered into with your employer, its
partners, licensors, agents and assigns, in perpetuity, without prejudice to my
ongoing rights and privileges. You further represent that you have the
authority to release me from any BOGUS AGREEMENTS on behalf of your employer.


pgpsZsO8kcf9d.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and SNDR..., now I'm confused.

2009-03-06 Thread Miles Nordin
 np == Neil Perrin neil.per...@sun.com writes:

np Alternatively, a lockfs will flush just a file system to
np stable storage but in this case just the intent log is
np written. (Then later when the txg commits those intent log
np records are discarded).

In your blog it sounded like there's an in-RAM ZIL through which
_everything_ passes, and parts of this in-RAM ZIL are written to the
on-disk ZIL as needed.  so maybe I was using the word ZIL wrongly in
my last post.

are you saying, lockfs will divert writes that would normally go
straight to the pool, to pass through the on-disk ZIL instead?

assuming any separate slog isn't destroyed while the power's off,
lockfs and sync should get you the same end result after an unclean
shutdown, right?


pgpIGQaXn1vZO.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and SNDR..., now I'm confused.

2009-03-06 Thread Neil Perrin



On 03/06/09 14:51, Miles Nordin wrote:

np == Neil Perrin neil.per...@sun.com writes:


np Alternatively, a lockfs will flush just a file system to
np stable storage but in this case just the intent log is
np written. (Then later when the txg commits those intent log
np records are discarded).

In your blog it sounded like there's an in-RAM ZIL through which
_everything_ passes, and parts of this in-RAM ZIL are written to the
on-disk ZIL as needed.


Thats correct.


so maybe I was using the word ZIL wrongly in
my last post.


I understood what you meant.



are you saying, lockfs will divert writes that would normally go
straight to the pool, to pass through the on-disk ZIL instead?


- Not instead but as well. The ZIL (code) will write immediately
to the stable intent logs, then later the data cached in the ARC
will be written as part of the pool transaction group (txg).
As soon as that happens the intent log blocks can be re-used.



assuming any separate slog isn't destroyed while the power's off,
lockfs and sync should get you the same end result after an unclean
shutdown, right?


Right.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss