Re[2]: [zfs-discuss] ZFS over iSCSI question

2007-03-25 Thread Robert Milkowski
Hello Thomas,

Saturday, March 24, 2007, 1:06:47 AM, you wrote:


 The problem is that the failure modes are very different for networks and
 presumably reliable local disk connections.  Hence NFS has a lot of error
 handling code and provides well understood error handling semantics.  Maybe
 what you really want is NFS?

TN We thought about using NFS as backend for as much as possible applications
TN but we need to have redundancy for the fileserver itself too

Then use Sun Cluster + NFS, both are for free.

Now it won't solve your 'sync' support but maybe you can try: SC + NFS
mounted on clients with directio, UFS on server mounted with directio.

It probably will be really slow, but everythink should be consistent
all the time I guess.


-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re[2]: [zfs-discuss] ZFS over iSCSI question

2007-03-25 Thread Thomas Nau

Hi Robert,

On Sun, 25 Mar 2007, Robert Milkowski wrote:

The problem is that the failure modes are very different for networks and
presumably reliable local disk connections.  Hence NFS has a lot of error
handling code and provides well understood error handling semantics.  Maybe
what you really want is NFS?


TN We thought about using NFS as backend for as much as possible applications
TN but we need to have redundancy for the fileserver itself too

Then use Sun Cluster + NFS, both are for free.


We use a cluster ;) but in the backend it doesn't solve the sync problem 
as you mention



Now it won't solve your 'sync' support but maybe you can try: SC + NFS
mounted on clients with directio, UFS on server mounted with directio.


UFS is no option due to it's limitations in size and data safety. We 
recently had a severe problem when the UFS log got corrupted due to a 
hardware failure (port died on a FCAL switch). The fsck ran for 10+ hours 
on the mail server. Even worse, it reported some corrected problems but 
running a few more hours in production the system paniced again with 
freeing free block/inode? At this point we decided that 500GB of mail 
is not what we wanna put on UFS or any similar FS anymore



It probably will be really slow, but everythink should be consistent
all the time I guess.


You might be right about. I did a quick check with dtrace on the mail 
server and it seems IMAP, sendmail and the others nicely sync data as they 
should


Thomas

-
GPG fingerprint: B1 EE D2 39 2C 82 26 DA  A5 4D E0 50 35 75 9E ED
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: Re[2]: [zfs-discuss] ZFS over iSCSI question

2007-03-25 Thread David Magda


On Mar 25, 2007, at 06:14, Thomas Nau wrote:

We use a cluster ;) but in the backend it doesn't solve the sync  
problem as you mention


The StorageTek Availability Suite was recently open-sourced:

http://www.opensolaris.org/os/project/avs/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS over iSCSI question

2007-03-24 Thread Joerg Schilling
Thomas Nau [EMAIL PROTECTED] wrote:

  fflush(fp);
  fsync(fileno(fp));
  fclose(fp);
 
  and check errors.
 
 
  (It's remarkable how often people get the above sequence wrong and only
  do something like fsync(fileno(fp)); fclose(fp);


 Thanks for clarifying! Seems I really need to check the apps with truss or 
 dtrace to see if they use that sequence. Allow me one more question: why 
 is fflush() required prior to fsync()?

You cannot simply verify this with truss unless you trace libc::fflush() too.

You need to call fflush() before, in order to move the user space cache to the
kernel.


Jörg

-- 
 EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin
   [EMAIL PROTECTED](uni)  
   [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS over iSCSI question

2007-03-24 Thread Frank Cusack

On March 23, 2007 11:06:33 PM -0700 Adam Leventhal [EMAIL PROTECTED] wrote:

On Fri, Mar 23, 2007 at 11:28:19AM -0700, Frank Cusack wrote:

 I'm in a way still hoping that it's a iSCSI related Problem as
 detecting dead hosts in a network can be a non trivial problem and it
 takes quite some time for TCP to timeout and inform the upper layers.
 Just a guess/hope here that FC-AL, ... do better in this case

iscsi doesn't use TCP, does it?  Anyway, the problem is really transport
independent.


It does use TCP. Were you thinking UDP?


or its own IP protocol.  I wouldn't have thought iSCSI would want to be
subject to the vagaries of TCP.

-frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS over iSCSI question

2007-03-24 Thread Brian Hechinger
On Sat, Mar 24, 2007 at 11:20:38AM -0700, Frank Cusack wrote:
 iscsi doesn't use TCP, does it?  Anyway, the problem is really transport
 independent.
 
 It does use TCP. Were you thinking UDP?
 
 or its own IP protocol.  I wouldn't have thought iSCSI would want to be
 subject to the vagaries of TCP.

No, you'll find that iSCSI does indeed us TCP, for better or for worse. ;)

-brian
-- 
The reason I don't use Gnome: every single other window manager I know of is
very powerfully extensible, where you can switch actions to different mouse
buttons. Guess which one is not, because it might confuse the poor users?
Here's a hint: it's not the small and fast one.--Linus
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS over iSCSI question

2007-03-23 Thread Thomas Nau

Dear all.
I've setup the following scenario:

Galaxy 4200 running OpenSolaris build 59 as iSCSI target; remaining 
diskspace of the two internal drives with a total of 90GB is used as zpool 
for the two 32GB volumes exported via iSCSI


The initiator is an up to date Solaris 10 11/06 x86 box using the above 
mentioned volumes as disks for a local zpool.


I've now started rsync to copy about 1GB of data in several thousand 
files. During the operation I took the network interface on the iSCSI 
target down which resulted in no more disk IO on that server. On the other 
hand, the client happily dumps data into the ZFS cache actually completely 
finishing all of the copy operation.


Now the big question: we plan to use that kind of setup for email or other 
important services so what happens if the client crashes while the network 
is down? Does it mean that all the data in the cache is gone forever?


If so, is this a transport independent problem which can also happen if 
ZFS used Fibre Channel attached drives instead of iSCSI devices?


Thanks for your help
Thomas

-
GPG fingerprint: B1 EE D2 39 2C 82 26 DA  A5 4D E0 50 35 75 9E ED
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS over iSCSI question

2007-03-23 Thread Thomas Nau

On Fri, 23 Mar 2007, Roch - PAE wrote:


I assume the rsync is not issuing fsyncs (and it's files are
not opened O_DSYNC). If so,  rsync just works against the
filesystem cache and does not commit the data to disk.

You might want to run sync(1M) after a successful rsync.

A larger  rsync would presumably have blocked. It's just
that the amount of data you needs to rsync fitted in a couple of
transaction groups.


Thanks for the hints but this would make our worst nightmares become true. 
At least they could because it means that we would have to check every 
application handling critical data and I think it's not the apps 
responsibility. Up to a certain amount like a database transaction but not 
any further. There's always a time window where data might be cached in 
memory but I would argue that caching several GB of data, in our case 
written data, with thousands of files in unbuffered memory circumvents all 
the build in reliability of ZFS.


I'm in a way still hoping that it's a iSCSI related Problem as detecting 
dead hosts in a network can be a non trivial problem and it takes quite 
some time for TCP to timeout and inform the upper layers. Just a 
guess/hope here that FC-AL, ... do better in this case


Thomas

-
GPG fingerprint: B1 EE D2 39 2C 82 26 DA  A5 4D E0 50 35 75 9E ED
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS over iSCSI question

2007-03-23 Thread Frank Cusack

On March 23, 2007 6:51:10 PM +0100 Thomas Nau [EMAIL PROTECTED] wrote:

Thanks for the hints but this would make our worst nightmares become
true. At least they could because it means that we would have to check
every application handling critical data and I think it's not the apps
responsibility.


I'd tend to disagree with that.  POSIX/SUS does not guarantee data makes
it to disk until you do an fsync() (or open the file with the right flags,
or other techniques).  If an application REQUIRES that data get to disk,
it really MUST DTRT.


Up to a certain amount like a database transaction but
not any further. There's always a time window where data might be cached
in memory but I would argue that caching several GB of data, in our case
written data, with thousands of files in unbuffered memory circumvents
all the build in reliability of ZFS.

I'm in a way still hoping that it's a iSCSI related Problem as detecting
dead hosts in a network can be a non trivial problem and it takes quite
some time for TCP to timeout and inform the upper layers. Just a
guess/hope here that FC-AL, ... do better in this case


iscsi doesn't use TCP, does it?  Anyway, the problem is really transport
independent.

-frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS over iSCSI question

2007-03-23 Thread Casper . Dik

I'd tend to disagree with that.  POSIX/SUS does not guarantee data makes
it to disk until you do an fsync() (or open the file with the right flags,
or other techniques).  If an application REQUIRES that data get to disk,
it really MUST DTRT.

Indeed; want your data safe?  Use:

fflush(fp);
fsync(fileno(fp));
fclose(fp);

and check errors.


(It's remarkable how often people get the above sequence wrong and only
do something like fsync(fileno(fp)); fclose(fp);


Casper
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS over iSCSI question

2007-03-23 Thread Richard Elling

Thomas Nau wrote:

Dear all.
I've setup the following scenario:

Galaxy 4200 running OpenSolaris build 59 as iSCSI target; remaining 
diskspace of the two internal drives with a total of 90GB is used as 
zpool for the two 32GB volumes exported via iSCSI


The initiator is an up to date Solaris 10 11/06 x86 box using the above 
mentioned volumes as disks for a local zpool.


Like this?
disk--zpool--zvol--iscsitarget--network--iscsiclient--zpool--filesystem--app

 I'm in a way still hoping that it's a iSCSI related Problem as detecting
 dead hosts in a network can be a non trivial problem and it takes quite
 some time for TCP to timeout and inform the upper layers. Just a
 guess/hope here that FC-AL, ... do better in this case

Actually, this is why NFS was invented.  Prior to NFS we had something like:
disk--raw--ndserver--network--ndclient--filesystem--app

The problem is that the failure modes are very different for networks and
presumably reliable local disk connections.  Hence NFS has a lot of error
handling code and provides well understood error handling semantics.  Maybe
what you really want is NFS?
 -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS over iSCSI question

2007-03-23 Thread Thomas Nau

Dear Fran  Casper


I'd tend to disagree with that.  POSIX/SUS does not guarantee data makes
it to disk until you do an fsync() (or open the file with the right flags,
or other techniques).  If an application REQUIRES that data get to disk,
it really MUST DTRT.


Indeed; want your data safe?  Use:

fflush(fp);
fsync(fileno(fp));
fclose(fp);

and check errors.


(It's remarkable how often people get the above sequence wrong and only
do something like fsync(fileno(fp)); fclose(fp);



Thanks for clarifying! Seems I really need to check the apps with truss or 
dtrace to see if they use that sequence. Allow me one more question: why 
is fflush() required prior to fsync()?


Putting all pieces together this means that if the app doesn't do it it 
suffered from the problem with UFS anyway just with typically smaller 
caches, right?


Thanks again
Thomas

-
GPG fingerprint: B1 EE D2 39 2C 82 26 DA  A5 4D E0 50 35 75 9E ED
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS over iSCSI question

2007-03-23 Thread Thomas Nau

Richard,


Like this?
disk--zpool--zvol--iscsitarget--network--iscsiclient--zpool--filesystem--app


exactly


I'm in a way still hoping that it's a iSCSI related Problem as detecting
dead hosts in a network can be a non trivial problem and it takes quite
some time for TCP to timeout and inform the upper layers. Just a
guess/hope here that FC-AL, ... do better in this case


Actually, this is why NFS was invented.  Prior to NFS we had something like:
disk--raw--ndserver--network--ndclient--filesystem--app


The problem is that our NFS, Mail, DB and other servers use mirrrored 
disks located in different building on campus. Currently we use FCAL 
devices and recently switched from UFS to ZFS. The drawback with FCAL is 
that you always need to have a second infrastructure (not the real 
problem) but with different components. Having all ethernet would be much 
easier.



The problem is that the failure modes are very different for networks and
presumably reliable local disk connections.  Hence NFS has a lot of error
handling code and provides well understood error handling semantics.  Maybe
what you really want is NFS?


We thought about using NFS as backend for as much as possible applications 
but we need to have redundancy for the fileserver itself too


Thomas

-
GPG fingerprint: B1 EE D2 39 2C 82 26 DA  A5 4D E0 50 35 75 9E ED
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS over iSCSI question

2007-03-23 Thread Casper . Dik

Thanks for clarifying! Seems I really need to check the apps with truss or 
dtrace to see if they use that sequence. Allow me one more question: why 
is fflush() required prior to fsync()?

When you use stdio, you need to make sure the data is in the
system buffers prior to call fsync.

fclose() will otherwise write the rest of the data which is not sync'ed.


(In S10 I fixed this for /etc/*_* driver files , they are generally
under 8 K and therefor never written to disk before fsync'ed
if not preceeded by fflush().

Casper
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS over iSCSI question

2007-03-23 Thread Adam Leventhal
On Fri, Mar 23, 2007 at 11:28:19AM -0700, Frank Cusack wrote:
 I'm in a way still hoping that it's a iSCSI related Problem as detecting
 dead hosts in a network can be a non trivial problem and it takes quite
 some time for TCP to timeout and inform the upper layers. Just a
 guess/hope here that FC-AL, ... do better in this case
 
 iscsi doesn't use TCP, does it?  Anyway, the problem is really transport
 independent.

It does use TCP. Were you thinking UDP?

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss