Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Jeroen Roodhart
Hi list,

 If you're running solaris proper, you better mirror
 your
  ZIL log device.  
...
 I plan to get to test this as well, won't be until
 late next week though.

Running OSOL nv130. Power off the machine, removed the F20 and power back on. 
Machines boots OK and comes up normally with the following message in 'zpool 
status':
...
pool: mypool
 state: FAULTED
status: An intent log record could not be read.
Waiting for adminstrator intervention to fix the faulted pool.
action: Either restore the affected device(s) and run 'zpool online',
or ignore the intent log records by running 'zpool clear'.
   see: http://www.sun.com/msg/ZFS-8000-K4
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
mypool  FAULTED  0 0 0  bad intent log
...

Nice! Running a later version of ZFS seems to lessen the need for 
ZIL-mirroring...

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Jeroen Roodhart
Hi list,

 If you're running solaris proper, you better mirror
 your
  ZIL log device.  
...
 I plan to get to test this as well, won't be until
 late next week though.

Running OSOL nv130. Power off the machine, removed the F20 and power back on. 
Machines boots OK and comes up normally with the following message in 'zpool 
status':
...
pool: mypool
 state: FAULTED
status: An intent log record could not be read.
Waiting for adminstrator intervention to fix the faulted pool.
action: Either restore the affected device(s) and run 'zpool online',
or ignore the intent log records by running 'zpool clear'.
   see: http://www.sun.com/msg/ZFS-8000-K4
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
mypool  FAULTED  0 0 0  bad intent log
...

Nice! Running a later version of ZFS seems to lessen the need for 
ZIL-mirroring...

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jeroen Roodhart
 
  If you're running solaris proper, you better mirror
  your
   ZIL log device.
 ...
  I plan to get to test this as well, won't be until
  late next week though.
 
 Running OSOL nv130. Power off the machine, removed the F20 and power
 back on. Machines boots OK and comes up normally [...]
 
 Nice! Running a later version of ZFS seems to lessen the need for ZIL-
 mirroring...

Yes, since zpool 19, which is not available in any version of solaris yet,
and is not available in osol 2009.06 unless you update to developer
builds,  Since zpool 19, you have the ability to zpool remove log
devices.  And if a log device fails during operation, the system is supposed
to fall back and just start using ZIL blocks from the main pool instead.

So the recommendation for zpool 19 would be *strongly* recommended.  Mirror
your log device if you care about using your pool.
And the recommendation for zpool =19 would be ... don't mirror your log
device.  If you have more than one, just add them both unmirrored.

I edited the ZFS Best Practices yesterday to reflect these changes.

I always have a shade of doubt about things that are supposed to do
something.  Later this week, I am building an OSOL machine, updating it,
adding an unmirrored log device, starting a sync-write benchmark (to ensure
the log device is heavily in use) and then I'm going to yank out the log
device, and see what happens.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Ragnar Sundblad

On 7 apr 2010, at 14.28, Edward Ned Harvey wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jeroen Roodhart
 
 If you're running solaris proper, you better mirror
 your
 ZIL log device.
 ...
 I plan to get to test this as well, won't be until
 late next week though.
 
 Running OSOL nv130. Power off the machine, removed the F20 and power
 back on. Machines boots OK and comes up normally [...]
 
 Nice! Running a later version of ZFS seems to lessen the need for ZIL-
 mirroring...
 
 Yes, since zpool 19, which is not available in any version of solaris yet,
 and is not available in osol 2009.06 unless you update to developer
 builds,  Since zpool 19, you have the ability to zpool remove log
 devices.  And if a log device fails during operation, the system is supposed
 to fall back and just start using ZIL blocks from the main pool instead.
 
 So the recommendation for zpool 19 would be *strongly* recommended.  Mirror
 your log device if you care about using your pool.
 And the recommendation for zpool =19 would be ... don't mirror your log
 device.  If you have more than one, just add them both unmirrored.

Rather: ... =19 would be ... if you don't mind loosing data written
the ~30 seconds before the crash, you don't have to mirror your log
device.

For a file server, mail server, etc etc, where things are stored
and supposed to be available later, you almost certainly want
redundancy on your slog too. (There may be file servers where
this doesn't apply, but they are special cases that should not
be mentioned in the general documentation.)

 I edited the ZFS Best Practices yesterday to reflect these changes.

I'd say, that In zpool version 19 or greater, it is recommended not to
mirror log devices. is not a very good advice and should be changed.

/ragge

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Robert Milkowski

On 07/04/2010 13:58, Ragnar Sundblad wrote:


Rather: ...=19 would be ... if you don't mind loosing data written
the ~30 seconds before the crash, you don't have to mirror your log
device.

For a file server, mail server, etc etc, where things are stored
and supposed to be available later, you almost certainly want
redundancy on your slog too. (There may be file servers where
this doesn't apply, but they are special cases that should not
be mentioned in the general documentation.)

   


While I agree with you I want to mention that it is all about 
understanding a risk.
In this case not only your server has to crash in such a way so data has 
not been synced (sudden power loss for example) but there would have to 
be some data committed to a slog device(s) which was not written to a 
main pool and when your server restarts your slog device would have to 
completely die as well.


Other than that you are fine even with unmirrored slog device.

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Bob Friesenhahn

On Wed, 7 Apr 2010, Ragnar Sundblad wrote:


So the recommendation for zpool 19 would be *strongly* recommended.  Mirror
your log device if you care about using your pool.
And the recommendation for zpool =19 would be ... don't mirror your log
device.  If you have more than one, just add them both unmirrored.


Rather: ... =19 would be ... if you don't mind loosing data written
the ~30 seconds before the crash, you don't have to mirror your log
device.


It is also worth pointing out that in normal operation the slog is 
essentially a write-only device which is only read at boot time.  The 
writes are assumed to work if the device claims success.  If the log 
device fails to read (oops!), then a mirror would be quite useful.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Robert Milkowski

On 07/04/2010 15:35, Bob Friesenhahn wrote:

On Wed, 7 Apr 2010, Ragnar Sundblad wrote:


So the recommendation for zpool 19 would be *strongly* 
recommended.  Mirror

your log device if you care about using your pool.
And the recommendation for zpool =19 would be ... don't mirror your 
log

device.  If you have more than one, just add them both unmirrored.


Rather: ... =19 would be ... if you don't mind loosing data written
the ~30 seconds before the crash, you don't have to mirror your log
device.


It is also worth pointing out that in normal operation the slog is 
essentially a write-only device which is only read at boot time.  The 
writes are assumed to work if the device claims success.  If the log 
device fails to read (oops!), then a mirror would be quite useful.


it is only read at boot if there are uncomitted data on it - during 
normal reboots zfs won't read data from slog.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Bob Friesenhahn

On Wed, 7 Apr 2010, Robert Milkowski wrote:


it is only read at boot if there are uncomitted data on it - during normal 
reboots zfs won't read data from slog.


How does zfs know if there is uncomitted data on the slog device 
without reading it?  The minimal read would be quite small, but it 
seems that a read is still required.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Neil Perrin

On 04/07/10 09:19, Bob Friesenhahn wrote:

On Wed, 7 Apr 2010, Robert Milkowski wrote:


it is only read at boot if there are uncomitted data on it - during 
normal reboots zfs won't read data from slog.


How does zfs know if there is uncomitted data on the slog device 
without reading it?  The minimal read would be quite small, but it 
seems that a read is still required.


Bob


If there's ever been synchronous activity then there an empty tail block 
(stubby) that

will be read even after a clean shutdown.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Edward Ned Harvey
 From: Ragnar Sundblad [mailto:ra...@csc.kth.se]
 
 Rather: ... =19 would be ... if you don't mind loosing data written
 the ~30 seconds before the crash, you don't have to mirror your log
 device.

If you have a system crash, *and* a failed log device at the same time, this
is an important consideration.  But if you have either a system crash, or a
failed log device, that don't happen at the same time, then your sync writes
are safe, right up to the nanosecond.  Using unmirrored nonvolatile log
device on zpool = 19.


 I'd say, that In zpool version 19 or greater, it is recommended not to
 mirror log devices. is not a very good advice and should be changed.

See above.  Still disagree?

If desired, I could clarify the statement, by basically pasting what's
written above.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Bob Friesenhahn
 
 It is also worth pointing out that in normal operation the slog is
 essentially a write-only device which is only read at boot time.  The
 writes are assumed to work if the device claims success.  If the log
 device fails to read (oops!), then a mirror would be quite useful.

An excellent point.

BTW, does the system *ever* read from the log device during normal
operation?  Such as perhaps during a scrub?  It really would be nice to
detect failure of log devices in advance, that are claiming to write
correctly, but which are really unreadable.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Neil Perrin

On 04/07/10 10:18, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Bob Friesenhahn

It is also worth pointing out that in normal operation the slog is
essentially a write-only device which is only read at boot time.  The
writes are assumed to work if the device claims success.  If the log
device fails to read (oops!), then a mirror would be quite useful.



An excellent point.

BTW, does the system *ever* read from the log device during normal
operation?  Such as perhaps during a scrub?  It really would be nice to
detect failure of log devices in advance, that are claiming to write
correctly, but which are really unreadable.


A scrub will read the log blocks but only for unplayed logs.
Because of the transient nature of the log and becuase it operates
outside of the transaction group model it's hard to read the in-flight 
log blocks

to validate them.

There have previously been suggestions to read slogs periodically.
I don't know if  there's a CR raised for this though.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Mark J Musante

On Wed, 7 Apr 2010, Neil Perrin wrote:

There have previously been suggestions to read slogs periodically. I 
don't know if there's a CR raised for this though.


Roch wrote up CR 6938883 Need to exercise read from slog dynamically


Regards,
markm
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Bob Friesenhahn

On Wed, 7 Apr 2010, Edward Ned Harvey wrote:


From: Ragnar Sundblad [mailto:ra...@csc.kth.se]

Rather: ... =19 would be ... if you don't mind loosing data written
the ~30 seconds before the crash, you don't have to mirror your log
device.


If you have a system crash, *and* a failed log device at the same time, this
is an important consideration.  But if you have either a system crash, or a
failed log device, that don't happen at the same time, then your sync writes
are safe, right up to the nanosecond.  Using unmirrored nonvolatile log
device on zpool = 19.


The point is that the slog is a write-only device and a device which 
fails such that its acks each write, but fails to read the data that 
it wrote, could silently fail at any time during the normal 
operation of the system.  It is not necessary for the slog device to 
fail at the exact same time that the system spontaneously reboots.  I 
don't know if Solaris implements a background scrub of the slog as a 
normal course of operation which would cause a device with this sort 
of failure to be exposed quickly.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Bob Friesenhahn

On Wed, 7 Apr 2010, Edward Ned Harvey wrote:


BTW, does the system *ever* read from the log device during normal
operation?  Such as perhaps during a scrub?  It really would be nice to
detect failure of log devices in advance, that are claiming to write
correctly, but which are really unreadable.


To make matters worse, a SSD with a large cache might satisfy such 
reads from its cache so a scrub of the (possibly) tiny bit of 
pending synchronous writes may not validate anything.  A lightly 
loaded slog should usually be empty.  We already know that some 
(many?) SSDs are not very good about persisting writes to FLASH, even 
after acking a cache flush request.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Richard Elling
On Apr 7, 2010, at 10:19 AM, Bob Friesenhahn wrote:
 On Wed, 7 Apr 2010, Edward Ned Harvey wrote:
 From: Ragnar Sundblad [mailto:ra...@csc.kth.se]
 
 Rather: ... =19 would be ... if you don't mind loosing data written
 the ~30 seconds before the crash, you don't have to mirror your log
 device.
 
 If you have a system crash, *and* a failed log device at the same time, this
 is an important consideration.  But if you have either a system crash, or a
 failed log device, that don't happen at the same time, then your sync writes
 are safe, right up to the nanosecond.  Using unmirrored nonvolatile log
 device on zpool = 19.
 
 The point is that the slog is a write-only device and a device which fails 
 such that its acks each write, but fails to read the data that it wrote, 
 could silently fail at any time during the normal operation of the system.  
 It is not necessary for the slog device to fail at the exact same time that 
 the system spontaneously reboots.  I don't know if Solaris implements a 
 background scrub of the slog as a normal course of operation which would 
 cause a device with this sort of failure to be exposed quickly.

You are playing against marginal returns. An ephemeral storage requirement
is very different than permanent storage requirement.  For permanent storage
services, scrubs work well -- you can have good assurance that if you read
the data once then you will likely be able to read the same data again with 
some probability based on the expected decay of the data. For ephemeral data,
you do not read the same data more than once, so there is no correlation
between reading once and reading again later.  In other words, testing the
readability of an ephemeral storage service is like a cat chasing its tail.  
IMHO,
this is particularly problematic for contemporary SSDs that implement wear 
leveling.

sidebar
For clusters the same sort of problem exists for path monitoring. If you think
about paths (networks, SANs, cups-n-strings) then there is no assurance 
that a failed transfer means all subsequent transfers will also fail. Some other
permanence test is required to predict future transfer failures.
s/fail/pass/g
/sidebar

Bottom line: if you are more paranoid, mirror the separate log devices and
sleep through the night.  Pleasant dreams! :-)
 -- richard


ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Miles Nordin
 jr == Jeroen Roodhart j.r.roodh...@uva.nl writes:

jr Running OSOL nv130. Power off the machine, removed the F20 and
jr power back on. Machines boots OK and comes up normally with
jr the following message in 'zpool status':

yeah, but try it again and this time put rpool on the F20 as well and
try to import the pool from a LiveCD: if you lose zpool.cache at this
stage, your pool is toast./end repeat mode


pgpt1GZtrVxS6.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Ragnar Sundblad

On 7 apr 2010, at 18.13, Edward Ned Harvey wrote:

 From: Ragnar Sundblad [mailto:ra...@csc.kth.se]
 
 Rather: ... =19 would be ... if you don't mind loosing data written
 the ~30 seconds before the crash, you don't have to mirror your log
 device.
 
 If you have a system crash, *and* a failed log device at the same time, this
 is an important consideration.  But if you have either a system crash, or a
 failed log device, that don't happen at the same time, then your sync writes
 are safe, right up to the nanosecond.  Using unmirrored nonvolatile log
 device on zpool = 19.

Right, but if you have a power or a hardware problem, chances are
that more things really break at the same time, including the slog
device(s).

 I'd say, that In zpool version 19 or greater, it is recommended not to
 mirror log devices. is not a very good advice and should be changed.
 
 See above.  Still disagree?
 
 If desired, I could clarify the statement, by basically pasting what's
 written above.

I believe that for a mail server, NFS server (to be spec compliant),
general purpose file server and the like, where the last written data
is as important as older data (maybe even more), it would be wise to
have at least as good redundancy on the slog as on the data disks.

If one can stand the (pretty small) risk of of loosing the last
transaction group before a crash, at the moment typically up to the
last 30 seconds of changes, you may have less redundancy on the slog.

(And if you don't care at all, like on a web cache perhaps, you
could of course disable the zil all together - that is kind of
the other end of the scale, which puts this in perspective.)

As Robert M so wisely and simply put it; It is all about understanding
a risk. I think the documentation should help people take educated
decisions, though I am not right now sure how to put the words to
describe this in an easily understandable way.

/ragge

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-06 Thread Jeroen Roodhart
Hi Roch,

 Can  you try 4 concurrent tar to four different ZFS
 filesystems (same pool). 

Hmmm, you're on to something here:

http://www.science.uva.nl/~jeroen/zil_compared_e1000_iostat_iops_svc_t_10sec_interval.pdf

In short: when using two exported file systems total time goes down to around 
4mins (IOPS maxes out at around 5500 when adding all four vmods together). When 
using four file systems total time goes down to around 3min30s (IOPS maxing out 
at about 9500).

I figured it is either NFS or a per file system data structure in the ZFS/ZIL 
interface. To rule out NFS I tried exporting two directories using default 
NFS shares (via /etc/dfs/dfstab entries). To my surprise this seems to bypass 
the ZIL all together (dropping to 100 IOPS, which results from our RAIDZ2 
configuration). So clearly ZFS sharenfs is more than a nice front end for NFS 
configuration :).  

But back to your suggestion: You clearly had a hypothesis behind your question. 
Care to elaborate?

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-06 Thread Edward Ned Harvey
  We ran into something similar with these drives in an X4170 that
 turned
  out to
  be  an issue of the preconfigured logical volumes on the drives. Once
  we made
  sure all of our Sun PCI HBAs where running the exact same version of
  firmware
  and recreated the volumes on new drives arriving from Sun we got back
  into sync
  on the X25-E devices sizes.
 
 Can you elaborate?  Just today, we got the replacement drive that has
 precisely the right version of firmware and everything.  Still, when we
 plugged in that drive, and create simple volume in the storagetek
 raid utility, the new drive is 0.001 Gb smaller than the old drive.
 I'm still hosed.
 
 Are you saying I might benefit by sticking the SSD into some laptop,
 and zero'ing the disk?  And then attach to the sun server?
 
 Are you saying I might benefit by finding some other way to make the
 drive available, instead of using the storagetek raid utility?
 
 Thanks for the suggestions...

Sorry for the double post.  Since the wrong-sized drive was discussed in two
separate threads, I want to stick a link here to the other one, where the
question was answered.  Just incase anyone comes across this discussion by
search or whatever...

http://mail.opensolaris.org/pipermail/zfs-discuss/2010-April/039669.html

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-05 Thread Kyle McDonald
On 4/4/2010 11:04 PM, Edward Ned Harvey wrote:
 Actually, It's my experience that Sun (and other vendors) do exactly
 that for you when you buy their parts - at least for rotating drives, I
 have no experience with SSD's.

 The Sun disk label shipped on all the drives is setup to make the drive
 the standard size for that sun part number. They have to do this since
 they (for many reasons) have many sources (diff. vendors, even diff.
 parts from the same vendor) for the actual disks they use for a
 particular Sun part number.
 
 Actually, if there is a fdisk partition and/or disklabel on a drive when it
 arrives, I'm pretty sure that's irrelevant.  Because when I first connect a
 new drive to the HBA, of course the HBA has to sign and initialize the drive
 at a lower level than what the OS normally sees.  So unless I do some sort
 of special operation to tell the HBA to preserve/import a foreign disk, the
 HBA will make the disk blank before the OS sees it anyway.

   
That may be true. Though these days they may be spec'ing the drives to
the manufacturer's at an even lower level.

So does your HBA have newer firmware now than it did when the first disk
was connected?
Maybe it's the HBA that is handling the new disks differently now, than
it did when the first one was plugged in?

Can you down rev the HBA FW? Do you have another HBa that might still
have the older Rev you coudltest it on?

  -Kyle


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-05 Thread Edward Ned Harvey
 From: Kyle McDonald [mailto:kmcdon...@egenera.com]

 So does your HBA have newer firmware now than it did when the first
 disk
 was connected?
 Maybe it's the HBA that is handling the new disks differently now, than
 it did when the first one was plugged in?
 
 Can you down rev the HBA FW? Do you have another HBa that might still
 have the older Rev you coudltest it on?

I'm planning to get the support guys more involved tomorrow, so ... things
have been pretty stagnant for several days now, I think it's time to start
putting more effort into this.

Long story short, I don't know yet.  But there is one glaring clue:  Prior
to OS installation, I don't know how to configure the HBA.  This means the
HBA must have been preconfigured with the factory installed disks, and I
followed a different process with my new disks, because I was using the GUI
within the OS.  My best hope right now is to find some other way to
configure the HBA, possibly through the ILOM, but I already searched there
and looked at everything.  Maybe I have to shutdown (power cycle) the system
and attach keyboard  monitor.  I don't know yet...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-04 Thread Ragnar Sundblad

On 4 apr 2010, at 06.01, Richard Elling wrote:

Thank you for your reply! Just wanted to make sure.

 Do not assume that power outages are the only cause of unclean shutdowns.
 -- richard

Thanks, I have seen that mistake several times with other
(file)systems, and hope I'll never ever make it myself! :-)

/ragge s

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-04 Thread Edward Ned Harvey
 Hmm, when you did the write-back test was the ZIL SSD included in the
 write-back?

 What I was proposing was write-back only on the disks, and ZIL SSD
 with no write-back.

The tests I did were:
All disks write-through
All disks write-back
With/without SSD for ZIL

All the permutations of the above.

So, unfortunately, no, I didn't test with WriteBack enabled only for
spindles, and WriteThrough on SSD.  

It has been suggested, and this is actually what I now believe based on my
experience, that precisely the opposite would be the better configuration.
If the spindles are configured WriteThrough, while the SSD is configured
WriteBack.  I believe would be optimal.

If I get the opportunity to test further, I'm interested and I will.  But
who knows when/if that will happen.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-04 Thread Edward Ned Harvey
 Actually, It's my experience that Sun (and other vendors) do exactly
 that for you when you buy their parts - at least for rotating drives, I
 have no experience with SSD's.
 
 The Sun disk label shipped on all the drives is setup to make the drive
 the standard size for that sun part number. They have to do this since
 they (for many reasons) have many sources (diff. vendors, even diff.
 parts from the same vendor) for the actual disks they use for a
 particular Sun part number.

Actually, if there is a fdisk partition and/or disklabel on a drive when it
arrives, I'm pretty sure that's irrelevant.  Because when I first connect a
new drive to the HBA, of course the HBA has to sign and initialize the drive
at a lower level than what the OS normally sees.  So unless I do some sort
of special operation to tell the HBA to preserve/import a foreign disk, the
HBA will make the disk blank before the OS sees it anyway.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-03 Thread Casper . Dik


The only way to guarantee consistency in the snapshot is to always
(regardless of ZIL enabled/disabled) give priority for sync writes to get
into the TXG before async writes.

If the OS does give priority for sync writes going into TXG's before async
writes (even with ZIL disabled), then after spontaneous ungraceful reboot,
the latest uberblock is guaranteed to be consistent.

This is what Jeff Bonwick says in the zil synchronicity arc case:

   What I mean is that the barrier semantic is implicit even with no ZIL at 
all.
   In ZFS, if event A happens before event B, and you lose power, then
   what you'll see on disk is either nothing, A, or both A and B.  Never just B.
   It is impossible for us not to have at least barrier semantics.

So there's no chance that a *later* async write will overtake an earlier
sync *or* async write.

Casper


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-03 Thread Neil Perrin

On 04/02/10 08:24, Edward Ned Harvey wrote:

The purpose of the ZIL is to act like a fast log for synchronous
writes.  It allows the system to quickly confirm a synchronous write
request with the minimum amount of work.  



Bob and Casper and some others clearly know a lot here.  But I'm hearing
conflicting information, and don't know what to believe.  Does anyone here
work on ZFS as an actual ZFS developer for Sun/Oracle?  Can claim I can
answer this question, I wrote that code, or at least have read it?
  


I'm one of the ZFS developers. I wrote most of the zil code.
Still I don't have all the answers. There's a lot of knowledgeable people
on this alias. I usually monitor this alias and sometimes chime in
when there's some misinformation being spread, but sometimes the volume 
is so high.

Since I started this reply there's been 20 new posts on this thread alone!


Questions to answer would be:

Is a ZIL log device used only by sync() and fsync() system calls? 
  


- The intent log (separate device(s) or not) is only used by fsync, 
O_DSYNC, O_SYNC, O_RSYNC.

NFS commits are seen to ZFS as fsyncs.
Note sync(1m) and sync(2s) do not use the intent log. They force 
transaction group (txg)
commits on all pools. So zfs goes beyond the the requirement for sync() 
which only requires

it schedules but does not necessarily complete the writing before returning.
The zfs interpretation is rather expensive but seemed broken so we fixed it.


Is it ever used to accelerate async writes?



The zil is not used to accelerate async writes.


Suppose there is an application which sometimes does sync writes, and
sometimes async writes.  In fact, to make it easier, suppose two processes
open two files, one of which always writes asynchronously, and one of which
always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
for writes to be committed to disk out-of-order?  Meaning, can a large block
async write be put into a TXG and committed to disk before a small sync
write to a different file is committed to disk, even though the small sync
write was issued by the application before the large async write?  Remember,
the point is:  ZIL is disabled.  Question is whether the async could
possibly be committed to disk before the sync.
  


Threads can be pre-empted in the OS at any time. So even though thread A 
issued
W1 before thread B issued W2, the order is not guaranteed to arrive at 
ZFS as W1, W2.

Multi-threaded applications have to handle this.

If this was a single thread issuing W1 then W2 then yes the order is 
guaranteed

regardless of whether W1 or W2 are synchronous or asynchronous.
Of course if the system crashes then the async operations might not be 
there.



I make the assumption that an uberblock is the term for a TXG after it is
committed to disk.  Correct?
  


- Kind of. The uberblock contains the root of the txg.



At boot time, or zpool import time, what is taken to be the current
filesystem?  The latest uberblock?  Something else?
  


A txg is for the whole pool which can contain many filesystems.
The latest txg defines the current state of the pool and each individual fs.


My understanding is that enabling a dedicated ZIL device guarantees sync()
and fsync() system calls block until the write has been committed to
nonvolatile storage, and attempts to accelerate by using a physical device
which is faster or more idle than the main storage pool.


Correct (except replace sync() with O_DSYNC, etc).
This also assumes hardware that for example handles correctly the 
flushing of it's caches.



  My understanding
is that this provides two implicit guarantees:  (1) sync writes are always
guaranteed to be committed to disk in order, relevant to other sync writes.
(2) In the event of OS halting or ungraceful shutdown, sync writes committed
to disk are guaranteed to be equal or greater than the async writes that
were taking place at the same time.  That is, if two processes both complete
a write operation at the same time, one in sync mode and the other in async
mode, then it is guaranteed the data on disk will never have the async data
committed before the sync data.
  


The ZIL doesn't make such guarantees. It's the DMU that handles transactions
and their grouping into txgs. It ensures that writes are committed in order
by it's transactional nature.

The function of the zil is to merely ensure that synchronous operations are
stable and replayed after a crash/power fail onto the latest txg.


Based on this understanding, if you disable ZIL, then there is no guarantee
about order of writes being committed to disk.  Neither of the above
guarantees is valid anymore.  Sync writes may be completed out of order.
Async writes that supposedly happened after sync writes may be committed to
disk before the sync writes.
  

No, disabling the ZIL does not disable the DMU.


Somebody, (Casper?) said it before, and now I'm starting to realize ... This
is also true of the snapshots.  If you 

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-03 Thread Jeroen Roodhart
Hi Al,

 Have you tried the DDRdrive from Christopher George
 cgeo...@ddrdrive.com?
 Looks to me like a much better fit for your application than the F20?
 
 It would not hurt to check it out.  Looks to me like
 you need a product with low *latency* - and a RAM based cache
 would be a much better performer than any solution based solely on
 flash.
 
 Let us know (on the list) how this works out for you.

Well, I did look at it but at that time there was no Solaris support yet. Right 
now it seems there is only a beta driver? I kind of remember that if you'd want 
reliable fallback to nvram, you'd need an UPS feeding the card. I could be very 
wrong there, but the product documentation isn't very clear on this (at least 
to me ;) ) 

Also, we'd kind of like to have a SnOracle supported option. 

But yeah, on paper it does seem it could be an attractive solution...

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-03 Thread Christopher George
 Well, I did look at it but at that time there was no Solaris support yet. 
 Right now it 
 seems there is only a beta driver?

Correct, we just completed functional validation of the OpenSolaris driver.  
Our 
focus has now turned to performance tuning and benchmarking.  We expect to 
formally introduce the DDRdrive X1 to the ZFS community later this quarter.  It 
is our 
goal to focus exclusively on the dedicated ZIL device market going forward. 

  I kind of remember that if you'd want reliable fallback to nvram, you'd need 
 an 
  UPS feeding the card.

Currently, a dedicated external UPS is required for correct operation.  Based 
on 
community feedback, we will be offering automatic backup/restore prior to 
release.  
This guarantees the UPS will only be required for 60 secs to successfully 
backup 
the drive contents on a host power or hardware failure.  Dutifully on the next 
reboot 
the restore will occur prior to the OS loading for seamless non-volatile 
operation.

Also,we have heard loud and clear the requests for a internal power option.  It 
is our 
intention the X1 will be the first in a family of products all dedicated to ZIL 
acceleration for not only OpenSolaris but also Solaris 10 and FreeBSD.

  Also, we'd kind of like to have a SnOracle supported option.

Although a much smaller company, we believe our singular focus and absolute 
passion 
for ZFS and the potential of Hybrid Storage Pools will serve our customers well.

We are actively designing our soon to be available support plans.  Your voice 
will be 
heard, please email directly at cgeorge at ddrdrive dot com for requests, 
comments
and/or questions.

Thanks,

Christopher George
Founder/CTO
www.ddrdrive.com
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-03 Thread Ragnar Sundblad

On 1 apr 2010, at 06.15, Stuart Anderson wrote:

 Assuming you are also using a PCI LSI HBA from Sun that is managed with
 a utility called /opt/StorMan/arcconf and reports itself as the amazingly
 informative model number Sun STK RAID INT what worked for me was to run,
 arcconf delete (to delete the pre-configured volume shipped on the drive)
 arcconf create (to create a new volume)

Just to sort things out (or not? :-): 

I more than agree that this product is highly confusing, but I
don't think there is anything LSI in or about that card. I believe
it is an Adaptec card, developed, manufactured and supported by
Intel for Adaptec, licensed (or something) to StorageTek, and later
included in Sun machines (since Sun bought StorageTek, I suppose).
Now we could add Oracle to this name dropping inferno, if we would
want to.

I am not sure why they (Sun) put those in there, they don't seem
very fast or smart or anything.

/ragge

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-03 Thread Ragnar Sundblad

On 2 apr 2010, at 22.47, Neil Perrin wrote:

 Suppose there is an application which sometimes does sync writes, and
 sometimes async writes.  In fact, to make it easier, suppose two processes
 open two files, one of which always writes asynchronously, and one of which
 always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
 for writes to be committed to disk out-of-order?  Meaning, can a large block
 async write be put into a TXG and committed to disk before a small sync
 write to a different file is committed to disk, even though the small sync
 write was issued by the application before the large async write?  Remember,
 the point is:  ZIL is disabled.  Question is whether the async could
 possibly be committed to disk before the sync.
   
 
 
 Threads can be pre-empted in the OS at any time. So even though thread A 
 issued
 W1 before thread B issued W2, the order is not guaranteed to arrive at ZFS as 
 W1, W2.
 Multi-threaded applications have to handle this.
 
 If this was a single thread issuing W1 then W2 then yes the order is 
 guaranteed
 regardless of whether W1 or W2 are synchronous or asynchronous.
 Of course if the system crashes then the async operations might not be there.

Could you please clarify this last paragraph a little:
Do you mean that this is in the case that you have ZIL enabled
and the txg for W1 and W2 hasn't been commited, so that upon reboot
the ZIL is replayed, and therefore only the sync writes are
eventually there?

If, lets say, W1 is an async small write, W2 is a sync small write,
W1 arrives to zfs before W2, and W2 arrives before the txg is
commited, will both writes always be in the txg on disk?
If so, it would mean that zfs itself never buffer up async writes to
larger blurbs to write at a later txg, correct?
I take it that ZIL enabled or not does not make any difference here
(we pretend the system did _not_ crash), correct?

Thanks!

/ragge

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-03 Thread Richard Elling
On Apr 3, 2010, at 5:47 PM, Ragnar Sundblad wrote:
 On 2 apr 2010, at 22.47, Neil Perrin wrote:
 
 Suppose there is an application which sometimes does sync writes, and
 sometimes async writes.  In fact, to make it easier, suppose two processes
 open two files, one of which always writes asynchronously, and one of which
 always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
 for writes to be committed to disk out-of-order?  Meaning, can a large block
 async write be put into a TXG and committed to disk before a small sync
 write to a different file is committed to disk, even though the small sync
 write was issued by the application before the large async write?  Remember,
 the point is:  ZIL is disabled.  Question is whether the async could
 possibly be committed to disk before the sync.
 
 
 
 Threads can be pre-empted in the OS at any time. So even though thread A 
 issued
 W1 before thread B issued W2, the order is not guaranteed to arrive at ZFS 
 as W1, W2.
 Multi-threaded applications have to handle this.
 
 If this was a single thread issuing W1 then W2 then yes the order is 
 guaranteed
 regardless of whether W1 or W2 are synchronous or asynchronous.
 Of course if the system crashes then the async operations might not be there.
 
 Could you please clarify this last paragraph a little:
 Do you mean that this is in the case that you have ZIL enabled
 and the txg for W1 and W2 hasn't been commited, so that upon reboot
 the ZIL is replayed, and therefore only the sync writes are
 eventually there?

yes. The ZIL needs to be replayed on import after an unclean shutdown.

 If, lets say, W1 is an async small write, W2 is a sync small write,
 W1 arrives to zfs before W2, and W2 arrives before the txg is
 commited, will both writes always be in the txg on disk?

yes

 If so, it would mean that zfs itself never buffer up async writes to
 larger blurbs to write at a later txg, correct?

correct

 I take it that ZIL enabled or not does not make any difference here
 (we pretend the system did _not_ crash), correct?

For import following a clean shutdown, there are no transactions in 
the ZIL to apply.

For async-only workloads, there are no transactions in the ZIL to apply.

Do not assume that power outages are the only cause of unclean shutdowns.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Casper . Dik

On 01/04/2010 20:58, Jeroen Roodhart wrote:

 I'm happy to see that it is now the default and I hope this will cause the
 Linux NFS client implementation to be faster for conforming NFS servers.
  
 Interesting thing is that apparently defaults on Solaris an Linux are chosen 
 such that one can't
 signal the desired behaviour to the other. At least we didn't manage to get a 
Linux client to asyn
chronously mount a Solaris (ZFS backed) NFS export...


Which is to be expected as it is not a nfs client which requests the 
behavior but rather a nfs server.
Currently on Linux you can export a share with as sync (default) or 
async share while on Solaris you can't really currently force a NFS 
server to start working in an async mode.


The other part of the issue is that the Solaris Clients have been 
developed with a sync server.  The client write behinds more and
continues caching the non-acked data.  The Linux client has been developed 
with a async server and has some catching up to do.


Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Roch

Robert Milkowski writes:
  On 01/04/2010 20:58, Jeroen Roodhart wrote:
  
   I'm happy to see that it is now the default and I hope this will cause the
   Linux NFS client implementation to be faster for conforming NFS servers.

   Interesting thing is that apparently defaults on Solaris an Linux are 
   chosen such that one can't signal the desired behaviour to the other. At 
   least we didn't manage to get a Linux client to asynchronously mount a 
   Solaris (ZFS backed) NFS export...
  
  
  Which is to be expected as it is not a nfs client which requests the 
  behavior but rather a nfs server.
  Currently on Linux you can export a share with as sync (default) or 
  async share while on Solaris you can't really currently force a NFS 
  server to start working in an async mode.
  

True, and there is an entrenched misconception (not you)
that this a ZFS specific problem which it's not. 

It's really an NFS protocol feature which can be
circumvented using zil_disable which therefore reinforces
the misconception.  It's further reinforced by testing NFS
server on disk drives with WCE=1 with filesystem not ZFS.

All fast options cause the NFS client to become inconsistent
after a server reboot. Whatever was being done in the moments
prior to server reboot will need to be wiped out by users if
they are told that the server did reboot. That's manageable
for home use not for the entreprise.

-r





  -- 
  Robert Milkowski
  http://milek.blogspot.com
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey
  Seriously, all disks configured WriteThrough (spindle and SSD disks
  alike)
  using the dedicated ZIL SSD device, very noticeably faster than
  enabling the
  WriteBack.
 
 What do you get with both SSD ZIL and WriteBack disks enabled?
 
 I mean if you have both why not use both? Then both async and sync IO
 benefits.

Interesting, but unfortunately false.  Soon I'll post the results here.  I
just need to package them in a way suitable to give the public, and stick it
on a website.  But I'm fighting IT fires for now and haven't had the time
yet.

Roughly speaking, the following are approximately representative.  Of course
it varies based on tweaks of the benchmark and stuff like that.
Stripe 3 mirrors write through:  450-780 IOPS
Stripe 3 mirrors write back:  1030-2130 IOPS
Stripe 3 mirrors write back + SSD ZIL:  1220-2480 IOPS
Stripe 3 mirrors write through + SSD ZIL:  1840-2490 IOPS

Overall, I would say WriteBack is 2-3 times faster than naked disks.  SSD
ZIL is 3-4 times faster than naked disk.  And for some reason, having the
WriteBack enabled while you have SSD ZIL actually hurts performance by
approx 10%.  You're better off to use the SSD ZIL with disks in Write
Through mode.

That result is surprising to me.  But I have a theory to explain it.  When
you have WriteBack enabled, the OS issues a small write, and the HBA
immediately returns to the OS:  Yes, it's on nonvolatile storage.  So the
OS quickly gives it another, and another, until the HBA write cache is full.
Now the HBA faces the task of writing all those tiny writes to disk, and the
HBA must simply follow orders, writing a tiny chunk to the sector it said it
would write, and so on.  The HBA cannot effectively consolidate the small
writes into a larger sequential block write.  But if you have the WriteBack
disabled, and you have a SSD for ZIL, then ZFS can log the tiny operation on
SSD, and immediately return to the process:  Yes, it's on nonvolatile
storage.  So the application can issue another, and another, and another.
ZFS is smart enough to aggregate all these tiny write operations into a
single larger sequential write before sending it to the spindle disks.  

Long story short, the evidence suggests if you have SSD ZIL, you're better
off without WriteBack on the HBA.  And I conjecture the reasoning behind it
is because ZFS can write buffer better than the HBA can.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey
 I know it is way after the fact, but I find it best to coerce each
 drive down to the whole GB boundary using format (create Solaris
 partition just up to the boundary). Then if you ever get a drive a
 little smaller it still should fit.

It seems like it should be unnecessary.  It seems like extra work.  But
based on my present experience, I reached the same conclusion.

If my new replacement SSD with identical part number and firmware is 0.001
Gb smaller than the original and hence unable to mirror, what's to prevent
the same thing from happening to one of my 1TB spindle disk mirrors?
Nothing.  That's what.

I take it back.  Me.  I am to prevent it from happening.  And the technique
to do so is precisely as you've said.  First slice every drive to be a
little smaller than actual.  Then later if I get a replacement device for
the mirror, that's slightly smaller than the others, I have no reason to
care.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Roch

  When we use one vmod, both machines are finished in about 6min45,
  zilstat maxes out at about 4200 IOPS.
  Using four vmods it takes about 6min55, zilstat maxes out at 2200
  IOPS. 

Can  you try 4 concurrent tar to four different ZFS filesystems (same
pool). 

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey
  http://nfs.sourceforge.net/
 
 I think B4 is the answer to Casper's question:

We were talking about ZFS, and under what circumstances data is flushed to
disk, in what way sync and async writes are handled by the OS, and what
happens if you disable ZIL and lose power to your system.

We were talking about C/C++ sync and async.  Not NFS sync and async.

I don't think anything relating to NFS is the answer to Casper's question,
or else, Casper was simply jumping context by asking it.  Don't get me
wrong, I have no objection to his question or anything, it's just that the
conversation has derailed and now people are talking about NFS sync/async
instead of what happens when a C/C++ application is doing sync/async writes
to a disabled ZIL.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey
  I am envisioning a database, which issues a small sync write,
 followed by a
  larger async write.  Since the sync write is small, the OS would
 prefer to
  defer the write and aggregate into a larger block.  So the
 possibility of
  the later async write being committed to disk before the older sync
 write is
  a real risk.  The end result would be inconsistency in my database
 file.
 
 Zfs writes data in transaction groups and each bunch of data which
 gets written is bounded by a transaction group.  The current state of
 the data at the time the TXG starts will be the state of the data once
 the TXG completes.  If the system spontaneously reboots then it will
 restart at the last completed TXG so any residual writes which might
 have occured while a TXG write was in progress will be discarded.
 Based on this, I think that your ordering concerns (sync writes
 getting to disk faster than async writes) are unfounded for normal
 file I/O.

So you're saying that while the OS is building txg's to write to disk, the
OS will never reorder the sequence in which individual write operations get
ordered into the txg's.  That is, an application performing a small sync
write, followed by a large async write, will never have the second operation
flushed to disk before the first.  Can you support this belief in any way?

If that's true, if there's no increased risk of data corruption, then why
doesn't everybody just disable their ZIL all the time on every system?

The reason to have a sync() function in C/C++ is so you can ensure data is
written to disk before you move on.  It's a blocking call, that doesn't
return until the sync is completed.  The only reason you would ever do this
is if order matters.  If you cannot allow the next command to begin until
after the previous one was completed.  Such is the situation with databases
and sometimes virtual machines.  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey
 hello
 
 i have had this problem this week. our zil ssd died (apt slc ssd 16gb).
 because we had no spare drive in stock, we ignored it.
 
 then we decided to update our nexenta 3 alpha to beta, exported the
 pool and made a fresh install to have a clean system and tried to
 import the pool. we only got a error message about a missing drive.
 
 we googled about this and it seems there is no way to acces the pool
 !!!
 (hope this will be fixed in future)
 
 we had a backup and the data are not so important, but that could be a
 real problem.
 you have  a valid zfs3 pool and you cannot access your data due to
 missing zil.

If you have zpool less than version 19 (when ability to remove log device
was introduced) and you have a non-mirrored log device that failed, you had
better treat the situation as an emergency.  Normally you can find your
current zpool version by doing zpool upgrade, but you cannot now if you're
in this failure state.  Do not attempt zfs send or zfs list or any other
zpool or zfs command.  Instead, do man zpool and look for zpool remove.
If it says supports removing log devices then you had better use it to
remove your log device.  If it says only supports removing hotspares or
cache then your zpool is lost permanently.

If you are running Solaris, take it as given, you do not have zpool version
19.  If you are running Opensolaris, I don't know at which point zpool 19
was introduced.  Your only hope is to zpool remove the log device.  Use
tar or cp or something, to try and salvage your data out of there.  Your
zpool is lost and if it's functional at all right now, it won't stay that
way for long.  Your system will soon hang, and then you will not be able to
import your pool.

Ask me how I know.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey
 ZFS recovers to a crash-consistent state, even without the slog,
 meaning it recovers to some state through which the filesystem passed
 in the seconds leading up to the crash.  This isn't what UFS or XFS
 do.
 
 The on-disk log (slog or otherwise), if I understand right, can
 actually make the filesystem recover to a crash-INconsistent state (a

You're speaking the opposite of common sense.  If disabling the ZIL makes
the system faster *and* less prone to data corruption, please explain why we
don't all disable the ZIL?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey
 If you have zpool less than version 19 (when ability to remove log
 device
 was introduced) and you have a non-mirrored log device that failed, you
 had
 better treat the situation as an emergency.  

 Instead, do man zpool and look for zpool
 remove.
 If it says supports removing log devices then you had better use it
 to
 remove your log device.  If it says only supports removing hotspares
 or
 cache then your zpool is lost permanently.

I take it back.  If you lost your log device on a zpool which is less than
version 19, then you *might* have a possible hope if you migrate your disks
to a later system.  You *might* be able to zpool import on a later version
of OS.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Casper . Dik

  http://nfs.sourceforge.net/
 
 I think B4 is the answer to Casper's question:

We were talking about ZFS, and under what circumstances data is flushed to
disk, in what way sync and async writes are handled by the OS, and what
happens if you disable ZIL and lose power to your system.

We were talking about C/C++ sync and async.  Not NFS sync and async.

I don't think so.

http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg36783.html

(This discussion was started, I think, in the context of NFS performance)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Casper . Dik


So you're saying that while the OS is building txg's to write to disk, the
OS will never reorder the sequence in which individual write operations get
ordered into the txg's.  That is, an application performing a small sync
write, followed by a large async write, will never have the second operation
flushed to disk before the first.  Can you support this belief in any way?

The question is not how the writes are ordered but whether an earlier
write can be in a later txg.  A transaction group is committed atomically.

In http://arc.opensolaris.org/caselog/PSARC/2010/108/mail I ask a similar 
question to make sure I understand it correctly, and the answer was:

 = Casper, the answer is from Neil Perrin:

 Is there a partialy order defined for all filesystem operations?
   

File system operations  will be written in order for all settings of 
the 
sync flag.

 Specifically, will ZFS guarantee that when fsync()/O_DATA happens on a
 file,
   
(I assume by O_DATA you meant O_DSYNC).

 that later transactions will not be in an earlier transaction group?
 (Or is this already the case?)
  
This is already the case.


So what I assumed was true but what you made me doubt, was apparently still
true: later transactions cannot be committed in an earlier txg.



If that's true, if there's no increased risk of data corruption, then why
doesn't everybody just disable their ZIL all the time on every system?

For an application running on the file server, there is no difference.
When the system panics you know that data might be lost.  The application 
also dies.  (The snapshot and the last valid uberblock are equally valid)

But for an application on an NFS client, without ZIL data will be lost 
while the NFS client believes the data is written amd it will not try 
again.  With the ZIL, when the NFS server says that data is written then 
it is actually on stable storage.

The reason to have a sync() function in C/C++ is so you can ensure data is
written to disk before you move on.  It's a blocking call, that doesn't
return until the sync is completed.  The only reason you would ever do this
is if order matters.  If you cannot allow the next command to begin until
after the previous one was completed.  Such is the situation with databases
and sometimes virtual machines.  

So the question is: when will your data invalid?

What happens with the data when the system dies before the fsync() call?
What happens with the data when the system dies after the fsync() call?
What happens with the data when the system dies after more I/O operations?

With the zil disabled, you call fsync() but you may encounter data from
before the call to fsync().  That could happen before, so I assume you can
actually recover from that situation.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey
 Dude, don't be so arrogant.  Acting like you know what I'm talking
 about
 better than I do.  Face it that you have something to learn here.
 
 You may say that, but then you post this:

Acknowledged.  I read something arrogant, and I replied even more arrogant.
That was dumb of me.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey
 Only a broken application uses sync writes
 sometimes, and async writes at other times.

Suppose there is a virtual machine, with virtual processes inside it.  Some
virtual process issues a sync write to the virtual OS, meanwhile another
virtual process issues an async write.  Then the virtual OS will sometimes
issue sync writes and sometimes async writes to the host OS.

Are you saying this makes qemu, and vbox, and vmware broken applications?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey
 The purpose of the ZIL is to act like a fast log for synchronous
 writes.  It allows the system to quickly confirm a synchronous write
 request with the minimum amount of work.  

Bob and Casper and some others clearly know a lot here.  But I'm hearing
conflicting information, and don't know what to believe.  Does anyone here
work on ZFS as an actual ZFS developer for Sun/Oracle?  Can claim I can
answer this question, I wrote that code, or at least have read it?

Questions to answer would be:

Is a ZIL log device used only by sync() and fsync() system calls?  Is it
ever used to accelerate async writes?

Suppose there is an application which sometimes does sync writes, and
sometimes async writes.  In fact, to make it easier, suppose two processes
open two files, one of which always writes asynchronously, and one of which
always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
for writes to be committed to disk out-of-order?  Meaning, can a large block
async write be put into a TXG and committed to disk before a small sync
write to a different file is committed to disk, even though the small sync
write was issued by the application before the large async write?  Remember,
the point is:  ZIL is disabled.  Question is whether the async could
possibly be committed to disk before the sync.

I make the assumption that an uberblock is the term for a TXG after it is
committed to disk.  Correct?

At boot time, or zpool import time, what is taken to be the current
filesystem?  The latest uberblock?  Something else?

My understanding is that enabling a dedicated ZIL device guarantees sync()
and fsync() system calls block until the write has been committed to
nonvolatile storage, and attempts to accelerate by using a physical device
which is faster or more idle than the main storage pool.  My understanding
is that this provides two implicit guarantees:  (1) sync writes are always
guaranteed to be committed to disk in order, relevant to other sync writes.
(2) In the event of OS halting or ungraceful shutdown, sync writes committed
to disk are guaranteed to be equal or greater than the async writes that
were taking place at the same time.  That is, if two processes both complete
a write operation at the same time, one in sync mode and the other in async
mode, then it is guaranteed the data on disk will never have the async data
committed before the sync data.

Based on this understanding, if you disable ZIL, then there is no guarantee
about order of writes being committed to disk.  Neither of the above
guarantees is valid anymore.  Sync writes may be completed out of order.
Async writes that supposedly happened after sync writes may be committed to
disk before the sync writes.

Somebody, (Casper?) said it before, and now I'm starting to realize ... This
is also true of the snapshots.  If you disable your ZIL, then there is no
guarantee your snapshots are consistent either.  Rolling back doesn't
necessarily gain you anything.

The only way to guarantee consistency in the snapshot is to always
(regardless of ZIL enabled/disabled) give priority for sync writes to get
into the TXG before async writes.

If the OS does give priority for sync writes going into TXG's before async
writes (even with ZIL disabled), then after spontaneous ungraceful reboot,
the latest uberblock is guaranteed to be consistent.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Casper . Dik


Questions to answer would be:

Is a ZIL log device used only by sync() and fsync() system calls?  Is it
ever used to accelerate async writes?

There are quite a few of sync writes, specifically when you mix in the 
NFS server.

Suppose there is an application which sometimes does sync writes, and
sometimes async writes.  In fact, to make it easier, suppose two processes
open two files, one of which always writes asynchronously, and one of which
always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
for writes to be committed to disk out-of-order?  Meaning, can a large block
async write be put into a TXG and committed to disk before a small sync
write to a different file is committed to disk, even though the small sync
write was issued by the application before the large async write?  Remember,
the point is:  ZIL is disabled.  Question is whether the async could
possibly be committed to disk before the sync.

What I quoted from the other discussion, it seems to be that later writes 
cannot be committed in an earlier TXG then your sync write or other earlier
writes.

I make the assumption that an uberblock is the term for a TXG after it is
committed to disk.  Correct?

The uberblock is the root of all the data.  All the data in a ZFS pool 
is referenced by it; after the txg is in stable storage then the uberblock 
is updated.

At boot time, or zpool import time, what is taken to be the current
filesystem?  The latest uberblock?  Something else?

The current zpool and the filesystems such as referenced by the last
uberblock.

My understanding is that enabling a dedicated ZIL device guarantees sync()
and fsync() system calls block until the write has been committed to
nonvolatile storage, and attempts to accelerate by using a physical device
which is faster or more idle than the main storage pool.  My understanding
is that this provides two implicit guarantees:  (1) sync writes are always
guaranteed to be committed to disk in order, relevant to other sync writes.
(2) In the event of OS halting or ungraceful shutdown, sync writes committed
to disk are guaranteed to be equal or greater than the async writes that
were taking place at the same time.  That is, if two processes both complete
a write operation at the same time, one in sync mode and the other in async
mode, then it is guaranteed the data on disk will never have the async data
committed before the sync data.

sync() is actually *async* and returning from sync() says nothing about 
stable storage.  After fsync() returns it signals that all the data is
in stable storage (except if you disable ZIL), or, apparently, in Linux
when the write caches for your disks are enabled (the default for PC
drives).  ZFS doesn't care about the writecache; it makes sure it is 
flushed.  (There's fsyc() and open(..., O_DSYNC|O_SYNC)

Based on this understanding, if you disable ZIL, then there is no guarantee
about order of writes being committed to disk.  Neither of the above
guarantees is valid anymore.  Sync writes may be completed out of order.
Async writes that supposedly happened after sync writes may be committed to
disk before the sync writes.

Somebody, (Casper?) said it before, and now I'm starting to realize ... This
is also true of the snapshots.  If you disable your ZIL, then there is no
guarantee your snapshots are consistent either.  Rolling back doesn't
necessarily gain you anything.

The only way to guarantee consistency in the snapshot is to always
(regardless of ZIL enabled/disabled) give priority for sync writes to get
into the TXG before async writes.

If the OS does give priority for sync writes going into TXG's before async
writes (even with ZIL disabled), then after spontaneous ungraceful reboot,
the latest uberblock is guaranteed to be consistent.


I believe that the writes are still ordered so the consistency you want is 
actually delivered even without the ZIL enabled.

Casper


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Kyle McDonald
On 4/2/2010 8:08 AM, Edward Ned Harvey wrote:
 I know it is way after the fact, but I find it best to coerce each
 drive down to the whole GB boundary using format (create Solaris
 partition just up to the boundary). Then if you ever get a drive a
 little smaller it still should fit.
 
 It seems like it should be unnecessary.  It seems like extra work.  But
 based on my present experience, I reached the same conclusion.

 If my new replacement SSD with identical part number and firmware is 0.001
 Gb smaller than the original and hence unable to mirror, what's to prevent
 the same thing from happening to one of my 1TB spindle disk mirrors?
 Nothing.  That's what.

   
Actually, It's my experience that Sun (and other vendors) do exactly
that for you when you buy their parts - at least for rotating drives, I
have no experience with SSD's.

The Sun disk label shipped on all the drives is setup to make the drive
the standard size for that sun part number. They have to do this since
they (for many reasons) have many sources (diff. vendors, even diff.
parts from the same vendor) for the actual disks they use for a
particular Sun part number.

This isn't new, I beleive IBM, EMC, HP, etc all do it also for the same
reasons.
I'm a little surprised that the engineers would suddenly stop doing it
only on SSD's. But who knows.

  -Kyle

 I take it back.  Me.  I am to prevent it from happening.  And the technique
 to do so is precisely as you've said.  First slice every drive to be a
 little smaller than actual.  Then later if I get a replacement device for
 the mirror, that's slightly smaller than the others, I have no reason to
 care.

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Mattias Pantzare
On Fri, Apr 2, 2010 at 16:24, Edward Ned Harvey solar...@nedharvey.com wrote:
 The purpose of the ZIL is to act like a fast log for synchronous
 writes.  It allows the system to quickly confirm a synchronous write
 request with the minimum amount of work.

 Bob and Casper and some others clearly know a lot here.  But I'm hearing
 conflicting information, and don't know what to believe.  Does anyone here
 work on ZFS as an actual ZFS developer for Sun/Oracle?  Can claim I can
 answer this question, I wrote that code, or at least have read it?

 Questions to answer would be:

 Is a ZIL log device used only by sync() and fsync() system calls?  Is it
 ever used to accelerate async writes?

sync() will tell the filesystems to flush writes to disk. sync() will
not use ZIL, it will just start a new TXG, and could return before the
writes are done.

fsync() is what you are interested in.


 Suppose there is an application which sometimes does sync writes, and
 sometimes async writes.  In fact, to make it easier, suppose two processes
 open two files, one of which always writes asynchronously, and one of which
 always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
 for writes to be committed to disk out-of-order?  Meaning, can a large block
 async write be put into a TXG and committed to disk before a small sync
 write to a different file is committed to disk, even though the small sync
 write was issued by the application before the large async write?  Remember,
 the point is:  ZIL is disabled.  Question is whether the async could
 possibly be committed to disk before the sync.


Writers from a TXG will not be used until the whole TXG is committed to disk.
Everything from a half written TXG will be ignored after a crash.

This means that the order of writes within a TXG is not important.

The only way to do a sync write without ZIL is to start a new TXG
after the write. That costs a lot so we have the ZIL for sync writes.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Bob Friesenhahn

On Fri, 2 Apr 2010, Edward Ned Harvey wrote:


So you're saying that while the OS is building txg's to write to disk, the
OS will never reorder the sequence in which individual write operations get
ordered into the txg's.  That is, an application performing a small sync
write, followed by a large async write, will never have the second operation
flushed to disk before the first.  Can you support this belief in any way?


I am like a pool or tank of regurgitated zfs knowledge.  I simply 
pay attention when someone who really knows explains something (e.g. 
Neil Perrin, as Casper referred to) so I can regurgitate it later.  I 
try to do so faithfully.  If I had behaved this way in school, I would 
have been a good student.  Sometimes I am wrong or the design has 
somewhat changed since the original information was provided.


There are indeed popular filesystems (e.g. Linux EXT4) which write 
data to disk in different order than cronologically requested so it is 
good that you are paying attention to these issues.  While in the 
slog-based recovery scenario, it is possible for a TXG to be generated 
which lacks async data, this only happens after a system crash and if 
all of the critical data is written as a sync request, it will be 
faithfully preserved.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Bob Friesenhahn

On Fri, 2 Apr 2010, Edward Ned Harvey wrote:

were taking place at the same time.  That is, if two processes both complete
a write operation at the same time, one in sync mode and the other in async
mode, then it is guaranteed the data on disk will never have the async data
committed before the sync data.

Based on this understanding, if you disable ZIL, then there is no guarantee
about order of writes being committed to disk.  Neither of the above
guarantees is valid anymore.  Sync writes may be completed out of order.
Async writes that supposedly happened after sync writes may be committed to
disk before the sync writes.


You seem to be assuming that Solaris is an incoherent operating 
system.  With ZFS, the filesystem in memory is coherent, and 
transaction groups are constructed in simple chronological order 
(capturing combined changes up to that point in time), without regard 
to SYNC options.  The only possible exception to the coherency is for 
memory mapped files, where the mapped memory is a copy of data 
(originally) from the ZFS ARC and needs to be reconciled with the ARC 
if an application has dirtied it.  This differs from UFS and the way 
Solaris worked prior to Solaris 10.


Synchronous writes are not faster than asynchronous writes.  If you 
drop heavy and light objects from the same height, they fall at the 
same rate.  This was proven long ago.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Stuart Anderson

On Apr 2, 2010, at 5:08 AM, Edward Ned Harvey wrote:

 I know it is way after the fact, but I find it best to coerce each
 drive down to the whole GB boundary using format (create Solaris
 partition just up to the boundary). Then if you ever get a drive a
 little smaller it still should fit.
 
 It seems like it should be unnecessary.  It seems like extra work.  But
 based on my present experience, I reached the same conclusion.
 
 If my new replacement SSD with identical part number and firmware is 0.001
 Gb smaller than the original and hence unable to mirror, what's to prevent
 the same thing from happening to one of my 1TB spindle disk mirrors?
 Nothing.  That's what.
 
 I take it back.  Me.  I am to prevent it from happening.  And the technique
 to do so is precisely as you've said.  First slice every drive to be a
 little smaller than actual.  Then later if I get a replacement device for
 the mirror, that's slightly smaller than the others, I have no reason to
 care.

However, I believe there are some downsides to letting ZFS manage just
a slice rather than an entire drive, but perhaps those do not apply as
significantly to SSD devices?

Thanks

--
Stuart Anderson  ander...@ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Ross Walker
On Fri, Apr 2, 2010 at 8:03 AM, Edward Ned Harvey
solar...@nedharvey.com wrote:
  Seriously, all disks configured WriteThrough (spindle and SSD disks
  alike)
  using the dedicated ZIL SSD device, very noticeably faster than
  enabling the
  WriteBack.

 What do you get with both SSD ZIL and WriteBack disks enabled?

 I mean if you have both why not use both? Then both async and sync IO
 benefits.

 Interesting, but unfortunately false.  Soon I'll post the results here.  I
 just need to package them in a way suitable to give the public, and stick it
 on a website.  But I'm fighting IT fires for now and haven't had the time
 yet.

 Roughly speaking, the following are approximately representative.  Of course
 it varies based on tweaks of the benchmark and stuff like that.
        Stripe 3 mirrors write through:  450-780 IOPS
        Stripe 3 mirrors write back:  1030-2130 IOPS
        Stripe 3 mirrors write back + SSD ZIL:  1220-2480 IOPS
        Stripe 3 mirrors write through + SSD ZIL:  1840-2490 IOPS

 Overall, I would say WriteBack is 2-3 times faster than naked disks.  SSD
 ZIL is 3-4 times faster than naked disk.  And for some reason, having the
 WriteBack enabled while you have SSD ZIL actually hurts performance by
 approx 10%.  You're better off to use the SSD ZIL with disks in Write
 Through mode.

 That result is surprising to me.  But I have a theory to explain it.  When
 you have WriteBack enabled, the OS issues a small write, and the HBA
 immediately returns to the OS:  Yes, it's on nonvolatile storage.  So the
 OS quickly gives it another, and another, until the HBA write cache is full.
 Now the HBA faces the task of writing all those tiny writes to disk, and the
 HBA must simply follow orders, writing a tiny chunk to the sector it said it
 would write, and so on.  The HBA cannot effectively consolidate the small
 writes into a larger sequential block write.  But if you have the WriteBack
 disabled, and you have a SSD for ZIL, then ZFS can log the tiny operation on
 SSD, and immediately return to the process:  Yes, it's on nonvolatile
 storage.  So the application can issue another, and another, and another.
 ZFS is smart enough to aggregate all these tiny write operations into a
 single larger sequential write before sending it to the spindle disks.

Hmm, when you did the write-back test was the ZIL SSD included in the
write-back?

What I was proposing was write-back only on the disks, and ZIL SSD
with no write-back.

Not all operations hit the ZIL, so it would still be nice to have the
non-ZIL operations return quickly.

-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Robert Milkowski

On 02/04/2010 16:04, casper@sun.com wrote:


sync() is actually *async* and returning from sync() says nothing about

   


to clarify - in case of ZFS sync() is actually synchronous.

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Tirso Alonso
 If my new replacement SSD with identical part number and firmware is 0.001
 Gb smaller than the original and hence unable to mirror, what's to prevent
 the same thing from happening to one of my 1TB spindle disk mirrors?

There is a standard for sizes that many manufatures use (IDEMA LBA1-02):

LBA count = (97696368) + (1953504 * (Desired Capacity in Gbytes – 50.0)) 

Sizes should match exactly if the manufacturer follows the standard.

See:
http://opensolaris.org/jive/message.jspa?messageID=393336#393336
http://www.idema.org/_smartsite/modules/local/data_file/show_file.php?cmd=downloaddata_file_id=1066
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Miles Nordin
 enh == Edward Ned Harvey solar...@nedharvey.com writes:

   enh If you have zpool less than version 19 (when ability to remove
   enh log device was introduced) and you have a non-mirrored log
   enh device that failed, you had better treat the situation as an
   enh emergency.

Ed the log device removal support is only good for adding a slog to
try it out, then changing your mind and removing the slog (which was
not possible before).  It doesn't change the reliability situation one
bit: pools with dead slogs are not importable.  There've been threads
on this for a while.  It's well-discussed because it's an example of
IMHO broken process of ``obviously a critical requirement but not
technically part of the original RFE which is already late,'' as well
as a dangerous pitfall for ZFS admins.  I imagine the process works
well in other cases to keep stuff granular enough that it can be
prioritized effectively, but in this case it's made the slog feature
significantly incomplete for a couple years and put many production
systems in a precarious spot, and the whole mess was predicted before
the slog feature was integrated.

  The on-disk log (slog or otherwise), if I understand right, can
  actually make the filesystem recover to a crash-INconsistent
  state 

   enh You're speaking the opposite of common sense.  

Yeah, I'm doing it on purpose to suggest that just guessing how you
feel things ought to work based on vague notions of economy isn't a
good idea.

   enh If disabling the ZIL makes the system faster *and* less prone
   enh to data corruption, please explain why we don't all disable
   enh the ZIL?

I said complying with fsync can make the system recover to a state not
equal to one you might have hypothetically snapshotted in a moment
leading up to the crash.  Elsewhere I might've said disabling the ZIL
does not make the system more prone to data corruption, *iff* you are
not an NFS server.

If you are, disabling the ZIL can lead to lost writes if an NFS server
reboots and an NFS client does not, which can definitely cause
app-level data corruption.

Disabling the ZIL breaks the D requirement of ACID databases which
might screw up apps that replicate, or keep databases on several
separate servers in sync, and it might lead to lost mail on an MTA,
but because unlike non-COW filesystems it costs nothing extra for ZFS
to preserve write ordering even without fsync(), AIUI you will not get
corrupted application-level data by disabling the ZIL.  you just get
missing data that the app has a right to expect should be there.  The
dire warnings written by kernel developers in the wikis of ``don't
EVER disable the ZIL'' are totally ridiculous and inappropriate IMO.
I think they probably just worked really hard to write the ZIL piece
of ZFS, and don't want people telling their brilliant code to fuckoff
just because it makes things a little slower.  so we get all this
``enterprise'' snobbery and so on.

``crash consistent'' is a technical term not a common-sense term, and
I may have used it incorrectly:

 
http://oraclestorageguy.typepad.com/oraclestorageguy/2007/07/why-emc-technol.html

With a system that loses power on which fsync() had been in use, the
files getting fsync()'ed will probably recover to more recent versions
than the rest of the files, which means the recovered state achieved
by yanking the cord couldn't have been emulated by cloning a snapshot
and not actually having lost power.  However, the app calling fsync()
will expect this, so it's not supposed to lead to application-level
inconsistency.  

If you test your app's recovery ability in just that way, by cloning
snapshots of filesystems on which the app is actively writing and then
seeing if the app can recover the clone, then you're unfortunately not
testing the app quite hard enough if fsync() is involved, so yeah I
guess disabling the ZIL might in theory make incorrectly-written apps
less prone to data corruption.  Likewise, no testing of the app on a
ZFS will be aggressive enough to make the app powerfail-proof on a
non-COW POSIX system because ZFS keeps more ordering than the API
actually guarantees to the app.

I'm repeating myself though.  I wish you'll just read my posts with at
least paragraph granularity instead of just picking out individual
sentences and discarding everything that seems too complicated or too
awkwardly stated.

I'm basing this all on the ``common sense'' that to do otherwise,
fsync() would have to completely ignore its filedescriptor
argument. It'd have to copy the entire in-memory ZIL to the slog and
behave the same as 'lockfs -fa', which I think would perform too badly
compared to non-ZFS filesystems' fsync()s, and would lead to emphatic
performance advice like ``segregate files that get lots of fsync()s
into separate ZFS datasets from files that get high write bandwidth,''
and we don't have advice like that in the blogs/lists/wikis which
makes me think it's not beneficial (the benefit would be 

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Tim Cook
On Fri, Apr 2, 2010 at 10:08 AM, Kyle McDonald kmcdon...@egenera.comwrote:

 On 4/2/2010 8:08 AM, Edward Ned Harvey wrote:
  I know it is way after the fact, but I find it best to coerce each
  drive down to the whole GB boundary using format (create Solaris
  partition just up to the boundary). Then if you ever get a drive a
  little smaller it still should fit.
 
  It seems like it should be unnecessary.  It seems like extra work.  But
  based on my present experience, I reached the same conclusion.
 
  If my new replacement SSD with identical part number and firmware is
 0.001
  Gb smaller than the original and hence unable to mirror, what's to
 prevent
  the same thing from happening to one of my 1TB spindle disk mirrors?
  Nothing.  That's what.
 
 
 Actually, It's my experience that Sun (and other vendors) do exactly
 that for you when you buy their parts - at least for rotating drives, I
 have no experience with SSD's.

 The Sun disk label shipped on all the drives is setup to make the drive
 the standard size for that sun part number. They have to do this since
 they (for many reasons) have many sources (diff. vendors, even diff.
 parts from the same vendor) for the actual disks they use for a
 particular Sun part number.

 This isn't new, I beleive IBM, EMC, HP, etc all do it also for the same
 reasons.
 I'm a little surprised that the engineers would suddenly stop doing it
 only on SSD's. But who knows.

  -Kyle



If I were forced to ignorantly cast a stone, it would be into Intel's lap
(if the SSD's indeed came directly from Sun).  Sun's normal drive vendors
have been in this game for decades, and know the expectations.  Intel on the
other hand, may not have quite the same QC in place yet.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Eric D. Mudama

On Fri, Apr  2 at 11:14, Tirso Alonso wrote:

If my new replacement SSD with identical part number and firmware is 0.001
Gb smaller than the original and hence unable to mirror, what's to prevent
the same thing from happening to one of my 1TB spindle disk mirrors?


There is a standard for sizes that many manufatures use (IDEMA LBA1-02):

LBA count = (97696368) + (1953504 * (Desired Capacity in Gbytes ??? 50.0))

Sizes should match exactly if the manufacturer follows the standard.

See:
http://opensolaris.org/jive/message.jspa?messageID=393336#393336
http://www.idema.org/_smartsite/modules/local/data_file/show_file.php?cmd=downloaddata_file_id=1066


Problem is that it only applies to devices that are = 50GB in size,
and the X25 in question is only 32GB.

That being said, I'd be skeptical of either the sourcing of the parts,
or else some other configuration feature on the drives (like HPA or
DCO) that is changing the capacity.  It's possible one of these is in
effect.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Al Hopper
Hi Jeroen,

Have you tried the DDRdrive from Christopher George cgeo...@ddrdrive.com?
Looks to me like a much better fit for your application than the F20?

It would not hurt to check it out.  Looks to me like you need a
product with low *latency* - and a RAM based cache would be a much
better performer than any solution based solely on flash.

Let us know (on the list) how this works out for you.

Regards,

-- 
Al Hopper  Logical Approach Inc,Plano,TX a...@logical-approach.com
   Voice: 214.233.5089 Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Casper . Dik

If you disable the ZIL, the filesystem still stays correct in RAM, and the
only way you lose any data such as you've described, is to have an
ungraceful power down or reboot.

The advice I would give is:  Do zfs autosnapshots frequently (say ... every
5 minutes, keeping the most recent 2 hours of snaps) and then run with no
ZIL.  If you have an ungraceful shutdown or reboot, rollback to the latest
snapshot ... and rollback once more for good measure.  As long as you can
afford to risk 5-10 minutes of the most recent work after a crash, then you
can get a 10x performance boost most of the time, and no risk of the
aforementioned data corruption.

Why do you need the rollback? The current filesystems have correct and 
consistent data; not different from the last two snapshots.
(Snapshots can happen in the middle of untarring)

The difference between running with or without ZIL is whether the
client has lost data when the server reboots; not different from using 
Linux as an NFS server.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Edward Ned Harvey
 If you disable the ZIL, the filesystem still stays correct in RAM, and
 the
 only way you lose any data such as you've described, is to have an
 ungraceful power down or reboot.
 
 The advice I would give is:  Do zfs autosnapshots frequently (say ...
 every
 5 minutes, keeping the most recent 2 hours of snaps) and then run with
 no
 ZIL.  If you have an ungraceful shutdown or reboot, rollback to the
 latest
 snapshot ... and rollback once more for good measure.  As long as you
 can
 afford to risk 5-10 minutes of the most recent work after a crash,
 then you
 can get a 10x performance boost most of the time, and no risk of the
 aforementioned data corruption.
 
 Why do you need the rollback? The current filesystems have correct and
 consistent data; not different from the last two snapshots.
 (Snapshots can happen in the middle of untarring)
 
 The difference between running with or without ZIL is whether the
 client has lost data when the server reboots; not different from using
 Linux as an NFS server.

If you have an ungraceful shutdown in the middle of writing stuff, while the
ZIL is disabled, then you have corrupt data.  Could be files that are
partially written.  Could be wrong permissions or attributes on files.
Could be missing files or directories.  Or some other problem.

Some changes from the last 1 second of operation before crash might be
written, while some changes from the last 4 seconds might be still
unwritten.  This is data corruption, which could be worse than losing a few
minutes of changes.  At least, if you rollback, you know the data is
consistent, and you know what you lost.  You won't continue having more
losses afterward caused by inconsistent data on disk.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Edward Ned Harvey
  Can you elaborate?  Just today, we got the replacement drive that has
  precisely the right version of firmware and everything.  Still, when
 we
  plugged in that drive, and create simple volume in the storagetek
 raid
  utility, the new drive is 0.001 Gb smaller than the old drive.  I'm
 still
  hosed.
 
  Are you saying I might benefit by sticking the SSD into some laptop,
 and
  zero'ing the disk?  And then attach to the sun server?
 
  Are you saying I might benefit by finding some other way to make the
 drive
  available, instead of using the storagetek raid utility?
 
 Assuming you are also using a PCI LSI HBA from Sun that is managed with
 a utility called /opt/StorMan/arcconf and reports itself as the
 amazingly
 informative model number Sun STK RAID INT what worked for me was to
 run,
 arcconf delete (to delete the pre-configured volume shipped on the
 drive)
 arcconf create (to create a new volume)
 
 What I observed was that
 arcconf getconfig 1
 would show the same physical device size for our existing drives and
 new
 ones from Sun, but they reported a slightly different logical volume
 size.
 I am fairly sure that was due to the Sun factory creating the initial
 volume
 with a different version of the HBA controller firmware then we where
 using
 to create our own volumes.
 
 If I remember the sign correctly, the newer firmware creates larger
 logical
 volumes, and you really want to upgrade the firmware if you are going
 to
 be running multiple X25-E drives from the same controller.
 
 I hope that helps.

Uggh.  This is totally different than my system.  But thanks for writing.
I'll take this knowledge, and see if we can find some analogous situation
with the StorageTek controller.  It still may be helpful, so again, thanks.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Casper . Dik


If you have an ungraceful shutdown in the middle of writing stuff, while the
ZIL is disabled, then you have corrupt data.  Could be files that are
partially written.  Could be wrong permissions or attributes on files.
Could be missing files or directories.  Or some other problem.

Some changes from the last 1 second of operation before crash might be
written, while some changes from the last 4 seconds might be still
unwritten.  This is data corruption, which could be worse than losing a few
minutes of changes.  At least, if you rollback, you know the data is
consistent, and you know what you lost.  You won't continue having more
losses afterward caused by inconsistent data on disk.

How exactly is this different from rolling back to some other point of 
time?.

I think you don't quite understand how ZFS works; all operations are 
grouped in transaction groups; all the transactions in a particular group 
are commit in one operation.  I don't know what partial ordering ZFS uses 
when creating transaction groups, but a snapshot just picks one
transaction group as the last group included in the snapshot.

When the system reboots, ZFS picks the most recent, valid uberblock;
so the data available is correct upto transaction group N1.

If you rollback to a snapshot, you get data
 correct upto transaction group N2.

But N2  N1 so you lose more data.

Why do you think that a Snapshot has a better quality than the last 
snapshot available?

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Edward Ned Harvey
 If you have an ungraceful shutdown in the middle of writing stuff,
 while the
 ZIL is disabled, then you have corrupt data.  Could be files that are
 partially written.  Could be wrong permissions or attributes on files.
 Could be missing files or directories.  Or some other problem.
 
 Some changes from the last 1 second of operation before crash might be
 written, while some changes from the last 4 seconds might be still
 unwritten.  This is data corruption, which could be worse than losing
 a few
 minutes of changes.  At least, if you rollback, you know the data is
 consistent, and you know what you lost.  You won't continue having
 more
 losses afterward caused by inconsistent data on disk.
 
 How exactly is this different from rolling back to some other point of
 time?.
 
 I think you don't quite understand how ZFS works; all operations are
 grouped in transaction groups; all the transactions in a particular
 group
 are commit in one operation.  I don't know what partial ordering ZFS

Dude, don't be so arrogant.  Acting like you know what I'm talking about
better than I do.  Face it that you have something to learn here.

Yes, all the transactions in a transaction group are either committed
entirely to disk, or not at all.  But they're not necessarily committed to
disk in the same order that the user level applications requested.  Meaning:
If I have an application that writes to disk in sync mode intentionally
... perhaps because my internal file format consistency would be corrupt if
I wrote out-of-order ... If the sysadmin has disabled ZIL, my sync write
will not block, and I will happily issue more write operations.  As long as
the OS remains operational, no problem.  The OS keeps the filesystem
consistent in RAM, and correctly manages all the open file handles.  But if
the OS dies for some reason, some of my later writes may have been committed
to disk while some of my earlier writes could be lost, which were still
being buffered in system RAM for a later transaction group.

This is particularly likely to happen, if my application issues a very small
sync write, followed by a larger async write, followed by a very small sync
write, and so on.  Then the OS will buffer my small sync writes and attempt
to aggregate them into a larger sequential block for the sake of accelerated
performance.  The end result is:  My larger async writes are sometimes
committed to disk before my small sync writes.  But the only reason I would
ever know or care about that would be if the ZIL were disabled, and the OS
crashed.  Afterward, my file has internal inconsistency.

Perfect examples of applications behaving this way would be databases and
virtual machines.


 Why do you think that a Snapshot has a better quality than the last
 snapshot available?

If you rollback to a snapshot from several minutes ago, you can rest assured
all the transaction groups that belonged to that snapshot have been
committed.  So although you're losing the most recent few minutes of data,
you can rest assured you haven't got file corruption in any of the existing
files.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Edward Ned Harvey
 This approach does not solve the problem.  When you do a snapshot,
 the txg is committed.  If you wish to reduce the exposure to loss of
 sync data and run with ZIL disabled, then you can change the txg commit
 interval -- however changing the txg commit interval will not eliminate
 the
 possibility of data loss.

The default commit interval is what, 30 seconds?  Doesn't that guarantee
that any snapshot taken more than 30 seconds ago will have been fully
committed to disk?

Therefore, any snapshot older than 30 seconds old is guaranteed to be
consistent on disk.  While anything less than 30 seconds old could possibly
have some later writes committed to disk before some older writes from a few
seconds before.

If I'm wrong about this, please explain.

I am envisioning a database, which issues a small sync write, followed by a
larger async write.  Since the sync write is small, the OS would prefer to
defer the write and aggregate into a larger block.  So the possibility of
the later async write being committed to disk before the older sync write is
a real risk.  The end result would be inconsistency in my database file.

If you rollback to a snapshot that's at least 30 seconds old, then all the
writes for that snapshot are guaranteed to be committed to disk already, and
in the right order.  You're acknowledging the loss of some known time worth
of data.  But you're gaining a guarantee of internal file consistency.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Edward Ned Harvey
 Is that what sync means in Linux?  

A sync write is one in which the application blocks until the OS acks that
the write has been committed to disk.  An async write is given to the OS,
and the OS is permitted to buffer the write to disk at its own discretion.
Meaning the async write function call returns sooner, and the application is
free to continue doing other stuff, including issuing more writes.

Async writes are faster from the point of view of the application.  But sync
writes are done by applications which need to satisfy a race condition for
the sake of internal consistency.  Applications which need to know their
next commands will not begin until after the previous sync write was
committed to disk.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Casper . Dik


Dude, don't be so arrogant.  Acting like you know what I'm talking about
better than I do.  Face it that you have something to learn here.

You may say that, but then you post this:


 Why do you think that a Snapshot has a better quality than the last
 snapshot available?

If you rollback to a snapshot from several minutes ago, you can rest assured
all the transaction groups that belonged to that snapshot have been
committed.  So although you're losing the most recent few minutes of data,
you can rest assured you haven't got file corruption in any of the existing
files.


But the actual fact is that there is *NO* difference between the last
uberblock and an uberblock named as snapshot-such-and-so.  All changes 
made after the uberblock was written are discarded by rolling back.


All the transaction groups referenced by last uberblock *are* written to 
disk.

Disabling the ZIL makes sure that fsync() and sync() no longer work;
whether you take a named snapshot or the uberblock is immaterial; your
strategy will cause more data to be lost.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Casper . Dik

 Is that what sync means in Linux?  

A sync write is one in which the application blocks until the OS acks that
the write has been committed to disk.  An async write is given to the OS,
and the OS is permitted to buffer the write to disk at its own discretion.
Meaning the async write function call returns sooner, and the application is
free to continue doing other stuff, including issuing more writes.

Async writes are faster from the point of view of the application.  But sync
writes are done by applications which need to satisfy a race condition for
the sake of internal consistency.  Applications which need to know their
next commands will not begin until after the previous sync write was
committed to disk.


We're talking about the sync for NFS exports in Linux; what do they mean 
with sync NFS exports? 


Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Casper . Dik

 This approach does not solve the problem.  When you do a snapshot,
 the txg is committed.  If you wish to reduce the exposure to loss of
 sync data and run with ZIL disabled, then you can change the txg commit
 interval -- however changing the txg commit interval will not eliminate
 the
 possibility of data loss.

The default commit interval is what, 30 seconds?  Doesn't that guarantee
that any snapshot taken more than 30 seconds ago will have been fully
committed to disk?

When a system boots and it finds the snapshot, then all the data referred 
by the snapshot are on-disk.  But the snapshot doesn't guarantee more than 
the last valid uberblock.

Therefore, any snapshot older than 30 seconds old is guaranteed to be
consistent on disk.  While anything less than 30 seconds old could possibly
have some later writes committed to disk before some older writes from a few
seconds before.

If I'm wrong about this, please explain.

When a pointer to data is committed to disk by ZFS, then the data is 
also on disk.  (if the pointer is reachable from the uberblock,
then the data is also on dissk and reachable from the uberblock)

You don't need to wait 30 seconds.  If it's there, it's there.

I am envisioning a database, which issues a small sync write, followed by a
larger async write.  Since the sync write is small, the OS would prefer to
defer the write and aggregate into a larger block.  So the possibility of
the later async write being committed to disk before the older sync write is
a real risk.  The end result would be inconsistency in my database file.

If you rollback to a snapshot that's at least 30 seconds old, then all the
writes for that snapshot are guaranteed to be committed to disk already, and
in the right order.  You're acknowledging the loss of some known time worth
of data.  But you're gaining a guarantee of internal file consistency.


I don't know what ZFS guarantees when you disable the zil; the one broken 
promise is that when fsync() returns, that the data may not have 
committed to stable storage when fsync() returns.

I'm not sure whether there is a barrier when there is a sync()/fsync(),
if that is the case, then ZFS is still safe for your application.



Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Ross Walker
On Mar 31, 2010, at 11:51 PM, Edward Ned Harvey  
solar...@nedharvey.com wrote:



A MegaRAID card with write-back cache? It should also be cheaper than
the F20.


I haven't posted results yet, but I just finished a few weeks of  
extensive

benchmarking various configurations.  I can say this:

WriteBack cache is much faster than naked disks, but if you can  
buy an SSD
or two for ZIL log device, the dedicated ZIL is yet again much  
faster than

WriteBack.

It doesn't have to be F20.  You could use the Intel X25 for  
example.  If
you're running solaris proper, you better mirror your ZIL log  
device.  If

you're running opensolaris ... I don't know if that's important.  I'll
probably test it, just to be sure, but I might never get around to it
because I don't have a justifiable business reason to build the  
opensolaris

machine just for this one little test.

Seriously, all disks configured WriteThrough (spindle and SSD disks  
alike)
using the dedicated ZIL SSD device, very noticeably faster than  
enabling the

WriteBack.


What do you get with both SSD ZIL and WriteBack disks enabled?

I mean if you have both why not use both? Then both async and sync IO  
benefits.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Ross Walker
On Mar 31, 2010, at 11:58 PM, Edward Ned Harvey  
solar...@nedharvey.com wrote:


We ran into something similar with these drives in an X4170 that  
turned

out to
be  an issue of the preconfigured logical volumes on the drives. Once
we made
sure all of our Sun PCI HBAs where running the exact same version of
firmware
and recreated the volumes on new drives arriving from Sun we got back
into sync
on the X25-E devices sizes.


Can you elaborate?  Just today, we got the replacement drive that has
precisely the right version of firmware and everything.  Still, when  
we
plugged in that drive, and create simple volume in the storagetek  
raid
utility, the new drive is 0.001 Gb smaller than the old drive.  I'm  
still

hosed.

Are you saying I might benefit by sticking the SSD into some laptop,  
and

zero'ing the disk?  And then attach to the sun server?

Are you saying I might benefit by finding some other way to make the  
drive

available, instead of using the storagetek raid utility?


I know it is way after the fact, but I find it best to coerce each  
drive down to the whole GB boundary using format (create Solaris  
partition just up to the boundary). Then if you ever get a drive a  
little smaller it still should fit.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Ross Walker

On Apr 1, 2010, at 8:42 AM, casper@sun.com wrote:




Is that what sync means in Linux?


A sync write is one in which the application blocks until the OS  
acks that
the write has been committed to disk.  An async write is given to  
the OS,
and the OS is permitted to buffer the write to disk at its own  
discretion.
Meaning the async write function call returns sooner, and the  
application is

free to continue doing other stuff, including issuing more writes.

Async writes are faster from the point of view of the application.   
But sync
writes are done by applications which need to satisfy a race  
condition for
the sake of internal consistency.  Applications which need to know  
their

next commands will not begin until after the previous sync write was
committed to disk.



We're talking about the sync for NFS exports in Linux; what do  
they mean

with sync NFS exports?


See section A1 in the FAQ:

http://nfs.sourceforge.net/

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Darren J Moffat

On 01/04/2010 14:49, Ross Walker wrote:

We're talking about the sync for NFS exports in Linux; what do they
mean
with sync NFS exports?


See section A1 in the FAQ:

http://nfs.sourceforge.net/


I think B4 is the answer to Casper's question:

 BEGIN QUOTE 
Linux servers (although not the Solaris reference implementation) allow 
this requirement to be relaxed by setting a per-export option in 
/etc/exports. The name of this export option is [a]sync (note that 
there is also a client-side mount option by the same name, but it has a 
different function, and does not defeat NFS protocol compliance).


When set to sync, Linux server behavior strictly conforms to the NFS 
protocol. This is default behavior in most other server implementations. 
When set to async, the Linux server replies to NFS clients before 
flushing data or metadata modifying operations to permanent storage, 
thus improving performance, but breaking all guarantees about server 
reboot recovery.

 END QUOTE 

For more info the whole of section B4 though B6.

--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Ross Walker
On Thu, Apr 1, 2010 at 10:03 AM, Darren J Moffat
darr...@opensolaris.org wrote:
 On 01/04/2010 14:49, Ross Walker wrote:

 We're talking about the sync for NFS exports in Linux; what do they
 mean
 with sync NFS exports?

 See section A1 in the FAQ:

 http://nfs.sourceforge.net/

 I think B4 is the answer to Casper's question:

  BEGIN QUOTE 
 Linux servers (although not the Solaris reference implementation) allow this
 requirement to be relaxed by setting a per-export option in /etc/exports.
 The name of this export option is [a]sync (note that there is also a
 client-side mount option by the same name, but it has a different function,
 and does not defeat NFS protocol compliance).

 When set to sync, Linux server behavior strictly conforms to the NFS
 protocol. This is default behavior in most other server implementations.
 When set to async, the Linux server replies to NFS clients before flushing
 data or metadata modifying operations to permanent storage, thus improving
 performance, but breaking all guarantees about server reboot recovery.
  END QUOTE 

 For more info the whole of section B4 though B6.

True, I was thinking more of the protocol summary.

 Is that what sync means in Linux?  As NFS doesn't use close or
 fsync, what exactly are the semantics.

 (For NFSv2/v3 each *operation* is sync and the client needs to make sure
 it can continue; for NFSv4, some operations are async and the client
 needs to use COMMIT)

Actually the COMMIT command was introduced in NFSv3.

The full details:

NFS Version 3 introduces the concept of safe asynchronous writes. A
Version 3 client can specify that the server is allowed to reply
before it has saved the requested data to disk, permitting the server
to gather small NFS write operations into a single efficient disk
write operation. A Version 3 client can also specify that the data
must be written to disk before the server replies, just like a Version
2 write. The client specifies the type of write by setting the
stable_how field in the arguments of each write operation to UNSTABLE
to request a safe asynchronous write, and FILE_SYNC for an NFS Version
2 style write.

Servers indicate whether the requested data is permanently stored by
setting a corresponding field in the response to each NFS write
operation. A server can respond to an UNSTABLE write request with an
UNSTABLE reply or a FILE_SYNC reply, depending on whether or not the
requested data resides on permanent storage yet. An NFS
protocol-compliant server must respond to a FILE_SYNC request only
with a FILE_SYNC reply.

Clients ensure that data that was written using a safe asynchronous
write has been written onto permanent storage using a new operation
available in Version 3 called a COMMIT. Servers do not send a response
to a COMMIT operation until all data specified in the request has been
written to permanent storage. NFS Version 3 clients must protect
buffered data that has been written using a safe asynchronous write
but not yet committed. If a server reboots before a client has sent an
appropriate COMMIT, the server can reply to the eventual COMMIT
request in a way that forces the client to resend the original write
operation. Version 3 clients use COMMIT operations when flushing safe
asynchronous writes to the server during a close(2) or fsync(2) system
call, or when encountering memory pressure.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Bob Friesenhahn

On Thu, 1 Apr 2010, Edward Ned Harvey wrote:


If I'm wrong about this, please explain.

I am envisioning a database, which issues a small sync write, followed by a
larger async write.  Since the sync write is small, the OS would prefer to
defer the write and aggregate into a larger block.  So the possibility of
the later async write being committed to disk before the older sync write is
a real risk.  The end result would be inconsistency in my database file.


Zfs writes data in transaction groups and each bunch of data which 
gets written is bounded by a transaction group.  The current state of 
the data at the time the TXG starts will be the state of the data once 
the TXG completes.  If the system spontaneously reboots then it will 
restart at the last completed TXG so any residual writes which might 
have occured while a TXG write was in progress will be discarded. 
Based on this, I think that your ordering concerns (sync writes 
getting to disk faster than async writes) are unfounded for normal 
file I/O.


However, if file I/O is done via memory mapped files, then changed 
memory pages will not necessarily be written.  The changes will not be 
known to ZFS until the kernel decides that a dirty page should be 
written or there is a conflicting traditional I/O which would update 
the same file data.  Use of msync(3C) is necessary to assure that file 
data updated via mmap() will be seen by ZFS and comitted to disk in an 
orderly fashion.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Robert Milkowski

On 01/04/2010 13:01, Edward Ned Harvey wrote:

Is that what sync means in Linux?
 

A sync write is one in which the application blocks until the OS acks that
the write has been committed to disk.  An async write is given to the OS,
and the OS is permitted to buffer the write to disk at its own discretion.
Meaning the async write function call returns sooner, and the application is
free to continue doing other stuff, including issuing more writes.

Async writes are faster from the point of view of the application.  But sync
writes are done by applications which need to satisfy a race condition for
the sake of internal consistency.  Applications which need to know their
next commands will not begin until after the previous sync write was
committed to disk.

   

ROTFL!!!

I think you should explain it even further for Casper :) :) :) :) :) :) :)

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Casper . Dik

On 01/04/2010 13:01, Edward Ned Harvey wrote:
 Is that what sync means in Linux?
  
 A sync write is one in which the application blocks until the OS acks that
 the write has been committed to disk.  An async write is given to the OS,
 and the OS is permitted to buffer the write to disk at its own discretion.
 Meaning the async write function call returns sooner, and the application is
 free to continue doing other stuff, including issuing more writes.

 Async writes are faster from the point of view of the application.  But sync
 writes are done by applications which need to satisfy a race condition for
 the sake of internal consistency.  Applications which need to know their
 next commands will not begin until after the previous sync write was
 committed to disk.


ROTFL!!!

I think you should explain it even further for Casper :) :) :) :) :) :) :)



:-)

So what I *really* wanted to know what sync meant for the NFS server
in the case of Linux.

Apparently it means implement the NFS protocol to the letter.

I'm happy to see that it is now the default and I hope this will cause the 
Linux NFS client implementation to be faster for conforming NFS servers.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Bob Friesenhahn

On Thu, 1 Apr 2010, Edward Ned Harvey wrote:


Dude, don't be so arrogant.  Acting like you know what I'm talking about
better than I do.  Face it that you have something to learn here.


Geez!


Yes, all the transactions in a transaction group are either committed
entirely to disk, or not at all.  But they're not necessarily committed to
disk in the same order that the user level applications requested.  Meaning:
If I have an application that writes to disk in sync mode intentionally
... perhaps because my internal file format consistency would be corrupt if
I wrote out-of-order ... If the sysadmin has disabled ZIL, my sync write
will not block, and I will happily issue more write operations.  As long as
the OS remains operational, no problem.  The OS keeps the filesystem
consistent in RAM, and correctly manages all the open file handles.  But if
the OS dies for some reason, some of my later writes may have been committed
to disk while some of my earlier writes could be lost, which were still
being buffered in system RAM for a later transaction group.


The purpose of the ZIL is to act like a fast log for synchronous 
writes.  It allows the system to quickly confirm a synchronous write 
request with the minimum amount of work.  As you say, OS keeps the 
filesystem consistent in RAM.  There is no 1:1 ordering between 
application write requests and zfs writes and in fact, if the same 
portion of file is updated many times, or the file is created/deleted 
many times, zfs only writes the updated data which is current when the 
next TXG is written.  For a synchronous write, zfs advances its index 
in the slog once the corresponding data has been committed in a TXG. 
In other words, the sync and async write paths are the same when 
it comes to writing final data to disk.


There is however the recovery case where synchronous writes were 
affirmed which were not yet written in a TXG and the system 
spontaneously reboots.  In this case the synchronous writes will occur 
based on the slog, and uncommitted async writes will have been lost. 
Perhaps this is the case you are worried about.


It does seem like rollback to a snapshot does help here (to assure 
that sync  async data is consistent), but it certainly does not help 
any NFS clients.  Only a broken application uses sync writes 
sometimes, and async writes at other times.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Casper . Dik


It does seem like rollback to a snapshot does help here (to assure 
that sync  async data is consistent), but it certainly does not help 
any NFS clients.  Only a broken application uses sync writes 
sometimes, and async writes at other times.

But doesn't that snapshot possibly have the same issues?

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Bob Friesenhahn

On Thu, 1 Apr 2010, casper@sun.com wrote:


It does seem like rollback to a snapshot does help here (to assure
that sync  async data is consistent), but it certainly does not help
any NFS clients.  Only a broken application uses sync writes
sometimes, and async writes at other times.


But doesn't that snapshot possibly have the same issues?


No, at least not based on my understanding.  My understanding is that 
zfs uses uniform prioritization of updates and performs writes in 
order (at least to the level of a TXG).  If this is true, then each 
normal TXG will be a coherent representation of the filesystem.


If the slog is used to recover uncommitted writes, then the TXG based 
on that may not match the in-memory filesystem at the time of the 
crash since async writes may have been lost.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Günther
hello

i have had this problem this week. our zil ssd died (apt slc ssd 16gb).
because we had no spare drive in stock, we ignored it.

then we decided to update our nexenta 3 alpha to beta, exported the pool and 
made a fresh install to have a clean system and tried to import the pool. we 
only got a error message about a missing drive.

we googled about this and it seems there is no way to acces the pool !!!
(hope this will be fixed in future)

we had a backup and the data are not so important, but that could be a real 
problem.
you have  a valid zfs3 pool and you cannot access your data due to missing zil.


gea

www.napp-it.org zfs server
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Jeroen Roodhart
Hi Casper,

 :-)

Leuk te zien dat je straal nog steeds even ver komt :-)

I'm happy to see that it is now the default and I hope this will cause the
Linux NFS client implementation to be faster for conforming NFS servers.

Interesting thing is that apparently defaults on Solaris an Linux are chosen 
such that one can't signal the desired behaviour to the other. At least we 
didn't manage to get a Linux client to asynchronously mount a Solaris (ZFS 
backed) NFS export...

Anyway we seem to be getting of topic here :-)

The thread was started to get insight in behaviour of the F20 as ZIL. _My_ 
particular interest would be to be able to answer why perfomance doesn't seem 
to scale up when adding vmod-s... 

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Carson Gaspar

Jeroen Roodhart wrote:


The thread was started to get insight in behaviour of the F20 as ZIL.
_My_ particular interest would be to be able to answer why perfomance
doesn't seem to scale up when adding vmod-s...


My best guess would be latency. If you are latency bound, adding 
additional parallel devices with the same latency will make no 
difference. It will improve throughput, but may actually make latency 
worse (additional time to select which parallel device to use).


But one of the ZFS gurus may be able to provide a better answer, or some 
dtrace foo to confirm/deny my thesis.


--
Carson
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Jeroen Roodhart
 It doesn't have to be F20.  You could use the Intel
 X25 for example.  

The mlc-based disks are bound to be too slow (we tested with an OCZ Vertex 
Turbo). So you're stuck with the X25-E (which Sun stopped supporting for some 
reason). I believe most normal SSDs do have some sort of cache and usually no 
supercap or other backup power solution. So be wary of that.

Having said all this, the new Sandforce based SSDs look promising...
 
If you're running solaris proper, you better mirror your
 ZIL log device.  

Absolutely true, I forgot this 'cause we're running OSOL nv130... (we 
constantly seem to need features that haven't landed in Solaris proper :) ).

 If you're running opensolaris ... I don't know if that's
 important. 

At least I can confirm ability of adding and removing ZIL devices on the fly 
with OSOL of a sufficiently recent build. 

 I'll  probably test it, just to be sure, but I might never
 get around to it
 because I don't have a justifiable business reason to
 build the opensolaris
 machine just for this one little test.

I plan to get test this as well, won't be until late next week though.

With kind regards,

Jeroen

Message was edited by: tuxwield
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Miles Nordin
 enh == Edward Ned Harvey solar...@nedharvey.com writes:

   enh Dude, don't be so arrogant.  Acting like you know what I'm
   enh talking about better than I do.  Face it that you have
   enh something to learn here.

funny!  AIUI you are wrong and Casper is right.

ZFS recovers to a crash-consistent state, even without the slog,
meaning it recovers to some state through which the filesystem passed
in the seconds leading up to the crash.  This isn't what UFS or XFS
do.

The on-disk log (slog or otherwise), if I understand right, can
actually make the filesystem recover to a crash-INconsistent state (a
state not equal to a snapshot you might have hypothetically taken in
the seconds leading up to the crash), because files that were recently
fsync()'d may be of newer versions than files that weren't---that is,
fsync() durably commits only the file it references, by copying that
*part* of the in-RAM ZIL to the durable slog.  fsync() is not
equivalent to 'lockfs -fa' committing every file on the system (is
it?).  I guess I could be wrong about that.

If I'm right, this isn't a bad thing because apps that call fsync()
are supposed to expect the inconsistency, but it's still important to
understanding what's going on.


pgpUNxWo30EYO.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Robert Milkowski

On 01/04/2010 20:58, Jeroen Roodhart wrote:



I'm happy to see that it is now the default and I hope this will cause the
Linux NFS client implementation to be faster for conforming NFS servers.
 

Interesting thing is that apparently defaults on Solaris an Linux are chosen 
such that one can't signal the desired behaviour to the other. At least we 
didn't manage to get a Linux client to asynchronously mount a Solaris (ZFS 
backed) NFS export...
   


Which is to be expected as it is not a nfs client which requests the 
behavior but rather a nfs server.
Currently on Linux you can export a share with as sync (default) or 
async share while on Solaris you can't really currently force a NFS 
server to start working in an async mode.


--
Robert Milkowski
http://milek.blogspot.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Jeroen Roodhart
Oh, one more comment. If you don't mirror your ZIL, and your unmirrored SSD
goes bad, you lose your whole pool. Or at least suffer data corruption.

Hmmm, I thought that in that case ZFS reverts to the regular on disks ZIL?

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Jeroen Roodhart
The write cache is _not_ being disabled. The write cache is being marked
as non-volatile.

Of course you're right :) Please filter my postings with a sed 's/write 
cache/write cache flush/g' ;)

BTW, why is a Sun/Oracle branded product not properly respecting the NV
bit in the cache flush command? This seems remarkably broken, and leads
to the amazingly bad advice given on the wiki referenced above.

I suspect it has something to do with emulating disk semantics over PCIE. 
Anyway, this did get us stumped in the beginning, performance wasn't better 
than when using an OCZ Vertex Turbo ;) 

By the way, the URL to the reference is part of the official F20 product 
documentation (that's how we found it in the first place)...

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Karsten Weiss
 I stand corrected.  You don't lose your pool.  You don't have corrupted
 filesystem.  But you lose whatever writes were not yet completed, so if
 those writes happen to be things like database transactions, you could have
 corrupted databases or files, or missing files if you were creating them at
 the time, and stuff like that.  AKA, data corruption.
 
 But not pool corruption, and not filesystem corruption.

Yeah, that's a big difference! :)

Of course we could not live with pool or fs corruption. However, we can live 
with
the fact the NFS written data is not all on disk in case of a server crash 
although
the NFS client could rely on the write guaranteed by the NFS protocol. I.e. we 
do
not use it for db transactions or something like that.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Karsten Weiss
Hi Adam,

 Very interesting data. Your test is inherently
 single-threaded so I'm not surprised that the
 benefits aren't more impressive -- the flash modules
 on the F20 card are optimized more for concurrent
 IOPS than single-threaded latency.

Thanks for your reply. I'll probably test the multiple write case, too.

But frankly at the moment I care the most about the single-threaded case
because if we put e.g. user homes on this server I think they would be
severely disappointed if they would have to wait 2m42s just to extract a rather
small 50 MB tarball. The default 7m40s without SSD log were unacceptable
and we were hoping that the F20 would make a big difference and bring the
performance down to acceptable runtimes. But IMHO 2m42s is still too slow
and disabling the ZIL seems to be the only option.

Knowing that 100s of users could do this in parallel with good performance
is nice but it does not improve the situation for the single user which only
cares for his own tar run. If there's anything else we can do/try to improve
the single-threaded case I'm all ears.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Brent Jones
On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss
k.we...@science-computing.de wrote:
 Hi Adam,

 Very interesting data. Your test is inherently
 single-threaded so I'm not surprised that the
 benefits aren't more impressive -- the flash modules
 on the F20 card are optimized more for concurrent
 IOPS than single-threaded latency.

 Thanks for your reply. I'll probably test the multiple write case, too.

 But frankly at the moment I care the most about the single-threaded case
 because if we put e.g. user homes on this server I think they would be
 severely disappointed if they would have to wait 2m42s just to extract a 
 rather
 small 50 MB tarball. The default 7m40s without SSD log were unacceptable
 and we were hoping that the F20 would make a big difference and bring the
 performance down to acceptable runtimes. But IMHO 2m42s is still too slow
 and disabling the ZIL seems to be the only option.

 Knowing that 100s of users could do this in parallel with good performance
 is nice but it does not improve the situation for the single user which only
 cares for his own tar run. If there's anything else we can do/try to improve
 the single-threaded case I'm all ears.
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Use something other than Open/Solaris with ZFS as an NFS server?  :)

I don't think you'll find the performance you paid for with ZFS and
Solaris at this time. I've been trying to more than a year, and
watching dozens, if not hundreds of threads.
Getting half-ways decent performance from NFS and ZFS is impossible
unless you disable the ZIL.

You'd be better off getting NetApp

-- 
Brent Jones
br...@servuhome.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Arne Jansen
Brent Jones wrote:
 
 I don't think you'll find the performance you paid for with ZFS and
 Solaris at this time. I've been trying to more than a year, and
 watching dozens, if not hundreds of threads.
 Getting half-ways decent performance from NFS and ZFS is impossible
 unless you disable the ZIL.

A few days ago I posted to nfs-discuss with a proposal to add some
mount/share options to change semantics of a nfs-mounted filesystem
so that they parallel those of a local filesystem.
The main point is that data gets flushed to stable storage only if the
client explicitly requests so via fsync or O_DSYNC, not implicitly
with every close().
That would give you the performance you are seeking without sacrificing
data integrity for applications that need it.

I get the impression that I'm not the only one who could be interested
in that ;)

-Arne

 
 You'd be better off getting NetApp
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Karsten Weiss
 Nobody knows any way for me to remove my unmirrored
 log device.  Nobody knows any way for me to add a mirror to it (until

Since snv_125 you can remove log devices. See
http://bugs.opensolaris.org/view_bug.do?bug_id=6574286

I've used this all the time during my testing and was able to remove both
mirrored and unmirrored log devices without any problems (and without
reboot). I'm using snv_134.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Robert Milkowski



On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss
   
Use something other than Open/Solaris with ZFS as an NFS server?  :)


I don't think you'll find the performance you paid for with ZFS and
Solaris at this time. I've been trying to more than a year, and
watching dozens, if not hundreds of threads.
Getting half-ways decent performance from NFS and ZFS is impossible
unless you disable the ZIL.

   


Well, for lots of environments disabling ZIL is perfectly acceptable.
And frankly the reason you get better performance out of the box on 
Linux as NFS server is that it actually behaves like with disabled ZIL - 
so disabling ZIL on ZFS for NFS shares is no worse than using Linux here 
or any other OS which behaves in the same manner. Actually it makes it 
better as even if ZIL is disabled ZFS filesystem is always consisten on 
a disk and you still get all the other benefits from ZFS.


What would be useful though is to be able to easily disable ZIL per 
dataset instead of OS wide switch.
This feature has already been coded and tested and awaits a formal 
process to be completed in order to get integrated. Should be rather 
sooner than later.



You'd be better off getting NetApp
   
Well, spend some extra money on a really fast NVRAM solution for ZIL and 
you will get much faster ZFS environment than NetApp and still you will 
spend much less money. Not to mention all the extra flexibity compared 
to NetApp.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Robert Milkowski



Just to make sure you know ... if you disable the ZIL altogether, and
   

you
 

have a power interruption, failed cpu, or kernel halt, then you're
   

likely to
 

have a corrupt unusable zpool, or at least data corruption.  If that
   

is
 

indeed acceptable to you, go nuts.  ;-)
   

I believe that the above is wrong information as long as the devices
involved do flush their caches when requested to.  Zfs still writes
data in order (at the TXG level) and advances to the next transaction
group when the devices written to affirm that they have flushed their
cache.  Without the ZIL, data claimed to be synchronously written
since the previous transaction group may be entirely lost.

If the devices don't flush their caches appropriately, the ZIL is
irrelevant to pool corruption.
 

I stand corrected.  You don't lose your pool.  You don't have corrupted
filesystem.  But you lose whatever writes were not yet completed, so if
those writes happen to be things like database transactions, you could have
corrupted databases or files, or missing files if you were creating them at
the time, and stuff like that.  AKA, data corruption.

But not pool corruption, and not filesystem corruption.


   
Which is an expected behavior when you break NFS requirements as Linux 
does out of the box.
Disabling ZIL on a nfs server makes it no worse than the standard Linux 
behaviour - now you get decent performance at a cost of some data to get 
corrupted from a nfs client point of view. But then there are 
environments when it is perfectly acceptable as you there are not 
running critical databases but rather user home directories and zfs will 
flush a transaction maximum after 30s currently so user won't be able to 
loose more than last 30s if the nfs server would suddenly lost power.


To clarify - if ZIL is disabled it makes no difference at all for a 
pool/filesystem level consistency.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Robert Milkowski



standard ZIL:   7m40s  (ZFS default)
1x SSD ZIL:  4m07s  (Flash Accelerator F20)
2x SSD ZIL:  2m42s  (Flash Accelerator F20)
2x SSD mirrored ZIL:   3m59s  (Flash Accelerator F20)
3x SSD ZIL:  2m47s  (Flash Accelerator F20)
4x SSD ZIL:  2m57s  (Flash Accelerator F20)
disabled ZIL:   0m15s
(local extraction0m0.269s)

I was not so much interested in the absolute numbers but rather in the
relative
performance differences between the standard ZIL, the SSD ZIL and the
disabled
ZIL cases.
 

Oh, one more comment.  If you don't mirror your ZIL, and your unmirrored SSD
goes bad, you lose your whole pool.  Or at least suffer data corruption.


   
This is not true. If ZIL device would die while pool is imported then 
ZFS would start using z ZIL withing a pool and continue to operate.


On the other hand if your server would suddenly lost power and then when 
you power it up later on and ZFS detects that the ZIL is broken/gone it 
will require a sysadmin intervation to force the pool import and yes 
possibly loose some data.


But how is it different from any other solution where your log is put on 
a separate device?
Well, it is actually different. With ZFS you can still guearantee it to 
be consistent on-disk while others generally can't and often you will 
have to do fsck to even mount a fs in r/w...


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Karsten Weiss
Hi  Jeroen, Adam!

 link. Switched write caching off with the following
 addition to the /kernel/drv/sd.conf file (Karsten: if
 you didn't do this already, you _really_ want to :)

Okay, I bite! :) format-inquiry on the F20 FMods disks returns:

# Vendor:   ATA
# Product:  MARVELL SD88SA02

So I put this in /kernel/drv/sd.conf and rebooted:

# KAW, 2010-03-31
# Set F20 FMod devices to non-volatile mode
# See 
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes
sd-config-list = ATA MARVELL SD88SA02, nvcache1;
nvcache1=1, 0x4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1;

Now the tarball extraction test with active ZIL finishes in ~0m32s!
I've tested with a mirrored SSD log and two separate SSD log devices.
The runtime is nearly the same. Compared to the 2m42s before the
/kernel/drv/sd.conf modification this is a huge improvement. The
performance with active ZIL would be acceptable now.

But is this mode of operation *really* safe?

FWIW zilstat during the test shows this:

   N-Bytes  N-Bytes/s N-Max-RateB-Bytes  B-Bytes/s B-Max-Rateops  =4kB 
4-32kB =32kB
 0  0  0  0  0  0  0  0 
 0  0
   103907210390721039072377241637724163772416610299 
   311  0
   152249615224961522496540262454026245402624874429 
   445  0
   229295222929522292952674611267461126746112931215 
   716  0
   232127223212722321272677478467747846774784931208 
   723  0
   230347223034722303472654950465495046549504897195 
   702  0
   632632632673382467338246733824935226 
   709  0
   219832821983282198328666828866682886668288926224 
   702  0
   217217217637337663733766373376878200 
   678  0
   218541621854162185416635289663528966352896874197 
   677  0
   221804022180402218040651673665167366516736897203 
   694  0
   243698424369842436984654950465495046549504885171 
   714  0

I.e. ~900 ops/s.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Edward Ned Harvey
 Use something other than Open/Solaris with ZFS as an NFS server?  :)
 
 I don't think you'll find the performance you paid for with ZFS and
 Solaris at this time. I've been trying to more than a year, and
 watching dozens, if not hundreds of threads.
 Getting half-ways decent performance from NFS and ZFS is impossible
 unless you disable the ZIL.
 
 You'd be better off getting NetApp

Hah hah.  I have a Sun X4275 server exporting NFS.  We have clients on all 4
of the Gb ethers, and the Gb ethers are the bottleneck, not the disks or
filesystem.

I suggest you either enable the WriteBack cache on your HBA, or add SSD's
for ZIL.  Performance is 5-10x higher this way than using naked disks.
But of course, not as high as it is with a disabled ZIL.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Jeroen Roodhart
Hi Karsten,

 But is this mode of operation *really* safe?

As far as I can tell it is. 

-The F20 uses some form of power backup that should provide power to the 
interface card long enough to get the cache onto solid state in case of power 
failure. 

-Recollecting from earlier threads here; in case the card fails (but not the 
host), there should be enough data residing in memory for ZFS to safely switch 
to the regular on disk ZIL.

-According to my contacts at Sun, the F20 is a viable replacement solution for 
the X25-E. 

-Switching write caching off seems to be officially recommended on the Sun 
performance wiki
 (translated to more sane defaults).

If I'm wrong here I'd like to know too, 'cause this is probably the way we're 
taking it in production.
 :)

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   >