Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-02-02 Thread Torrey McMahon

Marion Hakanson wrote:

However, given the default behavior of ZFS (as of Solaris-10U3) is to
panic/halt when it encounters a corrupted block that it can't repair,
I'm re-thinking our options, weighing against the possibility of a
significant downtime caused by a single-block corruption.


Guess what happens when UFS finds an inconsistency it can't fix either?

The issue is that ZFS has the chance to fix the inconsistency if the 
zpool is a mirror or raidZ. Not that it finds the inconsistency in the 
first place. ZFS will just find more of them given a set of errors vs 
other filesystems.




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-02-01 Thread Marion Hakanson
[EMAIL PROTECTED] said:
 That is the part of your setup that puzzled me.  You took the same 7 disk
 raid5 set and split them into 9 LUNS.  The Hitachi likely splits the virtual
 disk into 9 continuous partitions so each LUN maps back to different parts
 of the 7 disks.  I speculate that ZFS thinks it is talking to 9 different
 disks so spreads out the writes accordingly. What ZFS thinks is sequential
 writes becomes well spaced writes across the entire disk  blows your seek
 time off the roof. 

That's what I thought might happen before I even tried this, although it's
also possible the Hitachi stripes each LUN across all 7 disks.  Either
way, one could be getting too many seeks.  Note that I'm just trying to see
if it was so bad that the self-healing capability wasn't worth the cost.
I do realize these are 7200rpm SATA disks, so seeking isn't what they do best.


 I'm interested how it looks like from the Hitachi end.  If you can,
 repeat the test with the Hitachi presenting all 7 disks directly to
 ZFS as LUNs?

The array doesn't give us that capability.


 Interesting... what you are suggesting is that %b is 100% when w/s and r/s is
 0? 

Correct.  Sometimes all iostat -xn columns are 0 except %b;  Sometimes
the asvc_t column stays at 4.0 for the duration of the quiet period.
I've also observed times where all columns were 0, including %b.  Sure
is puzzling.


[EMAIL PROTECTED] said:
 IIRC, the calculation for %busy is the amount of time that an I/O is on the
 device.  These symptoms would occur if an I/O is dropped somewhere along the
 way or at the array.  Eventually, we'll timeout and retry, though by default
 that should be after 60 seconds.  I think we need to figure out what is going
 on here before accepting the results. It could be that we're overrunning the
 queue on the Hitachi.  By default, ZFS will send 35 concurrent commands per
 vdev and the ssd driver will send up to 256 to a target.  IIRC, Hitachi has a
 formula for calculating sdd_max_throttle to avoid such overruns, but I'm not
 sure if that applies to this specific array. 

Hmm, it's true that I have made no tuning changes on the T2000 side.  It
would make sense if the array just stopped responding.  I'll have to poke
at the array and see if it has any diagnostics logged somewhere.  I recall
that the Hitachi docs do have some recommendations on max-throttle settings,
so I'll go dig those up and see what I can find out.

Thanks for the comments,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-31 Thread Marion Hakanson
I wrote:
 Just thinking out loud here.  Now I'm off to see what kind of performance
 cost there is, comparing (with 400GB disks):
   Simple ZFS stripe on one 2198GB LUN from a 6+1 HW RAID5 volume
   8+1 RAID-Z on 9 244.2GB LUN's from a 6+1 HW RAID5 volume


[EMAIL PROTECTED] said:
 Interesting idea.  Please post back to let us know how the performance looks.


The short story is, performance is not bad with the raidz arrangement, until
you get to doing reads, at which point it looks much worse than the 1-LUN setup.

Please bear in mind that I'm not a storage nor benchmarking expert, though
I'd say I'm not a neophyte either.

Some specifics:

The array is a low-end Hitachi, 9520V.  My two test subjects are a pair
of RAID-5 groups in the same shelf, each consisting of 6D+1P 400GB SATA
drives.  The test host is a Sun T2000, 16GB RAM, connected via 2Gb FC
links through a pair of switches (the array/mpxio combination do not
support load-balancing, so only one 2Gb channel is in use at a time).
It is running Solaris-10U3, patches current as of 12-Jan-2007.

The array was mostly idle except for my tests, although some light
I/O to other shelves may have come from another host on occasion.
The test host wasn't doing anything else during these tests.

One RAID-5 group was configured as a single 2048GB LUN (with about 150GB
left unallocated, the array has a max LUN size);  The second RAID-5 group
was setup as nine 244.3GB LUN's.

Here are the zpool configurations I used for these tests:
# zpool status -v
  pool: bulk_sp1
 state: ONLINE
 scrub: none requested
config:

NAME STATE READ WRITE 
CKSUM
bulk_sp1 ONLINE   0 0   
  0
  c6t4849544143484920443630303133323230303230d0  ONLINE   0 0   
  0

errors: No known data errors

  pool: bulk_zp2
 state: ONLINE
 scrub: none requested
config:

NAME   STATE READ WRITE 
CKSUM
bulk_zp2   ONLINE   0 0 
0
  raidz1   ONLINE   0 0 
0
c6t4849544143484920443630303133323230303330d0  ONLINE   0 0 
0
c6t4849544143484920443630303133323230303331d0  ONLINE   0 0 
0
c6t4849544143484920443630303133323230303332d0  ONLINE   0 0 
0
c6t484954414348492044363030313332323030d0  ONLINE   0 0 
0
c6t4849544143484920443630303133323230303334d0  ONLINE   0 0 
0
c6t4849544143484920443630303133323230303335d0  ONLINE   0 0 
0
c6t4849544143484920443630303133323230303336d0  ONLINE   0 0 
0
c6t4849544143484920443630303133323230303337d0  ONLINE   0 0 
0
c6t4849544143484920443630303133323230303338d0  ONLINE   0 0 
0

errors: No known data errors
# zfs list
NAME   USED  AVAIL  REFER  MOUNTPOINT
bulk_sp183K  1.95T  24.5K  /sp1
bulk_zp2  73.8K  1.87T  2.67K  /zp2


I used two benchmarks:  One was a bunzip2 | tar extract of the Sun
Studio-11 SPARC distribution tarball, extracting from the T2000's
internal drives onto the test zpools.  For this benchmark, both zpools
gave similar results:

pool sp1 (single-LUN stripe):
  du -s -k:
1155141
  time -p:
real 713.67
user 614.42
sys 7.56
  1.6MB/sec overall

pool zp2 (8+1-LUN raidz1):
  du -s -k:
1169020
  time -p:
real 714.96
user 614.78
sys 7.56
  1.6MB/sec overall



The 2nd benchmark was bonnie++ v1.03, run single-threaded with default
arguments, which means a 32GB dataset made of up 1GB files.  Observations of
vmstat and mpstat during the tests showed that bonnie++ is CPU-limited
on the T2000, especially for the getc()/putc() tests, so I later ran 3x
bonnie++'s simultaneously (13GB dataset each), and got the same results
in total throughput for the block read/write tests on the single-LUN zpool
(I was not patient enough to sit through the getc/putc tests again :-).

pool sp1 (single-LUN stripe):
Version  1.03   --Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
filer1  32G 15497  99 66245  84 16652  30 15210  90 106600  59 322.3   3
--Sequential Create-- Random Create
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 16  5204 100 + +++  8076 100  4551 100 + +++  7509 100
filer1,32G,15497,99,66245,84,16652,30,15210,90,106600,59,322.3,3,16,5204,100,+,+++,8076,100,4551,100,+,+++,7509,100

pool zp2 (8+1-LUN 

Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-31 Thread Wee Yeh Tan

On 2/1/07, Marion Hakanson [EMAIL PROTECTED] wrote:

There's also the potential of too much seeking going on for the raidz pool,
since there are 9 LUN's on top of 7 physical disk drives (though how Hitachi
divides/stripes those LUN's is not clear to me).


Marion,

That is the part of your setup that puzzled me.  You took the same 7
disk raid5 set and split them into 9 LUNS.  The Hitachi likely splits
the virtual disk into 9 continuous partitions so each LUN maps back
to different parts of the 7 disks.  I speculate that ZFS thinks it is
talking to 9 different disks so spreads out the writes accordingly.
What ZFS thinks is sequential writes becomes well spaced writes across
the entire disk  blows your seek time off the roof.

I'm interested how it looks like from the Hitachi end.  If you can,
repeat the test with the Hitachi presenting all 7 disks directly to
ZFS as LUNs?


One thing I noticed which puzzles me is that in both configurations, though
more so in the divided-up raidz pool, there were long periods of time where
the LUN's showed in iostat -xn output at 100% busy but with no I/O's
happening at all.  No paging, CPU 100% idle, no less than 2GB of free RAM,
for as long as 20-30 seconds.  Sure puts a dent in the throughput.


Interesting... what you are suggesting is that %b is 100% when w/s and r/s is 0?


--
Just me,
Wire ...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-31 Thread Richard Elling

fishy smell way below...

Marion Hakanson wrote:

I wrote:

Just thinking out loud here.  Now I'm off to see what kind of performance
cost there is, comparing (with 400GB disks):
Simple ZFS stripe on one 2198GB LUN from a 6+1 HW RAID5 volume
8+1 RAID-Z on 9 244.2GB LUN's from a 6+1 HW RAID5 volume



[EMAIL PROTECTED] said:

Interesting idea.  Please post back to let us know how the performance looks.



The short story is, performance is not bad with the raidz arrangement, until
you get to doing reads, at which point it looks much worse than the 1-LUN setup.

Please bear in mind that I'm not a storage nor benchmarking expert, though
I'd say I'm not a neophyte either.

Some specifics:

The array is a low-end Hitachi, 9520V.  My two test subjects are a pair
of RAID-5 groups in the same shelf, each consisting of 6D+1P 400GB SATA
drives.  The test host is a Sun T2000, 16GB RAM, connected via 2Gb FC
links through a pair of switches (the array/mpxio combination do not
support load-balancing, so only one 2Gb channel is in use at a time).
It is running Solaris-10U3, patches current as of 12-Jan-2007.

The array was mostly idle except for my tests, although some light
I/O to other shelves may have come from another host on occasion.
The test host wasn't doing anything else during these tests.

One RAID-5 group was configured as a single 2048GB LUN (with about 150GB
left unallocated, the array has a max LUN size);  The second RAID-5 group
was setup as nine 244.3GB LUN's.

Here are the zpool configurations I used for these tests:
# zpool status -v
  pool: bulk_sp1
 state: ONLINE
 scrub: none requested
config:

NAME STATE READ WRITE 
CKSUM
bulk_sp1 ONLINE   0 0   
  0
  c6t4849544143484920443630303133323230303230d0  ONLINE   0 0   
  0

errors: No known data errors

  pool: bulk_zp2
 state: ONLINE
 scrub: none requested
config:

NAME   STATE READ WRITE 
CKSUM
bulk_zp2   ONLINE   0 0 
0
  raidz1   ONLINE   0 0 
0
c6t4849544143484920443630303133323230303330d0  ONLINE   0 0 
0
c6t4849544143484920443630303133323230303331d0  ONLINE   0 0 
0
c6t4849544143484920443630303133323230303332d0  ONLINE   0 0 
0
c6t484954414348492044363030313332323030d0  ONLINE   0 0 
0
c6t4849544143484920443630303133323230303334d0  ONLINE   0 0 
0
c6t4849544143484920443630303133323230303335d0  ONLINE   0 0 
0
c6t4849544143484920443630303133323230303336d0  ONLINE   0 0 
0
c6t4849544143484920443630303133323230303337d0  ONLINE   0 0 
0
c6t4849544143484920443630303133323230303338d0  ONLINE   0 0 
0

errors: No known data errors
# zfs list
NAME   USED  AVAIL  REFER  MOUNTPOINT
bulk_sp183K  1.95T  24.5K  /sp1
bulk_zp2  73.8K  1.87T  2.67K  /zp2


I used two benchmarks:  One was a bunzip2 | tar extract of the Sun
Studio-11 SPARC distribution tarball, extracting from the T2000's
internal drives onto the test zpools.  For this benchmark, both zpools
gave similar results:

pool sp1 (single-LUN stripe):
  du -s -k:
1155141
  time -p:
real 713.67
user 614.42
sys 7.56
  1.6MB/sec overall

pool zp2 (8+1-LUN raidz1):
  du -s -k:
1169020
  time -p:
real 714.96
user 614.78
sys 7.56
  1.6MB/sec overall



The 2nd benchmark was bonnie++ v1.03, run single-threaded with default
arguments, which means a 32GB dataset made of up 1GB files.  Observations of
vmstat and mpstat during the tests showed that bonnie++ is CPU-limited
on the T2000, especially for the getc()/putc() tests, so I later ran 3x
bonnie++'s simultaneously (13GB dataset each), and got the same results
in total throughput for the block read/write tests on the single-LUN zpool
(I was not patient enough to sit through the getc/putc tests again :-).

pool sp1 (single-LUN stripe):
Version  1.03   --Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
filer1  32G 15497  99 66245  84 16652  30 15210  90 106600  59 322.3   3
--Sequential Create-- Random Create
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 16  5204 100 + +++  8076 100  4551 100 + +++  7509 100

Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-29 Thread Darren Dunham
  Our Netapp does double-parity RAID.  In fact, the filesystem design is
  remarkably similar to that of ZFS.  Wouldn't that also detect the
  error?  I suppose it depends if the `wrong sector without notice'
  error is repeated each time.  Or is it random?
 
 On most (all?) other systems the parity only comes into effect when a  
 drive fails. When all the drives are reporting OK most (all?) RAID  
 systems don't use the parity data at all. ZFS is the first (only?)  
 system that actively checks the data returned from disk, regardless  
 of whether the drives are reporting they're okay or not.
 
 I'm sure I'll be corrected if I'm wrong. :)

Netapp/OnTAP does do read verification, but it does it outside the
raid-4/raid-dp protection (just like ZFS does it outside the raidz
protction).  So it's correct that the parity data is not read at all in
either OnTAP or ZFS, but both attempt to do verification of the data on
all reads.

See also: http://blogs.sun.com/bonwick/entry/zfs_end_to_end_data for a
few more specifics on it and the differences from the ZFS data check.

-- 
Darren Dunham   [EMAIL PROTECTED]
Senior Technical Consultant TAOShttp://www.taos.com/
Got some Dr Pepper?   San Francisco, CA bay area
  This line left intentionally blank to confuse you. 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-29 Thread Jeffery Malloch
Hi Guys,

SO...

From what I can tell from this thread ZFS if VERY fussy about managing 
writes,reads and failures.  It wants to be bit perfect.  So if you use the 
hardware that comes with a given solution (in my case an Engenio 6994) to 
manage failures you risk a) bad writes that don't get picked up due to 
corruption from write cache to disk b) failures due to data changes that ZFS 
is unaware of that the hardware imposes when it tries to fix itself.

So now I have a $70K+ lump that's useless for what it was designed for.  I 
should have spent $20K on a JBOD.  But since I didn't do that, it sounds like a 
traditional model works best (ie. UFS et al) for the type of hardware I have.  
No sense paying for something and not using it.  And by using ZFS just as a 
method for ease of file system growth and management I risk much more 
corruption.

The other thing I haven't heard is why NOT to use ZFS.  Or people who don't 
like it for some reason or another.

Comments?

Thanks,

Jeff

PS - the responses so far have been great and are much appreciated!  Keep 'em 
coming...
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-29 Thread Jason J. W. Williams

Hi Jeff,

Maybe I mis-read this thread, but I don't think anyone was saying that
using ZFS on-top of an intelligent array risks more corruption. Given
my experience, I wouldn't run ZFS without some level of redundancy,
since it will panic your kernel in a RAID-0 scenario where it detects
a LUN is missing and can't fix it. That being said, I wouldn't run
anything but ZFS anymore. When we had some database corruption issues
awhile back, ZFS made it very simple to prove it was the DB. Just did
a scrub and boom, verification that the data was laid down correctly.
RAID-5 will have better random read performance the RAID-Z for reasons
Robert had to beat into my head. ;-) But if you really need that
performance, perhaps RAID-10 is what you should be looking at? Someone
smarter than I can probably give a better idea.

Regarding the failure detection, is anyone on the list have the
ZFS/FMA traps fed into a network management app yet? I'm curious what
the experience with it is?

Best Regards,
Jason

On 1/29/07, Jeffery Malloch [EMAIL PROTECTED] wrote:

Hi Guys,

SO...

From what I can tell from this thread ZFS if VERY fussy about managing 
writes,reads and failures.  It wants to be bit perfect.  So if you use the 
hardware that comes with a given solution (in my case an Engenio 6994) to manage 
failures you risk a) bad writes that don't get picked up due to corruption from 
write cache to disk b) failures due to data changes that ZFS is unaware of that 
the hardware imposes when it tries to fix itself.

So now I have a $70K+ lump that's useless for what it was designed for.  I 
should have spent $20K on a JBOD.  But since I didn't do that, it sounds like a 
traditional model works best (ie. UFS et al) for the type of hardware I have.  
No sense paying for something and not using it.  And by using ZFS just as a 
method for ease of file system growth and management I risk much more 
corruption.

The other thing I haven't heard is why NOT to use ZFS.  Or people who don't 
like it for some reason or another.

Comments?

Thanks,

Jeff

PS - the responses so far have been great and are much appreciated!  Keep 'em 
coming...


This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-29 Thread Jonathan Edwards


On Jan 29, 2007, at 14:17, Jeffery Malloch wrote:


Hi Guys,

SO...

From what I can tell from this thread ZFS if VERY fussy about  
managing writes,reads and failures.  It wants to be bit perfect.   
So if you use the hardware that comes with a given solution (in my  
case an Engenio 6994) to manage failures you risk a) bad writes  
that don't get picked up due to corruption from write cache to  
disk b) failures due to data changes that ZFS is unaware of that  
the hardware imposes when it tries to fix itself.


So now I have a $70K+ lump that's useless for what it was designed  
for.  I should have spent $20K on a JBOD.  But since I didn't do  
that, it sounds like a traditional model works best (ie. UFS et al)  
for the type of hardware I have.  No sense paying for something and  
not using it.  And by using ZFS just as a method for ease of file  
system growth and management I risk much more corruption.


The other thing I haven't heard is why NOT to use ZFS.  Or people  
who don't like it for some reason or another.


Comments?


I put together this chart a while back .. i should probably update it  
for RAID6 and RAIDZ2


#   ZFS ARRAY HWCAPACITYCOMMENTS
--  --- 
1   R0  R1  N/2 hw mirror - no zfs healing
2   R0  R5  N-1 hw R5 - no zfs healing
3   R1  2 x R0  N/2 flexible, redundant, good perf
4   R1  2 x R5  (N/2)-1 flexible, more redundant,  
decent perf
5   R1  1 x R5  (N-1)/2 parity and mirror on same  
drives (XXX)

6   RZ  R0  N-1 standard RAID-Z no mirroring
7   RZ  R1 (tray)   (N/2)-1 RAIDZ+1
8   RZ  R1 (drives) (N/2)-1 RAID1+Z (highest redundancy)
9   RZ  3 x R5  N-4 triple parity calculations (XXX)
10  RZ  1 x R5  N-2 double parity calculations (XXX)

(note: I included the cases where you have multiple arrays with a  
single lun per vdisk (say) and where you only have a single array  
split into multiple LUNs.)


The way I see it, you're better off picking either controller parity  
or zfs parity .. there's no sense in computing parity multiple times  
unless you have cycles to spare and don't mind the performance hit ..  
so the questions you should really answer before you choose the  
hardware is what level of redundancy to capacity balance do you want?  
and whether or not you want to compute RAID in ZFS host memory or out  
on a dedicated blackbox controller?  I would say something about  
double caching too, but I think that's moot since you'll always cache  
in the ARC if you use ZFS the way it's currently written.


Other feasible filesystem options for Solaris - UFS, QFS, or vxfs  
with SVM or VxVM for volume mgmt if you're so inclined .. all depends  
on your budget and application.  There's currently tradeoffs in each  
one, and contrary to some opinions, the death of any of these has  
been grossly exaggerated.


---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-29 Thread Albert Chin
On Mon, Jan 29, 2007 at 11:17:05AM -0800, Jeffery Malloch wrote:
 From what I can tell from this thread ZFS if VERY fussy about
 managing writes,reads and failures.  It wants to be bit perfect.  So
 if you use the hardware that comes with a given solution (in my case
 an Engenio 6994) to manage failures you risk a) bad writes that
 don't get picked up due to corruption from write cache to disk b)
 failures due to data changes that ZFS is unaware of that the
 hardware imposes when it tries to fix itself.
 
 So now I have a $70K+ lump that's useless for what it was designed
 for.  I should have spent $20K on a JBOD.  But since I didn't do
 that, it sounds like a traditional model works best (ie. UFS et al)
 for the type of hardware I have.  No sense paying for something and
 not using it.  And by using ZFS just as a method for ease of file
 system growth and management I risk much more corruption.

Well, ZFS with HW RAID makes sense in some cases. However, it seems
that if you are unwilling to lose 50% disk space to RAID 10 or two
mirrored HW RAID arrays, you either use RAID 0 on the array with ZFS
RAIDZ/RAIDZ2 on top of that or a JBOD with ZFS RAIDZ/RAIDZ2 on top of
that.

-- 
albert chin ([EMAIL PROTECTED])
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-29 Thread Frank Cusack
On January 29, 2007 11:17:05 AM -0800 Jeffery Malloch 
[EMAIL PROTECTED] wrote:

Hi Guys,

SO...


From what I can tell from this thread ZFS if VERY fussy about managing
writes,reads and failures.  It wants to be bit perfect.


It's funny to call that fussy.  All filesystems WANT to be bit perfect,
zfs actually does something to ensure it.


 So if you use
the hardware that comes with a given solution (in my case an Engenio
6994) to manage failures you risk a) bad writes that don't get picked up
due to corruption from write cache to disk


You would always have that problem, JBOD or RAID.  There are many places
data can get corrupted, not just in the RAID write cache.  zfs will correct
it, or at least detect it depending on your configuration.


b) failures due to data
changes that ZFS is unaware of that the hardware imposes when it tries
to fix itself.


If that happens, you will be lucky to have ZFS to fix it.  If the array
changes data, it is broken.  This is not the same thing as correcting data.


The other thing I haven't heard is why NOT to use ZFS.  Or people who
don't like it for some reason or another.


If you need per-user quotas, zfs might not be a good fit.  (In many cases
per-filesystem quotas can be used effectively though.)

If you need NFS clients to traverse mount points on the server
(eg /home/foo), then this won't work yet.  Then again, does this work
with UFS either?  Seems to me it wouldn't.  The difference is that zfs
encourages you to create more filesystems.  But you don't have to.

If you have an application that is very highly tuned for a specific
filesystem (e.g. UFS with directio), you might not want to replace
it with zfs.

If you need incremental restore, you might need to stick with UFS.
(snapshots might be enough for you though)

-frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-29 Thread Marion Hakanson
Albert Chin said:
 Well, ZFS with HW RAID makes sense in some cases. However, it seems that if
 you are unwilling to lose 50% disk space to RAID 10 or two mirrored HW RAID
 arrays, you either use RAID 0 on the array with ZFS RAIDZ/RAIDZ2 on top of
 that or a JBOD with ZFS RAIDZ/RAIDZ2 on top of that. 

I've been re-evaluating our local decision on this question (how to layout
ZFS on pre-existing RAID hardware).  In our case, the array does not allow
RAID-0 of any type, and we're unwilling to give up the expensive disk
space to a mirrored configuration.  In fact, in our last decision, we
came to the conclusion that we didn't want to layer RAID-Z on top of
HW RAID-5, thinking that the added loss of space is too high, given any
of the XXX layouts in Jonathan Edwards' chart:
 #   ZFS ARRAY HWCAPACITYCOMMENTS
 --  --- 
 . . .
 5   R1  1 x R5  (N-1)/2 parity and mirror on same drives (XXX)
 9   RZ  3 x R5  N-4 triple parity calculations (XXX)
 . . .
 10  RZ  1 x R5  N-2 double parity calculations (XXX)


So, we ended up (some months ago) deciding to go with only HW RAID-5,
using ZFS to stripe together large-ish LUN's made up of independent HW
RAID-5 groups.  We'd have no ZFS redundancy, but at least ZFS would catch
any corruption that may come along.  We can restore individual corrupted
files from tape backups (which we're already doing anyway), if necessary.

However, given the default behavior of ZFS (as of Solaris-10U3) is to
panic/halt when it encounters a corrupted block that it can't repair,
I'm re-thinking our options, weighing against the possibility of a
significant downtime caused by a single-block corruption.

Today I've been pondering a variant of #10 above, the variation being
to slice a RAID-5 volume across than N LUN's, i.e. LUN's smaller than the
size of the individual disks that make up the HW R5 volume.  A larger
number of small LUN's results in less space given up to ZFS parity, which
is nice when overall disk space is important to us.

We're not expecting RAID-Z across these LUN's to make it possible to
survive failure of a whole disk, rather we only need RAID-Z to repair
the occasional block corruption, in the hopes that this might head off the
need to restore a whole multi-TB pool.  We'll rely on the HW RAID-5 to
protect against whole-disk failure.

Just thinking out loud here.  Now I'm off to see what kind of performance
cost there is, comparing (with 400GB disks):
Simple ZFS stripe on one 2198GB LUN from a 6+1 HW RAID5 volume
8+1 RAID-Z on 9 244.2GB LUN's from a 6+1 HW RAID5 volume

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-28 Thread Robert Milkowski
Hello Anantha,

Friday, January 26, 2007, 5:06:46 PM, you wrote:

ANS All my feedback is based on Solaris 10 Update 2 (aka 06/06) and
ANS I've no comments on NFS. I strongly recommend that you use ZFS
ANS data redundancy (z1, z2, or mirror) and simply delegate the
ANS Engenio to stripe the data for performance.

Striping on an array and then doing redundancy with ZFS has at least
one drawback - what if one of disks fails? You've got to replace bad
disk, re-create stripe on an array and resilver on ZFS (or stay with
hotspare). Lot of hassle.


-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-27 Thread James C. McPherson

Selim Daoud wrote:

it would be good to have real data and not only guess ot anecdots

this story about wrong blocks being written by  RAID controllers
sounds like the anti-terrorism propaganda we are leaving in: exagerate
the facts to catch everyone's attention
.It's going to take more than that to prove RAID ctrls have been doing
a bad jobs for the last 30 years
Let's make up  real stories with hard fact first


I have actual hard data and bitter experience (from support calls)
to backup the allegations that raid controllers can and do write
bad blocks.

No, I cannot and will not provide specifics - I signed an NDA
which expressly deals with confidentiality of customer information.


What I can say is that if we'd had ZFS to manage the filesystems
in question, not only would we have detected the problem much
earlier, but the flow-on effect to the end-users would have been
much more easily managed.


James C. McPherson
--
Solaris kernel software engineer, system admin and troubleshooter
  http://www.jmcp.homeunix.com/blog
Find me on LinkedIn @ http://www.linkedin.com/in/jamescmcpherson
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-27 Thread David Magda


On Jan 26, 2007, at 14:43, Gary Mills wrote:


Our Netapp does double-parity RAID.  In fact, the filesystem design is
remarkably similar to that of ZFS.  Wouldn't that also detect the
error?  I suppose it depends if the `wrong sector without notice'
error is repeated each time.  Or is it random?


On most (all?) other systems the parity only comes into effect when a  
drive fails. When all the drives are reporting OK most (all?) RAID  
systems don't use the parity data at all. ZFS is the first (only?)  
system that actively checks the data returned from disk, regardless  
of whether the drives are reporting they're okay or not.


I'm sure I'll be corrected if I'm wrong. :)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Anantha N. Srirama
I've used ZFS since July/August 2006 when Sol 10 Update 2 came out (first 
release to integrate ZFS.) I've used it on three servers (E25K domain, and 2 
E2900s) extensivesely; two them are production. I've over 3TB of storage from 
an EMC SAN under ZFS management for no less than 6 months. Like your 
configuration we've defered data redundancy to SAN. My observations are:

1. ZFS is stable to a very large extent. There are two known issues that I'm 
aware of:
  a. You can end up in an endless 'reboot' cycle when you've a corrupt zpool. I 
came across this when I had data corruption due to a HBA mismatch with EMC SAN. 
This mismatch injected data corruption in transit and the EMC faithfully wrote 
bad data, upon reading this bad data ZFS threw up all over the floor for that 
pool. There is a documented workaround to snap out of the 'reboot' cycle, I've 
not checked if this is fixed in 11/06 update 3.
  b. Your server will hang when one of the underlying disks disappear. In our 
case we had a T2000 running 11/06 and had a mirrored zpool against two internal 
drives. When we pulled one of the drives abruptly the server simply hung. I 
believe this is a known bug, workaround?

2. When you've I/O operations that either request fsync or open files with 
O_DSYNC option coupled with high I/O ZFS will choke. It won't crash but the 
filesystem I/O runs like molases on a cold morning.

All my feedback is based on Solaris 10 Update 2 (aka 06/06) and I've no 
comments on NFS. I strongly recommend that you use ZFS data redundancy (z1, z2, 
or mirror) and simply delegate the Engenio to stripe the data for performance.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Brian Hechinger
On Fri, Jan 26, 2007 at 08:06:46AM -0800, Anantha N. Srirama wrote:
 
   b. Your server will hang when one of the underlying disks disappear. In our 
 case we had a T2000 running 11/06 and had a mirrored zpool against two 
 internal drives. When we pulled one of the drives abruptly the server simply 
 hung. I believe this is a known bug, workaround?

This was just covered here and looks like the fix will make it into u4 (i think 
it's in svn_48?)

The workaround is to do a 'zpool offline' whenever possible before removing a 
disk.  Yes,
this is not always possible (in the case of disk death), but will help in some 
situations.

I can't wait for U4.  :)

-brian
-- 
The reason I don't use Gnome: every single other window manager I know of is
very powerfully extensible, where you can switch actions to different mouse
buttons. Guess which one is not, because it might confuse the poor users?
Here's a hint: it's not the small and fast one.--Linus
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Akhilesh Mritunjai
Oh yep, I know that churning feeling in stomach that there's got to be a 
GOTCHA somewhere... it can't be *that* simple!
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Gary Mills
On Fri, Jan 26, 2007 at 09:33:40AM -0800, Akhilesh Mritunjai wrote:
 ZFS Rule #0: You gotta have redundancy
 ZFS Rule #1: Redundancy shall be managed by zfs, and zfs alone.
 
 Whatever you have, junk it. Let ZFS manage mirroring and redundancy. ZFS 
 doesn't forgive even single bit errors!

How does this work in an environment with storage that's centrally-
managed and shared between many servers?  I'm putting together a new
IMAP server that will eventually use 3TB of space from our Netapp via
an iSCSI SAN.  The Netapp provides all of the disk management and
redundancy that I'll ever need.  The server will only see a virtual
disk (a LUN).  I want to use ZFS on that LUN because it's superior
to UFS in this application, even without the redundancy.  There's
no way to get the Netapp to behave like a JBOD.  Are you saying that
this configuration isn't going to work?

-- 
-Gary Mills--Unix Support--U of M Academic Computing and Networking-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Ed Gould

On Jan 26, 2007, at 9:42, Gary Mills wrote:

How does this work in an environment with storage that's centrally-
managed and shared between many servers?  I'm putting together a new
IMAP server that will eventually use 3TB of space from our Netapp via
an iSCSI SAN.  The Netapp provides all of the disk management and
redundancy that I'll ever need.  The server will only see a virtual
disk (a LUN).  I want to use ZFS on that LUN because it's superior
to UFS in this application, even without the redundancy.  There's
no way to get the Netapp to behave like a JBOD.  Are you saying that
this configuration isn't going to work?


It will work, but if the storage system corrupts the data, ZFS will be 
unable to correct it.  It will detect the error.


A number that I've been quoting, albeit without a good reference, comes 
from Jim Gray, who has been around the data-management industry for 
longer than I have (and I've been in this business since 1970); he's 
currently at Microsoft.  Jim says that the controller/drive subsystem 
writes data to the wrong sector of the drive without notice about once 
per drive per year.  In a 400-drive array, that's once a day.  ZFS will 
detect this error when the file is read (one of the blocks' checksum 
will not match).  But it can only correct the error if it manages the 
redundancy.


I would suggest exporting two LUNs from your central storage and let 
ZFS mirror them.  You can get a wider range of space/performance 
tradeoffs if you give ZFS a JBOD, but that doesn't sound like an 
option.


--Ed

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Gary Mills
On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote:
 On Jan 26, 2007, at 9:42, Gary Mills wrote:
 How does this work in an environment with storage that's centrally-
 managed and shared between many servers?
 
 It will work, but if the storage system corrupts the data, ZFS will be 
 unable to correct it.  It will detect the error.
 
 A number that I've been quoting, albeit without a good reference, comes 
 from Jim Gray, who has been around the data-management industry for 
 longer than I have (and I've been in this business since 1970); he's 
 currently at Microsoft.  Jim says that the controller/drive subsystem 
 writes data to the wrong sector of the drive without notice about once 
 per drive per year.  In a 400-drive array, that's once a day.  ZFS will 
 detect this error when the file is read (one of the blocks' checksum 
 will not match).  But it can only correct the error if it manages the 
 redundancy.

Our Netapp does double-parity RAID.  In fact, the filesystem design is
remarkably similar to that of ZFS.  Wouldn't that also detect the
error?  I suppose it depends if the `wrong sector without notice'
error is repeated each time.  Or is it random?

 I would suggest exporting two LUNs from your central storage and let 
 ZFS mirror them.  You can get a wider range of space/performance 
 tradeoffs if you give ZFS a JBOD, but that doesn't sound like an 
 option.

That would double the amount of disk that we'd require.  I am actually
planning on using two iSCSI LUNs and letting ZFS stripe across them.
When we need to expand the ZFS pool, I'd like to just expand the two
LUNs on the Netapp.  If ZFS won't accomodate that, I can just add a
couple more LUNs.  This is all convenient and easily managable.

-- 
-Gary Mills--Unix Support--U of M Academic Computing and Networking-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Richard Elling

Gary Mills wrote:

On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote:

On Jan 26, 2007, at 9:42, Gary Mills wrote:

How does this work in an environment with storage that's centrally-
managed and shared between many servers?
It will work, but if the storage system corrupts the data, ZFS will be 
unable to correct it.  It will detect the error.


A number that I've been quoting, albeit without a good reference, comes 
from Jim Gray, who has been around the data-management industry for 
longer than I have (and I've been in this business since 1970); he's 
currently at Microsoft.  Jim says that the controller/drive subsystem 
writes data to the wrong sector of the drive without notice about once 
per drive per year.  In a 400-drive array, that's once a day.  ZFS will 
detect this error when the file is read (one of the blocks' checksum 
will not match).  But it can only correct the error if it manages the 
redundancy.


The quote from Jim seems to be related to the leaves of the tree (disks).
Anecdotally, now that we have ZFS at the trunk, we're seeing that the
branches are also corrupting data.  We've speculated that it would occur,
but now we can measure it, and it is non-zero.  See Anantha's post for
one such anecdote.


Our Netapp does double-parity RAID.  In fact, the filesystem design is
remarkably similar to that of ZFS.  Wouldn't that also detect the
error?  I suppose it depends if the `wrong sector without notice'
error is repeated each time.  Or is it random?


We're having a debate related to this, data would be appreciated :-)
Do you get small, random read performance equivalent to N-2 spindles
for an N-way double-parity volume?

I would suggest exporting two LUNs from your central storage and let 
ZFS mirror them.  You can get a wider range of space/performance 
tradeoffs if you give ZFS a JBOD, but that doesn't sound like an 
option.


That would double the amount of disk that we'd require.  I am actually
planning on using two iSCSI LUNs and letting ZFS stripe across them.
When we need to expand the ZFS pool, I'd like to just expand the two
LUNs on the Netapp.  If ZFS won't accomodate that, I can just add a
couple more LUNs.  This is all convenient and easily managable.


Sounds reasonable to me :-)
 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Torrey McMahon

Gary Mills wrote:

On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote:
  

On Jan 26, 2007, at 9:42, Gary Mills wrote:


How does this work in an environment with storage that's centrally-
managed and shared between many servers?
  
It will work, but if the storage system corrupts the data, ZFS will be 
unable to correct it.  It will detect the error.


A number that I've been quoting, albeit without a good reference, comes 
from Jim Gray, who has been around the data-management industry for 
longer than I have (and I've been in this business since 1970); he's 
currently at Microsoft.  Jim says that the controller/drive subsystem 
writes data to the wrong sector of the drive without notice about once 
per drive per year.  In a 400-drive array, that's once a day.  ZFS will 
detect this error when the file is read (one of the blocks' checksum 
will not match).  But it can only correct the error if it manages the 
redundancy.



Our Netapp does double-parity RAID.  In fact, the filesystem design is
remarkably similar to that of ZFS.  Wouldn't that also detect the
error?  I suppose it depends if the `wrong sector without notice'
error is repeated each time. 


If the wrong block is written by the controller then you're out of luck. 
The filesystem would read the incorrect block and ... who knows. Thats 
why the ZFS checksums are important.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Wade . Stuart






[EMAIL PROTECTED] wrote on 01/26/2007 01:43:35 PM:

 On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote:
  On Jan 26, 2007, at 9:42, Gary Mills wrote:
  How does this work in an environment with storage that's centrally-
  managed and shared between many servers?
 
  It will work, but if the storage system corrupts the data, ZFS will be
  unable to correct it.  It will detect the error.
 
  A number that I've been quoting, albeit without a good reference, comes

  from Jim Gray, who has been around the data-management industry for
  longer than I have (and I've been in this business since 1970); he's
  currently at Microsoft.  Jim says that the controller/drive subsystem
  writes data to the wrong sector of the drive without notice about once
  per drive per year.  In a 400-drive array, that's once a day.  ZFS will

  detect this error when the file is read (one of the blocks' checksum
  will not match).  But it can only correct the error if it manages the
  redundancy.

 Our Netapp does double-parity RAID.  In fact, the filesystem design is
 remarkably similar to that of ZFS.  Wouldn't that also detect the
 error?  I suppose it depends if the `wrong sector without notice'
 error is repeated each time.  Or is it random?

I do not know,  WAFL and other portions of NetApp backends are never really
described in very technical details -- even getting real IOPS numbers from
them seems to be a hassle, much magic -- little meat.  To me, zfs is very
well defined behavior and methodology (you can even see the source to
verify specifics) and this allows you to _know_ what weak points are.
NetApp, EMC  and other disk vendors may have financial benefits for
allowing edge cases such as the write hole or bit rot (x errors per disk
are acceptable losses,  after x errors then consider replacing disk
cost/benefit analysis -- will customers actually know a bit is flipped?).
In EMC's case it is very common for a disk to have multiple read/write
errors before EMC will swap out the disk,  they even use a substantial
portion of the disk as replacement and parity bits (outside of raid) so
they offset or postpone the replacement volume/costs on the customer.

The most detailed description of WAFL I was able to find last time I looked
was:
http://www.netapp.com/library/tr/3002.pdf



  I would suggest exporting two LUNs from your central storage and let
  ZFS mirror them.  You can get a wider range of space/performance
  tradeoffs if you give ZFS a JBOD, but that doesn't sound like an
  option.

 That would double the amount of disk that we'd require.  I am actually
 planning on using two iSCSI LUNs and letting ZFS stripe across them.
 When we need to expand the ZFS pool, I'd like to just expand the two
 LUNs on the Netapp.  If ZFS won't accomodate that, I can just add a
 couple more LUNs.  This is all convenient and easily managable.

If you do have bit errors coming from the netapp zfs will find them and
will not be able to correct in this case.



 --
 -Gary Mills--Unix Support--U of M Academic Computing and
Networking-
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Ed Gould

On Jan 26, 2007, at 12:13, Richard Elling wrote:

On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote:
A number that I've been quoting, albeit without a good reference, 
comes from Jim Gray, who has been around the data-management industry 
for longer than I have (and I've been in this business since 1970); 
he's currently at Microsoft.  Jim says that the controller/drive 
subsystem writes data to the wrong sector of the drive without notice 
about once per drive per year.  In a 400-drive array, that's once a 
day.  ZFS will detect this error when the file is read (one of the 
blocks' checksum will not match).  But it can only correct the error 
if it manages the redundancy.


The quote from Jim seems to be related to the leaves of the tree 
(disks).

Anecdotally, now that we have ZFS at the trunk, we're seeing that the
branches are also corrupting data.  We've speculated that it would 
occur,

but now we can measure it, and it is non-zero.  See Anantha's post for
one such anecdote.


Actually, Jim was referring to everything but the trunk.  He didn't 
specify where from the HBA to the drive the error actually occurs.  I 
don't think it really matters.  I saw him give a talk a few years ago 
at the Usenix FAST conference; that's where I got this information.


--Ed

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Dana H. Myers
Ed Gould wrote:
 On Jan 26, 2007, at 12:13, Richard Elling wrote:
 On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote:
 A number that I've been quoting, albeit without a good reference,
 comes from Jim Gray, who has been around the data-management industry
 for longer than I have (and I've been in this business since 1970);
 he's currently at Microsoft.  Jim says that the controller/drive
 subsystem writes data to the wrong sector of the drive without notice
 about once per drive per year.  In a 400-drive array, that's once a
 day.  ZFS will detect this error when the file is read (one of the
 blocks' checksum will not match).  But it can only correct the error
 if it manages the redundancy.

 Actually, Jim was referring to everything but the trunk.  He didn't
 specify where from the HBA to the drive the error actually occurs.  I
 don't think it really matters.  I saw him give a talk a few years ago at
 the Usenix FAST conference; that's where I got this information.

So this leaves me wondering how often the controller/drive subsystem
reads data from the wrong sector of the drive without notice; is it
symmetrical with respect to writing, and thus about once a drive/year,
or are there factors which change this?

Dana
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Torrey McMahon

Dana H. Myers wrote:

Ed Gould wrote:
  

On Jan 26, 2007, at 12:13, Richard Elling wrote:


On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote:
  

A number that I've been quoting, albeit without a good reference,
comes from Jim Gray, who has been around the data-management industry
for longer than I have (and I've been in this business since 1970);
he's currently at Microsoft.  Jim says that the controller/drive
subsystem writes data to the wrong sector of the drive without notice
about once per drive per year.  In a 400-drive array, that's once a
day.  ZFS will detect this error when the file is read (one of the
blocks' checksum will not match).  But it can only correct the error
if it manages the redundancy.



  

Actually, Jim was referring to everything but the trunk.  He didn't
specify where from the HBA to the drive the error actually occurs.  I
don't think it really matters.  I saw him give a talk a few years ago at
the Usenix FAST conference; that's where I got this information.



So this leaves me wondering how often the controller/drive subsystem
reads data from the wrong sector of the drive without notice; is it
symmetrical with respect to writing, and thus about once a drive/year,
or are there factors which change this?
  


It's not symmetrical. Often times its a fw bug. Others a spurious event 
causes one block to be read/written instead of an other one. (Alpha 
particles anyone?)



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Ed Gould

On Jan 26, 2007, at 12:52, Dana H. Myers wrote:

So this leaves me wondering how often the controller/drive subsystem
reads data from the wrong sector of the drive without notice; is it
symmetrical with respect to writing, and thus about once a drive/year,
or are there factors which change this?


My guess is that it would be symmetric, but I don't really know.

--Ed

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Dana H. Myers
Torrey McMahon wrote:
 Dana H. Myers wrote:
 Ed Gould wrote:
  
 On Jan 26, 2007, at 12:13, Richard Elling wrote:

 On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote:
  
 A number that I've been quoting, albeit without a good reference,
 comes from Jim Gray, who has been around the data-management industry
 for longer than I have (and I've been in this business since 1970);
 he's currently at Microsoft.  Jim says that the controller/drive
 subsystem writes data to the wrong sector of the drive without notice
 about once per drive per year.  In a 400-drive array, that's once a
 day.  ZFS will detect this error when the file is read (one of the
 blocks' checksum will not match).  But it can only correct the error
 if it manages the redundancy.
 

  
 Actually, Jim was referring to everything but the trunk.  He didn't
 specify where from the HBA to the drive the error actually occurs.  I
 don't think it really matters.  I saw him give a talk a few years ago at
 the Usenix FAST conference; that's where I got this information.
 

 So this leaves me wondering how often the controller/drive subsystem
 reads data from the wrong sector of the drive without notice; is it
 symmetrical with respect to writing, and thus about once a drive/year,
 or are there factors which change this?
   
 
 It's not symmetrical. Often times its a fw bug. Others a spurious event
 causes one block to be read/written instead of an other one. (Alpha
 particles anyone?)

I would tend to expect these spurious events to impact read and write
equally; more specifically, the chance of any one read or write being
mis-addressed is about the same.  Since, AFAIK, there are many more reads
from a disk typically than writes, this would seem to suggest that there
would be more mis-addressed reads in a drive/year than mis-addressed
writes.  Is this the reason for the asymmetry?

(I'm sure waving my hands here)

Dana
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Ed Gould

On Jan 26, 2007, at 13:16, Dana H. Myers wrote:

I would tend to expect these spurious events to impact read and write
equally; more specifically, the chance of any one read or write being
mis-addressed is about the same.  Since, AFAIK, there are many more 
reads
from a disk typically than writes, this would seem to suggest that 
there

would be more mis-addressed reads in a drive/year than mis-addressed
writes.  Is this the reason for the asymmetry?


Jim's once per drive per year number was not very precise.  I took it 
to be just one significant digit.  I don't recall if he distinguished 
reads from writes.


--Ed

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Selim Daoud

it would be good to have real data and not only guess ot anecdots

this story about wrong blocks being written by  RAID controllers
sounds like the anti-terrorism propaganda we are leaving in: exagerate
the facts to catch everyone's attention
.It's going to take more than that to prove RAID ctrls have been doing
a bad jobs for the last 30 years
Let's make up  real stories with hard fact first
s.

On 1/26/07, Ed Gould [EMAIL PROTECTED] wrote:

On Jan 26, 2007, at 12:52, Dana H. Myers wrote:
 So this leaves me wondering how often the controller/drive subsystem
 reads data from the wrong sector of the drive without notice; is it
 symmetrical with respect to writing, and thus about once a drive/year,
 or are there factors which change this?

My guess is that it would be symmetric, but I don't really know.

--Ed

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Ed Gould

On Jan 26, 2007, at 13:29, Selim Daoud wrote:

it would be good to have real data and not only guess ot anecdots


Yes, I agree.  I'm sorry I don't have the data that Jim presented at 
FAST, but he did present actual data.  Richard Elling (I believe it was 
Richard) has also posted some related data from ZFS experience to this 
list.


There is more than just anecdotal evidence for this.

--Ed

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


RE: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Paul Fisher
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Ed Gould
 Sent: Friday, January 26, 2007 3:38 PM
 
 Yes, I agree.  I'm sorry I don't have the data that Jim presented at 
 FAST, but he did present actual data.  Richard Elling (I believe it 
 was
 Richard) has also posted some related data from ZFS experience to this 
 list.

This seems to be from Jim and on point:

http://www.usenix.org/event/fast05/tech/gray.pdf


paul
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Torrey McMahon

Dana H. Myers wrote:

Torrey McMahon wrote:
  

Dana H. Myers wrote:


Ed Gould wrote:
 
  

On Jan 26, 2007, at 12:13, Richard Elling wrote:
   


On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote:
 
  

A number that I've been quoting, albeit without a good reference,
comes from Jim Gray, who has been around the data-management industry
for longer than I have (and I've been in this business since 1970);
he's currently at Microsoft.  Jim says that the controller/drive
subsystem writes data to the wrong sector of the drive without notice
about once per drive per year.  In a 400-drive array, that's once a
day.  ZFS will detect this error when the file is read (one of the
blocks' checksum will not match).  But it can only correct the error
if it manages the redundancy.


 
  

Actually, Jim was referring to everything but the trunk.  He didn't
specify where from the HBA to the drive the error actually occurs.  I
don't think it really matters.  I saw him give a talk a few years ago at
the Usenix FAST conference; that's where I got this information.



So this leaves me wondering how often the controller/drive subsystem
reads data from the wrong sector of the drive without notice; is it
symmetrical with respect to writing, and thus about once a drive/year,
or are there factors which change this?
  
  

It's not symmetrical. Often times its a fw bug. Others a spurious event
causes one block to be read/written instead of an other one. (Alpha
particles anyone?)



I would tend to expect these spurious events to impact read and write
equally; more specifically, the chance of any one read or write being
mis-addressed is about the same.  Since, AFAIK, there are many more reads
from a disk typically than writes, this would seem to suggest that there
would be more mis-addressed reads in a drive/year than mis-addressed
writes.  Is this the reason for the asymmetry?

(I'm sure waving my hands here)


For the spurious events, yes, I would expect things to be impacted 
symmetrically depending when it comes to errors during reads and errors 
during writes. That is if you could figure out what spurious event 
occurred. In most cases the spurious errors are caught only at read time 
and you're left wondering. Was it an incorrect read? Was the data 
written incorrectly? You end up throwing your hands up and saying, Lets 
hope that doesn't happen again. It's much easier to unearth a fw bug in 
a particular disk drive operating in certain conditions and fix it.


Now that we're checksumming things I'd expect to find more errors, and 
hopefully be in a condition to fix them, then we have in the past. We 
will also start getting customer complaints like, We moved to ZFS and 
now we are seeing media errors more often. Why is ZFS broken? This is 
similar to the StorADE issues we had in NWS - Ahhh, the good old days - 
when we started doing a much better job discovering issues and reporting 
them when in the past we were blissfully silent. We used to have some 
data on that with nice graphs but I can't find them lying about.





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Gary Mills
On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote:
 
 A number that I've been quoting, albeit without a good reference, comes 
 from Jim Gray, who has been around the data-management industry for 
 longer than I have (and I've been in this business since 1970); he's 
 currently at Microsoft.  Jim says that the controller/drive subsystem 
 writes data to the wrong sector of the drive without notice about once 
 per drive per year.  In a 400-drive array, that's once a day.  ZFS will 
 detect this error when the file is read (one of the blocks' checksum 
 will not match).  But it can only correct the error if it manages the 
 redundancy.

My only qualification to enter this discussion is that I once wrote a
floppy disk format program for minix.  I recollect, however, that each
sector on the disk is accompanied by a block that contains the sector
address and a CRC.  In order to write to the wrong sector, both of
these items would have to be read incorrectly.  Otherwise, the
controller would never find the wrong sector.  Are we just talking
about a CRC failure here?  That would be random, but the frequency
of CRC errors would depend on the signal quality.

-- 
-Gary Mills--Unix Support--U of M Academic Computing and Networking-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Toby Thain


On 26-Jan-07, at 7:29 PM, Selim Daoud wrote:


it would be good to have real data and not only guess ot anecdots

this story about wrong blocks being written by  RAID controllers
sounds like the anti-terrorism propaganda we are leaving in: exagerate
the facts to catch everyone's attention
.It's going to take more than that to prove RAID ctrls have been doing
a bad jobs for the last 30 years


It does happen. Hard numbers are available if you look. This sounds a  
bit like the RAID expert I bumped into who just couldn't see the  
paradigm had shifted under him -- the implications of end to end.






Let's make up  real stories with hard fact first
s.



Related links:
https://www.gelato.unsw.edu.au/archives/comp-arch/2006-September/ 
003008.html
http://www.lockss.org/locksswiki/files/3/30/Eurosys2006.pdf [A Fresh  
Look at the Reliability of Long-term Digital Storage, 2006]
http://www.ecsl.cs.sunysb.edu/tr/rpe19.pdf [Challenges of Long-Term  
Digital Archiving: A Survey, 2006]
http://www.cs.wisc.edu/~vijayan/vijayan-thesis.pdf [IRON File  
Systems, 2006]
http://www.tcs.hut.fi/~hhk/phd/phd_Hannu_H_Kari.pdf [Latent Sector  
Faults and Reliability of Disk Arrays, 1997]


--T



On 1/26/07, Ed Gould [EMAIL PROTECTED] wrote:

On Jan 26, 2007, at 12:52, Dana H. Myers wrote:
 So this leaves me wondering how often the controller/drive  
subsystem

 reads data from the wrong sector of the drive without notice; is it
 symmetrical with respect to writing, and thus about once a drive/ 
year,

 or are there factors which change this?

My guess is that it would be symmetric, but I don't really know.

--Ed

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Darren Dunham
 My only qualification to enter this discussion is that I once wrote a
 floppy disk format program for minix.  I recollect, however, that each
 sector on the disk is accompanied by a block that contains the sector
 address and a CRC.

You'd have to define the layer you're talking about.  I presume
something like this occurs between a dumb disk and an intelligent
controller, or even within the encoding parameters of a disk, but I
don't think it does between say a SCSI/FC controller and a disk.

So if the drive itself put the head in the wrong sector, maybe it could
figure that out.  But perhaps the scsi controller had a bug and sent the
wrong address to the drive.  I don't think there's anything at that
layer that would notice (unless the application/file system is encoding
intent into the data).

Corrections about my assumption with SCSI/FC/ATA appreciated.

-- 
Darren Dunham   [EMAIL PROTECTED]
Senior Technical Consultant TAOShttp://www.taos.com/
Got some Dr Pepper?   San Francisco, CA bay area
  This line left intentionally blank to confuse you. 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Torrey McMahon

Toby Thain wrote:


On 26-Jan-07, at 7:29 PM, Selim Daoud wrote:


it would be good to have real data and not only guess ot anecdots

this story about wrong blocks being written by  RAID controllers
sounds like the anti-terrorism propaganda we are leaving in: exagerate
the facts to catch everyone's attention
.It's going to take more than that to prove RAID ctrls have been doing
a bad jobs for the last 30 years


It does happen. Hard numbers are available if you look. This sounds a 
bit like the RAID expert I bumped into who just couldn't see the 
paradigm had shifted under him -- the implications of end to end. 


It happens. As long we look at the numbers in context and don't run 
around going, Hey...have you seen these numbers? What have been doing 
for the last 35 years!?!? we're ok.




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-26 Thread Anton B. Rang
 1.  How stable is ZFS?

It's a new file system; there will be bugs.  It appears to be well-tested, 
though.  There are a few known issues; for instance, a write failure can panic 
the system under some circumstances.  UFS has known issues too

 2.  Recommended config.  Above, I have a fairly
 simple setup.  In many of the examples the
 granularity is home directory level and when you have
 many many users that could get to be a bit of a
 nightmare administratively.

Do you need user quotas?  If so, you need a file system per user with ZFS.  
That may be an argument against it in some environments, but in my experience 
tends to be more important in academic settings than corporations.

 4.  Since all data access is via NFS we are concerned
 that 32 bit systems (Mainly Linux and Windows via
 Samba) will not be able to access all the data areas
 of a 2TB+ zpool even if the zfs quota on a particular
 share is less then that.  Can anyone comment?

Not a problem.  NFS doesn't really deal with volumes, just files, so the 
offsets are always file-relative and the volume can be as large as desired.

Anton
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss