Re: [zfs-discuss] Subversion repository on ZFS

2008-08-28 Thread Thommy M. Malmström
Toby Thain wrote:
 On 27-Aug-08, at 5:47 PM, Ian Collins wrote:
 
 Tim writes:

 On Wed, Aug 27, 2008 at 3:29 PM, Ian Collins [EMAIL PROTECTED]  
 wrote:

 Does anyone have any tuning tips for a Subversion repository on  
 ZFS?  The
 repository will mainly be storing binary (MS Office documents).

 It looks like a vanilla, uncompressed file system is the best bet.

 I believe this is called sharepoint :D
 Don't mention that abomination!
 
 Amen.

Don't mention _that_ abomination!

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Ross
Since somebody else has just posted about their entire system locking up when 
pulling a drive, I thought I'd raise this for discussion.

I think Ralf made a very good point in the other thread.  ZFS can guarantee 
data integrity, what it can't do is guarantee data availability.  The problem 
is, the way ZFS is marketed people expect it to be able to do just that.

This turned into a longer thread than expected, so I'll start with what I'm 
asking for, and then attempt to explain my thinking.  I'm essentially asking 
for two features to improve the availability of ZFS pools:

- Isolation of storage drivers so that buggy drivers do not bring down the OS.

- ZFS timeouts to improve pool availability when no timely response is received 
from storage drivers.

And my reasons for asking for these is that there are now many, many posts on 
here about people experiencing either total system lockup or ZFS lockup after 
removing a hot swap drive, and indeed while some of them are using consumer 
hardware, others have reported problems with server grade kit that definately 
should be able to handle these errors:

Aug 2008:  AMD SB600 - System hang
 - http://www.opensolaris.org/jive/thread.jspa?threadID=70349
Aug 2008:  Supermicro SAT2-MV8 - System hang
 - http://www.opensolaris.org/jive/thread.jspa?messageID=271218
May 2008: Sun hardware - ZFS hang
 - http://opensolaris.org/jive/thread.jspa?messageID=240481
Feb 2008:  iSCSI - ZFS hang
 - http://www.opensolaris.org/jive/thread.jspa?messageID=206985
Oct 2007:  Supermicro SAT2-MV8 - system hang
 - http://www.opensolaris.org/jive/thread.jspa?messageID=166037
Sept 2007:  Fibre channel
 - http://opensolaris.org/jive/thread.jspa?messageID=151719
... etc

Now while the root cause of each of these may be slightly different, I feel it 
would still be good to address this if possible as it's going to affect the 
perception of ZFS as a reliable system.

The common factor in all of these is that either the solaris driver hangs and 
locks the OS, or ZFS hangs and locks the pool.  Most of these are for hardware 
that should handle these failures fine (mine occured for hardware that 
definately works fine under windows), so I'm wondering:  Is there anything that 
can be done to prevent either type of lockup in these situations?

Firstly, for the OS, if a storage component (hardware or driver) fails for a 
non essential part of the system, the entire OS should not hang.  I appreciate 
there isn't a lot you can do if the OS is using the same driver as it's 
storage, but certainly in some of the cases above, the OS and the data are 
using different drivers, and I expect more examples of that could be found with 
a bit of work.  Is there any way storage drivers could be isolated such that 
the OS (and hence ZFS) can report a problem with that particular driver without 
hanging the entire system?

Please note:  I know work is being done on FMA to handle all kinds of bugs, I'm 
not talking about that.  It seems to me that FMA involves proper detection and 
reporting of bugs, which involves knowing in advance what the problems are and 
how to report them.  What I'm looking for is something much simpler, something 
that's able to keep the OS running when it encounters unexpected or unhandled 
behaviour from storage drivers or hardware.

It seems to me that one of the benefits of ZFS is working against it here.  
It's such a flexible system it's being used for many, many types of devices, 
and that means there are a whole host of drivers being used, and a lot of scope 
for bugs in those drivers.  I know that ultimately any driver issues will need 
to be sorted individually, but what I'm wondering is whether there's any 
possibility of putting some error checking code at a layer above the drivers in 
such a way it's able to trap major problems without hanging the OS?  ie: update 
ZFS/Solaris so they can handle storage layer bugs gracefully without downing 
the entire system.

My second suggestion is to ask if ZFS can be made to handle unexpected events 
more gracefully.  In the past I've suggested that ZFS have a separate timeout 
so that a redundant pool can continue working even if one device is not 
responding, and I really think that would be worthwhile.  My idea is to have a 
WAITING status flag for drives, so that if one isn't responding quickly, ZFS 
can flag it as WAITING, and attempt to read or write the same data from 
elsewhere in the pool.  That would work alongside the existing failure modes, 
and would allow ZFS to handle hung drivers much more smoothly, preventing 
redundant pools hanging when a single drive fails.

The ZFS update I feel is particularly appropriate.  ZFS already uses 
checksumming since it doesn't trust drivers or hardware to always return the 
correct data.  But ZFS then trusts those same drivers and hardware absolutely 
when it comes to the availability of the pool.

I believe ZFS should apply the same tough standards to pool availability as it 
does to 

[zfs-discuss] [Fwd: Re: Review for 6729208 Optimize macros in sys/byteorder.h (due Sept. 3)]

2008-08-28 Thread Darren J Moffat
Not the common case for ZFS but a useful performance improvement
for when it does happen.  This is as a result of some follow on work to 
optimising the byteswapping work Dan has done for the crypto algorithms 
in OpenSolaris.

 Original Message 
Subject: Re: Review for 6729208 Optimize macros in sys/byteorder.h (due 
Sept. 3)
Date: Wed, 27 Aug 2008 11:56:23 -0700 (PDT)
From: Dan Anderson [EMAIL PROTECTED]
To: [EMAIL PROTECTED]

Here's some performance results running  find . -exec ls -l on 
separate ZFS filesystems created on x86 and sparc and imported/exported 
to amd64, em64t, and sun4u platforms.  This shows performance gain from 
optimized byteorder.h macros.

Percent savings, real time
ZFS filesystem created originally on:
Platformx86 sparc
amd64   4%  3%
em64t   3%  4%
sun4u   4%  2%

Environment:
* Create 2 separate ZFS filesystems with 1024 directories, each with 32 
files,
are on x86 and sparc and zpool export/import to the other systems.
* Run this command on ZFS filesystem:  find . -exec ls -l {} \; /dev/null
* Run using NV97 with and without fix to RFE 6729208 (byteorder.h macro 
optimization)

BTW, I still could use some code review comments:
http://dan.drydog.com/reviews/6729208-bswap3/

--
This message posted from opensolaris.org
___
crypto-discuss mailing list
[EMAIL PROTECTED]
http://mail.opensolaris.org/mailman/listinfo/crypto-discuss

-- 
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Will there be a GUI for ZFS ?

2008-08-28 Thread Klaus Bergius
I'll second the original questions, but would like to know specifically when we 
will see (or how to install) the ZFS admin gui for OpenSolaris 2008.05.
I installed 2008.05, then updated the system, so it is now snv_95. 
There are no smc* commands, there is no service 'webconsole' to be seen in svcs 
-a,
because: there is no SUNWzfsg package installed.
However, the SUNWzfsg package is also not in the pkg.opensolaris.org repository.

Any hint where to find the package? I would really love to have the zfs admin 
gui on my system.

-Klaus
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Will there be a GUI for ZFS ?

2008-08-28 Thread MC
There is no good ZFS gui.  Nothing that is actively maintained, anyway.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zpool import sees two pools

2008-08-28 Thread Chris Gerhard
I have a USB disk with a pool on it called removable. On one laptop zpool 
import removable works just fine but on another with the same disk attached it 
tells me there is more than one matching pool:

: sigma TS 6 $; pfexec zpool import removable
cannot import 'removable': more than one matching pool
import by numeric ID instead
: sigma TS 7 $; pfexec zpool import  
  pool: removable
id: 16711095403932498465
 state: ONLINE
status: The pool is formatted using an older on-disk version.
action: The pool can be imported using its name or numeric identifier, though
some features will not be available without an explicit 'zpool upgrade'.
config:

removable   ONLINE
  c3t0d0ONLINE

  pool: removable
id: 13348174994041916803
 state: FAULTED
status: The pool metadata is corrupted.
action: The pool cannot be imported due to damaged devices or data.
   see: http://www.sun.com/msg/ZFS-8000-72
config:

removable   FAULTED  corrupted data
  c3t0d0p0  ONLINE
: sigma TS 8 $;

What I find curious is that this only happens on one system. Any ideas?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS boot reservations

2008-08-28 Thread Ross
Hey folks,

Tim Foster just linked this bug to the zfs auto backup mailing list, and I 
wondered if anybody knew if the work being done on ZFS boot includes making use 
of ZFS reservations to ensure the boot filesystems always have enough free 
space?

http://defect.opensolaris.org/bz/show_bug.cgi?id=3132

Ross
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Will there be a GUI for ZFS ?

2008-08-28 Thread Tim
On Thu, Aug 28, 2008 at 3:47 AM, Klaus Bergius [EMAIL PROTECTED]wrote:

 I'll second the original questions, but would like to know specifically
 when we will see (or how to install) the ZFS admin gui for OpenSolaris
 2008.05.
 I installed 2008.05, then updated the system, so it is now snv_95.
 There are no smc* commands, there is no service 'webconsole' to be seen in
 svcs -a,
 because: there is no SUNWzfsg package installed.
 However, the SUNWzfsg package is also not in the 
 pkg.opensolaris.orgrepository.

 Any hint where to find the package? I would really love to have the zfs
 admin gui on my system.

 -Klaus


My personal conspiracy theory is it's part of project fishworks that is
still under wraps.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS hangs/freezes after disk failure,

2008-08-28 Thread James C. McPherson
Hi Todd,
sorry for the delay in responding, been head down rewriting
a utility for the last few days.


Todd H. Poole wrote:
 Howdy James,
 
 While responding to halstead's post (see below), I had to restart several
 times to complete some testing. I'm not sure if that's important to these
 commands or not, but I just wanted to put it out there anyway.
 
 A few commands that you could provide the output from
 include:


 (these two show any FMA-related telemetry)
 fmadm faulty
 fmdump -v
 
 This is the output from both commands:
 
 [EMAIL PROTECTED]:~# fmadm faulty
 ---   -- -
 TIMEEVENT-ID  MSG-ID SEVERITY
 ---   -- -
 Aug 27 01:07:08 0d9c30f1-b2c7-66b6-f58d-9c6bcb95392a  ZFS-8000-FDMajor
 
 Fault class : fault.fs.zfs.vdev.io
 Description : The number of I/O errors associated with a ZFS device exceeded
 acceptable levels.  Refer to 
 http://sun.com/msg/ZFS-8000-FD
  for more information.
 Response: The device has been offlined and marked as faulted.  An attempt
 will be made to activate a hot spare if available.
 Impact  : Fault tolerance of the pool may be compromised.
 Action  : Run 'zpool status -x' and replace the bad device.
 
 [EMAIL PROTECTED]:~# fmdump -v
 TIME UUID SUNW-MSG-ID
 Aug 27 01:07:08.2040 0d9c30f1-b2c7-66b6-f58d-9c6bcb95392a ZFS-8000-FD
  100%  fault.fs.zfs.vdev.io
 
Problem in: zfs://pool=mediapool/vdev=bfaa3595c0bf719
   Affects: zfs://pool=mediapool/vdev=bfaa3595c0bf719
   FRU: -
  Location: -


In other emails in this thread you've mentioned the desire to
get an email (or some sort of notification) when Problems Happen(tm)
in your system, and the FMA framework is how we achieve that
in OpenSolaris.



# fmadm config
MODULE   VERSION STATUS  DESCRIPTION
cpumem-retire1.1 active  CPU/Memory Retire Agent
disk-transport   1.0 active  Disk Transport Agent
eft  1.16active  eft diagnosis engine
fabric-xlate 1.0 active  Fabric Ereport Translater
fmd-self-diagnosis   1.0 active  Fault Manager Self-Diagnosis
io-retire2.0 active  I/O Retire Agent
snmp-trapgen 1.0 active  SNMP Trap Generation Agent
sysevent-transport   1.0 active  SysEvent Transport Agent
syslog-msgs  1.0 active  Syslog Messaging Agent
zfs-diagnosis1.0 active  ZFS Diagnosis Engine
zfs-retire   1.0 active  ZFS Retire Agent


You'll notice that we've got an SNMP agent there... and you
can acquire a copy of the FMA mib from the Fault Management
community pages (http://opensolaris.org/os/community/fm and
http://opensolaris.org/os/community/fm/mib/).




 (this shows your storage controllers and what's
 connected to them) cfgadm -lav
 
 This is the output from cfgadm -lav
 
 [EMAIL PROTECTED]:~# cfgadm -lav
 Ap_Id  Receptacle   Occupant Condition  
 Information
 When Type Busy Phys_Id
 usb2/1 emptyunconfigured ok
 unavailable  unknown  n/devices/[EMAIL 
 PROTECTED],0/pci1458,[EMAIL PROTECTED]:1
 usb2/2 connectedconfigured   ok
 Mfg: Microsoft  Product: Microsoft 3-Button Mouse with IntelliEye(TM)
 NConfigs: 1  Config: 0  no cfg str descr
 unavailable  usb-mousen/devices/[EMAIL 
 PROTECTED],0/pci1458,[EMAIL PROTECTED]:2
 usb3/1 emptyunconfigured ok
[snip]
 usb7/2 emptyunconfigured ok
 unavailable  unknown  n/devices/[EMAIL 
 PROTECTED],0/pci1458,[EMAIL PROTECTED],1:2
 
 You'll notice that the only thing listed is my USB mouse... is that expected?

Yup. One of the artefacts of the cfgadm architecture. cfgadm(1m)
works by using plugins - usb, FC, SCSI, SATA, pci hotplug, InfiniBand...
but not IDE.

I think you also were wondering how to tell what controller
instances your disks were using in IDE mode - two basic ways
of achieving this:

/usr/bin/iostat -En

and

/usr/sbin/format

Your IDE disks will attach using the cmdk driver and show up like this:

c1d0
c1d1
c2d0
c2d1

In AHCI/SATA mode they'd show up as

c1t0d0
c1t1d0
c1t2d0
c1t3d0

or something similar, depending on how the bios and the actual
controllers sort themselves out.


 You'll also find messages in /var/adm/messages which
 might prove
 useful to review.
 
 If you really want, I can list the output from /var/adm/messages, but it
 doesn't seem to add anything new to what I've already copied and pasted.

No need - you've got them if you need them.

[snip]

 http://docs.sun.com/app/docs/coll/40.17 (manpages)
 http://docs.sun.com/app/docs/coll/47.23 (system admin 

Re: [zfs-discuss] zpool import sees two pools

2008-08-28 Thread Victor Latushkin
On 28.08.08 15:06, Chris Gerhard wrote:
 I have a USB disk with a pool on it called removable. On one laptop
 zpool import removable works just fine but on another with the same
 disk attached it tells me there is more than one matching pool:
 
 : sigma TS 6 $; pfexec zpool import removable
 cannot import 'removable': more than one matching pool
 import by numeric ID instead
 : sigma TS 7 $; pfexec zpool import  
   pool: removable
 id: 16711095403932498465
  state: ONLINE
 status: The pool is formatted using an older on-disk version.
 action: The pool can be imported using its name or numeric identifier, though
 some features will not be available without an explicit 'zpool 
 upgrade'.
 config:
 
 removable   ONLINE
   c3t0d0ONLINE
 
   pool: removable
 id: 13348174994041916803
  state: FAULTED
 status: The pool metadata is corrupted.
 action: The pool cannot be imported due to damaged devices or data.
see: http://www.sun.com/msg/ZFS-8000-72
 config:
 
 removable   FAULTED  corrupted data
   c3t0d0p0  ONLINE
 : sigma TS 8 $;
 
 What I find curious is that this only happens on one system. Any ideas?

What Solaris/ZFS versions are these systems running? it is a wild guess 
but may be there's some stale label with newer version which is 
recognized by one system and not recognized by another?

What does zdb -l say?

victor
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Pools 1+TB

2008-08-28 Thread Kenny
Bob,  Thanks for the reply.  Yes I did read your white paper and am using it!!  
Thanks again!!

I used zpool iostat -v and it did't give the information as advertised...  see 
below


bash-3.00# zpool iostat -v

capacity 
operationsbandwidth
poolused  avail   read  
write   read  write
--  -  -  -  -  -  -


log_data  147K  9.81G   
   0  0  0  4
 
 raidz1147K  9.81G  
0  0  0  4

c6t600A0B800049F93C030A48B3EA2Cd0  -  -  0  0  0
 22
c6t600A0B800049F93C030D48B3EAB6d0  -  -  0  0  0
 22
c6t600A0B800049F93C031C48B3EC76d0  -  -  0  0  0
 22
c6t600A0B800049F93C031F48B3ECA8d0  -  -  0  0  0
 22
c6t600A0B800049F93C030448B3CDEEd0  -  -  0  0  0
 22
c6t600A0B800049F93C030748B3E9F0d0  -  -  0  0  0
 22
c6t600A0B800049F93C031048B3EB44d0  -  -  0  0  0
 22
c6t600A0B800049F93C031348B3EB94d0  -  -  0  0  0
 22
c6t600A0B800049F93C031648B3EBE4d0  -  -  0  0  0
 22
c6t600A0B800049F93C031948B3EC28d0  -  -  0  0  0
 22
c6t600A0B800049F93C032248B3ECDEd0  -  -  0  0  0
 22

--  -  -  -  -  -  -

(sorry but I can't get the horizontal format to set the columns correctly...)
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Pools 1+TB

2008-08-28 Thread Kenny
Tim,

Per your request...

df -h

bash-3.00# df -h
Filesystem size   used  avail capacity  Mounted on
/dev/md/dsk/d10 98G   4.2G92G 5%/
/devices 0K 0K 0K 0%/devices
ctfs 0K 0K 0K 0%/system/contract
proc 0K 0K 0K 0%/proc
mnttab   0K 0K 0K 0%/etc/mnttab
swap32G   1.4M32G 1%/etc/svc/volatile
objfs0K 0K 0K 0%/system/object
/platform/SUNW,SPARC-Enterprise-T5220/lib/libc_psr/libc_psr_hwcap1.so.1
98G   4.2G92G 5%
/platform/sun4v/lib/libc_psr.so.1
/platform/SUNW,SPARC-Enterprise-T5220/lib/sparcv9/libc_psr/libc_psr_hwcap1.so.1
98G   4.2G92G 5%
/platform/sun4v/lib/sparcv9/libc_psr.so.1
fd   0K 0K 0K 0%/dev/fd
/dev/md/dsk/d50 19G   4.3G15G23%/var
swap   512M   112K   512M 1%/tmp
swap32G40K32G 1%/var/run
/dev/md/dsk/d309.6G   1.5G   8.1G16%/opt
/dev/md/dsk/d401.9G   142M   1.7G 8%/export/home
/vol/dev/dsk/c0t0d0/fm540cd3
   591M   591M 0K   100%/cdrom/fm540cd3
log_data   8.8G44K   8.8G 1%/log_data
bash-3.00# bash-3.00# df -h
v/dsk/c0t0d0/fm540cd3
   591M   591M 0K   100%/cdrom/fm540cd3
log_data   8.8G44K   8.8G 1%/log_data



zpool status

bash-3.00# zpool status   
  pool: log_data
 state: ONLINE
 scrub: none requested
config:

NAME   STATE READ WRITE CKSUM
log_data   ONLINE   0 0 0
  raidz1   ONLINE   0 0 0
c6t600A0B800049F93C030A48B3EA2Cd0  ONLINE   0 0 0
c6t600A0B800049F93C030D48B3EAB6d0  ONLINE   0 0 0
c6t600A0B800049F93C031C48B3EC76d0  ONLINE   0 0 0
c6t600A0B800049F93C031F48B3ECA8d0  ONLINE   0 0 0
c6t600A0B800049F93C030448B3CDEEd0  ONLINE   0 0 0
c6t600A0B800049F93C030748B3E9F0d0  ONLINE   0 0 0
c6t600A0B800049F93C031048B3EB44d0  ONLINE   0 0 0
c6t600A0B800049F93C031348B3EB94d0  ONLINE   0 0 0
c6t600A0B800049F93C031648B3EBE4d0  ONLINE   0 0 0
c6t600A0B800049F93C031948B3EC28d0  ONLINE   0 0 0
c6t600A0B800049F93C032248B3ECDEd0  ONLINE   0 0 0

errors: No known data errors



format

bash-3.00# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
   0. c1t0d0 SUN146G cyl 14087 alt 2 hd 24 sec 848
  /[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL 
PROTECTED]/[EMAIL PROTECTED],0
   1. c1t1d0 SUN146G cyl 14087 alt 2 hd 24 sec 848
  /[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL 
PROTECTED]/[EMAIL PROTECTED],0
   2. c6t600A0B800049F93C030A48B3EA2Cd0 SUN-LCSM100_F-0670-931.01GB
  /scsi_vhci/[EMAIL PROTECTED]
   3. c6t600A0B800049F93C030D48B3EAB6d0 SUN-LCSM100_F-0670-931.01MB
  /scsi_vhci/[EMAIL PROTECTED]
   4. c6t600A0B800049F93C031C48B3EC76d0 SUN-LCSM100_F-0670-931.01MB
  /scsi_vhci/[EMAIL PROTECTED]
   5. c6t600A0B800049F93C031F48B3ECA8d0 SUN-LCSM100_F-0670-931.01GB
  /scsi_vhci/[EMAIL PROTECTED]
   6. c6t600A0B800049F93C030448B3CDEEd0 SUN-LCSM100_F-0670-931.01GB
  /scsi_vhci/[EMAIL PROTECTED]
   7. c6t600A0B800049F93C030748B3E9F0d0 SUN-LCSM100_F-0670-931.01GB
  /scsi_vhci/[EMAIL PROTECTED]
   8. c6t600A0B800049F93C031048B3EB44d0 SUN-LCSM100_F-0670-931.01MB
  /scsi_vhci/[EMAIL PROTECTED]
   9. c6t600A0B800049F93C031348B3EB94d0 SUN-LCSM100_F-0670-931.01GB
  /scsi_vhci/[EMAIL PROTECTED]
  10. c6t600A0B800049F93C031648B3EBE4d0 SUN-LCSM100_F-0670-931.01GB
  /scsi_vhci/[EMAIL PROTECTED]
  11. c6t600A0B800049F93C031948B3EC28d0 SUN-LCSM100_F-0670-931.01GB
  /scsi_vhci/[EMAIL PROTECTED]
  12. c6t600A0B800049F93C032248B3ECDEd0 SUN-LCSM100_F-0670-931.01GB
  /scsi_vhci/[EMAIL PROTECTED]
Specify disk (enter its number):
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import sees two pools

2008-08-28 Thread Chris Gerhard

Victor Latushkin wrote:

On 28.08.08 15:06, Chris Gerhard wrote:

I have a USB disk with a pool on it called removable. On one laptop
zpool import removable works just fine but on another with the same
disk attached it tells me there is more than one matching pool:

: sigma TS 6 $; pfexec zpool import removable
cannot import 'removable': more than one matching pool
import by numeric ID instead
: sigma TS 7 $; pfexec zpool importpool: removable
id: 16711095403932498465
 state: ONLINE
status: The pool is formatted using an older on-disk version.
action: The pool can be imported using its name or numeric identifier, 
though
some features will not be available without an explicit 'zpool 
upgrade'.

config:

removable   ONLINE
  c3t0d0ONLINE

  pool: removable
id: 13348174994041916803
 state: FAULTED
status: The pool metadata is corrupted.
action: The pool cannot be imported due to damaged devices or data.
   see: http://www.sun.com/msg/ZFS-8000-72
config:

removable   FAULTED  corrupted data
  c3t0d0p0  ONLINE
: sigma TS 8 $;

What I find curious is that this only happens on one system. Any ideas?


What Solaris/ZFS versions are these systems running? it is a wild guess 
but may be there's some stale label with newer version which is 
recognized by one system and not recognized by another?




Both are running snv_94. The system with the problem is nevada the 
system without the problem is OpenSolaris.




What does zdb -l say?


# zdb -l /dev/rdsk/c3t0d0p0

LABEL 0

failed to unpack label 0

LABEL 1

failed to unpack label 1

LABEL 2

version=1
name='removable'
state=1
txg=18676
pool_guid=13348174994041916803
top_guid=17964267360868847787
guid=17964267360868847787
vdev_tree
type='disk'
id=0
guid=17964267360868847787
path='/vol/dev/dsk/c5t0d0/unknown_format'
whole_disk=0
metaslab_array=13
metaslab_shift=30
ashift=9
asize=164691705856

LABEL 3

version=1
name='removable'
state=1
txg=18676
pool_guid=13348174994041916803
top_guid=17964267360868847787
guid=17964267360868847787
vdev_tree
type='disk'
id=0
guid=17964267360868847787
path='/vol/dev/dsk/c5t0d0/unknown_format'
whole_disk=0
metaslab_array=13
metaslab_shift=30
ashift=9
asize=164691705856
#


# zdb -l /dev/rdsk/c3t0d0

LABEL 0

failed to unpack label 0

LABEL 1

failed to unpack label 1

LABEL 2

version=1
name='removable'
state=1
txg=18676
pool_guid=13348174994041916803
top_guid=17964267360868847787
guid=17964267360868847787
vdev_tree
type='disk'
id=0
guid=17964267360868847787
path='/vol/dev/dsk/c5t0d0/unknown_format'
whole_disk=0
metaslab_array=13
metaslab_shift=30
ashift=9
asize=164691705856

LABEL 3

version=1
name='removable'
state=1
txg=18676
pool_guid=13348174994041916803
top_guid=17964267360868847787
guid=17964267360868847787
vdev_tree
type='disk'
id=0
guid=17964267360868847787
path='/vol/dev/dsk/c5t0d0/unknown_format'
whole_disk=0
metaslab_array=13
metaslab_shift=30
ashift=9
asize=164691705856
#

# zdb -l /dev/rdsk/c3t0d0s0

LABEL 0

version=10
name='removable'
state=1
txg=75874
pool_guid=16711095403932498465
hostid=696785690
hostname='sigma'
top_guid=18371933882888483558
guid=18371933882888483558
vdev_tree
type='disk'
id=0
guid=18371933882888483558
path='/dev/dsk/c3t0d0s0'
devid='id1,[EMAIL PROTECTED]/a'
phys_path='/[EMAIL PROTECTED],0/pci1028,[EMAIL PROTECTED],7/[EMAIL 
PROTECTED]/[EMAIL PROTECTED],0:a'
whole_disk=1
metaslab_array=14
metaslab_shift=30
ashift=9
asize=164683055104
is_log=0
DTL=18

LABEL 1

version=10
name='removable'
state=1
txg=75874

[zfs-discuss] eWeek: corrupt file brought down FAA's antiquated IT system

2008-08-28 Thread Richard Elling
It is rare to see this sort of CNN Moment attributed to file corruption.
http://www.eweek.com/c/a/IT-Infrastructure/Corrupt-File-Brought-Down-FAAs-Antiquated-IT-System/?kc=EWKNLNAV08282008STR4

 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] trouble with resilver after removing drive from 3510

2008-08-28 Thread Thomas Bleek

Hello all,

I tried to test the behavior of zpool recovering after removing one 
drive with strange results.


Setup SunFire V240/4Gig RAM, Solaris10u5, fully patched (last week)
1 3510 12x 140Gig FC Drives, 12 luns (every drive is one lun), (I don't 
want to use the RAID hardware, letting ZFS doing all.)

one pool with 5x2 disks and 2 spares
(details below)

After pulling drive 2 it took about two minutes to recognise the situation.
zpool status command output and also zpool iostat 1 command output is 
very slow. some lines are fast, then it stops for about 30-60 seconds, 
but they do complete after all.
the resilver has started but is VERY slow and shows strange data. The % 
done value is going up and down all the time. I don't think it is 
working correctly.
zpool iostat 1 (when it works) shows many reads but very few writes. I 
would have expected a mainly equal read and write rate reading from the 
intact mirror-side writing to the spare-disk.


Most of the time during resilver the machine is 99% idle, maximum 10% 
kernel load for some short times.


Now I have waited for more than one day but nothing is getting better.
I did not put a new drive in, I wanted to see one spare getting into use.

snip of zpool iostat 1

tank 337G   343G313  2  37.4M  19.3K
tank 337G   343G240  5  29.0M  38.6K
tank 337G   343G355  6  44.4M  45.0K
tank 337G   343G336  8  41.6M  57.9K
tank 337G   343G422  0  46.0M  0
tank 337G   343G415 10  49.4M  70.8K
tank 337G   343G358  0  43.3M  0
tank 337G   343G340 10  42.6M  70.8K
tank 337G   343G323  5  38.1M  38.6K
tank 337G   343G315  0  35.0M  0
tank 337G   343G336  0  40.0M  6.43K
tank 337G   343G388 10  46.8M  70.8K
tank 337G   343G351  4  43.9M  32.2K
tank 337G   343G  5  5   620K   285K

nothing useful (at least for me) in messages. after grep -v of the both 
lines
date+time nftp scsi: [ID 107833 kern.warning] WARNING: 
/[EMAIL PROTECTED],70/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0/[EMAIL PROTECTED],1 (ssd48):

date+time nftpdrive offline
only these entries to see:
Aug 27 13:04:22 nftpi/o to invalid geometry
Aug 27 13:04:32 nftpi/o to invalid geometry
Aug 27 13:04:37 nftpi/o to invalid geometry
Aug 27 13:04:37 nftpi/o to invalid geometry
Aug 27 13:04:47 nftpi/o to invalid geometry
Aug 27 13:04:52 nftpi/o to invalid geometry
Aug 27 13:05:23 nftp fmd: [ID 441519 daemon.error] SUNW-MSG-ID: 
ZFS-8000-D3, TYPE: Fault, VER: 1, SEVERITY: Major

Aug 27 13:05:23 nftp EVENT-TIME: Wed Aug 27 13:05:22 CEST 2008
Aug 27 13:05:23 nftp PLATFORM: SUNW,Sun-Fire-V240, CSN: -, HOSTNAME: nftp
Aug 27 13:05:23 nftp SOURCE: zfs-diagnosis, REV: 1.0
Aug 27 13:05:23 nftp EVENT-ID: ea01afff-c58e-6b32-e345-81da8bf43146
Aug 27 13:05:23 nftp DESC: A ZFS device failed.  Refer to 
http://sun.com/msg/ZFS-8000-D3 for more information.

Aug 27 13:05:23 nftp AUTO-RESPONSE: No automated response will occur.
Aug 27 13:05:23 nftp IMPACT: Fault tolerance of the pool may be compromised.
Aug 27 13:05:23 nftp REC-ACTION: Run 'zpool status -x' and replace the 
bad device.




uname -a
SunOS nftp 5.10 Generic_137111-04 sun4u sparc SUNW,Sun-Fire-V240


before pulling drive:

sccli show disk
Ch Id  Size   Speed  LD Status IDs  
Rev 

2(3)   0  136.73GB   200MB  ld0ONLINE SEAGATE ST314680FSUN146G 
0407

  S/N 3HY602V37412
 WWNN 200C505EB811
2(3)   1  136.73GB   200MB  ld1ONLINE SEAGATE ST314680FSUN146G 
0407

  S/N 3HY61JX47412
 WWNN 200C505EB885
2(3)   2  136.73GB   200MB  ld2ONLINE SEAGATE ST3146807FC  
0006

  S/N 3HY62EGZ7443
 WWNN 200C50D76130
2(3)   3  136.73GB   200MB  ld3ONLINE SEAGATE ST314680FSUN146G 
0407

  S/N 3HY61JKG7411
 WWNN 200C505EB815
2(3)   4  136.73GB   200MB  ld4ONLINE SEAGATE ST314680FSUN146G 
0407

  S/N 3HY60YHX7410
 WWNN 200C505EBCBB
2(3)   5  136.73GB   200MB  ld5ONLINE SEAGATE ST314680FSUN146G 
0407

  S/N 3HY61FQ07412
 WWNN 200C505E98B9
2(3)   6  136.73GB   200MB  ld6 

Re: [zfs-discuss] ZFS Pools 1+TB

2008-08-28 Thread Daniel Rock
Kenny schrieb:
2. c6t600A0B800049F93C030A48B3EA2Cd0 SUN-LCSM100_F-0670-931.01GB
   /scsi_vhci/[EMAIL PROTECTED]
3. c6t600A0B800049F93C030D48B3EAB6d0 SUN-LCSM100_F-0670-931.01MB
   /scsi_vhci/[EMAIL PROTECTED]

Disk 2: 931GB
Disk 3: 931MB

Do you see the difference?



Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] eWeek: corrupt file brought down FAA's antiquated IT system

2008-08-28 Thread Gary Mills
On Thu, Aug 28, 2008 at 06:11:06AM -0700, Richard Elling wrote:
 It is rare to see this sort of CNN Moment attributed to file corruption.
 http://www.eweek.com/c/a/IT-Infrastructure/Corrupt-File-Brought-Down-FAAs-Antiquated-IT-System/?kc=EWKNLNAV08282008STR4

`file corruption' takes the blame all the time, in my experience, but
what does it mean?  It likely has nothing to do with the filesystem.
Probably an application wrote incorrect information into a file.

-- 
-Gary Mills--Unix Support--U of M Academic Computing and Networking-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] liveupgrade ufs root - zfs ?

2008-08-28 Thread Paul Floyd
Hi

On my opensolaris machine I currently have SXCEs 95 and 94 in two BEs. The same 
fdisk partition contains /export/home and swap. In a separate fdisk partition 
on another disk I have a ZFS pool.

Does anyone have a pointer to a howto for doing a liveupgrade such that I can 
convert the SXCE 94 UFS BE to ZFS (and liveupgrade to SXCE 96 while I'm at it) 
if this is possible? Searching with google shows a lot of blogs that describe 
the early problems that existed when ZFS was first available (ON 90 or so).

A+
Paul
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Pools 1+TB

2008-08-28 Thread Kyle McDonald
Daniel Rock wrote:

 Kenny schrieb:
 2. c6t600A0B800049F93C030A48B3EA2Cd0 
 SUN-LCSM100_F-0670-931.01GB
/scsi_vhci/[EMAIL PROTECTED]
 3. c6t600A0B800049F93C030D48B3EAB6d0 
 SUN-LCSM100_F-0670-931.01MB
/scsi_vhci/[EMAIL PROTECTED]

 Disk 2: 931GB
 Disk 3: 931MB

 Do you see the difference?

Not just disk 3:

 AVAILABLE DISK SELECTIONS:
3. c6t600A0B800049F93C030D48B3EAB6d0 SUN-LCSM100_F-0670-931.01MB
   /scsi_vhci/[EMAIL PROTECTED]
4. c6t600A0B800049F93C031C48B3EC76d0 SUN-LCSM100_F-0670-931.01MB
   /scsi_vhci/[EMAIL PROTECTED]
8. c6t600A0B800049F93C031048B3EB44d0 SUN-LCSM100_F-0670-931.01MB
   /scsi_vhci/[EMAIL PROTECTED]
   
This all makes sense now, since a RAIDZ (or RAIDZ2) vdev can only be as 
big as it's *smallest* component device.

   -Kyle



 Daniel
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-28 Thread Richard Elling
Miles Nordin wrote:
 re Indeed.  Intuitively, the AFR and population is more easily
 re grokked by the masses.

 It's nothing to do with masses.  There's an error in your math.  It's
 not right under any circumstance.
   

There is no error in my math.  I presented a failure rate for a time 
interval,
you presented a probability of failure over a time interval.  The two are
both correct, but say different things.  Mathematically, an AFR  100%
is quite possible and quite common.  A probability of failure  100% (1.0)
is not.  In my experience, failure rates described as annualized failure
rates (AFR) are more intuitive than their mathematically equivalent
counterpart: MTBF.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] liveupgrade ufs root - zfs ?

2008-08-28 Thread Mark J Musante
On Thu, 28 Aug 2008, Paul Floyd wrote:

 Does anyone have a pointer to a howto for doing a liveupgrade such that 
 I can convert the SXCE 94 UFS BE to ZFS (and liveupgrade to SXCE 96 
 while I'm at it) if this is possible? Searching with google shows a lot 
 of blogs that describe the early problems that existed when ZFS was 
 first available (ON 90 or so).

It should be fairly straightforward to convert to ZFS: lucreate -p pool 
-n new BE name

Doing the upgrade to 96, should be luupgrade -u -n new BE name -s 
source image


Regards,
markm
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] eWeek: corrupt file brought down FAA's antiquated IT system

2008-08-28 Thread Toby Thain

On 28-Aug-08, at 10:11 AM, Richard Elling wrote:

 It is rare to see this sort of CNN Moment attributed to file  
 corruption.
 http://www.eweek.com/c/a/IT-Infrastructure/Corrupt-File-Brought- 
 Down-FAAs-Antiquated-IT-System/?kc=EWKNLNAV08282008STR4


two 20-year-old redundant mainframe configurations ... that  
apparently are hanging on for dear life until reinforcements arrive  
in the form of a new, state-of-the-art system this winter.

And we all know that 'new, state-of-the-art systems' are silver  
bullets and good value for money.

What goes unremarked here is how the original system has coped  
reliably for decades of (one guesses) geometrically growing load.

--Toby



  -- richard

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-28 Thread Robert Milkowski
Hello Miles,

Wednesday, August 27, 2008, 10:51:49 PM, you wrote:

MN It's not really enough for me, but what's more the case doesn't match
MN what we were looking for: a device which ``never returns error codes,
MN always returns silently bad data.''  I asked for this because you said
MN ``However, not all devices return error codes which indicate
MN unrecoverable reads,'' which I think is wrong.  Rather, most devices
MN sometimes don't, not some devices always don't.



Please look for slides 23-27 at 
http://unixdays.pl/i/unixdays-prezentacje/2007/milek.pdf


-- 
Best regards,
 Robert Milkowskimailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] liveupgrade ufs root - zfs ?

2008-08-28 Thread Robin Guo
Hi,

  I think LU 94-96 would be fine, if there's no zone in your system,
just simply do

  # cd cdrom/Solaris_11/Tools/Installers
  # liveupgrade20 --nodisplay
  # lucreate -c BE94 -n BE96  -p newpool   (The newpool must be SMI lable)
  # luupgrade -u -n BE96 -s cdrom
  # luactivate BE96
  # init 6
 
  During snv_90~96, quite a lot LU bugs are solved, so I think you could 
complete
the process successfully, if no special case..


Paul Floyd wrote:
 Hi

 On my opensolaris machine I currently have SXCEs 95 and 94 in two BEs. The 
 same fdisk partition contains /export/home and swap. In a separate fdisk 
 partition on another disk I have a ZFS pool.

 Does anyone have a pointer to a howto for doing a liveupgrade such that I can 
 convert the SXCE 94 UFS BE to ZFS (and liveupgrade to SXCE 96 while I'm at 
 it) if this is possible? Searching with google shows a lot of blogs that 
 describe the early problems that existed when ZFS was first available (ON 90 
 or so).

 A+
 Paul
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] eWeek: corrupt file brought down FAA's antiquated IT system

2008-08-28 Thread Toby Thain

On 28-Aug-08, at 10:54 AM, Toby Thain wrote:


 On 28-Aug-08, at 10:11 AM, Richard Elling wrote:

 It is rare to see this sort of CNN Moment attributed to file
 corruption.
 http://www.eweek.com/c/a/IT-Infrastructure/Corrupt-File-Brought-
 Down-FAAs-Antiquated-IT-System/?kc=EWKNLNAV08282008STR4


 two 20-year-old redundant mainframe configurations ... that
 apparently are hanging on for dear life until reinforcements arrive
 in the form of a new, state-of-the-art system this winter.

 And we all know that 'new, state-of-the-art systems' are silver
 bullets and good value for money.

 What goes unremarked here is how the original system has coped
 reliably for decades of (one guesses) geometrically growing load.

D'oh! It was remarked below the fold. I should have read page 2  
before writing.

The original architects seem to have done an excellent job, how many  
of us are designing systems expected to run for 2 decades? (Yes I  
know the cycles are shorter these days. If you bought a PDP-11 you  
were expected to keep it running forever with component level repairs.)

--Toby

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-28 Thread Richard Elling
Robert Milkowski wrote:
 Hello Miles,

 Wednesday, August 27, 2008, 10:51:49 PM, you wrote:

 MN It's not really enough for me, but what's more the case doesn't match
 MN what we were looking for: a device which ``never returns error codes,
 MN always returns silently bad data.''  I asked for this because you said
 MN ``However, not all devices return error codes which indicate
 MN unrecoverable reads,'' which I think is wrong.  Rather, most devices
 MN sometimes don't, not some devices always don't.



 Please look for slides 23-27 at 
 http://unixdays.pl/i/unixdays-prezentacje/2007/milek.pdf

   

You really don't have to look very far to find this sort of thing.
The scar just below my left knee is directly attributed to a bugid
fixed in patch 106129-12.  Warning: the following link may
frighten experienced datacenter personnel, fortunately, the affected
device is long since EOL.
http://sunsolve.sun.com/search/document.do?assetkey=1-21-106129-12-1
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] eWeek: corrupt file brought down FAA's antiquated IT system

2008-08-28 Thread Wade . Stuart
[EMAIL PROTECTED] wrote on 08/28/2008 09:00:23 AM:


 On 28-Aug-08, at 10:54 AM, Toby Thain wrote:

 
  On 28-Aug-08, at 10:11 AM, Richard Elling wrote:
 
  It is rare to see this sort of CNN Moment attributed to file
  corruption.
  http://www.eweek.com/c/a/IT-Infrastructure/Corrupt-File-Brought-
  Down-FAAs-Antiquated-IT-System/?kc=EWKNLNAV08282008STR4
 
 
  two 20-year-old redundant mainframe configurations ... that
  apparently are hanging on for dear life until reinforcements arrive
  in the form of a new, state-of-the-art system this winter.
 
  And we all know that 'new, state-of-the-art systems' are silver
  bullets and good value for money.
 
  What goes unremarked here is how the original system has coped
  reliably for decades of (one guesses) geometrically growing load.

 D'oh! It was remarked below the fold. I should have read page 2
 before writing.

 The original architects seem to have done an excellent job, how many
 of us are designing systems expected to run for 2 decades? (Yes I
 know the cycles are shorter these days. If you bought a PDP-11 you
 were expected to keep it running forever with component level repairs.)

 --Toby


Then you also missed the all important crescendo where eweek uses the last
quarter of a poorly written article to shill completely unrelated but yet
inference to tie to the story software.

-Wade

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Pools 1+TB

2008-08-28 Thread Tim
exactly :)



On 8/28/08, Kyle McDonald [EMAIL PROTECTED] wrote:
 Daniel Rock wrote:

 Kenny schrieb:
 2. c6t600A0B800049F93C030A48B3EA2Cd0
 SUN-LCSM100_F-0670-931.01GB
/scsi_vhci/[EMAIL PROTECTED]
 3. c6t600A0B800049F93C030D48B3EAB6d0
 SUN-LCSM100_F-0670-931.01MB
/scsi_vhci/[EMAIL PROTECTED]

 Disk 2: 931GB
 Disk 3: 931MB

 Do you see the difference?

 Not just disk 3:

 AVAILABLE DISK SELECTIONS:
3. c6t600A0B800049F93C030D48B3EAB6d0
 SUN-LCSM100_F-0670-931.01MB
   /scsi_vhci/[EMAIL PROTECTED]
4. c6t600A0B800049F93C031C48B3EC76d0
 SUN-LCSM100_F-0670-931.01MB
   /scsi_vhci/[EMAIL PROTECTED]
8. c6t600A0B800049F93C031048B3EB44d0
 SUN-LCSM100_F-0670-931.01MB
   /scsi_vhci/[EMAIL PROTECTED]

 This all makes sense now, since a RAIDZ (or RAIDZ2) vdev can only be as
 big as it's *smallest* component device.

-Kyle



 Daniel
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Subversion repository on ZFS

2008-08-28 Thread Shawn Ferry



On Aug 27, 2008, at 4:38 PM, Tim wrote:


On Wed, Aug 27, 2008 at 3:29 PM, Ian Collins [EMAIL PROTECTED] wrote:

Does anyone have any tuning tips for a Subversion repository on  
ZFS?  The

repository will mainly be storing binary (MS Office documents).

It looks like a vanilla, uncompressed file system is the best bet.


I have a SVN on ZFS repository with ~75K relatively small files and  
few binaries. That is working well without any special tuning.


Shawn


--
Shawn Ferry  shawn.ferry at sun.com
Senior Primary Systems Engineer
Sun Managed Operations





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] xVM GRUB entry incorrect with ZFS root

2008-08-28 Thread Trevor Watson
I just ran live-upgrade of my system from nv94/UFS to nv96/ZFS on x86. 

nv96/ZFS boots okay. However, I can't boot the Solaris xVM partition as the 
GRUB entry does not contain the necessary magic to tell grub to use ZFS instead 
of UFS.

Looking at the GRUB menu, it appears as though the flags -B $ZFS-BOOTFS are 
needed to be passed to the kernel. Is this something I can add to:  kernel$ 
/boot/$ISADIR/xen.gz or is there some other mechanism required for booting 
Solaris xVM from ZFS ?
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] xVM GRUB entry incorrect with ZFS root

2008-08-28 Thread Trevor Watson
I just ran live-upgrade of my system from nv94/UFS to nv96/ZFS on x86. 

nv96/ZFS boots okay. However, I can't boot the Solaris xVM partition as the 
GRUB entry does not contain the necessary magic to tell grub to use ZFS instead 
of UFS.

Looking at the GRUB menu, it appears as though the flags -B $ZFS-BOOTFS are 
needed to be passed to the kernel. Is this something I can add to:  kernel$ 
/boot/$ISADIR/xen.gz or is there some other mechanism required for booting 
Solaris xVM from ZFS ?
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Bob Friesenhahn
On Thu, 28 Aug 2008, Ross wrote:

 I believe ZFS should apply the same tough standards to pool 
 availability as it does to data integrity.  A bad checksum makes ZFS 
 read the data from elsewhere, why shouldn't a timeout do the same 
 thing?

A problem is that for some devices, a five minute timeout is ok.  For 
others, there must be a problem if the device does not respond in a 
second or two.

If the system or device is simply overwelmed with work, then you would 
not want the system to go haywire and make the problems much worse.

Which of these do you prefer?

   o System waits substantial time for devices to (possibly) recover in
 order to ensure that subsequently written data has the least
 chance of being lost.

   o System immediately ignores slow devices and switches to
 non-redundant non-fail-safe non-fault-tolerant may-lose-your-data
 mode.  When system is under intense load, it automatically
 switches to the may-lose-your-data mode.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-28 Thread Miles Nordin
 re == Richard Elling [EMAIL PROTECTED] writes:

re There is no error in my math.  I presented a failure rate for
re a time interval,

What is a ``failure rate for a time interval''?

AIUI, the failure rate for a time interval is 0.46% / yr, no matter how
many drives you have.


pgpeGoMP0F3vv.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-28 Thread Miles Nordin
 rm == Robert Milkowski [EMAIL PROTECTED] writes:

rm Please look for slides 23-27 at
rm http://unixdays.pl/i/unixdays-prezentacje/2007/milek.pdf

yeah, ok, ONCE AGAIN, I never said that checksums are worthless.

relling: some drives don't return errors on unrecoverable read events.
carton: I doubt that.  Tell me a story about one that doesn't.

Your stories are about storage subsystems again, not drives.  Also
most or all of the slides aren't about unrecoverable read events.


pgpitPlQ325Eo.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] xVM GRUB entry incorrect with ZFS root

2008-08-28 Thread Malachi de Ælfweald
Take a look at my xVM/GRUB config:
http://malsserver.blogspot.com/2008/08/installing-xvm.html

On Thu, Aug 28, 2008 at 9:25 AM, Trevor Watson [EMAIL PROTECTED]wrote:

 I just ran live-upgrade of my system from nv94/UFS to nv96/ZFS on x86.

 nv96/ZFS boots okay. However, I can't boot the Solaris xVM partition as the
 GRUB entry does not contain the necessary magic to tell grub to use ZFS
 instead of UFS.

 Looking at the GRUB menu, it appears as though the flags -B $ZFS-BOOTFS
 are needed to be passed to the kernel. Is this something I can add to:
  kernel$ /boot/$ISADIR/xen.gz or is there some other mechanism required for
 booting Solaris xVM from ZFS ?
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Pools 1+TB

2008-08-28 Thread Bob Friesenhahn
On Thu, 28 Aug 2008, Kenny wrote:
   2. c6t600A0B800049F93C030A48B3EA2Cd0 SUN-LCSM100_F-0670-931.01GB
  /scsi_vhci/[EMAIL PROTECTED]

Good.

   3. c6t600A0B800049F93C030D48B3EAB6d0 SUN-LCSM100_F-0670-931.01MB
  /scsi_vhci/[EMAIL PROTECTED]

Oops!  Oops!  Oops!

It seems that some of your drives have the full 931.01GB exported 
while others have only 931.01MB exported.  The smallest device size 
will be used to size the vdev in your pool.  I sense a user error in 
the tedious CAM interface.  CAM is slow so you need to be patient and 
take extra care when configuring the 2540 volumes.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Pools 1+TB

2008-08-28 Thread Kenny
Ok so I knew it had to be operator headspace...  grin

I found my error and have fixed it in CAM.  Thanks to all for helping my 
education!!  

However I do have a question.  And pardon if it's a 101 type...

How did you determine from the format output the GB vs MB amount??

Where do you compute 931 GB vs 932 MB from this??

2. c6t600A0B800049F93C030A48B3EA2Cd0 /scsi_vhci/[EMAIL PROTECTED]

3. c6t600A0B800049F93C030D48B3EAB6d0
/scsi_vhci/[EMAIL PROTECTED]

Please educate me!!  grin

Thanks again!

--Kenny
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Pools 1+TB

2008-08-28 Thread Kenny
Ok so I knew it had to be operator headspace...  grin

I found my error and have fixed it in CAM.  Thanks to all for helping my 
education!!  

However I do have a question.  And pardon if it's a 101 type...

How did you determine from the format output the GB vs MB amount??

Where do you compute 931 GB vs 932 MB from this??

2. c6t600A0B800049F93C030A48B3EA2Cd0 /scsi_vhci/[EMAIL PROTECTED]

3. c6t600A0B800049F93C030D48B3EAB6d0
/scsi_vhci/[EMAIL PROTECTED]

Please educate me!!  grin

Thanks again!

--Kenny
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Eric Schrock
Ross, thanks for the feedback.  A couple points here -

A lot of work went into improving the error handling around build 77 of
Nevada.  There are still problems today, but a number of the
complaints we've seen are on s10 software or older nevada builds that
didn't have these fixes.  Anything from the pre-2008 (or pre-s10u5)
timeframe should be taken with grain of salt.

There is a fix in the immediate future to prevent I/O timeouts from
hanging other parts of the system - namely administrative commands and
other pool activity.  So I/O to that particular pool will hang, but
you'll still be able to run your favorite ZFS commands, and it won't
impact the ability of other pools to run.

We have some good ideas on how to improve the retry logic.  There is a
flag in Solaris, B_FAILFAST, that tells the drive to not try too hard
getting the data.  However, it can return failure when trying harder
would produce the correct results.  Currently, we try the first I/O with
B_FAILFAST, and if that fails immediately retry without the flag.  The
idea is to elevate the retry logic to a higher level, so when a read
from a side of a mirror fails with B_FAILFAST, instead of immediately
retrying the same device without the failfast flag, we push the error
higher up the stack, and issue another B_FAILFAST I/O to the other half
of the mirror.  Only if both fail with failfast do we try a more
thorough request (though with ditto blocks we may try another vdev
alltogether). This should improve I/O error latency for a subset of
failure scenarios, and biasing reads away from degraded (but not faulty)
devices should also improve response time.  The tricky part is
incoporating this into the FMA diagnosis engine, as devices may fail
B_FAILFAST requests for a variety of non-fatal reasons.

Finally, imposing additional timeouts in ZFS is a bad idea.  ZFS is
designed to be a generic storage consumer.  It can be layered on top of
directly attached disks, SSDs, SAN devices, iSCSI targets, files, and
basically anything else.  As such, it doesn't have the necessary context
to know what constitutes a reasonable timeout.  This is explicitly
delegated to the underlying storage subsystem.  If a storage subsystem
is timing out for excessive periods of time when B_FAILFAST is set, then
that's a bug in the storage subsystem, and working around it in ZFS with
yet another set of tunables is not practical.  It will be interesting to
see if this is an issue after the retry logic is modified as described
above.

Hope that helps,

- Eric

On Thu, Aug 28, 2008 at 01:08:26AM -0700, Ross wrote:
 Since somebody else has just posted about their entire system locking up when 
 pulling a drive, I thought I'd raise this for discussion.
 
 I think Ralf made a very good point in the other thread.  ZFS can guarantee 
 data integrity, what it can't do is guarantee data availability.  The problem 
 is, the way ZFS is marketed people expect it to be able to do just that.
 
 This turned into a longer thread than expected, so I'll start with what I'm 
 asking for, and then attempt to explain my thinking.  I'm essentially asking 
 for two features to improve the availability of ZFS pools:
 
 - Isolation of storage drivers so that buggy drivers do not bring down the OS.
 
 - ZFS timeouts to improve pool availability when no timely response is 
 received from storage drivers.
 
 And my reasons for asking for these is that there are now many, many posts on 
 here about people experiencing either total system lockup or ZFS lockup after 
 removing a hot swap drive, and indeed while some of them are using consumer 
 hardware, others have reported problems with server grade kit that definately 
 should be able to handle these errors:
 
 Aug 2008:  AMD SB600 - System hang
  - http://www.opensolaris.org/jive/thread.jspa?threadID=70349
 Aug 2008:  Supermicro SAT2-MV8 - System hang
  - http://www.opensolaris.org/jive/thread.jspa?messageID=271218
 May 2008: Sun hardware - ZFS hang
  - http://opensolaris.org/jive/thread.jspa?messageID=240481
 Feb 2008:  iSCSI - ZFS hang
  - http://www.opensolaris.org/jive/thread.jspa?messageID=206985
 Oct 2007:  Supermicro SAT2-MV8 - system hang
  - http://www.opensolaris.org/jive/thread.jspa?messageID=166037
 Sept 2007:  Fibre channel
  - http://opensolaris.org/jive/thread.jspa?messageID=151719
 ... etc
 
 Now while the root cause of each of these may be slightly different, I feel 
 it would still be good to address this if possible as it's going to affect 
 the perception of ZFS as a reliable system.
 
 The common factor in all of these is that either the solaris driver hangs and 
 locks the OS, or ZFS hangs and locks the pool.  Most of these are for 
 hardware that should handle these failures fine (mine occured for hardware 
 that definately works fine under windows), so I'm wondering:  Is there 
 anything that can be done to prevent either type of lockup in these 
 situations?
 
 Firstly, for the OS, if a storage component (hardware or driver) 

Re: [zfs-discuss] ZFS Pools 1+TB

2008-08-28 Thread Kyle McDonald
Kenny wrote:

 How did you determine from the format output the GB vs MB amount??

 Where do you compute 931 GB vs 932 MB from this??

 2. c6t600A0B800049F93C030A48B3EA2Cd0 /scsi_vhci/[EMAIL PROTECTED]

 3. c6t600A0B800049F93C030D48B3EAB6d0
 /scsi_vhci/[EMAIL PROTECTED]

It's in the part you didn't cut and paste:

AVAILABLE DISK SELECTIONS:
3. c6t600A0B800049F93C030D48B3EAB6d0 SUN-LCSM100_F-0670-931.01MB
   /scsi_vhci/[EMAIL PROTECTED]
4. c6t600A0B800049F93C031C48B3EC76d0 SUN-LCSM100_F-0670-931.01MB
   /scsi_vhci/[EMAIL PROTECTED]
8. c6t600A0B800049F93C031048B3EB44d0 SUN-LCSM100_F-0670-931.01MB
   /scsi_vhci/[EMAIL PROTECTED]
   

Look at the label:

SUN-LCSM100_F-0670-931.01MB

The last field.


 Please educate me!!  grin

No problem. Things like this have happened to me from time to time.

   -Kyle

 Thanks again!

 --Kenny
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] xVM GRUB entry incorrect with ZFS root

2008-08-28 Thread John Levon
On Thu, Aug 28, 2008 at 09:25:14AM -0700, Trevor Watson wrote:

 Looking at the GRUB menu, it appears as though the flags -B $ZFS-BOOTFS are 
 needed to be passed to the kernel. Is this something I can add to:  kernel$ 
 /boot/$ISADIR/xen.gz or is there some other mechanism required for booting 
 Solaris xVM from ZFS ?

You need to add it to the next line ($module ...). This was a bug that's
now fixed in the latest LU

regards
john
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] eWeek: corrupt file brought down FAA's antiquated IT system

2008-08-28 Thread Bob Friesenhahn
On Thu, 28 Aug 2008, Toby Thain wrote:

 two 20-year-old redundant mainframe configurations ... that
 apparently are hanging on for dear life until reinforcements arrive
 in the form of a new, state-of-the-art system this winter.

 And we all know that 'new, state-of-the-art systems' are silver
 bullets and good value for money.

The problem is that the replacement system is almost certain to be 
less reliable and cause problems for a while.  The old FORTRAN code 
either had to be ported or new code written from scratch.  If they 
used off the shelf software for the replacement then there is no way 
that the new system can be supported for 20 years.

 What goes unremarked here is how the original system has coped
 reliably for decades of (one guesses) geometrically growing load.

Fantastic engineering from a company which went defunct shortly after 
delivering the system.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] eWeek: corrupt file brought down FAA's antiquated IT system

2008-08-28 Thread Keith Bierman

On Aug 28, 2008, at 11:38 AM, Bob Friesenhahn wrote:

  The old FORTRAN code
 either had to be ported or new code written from scratch.

Assuming it WAS written in FORTRAN there is no reason to believe it  
wouldn't just compile with a modern Fortran compiler. I've often run  
codes originally written in the sixties without any significant  
changes (very old codes may have used the frequency statement,  
toggled front panel lights or sensed toggle switches ... but that's  
pretty rare).



-- 
Keith H. Bierman   [EMAIL PROTECTED]  | AIM kbiermank
5430 Nassau Circle East  |
Cherry Hills Village, CO 80113   | 303-997-2749
speaking for myself* Copyright 2008




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-28 Thread Jonathan Loran


Miles Nordin wrote:
 What is a ``failure rate for a time interval''?

   
Failure rate = Failures/unit time
Failure rate for a time interval = (Failures/unit time) * time

For example, if we have a failure rate: 

  Fr = 46% failures/month

Then the expectation value of a failure in one year:

  Fe = 46% failures/month  *  12 months = 5.52 failures


Jon

-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Miles Nordin
 es == Eric Schrock [EMAIL PROTECTED] writes:

es Finally, imposing additional timeouts in ZFS is a bad idea.
es [...] As such, it doesn't have the necessary context to know
es what constitutes a reasonable timeout.

you're right in terms of fixed timeouts, but there's no reason it
can't compare the performance of redundant data sources, and if one
vdev performs an order of magnitude slower than another set of vdevs
with sufficient redundancy, stop issuing reads except scrubs/healing
to the underperformer (issue writes only), and pass an event to FMA.

ZFS can also compare the performance of a drive to itself over time,
and if the performance suddenly decreases, do the same.

The former case eliminates the need for the mirror policies in SVM,
which Ian requested a few hours ago for the situation that half the
mirror is a slow iSCSI target for geographic redundancy and half is
faster/local.  Some care would have to be taken for targets shared by
ZFS and some other initiator, but I'm not sure the care would really
be that difficult to take, or that the oscillations induced by failing
to take it would really be particularly harmful compared to
unsupervised contention for a device.

The latter notices quickly drives that have been pulled, or for
Richard's ``overwhelmingly dominant'' case, for drives which are
stalled for 30 seconds pending their report of an unrecovered read.

Developing meaningful performance statistics for drives and a tool for
displaying them would be useful for itself, not just for stopping
freezes and preventing a failing drive from degrading performance a
thousandfold.

Issuing reads to redundant devices is cheap compared to freezing.  The
policy with which it's done is highly tunable and should be fun to
tune and watch, and the consequence if the policy makes the wrong
choice isn't incredibly dire.


This B_FAILFAST architecture captures the situation really poorly.

First, it's not implementable in any serious way with near-line
drives, or really with any drives with which you're not intimately
familiar and in control of firmware/release-engineering, and perhaps
not with any drives period.  I suspect in practice it's more a
controller-level feature, about whether or not you'd like to distrust
the device's error report and start resetting busses and channels and
mucking everything up trying to recover from some kind of
``weirdness''.  It's not an answer to the known problem of drives
stalling for 30 seconds when they start to fail.

First and a half, when it's not implemented, the system degrades to
doubling your timeout pointlessly.  A driver-level block cache of
UNC's would probably have more value toward this
speed/read-aggressiveness tradeoff than the whole B_FAILFAST
architecture---just cache known unrecoverable read sectors, and refuse
to issue further I/O for them until a timeout of 3 - 10 minutes
passes.  I bet this would speed up most failures tremendously, and
without burdening upper layers with retry logic.

Second, B_FAILFAST entertains the fantasy that I/O's are independent,
while what happens in practice is that the drive hits a UNC on one
I/O, and won't entertain any further I/O's no matter what flags the
request has on it or how many times you ``reset'' things.


Maybe you could try to rescue B_FAILFAST by putting clever statistics
into the driver to compare the drive's performance to recent past as I
suggested ZFS do, and admit no B_FAILFAST requests to queues of drives
that have suddenly slowed down, just fail them immediately without
even trying.  I submit this queueing and statistic collection is
actually _better_ managed by ZFS than the driver because ZFS can
compare a whole floating-point statistic across a whole vdev, while
even a driver which is fancier than we ever dreamed, is still playing
poker with only 1 bit of input ``I'll call,'' or ``I'll fold.''  ZFS
can see all the cards and get better results while being stupider and
requiring less clever poker-guessing than would be required by a
hypothetical driver B_FAILFAST implementation that actually worked.


pgpqZb7GbAEgk.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] eWeek: corrupt file brought down FAA's antiquated IT system

2008-08-28 Thread Tim
On Thu, Aug 28, 2008 at 12:38 PM, Bob Friesenhahn 
[EMAIL PROTECTED] wrote:

 On Thu, 28 Aug 2008, Toby Thain wrote:

  What goes unremarked here is how the original system has coped
  reliably for decades of (one guesses) geometrically growing load.

 Fantastic engineering from a company which went defunct shortly after
 delivering the system.



And let this be a lesson to all of you not to write code that is too good.
If you can't sell an update (patch) every 6 months, you'll be out of
business as well :D

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-28 Thread Miles Nordin
 jl == Jonathan Loran [EMAIL PROTECTED] writes:

jl   Fe = 46% failures/month * 12 months = 5.52 failures

the original statistic wasn't of this kind.  It was ``likelihood a
single drive will experience one or more failures within 12 months''.

so, you could say, ``If I have a thousand drives, about 4.66 of those
drives will silently-corrupt at least once within 12 months.''  It is
0.466% no matter how many drives you have.  

And it's 4.66 drives, not 4.66 corruptions.  The estimated number of
corruptions is higher because some drives will corrupt twice, or
thousands of times.  It's not a BER, so you can't just add it like
Richard did.

If the original statistic in the paper were of the kind you're talking
about, it would be larger than 0.466%.  I'm not sure it would capture
the situation well, though.  I think you'd want to talk about bits of
recoverable data after one year, not corruption ``events'', and this
is not really measured well by the type of telemetry NetApp has.  If
it were, though, it would still be the same size number no matter how
many drives you had.

The 37% I gave was ``one or more within a population of 100 drives
silently corrupts within 12 months.''  The 46% Richard gave has no
meaning, and doesn't mean what you just said.  The only statistic
under discussion which (a) gets intimidatingly large as you increase
the number of drives, and (b) is a ratio rather than, say, an absolute
number of bits, is the one I gave.


pgpl2HghkrzU1.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Eric Schrock
On Thu, Aug 28, 2008 at 02:17:08PM -0400, Miles Nordin wrote:
 
 you're right in terms of fixed timeouts, but there's no reason it
 can't compare the performance of redundant data sources, and if one
 vdev performs an order of magnitude slower than another set of vdevs
 with sufficient redundancy, stop issuing reads except scrubs/healing
 to the underperformer (issue writes only), and pass an event to FMA.

Yep, latency would be a useful metric to add to mirroring choices.
The current logic is rather naive (round-robin) and could easily be
enhanced.

Making diagnoses based on this is much trickier, particularly at the ZFS
level.  A better option would be to leverage the SCSI FMA work going on
to do a more intimate diagnosis at the scsa level.

Also, the problem you are trying to solve - timing out the first I/O to
take a long time - is not captured well by the type  of hysteresis you
would need to perform in order to do this diagnosis.  It certainly can
be done, but is much better suited to diagnosising a failing drive over
time, not aborting a transaction in response to immediate failure.

 This B_FAILFAST architecture captures the situation really poorly.

I don't think you understand how this works.  Imagine two I/Os, just
with different sd timeouts and retry logic - that's B_FAILFAST.  It's
quite simple, and independent of any hardware implementation.

- Eric

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Bob Friesenhahn
On Thu, 28 Aug 2008, Miles Nordin wrote:

 you're right in terms of fixed timeouts, but there's no reason it
 can't compare the performance of redundant data sources, and if one
 vdev performs an order of magnitude slower than another set of vdevs
 with sufficient redundancy, stop issuing reads except scrubs/healing
 to the underperformer (issue writes only), and pass an event to FMA.

You are saying that I can't split my mirrors between a local disk in 
Dallas and a remote disk in New York accessed via iSCSI?  Why don't 
you want me to be able to do that?

ZFS already backs off from writing to slow vdevs.

 ZFS can also compare the performance of a drive to itself over time,
 and if the performance suddenly decreases, do the same.

While this may be useful for reads, I would hate to disable redundancy 
just because a device is currently slow.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Ross Smith

Hi guys,

Bob, my thought was to have this timeout as something that can be optionally 
set by the administrator on a per pool basis.  I'll admit I was mainly thinking 
about reads and hadn't considered the write scenario, but even having thought 
about that it's still a feature I'd like.  After all, this would be a timeout 
set by the administrator based on the longest delay they can afford for that 
storage pool.

Personally, if a SATA disk wasn't responding to any requests after 2 seconds I 
really don't care if an error has been detected, as far as I'm concerned that 
disk is faulty.  I'd be quite happy for the array to drop to a degraded mode 
based on that and for writes to carry on with the rest of the array.

Eric, thanks for the extra details, they're very much appreciated.  It's good 
to hear you're working on this, and I love the idea of doing a B_FAILFAST read 
on both halves of the mirror.

I do have a question though.  From what you're saying, the response time can't 
be consistent across all hardware, so you're once again at the mercy of the 
storage drivers.  Do you know how long does B_FAILFAST takes to return a 
response on iSCSI?  If that's over 1-2 seconds I would still consider that too 
slow I'm afraid.

I understand that Sun in general don't want to add fault management to ZFS, but 
I don't see how this particular timeout does anything other than help ZFS when 
it's dealing with such a diverse range of media.  I agree that ZFS can't know 
itself what should be a valid timeout, but that's exactly why this needs to be 
an optional administrator set parameter.  The administrator of a storage array 
who wants to set this certainly knows what a valid timeout is for them, and 
these timeouts are likely to be several orders of magnitude larger than the 
standard response times.  I would configure very different values for my SATA 
drives as for my iSCSI connections, but in each case I would be happier knowing 
that ZFS has more of a chance of catching bad drivers or unexpected scenarios.

I very much doubt hardware raid controllers would wait 3 minutes for a drive to 
return a response, they will have their own internal timeouts to know when a 
drive has failed, and while ZFS is dealing with very different hardware I can't 
help but feel it should have that same approach to management of its drives.

However, that said, I'll be more than willing to test the new
B_FAILFAST logic on iSCSI once it's released.  Just let me know when
it's out.


Ross





 Date: Thu, 28 Aug 2008 11:29:21 -0500
 From: [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 CC: zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / 
 driver failure better
 
 On Thu, 28 Aug 2008, Ross wrote:
 
  I believe ZFS should apply the same tough standards to pool 
  availability as it does to data integrity.  A bad checksum makes ZFS 
  read the data from elsewhere, why shouldn't a timeout do the same 
  thing?
 
 A problem is that for some devices, a five minute timeout is ok.  For 
 others, there must be a problem if the device does not respond in a 
 second or two.
 
 If the system or device is simply overwelmed with work, then you would 
 not want the system to go haywire and make the problems much worse.
 
 Which of these do you prefer?
 
o System waits substantial time for devices to (possibly) recover in
  order to ensure that subsequently written data has the least
  chance of being lost.
 
o System immediately ignores slow devices and switches to
  non-redundant non-fail-safe non-fault-tolerant may-lose-your-data
  mode.  When system is under intense load, it automatically
  switches to the may-lose-your-data mode.
 
 Bob
 ==
 Bob Friesenhahn
 [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
 

_
Get Hotmail on your mobile from Vodafone 
http://clk.atdmt.com/UKM/go/107571435/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Miles Nordin
 es == Eric Schrock [EMAIL PROTECTED] writes:

es I don't think you understand how this works.  Imagine two
es I/Os, just with different sd timeouts and retry logic - that's
es B_FAILFAST.  It's quite simple, and independent of any
es hardware implementation.

AIUI the main timeout to which we should be subject, at least for
nearline drives, is about 30 seconds long and is decided by the
drive's firmware, not the driver, and can't be negotiated in any way
that's independent of the hardware implementation, although sometimes
there are dependent ways to negotiate it.  The driver could also
decide through ``retry logic'' to time out the command sooner, before
the drive completes it, but this won't do much good because the drive
won't accept a second command until ITS timeout expires.

which leads to the second problem, that we're talking about timeouts
for individual I/O's, not marking whole devices.  A ``fast'' timeout
of even 1 second could cause a 100- or 1000-fold decrease in
performance, which could end up being equivalent to a freeze depending
on the type of load on the filesystem.


pgphjTr74byaZ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ARCSTAT Kstat Definitions

2008-08-28 Thread Peter Tribble
On Thu, Aug 21, 2008 at 8:47 PM, Ben Rockwood [EMAIL PROTECTED] wrote:
 New version is available (v0.2) :

 * Fixes divide by zero,
 * includes tuning from /etc/system in output
 * if prefetch is disabled I explicitly say so.
 * Accounts for jacked anon count.  Still need improvement here.
 * Added friendly explanations for MRU/MFU  Ghost lists counts.

 Page and examples are updated: cuddletech.com/arc_summary.pl

 Still needs work, but hopefully interest in this will stimulate some improved 
 understanding of ARC internals.

For a bit of light relief (in other words, with pretty graphs) I've hacked up a
graphical java version of Ben's script as part of jkstat (updated to 0.24):

http://www.petertribble.co.uk/Solaris/jkstat.html

Now, this is pretty rough, and chews up a modest amount of CPU, and
I'm not sure of the interpretation, but I've basically taken Ben's code and
lifted it more or less as is.

-- 
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Miles Nordin
 bf == Bob Friesenhahn [EMAIL PROTECTED] writes:

bf If the system or device is simply overwelmed with work, then
bf you would not want the system to go haywire and make the
bf problems much worse.

None of the decisions I described its making based on performance
statistics are ``haywire''---I said it should funnel reads to the
faster side of the mirror, and do this really quickly and
unconservatively.  What's your issue with that?

bf You are saying that I can't split my mirrors between a local
bf disk in Dallas and a remote disk in New York accessed via
bf iSCSI?

nope, you've misread.  I'm saying reads should go to the local disk
only, and writes should go to both.  See SVM's 'metaparam -r'.  I
suggested that unlike the SVM feature it should be automatic, because
by so being it becomes useful as an availability tool rather than just
performance optimisation.

The performance-statistic logic should influence read scheduling
immediately, and generate events which are fed to FMA, then FMA can
mark devices faulty.  There's no need for both to make the same
decision at the same time.  If the events aren't useful for diagnosis,
ZFS could not bother generating them, or fmd could ignore them in its
diagnosis.  I suspect they *would* be useful, though.

I'm imagining the read rescheduling would happen very quickly, quicker
than one would want a round-trip from FMA, in much less than a second.
That's why it would have to compare devices to others in the same
vdev, and to themselves over time, rather than use fixed timeouts or
punt to haphazard driver and firmware logic.

bfo System waits substantial time for devices to (possibly)
bf recover in order to ensure that subsequently written data has
bf the least chance of being lost.

There's no need for the filesystem to *wait* for data to be written,
unless you are calling fsync.  and maybe not even then if there's a
slog.

I said clearly that you read only one half of the mirror, but write to
both.  But you're right that the trick probably won't work
perfectly---eventually dead devices need to be faulted.  The idea is
that normal write caching will buy you orders of magnitude longer time
in which to make a better decision before anyone notices.

Experience here is that ``waits substantial time'' usually means
``freezes for hours and gets rebooted''.  There's no need to be
abstract: we know what happens when a drive starts taking 1000x -
2000x longer than usual to respond to commands, and we know that this
is THE common online failure mode for drives.  That's what started the
thread.  so, think about this: hanging for an hour trying to write to
a broken device may block other writes to devices which are still
working, until the patiently-waiting data is eventually lost in the
reboot.

bfo System immediately ignores slow devices and switches to
bf non-redundant non-fail-safe non-fault-tolerant
bf may-lose-your-data mode.  When system is under intense load,
bf it automatically switches to the may-lose-your-data mode.

nobody's proposing a system which silently rocks back and forth
between faulted and online.  That's not what we have now, and no such
system would naturally arise.  If FMA marked a drive faulty based on
performance statistics, that drive would get retired permanently and
hot-spare-replaced.  Obviously false positives are bad, just as
obviously as freezes/reboots are bad.

It's not my idea to use FMA in this way.  This is how FMA was pitched,
and the excuse for leaving good exception handling out of ZFS for two
years.  so, where's the beef?


pgpUDw139jf6A.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk

2008-08-28 Thread Anton B. Rang
Many mid-range/high-end RAID controllers work by having a small timeout on 
individual disk I/O operations. If the disk doesn't respond quickly, they'll 
issue an I/O to the redundant disk(s) to get the data back to the host in a 
reasonable time. Often they'll change parameters on the disk to limit how long 
the disk retries before returning an error for a bad sector (this is 
standardized for SCSI, I don't recall offhand whether any of this is 
standardized for ATA).

RAID 3 units, e.g. DataDirect, issue I/O to all disks simultaneously and when 
enough (N-1 or N-2) disks return data, they'll return the data to the host. At 
least they do that for full stripes. But this strategy works better for 
sequential I/O, not so good for random I/O, since you're using up extra 
bandwidth.

Host-based RAID/mirroring almost never takes this strategy for two reasons. 
First, the bottleneck is almost always the channel from disk to host, and you 
don't want to clog it. [Yes, I know there's more bandwidth there than the sum 
of the disks, but consider latency.] Second, to read from two disks on a 
mirror, you'd need two memory buffers.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk

2008-08-28 Thread Anton B. Rang
Many mid-range/high-end RAID controllers work by having a small timeout on 
individual disk I/O operations. If the disk doesn't respond quickly, they'll 
issue an I/O to the redundant disk(s) to get the data back to the host in a 
reasonable time. Often they'll change parameters on the disk to limit how long 
the disk retries before returning an error for a bad sector (this is 
standardized for SCSI, I don't recall offhand whether any of this is 
standardized for ATA).

RAID 3 units, e.g. DataDirect, issue I/O to all disks simultaneously and when 
enough (N-1 or N-2) disks return data, they'll return the data to the host. At 
least they do that for full stripes. But this strategy works better for 
sequential I/O, not so good for random I/O, since you're using up extra 
bandwidth.

Host-based RAID/mirroring almost never takes this strategy for two reasons. 
First, the bottleneck is almost always the channel from disk to host, and you 
don't want to clog it. [Yes, I know there's more bandwidth there than the sum 
of the disks, but consider latency.] Second, to read from two disks on a 
mirror, you'd need two memory buffers.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Eric Schrock
On Thu, Aug 28, 2008 at 08:34:24PM +0100, Ross Smith wrote:
 
 Personally, if a SATA disk wasn't responding to any requests after 2
 seconds I really don't care if an error has been detected, as far as
 I'm concerned that disk is faulty.

Unless you have power management enabled, or there's a bad region of the
disk, or the bus was reset, or...

 I do have a question though.  From what you're saying, the response
 time can't be consistent across all hardware, so you're once again at
 the mercy of the storage drivers.  Do you know how long does
 B_FAILFAST takes to return a response on iSCSI?  If that's over 1-2
 seconds I would still consider that too slow I'm afraid.

It's main function is how it deals with retryable errors.  If the drive
responds with a retryable error, or any error at all, it won't attempt
to retry again.  If you have a device that is taking arbitrarily long to
respond to successful commands (or to notice that a command won't
succeed), it won't help you.

 I understand that Sun in general don't want to add fault management to
 ZFS, but I don't see how this particular timeout does anything other
 than help ZFS when it's dealing with such a diverse range of media.  I
 agree that ZFS can't know itself what should be a valid timeout, but
 that's exactly why this needs to be an optional administrator set
 parameter.  The administrator of a storage array who wants to set this
 certainly knows what a valid timeout is for them, and these timeouts
 are likely to be several orders of magnitude larger than the standard
 response times.  I would configure very different values for my SATA
 drives as for my iSCSI connections, but in each case I would be
 happier knowing that ZFS has more of a chance of catching bad drivers
 or unexpected scenarios.

The main problem with exposing tunables like this is that they have a
direct correlation to service actions, and mis-diagnosing failures costs
everybody (admin, companies, Sun, etc) lots of time and money.  Once you
expose such a tunable, it will be impossible to trust any FMA diagnosis,
because you won't be able to know whether it was a mistaken tunable.

A better option would be to not use this to perform FMA diagnosis, but
instead work into the mirror child selection code.  This has already
been alluded to before, but it would be cool to keep track of latency
over time, and use this to both a) prefer one drive over another when
selecting the child and b) proactively timeout/ignore results from one
child and select the other if it's taking longer than some historical
standard deviation.  This keeps away from diagnosing drives as faulty,
but does allow ZFS to make better choices and maintain response times.
It shouldn't be hard to keep track of the average and/or standard
deviation and use it for selection; proactively timing out the slow I/Os
is much trickier.

As others have mentioned, things get more difficult with writes.  If I
issue a write to both halves of a mirror, should I return when the first
one completes, or when both complete?  One possibility is to expose this
as a tunable, but any such best effort RAS is a little dicey because
you have very little visibility into the state of the pool in this
scenario - is my data protected? becomes a very difficult question to
answer.

- Eric

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] liveupgrade ufs root - zfs ?

2008-08-28 Thread Paul Floyd
Hi

I'm not sure that the ZFS pool meets this requirement. I have

# lufslist SXCE_94
 
Filesystem  fstypedevice size Mounted on  Mount Options
---   --- --
/dev/dsk/c1t2d0s1   swap   2147880960 -   -
/dev/dsk/c1t2d0s0   ufs8590202880 /   -
/dev/dsk/c1t2d0s7   ufs5747496960 /export/home-

# lufslist SXCE_95
  
Filesystem  fstypedevice size Mounted on  Mount Options
---   --- --
/dev/dsk/c1t2d0s1   swap   2147880960 -   -
/dev/dsk/c1t2d0s4   ufs8590202880 /   -
/dev/dsk/c1t2d0s7   ufs5747496960 /export/home-

Is it possible to delete SXCE_94, do a zpool create with /dev/dsk/c1t2d0s0, and 
then do a liveupgrade?

I have the impression that it's possible, but that there are some extra steps 
needed (to specify the ZFS mount point?).

A+
Paul
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] liveupgrade ufs root - zfs ?

2008-08-28 Thread Paul Floyd
Hi

I'm not sure that the ZFS pool meets this requirement. I have

# lufslist SXCE_94
 
Filesystem  fstypedevice size Mounted on  Mount Options
---   --- --
/dev/dsk/c1t2d0s1   swap   2147880960 -   -
/dev/dsk/c1t2d0s0   ufs8590202880 /   -
/dev/dsk/c1t2d0s7   ufs5747496960 /export/home-

# lufslist SXCE_95
  
Filesystem  fstypedevice size Mounted on  Mount Options
---   --- --
/dev/dsk/c1t2d0s1   swap   2147880960 -   -
/dev/dsk/c1t2d0s4   ufs8590202880 /   -
/dev/dsk/c1t2d0s7   ufs5747496960 /export/home-

Is it possible to delete SXCE_94, do a zpool create with /dev/dsk/c1t2d0s0, and 
then do a liveupgrade?

I have the impression that it's possible, but that there are some extra steps 
needed (to specify the ZFS mount point?).

A+
Paul
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Bob Friesenhahn
On Thu, 28 Aug 2008, Miles Nordin wrote:

 None of the decisions I described its making based on performance
 statistics are ``haywire''---I said it should funnel reads to the
 faster side of the mirror, and do this really quickly and
 unconservatively.  What's your issue with that?

From what I understand, this is partially happening now based on 
average service time.  If I/O is backed up for a device, then the 
other device is preferred.  However it good to keep in mind that if 
data is never read, then it is never validated and corrected.  It is 
good for ZFS to read data sometimes.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Bill Sommerfeld
On Thu, 2008-08-28 at 13:05 -0700, Eric Schrock wrote:
 A better option would be to not use this to perform FMA diagnosis, but
 instead work into the mirror child selection code.  This has already
 been alluded to before, but it would be cool to keep track of latency
 over time, and use this to both a) prefer one drive over another when
 selecting the child and b) proactively timeout/ignore results from one
 child and select the other if it's taking longer than some historical
 standard deviation.  This keeps away from diagnosing drives as faulty,
 but does allow ZFS to make better choices and maintain response times.
 It shouldn't be hard to keep track of the average and/or standard
 deviation and use it for selection; proactively timing out the slow I/Os
 is much trickier.

tcp has to solve essentially the same problem: decide when a response is
overdue based only on the timing of recent successful exchanges in a
context where it's difficult to make assumptions about reasonable
expected behavior of the underlying network.

it tracks both the smoothed round trip time and the variance, and
declares a response overdue after (SRTT + K * variance).

I think you'd probably do well to start with something similar to what's
described in http://www.ietf.org/rfc/rfc2988.txt and then tweak based on
experience.

- Bill





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ARCSTAT Kstat Definitions

2008-08-28 Thread Brendan Gregg - Sun Microsystems
G'Day Ben,

ARC visibility is important; did you see Neel's arcstat?:

http://www.solarisinternals.com/wiki/index.php/Arcstat

Try -x for various sizes, and -v for definitions.

On Thu, Aug 21, 2008 at 10:23:24AM -0700, Ben Rockwood wrote:
 Its a starting point anyway.   The key is to try and draw useful conclusions 
 from the info to answer the torrent of why is my ARC 30GB???
 
 There are several things I'm unclear on whether or not I'm properly 
 interpreting such as:
 
 * As you state, the anon pages.  Even the comment in code is, to me anyway, a 
 little vague.  I include them because otherwise you look at the hit counters 
 and wonder where a large chunk of them went.

Yes, anon hits doesn't make sense - they are dirty pages and won't have a DVA,
and so won't be findable by other threads in arc_read().   I can see why
arc_summary.pl thinks they exist - accounting for the discrepancy between
arcstats:hits and the sum of the hits from the four ARC lists.  Ghost list
hits aren't part of arcstats:hits - arcstats:hits are real hits, the ghost
hits are an artifact of the ARC algorithm.  If you do want to break down
arcstats:hits into it's components, use:

zfs:0:arcstats:demand_data_hits
zfs:0:arcstats:demand_metadata_hits
zfs:0:arcstats:prefetch_data_hits
zfs:0:arcstats:prefetch_metadata_hits

And for a different perspective on the demand hits:

zfs:0:arcstats:mru_hits
zfs:0:arcstats:mfu_hits

Also, arc_summary.pl's reported MRU and MFU sizes aren't actual, these
are target sizes.  The ARC will try to steer itself towards them, but in at
least one case (where the ARC has yet to fill) they can be very different
from actual (until arc_adjust() is called to whip them back to size.)

 * Prefetch... I want to use the Prefetch Data hit ratio as a judgment call on 
 the efficiency of prefetch.  If the value is very low it might be best to 
 turn it off. but I'd like to hear that from someone else before I go 
 saying that.

Sounds good to me.

 In high latency environments, such as ZFS on iSCSI, prefetch can either 
 significantly help or hurt, determining which is difficult without some type 
 of metric as as above.
 
 * There are several instances (based on dtracing) in which the ARC is 
 bypassed... for ZIL I understand, in some other cases I need to spend more 
 time analyzing the DMU (dbuf_*) for why.

 * In answering the Is having a 30GB ARC good? question, I want to say that 
 if MFU is 60% of ARC, and if the hits are mostly MFU that you are deriving 
 significant benefit from your large ARC but on a system with a 2GB ARC or 
 a 30GB ARC the overall hit ratio tends to be 99%.  Which is nuts, and tends 
 to reinforce a misinterpretation of anon hits.

I wouldn't read *too* much into MRU vs MFU hits.  MFU means 2 hits, MRU
means 1.

 The only way I'm seeing to _really_ understand ARC's efficiency is to look at 
 the overall number of reads and then how many are intercepted by ARC and how 
 many actually made it to disk... and why (prefetch or demand).  This is 
 tricky to implement via kstats because you have to pick out and monitor the 
 zpool disks themselves.

This would usually have more to do with the workload than the ARC's efficiency.

 I've spent a lot of time in this code (arc.c) and still have a lot of 
 questions.  I really wish there was an Advanced ZFS Internals talk coming 
 up; I simply can't keep spending so much time on this.

Maybe you could try forgetting about the kstats for a moment and draw a
fantasy arc_summary.pl output.  Then we can look at adding kstats to make
writing that script possible/easy (Mark and I could add the kstats, and
Neel could provide the script, for example).  Of course, if we do add more
kstats, it's not going to help on older rev kernels out there...

cheers,

Brendan

-- 
Brendan
[CA, USA]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss