Re: [zfs-discuss] help diagnosing system hang

2008-12-05 Thread Richard Elling
Ethan Erchinger wrote:
 
 Richard Elling wrote:

asc = 0x29
ascq = 0x0

 ASC/ASCQ 29/00 is POWER ON, RESET, OR BUS DEVICE RESET OCCURRED
 http://www.t10.org/lists/asc-num.htm#ASC_29

 [this should be more descriptive as the codes are, more-or-less,
 standardized, I'll try to file an RFE, unless someone beat me to it]

 Depending on which system did the reset, it should be noted in the
 /var/adm/messages log.  This makes me suspect the hardware (firmware,
 actually).

 Firmware of the SSD, or something else?

The answer may lie in the /var/adm/messages file which should report
if a reset was received or sent.
  -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Status of zpool remove in raidz and non-redundant stripes

2008-12-05 Thread Mike Brancato
I've seen discussions as far back as 2006 that say development is underway to 
allow the addition and remove of disks in a raidz vdev to grow/shrink the 
group.  Meaning, if a 4x100GB raidz only used 150GB of space, one could do 
'zpool remove tank c0t3d0' and data residing on c0t3d0 would be migrated to 
other disks in the raidz.  Then, c0t3d0 would be free for removal and reuse.

What is the status of this support in nv101?

If a pool has multiple raidz vdevs, how would one add a disk to the second 
raidz vdev?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] help diagnosing system hang

2008-12-05 Thread Ethan Erchinger
Richard Elling wrote:
 The answer may lie in the /var/adm/messages file which should report
 if a reset was received or sent.
Here is a sample set of messages at that time.  It looks like timeouts 
on the SSD for various requested blocks.  Maybe I need to talk with 
Intel about this issue.

Ethan
==

Dec  2 20:14:01 opensolaris scsi: [ID 107833 kern.warning] WARNING: 
/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1000,[EMAIL PROTECTED]/[EMAIL 
PROTECTED],0 (sd16):
Dec  2 20:14:01 opensolaris Error for Command: 
write   Error Level: Retryable
Dec  2 20:14:01 opensolaris scsi: [ID 107833 kern.notice]   
Requested Block: 840   Error Block: 840
Dec  2 20:14:01 opensolaris scsi: [ID 107833 kern.notice]   Vendor: 
ATASerial Number: CVEM840201EU
Dec  2 20:14:01 opensolaris scsi: [ID 107833 kern.notice]   Sense 
Key: Unit_Attention
Dec  2 20:14:01 opensolaris scsi: [ID 107833 kern.notice]   ASC: 
0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
Dec  2 20:15:08 opensolaris scsi: [ID 107833 kern.warning] WARNING: 
/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1000,[EMAIL PROTECTED] (mpt0):
Dec  2 20:15:08 opensolaris Disconnected command timeout for Target 15
Dec  2 20:15:09 opensolaris scsi: [ID 365881 kern.info] 
/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1000,[EMAIL PROTECTED] (mpt0):
Dec  2 20:15:09 opensolaris Log info 0x3114 received for target 15.
Dec  2 20:15:09 opensolaris scsi_status=0x0, ioc_status=0x8048, 
scsi_state=0xc
Dec  2 20:15:09 opensolaris scsi: [ID 365881 kern.info] 
/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1000,[EMAIL PROTECTED] (mpt0):
Dec  2 20:15:09 opensolaris Log info 0x3114 received for target 15.
Dec  2 20:15:09 opensolaris scsi_status=0x0, ioc_status=0x8048, 
scsi_state=0xc
Dec  2 20:15:09 opensolaris scsi: [ID 365881 kern.info] 
/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1000,[EMAIL PROTECTED] (mpt0):
Dec  2 20:15:09 opensolaris Log info 0x3114 received for target 15.
Dec  2 20:15:09 opensolaris scsi_status=0x0, ioc_status=0x8048, 
scsi_state=0xc
Dec  2 20:15:09 opensolaris scsi: [ID 365881 kern.info] 
/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1000,[EMAIL PROTECTED] (mpt0):
Dec  2 20:15:09 opensolaris Log info 0x3114 received for target 15.
Dec  2 20:15:09 opensolaris scsi_status=0x0, ioc_status=0x8048, 
scsi_state=0xc
Dec  2 20:15:12 opensolaris scsi: [ID 107833 kern.warning] WARNING: 
/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1000,[EMAIL PROTECTED]/[EMAIL 
PROTECTED],0 (sd16):
Dec  2 20:15:12 opensolaris Error for Command: 
write   Error Level: Retryable
Dec  2 20:15:12 opensolaris scsi: [ID 107833 kern.notice]   
Requested Block: 810   Error Block: 810
Dec  2 20:15:12 opensolaris scsi: [ID 107833 kern.notice]   Vendor: 
ATASerial Number: CVEM840201EU
Dec  2 20:15:12 opensolaris scsi: [ID 107833 kern.notice]   Sense 
Key: Unit_Attention
Dec  2 20:15:12 opensolaris scsi: [ID 107833 kern.notice]   ASC: 
0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
Dec  2 20:16:19 opensolaris scsi: [ID 107833 kern.warning] WARNING: 
/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1000,[EMAIL PROTECTED] (mpt0):
Dec  2 20:16:19 opensolaris Disconnected command timeout for Target 15
Dec  2 20:16:21 opensolaris scsi: [ID 365881 kern.info] 
/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1000,[EMAIL PROTECTED] (mpt0):
Dec  2 20:16:21 opensolaris Log info 0x3114 received for target 15.
Dec  2 20:16:21 opensolaris scsi_status=0x0, ioc_status=0x8048, 
scsi_state=0xc
Dec  2 20:16:21 opensolaris scsi: [ID 365881 kern.info] 
/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1000,[EMAIL PROTECTED] (mpt0):
Dec  2 20:16:21 opensolaris Log info 0x3114 received for target 15.
Dec  2 20:16:21 opensolaris scsi_status=0x0, ioc_status=0x8048, 
scsi_state=0xc
Dec  2 20:16:21 opensolaris scsi: [ID 365881 kern.info] 
/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1000,[EMAIL PROTECTED] (mpt0):
Dec  2 20:16:21 opensolaris Log info 0x3114 received for target 15.
Dec  2 20:16:21 opensolaris scsi_status=0x0, ioc_status=0x8048, 
scsi_state=0xc
Dec  2 20:16:21 opensolaris scsi: [ID 365881 kern.info] 
/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1000,[EMAIL PROTECTED] (mpt0):
Dec  2 20:16:21 opensolaris Log info 0x3114 received for target 15.
Dec  2 20:16:21 opensolaris scsi_status=0x0, ioc_status=0x8048, 
scsi_state=0xc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] redundancy in non-redundant stripes

2008-12-05 Thread Mike Brancato
With ZFS, we can enable copies=[1,2,3] to configure how many copies of data 
there are.  With copies of 2 or more, in theory, an entire disk can have read 
errors, and the zfs volume still works.  

The unfortunate part here is that the redundancy lies in the volume, not the 
pool vdev like with raidz or mirroring.  So if a disk were to go missing, the 
zpool (stripe) is missing a vdev and thus becomes offline.  If a single disk in 
a raidz vdev is missing, it would become degraded and still usable.  Now, with 
non-redundant stripes, the disk can't be replaced, but all the data is still 
there with copies=2 if a disk dies.  Is there not a way to force the zpool 
online or prevent it from offlining itself?

One of the key benefits of the metadata copies is that if a single block fails, 
the filesystem is still navigable to grab what data is possible.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] redundancy in non-redundant stripes

2008-12-05 Thread Bob Friesenhahn
On Fri, 5 Dec 2008, Mike Brancato wrote:

 With ZFS, we can enable copies=[1,2,3] to configure how many copies 
 of data there are.  With copies of 2 or more, in theory, an entire 
 disk can have read errors, and the zfs volume still works.

So you are saying that if we use copies of 2 or more that if we have 
only one disk drive and it does not spin up then we should be ok?

My understanding is that the copies function is purely statistical and 
if some drives are overloaded and therefore are not selected as the 
next round-robin device, it is possible that the several copies may 
end up on the same drive.  The copies functionality is intended to aid 
with media failure and not whole drive failure.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Status of zpool remove in raidz and non-redundant stripes

2008-12-05 Thread Richard Elling
Mike Brancato wrote:
 I've seen discussions as far back as 2006 that say development is underway to 
 allow the addition and remove of disks in a raidz vdev to grow/shrink the 
 group.  Meaning, if a 4x100GB raidz only used 150GB of space, one could do 
 'zpool remove tank c0t3d0' and data residing on c0t3d0 would be migrated to 
 other disks in the raidz.  Then, c0t3d0 would be free for removal and reuse.
 
 What is the status of this support in nv101?

Not available.  I predict that you will see it mentioned everywhere,
billboards, graffiti, slashdot, etc. when it arrives.
  -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] redundancy in non-redundant stripes

2008-12-05 Thread Mike Brancato
In theory, with 2 80GB drives, you would always have a copy somewhere else.  
But a single drive, no.

I guess I'm thinking in the optimal situation.  With multiple drives, copies 
are spread through the vdevs.  I guess it would work better if we could define 
that if copies=2 or more, that at least one copy be on a different vdev.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Status of zpool remove in raidz and non-redundant stripes

2008-12-05 Thread Mike Brancato
Well, I knew it wasn't available.  I meant to ask what is the status of the 
development of the feature?  Not started, I presume.

Is there no timeline?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] redundancy in non-redundant stripes

2008-12-05 Thread Richard Elling
Mike Brancato wrote:
 With ZFS, we can enable copies=[1,2,3] to configure how many copies of data 
 there are.  With copies of 2 or more, in theory, an entire disk can have read 
 errors, and the zfs volume still works.  

No, this is not a completely true statement.

 The unfortunate part here is that the redundancy lies in the volume, not the 
 pool vdev like with raidz or mirroring.  So if a disk were to go missing, the 
 zpool (stripe) is missing a vdev and thus becomes offline.  If a single disk 
 in a raidz vdev is missing, it would become degraded and still usable.  Now, 
 with non-redundant stripes, the disk can't be replaced, but all the data is 
 still there with copies=2 if a disk dies.  Is there not a way to force the 
 zpool online or prevent it from offlining itself?

No. If you want this feature with copies1, then consider
mirroring.

 One of the key benefits of the metadata copies is that if a single block 
 fails, the filesystem is still navigable to grab what data is possible.

Yes.  But you cannot guarantee that the metadata copies are
on different vdevs.
  -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool replace - choke point

2008-12-05 Thread Marion Hakanson
[EMAIL PROTECTED] said:
 Thanks for the tips.  I'm not sure if they will be relevant, though.  We
 don't talk directly with the AMS1000.  We are using a USP-VM to virtualize
 all of our storage and we didn't have to add anything to the drv
 configuration files to see the new disk (mpxio was already turned on).  We
 are using the Sun drivers and mpxio and we didn't require any tinkering to
 see the new LUNs.

Yes, the fact that the USP-VM was recognized automatically by Solaris drivers
is a good sign.  I suggest that you check to see what queue-depth and disksort
values you ended up with from the automatic settings:

  echo *ssd_state::walk softstate |::print -t struct sd_lun un_throttle \
   | mdb -k

The ssd_state would be sd_state on an x86 machine (Solaris-10).
The un_throttle above will show the current max_throttle (queue depth);
Replace it with un_min_throttle to see the min, and un_f_disksort_disabled
to see the current queue-sort setting.

The HDS docs for 9500 series suggested 32 as the max_throttle to use, and
the default setting (Solaris-10) was 256 (hopefully with the USP-VM you get
something more reasonable).  And while 32 did work for us, i.e. no operations
were ever lost as far as I could tell, the array back-end -- the drives
themselves, and the internal SATA shelf connections, have an actual queue
depth of four for each array controller.  The AMS1000 has the same limitation
for SATA shelves, according to our HDS engineer.

In short, Solaris, especially with ZFS, functions much better if it does
not try to send more FC operations to the array than the actual physical
devices can handle.  We were actually seeing NFS client operations hang
for minutes at a time when the SAN-hosted NFS server was making its ZFS
devices busy -- and this was true even if clients were using different
devices than the busy ones.  We do not see these hangs after making the
described changes, and I believe this is because the OS is no longer waiting
around for a response from devices that aren't going to respond in a
reasonable amount of time.

Yes, having the USP between the host and the AMS1000 will affect things;
There's probably some huge cache in there somewhere.  But unless you've
got cache of hundreds of GB in size, at some point a resilver operation
is going to end up running at the speed of the actual back-end device.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS fragments 32 bits RAM? Problem?

2008-12-05 Thread Orvar Korvar
I see this old post about ZFS fragmenting the RAM if it is 32 bit. This makes 
the memory run out. Is it still true, or has it been fixed?

http://mail.opensolaris.org/pipermail/zfs-discuss/2006-July/003506.html
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Status of zpool remove in raidz and non-redundant stripes

2008-12-05 Thread Miles Nordin
 mb == Mike Brancato [EMAIL PROTECTED] writes:

mb if a 4x100GB raidz only used 150GB of space, one could do
mb 'zpool remove tank c0t3d0' and data residing on c0t3d0 would
mb be migrated to other disks in the raidz.

that sounds like in-place changing of stripe width, and wasn't part of
the discussion I remember.  We were wishing for vdev removal, but
you'd have to remove a whole vdev at a time.  It would be analagous to
'zpool add', so just as you can't add 1 disk to widen a 3-disk raidz
vdev to 4-disks, you couldn't do the reverse even with the wished-for
feature.

To change from 4x100GB raidz to 3x100GB raidz, you'd have to:

zpool add pool raidz disk5 disk6 disk7
zpool evacuate pool raidz disk1 disk2 disk3 disk4

RFE 4852783 is to create something like zpool evacuate, removing the
whole vdev at once and migrating onto other vdev's, not other disks.

The feature's advantage as-is would be for pools with many vdev's.  It
could also be an advantage for pools with just one vdev that are
humongous: you want to change the shape of the 1 vdev, but you need to
do the copy/evacuation online because it takes a week.  If not for the
week, on a 1-vdev pool you could destroy the pool and make a new one
without needing any more media than you would with the new feature.

For home storage with big, slow, cheap pools, what you want sounds
nice.  Someone once told me he'd gotten Veritas to change a plex's
width with the vg online, but for me I think it's scary because, if it
crashed halfway through, I'm not sure how the system could communicate
to me what's happening in a way I'd understand, much less recover from
it.  I'm not saying Veritas doesn't do both, just that I'd chuckle
happily if I saw it actually work (which was the storyteller's
response too).  For vdev removal I think you could harmlessly stop the
evacuation at any time with only O(1) quickie-import-time recovery,
without needing to communicate anything.  much easier.  so i like the
RFE as-is, analagous to Linux LVM2's pvmove.


pgppOdQ9azkoS.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS fragments 32 bits RAM? Problem?

2008-12-05 Thread Brian Hechinger
On Fri, Dec 05, 2008 at 11:35:27AM -0800, Orvar Korvar wrote:
 I see this old post about ZFS fragmenting the RAM if it is 32 bit. This makes 
 the memory run out. Is it still true, or has it been fixed?

Don't waste your time trying to run ZFS on a 32-bit machine.  The performance
is horrible.  I really wish I hadn't.

-brian
-- 
Coding in C is like sending a 3 year old to do groceries. You gotta
tell them exactly what you want or you'll end up with a cupboard full of
pop tarts and pancake mix. -- IRC User (http://www.bash.org/?841435)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs not yet suitable for HA applications?

2008-12-05 Thread David Anderson
Trying to keep this in the spotlight. Apologies for the lengthy post.

I'd really like to see features as described by Ross in his summary of  
the Availability: ZFS needs to handle disk removal / driver failure  
better  (http://www.opensolaris.org/jive/thread.jspa?messageID=274031#274031 
  ). I'd like to have these/similar features as well. Has there  
already been internal discussions regarding adding this type of  
functionality to ZFS itself, and was there approval, disapproval or no  
decision?

Unfortunately my situation has put me in urgent need to find  
workarounds in the meantime.

My setup: I have two iSCSI target nodes, each with six drives exported  
via iscsi (Storage Nodes). I have a ZFS Node that logs into each  
target from both Storage Nodes and creates a mirrored Zpool with one  
drive from each Storage Node comprising each half of the mirrored  
vdevs (6 x 2-way mirrors).

My problem: If a Storage Node crashes completely, is disconnected from  
the network, iscsitgt core dumps, a drive is pulled, or a drive has a  
problem accessing data (read retries), then my ZFS Node hangs while  
ZFS waits patiently for the layers below to report a problem and  
timeout the devices. This can lead to a roughly 3 minute or longer  
halt when reading OR writing to the Zpool on the ZFS node. While this  
is acceptable in certain situations, I have a case where my  
availability demand is more severe.

My goal: figure out how to have the zpool pause for NO LONGER than 30  
seconds (roughly within a typical HTTP request timeout) and then issue  
reads/writes to the good devices in the zpool/mirrors while the other  
side comes back online or is fixed.

My ideas:
   1. In the case of the iscsi targets disappearing (iscsitgt core  
dump, Storage Node crash, Storage Node disconnected from network), I  
need to lower the iSCSI login retry/timeout values. Am I correct in  
assuming the only way to accomplish this is to recompile the iscsi  
initiator? If so, can someone help point me in the right direction (I  
have never compiled ONNV sources - do I need to do this or can I just  
recompile the iscsi initiator)?

1.a. I'm not sure in what Initiator session states  
iscsi_sess_max_delay is applicable - only for the initial login, or  
also in the case of reconnect? Ross, if you still have your test boxes  
available, can you please try setting set iscsi:iscsi_sess_max_delay  
= 5 in /etc/system, reboot and try failing your iscsi vdevs again? I  
can't find a case where this was tested quick failover.

 1.b. I would much prefer to have bug 649 addressed and fixed  
rather than having to resort to recompiling the iscsi initiator (if  
iscsi_sess_max_delay) doesn't work. This seems like a trivial feature  
to implement. How can I sponsor development?

   2. In the case of the iscsi target being reachable, but the  
physical disk is having problems reading/writing data (retryable  
events that take roughly 60 seconds to timeout), should I change the  
iscsi_rx_max_window tunable with mdb? Is there a tunable for iscsi_tx?  
Ross, I know you tried this recently in the thread referenced above  
(with value 15), which resulted in a 60 second hang. How did you  
offline the iscsi vol to test this failure? Unless iscsi uses a  
multiple of the value for retries, then maybe the way you failed the  
disk caused the iscsi system to follow a different failure path?  
Unfortunately I don't know of a way to introduce read/write retries to  
a disk while the disk is still reachable and presented via iscsitgt,  
so I'm not sure how to test this.

 2.a With the fix of http://bugs.opensolaris.org/view_bug.do?bug_id=6518995 
  , we can set sd_retry_count along with sd_io_time to cause I/O  
failure when a command takes longer than sd_retry_count * sd_io_time.  
Can (or should) these tunables be set on the imported iscsi disks in  
the ZFS Node, or can/should they be applied only to the local disk on  
the Storage Nodes? If there is a way to apply them to ONLY the  
imported iscsi disks (and not the local disks) of the ZFS Node, and  
without rebooting every time a new iscsi disk is imported, then I'm  
thinking this is the way to go.

In a year of having this setup in customer beta I have never had  
Storage nodes (or both sides of a mirror) down at the same time. I'd  
like ZFS to take advantage of this. If (and only if) both sides fail  
then ZFS can enter failmode=wait.

Currently using Nevada b96. Planning to move to 100 shortly to avoid  
zpool commands hanging while the zpool is waiting to reach a device.

David Anderson
Aktiom Networks, LLC

Ross wrote:
  I discussed this exact issue on the forums in February, and filed a  
bug at the time.  I've also e-mailed and chatted with the iSCSI  
developers, and the iSER developers a few times.  There was also been  
another thread about the iSCSI timeouts being made configurable a few  
months back, and finally, I started another discussion on ZFS  

Re: [zfs-discuss] zfs not yet suitable for HA applications?

2008-12-05 Thread Ross Smith
Hi Dan, replying in line:

On Fri, Dec 5, 2008 at 9:19 PM, David Anderson [EMAIL PROTECTED] wrote:
 Trying to keep this in the spotlight. Apologies for the lengthy post.

Heh, don't apologise, you should see some of my posts... o_0

 I'd really like to see features as described by Ross in his summary of the
 Availability: ZFS needs to handle disk removal / driver failure better
  (http://www.opensolaris.org/jive/thread.jspa?messageID=274031#274031 ).
 I'd like to have these/similar features as well. Has there already been
 internal discussions regarding adding this type of functionality to ZFS
 itself, and was there approval, disapproval or no decision?

 Unfortunately my situation has put me in urgent need to find workarounds in
 the meantime.

 My setup: I have two iSCSI target nodes, each with six drives exported via
 iscsi (Storage Nodes). I have a ZFS Node that logs into each target from
 both Storage Nodes and creates a mirrored Zpool with one drive from each
 Storage Node comprising each half of the mirrored vdevs (6 x 2-way mirrors).

 My problem: If a Storage Node crashes completely, is disconnected from the
 network, iscsitgt core dumps, a drive is pulled, or a drive has a problem
 accessing data (read retries), then my ZFS Node hangs while ZFS waits
 patiently for the layers below to report a problem and timeout the devices.
 This can lead to a roughly 3 minute or longer halt when reading OR writing
 to the Zpool on the ZFS node. While this is acceptable in certain
 situations, I have a case where my availability demand is more severe.

 My goal: figure out how to have the zpool pause for NO LONGER than 30
 seconds (roughly within a typical HTTP request timeout) and then issue
 reads/writes to the good devices in the zpool/mirrors while the other side
 comes back online or is fixed.

 My ideas:
  1. In the case of the iscsi targets disappearing (iscsitgt core dump,
 Storage Node crash, Storage Node disconnected from network), I need to lower
 the iSCSI login retry/timeout values. Am I correct in assuming the only way
 to accomplish this is to recompile the iscsi initiator? If so, can someone
 help point me in the right direction (I have never compiled ONNV sources -
 do I need to do this or can I just recompile the iscsi initiator)?

I believe it's possible to just recompile the initiator and install
the new driver.  I have some *very* rough notes that were sent to me
about a year ago, but I've no experience compiling anything in
Solaris, so don't know how useful they will be.  I'll try to dig them
out in case they're useful.


   1.a. I'm not sure in what Initiator session states iscsi_sess_max_delay is
 applicable - only for the initial login, or also in the case of reconnect?
 Ross, if you still have your test boxes available, can you please try
 setting set iscsi:iscsi_sess_max_delay = 5 in /etc/system, reboot and try
 failing your iscsi vdevs again? I can't find a case where this was tested
 quick failover.

Will gladly have a go at this on Monday.

1.b. I would much prefer to have bug 649 addressed and fixed rather
 than having to resort to recompiling the iscsi initiator (if
 iscsi_sess_max_delay) doesn't work. This seems like a trivial feature to
 implement. How can I sponsor development?

  2. In the case of the iscsi target being reachable, but the physical disk
 is having problems reading/writing data (retryable events that take roughly
 60 seconds to timeout), should I change the iscsi_rx_max_window tunable with
 mdb? Is there a tunable for iscsi_tx? Ross, I know you tried this recently
 in the thread referenced above (with value 15), which resulted in a 60
 second hang. How did you offline the iscsi vol to test this failure? Unless
 iscsi uses a multiple of the value for retries, then maybe the way you
 failed the disk caused the iscsi system to follow a different failure path?
 Unfortunately I don't know of a way to introduce read/write retries to a
 disk while the disk is still reachable and presented via iscsitgt, so I'm
 not sure how to test this.

So far I've just been shutting down the Solaris box hosting the iSCSI
target.  Next step will involve pulling some virtual cables.
Unfortunately I don't think I've got a physical box handy to test
drive failures right now, but my previous testing (of simply pulling
drives) showed that it can be hit and miss as to how well ZFS detects
these types of 'failure'.

Like you I don't know yet how to simulate failures, so I'm doing
simple tests right now, offlining entire drives or computers.
Unfortunately I've found more than enough problems with just those
tests to keep me busy.


2.a With the fix of
 http://bugs.opensolaris.org/view_bug.do?bug_id=6518995 , we can set
 sd_retry_count along with sd_io_time to cause I/O failure when a command
 takes longer than sd_retry_count * sd_io_time. Can (or should) these
 tunables be set on the imported iscsi disks in the ZFS Node, or can/should
 they be applied only to the local disk on 

Re: [zfs-discuss] zfs not yet suitable for HA applications?

2008-12-05 Thread Miles Nordin
 da == David Anderson [EMAIL PROTECTED] writes:

da (I have never
da compiled ONNV sources - do I need to do this or can I just
da recompile the iscsi initiator)?

The source offering is disorganized and spread over many
``consolidations'' which are pushed through ``gates'', similar to
Linux with its source tarballs and kernel patch-kits, but only tens of
consolidations instead of thousands of packages.  The downside: the
overall source-to-binary system you get with Gentoo portage or Debian
dpkg or RedHat SRPM's to gather the consolidations and turn them into
an .iso, is Sun-proprietary.  There was talk of an IPS ``distribution
builder'' but it seems to be a binary-only FLAR replacement for
replicating installed systems, not a proper open build system that
consumes sources and produces IPS ``images''.  I don't know which
consolidation holds the iSCSI initiator sources, or how to find it.

Also for sharing your experiences you need to get the exact same
version of the sources as on other people's binary installs, so you
can compare with others while change onyl what you're trying to
change, ``snv_101 +my timeout change''.  I'm not sure how to do
that---I see some bugs for example 6684570 refers to versions like
``onnv-gate:2008-04-04'' but how does that map to the snv_nn
version-markers, or is it a different branch entirely?  On Linux or
BSD I'd use the package system to both find the source and get the
exact version of it I'm running.

There are only a few consolidations to dig through.  maybe either the
ON consolidation or the Storage consolidation?  Once you find it all
you have to do is solve the version question.


pgpyW2trdGXgk.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Problem with ZFS and ACL with GDM

2008-12-05 Thread Brian Cameron

I am the maintainer of GDM, and I am noticing that GDM has a problem when
running on a ZFS filesystem, as with Indiana.

When GDM (the GNOME Display Manager) starts the login GUI, it runs the
following commands on Solaris:

   /usr/bin/setfacl -m user:gdm:rwx,mask:rwx /dev/audio
   /usr/bin/setfacl -m user:gdm:rwx,mask:rwx /dev/audioctl

It does this because the login GUI programs are run as the gdm user,
and in order to support text-to-speech via orca, for users with
accessibility needs, the gdm user needs access to the audio device.
We were using setfacl because logindevperm(3) normally manages the
audio device permissions and we only want the gdm user to have
access on-the-fly when the GDM GUI is started.

However, I notice that when using ZFS on Indiana the above commands fail
with the following error:

   File system doesn't support aclent_t style ACL's.
   See acl(5) for more information on ACL styles support by Solaris.

What is the appropriate command to use with ZFS?  If different commands
are needed based on the file system type, then how can GDM determine which
command to use.  Or is there a better way for GDM to ensure that the
audio devices has the appropriate permissions for the gdm user to
support text-to-speech accessibility?

I am not subscribed to this list, so please cc: me in any response.

Thanks,

Brian


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss