Re: [OmniOS-discuss] scsi command timeouts

2017-06-22 Thread Michael Talbott
A couple things that I've discovered over time that might help:

Don't ever use the root user for zpool queries such as "zpool status". If you 
have a really bad failing disk a zpool status command can take forever to 
complete when ran as root. A "su nobody -c 'zpool status'" will return results 
almost instantly. So if your device discovery script(s) use zpool commands, 
that might be a choking point.

# make sure to prevent scsi bus resets (in /kernel/drv/sd.conf) especially in 
an HA environment
allow-bus-device-reset=0;

Also, depending on the disk model, I've found that some of them wreak havoc on 
the SAS topology itself when they start to fail. Some just handle errors really 
badly and can flood the SAS channel. If you have a SAS switch in between, you 
might be able to get an idea of which device is causing the grief from there 
based on the counts.

In my case I have had horrible experiences with the WD WD4001FYYG. That model 
of drive has caused me an insane amount of headache. The disk scan on boot 
literally takes 13 seconds per-disk (when the disks are perfectly good and much 
much longer when one is dying). If I replace them with another make/model 
drive, the disk scan is done in a fraction of a second. Also, booting the same 
machine into any linux os the scan completes in a fraction of a second. Must be 
something about that model's firmware that doesn't play nicely with Illumos's 
driver. Anyway, that's a story for another time ;)

I've reduced the drive scan time at boot down to 5 seconds per disk instead of 
the 13 seconds per disk for that horrible accursed drive by adding this to 
/kernel/drv/sd.conf

sd-config-list= "WD  WD4001FYYG","power-condition:false";

Followed by this command to commit it:
update_drv -vf sd

Hope this helps.


Michael


> On Jun 22, 2017, at 1:41 PM, Schweiss, Chip  wrote:
> 
> I'm talking about an offline pool.   I started this thread after rebooting a 
> server that is part of an HA pair. The other server has the pools online.  
> It's been over 4 hours now and it still hasn't completed its disk scan.   
> 
> Every tool I have that helps me locate disks, suffers from the same insane 
> command timeout to happen many times before moving on.   Operations that 
> typically take seconds blow up to hours really fast because of a few dead 
> disks. 
> 
> -Chip
> 
> 
> 
> On Thu, Jun 22, 2017 at 3:12 PM, Dale Ghent  > wrote:
> 
> Have you able to and have tried offlining it in the zpool?
> 
> zpool offline thepool 
> 
> I'm assuming the pool has some redundancy which would allow for this.
> 
> /dale
> 
> > On Jun 22, 2017, at 11:54 AM, Schweiss, Chip  > > wrote:
> >
> > When ever a disk goes south, several disk related takes become painfully 
> > slow.  Boot up times can jump into the hours to complete the disk scans.
> >
> > The logs slowly get these type messages:
> >
> > genunix: WARNING /pci@0,0/pci8086,340c@5/pci15d9,400@0 (mpt_sas0):
> > Timeout of 60 seconds expired with 1 commands on target 16 lun 0
> >
> > I thought this /etc/system setting would reduce the timeout to 5 seconds:
> > set sd:sd_io_time = 5
> >
> > But this doesn't seem to change anything.
> >
> > Is there anyway to make this a more reasonable timeout, besides pulling the 
> > disk that's causing it?   Just locating the defective disk is also 
> > painfully slow because of this problem.
> >
> > -Chip
> > ___
> > OmniOS-discuss mailing list
> > OmniOS-discuss@lists.omniti.com 
> > http://lists.omniti.com/mailman/listinfo/omnios-discuss 
> > 
> 
> 
> ___
> OmniOS-discuss mailing list
> OmniOS-discuss@lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss

___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] scsi command timeouts

2017-06-22 Thread Bob Friesenhahn

On Thu, 22 Jun 2017, Schweiss, Chip wrote:


I'm talking about an offline pool.   I started this thread after rebooting
a server that is part of an HA pair. The other server has the pools
online.  It's been over 4 hours now and it still hasn't completed its disk
scan.

Every tool I have that helps me locate disks, suffers from the same insane
command timeout to happen many times before moving on.   Operations that
typically take seconds blow up to hours really fast because of a few dead
disks.


You forgot to describe your storage topology and the type of drives 
(SAS/SATA) involved.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] scsi command timeouts

2017-06-22 Thread Jeffry Molanus
Hi,

Certain commands (in particular during attach) are send by mptsas itself,
these have a timeout set in the driver and are not issued by SD hence these
commands are not affected by changing those values. See for example,
mptsas_access_config_page()

 - Jeffry

On Thu, Jun 22, 2017 at 10:12 PM, Dale Ghent  wrote:

>
> Have you able to and have tried offlining it in the zpool?
>
> zpool offline thepool 
>
> I'm assuming the pool has some redundancy which would allow for this.
>
> /dale
>
> > On Jun 22, 2017, at 11:54 AM, Schweiss, Chip  wrote:
> >
> > When ever a disk goes south, several disk related takes become painfully
> slow.  Boot up times can jump into the hours to complete the disk scans.
> >
> > The logs slowly get these type messages:
> >
> > genunix: WARNING /pci@0,0/pci8086,340c@5/pci15d9,400@0 (mpt_sas0):
> > Timeout of 60 seconds expired with 1 commands on target 16 lun 0
> >
> > I thought this /etc/system setting would reduce the timeout to 5 seconds:
> > set sd:sd_io_time = 5
> >
> > But this doesn't seem to change anything.
> >
> > Is there anyway to make this a more reasonable timeout, besides pulling
> the disk that's causing it?   Just locating the defective disk is also
> painfully slow because of this problem.
> >
> > -Chip
> > ___
> > OmniOS-discuss mailing list
> > OmniOS-discuss@lists.omniti.com
> > http://lists.omniti.com/mailman/listinfo/omnios-discuss
>
>
> ___
> OmniOS-discuss mailing list
> OmniOS-discuss@lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>
>
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] scsi command timeouts

2017-06-22 Thread Schweiss, Chip
I'm talking about an offline pool.   I started this thread after rebooting
a server that is part of an HA pair. The other server has the pools
online.  It's been over 4 hours now and it still hasn't completed its disk
scan.

Every tool I have that helps me locate disks, suffers from the same insane
command timeout to happen many times before moving on.   Operations that
typically take seconds blow up to hours really fast because of a few dead
disks.

-Chip



On Thu, Jun 22, 2017 at 3:12 PM, Dale Ghent  wrote:

>
> Have you able to and have tried offlining it in the zpool?
>
> zpool offline thepool 
>
> I'm assuming the pool has some redundancy which would allow for this.
>
> /dale
>
> > On Jun 22, 2017, at 11:54 AM, Schweiss, Chip  wrote:
> >
> > When ever a disk goes south, several disk related takes become painfully
> slow.  Boot up times can jump into the hours to complete the disk scans.
> >
> > The logs slowly get these type messages:
> >
> > genunix: WARNING /pci@0,0/pci8086,340c@5/pci15d9,400@0 (mpt_sas0):
> > Timeout of 60 seconds expired with 1 commands on target 16 lun 0
> >
> > I thought this /etc/system setting would reduce the timeout to 5 seconds:
> > set sd:sd_io_time = 5
> >
> > But this doesn't seem to change anything.
> >
> > Is there anyway to make this a more reasonable timeout, besides pulling
> the disk that's causing it?   Just locating the defective disk is also
> painfully slow because of this problem.
> >
> > -Chip
> > ___
> > OmniOS-discuss mailing list
> > OmniOS-discuss@lists.omniti.com
> > http://lists.omniti.com/mailman/listinfo/omnios-discuss
>
>
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] scsi command timeouts

2017-06-22 Thread Dale Ghent

Have you able to and have tried offlining it in the zpool?

zpool offline thepool 

I'm assuming the pool has some redundancy which would allow for this.

/dale

> On Jun 22, 2017, at 11:54 AM, Schweiss, Chip  wrote:
> 
> When ever a disk goes south, several disk related takes become painfully 
> slow.  Boot up times can jump into the hours to complete the disk scans.
> 
> The logs slowly get these type messages:
> 
> genunix: WARNING /pci@0,0/pci8086,340c@5/pci15d9,400@0 (mpt_sas0):
> Timeout of 60 seconds expired with 1 commands on target 16 lun 0
> 
> I thought this /etc/system setting would reduce the timeout to 5 seconds:
> set sd:sd_io_time = 5
> 
> But this doesn't seem to change anything.
> 
> Is there anyway to make this a more reasonable timeout, besides pulling the 
> disk that's causing it?   Just locating the defective disk is also painfully 
> slow because of this problem.
> 
> -Chip
> ___
> OmniOS-discuss mailing list
> OmniOS-discuss@lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss



signature.asc
Description: Message signed with OpenPGP
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] scsi command timeouts

2017-06-22 Thread Schweiss, Chip
On Thu, Jun 22, 2017 at 11:05 AM, Michael Rasmussen  wrote:

>
> > I thought this /etc/system setting would reduce the timeout to 5 seconds:
> > set sd:sd_io_time = 5
> >
> I think it expects a hex value so try 0x5 instead.
>
>
Unfortunately, no, I've tried that too.

-Chip


> --
> Hilsen/Regards
> Michael Rasmussen
>
> Get my public GnuPG keys:
> michael  rasmussen  cc
> http://pgp.mit.edu:11371/pks/lookup?op=get=0xD3C9A00E
> mir  datanom  net
> http://pgp.mit.edu:11371/pks/lookup?op=get=0xE501F51C
> mir  miras  org
> http://pgp.mit.edu:11371/pks/lookup?op=get=0xE3E80917
> --
> /usr/games/fortune -es says:
> Look, we play the Star Spangled Banner before every game.  You want us
> to pay income taxes, too?
> -- Bill Veeck, Chicago White Sox
>
> ___
> OmniOS-discuss mailing list
> OmniOS-discuss@lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>
>
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] scsi command timeouts

2017-06-22 Thread Michael Rasmussen
On Thu, 22 Jun 2017 10:54:25 -0500
"Schweiss, Chip"  wrote:

> I thought this /etc/system setting would reduce the timeout to 5 seconds:
> set sd:sd_io_time = 5
> 
I think it expects a hex value so try 0x5 instead.

-- 
Hilsen/Regards
Michael Rasmussen

Get my public GnuPG keys:
michael  rasmussen  cc
http://pgp.mit.edu:11371/pks/lookup?op=get=0xD3C9A00E
mir  datanom  net
http://pgp.mit.edu:11371/pks/lookup?op=get=0xE501F51C
mir  miras  org
http://pgp.mit.edu:11371/pks/lookup?op=get=0xE3E80917
--
/usr/games/fortune -es says:
Look, we play the Star Spangled Banner before every game.  You want us
to pay income taxes, too?
-- Bill Veeck, Chicago White Sox


pgp740EhuCt9Z.pgp
Description: OpenPGP digital signature
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss