Hi Sebastien, In the first mail you mentioned that you are using mpxio to control the XP24K array. Why are you using mpxio here?
Thanks, Venkata Sreenivasarao Nagineni, Symantec > -----Original Message----- > From: veritas-vx-boun...@mailman.eng.auburn.edu [mailto:veritas-vx- > boun...@mailman.eng.auburn.edu] On Behalf Of Sebastien DAUBIGNE > Sent: Wednesday, October 06, 2010 9:32 AM > To: undisclosed-recipients > Cc: Veritas-vx@mailman.eng.auburn.edu > Subject: Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue > > Hi, > > I come back with my dmp_fast_recovery issue (VxDMP fails the path > before > MPxIO gets a chance to failover on alternate path). > As stated previously, I am running 5.0GA, and this tunable is not > supported in this release. However I still don't know if VxVM 5.0GA > silently bypasses the MPxIO stack for error recovery. > > Now I try to determine if upgrading to MP3 will resolve this issue > (which rarely occured). > > Could anyone (maybe Joshua ?) explain if the behaviour of 5.0GA without > tunable is functionally identical to dmp_fast_recovery=0 or > dmp_fast_recovery=1 ? Maybe the mechanism has been implemented in 5.0 > without the option to disable it (this could explain my issue) ? > > Joshua, you mentioned another tuneable for 5.0 but looking at the list > I > can't identify the corresponding tunable : > > > vxdmpadm gettune all > Tunable Current Value Default Value > ------------------------------ ------------- ------------- > dmp_failed_io_threshold 57600 57600 > dmp_retry_count 5 5 > dmp_pathswitch_blks_shift 11 11 > dmp_queue_depth 32 32 > dmp_cache_open on on > dmp_daemon_count 10 10 > dmp_scsi_timeout 30 30 > dmp_delayq_interval 15 15 > dmp_path_age 0 300 > dmp_stat_interval 1 1 > dmp_health_time 0 60 > dmp_probe_idle_lun on on > dmp_log_level 4 1 > > Cheers. > > > > Le 16/09/2010 16:50, Joshua Fielden a écrit : > > dmp_fast_recovery is a mechanism by which we bypass the sd/scsi stack > and send path inquiry/status CDBs directly from the HBA in order to > bypass long SCSI queues and recover paths faster. With a TPD (third- > party driver) such as MPxIO, bypassing the stack means we bypass the > TPD completely, and interactions such as this can happen. The vxesd > (event-source daemon) is another 5.0/MP2 backport addition that's moot > in the presence of a TPD. > > > > From your modinfo, you're not actually running MP3. This technote > (http://seer.entsupport.symantec.com/docs/327057.htm) isn't exactly > your scenario, but looking for partially-installed pkgs is a good start > to getting your server correctly installed, then the tuneable should > work -- very early 5.0 versions had a differently-named tuneable I > can't find in my mail archive ATM. > > > > Cheers, > > > > Jf > > > > -----Original Message----- > > From: veritas-vx-boun...@mailman.eng.auburn.edu [mailto:veritas-vx- > boun...@mailman.eng.auburn.edu] On Behalf Of Sebastien DAUBIGNE > > Sent: Thursday, September 16, 2010 7:41 AM > > To: Veritas-vx@mailman.eng.auburn.edu > > Subject: Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue > > > > Thank you Victor and William, it seems to be a very good lead. > > > > Unfortunately, this tunable seems not to be supported in the VxVM > > version installed on my system : > > > > > vxdmpadm gettune dmp_fast_recovery > > VxVM vxdmpadm ERROR V-5-1-12015 Incorrect tunable > > vxdmpadm gettune [tunable name] > > Note - Tunable name can be dmp_failed_io_threshold, dmp_retry_count, > > dmp_pathswitch_blks_shift, dmp_queue_depth, dmp_cache_open, > > dmp_daemon_count, dmp_scsi_timeout, dmp_delayq_interval, > dmp_path_age, > > or dmp_stat_interval > > > > Something odd because my version is 5.0 MP3 Solaris SPARC, and > according > > to http://seer.entsupport.symantec.com/docs/316981.htm this tunable > > should be available. > > > > > modinfo | grep -i vx > > 38 7846a000 3800e 288 1 vxdmp (VxVM 5.0-2006-05-11a: DMP > Drive) > > 40 784a4000 334c40 289 1 vxio (VxVM 5.0-2006-05-11a I/O driver) > > 42 783ec71d df8 290 1 vxspec (VxVM 5.0-2006-05-11a > control/st) > > 296 78cfb0a2 c6b 291 1 vxportal (VxFS 5.0_REV-5.0A55_sol portal > ) > > 297 78d6c000 1b9d4f 8 1 vxfs (VxFS 5.0_REV-5.0A55_sol SunOS 5) > > 298 78f18000 a270 292 1 fdd (VxQIO 5.0_REV-5.0A55_sol Quick ) > > > > > > > > > > > > Le 16/09/2010 12:15, Victor Engle a écrit : > >> Which version of veritas? Version 4/2MP2 and version 5.x introduced > a > >> feature called DMP fast recovery. It was probably supposed to be > >> called DMP fast fail but "recovery" sounds better. It is supposed to > >> fail suspect paths more aggressively to speed up failover. But when > >> you only have one vxvm DMP path, as is the case with MPxIO, and > >> fast-recovery fails that path, then you're in trouble. In version > 5.x, > >> it is possible to disable this feature. > >> > >> Google DMP fast recovery. > >> > >> http://seer.entsupport.symantec.com/docs/307959.htm > >> > >> I can imagine there must have been some internal fights at symantec > >> between product management and QA to get that feature released. > >> > >> Vic > >> > >> > >> > >> > >> > >> On Thu, Sep 16, 2010 at 6:03 AM, Sebastien DAUBIGNE > >> <sebastien.daubi...@atosorigin.com> wrote: > >>> Dear Vx-addicts, > >>> > >>> We encountered a failover issue on this configuration : > >>> > >>> - Solaris 9 HW 9/05 > >>> - SUN SAN (SFS) 4.4.15 > >>> - Emulex with SUN generic driver (emlx) > >>> - VxVM 5.0-2006-05-11a > >>> > >>> - storage on HP SAN (XP 24K). > >>> > >>> > >>> Multipathing is managed by MPxIO (not VxDMP) because the SAN team > and HP > >>> support imposed the Solaris native solution for multipathing : > >>> > >>> VxVM ==> VxDMP ==> MPxIO ==> FCP ... > >>> > >>> We have 2 paths to the switch, linked to 2 paths to the storage, so > the > >>> LUNs have 4 paths, with active/active support. > >>> Failover operation has been tested successfully by offlining each > port > >>> successively on the SAN. > >>> > >>> We regulary have transient I/O errors (scsi timeout, I/O error > retries > >>> with "Unit attention"), due to SAN-side issues. Usually these > errors are > >>> transparently managed by MPxIO/VxVM without impact on the > applications. > >>> > >>> Now for the incident we encountered : > >>> > >>> One of the SAN port was reset , consequently there were some > transient > >>> I/O error. > >>> The other SAN port was OK, so the MPxIO multipathing layer should > have > >>> failover the I/O on the other path, without transmiting the error > to the > >>> VxDMP layer. > >>> For some reason, it did not failover the I/O before VxVM caught it > as > >>> unrecoverable I/O error, disabling the subdisk and consequently the > >>> filesystem. > >>> > >>> Note the "giving up" message from scsi layer at 06:23:03 : > >>> > >>> Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: > VxVM > >>> vxdmp V-5-0-112 disabled path 118/0x558 belonging to the dmpnode > 288/0x60 > >>> Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: > VxVM > >>> vxdmp V-5-0-111 disabled dmpnode 288/0x60 > >>> Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: > VxVM > >>> vxdmp V-5-0-112 disabled path 118/0x538 belonging to the dmpnode > 288/0x20 > >>> Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: > VxVM > >>> vxdmp V-5-0-112 disabled path 118/0x550 belonging to the dmpnode > 288/0x18 > >>> Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: > VxVM > >>> vxdmp V-5-0-111 disabled dmpnode 288/0x20 > >>> Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: > VxVM > >>> vxdmp V-5-0-111 disabled dmpnode 288/0x18 > >>> Sep 1 06:18:54 myserver scsi: [ID 107833 kern.warning] WARNING: > >>> /scsi_vhci/s...@g60060e80152777000001277700003794 (ssd165): > >>> Sep 1 06:18:54 myserver SCSI transport failed: reason > >>> 'tran_err': retrying command > >>> Sep 1 06:19:05 myserver scsi: [ID 107833 kern.warning] WARNING: > >>> /scsi_vhci/s...@g60060e80152777000001277700003794 (ssd165): > >>> Sep 1 06:19:05 myserver SCSI transport failed: reason > 'timeout': > >>> retrying command > >>> Sep 1 06:21:57 myserver scsi: [ID 107833 kern.warning] WARNING: > >>> /scsi_vhci/s...@g60060e8015277700000127770000376d (ssd168): > >>> Sep 1 06:21:57 myserver SCSI transport failed: reason > >>> 'tran_err': retrying command > >>> Sep 1 06:22:45 myserver scsi: [ID 107833 kern.warning] WARNING: > >>> /scsi_vhci/s...@g60060e8015277700000127770000376d (ssd168): > >>> Sep 1 06:22:45 myserver SCSI transport failed: reason > 'timeout': > >>> retrying command > >>> Sep 1 06:23:03 myserver scsi: [ID 107833 kern.warning] WARNING: > >>> /scsi_vhci/s...@g60060e80152777000001277700003787 (ssd166): > >>> Sep 1 06:23:03 myserver SCSI transport failed: reason > 'timeout': > >>> giving up > >>> Sep 1 06:23:03 myserver vxio: [ID 539309 kern.warning] WARNING: > VxVM > >>> vxio V-5-3-0 voldmp_errbuf_sio_start: Failed to flush the error > buffer > >>> 300ce41c340 on device 0x1200000003a to DMP > >>> Sep 1 06:23:03 myserver vxio: [ID 771159 kern.warning] WARNING: > VxVM > >>> vxio V-5-0-2 Subdisk mydisk_2-02 block 5935: Uncorrectable write > error > >>> Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: > msgcnt > >>> 1 mesg 037: V-2-37: vx_metaioerr - vx_logbuf_clean - > >>> /dev/vx/dsk/mydg/vol1 file system meta data write error in > dev/block 0/5935 > >>> Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: > msgcnt > >>> 2 mesg 031: V-2-31: vx_disable - /dev/vx/dsk/mydg/vol1 file system > disabled > >>> Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: > msgcnt > >>> 3 mesg 037: V-2-37: vx_metaioerr - vx_inode_iodone - > >>> /dev/vx/dsk/mydg/vol1 file system meta data write error in > dev/block > >>> 0/265984 > >>> > >>> > >>> It seems VxDMP gets the I/O error at the same time as MPxIO : I > though > >>> MPxIO would have conceal the I/O error until failover has occured, > which > >>> is not the case. > >>> > >>> As a workaround, I increased the VxDMP > >>> recoveryotion/fixedretry/retrycount tunable from 5 to 20 to give > MPxIO a > >>> chance to failover before VxDMP fails, but I still don't understand > why > >>> VxVM catch the scsi errors. > >>> > >>> Any advice ? > >>> > >>> thanks. > >>> > >>> > >>> > >>> > >>> > >>> > >>> -- > >>> Sebastien DAUBIGNE > >>> sebastien.daubi...@atosorigin.com - +33(0)5.57.89.31.09 > >>> AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix > >>> > >>> _______________________________________________ > >>> Veritas-vx maillist - Veritas-vx@mailman.eng.auburn.edu > >>> http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx > >>> > > > > > -- > Sebastien DAUBIGNE > sebastien.daubi...@atosorigin.com - +33(0)5.57.89.31.09 > AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix > > _______________________________________________ > Veritas-vx maillist - Veritas-vx@mailman.eng.auburn.edu > http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx _______________________________________________ Veritas-vx maillist - Veritas-vx@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx