VxVM failover issue

Venkata Sreenivasa Rao Nagineni Wed, 06 Oct 2010 10:08:43 -0700

Hi Sebastien,

In the first mail you mentioned that you are using mpxio to control the XP24K 
array. Why are you using mpxio here?


Thanks,
Venkata Sreenivasarao Nagineni,
Symantec

> -----Original Message-----
> From: veritas-vx-boun...@mailman.eng.auburn.edu [mailto:veritas-vx-
> boun...@mailman.eng.auburn.edu] On Behalf Of Sebastien DAUBIGNE
> Sent: Wednesday, October 06, 2010 9:32 AM
> To: undisclosed-recipients
> Cc: Veritas-vx@mailman.eng.auburn.edu
> Subject: Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue
> 
>   Hi,
> 
> I come back with my dmp_fast_recovery issue (VxDMP fails the path
> before
> MPxIO gets a chance to failover on alternate path).
> As stated previously, I am running 5.0GA, and this tunable is not
> supported in this release. However I still don't know if VxVM 5.0GA
> silently bypasses the MPxIO stack for error recovery.
> 
> Now I try to determine if upgrading to MP3 will resolve this issue
> (which rarely occured).
> 
> Could anyone (maybe Joshua ?) explain if the behaviour of 5.0GA without
> tunable  is functionally identical to dmp_fast_recovery=0 or
> dmp_fast_recovery=1 ? Maybe the mechanism has been implemented in 5.0
> without the option to disable it (this could explain my issue) ?
> 
> Joshua, you mentioned another tuneable for 5.0 but looking at the list
> I
> can't identify the corresponding tunable :
> 
>  > vxdmpadm gettune all
>              Tunable               Current Value  Default Value
> ------------------------------    -------------  -------------
> dmp_failed_io_threshold               57600            57600
> dmp_retry_count                           5                5
> dmp_pathswitch_blks_shift                11               11
> dmp_queue_depth                          32               32
> dmp_cache_open                           on               on
> dmp_daemon_count                         10               10
> dmp_scsi_timeout                         30               30
> dmp_delayq_interval                      15               15
> dmp_path_age                              0              300
> dmp_stat_interval                         1                1
> dmp_health_time                           0               60
> dmp_probe_idle_lun                       on               on
> dmp_log_level                             4                1
> 
> Cheers.
> 
> 
> 
> Le 16/09/2010 16:50, Joshua Fielden a écrit :
> > dmp_fast_recovery is a mechanism by which we bypass the sd/scsi stack
> and send path inquiry/status CDBs directly from the HBA in order to
> bypass long SCSI queues and recover paths faster. With a TPD (third-
> party driver) such as MPxIO, bypassing the stack means we bypass the
> TPD completely, and interactions such as this can happen. The vxesd
> (event-source daemon) is another 5.0/MP2 backport addition that's moot
> in the presence of a TPD.
> >
> >  From your modinfo, you're not actually running MP3. This technote
> (http://seer.entsupport.symantec.com/docs/327057.htm) isn't exactly
> your scenario, but looking for partially-installed pkgs is a good start
> to getting your server correctly installed, then the tuneable should
> work -- very early 5.0 versions had a differently-named tuneable I
> can't find in my mail archive ATM.
> >
> > Cheers,
> >
> > Jf
> >
> > -----Original Message-----
> > From: veritas-vx-boun...@mailman.eng.auburn.edu [mailto:veritas-vx-
> boun...@mailman.eng.auburn.edu] On Behalf Of Sebastien DAUBIGNE
> > Sent: Thursday, September 16, 2010 7:41 AM
> > To: Veritas-vx@mailman.eng.auburn.edu
> > Subject: Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue
> >
> >    Thank you Victor and William, it seems to be a very good lead.
> >
> > Unfortunately, this tunable seems not to be supported in the VxVM
> > version installed on my system :
> >
> >   >  vxdmpadm gettune dmp_fast_recovery
> > VxVM vxdmpadm ERROR V-5-1-12015  Incorrect tunable
> > vxdmpadm gettune [tunable name]
> > Note - Tunable name can be dmp_failed_io_threshold, dmp_retry_count,
> > dmp_pathswitch_blks_shift, dmp_queue_depth, dmp_cache_open,
> > dmp_daemon_count, dmp_scsi_timeout, dmp_delayq_interval,
> dmp_path_age,
> > or dmp_stat_interval
> >
> > Something odd because my version is 5.0 MP3 Solaris SPARC, and
> according
> > to http://seer.entsupport.symantec.com/docs/316981.htm this tunable
> > should be available.
> >
> >   >  modinfo | grep -i vx
> >    38 7846a000  3800e 288   1  vxdmp (VxVM 5.0-2006-05-11a: DMP
> Drive)
> >    40 784a4000 334c40 289   1  vxio (VxVM 5.0-2006-05-11a I/O driver)
> >    42 783ec71d    df8 290   1  vxspec (VxVM 5.0-2006-05-11a
> control/st)
> > 296 78cfb0a2    c6b 291   1  vxportal (VxFS 5.0_REV-5.0A55_sol portal
> )
> > 297 78d6c000 1b9d4f   8   1  vxfs (VxFS 5.0_REV-5.0A55_sol SunOS 5)
> > 298 78f18000   a270 292   1  fdd (VxQIO 5.0_REV-5.0A55_sol Quick )
> >
> >
> >
> >
> >
> > Le 16/09/2010 12:15, Victor Engle a écrit :
> >> Which version of veritas? Version 4/2MP2 and version 5.x introduced
> a
> >> feature called DMP fast recovery. It was probably supposed to be
> >> called DMP fast fail but "recovery" sounds better. It is supposed to
> >> fail suspect paths more aggressively to speed up failover. But when
> >> you only have one vxvm DMP path, as is the case with MPxIO, and
> >> fast-recovery fails that path, then you're in trouble. In version
> 5.x,
> >> it is possible to disable this feature.
> >>
> >> Google DMP fast recovery.
> >>
> >> http://seer.entsupport.symantec.com/docs/307959.htm
> >>
> >> I can imagine there must have been some internal fights at symantec
> >> between product management and QA to get that feature released.
> >>
> >> Vic
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Sep 16, 2010 at 6:03 AM, Sebastien DAUBIGNE
> >> <sebastien.daubi...@atosorigin.com>   wrote:
> >>>    Dear Vx-addicts,
> >>>
> >>> We encountered a failover issue on this configuration :
> >>>
> >>> - Solaris 9 HW 9/05
> >>> - SUN SAN (SFS) 4.4.15
> >>> - Emulex with SUN generic driver (emlx)
> >>> - VxVM 5.0-2006-05-11a
> >>>
> >>> - storage on HP SAN (XP 24K).
> >>>
> >>>
> >>> Multipathing is managed by MPxIO (not VxDMP) because the SAN team
> and HP
> >>> support imposed the Solaris native solution for multipathing :
> >>>
> >>> VxVM ==>   VxDMP ==>   MPxIO ==>   FCP ...
> >>>
> >>> We have 2 paths to the switch, linked to 2 paths to the storage, so
> the
> >>> LUNs have 4 paths, with active/active support.
> >>> Failover operation has been tested successfully by offlining each
> port
> >>> successively on the SAN.
> >>>
> >>> We regulary have transient I/O errors (scsi timeout, I/O error
> retries
> >>> with "Unit attention"), due to SAN-side issues. Usually these
> errors are
> >>> transparently managed by MPxIO/VxVM without impact on the
> applications.
> >>>
> >>> Now for the incident we encountered :
> >>>
> >>> One of the SAN port was reset , consequently there were some
> transient
> >>> I/O error.
> >>> The other SAN port was OK, so the MPxIO multipathing layer should
> have
> >>> failover the I/O on the other path, without transmiting the error
> to the
> >>> VxDMP layer.
> >>> For some reason, it did not failover the I/O before VxVM caught it
> as
> >>> unrecoverable I/O error, disabling the subdisk and consequently the
> >>> filesystem.
> >>>
> >>> Note the "giving up" message from scsi layer at 06:23:03 :
> >>>
> >>> Sep  1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE:
> VxVM
> >>> vxdmp V-5-0-112 disabled path 118/0x558 belonging to the dmpnode
> 288/0x60
> >>> Sep  1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE:
> VxVM
> >>> vxdmp V-5-0-111 disabled dmpnode 288/0x60
> >>> Sep  1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE:
> VxVM
> >>> vxdmp V-5-0-112 disabled path 118/0x538 belonging to the dmpnode
> 288/0x20
> >>> Sep  1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE:
> VxVM
> >>> vxdmp V-5-0-112 disabled path 118/0x550 belonging to the dmpnode
> 288/0x18
> >>> Sep  1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE:
> VxVM
> >>> vxdmp V-5-0-111 disabled dmpnode 288/0x20
> >>> Sep  1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE:
> VxVM
> >>> vxdmp V-5-0-111 disabled dmpnode 288/0x18
> >>> Sep  1 06:18:54 myserver scsi: [ID 107833 kern.warning] WARNING:
> >>> /scsi_vhci/s...@g60060e80152777000001277700003794 (ssd165):
> >>> Sep  1 06:18:54 myserver        SCSI transport failed: reason
> >>> 'tran_err': retrying command
> >>> Sep  1 06:19:05 myserver scsi: [ID 107833 kern.warning] WARNING:
> >>> /scsi_vhci/s...@g60060e80152777000001277700003794 (ssd165):
> >>> Sep  1 06:19:05 myserver        SCSI transport failed: reason
> 'timeout':
> >>> retrying command
> >>> Sep  1 06:21:57 myserver scsi: [ID 107833 kern.warning] WARNING:
> >>> /scsi_vhci/s...@g60060e8015277700000127770000376d (ssd168):
> >>> Sep  1 06:21:57 myserver        SCSI transport failed: reason
> >>> 'tran_err': retrying command
> >>> Sep  1 06:22:45 myserver scsi: [ID 107833 kern.warning] WARNING:
> >>> /scsi_vhci/s...@g60060e8015277700000127770000376d (ssd168):
> >>> Sep  1 06:22:45 myserver        SCSI transport failed: reason
> 'timeout':
> >>> retrying command
> >>> Sep  1 06:23:03 myserver scsi: [ID 107833 kern.warning] WARNING:
> >>> /scsi_vhci/s...@g60060e80152777000001277700003787 (ssd166):
> >>> Sep  1 06:23:03 myserver        SCSI transport failed: reason
> 'timeout':
> >>> giving up
> >>> Sep  1 06:23:03 myserver vxio: [ID 539309 kern.warning] WARNING:
> VxVM
> >>> vxio V-5-3-0 voldmp_errbuf_sio_start: Failed to flush the error
> buffer
> >>> 300ce41c340 on device 0x1200000003a to DMP
> >>> Sep  1 06:23:03 myserver vxio: [ID 771159 kern.warning] WARNING:
> VxVM
> >>> vxio V-5-0-2 Subdisk mydisk_2-02 block 5935: Uncorrectable write
> error
> >>> Sep  1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING:
> msgcnt
> >>> 1 mesg 037: V-2-37: vx_metaioerr - vx_logbuf_clean -
> >>> /dev/vx/dsk/mydg/vol1 file system meta data write error in
> dev/block 0/5935
> >>> Sep  1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING:
> msgcnt
> >>> 2 mesg 031: V-2-31: vx_disable - /dev/vx/dsk/mydg/vol1 file system
> disabled
> >>> Sep  1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING:
> msgcnt
> >>> 3 mesg 037: V-2-37: vx_metaioerr - vx_inode_iodone -
> >>> /dev/vx/dsk/mydg/vol1 file system meta data write error in
> dev/block
> >>> 0/265984
> >>>
> >>>
> >>> It seems VxDMP gets the I/O error at the same time as MPxIO  : I
> though
> >>> MPxIO would have conceal the I/O error until failover has occured,
> which
> >>> is not the case.
> >>>
> >>> As a workaround, I increased the VxDMP
> >>> recoveryotion/fixedretry/retrycount tunable from 5 to 20 to give
> MPxIO a
> >>> chance to failover before VxDMP fails, but I still don't understand
> why
> >>> VxVM catch the scsi errors.
> >>>
> >>> Any advice ?
> >>>
> >>> thanks.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Sebastien DAUBIGNE
> >>> sebastien.daubi...@atosorigin.com  - +33(0)5.57.89.31.09
> >>> AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix
> >>>
> >>> _______________________________________________
> >>> Veritas-vx maillist  -  Veritas-vx@mailman.eng.auburn.edu
> >>> http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
> >>>
> >
> 
> 
> --
> Sebastien DAUBIGNE
> sebastien.daubi...@atosorigin.com - +33(0)5.57.89.31.09
> AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix
> 
> _______________________________________________
> Veritas-vx maillist  -  Veritas-vx@mailman.eng.auburn.edu
> http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
_______________________________________________
Veritas-vx maillist  -  Veritas-vx@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx

Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue

Reply via email to