VxVM failover issue

Victor Engle Thu, 16 Sep 2010 03:16:58 -0700

Which version of veritas? Version 4/2MP2 and version 5.x introduced a
feature called DMP fast recovery. It was probably supposed to be
called DMP fast fail but "recovery" sounds better. It is supposed to
fail suspect paths more aggressively to speed up failover. But when
you only have one vxvm DMP path, as is the case with MPxIO, and
fast-recovery fails that path, then you're in trouble. In version 5.x,
it is possible to disable this feature.


Google DMP fast recovery.

http://seer.entsupport.symantec.com/docs/307959.htm

I can imagine there must have been some internal fights at symantec
between product management and QA to get that feature released.

Vic





On Thu, Sep 16, 2010 at 6:03 AM, Sebastien DAUBIGNE
<sebastien.daubi...@atosorigin.com> wrote:
>  Dear Vx-addicts,
>
> We encountered a failover issue on this configuration :
>
> - Solaris 9 HW 9/05
> - SUN SAN (SFS) 4.4.15
> - Emulex with SUN generic driver (emlx)
> - VxVM 5.0-2006-05-11a
>
> - storage on HP SAN (XP 24K).
>
>
> Multipathing is managed by MPxIO (not VxDMP) because the SAN team and HP
> support imposed the Solaris native solution for multipathing :
>
> VxVM ==> VxDMP ==> MPxIO ==> FCP ...
>
> We have 2 paths to the switch, linked to 2 paths to the storage, so the
> LUNs have 4 paths, with active/active support.
> Failover operation has been tested successfully by offlining each port
> successively on the SAN.
>
> We regulary have transient I/O errors (scsi timeout, I/O error retries
> with "Unit attention"), due to SAN-side issues. Usually these errors are
> transparently managed by MPxIO/VxVM without impact on the applications.
>
> Now for the incident we encountered :
>
> One of the SAN port was reset , consequently there were some transient
> I/O error.
> The other SAN port was OK, so the MPxIO multipathing layer should have
> failover the I/O on the other path, without transmiting the error to the
> VxDMP layer.
> For some reason, it did not failover the I/O before VxVM caught it as
> unrecoverable I/O error, disabling the subdisk and consequently the
> filesystem.
>
> Note the "giving up" message from scsi layer at 06:23:03 :
>
> Sep  1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
> vxdmp V-5-0-112 disabled path 118/0x558 belonging to the dmpnode 288/0x60
> Sep  1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
> vxdmp V-5-0-111 disabled dmpnode 288/0x60
> Sep  1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
> vxdmp V-5-0-112 disabled path 118/0x538 belonging to the dmpnode 288/0x20
> Sep  1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
> vxdmp V-5-0-112 disabled path 118/0x550 belonging to the dmpnode 288/0x18
> Sep  1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
> vxdmp V-5-0-111 disabled dmpnode 288/0x20
> Sep  1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
> vxdmp V-5-0-111 disabled dmpnode 288/0x18
> Sep  1 06:18:54 myserver scsi: [ID 107833 kern.warning] WARNING:
> /scsi_vhci/s...@g60060e80152777000001277700003794 (ssd165):
> Sep  1 06:18:54 myserver        SCSI transport failed: reason
> 'tran_err': retrying command
> Sep  1 06:19:05 myserver scsi: [ID 107833 kern.warning] WARNING:
> /scsi_vhci/s...@g60060e80152777000001277700003794 (ssd165):
> Sep  1 06:19:05 myserver        SCSI transport failed: reason 'timeout':
> retrying command
> Sep  1 06:21:57 myserver scsi: [ID 107833 kern.warning] WARNING:
> /scsi_vhci/s...@g60060e8015277700000127770000376d (ssd168):
> Sep  1 06:21:57 myserver        SCSI transport failed: reason
> 'tran_err': retrying command
> Sep  1 06:22:45 myserver scsi: [ID 107833 kern.warning] WARNING:
> /scsi_vhci/s...@g60060e8015277700000127770000376d (ssd168):
> Sep  1 06:22:45 myserver        SCSI transport failed: reason 'timeout':
> retrying command
> Sep  1 06:23:03 myserver scsi: [ID 107833 kern.warning] WARNING:
> /scsi_vhci/s...@g60060e80152777000001277700003787 (ssd166):
> Sep  1 06:23:03 myserver        SCSI transport failed: reason 'timeout':
> giving up
> Sep  1 06:23:03 myserver vxio: [ID 539309 kern.warning] WARNING: VxVM
> vxio V-5-3-0 voldmp_errbuf_sio_start: Failed to flush the error buffer
> 300ce41c340 on device 0x1200000003a to DMP
> Sep  1 06:23:03 myserver vxio: [ID 771159 kern.warning] WARNING: VxVM
> vxio V-5-0-2 Subdisk mydisk_2-02 block 5935: Uncorrectable write error
> Sep  1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
> 1 mesg 037: V-2-37: vx_metaioerr - vx_logbuf_clean -
> /dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block 0/5935
> Sep  1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
> 2 mesg 031: V-2-31: vx_disable - /dev/vx/dsk/mydg/vol1 file system disabled
> Sep  1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
> 3 mesg 037: V-2-37: vx_metaioerr - vx_inode_iodone -
> /dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block
> 0/265984
>
>
> It seems VxDMP gets the I/O error at the same time as MPxIO  : I though
> MPxIO would have conceal the I/O error until failover has occured, which
> is not the case.
>
> As a workaround, I increased the VxDMP
> recoveryotion/fixedretry/retrycount tunable from 5 to 20 to give MPxIO a
> chance to failover before VxDMP fails, but I still don't understand why
> VxVM catch the scsi errors.
>
> Any advice ?
>
> thanks.
>
>
>
>
>
>
> --
> Sebastien DAUBIGNE
> sebastien.daubi...@atosorigin.com  - +33(0)5.57.89.31.09
> AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix
>
> _______________________________________________
> Veritas-vx maillist  -  veritas...@mailman.eng.auburn.edu
> http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
>
_______________________________________________
Veritas-vx maillist  -  Veritas-vx@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx

Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue

Reply via email to