Re: [storage-discuss] ZFS or MPXIO failure after importing new zpool

James C. McPherson Wed, 04 Jun 2008 22:46:43 -0700

Hi Ethan,
responses inline below

Ethan Erchinger wrote:
> Hello,
> We have a backup strategy that involves mapping LUNs between a given 
> pair of hosts, and copying data from one of the LUNs (src) and another 
> LUN (dest).  The src LUNs sit a SAN device, sometimes multiple devices 
> (zpool mirror).  The src LUN is running a MySQL database and typically 
> will be running for weeks without issue.


I'm sorry, I don't quite understand how this can be a serious
"backup strategy" - how on earth did you get to thinking that
it was going to work reliably?


> When we start the backup sequence, we map a previously unmapped LUN to 
> the DB host and issue the following commands:
> 
> root# cfgadm -al
> (sleep 10)
> root# luxadm probe
> (sleep 10)
> root# zpool import <pool_name>

You're kidding, right? Have you RTFMd the cfgadm_fp(1M) manpage?
Ever thought about running something similar to


# cfgadm -c configure c$X::$target-pwwn


> After importing we'll perform some minor IO on the dest LUN, such as 
> adding a symlink, removing some old configuration files.  Then we'll 
> start an ibbackup of that database from the src LUN to the dest LUN, and 
> things go bad.

Frankly, I'm surprised it takes this long for you to get to the
"things go bad" stage.


> It's not completely consistent, but sometimes the DB host will crash, 
> sometimes we'll get chksum/read/write errors on the src LUN.  Looking at 
> dmesg (when the host doesn't crash), we see the LUNs paths all disappear 
> and then reappear usually around 20 seconds later.  Example output 
> below.  Each LUN has 2 paths out of the DB host and 4 paths on each 
> storage device, across two separate SANs.

You're yanking drives in- and out-of-view of your host, you're
doing so with zpool importing (and exporting?) and yet you still
want your database to be reliable.

> Usually the host will crash when not running with a zpool mirror, which 
> apparently in Sol10u4, it's expected behavior.

Sorry, but no. What you're doing is creating inconsistencies in the
host's view of it storage. Don't blame Solaris for this, it's actually
trying to keep your data consistent.

> These hosts are x86_64 servers, running Sol10u4, unpatched.  They use 
> qlogic qla2342 HBAs, and the stock qlc driver.  They are using MPXIO, 
> from what I can tell.

Yes, they're using MPxIO. You can tell that from the pathnames
such as /scsi_vhci/[EMAIL PROTECTED] - that's a dead giveaway.

So ... _unpatched_ you say? _Why_ ? I know organisations generally
have rigorous patching methodologies and schedules, but fer cryin'
out loud, S10 Update 4 has been available since the middle of 2007.
That's very nearly 12 months old.


> If anyone has any tips on troubleshooting, or knows of things we are 
> doing wrong, help would be appreciated.

Two major recommendations. Firstly, PATCH YOUR SYSTEM.
Secondly, design a backup methodology which doesn't rely
on playing the fool with your storage.

Assuming that you're posting from your work email address,
_surely_ you could convince your management that implementing
a backup strategy based around an enterprise-class backup
package such as NetBackup or Networker.

You should also seriously consider getting a professional
services organisation (such as Sun's) to come in and help
you get your systems setup properly.


James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp       http://www.jmcp.homeunix.com/blog
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss

Re: [storage-discuss] ZFS or MPXIO failure after importing new zpool

Reply via email to