Hi James,
also inline.

James C. McPherson wrote:

Hi Ethan,
responses inline below

Ethan Erchinger wrote:
Hello,
We have a backup strategy that involves mapping LUNs between a given pair of hosts, and copying data from one of the LUNs (src) and another LUN (dest). The src LUNs sit a SAN device, sometimes multiple devices (zpool mirror). The src LUN is running a MySQL database and typically will be running for weeks without issue.

I'm sorry, I don't quite understand how this can be a serious
"backup strategy" - how on earth did you get to thinking that
it was going to work reliably?
Well, this is not the final destination of the backup. This a method of taking periodic snapshots between the primary and secondary hosts, of a replication pair. We take the completed ibbackup and stream to tape afterwards, then do the equivalent of a restore on the dest lun. It actually is pretty reliable. We do this weekly, mainly because we don't trust MySQL replication, it's somewhat error prone. While we may have issues with our implementation, I don't believe the strategy to be faulty at it's core.


When we start the backup sequence, we map a previously unmapped LUN to the DB host and issue the following commands:

root# cfgadm -al
(sleep 10)
root# luxadm probe
(sleep 10)
root# zpool import <pool_name>

You're kidding, right? Have you RTFMd the cfgadm_fp(1M) manpage?
Ever thought about running something similar to


# cfgadm -c configure c$X::$target-pwwn
Well, no, not kidding. I believe that yes, this may be one of the main issues with our system. We have read quite a bit of documentation, and normally this works pretty darn well. Doing a configure, and I can only assume an unconfigure prior to remapping the LUN is the proper procedure? We believed that doing a configure was having little to no effect, because in the cfgadm -al output, condition (on a configured LUN) is "unknown". According to the manpage that can very well mean that the configure command will have zero effect.
"""
        configure       Configure  a  connected  Fibre   Channel
                        Fabric  device  to  a host. When a Fibre
                        Channel device is listed as  an  unknown
                        type in the output of the list operation
                        the device might not be configurable. No
                        attempt  is  made  to  configure devices
                        with unknown types.
"""


After importing we'll perform some minor IO on the dest LUN, such as adding a symlink, removing some old configuration files. Then we'll start an ibbackup of that database from the src LUN to the dest LUN, and things go bad.

Frankly, I'm surprised it takes this long for you to get to the
"things go bad" stage.
I think that it's possible that having an improper "configure" stage from above is causing things to go bad.


It's not completely consistent, but sometimes the DB host will crash, sometimes we'll get chksum/read/write errors on the src LUN. Looking at dmesg (when the host doesn't crash), we see the LUNs paths all disappear and then reappear usually around 20 seconds later. Example output below. Each LUN has 2 paths out of the DB host and 4 paths on each storage device, across two separate SANs.

You're yanking drives in- and out-of-view of your host, you're
doing so with zpool importing (and exporting?) and yet you still
want your database to be reliable.
We are not attempting to yank the src LUN in and out. Yes, removing the dest LUN from a hosts view inexplicable may be causing other MPxIO inconsistencies though. We typically import/export zpools between hosts, because ibbackup from Innobase (the recommended hot backup solution for InnoDB), cannot write over the network, and must copy to a local directory. Mapping LUNs between hosts is a method. NFS is another, and so on. The typical 'Enterprise' backup solution didn't support hot backups on InnoDB until more recently.

Usually the host will crash when not running with a zpool mirror, which apparently in Sol10u4, it's expected behavior.

Sorry, but no. What you're doing is creating inconsistencies in the
host's view of it storage. Don't blame Solaris for this, it's actually
trying to keep your data consistent.
As mentioned, we _never_ remove (via wwn host masking) LUNs that are active from a ZFS perspective. So consistency should not be compromised.

These hosts are x86_64 servers, running Sol10u4, unpatched. They use qlogic qla2342 HBAs, and the stock qlc driver. They are using MPXIO, from what I can tell.

Yes, they're using MPxIO. You can tell that from the pathnames
such as /scsi_vhci/[EMAIL PROTECTED] - that's a dead giveaway.
Good, that's what I thought. I was only not positive because runs of mpathadm --cannot-remember didn't list our storage device as a supported MPxIO device, at least that's what I took the output to mean.

So ... _unpatched_ you say? _Why_ ? I know organisations generally
have rigorous patching methodologies and schedules, but fer cryin'
out loud, S10 Update 4 has been available since the middle of 2007.
That's very nearly 12 months old.
As mentioned by Bob in another email, we have not been running into issues with u4, that we knew of. We have begun upgrading to U5 and applying the latest patches, in-fact all of our secondary hosts have this completed, we just haven't had a downtime window available to get the primary systems upgraded. We are aware that we should be upgrading, and we are working towards that goal. We have also seen enough instability with software releases (in software in general) that we like some bake time in new releases, so we've been waiting a little bit on the U5 release. Which turns out was a good idea, given the memory leak in the qlc driver, fixed by patch 125165-10.


If anyone has any tips on troubleshooting, or knows of things we are doing wrong, help would be appreciated.

Two major recommendations. Firstly, PATCH YOUR SYSTEM.
Secondly, design a backup methodology which doesn't rely
on playing the fool with your storage.

Assuming that you're posting from your work email address,
_surely_ you could convince your management that implementing
a backup strategy based around an enterprise-class backup
package such as NetBackup or Networker.

You should also seriously consider getting a professional
services organisation (such as Sun's) to come in and help
you get your systems setup properly.

Thanks for your recommendations.
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss

Reply via email to