Hi James,
also inline.
James C. McPherson wrote:
Hi Ethan,
responses inline below
Ethan Erchinger wrote:
Hello,
We have a backup strategy that involves mapping LUNs between a given
pair of hosts, and copying data from one of the LUNs (src) and
another LUN (dest). The src LUNs sit a SAN device, sometimes
multiple devices (zpool mirror). The src LUN is running a MySQL
database and typically will be running for weeks without issue.
I'm sorry, I don't quite understand how this can be a serious
"backup strategy" - how on earth did you get to thinking that
it was going to work reliably?
Well, this is not the final destination of the backup. This a method of
taking periodic snapshots between the primary and secondary hosts, of a
replication pair. We take the completed ibbackup and stream to tape
afterwards, then do the equivalent of a restore on the dest lun. It
actually is pretty reliable. We do this weekly, mainly because we don't
trust MySQL replication, it's somewhat error prone. While we may have
issues with our implementation, I don't believe the strategy to be
faulty at it's core.
When we start the backup sequence, we map a previously unmapped LUN
to the DB host and issue the following commands:
root# cfgadm -al
(sleep 10)
root# luxadm probe
(sleep 10)
root# zpool import <pool_name>
You're kidding, right? Have you RTFMd the cfgadm_fp(1M) manpage?
Ever thought about running something similar to
# cfgadm -c configure c$X::$target-pwwn
Well, no, not kidding. I believe that yes, this may be one of the main
issues with our system. We have read quite a bit of documentation, and
normally this works pretty darn well. Doing a configure, and I can
only assume an unconfigure prior to remapping the LUN is the proper
procedure? We believed that doing a configure was having little to no
effect, because in the cfgadm -al output, condition (on a configured
LUN) is "unknown". According to the manpage that can very well mean
that the configure command will have zero effect.
"""
configure Configure a connected Fibre Channel
Fabric device to a host. When a Fibre
Channel device is listed as an unknown
type in the output of the list operation
the device might not be configurable. No
attempt is made to configure devices
with unknown types.
"""
After importing we'll perform some minor IO on the dest LUN, such as
adding a symlink, removing some old configuration files. Then we'll
start an ibbackup of that database from the src LUN to the dest LUN,
and things go bad.
Frankly, I'm surprised it takes this long for you to get to the
"things go bad" stage.
I think that it's possible that having an improper "configure" stage
from above is causing things to go bad.
It's not completely consistent, but sometimes the DB host will crash,
sometimes we'll get chksum/read/write errors on the src LUN. Looking
at dmesg (when the host doesn't crash), we see the LUNs paths all
disappear and then reappear usually around 20 seconds later. Example
output below. Each LUN has 2 paths out of the DB host and 4 paths on
each storage device, across two separate SANs.
You're yanking drives in- and out-of-view of your host, you're
doing so with zpool importing (and exporting?) and yet you still
want your database to be reliable.
We are not attempting to yank the src LUN in and out. Yes, removing the
dest LUN from a hosts view inexplicable may be causing other MPxIO
inconsistencies though. We typically import/export zpools between
hosts, because ibbackup from Innobase (the recommended hot backup
solution for InnoDB), cannot write over the network, and must copy to a
local directory. Mapping LUNs between hosts is a method. NFS is
another, and so on. The typical 'Enterprise' backup solution didn't
support hot backups on InnoDB until more recently.
Usually the host will crash when not running with a zpool mirror,
which apparently in Sol10u4, it's expected behavior.
Sorry, but no. What you're doing is creating inconsistencies in the
host's view of it storage. Don't blame Solaris for this, it's actually
trying to keep your data consistent.
As mentioned, we _never_ remove (via wwn host masking) LUNs that are
active from a ZFS perspective. So consistency should not be compromised.
These hosts are x86_64 servers, running Sol10u4, unpatched. They use
qlogic qla2342 HBAs, and the stock qlc driver. They are using MPXIO,
from what I can tell.
Yes, they're using MPxIO. You can tell that from the pathnames
such as /scsi_vhci/[EMAIL PROTECTED] - that's a dead giveaway.
Good, that's what I thought. I was only not positive because runs of
mpathadm --cannot-remember didn't list our storage device as a supported
MPxIO device, at least that's what I took the output to mean.
So ... _unpatched_ you say? _Why_ ? I know organisations generally
have rigorous patching methodologies and schedules, but fer cryin'
out loud, S10 Update 4 has been available since the middle of 2007.
That's very nearly 12 months old.
As mentioned by Bob in another email, we have not been running into
issues with u4, that we knew of. We have begun upgrading to U5 and
applying the latest patches, in-fact all of our secondary hosts have
this completed, we just haven't had a downtime window available to get
the primary systems upgraded. We are aware that we should be upgrading,
and we are working towards that goal. We have also seen enough
instability with software releases (in software in general) that we like
some bake time in new releases, so we've been waiting a little bit on
the U5 release. Which turns out was a good idea, given the memory leak
in the qlc driver, fixed by patch 125165-10.
If anyone has any tips on troubleshooting, or knows of things we are
doing wrong, help would be appreciated.
Two major recommendations. Firstly, PATCH YOUR SYSTEM.
Secondly, design a backup methodology which doesn't rely
on playing the fool with your storage.
Assuming that you're posting from your work email address,
_surely_ you could convince your management that implementing
a backup strategy based around an enterprise-class backup
package such as NetBackup or Networker.
You should also seriously consider getting a professional
services organisation (such as Sun's) to come in and help
you get your systems setup properly.
Thanks for your recommendations.
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss