I've been reading this thread with interest, while simultaneously
dealing with an outage on our Oracle (née Sun) 7410C cluster. This is
the ZFS NAS appliance offering. Ours is fairly small, about 40TB total,
serving up NFS-only to Solaris, RHEL and VMWare hosts.
After a shaky start the cluster has been pretty stable for the last year
or so, but the last two weeks have been a trip back to bizzarro world.
About ten days ago 'node one' just dropped off the network for eight
minutes. No cluster failover, both nodes stayed up. Then it came back
online with no intervention. Then ten minutes later a disk failed.
Coincidence? Meh, opened a case. Took 8 days to get a 1TB SATA drive
delivered, lolz.
In the meantime the mgmt interfaces for both nodes cease to function.
It continues to serve data though. Seems there's this bug where log
files get too big and cause the management processes to crash. Support
dude logs in remotely, cleans up the egregious log files and restarts
the daemons. OK now? Well, 'node one' is OK. 'Node two' drops off the
network for 8 minutes.
Granted I am running a one year old version of the system code. Support
hasn't been able to say if newer code would have prevented any of this.
They are proposing that the first outage was probably due to the drive
failure because data would be unavailable while the pool was
resilvering. Huh? This is "RAID", right? I'm confused. The 2nd
outage was allegedly due to restarting the daemon. Apparently
restarting the management daemon ifconfigs the interfaces down. So why
would you do that at 3PM on a Wednesday then? And hmm, why did the same
procedure only affect 'node two'? Still waiting for answers.
I'm just venting at this point I guess; take it for what it's worth. I
have a knack for selecting unreliable storage (anyone remember MTI?), so
just watch what I do and do the opposite.
FWIW, we also have a NetApp FAS-3020 and it's been solid, but I don't
ask much of it.
Cheers,
--
Roy McMorran
Bar Harbor, ME
_______________________________________________
Tech mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
http://lopsa.org/