2012-10-01 17:07, Edward Ned Harvey
Well, now I know why it's stupid. Cuz it doesn't work right - It turns out,
iscsi devices (And I presume SAS devices) are not removable storage. That
means, if the device goes offline and comes back online again, it doesn't just
gracefully resilver and move on without any problems, it's in a perpetual state
of IO error, device unreadable. If there were simply cksum errors, or
something like that, I could handle it. But it's bus error, device error,
system can't operate, I have to remove the device permanently.
The really odd thing is - It doesn't always show as faulted in zpool status.
Even when it does show as faulted - I can zpool online, or zpool clear, to make
the pool look healthy again. But when an app tries to use something in that
zpool, the system grinds, and I can see scsi errors spewing into the
/var/adm/messages, and sometimes the system will halt.
This is call caused because I disconnected / rebooted either the iscsi
initiator or target.
Lesson learned: If you create an iscsi target, make *damn* sure it's an
always-on system. And don't use just one. And don't do maintenance on them
both, in anywhere near the same week.
And would some sort of clusterware help in this case?
I.e. when the target goes down, it informs the initiator to
"offline" the disk component gracefully (if that is possible).
When the target returns up, the automation would online the
pool components, or replace them in-place, and *properly*
resilver and clear the pool.
Wonder if that's possible and if that would help your case?
zfs-discuss mailing list