Am 19.01.13 18:17, schrieb Bob Friesenhahn:
On Sat, 19 Jan 2013, Stephan Budach wrote:


Now, this zpool is made of 3-way mirrors and currently 13 out of 15 vdevs are resilvering (which they had gone through yesterday as well) and I never got any error while resilvering. I have been all over the setup to find any glitch or bad part, but I couldn't come up with anything significant.

Doesn't this sound improbable, wouldn't one expect to encounter other chksum errors while resilvering is running?

I can't attest to chksum errors since I have yet to see one on my machines (have seen several complete disk failures, or disks faulted by the system though). Checksum errors are bad and not seeing them should be the normal case.
I know and it's really bugging me, that I seem to have these chksum errors on all of my machines, be it Sun gear or Dell.

Resilver may in fact be just verifying that the pool disks are coherent via metadata. This might happen if the fiber channel is flapping.

Regarding the dire fiber channel issue, are you using fiber channel switches or direct connections to the storage array(s)? If you are using switches, are they stable or are they doing something terrible like resetting? Do you have duplex connectivity? Have you verified that your FC HBA's firmware is correct?
Looking on my FC switches, I am noticing such errors like these:

[656][Thu Dec 06 03:33:04.795 UTC 2012][I][8600.001E][Port][Port: 2][PortID 0x30200 PortWWN 10:00:00:06:2b:12:d3:55 logged out of nameserver.] [657][Thu Dec 06 03:33:05.829 UTC 2012][I][8600.0020][Port][Port: 2][SYNC_LOSS] [658][Thu Dec 06 03:37:08.077 UTC 2012][I][8600.001F][Port][Port: 2][SYNC_ACQ] [659][Thu Dec 06 03:37:10.582 UTC 2012][I][8600.001D][Port][Port: 2][PortID 0x30200 PortWWN 10:00:00:06:2b:12:d3:55 logged into nameserver.] [660][Sun Dec 09 04:18:32.324 UTC 2012][I][8600.001E][Port][Port: 10][PortID 0x30a00 PortWWN 21:01:00:1b:32:22:30:53 logged out of nameserver.] [661][Sun Dec 09 04:18:32.326 UTC 2012][I][8600.0020][Port][Port: 10][SYNC_LOSS] [662][Sun Dec 09 04:18:32.913 UTC 2012][I][8600.001F][Port][Port: 10][SYNC_ACQ] [663][Sun Dec 09 04:18:33.024 UTC 2012][I][8600.001D][Port][Port: 10][PortID 0x30a00 PortWWN 21:01:00:1b:32:22:30:53 logged into nameserver.]

Just ignore the timestamp, as it seems that the time is not set correctly, but the dates match my two issues from today and thursday, which accounts for three days. I didn't catch that before, but it seems to clearly indicate a problem with the FC connection…

But, what do I make of this information?


Did you check for messages in /var/adm/messages which might indicate when and how FC connectivity has been lost?
Well, this is the most scaring part to me. Neither fmdump nor dmesg showed anything that would indicate a connectivity issue - at least not the last time.

Bob

Thanks,
Stephan
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to