[zfs-discuss] A disk on Thumper giving random CKSUM error counts

Jim Klimov Sun, 10 Jun 2012 06:10:03 -0700

Hello all,

  As some of you might remember, there is a Sun Fire X4500
(Thumper) server that I was asked to help upgrade to modern
disks. It is still in a testing phase, and the one UltraStar
3Tb currently available to the server's owners is humming
in the server, with one small partition on its tail which
replaced a dead 250Gb disk earlier in to pool. The OS is
still SXCE snv_117 so far.


  Early tests which have filled the UltraStar with data
in a couple of single-partition pools on it had shown that
the writes and scrubs yielded 0 errors.

  However, now that this disk works as part of a larger old
pool (9*5-vdev raidz1 sets), it suffers CKSUM error counts
found on every scrub. They tried to reseat the disk in the
same position (and will soon try the ex-position of another
failed disk), but reseating did not help. For some reason,
while early high CKSUM counts led to ZFS degrading the pool,
this no longer happens (and my questionable script to clear
the errors during scrub is not in use anymore).

  Numbers of CKSUM errors vary, in no dependable pattern:
1852, 317, 146, 83, 32, 1063, 6, 163, 4, 1, 8, 50...
I can not say that there is a pattern leading to "now that
some intermediate errors will cleanse, they will remain zero".

  Hence the question: what can be wrong, considering (hoping)
that there are high-quality components in play, in a cooled
datacenter room with UPSes powering the box?

  Is there some way to reliably test and blame or rule out:
* HDD itself (media, chips, connectors) as a black box
* OS version
or aging X4500 hardware, including:
* backplane connectors
* marvell controllers
* power source
* ECC RAM
* CPUs

  So far, two disks have failed on this server in positions
c1t2 and c5t6, and the replacement disk is currently running
in position c1t2. Other disks have not reported errors over
the 4 years that the server is in 24*7*365 service, so I do
doubt that this is a systemwide problem (CPU, RAM, power),
or even a controller/backplane-wide problem, but I am more
inclined towards the connectors or particular lanes on the
controller.

  Any better ideas, perhaps someone had same experiences?

Thanks,
//Jim Klimov

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] A disk on Thumper giving random CKSUM error counts

Reply via email to