Re: [zones-discuss] Zones on shared storage - a warning

Mike Gerdts Fri, 08 Jan 2010 07:13:00 -0800

On Fri, Jan 8, 2010 at 5:28 AM, Frank Batschulat (Home)
<frank.batschu...@sun.com> wrote:
[snip]
> Hey Mike, you're not the only victim of these strange CHKSUM errors, I hit
> the same during my slightely different testing, where I'm NFS mounting an
> entire, pre-existing remote file living in the zpool on the NFS server and use
> that to create a zpool and install zones into it.


What does your overall setup look like?

Mine is:

T5220 + Sun System Firmware 7.2.4.f 2009/11/05 18:21
   Primary LDom
      Solaris 10u8
      Logical Domains Manager 1.2,REV=2009.06.25.09.48 + 142840-03
      Guest Domain 4 vcpus + 15 GB memory
         OpenSolaris snv_130
            (this is where the problem is observed)

I've seen similar errors on Solaris 10 in the primary domain and on a
M4000.  Unfortunately Solaris 10 doesn't show the checksums in the
ereport.  There I noticed a mixture between read errors and checksum
errors - and lots more of them.  This could be because the S10 zone
was a full root SUNWCXall compared to the much smaller default ipkg
branded zone.  On the primary domain running Solaris 10...

(this command was run some time ago)
primary-domain# zpool status myzone
  pool: myzone
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        myzone      DEGRADED     0     0     0
          /foo/20g  DEGRADED 4.53K     0   671  too many errors

errors: No known data errors


(this was run today, many days after previous command)
primary-domain# fmdump -eV | egrep zio_err | uniq -c | head
   1    zio_err = 5
   1    zio_err = 50
   1    zio_err = 5
   1    zio_err = 50
   1    zio_err = 5
   1    zio_err = 50
   2    zio_err = 5
   1    zio_err = 50
   3    zio_err = 5
   1    zio_err = 50


Note that even though I had thousands of read errors the zone worked
just fine. I would have never known (suspected?) there was a problem
if I hadn't run "zpool status" or the various FMA commands.


> I've filed today:
>
> 6915265 zpools on files (over NFS) accumulate CKSUM errors with no apparent 
> reason

Thanks.  I'll open a support call to help get some funding on it...

> here's the relevant piece worth investigating out of it (leaving out the 
> actual setup etc..)
> as in your case, creating the zpool and installing the zone into it still 
> gives
> a healthy zpool, but immediately after booting the zone, the zpool served 
> over NFS
> accumulated CHKSUM errors.
>
> of particular interest are the 'cksum_actual' values as reported by Mike for 
> his
> test case here:
>
> http://www.mail-archive.com/zfs-disc...@opensolaris.org/msg33041.html
>
> if compared to the 'chksum_actual' values I got in the fmdump error output on 
> my test case/system:
>
> note, the NFS servers zpool that is serving and sharing the file we use is 
> healthy.
>
> zone halted now on my test system, and checking fmdump:
>
> osoldev.batschul./export/home/batschul.=> fmdump -eV | grep cksum_actual | 
> sort | uniq -c | sort -n | tail
>   2    cksum_actual = 0x4bea1a77300 0xf6decb1097980 0x217874c80a8d9100 
> 0x7cd81ca72df5ccc0
>   2    cksum_actual = 0x5c1c805253 0x26fa7270d8d2 0xda52e2079fd74 
> 0x3d2827dd7ee4f21
>   6    cksum_actual = 0x28e08467900 0x479d57f76fc80 0x53bca4db5209300 
> 0x983ddbb8c4590e40
> *A   6    cksum_actual = 0x348e6117700 0x765aa1a547b80 0xb1d6d98e59c3d00 
> 0x89715e34fbf9cdc0
> *B   7    cksum_actual = 0x0 0x0 0x0 0x0
> *C  11    cksum_actual = 0x1184cb07d00 0xd2c5aab5fe80 0x69ef5922233f00 
> 0x280934efa6d20f40
> *D  14    cksum_actual = 0x175bb95fc00 0x1767673c6fe00 0xfa9df17c835400 
> 0x7e0aef335f0c7f00
> *E  17    cksum_actual = 0x2eb772bf800 0x5d8641385fc00 0x7cf15b214fea800 
> 0xd4f1025a8e66fe00
> *F  20    cksum_actual = 0xbaddcafe00 0x5dcc54647f00 0x1f82a459c2aa00 
> 0x7f84b11b3fc7f80
> *G  25    cksum_actual = 0x5d6ee57f00 0x178a70d27f80 0x3fc19c3a19500 
> 0x82804bc6ebcfc0
>
> osoldev.root./export/home/batschul.=> zpool status -v
>  pool: nfszone
>  state: DEGRADED
> status: One or more devices has experienced an unrecoverable error.  An
>        attempt was made to correct the error.  Applications are unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
>        using 'zpool clear' or replace the device with 'zpool replace'.
>   see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: none requested
> config:
>
>        NAME        STATE     READ WRITE CKSUM
>        nfszone     DEGRADED     0     0     0
>          /nfszone  DEGRADED     0     0   462  too many errors
>
> errors: No known data errors
>
> ==========================================================================
>
> now compare this with Mike's error output as posted here:
>
> http://www.mail-archive.com/zfs-disc...@opensolaris.org/msg33041.html
>
> # fmdump -eV | grep cksum_actual | sort | uniq -c | sort -n | tail
>
>   2    cksum_actual = 0x14c538b06b6 0x2bb571a06ddb0 0x3e05a7c4ac90c62 
> 0x290cbce13fc59dce
> *D   3    cksum_actual = 0x175bb95fc00 0x1767673c6fe00 0xfa9df17c835400 
> 0x7e0aef335f0c7f00
> *E   3    cksum_actual = 0x2eb772bf800 0x5d8641385fc00 0x7cf15b214fea800 
> 0xd4f1025a8e66fe00
> *B   4    cksum_actual = 0x0 0x0 0x0 0x0
>   4    cksum_actual = 0x1d32a7b7b00 0x248deaf977d80 0x1e8ea26c8a2e900 
> 0x330107da7c4bcec0
>   5    cksum_actual = 0x14b8f7afe6 0x915db8d7f87 0x205dc7979ad73 
> 0x4e0b3a8747b8a8
> *C   6    cksum_actual = 0x1184cb07d00 0xd2c5aab5fe80 0x69ef5922233f00 
> 0x280934efa6d20f40
> *A   6    cksum_actual = 0x348e6117700 0x765aa1a547b80 0xb1d6d98e59c3d00 
> 0x89715e34fbf9cdc0
> *F  16    cksum_actual = 0xbaddcafe00 0x5dcc54647f00 0x1f82a459c2aa00 
> 0x7f84b11b3fc7f80
> *G  48    cksum_actual = 0x5d6ee57f00 0x178a70d27f80 0x3fc19c3a19500 
> 0x82804bc6ebcfc0
>
> and observe that the values in 'chksum_actual' causing our CHKSUM pool errors 
> eventually
> because of missmatching with what had been expected are the SAME ! for 2 
> totally
> different client systems and 2 different NFS servers (mine vrs. Mike's),
> see the entries marked with *A to *G.
>
> This just can't be an accident, there must be some coincidence and thus 
> there's a good chance
> that these CHKSUM errors must have a common source, either in ZFS or in NFS ?

You saved me so much time with this observation.  Thank you!


-- 
Mike Gerdts
http://mgerdts.blogspot.com/
_______________________________________________
zones-discuss mailing list
zones-discuss@opensolaris.org

Re: [zones-discuss] Zones on shared storage - a warning

Reply via email to