Dear all

I'm about to answer my own question with some really useful hints
from Steve, thanks for that!!!



On 03/02/2012 07:43 AM, Thomas Nau wrote:
> Dear all
> I asked before but without much feedback. As the issue
> is persistent I want to give it another try. We disabled
> panicing for such kind of error in /etc/system but still
> see messages such as
>
> zfs: accessing past end of object 5b1aa/21a8008 (size=60416 
> access=32603+32768)
>
> in the logs. Is there any way to identify which object (file?)
> causes this?



I'm about to answer my own question with some really useful hints
from Steve, thanks for that!!!

In case your system crashed there's an option in /etc/system to
return an IO error instead of panicing, just add
        set zfs:zfs_recover=1

After rebooting the system will issue a warning such as
        Mar  4 15:04:16 neo genunix: [ID 762447 kern.warning] WARNING:
        zfs: accessing past end of object bab/21a8008 (size=60416 
access=32603+32768)
to syslog whenever the problem shows. The important numbers are
        dataset ID:     bab
        object ID:      21a8008

Those would also show up in the kernel panic message.
Assuming the ZFS datasets still exist and the file is
also unaltered we don't need crash dump analysis but can
use zdb instead. I running the latest S11 bits by the way

First use some bash magic to turn the hex numbers into decimals
as zdb deals with those
        # printf "%d %d\n" 0xbab 0x21a8008
        987 35291144

now lookup the dataset; I assume we already have a pretty
good idea about which pool to check
        # zdb -d -r pool1 | grep "ID 2987"
        Dataset pool1/.../backup-clone [ZPL], ID 2987, cr_txg 190496, 1.36T, 
15590871 objects

Now lookup the actual object. Add  more "-v" to get even more data
        # zdb -vvv pool1/backup/nfs/home/student1/backup-clone 35291144
        Dataset pool1/.../backup-clone [ZPL], ID 2987, cr_txg 190496, 1.36T, 
15590871 objects,
        rootbp DVA[0]=<2:2a000cada00:c00:RZM:4> [L0 DMU objset] fletcher4 lzjb 
LE contiguous unique
        unencrypted 4-copy size=800L/200P birth=190500L/190500P fill=15590871
        cksum=1abc142ec9:88286e5b5e3:18d73114fd4d4:34d14f5e348c05

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
  35291144    1    16K  59.0K    32K  59.0K  100.00  ZFS plain file
                                        168   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED
        dnode maxblkid: 0
        path    /zep13/.mozilla/firefox/jlonp9fm.default/cookies.sqlite
        uid     63883
        gid     400
        atime   Tue Jan 10 14:15:34 2012
        mtime   Tue Jan 10 14:23:01 2012
        ctime   Tue Jan 10 14:23:01 2012
        crtime  Wed Oct 19 09:43:54 2011
        gen     15760229
        mode    100644
        size    2228224
        parent  34303712
        links   1
        pflags  40800000004


So here comes the funny stuff: according to the object data the size
is 2228224 bytes which matches of course the "ls -l" output. On the
other hand ZFS read complained after about 32k which fits the dsize/lsize
columns as we use compression.
Strange, isn't it but wait, it gets even more confusing... The initial
panic, now turned into warnings, was caused by the TSM backup client trying
to backup the file. We use zfs clones to get a consistent backup as much
as possible. Let's truss the client (cut some path elements):

access("/backup/pool1/.../zep13/.mozilla/firefox/jlonp9fm.default/cookies.sqlite",
 R_OK) = 0
open64("/backup/pool1/.../zep13/.mozilla/firefox/jlonp9fm.default/cookies.sqlite",
 O_RDONLY|O_NONBLOCK) = 6
acl("/backup/pool1/.../zep13/.mozilla/firefox/jlonp9fm.default/cookies.sqlite", 
ACE_GETACL, 1024, 0x0846D780) = 3
read(6, " S Q L i t e   f o r m a".., 32603)    = 32603
...
read(6, 0x086B0C98, 32768)                      Err#5 EIO

now let's just cat the file and see what happens:
        # cat 
/backup/pool1/.../zep13/.mozilla/firefox/jlonp9fm.default/cookies.sqlite > TEST
        # ls -l TEST
        -rw-r--r--   1 root     root     2228224 Mar  4 15:50 TEST

no complains. Observing the appropriate routine through
        # dtrace -n '::zfs_panic_recover:entry { stack(); }'

does not trigger. Checking the backup client again... no more errors
as truss also confirms

open64("/backup/pool1/.../zep13/.mozilla/firefox/jlonp9fm.default/cookies.sqlite",
 O_RDONLY|O_NONBLOCK) = 6
read(6, " S Q L i t e   f o r m a".., 32603)    = 32603
read(6, " 3 4 1 0 8 . 2 . 2 . u t".., 32768)    = 32768
read(6, "\0\0\0\0\0\0\0\0\0\0\0\0".., 32768)    = 32768
...


Double checking with zdb and "ls -i" shows the same object ID.

I'm really puzzled!!! Any more ideas what's going on?

Thomas

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to