Dear all I'm about to answer my own question with some really useful hints from Steve, thanks for that!!!
On 03/02/2012 07:43 AM, Thomas Nau wrote: > Dear all > I asked before but without much feedback. As the issue > is persistent I want to give it another try. We disabled > panicing for such kind of error in /etc/system but still > see messages such as > > zfs: accessing past end of object 5b1aa/21a8008 (size=60416 > access=32603+32768) > > in the logs. Is there any way to identify which object (file?) > causes this? I'm about to answer my own question with some really useful hints from Steve, thanks for that!!! In case your system crashed there's an option in /etc/system to return an IO error instead of panicing, just add set zfs:zfs_recover=1 After rebooting the system will issue a warning such as Mar 4 15:04:16 neo genunix: [ID 762447 kern.warning] WARNING: zfs: accessing past end of object bab/21a8008 (size=60416 access=32603+32768) to syslog whenever the problem shows. The important numbers are dataset ID: bab object ID: 21a8008 Those would also show up in the kernel panic message. Assuming the ZFS datasets still exist and the file is also unaltered we don't need crash dump analysis but can use zdb instead. I running the latest S11 bits by the way First use some bash magic to turn the hex numbers into decimals as zdb deals with those # printf "%d %d\n" 0xbab 0x21a8008 987 35291144 now lookup the dataset; I assume we already have a pretty good idea about which pool to check # zdb -d -r pool1 | grep "ID 2987" Dataset pool1/.../backup-clone [ZPL], ID 2987, cr_txg 190496, 1.36T, 15590871 objects Now lookup the actual object. Add more "-v" to get even more data # zdb -vvv pool1/backup/nfs/home/student1/backup-clone 35291144 Dataset pool1/.../backup-clone [ZPL], ID 2987, cr_txg 190496, 1.36T, 15590871 objects, rootbp DVA[0]=<2:2a000cada00:c00:RZM:4> [L0 DMU objset] fletcher4 lzjb LE contiguous unique unencrypted 4-copy size=800L/200P birth=190500L/190500P fill=15590871 cksum=1abc142ec9:88286e5b5e3:18d73114fd4d4:34d14f5e348c05 Object lvl iblk dblk dsize lsize %full type 35291144 1 16K 59.0K 32K 59.0K 100.00 ZFS plain file 168 bonus System attributes dnode flags: USED_BYTES USERUSED_ACCOUNTED dnode maxblkid: 0 path /zep13/.mozilla/firefox/jlonp9fm.default/cookies.sqlite uid 63883 gid 400 atime Tue Jan 10 14:15:34 2012 mtime Tue Jan 10 14:23:01 2012 ctime Tue Jan 10 14:23:01 2012 crtime Wed Oct 19 09:43:54 2011 gen 15760229 mode 100644 size 2228224 parent 34303712 links 1 pflags 40800000004 So here comes the funny stuff: according to the object data the size is 2228224 bytes which matches of course the "ls -l" output. On the other hand ZFS read complained after about 32k which fits the dsize/lsize columns as we use compression. Strange, isn't it but wait, it gets even more confusing... The initial panic, now turned into warnings, was caused by the TSM backup client trying to backup the file. We use zfs clones to get a consistent backup as much as possible. Let's truss the client (cut some path elements): access("/backup/pool1/.../zep13/.mozilla/firefox/jlonp9fm.default/cookies.sqlite", R_OK) = 0 open64("/backup/pool1/.../zep13/.mozilla/firefox/jlonp9fm.default/cookies.sqlite", O_RDONLY|O_NONBLOCK) = 6 acl("/backup/pool1/.../zep13/.mozilla/firefox/jlonp9fm.default/cookies.sqlite", ACE_GETACL, 1024, 0x0846D780) = 3 read(6, " S Q L i t e f o r m a".., 32603) = 32603 ... read(6, 0x086B0C98, 32768) Err#5 EIO now let's just cat the file and see what happens: # cat /backup/pool1/.../zep13/.mozilla/firefox/jlonp9fm.default/cookies.sqlite > TEST # ls -l TEST -rw-r--r-- 1 root root 2228224 Mar 4 15:50 TEST no complains. Observing the appropriate routine through # dtrace -n '::zfs_panic_recover:entry { stack(); }' does not trigger. Checking the backup client again... no more errors as truss also confirms open64("/backup/pool1/.../zep13/.mozilla/firefox/jlonp9fm.default/cookies.sqlite", O_RDONLY|O_NONBLOCK) = 6 read(6, " S Q L i t e f o r m a".., 32603) = 32603 read(6, " 3 4 1 0 8 . 2 . 2 . u t".., 32768) = 32768 read(6, "\0\0\0\0\0\0\0\0\0\0\0\0".., 32768) = 32768 ... Double checking with zdb and "ls -i" shows the same object ID. I'm really puzzled!!! Any more ideas what's going on? Thomas _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss