On Tue, Mar 27, 2012 at 3:14 AM, Carsten John <cj...@mpi-bremen.de> wrote:
> Hallo everybody,
>
> I have a Solaris 11 box here (Sun X4270) that crashes with a kernel panic 
> during the import of a zpool (some 30TB) containing ~500 zfs filesystems 
> after reboot. This causes a reboot loop, until booted single user and removed 
> /etc/zfs/zpool.cache.
>
>
> From /var/adm/messages:
>
> savecore: [ID 570001 auth.error] reboot after panic: BAD TRAP: type=e (#pf 
> Page fault) rp=ffffff002f9cec50 addr=20 occurred in module "zfs" due to a 
> NULL pointer dereference
> savecore: [ID 882351 auth.error] Saving compressed system crash dump in 
> /var/crash/vmdump.2
>

    I ran into a very similar problem with Solaris 10U9 and the
replica (zfs send | zfs recv destination) of a zpool of about 25 TB of
data. The problem was an incomplete snapshot (the zfs send | zfs recv
had been interrupted). On boot the system was trying to import the
zpool and as part of that it was trying to destroy the offending
(incomplete) snapshot. This was zpool version 22 and destruction of
snapshots is handled as a single TXG. The problem was that the
operation was running the system out of RAM (32 GB worth). There is a
fix for this and it is in zpool 26 (or newer), but any snapshots
created while the zpool is at a version prior to 26 will have the
problem on-disk. We have support with Oracle and were able to get a
loaner system with 128 GB RAM to clean up the zpool (it took about 75
GB RAM to do so).

    If you are at zpool 26 or later this is not your problem. If you
are at zpool < 26, then test for an incomplete snapshot by importing
the pool read only, then `zdb -d <zpool> | grep '%'` as the incomplete
snapshot will have a '%' instead of a '@' as the dataset / snapshot
separator. You can also run the zdb against the _un_imported_ zpool
using the -e option to zdb.

See the following Oracle Bugs for more information.

CR# 6876953
CR# 6910767
CR# 7082249

CR#7082249 has been marked as a duplicate of CR# 6948890

P.S. I have a suspect that the incomplete snapshot was also corrupt in
some strange way, but could never make a solid determination of that.
We think what caused the zfs send | zfs recv to be interrupted was
hitting an e1000g Ethernet device driver bug.

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, Troy Civic Theatre Company
-> Technical Advisor, RPI Players
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to