To: ZFS Discussions <firstname.lastname@example.org>;
From: Paul Kraus <p...@kraus-haus.org>
Sent: Tue 27-03-2012 15:05
Subject: Re: [zfs-discuss] kernel panic during zfs import
> On Tue, Mar 27, 2012 at 3:14 AM, Carsten John <cj...@mpi-bremen.de> wrote:
> > Hallo everybody,
> > I have a Solaris 11 box here (Sun X4270) that crashes with a kernel panic
> during the import of a zpool (some 30TB) containing ~500 zfs filesystems
> reboot. This causes a reboot loop, until booted single user and removed
> > From /var/adm/messages:
> > savecore: [ID 570001 auth.error] reboot after panic: BAD TRAP: type=e (#pf
> Page fault) rp=ffffff002f9cec50 addr=20 occurred in module "zfs" due to a
> pointer dereference
> > savecore: [ID 882351 auth.error] Saving compressed system crash dump in
> I ran into a very similar problem with Solaris 10U9 and the
> replica (zfs send | zfs recv destination) of a zpool of about 25 TB of
> data. The problem was an incomplete snapshot (the zfs send | zfs recv
> had been interrupted). On boot the system was trying to import the
> zpool and as part of that it was trying to destroy the offending
> (incomplete) snapshot. This was zpool version 22 and destruction of
> snapshots is handled as a single TXG. The problem was that the
> operation was running the system out of RAM (32 GB worth). There is a
> fix for this and it is in zpool 26 (or newer), but any snapshots
> created while the zpool is at a version prior to 26 will have the
> problem on-disk. We have support with Oracle and were able to get a
> loaner system with 128 GB RAM to clean up the zpool (it took about 75
> GB RAM to do so).
> If you are at zpool 26 or later this is not your problem. If you
> are at zpool < 26, then test for an incomplete snapshot by importing
> the pool read only, then `zdb -d <zpool> | grep '%'` as the incomplete
> snapshot will have a '%' instead of a '@' as the dataset / snapshot
> separator. You can also run the zdb against the _un_imported_ zpool
> using the -e option to zdb.
> See the following Oracle Bugs for more information.
> CR# 6876953
> CR# 6910767
> CR# 7082249
> CR#7082249 has been marked as a duplicate of CR# 6948890
> P.S. I have a suspect that the incomplete snapshot was also corrupt in
> some strange way, but could never make a solid determination of that.
> We think what caused the zfs send | zfs recv to be interrupted was
> hitting an e1000g Ethernet device driver bug.
> Paul Kraus
> -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
> -> Sound Coordinator, Schenectady Light Opera Company (
> http://www.sloctheater.org/ )
> -> Technical Advisor, Troy Civic Theatre Company
> -> Technical Advisor, RPI Players
> zfs-discuss mailing list
this scenario seems to fit. The machine that was sending the snapshot is on
OpenSolaris Build 111b (which is running zpool version 14).
I rebooted the receiving machine due to a hanging "zfs receive" that couldn't
zdb -d -e <pool> does not give any useful information:
zdb -d -e san_pool
Dataset san_pool [ZPL], ID 18, cr_txg 1, 36.0K, 11 objects
When importing the pool readonly, I get an error about two datasets:
zpool import -o readonly=on san_pool
cannot set property for 'san_pool/home/someuser': dataset is read-only
cannot set property for 'san_pool/home/someotheruser': dataset is read-only
As this is a mirror machine, I still have the option to destroy the pool and
copy over the stuff via send/receive from the primary. But nobody knows how
long this will work until I'm hit again....
If an interrupted send/receive can screw up a 30TB target pool, then
send/receive isn't an option for replication data at all, furthermore it should
be flagged as "don't use it if your target pool might contain any valuable data"
I wil reproduce the crash once more and try to file a bug report for S11 as
recommended by Deepak (not so easy these days...).
zfs-discuss mailing list