Hello Matthew,
Tuesday, September 12, 2006, 7:57:45 PM, you
wrote:
MA Ben Miller wrote:
I had a strange ZFS problem this morning.
The
entire system would
hang when mounting the ZFS filesystems. After
trial and error I
determined that the problem was with one of
the
2500 ZFS filesystems.
When mounting that users' home the system
would
hang and need to be
rebooted. After I removed the snapshots (9 of
them) for that
filesystem everything was fine.
I don't know how to reproduce this and didn't
get
a crash dump. I
don't remember seeing anything about this
before
so I wanted to
report it and see if anyone has any ideas.
MA Hmm, that sounds pretty bizarre, since I
don't
think that mounting a
MA filesystem doesn't really interact with
snapshots
at all.
MA Unfortunately, I don't think we'll be able to
diagnose this without a
MA crash dump or reproducibility. If it happens
again, force a crash dump
MA while the system is hung and we can take a
look
at it.
Maybe it wasn't hung after all. I've seen similar
behavior here
sometimes. Did your disks used in a pool were
actually working?
There was lots of activity on the disks (iostat and
status LEDs) until it got to this one filesystem
and
everything stopped. 'zpool iostat 5' stopped
running, the shell wouldn't respond and activity on
the disks stopped. This fs is relatively small
(175M used of a 512M quota).
Sometimes it takes a lot of time (30-50minutes) to
mount a file system
- it's rare, but it happens. And during this ZFS
reads from those
disks in a pool. I did report it here some time
ago.
In my case the system crashed during the evening
and it was left hung up when I came in during the
morning, so it was hung for a good 9-10 hours.
The problem happened again last night, but for a
different users' filesystem. I took a crash dump
with it hung and the back trace looks like this:
::status
debugging crash dump vmcore.0 (64-bit) from hostname
operating system: 5.11 snv_40 (sun4u)
panic message: sync initiated
dump content: kernel pages only
::stack
0xf0046a3c(f005a4d8, 2a100047818, 181d010, 18378a8,
1849000, f005a4d8)
prom_enter_mon+0x24(2, 183c000, 18b7000, 2a100046c61,
1812158, 181b4c8)
debug_enter+0x110(0, a, a, 180fc00, 0, 183e000)
abort_seq_softintr+0x8c(180fc00, 18abc00, 180c000,
2a100047d98, 1, 1859800)
intr_thread+0x170(600019de0e0, 0, 6000d7bfc98,
600019de110, 600019de110,
600019de110)
zfs_delete_thread_target+8(600019de080,
, 0, 600019de080,
6000d791ae8, 60001aed428)
zfs_delete_thread+0x164(600019de080, 6000d7bfc88, 1,
2a100c4faca, 2a100c4fac8,
600019de0e0)
thread_start+4(600019de080, 0, 0, 0, 0, 0)
In single user I set the mountpoint for that user to
be none and then brought the system up fine. Then I
destroyed the snapshots for that user and their
filesystem mounted fine. In this case the quota was
reached with the snapshots and 52% used without.
Ben
Hate to re-open something from a year ago, but we just had this problem happen
again. We have been running Solaris 10u3 on this system for awhile. I
searched the bug reports, but couldn't find anything on this. I also think I
understand what happened a little more. We take snapshots at noon and the
system hung up during that time. When trying to reboot the system would hang
on the ZFS mounts. After I boot into single use and remove the snapshot from
the filesystem causing the problem everything is fine. The filesystem in
question at 100% use with snapshots in use.
Here's the back trace for the system when it was hung:
::stack
0xf0046a3c(f005a4d8, 2a10004f828, 0, 181c850, 1848400, f005a4d8)
prom_enter_mon+0x24(0, 0, 183b400, 1, 1812140, 181ae60)
debug_enter+0x118(0, a, a, 180fc00, 0, 183d400)
abort_seq_softintr+0x94(180fc00, 18a9800, 180c000, 2a10004fd98, 1, 1857c00)
intr_thread+0x170(2, 30007b64bc0, 0, c001ed9, 110, 6000240)
0x985c8(300adca4c40, 0, 0, 0, 0, 30007b64bc0)
dbuf_hold_impl+0x28(60008cd02e8, 0, 0, 0, 7b648d73, 2a105bb57c8)
dbuf_hold_level+0x18(60008cd02e8, 0, 0, 7b648d73, 0, 0)
dmu_tx_check_ioerr+0x20(0, 60008cd02e8, 0, 0, 0, 7b648c00)
dmu_tx_hold_zap+0x84(60011fb2c40, 0, 0, 0, 30049b58008, 400)
zfs_rmnode+0xc8(3002410d210, 2a105bb5cc0, 0, 60011fb2c40, 30007b3ff58,
30007b56ac0)
zfs_delete_thread+0x168(30007b56ac0, 3002410d210, 69a4778, 30007b56b28,
2a105bb5aca, 2a105bb5ac8)
thread_start+4(30007b56ac0, 0, 0, 489a48, d83a10bf28, 50386)
Has this been fixed in more recent code? I can make the crash dump available.
Ben
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss