subject:"\[zfs\-discuss\] System hang caused by a bad snapshot"

Re: [zfs-discuss] System hang caused by a bad snapshot

2007-09-18 Thread Ben Miller

   Hello Matthew,
   Tuesday, September 12, 2006, 7:57:45 PM, you
  wrote:
   MA Ben Miller wrote:
I had a strange ZFS problem this morning.
  The
   entire system would
hang when mounting the ZFS filesystems.  After
   trial and error I
   determined that the problem was with one of
  the
   2500 ZFS filesystems.
   When mounting that users' home the system
  would
   hang and need to be
rebooted.  After I removed the snapshots (9 of
   them) for that
filesystem everything was fine.

I don't know how to reproduce this and didn't
  get
   a crash dump.  I
don't remember seeing anything about this
  before
   so I wanted to
report it and see if anyone has any ideas.
   
  MA Hmm, that sounds pretty bizarre, since I
  don't
   think that mounting a 
   MA filesystem doesn't really interact with
  snapshots
   at all. 
   MA Unfortunately, I don't think we'll be able to
   diagnose this without a 
   MA crash dump or reproducibility.  If it happens
   again, force a crash dump
   MA while the system is hung and we can take a
  look
   at it.
   
   Maybe it wasn't hung after all. I've seen similar
   behavior here
   sometimes. Did your disks used in a pool were
   actually working?
   
  
  There was lots of activity on the disks (iostat and
 status LEDs) until it got to this one filesystem
  and
  everything stopped.  'zpool iostat 5' stopped
  running, the shell wouldn't respond and activity on
  the disks stopped.  This fs is relatively small
(175M used of a 512M quota).
  Sometimes it takes a lot of time (30-50minutes) to
   mount a file system
   - it's rare, but it happens. And during this ZFS
   reads from those
   disks in a pool. I did report it here some time
  ago.
   
  In my case the system crashed during the evening
  and it was left hung up when I came in during the
   morning, so it was hung for a good 9-10 hours.
 
 The problem happened again last night, but for a
 different users' filesystem.  I took a crash dump
 with it hung and the back trace looks like this:
  ::status
 debugging crash dump vmcore.0 (64-bit) from hostname
 operating system: 5.11 snv_40 (sun4u)
 panic message: sync initiated
 dump content: kernel pages only
  ::stack
 0xf0046a3c(f005a4d8, 2a100047818, 181d010, 18378a8,
 1849000, f005a4d8)
 prom_enter_mon+0x24(2, 183c000, 18b7000, 2a100046c61,
 1812158, 181b4c8)
 debug_enter+0x110(0, a, a, 180fc00, 0, 183e000)
 abort_seq_softintr+0x8c(180fc00, 18abc00, 180c000,
 2a100047d98, 1, 1859800)
 intr_thread+0x170(600019de0e0, 0, 6000d7bfc98,
 600019de110, 600019de110, 
 600019de110)
 zfs_delete_thread_target+8(600019de080,
 , 0, 600019de080, 
 6000d791ae8, 60001aed428)
 zfs_delete_thread+0x164(600019de080, 6000d7bfc88, 1,
 2a100c4faca, 2a100c4fac8, 
 600019de0e0)
 thread_start+4(600019de080, 0, 0, 0, 0, 0)
 
 In single user I set the mountpoint for that user to
 be none and then brought the system up fine.  Then I
 destroyed the snapshots for that user and their
 filesystem mounted fine.  In this case the quota was
 reached with the snapshots and 52% used without.
 
 Ben

Hate to re-open something from a year ago, but we just had this problem happen 
again.  We have been running Solaris 10u3 on this system for awhile.  I 
searched the bug reports, but couldn't find anything on this.  I also think I 
understand what happened a little more.  We take snapshots at noon and the 
system hung up during that time.  When trying to reboot the system would hang 
on the ZFS mounts.  After I boot into single use and remove the snapshot from 
the filesystem causing the problem everything is fine.  The filesystem in 
question at 100% use with snapshots in use.

Here's the back trace for the system when it was hung:
 ::stack
0xf0046a3c(f005a4d8, 2a10004f828, 0, 181c850, 1848400, f005a4d8)
prom_enter_mon+0x24(0, 0, 183b400, 1, 1812140, 181ae60)
debug_enter+0x118(0, a, a, 180fc00, 0, 183d400)
abort_seq_softintr+0x94(180fc00, 18a9800, 180c000, 2a10004fd98, 1, 1857c00)
intr_thread+0x170(2, 30007b64bc0, 0, c001ed9, 110, 6000240)
0x985c8(300adca4c40, 0, 0, 0, 0, 30007b64bc0)
dbuf_hold_impl+0x28(60008cd02e8, 0, 0, 0, 7b648d73, 2a105bb57c8)
dbuf_hold_level+0x18(60008cd02e8, 0, 0, 7b648d73, 0, 0)
dmu_tx_check_ioerr+0x20(0, 60008cd02e8, 0, 0, 0, 7b648c00)
dmu_tx_hold_zap+0x84(60011fb2c40, 0, 0, 0, 30049b58008, 400)
zfs_rmnode+0xc8(3002410d210, 2a105bb5cc0, 0, 60011fb2c40, 30007b3ff58, 
30007b56ac0)
zfs_delete_thread+0x168(30007b56ac0, 3002410d210, 69a4778, 30007b56b28, 
2a105bb5aca, 2a105bb5ac8)
thread_start+4(30007b56ac0, 0, 0, 489a48, d83a10bf28, 50386)

Has this been fixed in more recent code?  I can make the crash dump available.

Ben
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] System hang caused by a bad snapshot

2007-09-18 Thread George Wilson

Ben,

Much of this code has been revamped as a result of:

6514331 in-memory delete queue is not needed

Although this may not fix your issue it would be good to try this test 
with more recent bits.

Thanks,
George

Ben Miller wrote:

 Hate to re-open something from a year ago, but we just had this problem 
 happen again.  We have been running Solaris 10u3 on this system for awhile.  
 I searched the bug reports, but couldn't find anything on this.  I also think 
 I understand what happened a little more.  We take snapshots at noon and the 
 system hung up during that time.  When trying to reboot the system would hang 
 on the ZFS mounts.  After I boot into single use and remove the snapshot from 
 the filesystem causing the problem everything is fine.  The filesystem in 
 question at 100% use with snapshots in use.
 
 Here's the back trace for the system when it was hung:
 ::stack
 0xf0046a3c(f005a4d8, 2a10004f828, 0, 181c850, 1848400, f005a4d8)
 prom_enter_mon+0x24(0, 0, 183b400, 1, 1812140, 181ae60)
 debug_enter+0x118(0, a, a, 180fc00, 0, 183d400)
 abort_seq_softintr+0x94(180fc00, 18a9800, 180c000, 2a10004fd98, 1, 1857c00)
 intr_thread+0x170(2, 30007b64bc0, 0, c001ed9, 110, 6000240)
 0x985c8(300adca4c40, 0, 0, 0, 0, 30007b64bc0)
 dbuf_hold_impl+0x28(60008cd02e8, 0, 0, 0, 7b648d73, 2a105bb57c8)
 dbuf_hold_level+0x18(60008cd02e8, 0, 0, 7b648d73, 0, 0)
 dmu_tx_check_ioerr+0x20(0, 60008cd02e8, 0, 0, 0, 7b648c00)
 dmu_tx_hold_zap+0x84(60011fb2c40, 0, 0, 0, 30049b58008, 400)
 zfs_rmnode+0xc8(3002410d210, 2a105bb5cc0, 0, 60011fb2c40, 30007b3ff58, 
 30007b56ac0)
 zfs_delete_thread+0x168(30007b56ac0, 3002410d210, 69a4778, 30007b56b28, 
 2a105bb5aca, 2a105bb5ac8)
 thread_start+4(30007b56ac0, 0, 0, 489a48, d83a10bf28, 50386)
 
 Has this been fixed in more recent code?  I can make the crash dump available.
 
 Ben
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] System hang caused by a bad snapshot

2006-09-12 Thread Ben Miller

I had a strange ZFS problem this morning.  The entire system would hang when 
mounting the ZFS filesystems.  After trial and error I determined that the 
problem was with one of the 2500 ZFS filesystems.  When mounting that users' 
home the system would hang and need to be rebooted.  After I removed the 
snapshots (9 of them) for that filesystem everything was fine.

I don't know how to reproduce this and didn't get a crash dump.  I don't 
remember seeing anything about this before so I wanted to report it and see if 
anyone has any ideas.

The system is a Sun Fire 280R with 3GB of RAM running SXCR b40.
The pool looks like this (I'm running a scrub currently):
# zpool status pool1
  pool: pool1
 state: ONLINE
 scrub: scrub in progress, 78.61% done, 0h18m to go
config:

NAME STATE READ WRITE CKSUM
pool1ONLINE   0 0 0
  raidz  ONLINE   0 0 0
c1t8d0   ONLINE   0 0 0
c1t9d0   ONLINE   0 0 0
c1t10d0  ONLINE   0 0 0
c1t11d0  ONLINE   0 0 0

errors: No known data errors

Ben
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] System hang caused by a bad snapshot

Re: [zfs-discuss] System hang caused by a bad snapshot

[zfs-discuss] System hang caused by a bad snapshot

3 matches

Site Navigation

Mail list logo

Footer information