Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers

2009-06-11 Thread Brent Jones


 After examining the dump we got from you (thanks again), we're relatively
 sure you are hitting

 6826836 Deadlock possible in dmu_object_reclaim()

 This was introduced in nv_111 and fixed in nv_113.

 Sorry for the trouble.

 -tim



Do you know when new builds will show up on pkg.opensolaris.org/dev ?


-- 
Brent Jones
br...@servuhome.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers

2009-06-10 Thread Tim Haley

Brent Jones wrote:

On Mon, Jun 8, 2009 at 9:38 PM, Richard Lowerichl...@richlowe.net wrote:

Brent Jones br...@servuhome.net writes:




I've had similar issues with similar traces.  I think you're waiting on
a transaction that's never going to come.

I thought at the time that I was hitting:
  CR 6367701 hang because tx_state_t is inconsistent

But given the rash of reports here, it seems perhaps this is something
different.

I, like you, hit it when sending snapshots, it seems (in my case) to be
specific to incremental streams, rather than full streams, I can send
seemingly any number of full streams, but incremental sends via send -i,
or send -R of datasets with multiple snapshots, will get into a state
like that above.

-- Rich



For now, back to snv_106 (the most stable build that I've seen, like it a lot)
I'll open a case in the morning, and see what they suggest.


After examining the dump we got from you (thanks again), we're relatively sure 
you are hitting


6826836 Deadlock possible in dmu_object_reclaim()

This was introduced in nv_111 and fixed in nv_113.

Sorry for the trouble.

-tim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers

2009-06-08 Thread Brent Jones
On Sun, Jun 7, 2009 at 3:50 AM, Ian Collinsi...@ianshome.com wrote:
 Ian Collins wrote:

 Tim Haley wrote:

 Brent Jones wrote:

 On the sending side, I CAN kill the ZFS send process, but the remote
 side leaves its processes going, and I CANNOT kill -9 them. I also
 cannot reboot the receiving system, at init 6, the system will just
 hang trying to unmount the file systems.
 I have to physically cut power to the server, but a couple days later,
 this issue will occur again.


 A crash dump from the receiving server with the stuck receives would be
 highly useful, if you can get it. Reboot -d would be best, but it might just
 hang. You can try savecore -L.

 I tried a reboot -d (I even had kmem-flags=0xf set), but it did hang. I
 didn't try savecore.

 One thing I didn't try was scat on the running system. What should I look
 for (with scat) if this happens again?

 I now have a system with a hanging zfs receive, any hints on debugging it?

 --
 Ian.

I haven't figured out a way to identify the problem, still trying to
find a 100% way to reproduce this problem.
Seemingly the more snapshots I send at a given time, the likelihood of
this happening goes up, but, correlation is not causation  :)

I might try to open a support case with Sun (have a support contract),
but Opensolaris doesn't seem to be well understood by the support
folks yet, so not sure how far it will get.

-- 
Brent Jones
br...@servuhome.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers

2009-06-08 Thread Brent Jones

 I haven't figured out a way to identify the problem, still trying to
 find a 100% way to reproduce this problem.
 Seemingly the more snapshots I send at a given time, the likelihood of
 this happening goes up, but, correlation is not causation  :)

 I might try to open a support case with Sun (have a support contract),
 but Opensolaris doesn't seem to be well understood by the support
 folks yet, so not sure how far it will get.

 --
 Brent Jones
 br...@servuhome.net


I can reproduce this 100% by sending about 6 or more snapshots at once.

Here is some output that JBK helped me put together:

Here is a pastebin 'mdb' findstack output:
http://pastebin.com/m4751b08c

Not sure what I'm looking at, but maybe someone at Sun can see whats going on?



-- 
Brent Jones
br...@servuhome.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers

2009-06-08 Thread Richard Lowe
Brent Jones br...@servuhome.net writes:


 I haven't figured out a way to identify the problem, still trying to
 find a 100% way to reproduce this problem.
 Seemingly the more snapshots I send at a given time, the likelihood of
 this happening goes up, but, correlation is not causation  :)

 I might try to open a support case with Sun (have a support contract),
 but Opensolaris doesn't seem to be well understood by the support
 folks yet, so not sure how far it will get.

 --
 Brent Jones
 br...@servuhome.net


 I can reproduce this 100% by sending about 6 or more snapshots at once.

 Here is some output that JBK helped me put together:

 Here is a pastebin 'mdb' findstack output:
 http://pastebin.com/m4751b08c

 Not sure what I'm looking at, but maybe someone at Sun can see whats going on?

I've had similar issues with similar traces.  I think you're waiting on
a transaction that's never going to come.

I thought at the time that I was hitting:
   CR 6367701 hang because tx_state_t is inconsistent

But given the rash of reports here, it seems perhaps this is something
different.

I, like you, hit it when sending snapshots, it seems (in my case) to be
specific to incremental streams, rather than full streams, I can send
seemingly any number of full streams, but incremental sends via send -i,
or send -R of datasets with multiple snapshots, will get into a state
like that above.

-- Rich
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers

2009-06-08 Thread Brent Jones
On Mon, Jun 8, 2009 at 9:38 PM, Richard Lowerichl...@richlowe.net wrote:
 Brent Jones br...@servuhome.net writes:



 I've had similar issues with similar traces.  I think you're waiting on
 a transaction that's never going to come.

 I thought at the time that I was hitting:
   CR 6367701 hang because tx_state_t is inconsistent

 But given the rash of reports here, it seems perhaps this is something
 different.

 I, like you, hit it when sending snapshots, it seems (in my case) to be
 specific to incremental streams, rather than full streams, I can send
 seemingly any number of full streams, but incremental sends via send -i,
 or send -R of datasets with multiple snapshots, will get into a state
 like that above.

 -- Rich


For now, back to snv_106 (the most stable build that I've seen, like it a lot)
I'll open a case in the morning, and see what they suggest.


-- 
Brent Jones
br...@servuhome.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers

2009-06-07 Thread Ian Collins

Ian Collins wrote:

Tim Haley wrote:

Brent Jones wrote:


On the sending side, I CAN kill the ZFS send process, but the remote
side leaves its processes going, and I CANNOT kill -9 them. I also
cannot reboot the receiving system, at init 6, the system will just
hang trying to unmount the file systems.
I have to physically cut power to the server, but a couple days later,
this issue will occur again.


A crash dump from the receiving server with the stuck receives would 
be highly useful, if you can get it. Reboot -d would be best, but it 
might just hang. You can try savecore -L.


I tried a reboot -d (I even had kmem-flags=0xf set), but it did hang. 
I didn't try savecore.


One thing I didn't try was scat on the running system. What should I 
look for (with scat) if this happens again?



I now have a system with a hanging zfs receive, any hints on debugging it?

--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers

2009-06-05 Thread Ian Collins

Brent Jones wrote:


On the sending side, I CAN kill the ZFS send process, but the remote
side leaves its processes going, and I CANNOT kill -9 them. I also
cannot reboot the receiving system, at init 6, the system will just
hang trying to unmount the file systems.
I have to physically cut power to the server, but a couple days later,
this issue will occur again.

  
I have seen this on Solaris 10.  Something appears to break with a pool 
or filesystem causing zfs receive to hang in the kernel.  Once this 
happens, any zfs command that changes the state of the pool/filesystem 
will hang, including a zpool detach or an int 6.


Can you get truss -p or mdb -p to work on the stuck process?

--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers

2009-06-05 Thread Brent Jones
On Fri, Jun 5, 2009 at 3:25 PM, Ian Collins i...@ianshome.com wrote:
 Brent Jones wrote:

 On the sending side, I CAN kill the ZFS send process, but the remote
 side leaves its processes going, and I CANNOT kill -9 them. I also
 cannot reboot the receiving system, at init 6, the system will just
 hang trying to unmount the file systems.
 I have to physically cut power to the server, but a couple days later,
 this issue will occur again.



 I have seen this on Solaris 10.  Something appears to break with a pool or
 filesystem causing zfs receive to hang in the kernel.  Once this happens,
 any zfs command that changes the state of the pool/filesystem will hang,
 including a zpool detach or an int 6.

 Can you get truss -p or mdb -p to work on the stuck process?

 --
 Ian.



I cannot.

# truss -p 11308
truss: unanticipated system error: 11308
(r...@pdxfilu02)-(06:29 PM Fri Jun 05)-(log)
# mdb -p 11308
mdb: cannot debug 11308: unanticipated system error
mdb: failed to initialize target: No such file or directory


All the hung zfs receives PID's have '1' as their PPID.
Is it safe to truss PID 1?  :)

When you saw this, how did you escape it? I've found only pulling the
plug will fix it.

-- 
Brent Jones
br...@servuhome.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers

2009-06-05 Thread Ian Collins

Brent Jones wrote:

On Fri, Jun 5, 2009 at 3:25 PM, Ian Collins i...@ianshome.com wrote:
  

Brent Jones wrote:


On the sending side, I CAN kill the ZFS send process, but the remote
side leaves its processes going, and I CANNOT kill -9 them. I also
cannot reboot the receiving system, at init 6, the system will just
hang trying to unmount the file systems.
I have to physically cut power to the server, but a couple days later,
this issue will occur again.


  

I have seen this on Solaris 10.  Something appears to break with a pool or
filesystem causing zfs receive to hang in the kernel.  Once this happens,
any zfs command that changes the state of the pool/filesystem will hang,
including a zpool detach or an int 6.

Can you get truss -p or mdb -p to work on the stuck process?


I cannot.

# truss -p 11308
truss: unanticipated system error: 11308
(r...@pdxfilu02)-(06:29 PM Fri Jun 05)-(log)
# mdb -p 11308
mdb: cannot debug 11308: unanticipated system error
mdb: failed to initialize target: No such file or directory

  

Same as me...

All the hung zfs receives PID's have '1' as their PPID.
Is it safe to truss PID 1?  :)

When you saw this, how did you escape it? I've found only pulling the
plug will fix it.

  
I'm several miles away from the boxes, so I had to resort to a hard 
reset through the ILOM.


I have yet to identify the root cause, all I know is the problem happens 
sometimes.  I have sent over several 10s of thousands of snapshots to 
the last system that hung over the past few days without incident.


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers

2009-06-05 Thread Tim Haley

Brent Jones wrote:

Hello all,
I had been running snv_106 for about 3 or 4 months on a pair of X4540's.
I would ship snapshots from the primary server to the secondary server
nightly, which was working really well.

However, I have upgraded to 2009.06, and my replication scripts appear
to hang when performing zfs send/recv.
When one zfs send/recv process hangs, you cannot send any other
snapshots from any other filesystem to the remote host.
I have about 20 file systems I snapshots and replicate nightly.

The script I use to perform the snapshots is here:
http://www.brentrjones.com/wp-content/uploads/2009/03/replicate.ksh

On the remote side, I end up with many hung processes, like this:

  bjones 11676 11661   0 01:30:03 ?   0:00 /sbin/zfs recv -vFd pdxfilu02
  bjones 11673 11660   0 01:30:03 ?   0:00 /sbin/zfs recv -vFd pdxfilu02
  bjones 11664 11653   0 01:30:03 ?   0:00 /sbin/zfs recv -vFd pdxfilu02
  bjones 13727 13722   0 14:21:20 ?   0:00 /sbin/zfs recv -vFd pdxfilu02

And so on, one for each file system.

On the receiving end, 'zfs list' shows one filesystem attempting to
receive a snapshot, but I cannot stop it:

$ zfs list
NAME   USED  AVAIL  REFER  MOUNTPOINT
pdxfilu02/data/fs01/%20090605-00:30:00  1.74G  27.2T   208G
/pdxfilu02/data/fs01/%20090605-00:30:00



On the sending side, I CAN kill the ZFS send process, but the remote
side leaves its processes going, and I CANNOT kill -9 them. I also
cannot reboot the receiving system, at init 6, the system will just
hang trying to unmount the file systems.
I have to physically cut power to the server, but a couple days later,
this issue will occur again.


A crash dump from the receiving server with the stuck receives would be highly 
useful, if you can get it.  Reboot -d would be best, but it might just hang. 
You can try savecore -L.


-tim


I'f I boot to my snv_106 BE, everything works fine, this issue has
never occurred on that version.

Any thoughts?



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers

2009-06-05 Thread Ian Collins

Tim Haley wrote:

Brent Jones wrote:


On the sending side, I CAN kill the ZFS send process, but the remote
side leaves its processes going, and I CANNOT kill -9 them. I also
cannot reboot the receiving system, at init 6, the system will just
hang trying to unmount the file systems.
I have to physically cut power to the server, but a couple days later,
this issue will occur again.


A crash dump from the receiving server with the stuck receives would 
be highly useful, if you can get it.  Reboot -d would be best, but it 
might just hang. You can try savecore -L.


I tried a reboot -d (I even had kmem-flags=0xf set), but it did hang.  I 
didn't try savecore.


One thing I didn't try was scat on the running system.  What should I 
look for (with scat) if this happens again?


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers

2009-06-05 Thread Brent Jones
On Fri, Jun 5, 2009 at 4:20 PM, Tim Haley tim.ha...@sun.com wrote:
 Brent Jones wrote:

 Hello all,
 I had been running snv_106 for about 3 or 4 months on a pair of X4540's.
 I would ship snapshots from the primary server to the secondary server
 nightly, which was working really well.

 However, I have upgraded to 2009.06, and my replication scripts appear
 to hang when performing zfs send/recv.
 When one zfs send/recv process hangs, you cannot send any other
 snapshots from any other filesystem to the remote host.
 I have about 20 file systems I snapshots and replicate nightly.

 The script I use to perform the snapshots is here:
 http://www.brentrjones.com/wp-content/uploads/2009/03/replicate.ksh

 On the remote side, I end up with many hung processes, like this:

  bjones 11676 11661   0 01:30:03 ?           0:00 /sbin/zfs recv -vFd
 pdxfilu02
  bjones 11673 11660   0 01:30:03 ?           0:00 /sbin/zfs recv -vFd
 pdxfilu02
  bjones 11664 11653   0 01:30:03 ?           0:00 /sbin/zfs recv -vFd
 pdxfilu02
  bjones 13727 13722   0 14:21:20 ?           0:00 /sbin/zfs recv -vFd
 pdxfilu02

 And so on, one for each file system.

 On the receiving end, 'zfs list' shows one filesystem attempting to
 receive a snapshot, but I cannot stop it:

 $ zfs list
 NAME                                       USED  AVAIL  REFER  MOUNTPOINT
 pdxfilu02/data/fs01/%20090605-00:30:00  1.74G  27.2T   208G
 /pdxfilu02/data/fs01/%20090605-00:30:00



 On the sending side, I CAN kill the ZFS send process, but the remote
 side leaves its processes going, and I CANNOT kill -9 them. I also
 cannot reboot the receiving system, at init 6, the system will just
 hang trying to unmount the file systems.
 I have to physically cut power to the server, but a couple days later,
 this issue will occur again.


 A crash dump from the receiving server with the stuck receives would be
 highly useful, if you can get it.  Reboot -d would be best, but it might
 just hang. You can try savecore -L.

 -tim

 I'f I boot to my snv_106 BE, everything works fine, this issue has
 never occurred on that version.

 Any thoughts?




I'm doing a savecore -L, but I have 64GB of ram, which makes the dumps
a pita to work with.

Is there any additional information I can provide?

-- 
Brent Jones
br...@servuhome.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers

2009-06-05 Thread Brent Jones
On Fri, Jun 5, 2009 at 4:20 PM, Tim Haley tim.ha...@sun.com wrote:
 Brent Jones wrote:

 Hello all,
 I had been running snv_106 for about 3 or 4 months on a pair of X4540's.
 I would ship snapshots from the primary server to the secondary server
 nightly, which was working really well.

 However, I have upgraded to 2009.06, and my replication scripts appear
 to hang when performing zfs send/recv.
 When one zfs send/recv process hangs, you cannot send any other
 snapshots from any other filesystem to the remote host.
 I have about 20 file systems I snapshots and replicate nightly.

 The script I use to perform the snapshots is here:
 http://www.brentrjones.com/wp-content/uploads/2009/03/replicate.ksh

 On the remote side, I end up with many hung processes, like this:

  bjones 11676 11661   0 01:30:03 ?           0:00 /sbin/zfs recv -vFd
 pdxfilu02
  bjones 11673 11660   0 01:30:03 ?           0:00 /sbin/zfs recv -vFd
 pdxfilu02
  bjones 11664 11653   0 01:30:03 ?           0:00 /sbin/zfs recv -vFd
 pdxfilu02
  bjones 13727 13722   0 14:21:20 ?           0:00 /sbin/zfs recv -vFd
 pdxfilu02

 And so on, one for each file system.

 On the receiving end, 'zfs list' shows one filesystem attempting to
 receive a snapshot, but I cannot stop it:

 $ zfs list
 NAME                                       USED  AVAIL  REFER  MOUNTPOINT
 pdxfilu02/data/fs01/%20090605-00:30:00  1.74G  27.2T   208G
 /pdxfilu02/data/fs01/%20090605-00:30:00



 On the sending side, I CAN kill the ZFS send process, but the remote
 side leaves its processes going, and I CANNOT kill -9 them. I also
 cannot reboot the receiving system, at init 6, the system will just
 hang trying to unmount the file systems.
 I have to physically cut power to the server, but a couple days later,
 this issue will occur again.


 A crash dump from the receiving server with the stuck receives would be
 highly useful, if you can get it.  Reboot -d would be best, but it might
 just hang. You can try savecore -L.

 -tim

 I'f I boot to my snv_106 BE, everything works fine, this issue has
 never occurred on that version.

 Any thoughts?




Well, I think I found a specific file system that is causing this.
I kicked off a zpool scrub to see if there might be corruption on
either end, but that takes well over 40 hours on these servers.


-- 
Brent Jones
br...@servuhome.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss