Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers
After examining the dump we got from you (thanks again), we're relatively sure you are hitting 6826836 Deadlock possible in dmu_object_reclaim() This was introduced in nv_111 and fixed in nv_113. Sorry for the trouble. -tim Do you know when new builds will show up on pkg.opensolaris.org/dev ? -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers
Brent Jones wrote: On Mon, Jun 8, 2009 at 9:38 PM, Richard Lowerichl...@richlowe.net wrote: Brent Jones br...@servuhome.net writes: I've had similar issues with similar traces. I think you're waiting on a transaction that's never going to come. I thought at the time that I was hitting: CR 6367701 hang because tx_state_t is inconsistent But given the rash of reports here, it seems perhaps this is something different. I, like you, hit it when sending snapshots, it seems (in my case) to be specific to incremental streams, rather than full streams, I can send seemingly any number of full streams, but incremental sends via send -i, or send -R of datasets with multiple snapshots, will get into a state like that above. -- Rich For now, back to snv_106 (the most stable build that I've seen, like it a lot) I'll open a case in the morning, and see what they suggest. After examining the dump we got from you (thanks again), we're relatively sure you are hitting 6826836 Deadlock possible in dmu_object_reclaim() This was introduced in nv_111 and fixed in nv_113. Sorry for the trouble. -tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers
On Sun, Jun 7, 2009 at 3:50 AM, Ian Collinsi...@ianshome.com wrote: Ian Collins wrote: Tim Haley wrote: Brent Jones wrote: On the sending side, I CAN kill the ZFS send process, but the remote side leaves its processes going, and I CANNOT kill -9 them. I also cannot reboot the receiving system, at init 6, the system will just hang trying to unmount the file systems. I have to physically cut power to the server, but a couple days later, this issue will occur again. A crash dump from the receiving server with the stuck receives would be highly useful, if you can get it. Reboot -d would be best, but it might just hang. You can try savecore -L. I tried a reboot -d (I even had kmem-flags=0xf set), but it did hang. I didn't try savecore. One thing I didn't try was scat on the running system. What should I look for (with scat) if this happens again? I now have a system with a hanging zfs receive, any hints on debugging it? -- Ian. I haven't figured out a way to identify the problem, still trying to find a 100% way to reproduce this problem. Seemingly the more snapshots I send at a given time, the likelihood of this happening goes up, but, correlation is not causation :) I might try to open a support case with Sun (have a support contract), but Opensolaris doesn't seem to be well understood by the support folks yet, so not sure how far it will get. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers
I haven't figured out a way to identify the problem, still trying to find a 100% way to reproduce this problem. Seemingly the more snapshots I send at a given time, the likelihood of this happening goes up, but, correlation is not causation :) I might try to open a support case with Sun (have a support contract), but Opensolaris doesn't seem to be well understood by the support folks yet, so not sure how far it will get. -- Brent Jones br...@servuhome.net I can reproduce this 100% by sending about 6 or more snapshots at once. Here is some output that JBK helped me put together: Here is a pastebin 'mdb' findstack output: http://pastebin.com/m4751b08c Not sure what I'm looking at, but maybe someone at Sun can see whats going on? -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers
Brent Jones br...@servuhome.net writes: I haven't figured out a way to identify the problem, still trying to find a 100% way to reproduce this problem. Seemingly the more snapshots I send at a given time, the likelihood of this happening goes up, but, correlation is not causation :) I might try to open a support case with Sun (have a support contract), but Opensolaris doesn't seem to be well understood by the support folks yet, so not sure how far it will get. -- Brent Jones br...@servuhome.net I can reproduce this 100% by sending about 6 or more snapshots at once. Here is some output that JBK helped me put together: Here is a pastebin 'mdb' findstack output: http://pastebin.com/m4751b08c Not sure what I'm looking at, but maybe someone at Sun can see whats going on? I've had similar issues with similar traces. I think you're waiting on a transaction that's never going to come. I thought at the time that I was hitting: CR 6367701 hang because tx_state_t is inconsistent But given the rash of reports here, it seems perhaps this is something different. I, like you, hit it when sending snapshots, it seems (in my case) to be specific to incremental streams, rather than full streams, I can send seemingly any number of full streams, but incremental sends via send -i, or send -R of datasets with multiple snapshots, will get into a state like that above. -- Rich ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers
On Mon, Jun 8, 2009 at 9:38 PM, Richard Lowerichl...@richlowe.net wrote: Brent Jones br...@servuhome.net writes: I've had similar issues with similar traces. I think you're waiting on a transaction that's never going to come. I thought at the time that I was hitting: CR 6367701 hang because tx_state_t is inconsistent But given the rash of reports here, it seems perhaps this is something different. I, like you, hit it when sending snapshots, it seems (in my case) to be specific to incremental streams, rather than full streams, I can send seemingly any number of full streams, but incremental sends via send -i, or send -R of datasets with multiple snapshots, will get into a state like that above. -- Rich For now, back to snv_106 (the most stable build that I've seen, like it a lot) I'll open a case in the morning, and see what they suggest. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers
Ian Collins wrote: Tim Haley wrote: Brent Jones wrote: On the sending side, I CAN kill the ZFS send process, but the remote side leaves its processes going, and I CANNOT kill -9 them. I also cannot reboot the receiving system, at init 6, the system will just hang trying to unmount the file systems. I have to physically cut power to the server, but a couple days later, this issue will occur again. A crash dump from the receiving server with the stuck receives would be highly useful, if you can get it. Reboot -d would be best, but it might just hang. You can try savecore -L. I tried a reboot -d (I even had kmem-flags=0xf set), but it did hang. I didn't try savecore. One thing I didn't try was scat on the running system. What should I look for (with scat) if this happens again? I now have a system with a hanging zfs receive, any hints on debugging it? -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers
Brent Jones wrote: On the sending side, I CAN kill the ZFS send process, but the remote side leaves its processes going, and I CANNOT kill -9 them. I also cannot reboot the receiving system, at init 6, the system will just hang trying to unmount the file systems. I have to physically cut power to the server, but a couple days later, this issue will occur again. I have seen this on Solaris 10. Something appears to break with a pool or filesystem causing zfs receive to hang in the kernel. Once this happens, any zfs command that changes the state of the pool/filesystem will hang, including a zpool detach or an int 6. Can you get truss -p or mdb -p to work on the stuck process? -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers
On Fri, Jun 5, 2009 at 3:25 PM, Ian Collins i...@ianshome.com wrote: Brent Jones wrote: On the sending side, I CAN kill the ZFS send process, but the remote side leaves its processes going, and I CANNOT kill -9 them. I also cannot reboot the receiving system, at init 6, the system will just hang trying to unmount the file systems. I have to physically cut power to the server, but a couple days later, this issue will occur again. I have seen this on Solaris 10. Something appears to break with a pool or filesystem causing zfs receive to hang in the kernel. Once this happens, any zfs command that changes the state of the pool/filesystem will hang, including a zpool detach or an int 6. Can you get truss -p or mdb -p to work on the stuck process? -- Ian. I cannot. # truss -p 11308 truss: unanticipated system error: 11308 (r...@pdxfilu02)-(06:29 PM Fri Jun 05)-(log) # mdb -p 11308 mdb: cannot debug 11308: unanticipated system error mdb: failed to initialize target: No such file or directory All the hung zfs receives PID's have '1' as their PPID. Is it safe to truss PID 1? :) When you saw this, how did you escape it? I've found only pulling the plug will fix it. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers
Brent Jones wrote: On Fri, Jun 5, 2009 at 3:25 PM, Ian Collins i...@ianshome.com wrote: Brent Jones wrote: On the sending side, I CAN kill the ZFS send process, but the remote side leaves its processes going, and I CANNOT kill -9 them. I also cannot reboot the receiving system, at init 6, the system will just hang trying to unmount the file systems. I have to physically cut power to the server, but a couple days later, this issue will occur again. I have seen this on Solaris 10. Something appears to break with a pool or filesystem causing zfs receive to hang in the kernel. Once this happens, any zfs command that changes the state of the pool/filesystem will hang, including a zpool detach or an int 6. Can you get truss -p or mdb -p to work on the stuck process? I cannot. # truss -p 11308 truss: unanticipated system error: 11308 (r...@pdxfilu02)-(06:29 PM Fri Jun 05)-(log) # mdb -p 11308 mdb: cannot debug 11308: unanticipated system error mdb: failed to initialize target: No such file or directory Same as me... All the hung zfs receives PID's have '1' as their PPID. Is it safe to truss PID 1? :) When you saw this, how did you escape it? I've found only pulling the plug will fix it. I'm several miles away from the boxes, so I had to resort to a hard reset through the ILOM. I have yet to identify the root cause, all I know is the problem happens sometimes. I have sent over several 10s of thousands of snapshots to the last system that hung over the past few days without incident. -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers
Brent Jones wrote: Hello all, I had been running snv_106 for about 3 or 4 months on a pair of X4540's. I would ship snapshots from the primary server to the secondary server nightly, which was working really well. However, I have upgraded to 2009.06, and my replication scripts appear to hang when performing zfs send/recv. When one zfs send/recv process hangs, you cannot send any other snapshots from any other filesystem to the remote host. I have about 20 file systems I snapshots and replicate nightly. The script I use to perform the snapshots is here: http://www.brentrjones.com/wp-content/uploads/2009/03/replicate.ksh On the remote side, I end up with many hung processes, like this: bjones 11676 11661 0 01:30:03 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 bjones 11673 11660 0 01:30:03 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 bjones 11664 11653 0 01:30:03 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 bjones 13727 13722 0 14:21:20 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 And so on, one for each file system. On the receiving end, 'zfs list' shows one filesystem attempting to receive a snapshot, but I cannot stop it: $ zfs list NAME USED AVAIL REFER MOUNTPOINT pdxfilu02/data/fs01/%20090605-00:30:00 1.74G 27.2T 208G /pdxfilu02/data/fs01/%20090605-00:30:00 On the sending side, I CAN kill the ZFS send process, but the remote side leaves its processes going, and I CANNOT kill -9 them. I also cannot reboot the receiving system, at init 6, the system will just hang trying to unmount the file systems. I have to physically cut power to the server, but a couple days later, this issue will occur again. A crash dump from the receiving server with the stuck receives would be highly useful, if you can get it. Reboot -d would be best, but it might just hang. You can try savecore -L. -tim I'f I boot to my snv_106 BE, everything works fine, this issue has never occurred on that version. Any thoughts? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers
Tim Haley wrote: Brent Jones wrote: On the sending side, I CAN kill the ZFS send process, but the remote side leaves its processes going, and I CANNOT kill -9 them. I also cannot reboot the receiving system, at init 6, the system will just hang trying to unmount the file systems. I have to physically cut power to the server, but a couple days later, this issue will occur again. A crash dump from the receiving server with the stuck receives would be highly useful, if you can get it. Reboot -d would be best, but it might just hang. You can try savecore -L. I tried a reboot -d (I even had kmem-flags=0xf set), but it did hang. I didn't try savecore. One thing I didn't try was scat on the running system. What should I look for (with scat) if this happens again? -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers
On Fri, Jun 5, 2009 at 4:20 PM, Tim Haley tim.ha...@sun.com wrote: Brent Jones wrote: Hello all, I had been running snv_106 for about 3 or 4 months on a pair of X4540's. I would ship snapshots from the primary server to the secondary server nightly, which was working really well. However, I have upgraded to 2009.06, and my replication scripts appear to hang when performing zfs send/recv. When one zfs send/recv process hangs, you cannot send any other snapshots from any other filesystem to the remote host. I have about 20 file systems I snapshots and replicate nightly. The script I use to perform the snapshots is here: http://www.brentrjones.com/wp-content/uploads/2009/03/replicate.ksh On the remote side, I end up with many hung processes, like this: bjones 11676 11661 0 01:30:03 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 bjones 11673 11660 0 01:30:03 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 bjones 11664 11653 0 01:30:03 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 bjones 13727 13722 0 14:21:20 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 And so on, one for each file system. On the receiving end, 'zfs list' shows one filesystem attempting to receive a snapshot, but I cannot stop it: $ zfs list NAME USED AVAIL REFER MOUNTPOINT pdxfilu02/data/fs01/%20090605-00:30:00 1.74G 27.2T 208G /pdxfilu02/data/fs01/%20090605-00:30:00 On the sending side, I CAN kill the ZFS send process, but the remote side leaves its processes going, and I CANNOT kill -9 them. I also cannot reboot the receiving system, at init 6, the system will just hang trying to unmount the file systems. I have to physically cut power to the server, but a couple days later, this issue will occur again. A crash dump from the receiving server with the stuck receives would be highly useful, if you can get it. Reboot -d would be best, but it might just hang. You can try savecore -L. -tim I'f I boot to my snv_106 BE, everything works fine, this issue has never occurred on that version. Any thoughts? I'm doing a savecore -L, but I have 64GB of ram, which makes the dumps a pita to work with. Is there any additional information I can provide? -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers
On Fri, Jun 5, 2009 at 4:20 PM, Tim Haley tim.ha...@sun.com wrote: Brent Jones wrote: Hello all, I had been running snv_106 for about 3 or 4 months on a pair of X4540's. I would ship snapshots from the primary server to the secondary server nightly, which was working really well. However, I have upgraded to 2009.06, and my replication scripts appear to hang when performing zfs send/recv. When one zfs send/recv process hangs, you cannot send any other snapshots from any other filesystem to the remote host. I have about 20 file systems I snapshots and replicate nightly. The script I use to perform the snapshots is here: http://www.brentrjones.com/wp-content/uploads/2009/03/replicate.ksh On the remote side, I end up with many hung processes, like this: bjones 11676 11661 0 01:30:03 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 bjones 11673 11660 0 01:30:03 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 bjones 11664 11653 0 01:30:03 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 bjones 13727 13722 0 14:21:20 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 And so on, one for each file system. On the receiving end, 'zfs list' shows one filesystem attempting to receive a snapshot, but I cannot stop it: $ zfs list NAME USED AVAIL REFER MOUNTPOINT pdxfilu02/data/fs01/%20090605-00:30:00 1.74G 27.2T 208G /pdxfilu02/data/fs01/%20090605-00:30:00 On the sending side, I CAN kill the ZFS send process, but the remote side leaves its processes going, and I CANNOT kill -9 them. I also cannot reboot the receiving system, at init 6, the system will just hang trying to unmount the file systems. I have to physically cut power to the server, but a couple days later, this issue will occur again. A crash dump from the receiving server with the stuck receives would be highly useful, if you can get it. Reboot -d would be best, but it might just hang. You can try savecore -L. -tim I'f I boot to my snv_106 BE, everything works fine, this issue has never occurred on that version. Any thoughts? Well, I think I found a specific file system that is causing this. I kicked off a zpool scrub to see if there might be corruption on either end, but that takes well over 40 hours on these servers. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss