Re: [zfs-discuss] Hanging receive
On Wed, Jul 08, 2009 at 08:43:17AM +1200, Ian Collins wrote: Ian Collins wrote: Brent Jones wrote: On Fri, Jul 3, 2009 at 8:31 PM, Ian Collinsi...@ianshome.com wrote: Ian Collins wrote: I was doing an incremental send between pools, the receive side is locked up and no zfs/zpool commands work on that pool. The stacks look different from those reported in the earlier ZFS snapshot send/recv hangs X4540 servers thread. Here is the process information from scat (other commands hanging on the pool are also in cv_wait): Has anyone else seen anything like this? The box wouldn't even reboot, it had to be power cycled. It locks up on receive regularly now. I hit this too: 6826836 Fixed in 117 http://opensolaris.org/jive/thread.jspa?threadID=104852tstart=120 I don't think this is the same problem (which is why a started a new thread), a single incremental set will eventually lock the pool up, pretty much guaranteed each time. One more data point: This didn't happen when I had a single pool (stripe of mirrors) on the server. It started happening when I split the mirrors and created a second pool built from 3 8 drive raidz2 vdevs. Sending to the new pool (either locally or from another machine) causes the hangs. And here are my data points: We were running two X4500s under Nevada 112 but came across this issue on both of them. On receiving much data through a ZFS receive, they would lock up. Any zpool or zfs commands would hang and were unkillable. The only way to resolve the situation was to reboot without syncing disks. I reported this in some posts back in April (http://opensolaris.org/jive/click.jspa?searchID=2021762messageID=368524) One of them had an old enough zpool and zfs version to down/up/sidegrade to Solaris 10 u6 and so I made this change. The thumper running Solaris 10 is now mostly fine - it normally receives an hourly snapshot with no problem. The thumper unning 112 has continued to experience the issues described by Ian and others. I've just upgraded to 117 and am having even more issues - I'm unable to receive or roll back snapshots, instead I see: 506 r...@thumper1:~ cat snap | zfs receive -vF thumperpool receiving incremental stream of vlepool/m...@200906182000 into thumperp...@200906182000 cannot receive incremental stream: most recent snapshot of thumperpool does not match incremental source 511 r...@thumper1:~ zfs rollback -r thumperpool/m...@200906181800 cannot destroy 'thumperpool/m...@200906181900': dataset already exists As a result, I'm a bit scuppered. I'm going to try going back to by 112 installation instead to see if that resolves any of my issues. All of our thumpers have the following disk configuration: 4 x 11 Disk raidz2 arrays with 2 disks as hot spares in a single pool. 2 disks in a mirror for booting. When zpool locks up on the main pool, I'm still able to get a zpool status on the boot pool. I can't access any data on the pool which is locked up. Andrew -- Systems Developer e: andrew.nic...@luns.net.uk im: a.nic...@jabber.lancs.ac.uk t: +44 (0)1524 5 10147 Lancaster University Network Services is a limited company registered in England and Wales. Registered number: 4311892. Registered office: University House, Lancaster University, Lancaster, LA1 4YW signature.asc Description: Digital signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Hanging receive
Andrew Robert Nicols wrote: On Wed, Jul 08, 2009 at 08:43:17AM +1200, Ian Collins wrote: Ian Collins wrote: Brent Jones wrote: On Fri, Jul 3, 2009 at 8:31 PM, Ian Collinsi...@ianshome.com wrote: Ian Collins wrote: I was doing an incremental send between pools, the receive side is locked up and no zfs/zpool commands work on that pool. The stacks look different from those reported in the earlier ZFS snapshot send/recv hangs X4540 servers thread. Here is the process information from scat (other commands hanging on the pool are also in cv_wait): Has anyone else seen anything like this? The box wouldn't even reboot, it had to be power cycled. It locks up on receive regularly now. I hit this too: 6826836 Fixed in 117 http://opensolaris.org/jive/thread.jspa?threadID=104852tstart=120 I don't think this is the same problem (which is why a started a new thread), a single incremental set will eventually lock the pool up, pretty much guaranteed each time. One more data point: This didn't happen when I had a single pool (stripe of mirrors) on the server. It started happening when I split the mirrors and created a second pool built from 3 8 drive raidz2 vdevs. Sending to the new pool (either locally or from another machine) causes the hangs. And here are my data points: We were running two X4500s under Nevada 112 but came across this issue on both of them. On receiving much data through a ZFS receive, they would lock up. Any zpool or zfs commands would hang and were unkillable. The only way to resolve the situation was to reboot without syncing disks. I reported this in some posts back in April (http://opensolaris.org/jive/click.jspa?searchID=2021762messageID=368524) One of them had an old enough zpool and zfs version to down/up/sidegrade to Solaris 10 u6 and so I made this change. The thumper running Solaris 10 is now mostly fine - it normally receives an hourly snapshot with no problem. The thumper unning 112 has continued to experience the issues described by Ian and others. I've just upgraded to 117 and am having even more issues - I'm unable to receive or roll back snapshots, instead I see: 506 r...@thumper1:~ cat snap | zfs receive -vF thumperpool receiving incremental stream of vlepool/m...@200906182000 into thumperp...@200906182000 cannot receive incremental stream: most recent snapshot of thumperpool does not match incremental source 511 r...@thumper1:~ zfs rollback -r thumperpool/m...@200906181800 cannot destroy 'thumperpool/m...@200906181900': dataset already exists Thanks for the additional data Andrew. Can you do a zfs destroy of thumperpool/m...@200906181900? -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Hanging receive
On Wed, Jul 08, 2009 at 08:31:54PM +1200, Ian Collins wrote: Andrew Robert Nicols wrote: The thumper unning 112 has continued to experience the issues described by Ian and others. I've just upgraded to 117 and am having even more issues - I'm unable to receive or roll back snapshots, instead I see: 506 r...@thumper1:~ cat snap | zfs receive -vF thumperpool receiving incremental stream of vlepool/m...@200906182000 into thumperp...@200906182000 cannot receive incremental stream: most recent snapshot of thumperpool does not match incremental source 511 r...@thumper1:~ zfs rollback -r thumperpool/m...@200906181800 cannot destroy 'thumperpool/m...@200906181900': dataset already exists Thanks for the additional data Andrew. Can you do a zfs destroy of thumperpool/m...@200906181900? I'm afraid not: 503 r...@thumper1:~ zfs destroy thumperpool/m...@200906181900 cannot destroy 'thumperpool/m...@200906181900': dataset already exists Andrew -- Systems Developer e: andrew.nic...@luns.net.uk im: a.nic...@jabber.lancs.ac.uk t: +44 (0)1524 5 10147 Lancaster University Network Services is a limited company registered in England and Wales. Registered number: 4311892. Registered office: University House, Lancaster University, Lancaster, LA1 4YW signature.asc Description: Digital signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Hanging receive
On Wed, Jul 08, 2009 at 09:41:12AM +0100, Andrew Robert Nicols wrote: On Wed, Jul 08, 2009 at 08:31:54PM +1200, Ian Collins wrote: Andrew Robert Nicols wrote: The thumper unning 112 has continued to experience the issues described by Ian and others. I've just upgraded to 117 and am having even more issues - I'm unable to receive or roll back snapshots, instead I see: 506 r...@thumper1:~ cat snap | zfs receive -vF thumperpool receiving incremental stream of vlepool/m...@200906182000 into thumperp...@200906182000 cannot receive incremental stream: most recent snapshot of thumperpool does not match incremental source 511 r...@thumper1:~ zfs rollback -r thumperpool/m...@200906181800 cannot destroy 'thumperpool/m...@200906181900': dataset already exists Thanks for the additional data Andrew. Can you do a zfs destroy of thumperpool/m...@200906181900? I'm afraid not: 503 r...@thumper1:~ zfs destroy thumperpool/m...@200906181900 cannot destroy 'thumperpool/m...@200906181900': dataset already exists Moving back to Nevada 112, I'm once again able to receive snapshots and destroy datasets as appropriate - thank goodness! However, I'm fairly sure that in a few hours, with the volume of data I'm sending I'll see zfs hang. Can anyone on the list suggest some diagnostics which may be of use when this happens? Thanks in advance, Andrew Nicols -- Systems Developer e: andrew.nic...@luns.net.uk im: a.nic...@jabber.lancs.ac.uk t: +44 (0)1524 5 10147 Lancaster University Network Services is a limited company registered in England and Wales. Registered number: 4311892. Registered office: University House, Lancaster University, Lancaster, LA1 4YW signature.asc Description: Digital signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Hanging receive
Ian Collins wrote: Brent Jones wrote: On Fri, Jul 3, 2009 at 8:31 PM, Ian Collinsi...@ianshome.com wrote: Ian Collins wrote: I was doing an incremental send between pools, the receive side is locked up and no zfs/zpool commands work on that pool. The stacks look different from those reported in the earlier ZFS snapshot send/recv hangs X4540 servers thread. Here is the process information from scat (other commands hanging on the pool are also in cv_wait): Has anyone else seen anything like this? The box wouldn't even reboot, it had to be power cycled. It locks up on receive regularly now. I hit this too: 6826836 Fixed in 117 http://opensolaris.org/jive/thread.jspa?threadID=104852tstart=120 I don't think this is the same problem (which is why a started a new thread), a single incremental set will eventually lock the pool up, pretty much guaranteed each time. One more data point: This didn't happen when I had a single pool (stripe of mirrors) on the server. It started happening when I split the mirrors and created a second pool built from 3 8 drive raidz2 vdevs. Sending to the new pool (either locally or from another machine) causes the hangs. -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Hanging receive
Ian Collins wrote: I was doing an incremental send between pools, the receive side is locked up and no zfs/zpool commands work on that pool. The stacks look different from those reported in the earlier ZFS snapshot send/recv hangs X4540 servers thread. Here is the process information from scat (other commands hanging on the pool are also in cv_wait): Has anyone else seen anything like this? The box wouldn't even reboot, it had to be power cycled. It locks up on receive regularly now. SolarisCAT(live/10X) proc -L 18500 addr PIDPPID RUID/UID size RSS swresv time command == == == == == == = 0xffc8d1990398 18500 14729 05369856 2813952 1064960 32 zfs receive -v -d backup user (LWP_SYS) thread: 0xfe84e0d5bc20 PID: 18500 cmd: zfs receive -v -d backup t_wchan: 0xa0ed62a2 sobj: condition var (from zfs:txg_wait_synced+0x83) t_procp: 0xffc8d1990398 p_as: 0xfee19d29c810 size: 5369856 RSS: 2813952 hat: 0xfedb762d2818 cpuset: zone: global t_stk: 0xfe8000143f10 sp: 0xfe8000143b10 t_stkbase: 0xfe800013f000 t_pri: 59(TS) pctcpu: 0.00 t_lwp: 0xfe84e92d6ec0 lwp_regs: 0xfe8000143f10 mstate: LMS_SLEEP ms_prev: LMS_SYSTEM ms_state_start: 15 minutes 4.476756638 seconds earlier ms_start: 15 minutes 8.447715668 seconds earlier psrset: 0 last CPU: 2 idle: 102425 ticks (17 minutes 4.25 seconds) start: Thu Jul 2 22:23:06 2009 age: 1029 seconds (17 minutes 9 seconds) syscall: #54 ioctl(, 0x0) (sysent: genunix:ioctl+0x0) tstate: TS_SLEEP - awaiting an event tflg: T_DFLTSTK - stack is default size tpflg: TP_TWAIT - wait to be freed by lwp_wait TP_MSACCT - collect micro-state accounting information tsched: TS_LOAD - thread is in memory TS_DONT_SWAP - thread/LWP should not be swapped pflag: SKILLED - SIGKILL has been posted to the process SMSACCT - process is keeping micro-state accounting SMSFORK - child inherits micro-state accounting pc: unix:_resume_from_idle+0xf8 resume_return: addq $0x8,%rsp unix:_resume_from_idle+0xf8 resume_return() unix:swtch+0x12a() genunix:cv_wait+0x68() zfs:txg_wait_synced+0x83() zfs:dsl_sync_task_group_wait+0xed() zfs:dsl_sync_task_do+0x54() zfs:dmu_objset_create+0xc5() zfs:zfs_ioc_create+0xee() zfs:zfsdev_ioctl+0x14c() genunix:cdev_ioctl+0x1d() specfs:spec_ioctl+0x50() genunix:fop_ioctl+0x25() genunix:ioctl+0xac() unix:_syscall32_save+0xbf() -- switch to user thread's user stack -- The box is an x4500, Solaris 10u7. -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Hanging receive
On Fri, Jul 3, 2009 at 8:31 PM, Ian Collinsi...@ianshome.com wrote: Ian Collins wrote: I was doing an incremental send between pools, the receive side is locked up and no zfs/zpool commands work on that pool. The stacks look different from those reported in the earlier ZFS snapshot send/recv hangs X4540 servers thread. Here is the process information from scat (other commands hanging on the pool are also in cv_wait): Has anyone else seen anything like this? The box wouldn't even reboot, it had to be power cycled. It locks up on receive regularly now. SolarisCAT(live/10X) proc -L 18500 addr PID PPID RUID/UID size RSS swresv time command == == == == == == = 0xffc8d1990398 18500 14729 0 5369856 2813952 1064960 32 zfs receive -v -d backup user (LWP_SYS) thread: 0xfe84e0d5bc20 PID: 18500 cmd: zfs receive -v -d backup t_wchan: 0xa0ed62a2 sobj: condition var (from zfs:txg_wait_synced+0x83) t_procp: 0xffc8d1990398 p_as: 0xfee19d29c810 size: 5369856 RSS: 2813952 hat: 0xfedb762d2818 cpuset: zone: global t_stk: 0xfe8000143f10 sp: 0xfe8000143b10 t_stkbase: 0xfe800013f000 t_pri: 59(TS) pctcpu: 0.00 t_lwp: 0xfe84e92d6ec0 lwp_regs: 0xfe8000143f10 mstate: LMS_SLEEP ms_prev: LMS_SYSTEM ms_state_start: 15 minutes 4.476756638 seconds earlier ms_start: 15 minutes 8.447715668 seconds earlier psrset: 0 last CPU: 2 idle: 102425 ticks (17 minutes 4.25 seconds) start: Thu Jul 2 22:23:06 2009 age: 1029 seconds (17 minutes 9 seconds) syscall: #54 ioctl(, 0x0) (sysent: genunix:ioctl+0x0) tstate: TS_SLEEP - awaiting an event tflg: T_DFLTSTK - stack is default size tpflg: TP_TWAIT - wait to be freed by lwp_wait TP_MSACCT - collect micro-state accounting information tsched: TS_LOAD - thread is in memory TS_DONT_SWAP - thread/LWP should not be swapped pflag: SKILLED - SIGKILL has been posted to the process SMSACCT - process is keeping micro-state accounting SMSFORK - child inherits micro-state accounting pc: unix:_resume_from_idle+0xf8 resume_return: addq $0x8,%rsp unix:_resume_from_idle+0xf8 resume_return() unix:swtch+0x12a() genunix:cv_wait+0x68() zfs:txg_wait_synced+0x83() zfs:dsl_sync_task_group_wait+0xed() zfs:dsl_sync_task_do+0x54() zfs:dmu_objset_create+0xc5() zfs:zfs_ioc_create+0xee() zfs:zfsdev_ioctl+0x14c() genunix:cdev_ioctl+0x1d() specfs:spec_ioctl+0x50() genunix:fop_ioctl+0x25() genunix:ioctl+0xac() unix:_syscall32_save+0xbf() -- switch to user thread's user stack -- The box is an x4500, Solaris 10u7. -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I hit this too: 6826836 Fixed in 117 http://opensolaris.org/jive/thread.jspa?threadID=104852tstart=120 -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Hanging receive
Brent Jones wrote: On Fri, Jul 3, 2009 at 8:31 PM, Ian Collinsi...@ianshome.com wrote: Ian Collins wrote: I was doing an incremental send between pools, the receive side is locked up and no zfs/zpool commands work on that pool. The stacks look different from those reported in the earlier ZFS snapshot send/recv hangs X4540 servers thread. Here is the process information from scat (other commands hanging on the pool are also in cv_wait): Has anyone else seen anything like this? The box wouldn't even reboot, it had to be power cycled. It locks up on receive regularly now. I hit this too: 6826836 Fixed in 117 http://opensolaris.org/jive/thread.jspa?threadID=104852tstart=120 I don't think this is the same problem (which is why a started a new thread), a single incremental set will eventually lock the pool up, pretty much guaranteed each time. -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Hanging receive
I was doing an incremental send between pools, the receive side is locked up and no zfs/zpool commands work on that pool. The stacks look different from those reported in the earlier ZFS snapshot send/recv hangs X4540 servers thread. Here is the process information from scat (other commands hanging on the pool are also in cv_wait): SolarisCAT(live/10X) proc -L 18500 addr PIDPPID RUID/UID size RSS swresv time command == == == == == == = 0xffc8d1990398 18500 14729 05369856 2813952 1064960 32 zfs receive -v -d backup user (LWP_SYS) thread: 0xfe84e0d5bc20 PID: 18500 cmd: zfs receive -v -d backup t_wchan: 0xa0ed62a2 sobj: condition var (from zfs:txg_wait_synced+0x83) t_procp: 0xffc8d1990398 p_as: 0xfee19d29c810 size: 5369856 RSS: 2813952 hat: 0xfedb762d2818 cpuset: zone: global t_stk: 0xfe8000143f10 sp: 0xfe8000143b10 t_stkbase: 0xfe800013f000 t_pri: 59(TS) pctcpu: 0.00 t_lwp: 0xfe84e92d6ec0 lwp_regs: 0xfe8000143f10 mstate: LMS_SLEEP ms_prev: LMS_SYSTEM ms_state_start: 15 minutes 4.476756638 seconds earlier ms_start: 15 minutes 8.447715668 seconds earlier psrset: 0 last CPU: 2 idle: 102425 ticks (17 minutes 4.25 seconds) start: Thu Jul 2 22:23:06 2009 age: 1029 seconds (17 minutes 9 seconds) syscall: #54 ioctl(, 0x0) (sysent: genunix:ioctl+0x0) tstate: TS_SLEEP - awaiting an event tflg: T_DFLTSTK - stack is default size tpflg: TP_TWAIT - wait to be freed by lwp_wait TP_MSACCT - collect micro-state accounting information tsched: TS_LOAD - thread is in memory TS_DONT_SWAP - thread/LWP should not be swapped pflag: SKILLED - SIGKILL has been posted to the process SMSACCT - process is keeping micro-state accounting SMSFORK - child inherits micro-state accounting pc: unix:_resume_from_idle+0xf8 resume_return: addq $0x8,%rsp unix:_resume_from_idle+0xf8 resume_return() unix:swtch+0x12a() genunix:cv_wait+0x68() zfs:txg_wait_synced+0x83() zfs:dsl_sync_task_group_wait+0xed() zfs:dsl_sync_task_do+0x54() zfs:dmu_objset_create+0xc5() zfs:zfs_ioc_create+0xee() zfs:zfsdev_ioctl+0x14c() genunix:cdev_ioctl+0x1d() specfs:spec_ioctl+0x50() genunix:fop_ioctl+0x25() genunix:ioctl+0xac() unix:_syscall32_save+0xbf() -- switch to user thread's user stack -- The box is an x4500, Solaris 10u7. -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss