Re: [zfs-discuss] Hanging receive

2009-07-08 Thread Andrew Robert Nicols
On Wed, Jul 08, 2009 at 08:43:17AM +1200, Ian Collins wrote:
 Ian Collins wrote:
 Brent Jones wrote:
 On Fri, Jul 3, 2009 at 8:31 PM, Ian Collinsi...@ianshome.com wrote:
  
 Ian Collins wrote:

 I was doing an incremental send between pools, the receive side is
 locked up and no zfs/zpool commands work on that pool.

 The stacks look different from those reported in the earlier ZFS
 snapshot send/recv hangs X4540 servers thread.

 Here is the process information from scat (other commands hanging on
 the pool are also in cv_wait):

   
 Has anyone else seen anything like this?  The box wouldn't even
 reboot, it had to be power cycled.  It locks up on receive regularly
 now.

 I hit this too:
 6826836

 Fixed in 117

 http://opensolaris.org/jive/thread.jspa?threadID=104852tstart=120
   
 I don't think this is the same problem (which is why a started a new  
 thread), a single incremental set will eventually lock the pool up,  
 pretty much guaranteed each time.

 One more data point: 

 This didn't happen when I had a single pool (stripe of mirrors) on the  
 server.  It started happening when I split the mirrors and created a  
 second pool built from 3 8 drive raidz2 vdevs.  Sending to the new pool  
 (either locally or from another machine) causes the hangs.

And here are my data points:

We were running two X4500s under Nevada 112 but came across this issue on
both of them. On receiving much data through a ZFS receive, they would lock
up. Any zpool or zfs commands would hang and were unkillable. The only way
to resolve the situation was to reboot without syncing disks. I reported
this in some posts back in April
(http://opensolaris.org/jive/click.jspa?searchID=2021762messageID=368524)

One of them had an old enough zpool and zfs version to down/up/sidegrade to
Solaris 10 u6 and so I made this change.
The thumper running Solaris 10 is now mostly fine - it normally receives an
hourly snapshot with no problem.

The thumper unning 112 has continued to experience the issues described by
Ian and others. I've just upgraded to 117 and am having even more issues -
I'm unable to receive or roll back snapshots, instead I see:

506 r...@thumper1:~ cat snap | zfs receive -vF thumperpool
receiving incremental stream of vlepool/m...@200906182000 into 
thumperp...@200906182000
cannot receive incremental stream: most recent snapshot of thumperpool does not
match incremental source

511 r...@thumper1:~ zfs rollback -r thumperpool/m...@200906181800
cannot destroy 'thumperpool/m...@200906181900': dataset already exists

As a result, I'm a bit scuppered. I'm going to try going back to by 112
installation instead to see if that resolves any of my issues.

All of our thumpers have the following disk configuration:
4 x 11 Disk raidz2 arrays with 2 disks as hot spares in a single pool.
2 disks in a mirror for booting.

When zpool locks up on the main pool, I'm still able to get a zpool status
on the boot pool. I can't access any data on the pool which is locked up.

Andrew

-- 
Systems Developer

e: andrew.nic...@luns.net.uk
im: a.nic...@jabber.lancs.ac.uk
t: +44 (0)1524 5 10147

Lancaster University Network Services is a limited company registered in
England and Wales. Registered number: 4311892. Registered office:
University House, Lancaster University, Lancaster, LA1 4YW


signature.asc
Description: Digital signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hanging receive

2009-07-08 Thread Ian Collins

Andrew Robert Nicols wrote:

On Wed, Jul 08, 2009 at 08:43:17AM +1200, Ian Collins wrote:
  

Ian Collins wrote:


Brent Jones wrote:
  

On Fri, Jul 3, 2009 at 8:31 PM, Ian Collinsi...@ianshome.com wrote:
 


Ian Collins wrote:
   
  

I was doing an incremental send between pools, the receive side is
locked up and no zfs/zpool commands work on that pool.

The stacks look different from those reported in the earlier ZFS
snapshot send/recv hangs X4540 servers thread.

Here is the process information from scat (other commands hanging on
the pool are also in cv_wait):

  


Has anyone else seen anything like this?  The box wouldn't even
reboot, it had to be power cycled.  It locks up on receive regularly
now.
  

I hit this too:
6826836

Fixed in 117

http://opensolaris.org/jive/thread.jspa?threadID=104852tstart=120
  

I don't think this is the same problem (which is why a started a new  
thread), a single incremental set will eventually lock the pool up,  
pretty much guaranteed each time.


  
One more data point: 

This didn't happen when I had a single pool (stripe of mirrors) on the  
server.  It started happening when I split the mirrors and created a  
second pool built from 3 8 drive raidz2 vdevs.  Sending to the new pool  
(either locally or from another machine) causes the hangs.



And here are my data points:

We were running two X4500s under Nevada 112 but came across this issue on
both of them. On receiving much data through a ZFS receive, they would lock
up. Any zpool or zfs commands would hang and were unkillable. The only way
to resolve the situation was to reboot without syncing disks. I reported
this in some posts back in April
(http://opensolaris.org/jive/click.jspa?searchID=2021762messageID=368524)

One of them had an old enough zpool and zfs version to down/up/sidegrade to
Solaris 10 u6 and so I made this change.
The thumper running Solaris 10 is now mostly fine - it normally receives an
hourly snapshot with no problem.

The thumper unning 112 has continued to experience the issues described by
Ian and others. I've just upgraded to 117 and am having even more issues -
I'm unable to receive or roll back snapshots, instead I see:

506 r...@thumper1:~ cat snap | zfs receive -vF thumperpool
receiving incremental stream of vlepool/m...@200906182000 into 
thumperp...@200906182000
cannot receive incremental stream: most recent snapshot of thumperpool does not
match incremental source

511 r...@thumper1:~ zfs rollback -r thumperpool/m...@200906181800
cannot destroy 'thumperpool/m...@200906181900': dataset already exists

  

Thanks for the additional data Andrew.

Can you do a zfs destroy of thumperpool/m...@200906181900?

--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hanging receive

2009-07-08 Thread Andrew Robert Nicols
On Wed, Jul 08, 2009 at 08:31:54PM +1200, Ian Collins wrote:
 Andrew Robert Nicols wrote:

 The thumper unning 112 has continued to experience the issues described by
 Ian and others. I've just upgraded to 117 and am having even more issues -
 I'm unable to receive or roll back snapshots, instead I see:

 506 r...@thumper1:~ cat snap | zfs receive -vF thumperpool
 receiving incremental stream of vlepool/m...@200906182000 into 
 thumperp...@200906182000
 cannot receive incremental stream: most recent snapshot of thumperpool does 
 not
 match incremental source

 511 r...@thumper1:~ zfs rollback -r thumperpool/m...@200906181800
 cannot destroy 'thumperpool/m...@200906181900': dataset already exists

   
 Thanks for the additional data Andrew.

 Can you do a zfs destroy of thumperpool/m...@200906181900?

I'm afraid not:

503 r...@thumper1:~ zfs destroy thumperpool/m...@200906181900
cannot destroy 'thumperpool/m...@200906181900': dataset already exists

Andrew

-- 
Systems Developer

e: andrew.nic...@luns.net.uk
im: a.nic...@jabber.lancs.ac.uk
t: +44 (0)1524 5 10147

Lancaster University Network Services is a limited company registered in
England and Wales. Registered number: 4311892. Registered office:
University House, Lancaster University, Lancaster, LA1 4YW


signature.asc
Description: Digital signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hanging receive

2009-07-08 Thread Andrew Robert Nicols
On Wed, Jul 08, 2009 at 09:41:12AM +0100, Andrew Robert Nicols wrote:
 On Wed, Jul 08, 2009 at 08:31:54PM +1200, Ian Collins wrote:
  Andrew Robert Nicols wrote:
 
  The thumper unning 112 has continued to experience the issues described by
  Ian and others. I've just upgraded to 117 and am having even more issues -
  I'm unable to receive or roll back snapshots, instead I see:
 
  506 r...@thumper1:~ cat snap | zfs receive -vF thumperpool
  receiving incremental stream of vlepool/m...@200906182000 into 
  thumperp...@200906182000
  cannot receive incremental stream: most recent snapshot of thumperpool 
  does not
  match incremental source
 
  511 r...@thumper1:~ zfs rollback -r thumperpool/m...@200906181800
  cannot destroy 'thumperpool/m...@200906181900': dataset already exists
 

  Thanks for the additional data Andrew.
 
  Can you do a zfs destroy of thumperpool/m...@200906181900?
 
 I'm afraid not:
 
 503 r...@thumper1:~ zfs destroy thumperpool/m...@200906181900
 cannot destroy 'thumperpool/m...@200906181900': dataset already exists

Moving back to Nevada 112, I'm once again able to receive snapshots and
destroy datasets as appropriate - thank goodness!

However, I'm fairly sure that in a few hours, with the volume of data I'm
sending I'll see zfs hang.

Can anyone on the list suggest some diagnostics which may be of use when
this happens?

Thanks in advance,

Andrew Nicols

-- 
Systems Developer

e: andrew.nic...@luns.net.uk
im: a.nic...@jabber.lancs.ac.uk
t: +44 (0)1524 5 10147

Lancaster University Network Services is a limited company registered in
England and Wales. Registered number: 4311892. Registered office:
University House, Lancaster University, Lancaster, LA1 4YW


signature.asc
Description: Digital signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hanging receive

2009-07-07 Thread Ian Collins

Ian Collins wrote:

Brent Jones wrote:

On Fri, Jul 3, 2009 at 8:31 PM, Ian Collinsi...@ianshome.com wrote:
 

Ian Collins wrote:
   
I was doing an incremental send between pools, the receive side is 
locked

up and no zfs/zpool commands work on that pool.

The stacks look different from those reported in the earlier ZFS 
snapshot

send/recv hangs X4540 servers thread.

Here is the process information from scat (other commands hanging 
on the

pool are also in cv_wait):

  
Has anyone else seen anything like this?  The box wouldn't even 
reboot, it

had to be power cycled.  It locks up on receive regularly now.


I hit this too:
6826836

Fixed in 117

http://opensolaris.org/jive/thread.jspa?threadID=104852tstart=120
  
I don't think this is the same problem (which is why a started a new 
thread), a single incremental set will eventually lock the pool up, 
pretty much guaranteed each time.


One more data point: 

This didn't happen when I had a single pool (stripe of mirrors) on the 
server.  It started happening when I split the mirrors and created a 
second pool built from 3 8 drive raidz2 vdevs.  Sending to the new pool 
(either locally or from another machine) causes the hangs.


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hanging receive

2009-07-03 Thread Ian Collins

Ian Collins wrote:
I was doing an incremental send between pools, the receive side is 
locked up and no zfs/zpool commands work on that pool.


The stacks look different from those reported in the earlier ZFS 
snapshot send/recv hangs X4540 servers thread.


Here is the process information from scat (other commands hanging on 
the pool are also in cv_wait):


Has anyone else seen anything like this?  The box wouldn't even reboot, 
it had to be power cycled.  It locks up on receive regularly now.



SolarisCAT(live/10X) proc -L 18500
  addr PIDPPID   RUID/UID size  RSS 
swresv   time  command
== == == == ==  
 == =
0xffc8d1990398  18500  14729  05369856  2813952  
1064960 32 zfs receive -v -d backup


 user (LWP_SYS) thread: 0xfe84e0d5bc20  PID: 18500 
cmd: zfs receive -v -d backup
t_wchan: 0xa0ed62a2  sobj: condition var (from 
zfs:txg_wait_synced+0x83)

t_procp: 0xffc8d1990398
 p_as: 0xfee19d29c810  size: 5369856  RSS: 2813952
 hat: 0xfedb762d2818  cpuset:
 zone: global
t_stk: 0xfe8000143f10  sp: 0xfe8000143b10  t_stkbase: 
0xfe800013f000

t_pri: 59(TS)  pctcpu: 0.00
t_lwp: 0xfe84e92d6ec0  lwp_regs: 0xfe8000143f10
 mstate: LMS_SLEEP  ms_prev: LMS_SYSTEM
 ms_state_start: 15 minutes 4.476756638 seconds earlier
 ms_start: 15 minutes 8.447715668 seconds earlier
psrset: 0  last CPU: 2
idle: 102425 ticks (17 minutes 4.25 seconds)
start: Thu Jul  2 22:23:06 2009
age: 1029 seconds (17 minutes 9 seconds)
syscall: #54 ioctl(, 0x0) (sysent: genunix:ioctl+0x0)
tstate: TS_SLEEP - awaiting an event
tflg:   T_DFLTSTK - stack is default size
tpflg:  TP_TWAIT - wait to be freed by lwp_wait
   TP_MSACCT - collect micro-state accounting information
tsched: TS_LOAD - thread is in memory
   TS_DONT_SWAP - thread/LWP should not be swapped
pflag:  SKILLED - SIGKILL has been posted to the process
   SMSACCT - process is keeping micro-state accounting
   SMSFORK - child inherits micro-state accounting

pc:  unix:_resume_from_idle+0xf8 resume_return:  addq   $0x8,%rsp

unix:_resume_from_idle+0xf8 resume_return()
unix:swtch+0x12a()
genunix:cv_wait+0x68()
zfs:txg_wait_synced+0x83()
zfs:dsl_sync_task_group_wait+0xed()
zfs:dsl_sync_task_do+0x54()
zfs:dmu_objset_create+0xc5()
zfs:zfs_ioc_create+0xee()
zfs:zfsdev_ioctl+0x14c()
genunix:cdev_ioctl+0x1d()
specfs:spec_ioctl+0x50()
genunix:fop_ioctl+0x25()
genunix:ioctl+0xac()
unix:_syscall32_save+0xbf()
-- switch to user thread's user stack --

The box is an x4500, Solaris 10u7.




--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hanging receive

2009-07-03 Thread Brent Jones
On Fri, Jul 3, 2009 at 8:31 PM, Ian Collinsi...@ianshome.com wrote:
 Ian Collins wrote:

 I was doing an incremental send between pools, the receive side is locked
 up and no zfs/zpool commands work on that pool.

 The stacks look different from those reported in the earlier ZFS snapshot
 send/recv hangs X4540 servers thread.

 Here is the process information from scat (other commands hanging on the
 pool are also in cv_wait):

 Has anyone else seen anything like this?  The box wouldn't even reboot, it
 had to be power cycled.  It locks up on receive regularly now.

 SolarisCAT(live/10X) proc -L 18500
      addr         PID    PPID   RUID/UID     size      RSS     swresv
 time  command
 == == == == ==  
 == =
 0xffc8d1990398  18500  14729          0    5369856  2813952  1064960
   32 zfs receive -v -d backup

  user (LWP_SYS) thread: 0xfe84e0d5bc20  PID: 18500 
 cmd: zfs receive -v -d backup
 t_wchan: 0xa0ed62a2  sobj: condition var (from
 zfs:txg_wait_synced+0x83)
 t_procp: 0xffc8d1990398
  p_as: 0xfee19d29c810  size: 5369856  RSS: 2813952
  hat: 0xfedb762d2818  cpuset:
  zone: global
 t_stk: 0xfe8000143f10  sp: 0xfe8000143b10  t_stkbase:
 0xfe800013f000
 t_pri: 59(TS)  pctcpu: 0.00
 t_lwp: 0xfe84e92d6ec0  lwp_regs: 0xfe8000143f10
  mstate: LMS_SLEEP  ms_prev: LMS_SYSTEM
  ms_state_start: 15 minutes 4.476756638 seconds earlier
  ms_start: 15 minutes 8.447715668 seconds earlier
 psrset: 0  last CPU: 2
 idle: 102425 ticks (17 minutes 4.25 seconds)
 start: Thu Jul  2 22:23:06 2009
 age: 1029 seconds (17 minutes 9 seconds)
 syscall: #54 ioctl(, 0x0) (sysent: genunix:ioctl+0x0)
 tstate: TS_SLEEP - awaiting an event
 tflg:   T_DFLTSTK - stack is default size
 tpflg:  TP_TWAIT - wait to be freed by lwp_wait
       TP_MSACCT - collect micro-state accounting information
 tsched: TS_LOAD - thread is in memory
       TS_DONT_SWAP - thread/LWP should not be swapped
 pflag:  SKILLED - SIGKILL has been posted to the process
       SMSACCT - process is keeping micro-state accounting
       SMSFORK - child inherits micro-state accounting

 pc:      unix:_resume_from_idle+0xf8 resume_return:  addq   $0x8,%rsp

 unix:_resume_from_idle+0xf8 resume_return()
 unix:swtch+0x12a()
 genunix:cv_wait+0x68()
 zfs:txg_wait_synced+0x83()
 zfs:dsl_sync_task_group_wait+0xed()
 zfs:dsl_sync_task_do+0x54()
 zfs:dmu_objset_create+0xc5()
 zfs:zfs_ioc_create+0xee()
 zfs:zfsdev_ioctl+0x14c()
 genunix:cdev_ioctl+0x1d()
 specfs:spec_ioctl+0x50()
 genunix:fop_ioctl+0x25()
 genunix:ioctl+0xac()
 unix:_syscall32_save+0xbf()
 -- switch to user thread's user stack --

 The box is an x4500, Solaris 10u7.



 --
 Ian.

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


I hit this too:
6826836

Fixed in 117

http://opensolaris.org/jive/thread.jspa?threadID=104852tstart=120


-- 
Brent Jones
br...@servuhome.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hanging receive

2009-07-03 Thread Ian Collins

Brent Jones wrote:

On Fri, Jul 3, 2009 at 8:31 PM, Ian Collinsi...@ianshome.com wrote:
  

Ian Collins wrote:


I was doing an incremental send between pools, the receive side is locked
up and no zfs/zpool commands work on that pool.

The stacks look different from those reported in the earlier ZFS snapshot
send/recv hangs X4540 servers thread.

Here is the process information from scat (other commands hanging on the
pool are also in cv_wait):

  

Has anyone else seen anything like this?  The box wouldn't even reboot, it
had to be power cycled.  It locks up on receive regularly now.


I hit this too:
6826836

Fixed in 117

http://opensolaris.org/jive/thread.jspa?threadID=104852tstart=120
  
I don't think this is the same problem (which is why a started a new 
thread), a single incremental set will eventually lock the pool up, 
pretty much guaranteed each time.


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Hanging receive

2009-07-02 Thread Ian Collins
I was doing an incremental send between pools, the receive side is 
locked up and no zfs/zpool commands work on that pool.


The stacks look different from those reported in the earlier ZFS 
snapshot send/recv hangs X4540 servers thread.


Here is the process information from scat (other commands hanging on the 
pool are also in cv_wait):


SolarisCAT(live/10X) proc -L 18500
  addr PIDPPID   RUID/UID size  RSS 
swresv   time  command
== == == == ==   
== =
0xffc8d1990398  18500  14729  05369856  2813952  
1064960 32 zfs receive -v -d backup


 user (LWP_SYS) thread: 0xfe84e0d5bc20  PID: 18500 
cmd: zfs receive -v -d backup
t_wchan: 0xa0ed62a2  sobj: condition var (from 
zfs:txg_wait_synced+0x83)

t_procp: 0xffc8d1990398
 p_as: 0xfee19d29c810  size: 5369856  RSS: 2813952
 hat: 0xfedb762d2818  cpuset:
 zone: global
t_stk: 0xfe8000143f10  sp: 0xfe8000143b10  t_stkbase: 
0xfe800013f000

t_pri: 59(TS)  pctcpu: 0.00
t_lwp: 0xfe84e92d6ec0  lwp_regs: 0xfe8000143f10
 mstate: LMS_SLEEP  ms_prev: LMS_SYSTEM
 ms_state_start: 15 minutes 4.476756638 seconds earlier
 ms_start: 15 minutes 8.447715668 seconds earlier
psrset: 0  last CPU: 2
idle: 102425 ticks (17 minutes 4.25 seconds)
start: Thu Jul  2 22:23:06 2009
age: 1029 seconds (17 minutes 9 seconds)
syscall: #54 ioctl(, 0x0) (sysent: genunix:ioctl+0x0)
tstate: TS_SLEEP - awaiting an event
tflg:   T_DFLTSTK - stack is default size
tpflg:  TP_TWAIT - wait to be freed by lwp_wait
   TP_MSACCT - collect micro-state accounting information
tsched: TS_LOAD - thread is in memory
   TS_DONT_SWAP - thread/LWP should not be swapped
pflag:  SKILLED - SIGKILL has been posted to the process
   SMSACCT - process is keeping micro-state accounting
   SMSFORK - child inherits micro-state accounting

pc:  unix:_resume_from_idle+0xf8 resume_return:  addq   $0x8,%rsp

unix:_resume_from_idle+0xf8 resume_return()
unix:swtch+0x12a()
genunix:cv_wait+0x68()
zfs:txg_wait_synced+0x83()
zfs:dsl_sync_task_group_wait+0xed()
zfs:dsl_sync_task_do+0x54()
zfs:dmu_objset_create+0xc5()
zfs:zfs_ioc_create+0xee()
zfs:zfsdev_ioctl+0x14c()
genunix:cdev_ioctl+0x1d()
specfs:spec_ioctl+0x50()
genunix:fop_ioctl+0x25()
genunix:ioctl+0xac()
unix:_syscall32_save+0xbf()
-- switch to user thread's user stack --

The box is an x4500, Solaris 10u7.

--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss