Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-07-06 Thread Andrew Jones
 
 Good. Run 'zpool scrub' to make sure there are no
 other errors.
 
 regards
 victor
 

Yes, scrubbed successfully with no errors. Thanks again for all of your 
generous assistance.

/AJ
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-07-04 Thread Andrew Jones
 
 - Original Message -
  Victor,
  
  The zpool import succeeded on the next attempt
 following the crash
  that I reported to you by private e-mail!
  
  For completeness, this is the final status of the
 pool:
  
  
  pool: tank
  state: ONLINE
  scan: resilvered 1.50K in 165h28m with 0 errors on
 Sat Jul 3 08:02:30
 
 Out of curiosity, what sort of drives are you using
 here? Resilvering in 165h28m is close to a week,
 which is rather bad imho.

I think the resilvering statistic is quite misleading, in this case. We're 
using very average 1TB retail Hitachi disks, which perform just fine when the 
pool is healthy.

What happened here is that the zpool-tank process was performing a resilvering 
task in parallel with the processing of a very large inconsistent dataset, 
which took the overwhelming majority of the time to complete.

Why it actually took over a week to process the 2TB volume in an inconsistent 
state is my primary concern with the performance of ZFS, in this case.

 
 Vennlige hilsener / Best regards
 
 roy
 --
 Roy Sigurd Karlsbakk
 (+47) 97542685
 r...@karlsbakk.net
 http://blogg.karlsbakk.net/
 --
 I all pedagogikk er det essensielt at pensum
 presenteres intelligibelt. Det er et elementært
 imperativ for alle pedagoger å unngå eksessiv
 anvendelse av idiomer med fremmed opprinnelse. I de
 fleste tilfeller eksisterer adekvate og relevante
 synonymer på norsk.
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discu
 ss
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-07-03 Thread Andrew Jones
Victor,

The zpool import succeeded on the next attempt following the crash that I 
reported to you by private e-mail! 

For completeness, this is the final status of the pool:


  pool: tank
 state: ONLINE
 scan: resilvered 1.50K in 165h28m with 0 errors on Sat Jul  3 08:02:30 2010
config:

NAMESTATE READ WRITE CKSUM
tankONLINE   0 0 0
  raidz2-0  ONLINE   0 0 0
c0t0d0  ONLINE   0 0 0
c0t1d0  ONLINE   0 0 0
c0t2d0  ONLINE   0 0 0
c0t3d0  ONLINE   0 0 0
c0t4d0  ONLINE   0 0 0
c0t5d0  ONLINE   0 0 0
c0t6d0  ONLINE   0 0 0
c0t7d0  ONLINE   0 0 0
cache
  c2t0d0ONLINE   0 0 0

errors: No known data errors

Thank you very much for your help. We did not need to add additional RAM to 
solve this, in the end. Instead, we needed to persist with the import through 
several panics to finally work our way through the large inconsistent dataset; 
it is unclear whether the resilvering caused additional processing delay. 
Unfortunately, the delay made much of the data quite stale, now that it's been 
recovered.

It does seem that zfs would benefit tremendously from a better (quicker and 
more intuitive?) set of recovery tools, that are available to a wider range of 
users. It's really a shame, because the features and functionality in zfs are 
otherwise absolutely second to none.

/Andrew[i][/i][i][/i][i][/i][i][/i][i][/i]
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-07-02 Thread Andrew Jones
 Andrew,
 
 Looks like the zpool is telling you the devices are
 still doing work of 
 some kind, or that there are locks still held.
 

Agreed; it appears the CSV1 volume is in a fundamentally inconsistent state 
following the aborted zfs destroy attempt. See later in this thread where 
Victor has identified this to be the case. I am awaiting his analysis of the 
latest crash.

 From man of section 2 intro page the errors are
  listed.  Number 16 
 ooks to be an EBUSY.
 
 
   16 EBUSYDevice busy
 An attempt was made to mount
  a  dev-
 ice  that  was already
 mounted or an
 attempt was made to
 unmount a device
 on  which  there  is
  an active file
 (open   file,   current
   directory,
 mounted-on  file,  active
  text seg-
 ment). It  will  also
  occur  if  an
 attempt is made to
 enable accounting
 when it  is  already
  enabled.   The
 device or resource is
 currently una-
 vailable.   EBUSY is
  also  used  by
 mutexes, semaphores,
 condition vari-
 ables, and r/w  locks,
  to  indicate
 that   a  lock  is held,
  and by the
 processor  control
  function
  P_ONLINE.
 ndrew Jones wrote:
  Just re-ran 'zdb -e tank' to confirm the CSV1
 volume is still exhibiting error 16:
 
  snip
  Could not open tank/CSV1, error 16
  snip
 
  Considering my attempt to delete the CSV1 volume
 lead to the failure in the first place, I have to
 think that if I can either 1) complete the deletion
 of this volume or 2) roll back to a transaction prior
 to this based on logging or 3) repair whatever
 corruption has been caused by this partial deletion,
 that I will then be able to import the pool.
 
  What does 'error 16' mean in the ZDB output, any
 suggestions?
 
 
 -- 
 Geoff Shipman | Senior Technical Support Engineer
 Phone: +13034644710
 Oracle Global Customer Services
 500 Eldorado Blvd. UBRM-04 | Broomfield, CO 80021
 Email: geoff.ship...@sun.com | Hours:9am-5pm
 MT,Monday-Friday
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discu
 ss

-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-07-01 Thread Andrew Jones
Victor,

I've reproduced the crash and have vmdump.0 and dump device files. How do I 
query the stack on crash for your analysis? What other analysis should I 
provide?

Thanks
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-07-01 Thread Andrew Jones
Victor,

A little more info on the crash, from the messages file is attached here. I 
have also decompressed the dump with savecore to generate unix.0, vmcore.0, and 
vmdump.0.


Jun 30 19:39:10 HL-SAN unix: [ID 836849 kern.notice] 
Jun 30 19:39:10 HL-SAN ^Mpanic[cpu3]/thread=ff0017909c60: 
Jun 30 19:39:10 HL-SAN genunix: [ID 335743 kern.notice] BAD TRAP: type=e (#pf 
Page fault) rp=ff0017909790 addr=0 occurred in module unknown due to a 
NULL pointer dereference
Jun 30 19:39:10 HL-SAN unix: [ID 10 kern.notice] 
Jun 30 19:39:10 HL-SAN unix: [ID 839527 kern.notice] sched: 
Jun 30 19:39:10 HL-SAN unix: [ID 753105 kern.notice] #pf Page fault
Jun 30 19:39:10 HL-SAN unix: [ID 532287 kern.notice] Bad kernel fault at 
addr=0x0
Jun 30 19:39:10 HL-SAN unix: [ID 243837 kern.notice] pid=0, pc=0x0, 
sp=0xff0017909880, eflags=0x10002
Jun 30 19:39:10 HL-SAN unix: [ID 211416 kern.notice] cr0: 
8005003bpg,wp,ne,et,ts,mp,pe cr4: 6f8xmme,fxsr,pge,mce,pae,pse,de
Jun 30 19:39:10 HL-SAN unix: [ID 624947 kern.notice] cr2: 0
Jun 30 19:39:10 HL-SAN unix: [ID 625075 kern.notice] cr3: 336a71000
Jun 30 19:39:10 HL-SAN unix: [ID 625715 kern.notice] cr8: c
Jun 30 19:39:10 HL-SAN unix: [ID 10 kern.notice] 
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]rdi:  282 
rsi:15809 rdx: ff03edb1e538
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]rcx:5  
r8:0  r9: ff03eb2d6a00
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]rax:  202 
rbx:0 rbp: ff0017909880
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]r10: f80d16d0 
r11:4 r12:0
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]r13: ff03e21bca40 
r14: ff03e1a0d7e8 r15: ff03e21bcb58
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]fsb:0 
gsb: ff03e25fa580  ds:   4b
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice] es:   4b  
fs:0  gs:  1c3
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]trp:e 
err:   10 rip:0
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice] cs:   30 
rfl:10002 rsp: ff0017909880
Jun 30 19:39:10 HL-SAN unix: [ID 266532 kern.notice] ss:   38
Jun 30 19:39:10 HL-SAN unix: [ID 10 kern.notice] 
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909670 
unix:die+dd ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909780 
unix:trap+177b ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909790 
unix:cmntrap+e6 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 802836 kern.notice] ff0017909880 0 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff00179098a0 
unix:debug_enter+38 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff00179098c0 
unix:abort_sequence_enter+35 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909910 
kbtrans:kbtrans_streams_key+102 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909940 
conskbd:conskbdlrput+e7 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff00179099b0 
unix:putnext+21e ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff00179099f0 
kbtrans:kbtrans_queueevent+7c ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909a20 
kbtrans:kbtrans_queuepress+7c ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909a60 
kbtrans:kbtrans_untrans_keypressed_raw+46 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909a90 
kbtrans:kbtrans_processkey+32 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909ae0 
kbtrans:kbtrans_streams_key+175 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909b10 
kb8042:kb8042_process_key+40 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909b50 
kb8042:kb8042_received_byte+109 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909b80 
kb8042:kb8042_intr+6a ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909bb0 
i8042:i8042_intr+c5 ()
Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0017909c00 
unix:av_dispatch_autovect+7c ()
Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0017909c40 
unix:dispatch_hardint+33 ()
Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff00183552f0 
unix:switch_sp_and_call+13 ()
Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0018355340 
unix:do_interrupt+b8 ()
Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0018355350 
unix:_interrupt+b8 ()
Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff00183554a0 
unix:htable_steal+198 ()
Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0018355510 
unix:htable_alloc+248 ()
Jun 30 19:39:11 

Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-06-29 Thread Andrew Jones
 
 On Jun 29, 2010, at 8:30 PM, Andrew Jones wrote:
 
  Victor,
  
  The 'zpool import -f -F tank' failed at some point
 last night. The box was completely hung this morning;
 no core dump, no ability to SSH into the box to
 diagnose the problem. I had no choice but to reset,
 as I had no diagnostic ability. I don't know if there
 would be anything in the logs?
 
 It sounds like it might run out of memory. Is it an
 option for you to add more memory to the box
 temporarily?

I'll place the order for more memory or transfer some from another machine. 
Seems quite likely that we did run out of memory.

 
 Even if it is an option, it is good to prepare for
 such outcome and have kmdb loaded either at boot time
 by adding -k to 'kernel$' line in GRUB menu, or by
 loading it from console with 'mdb -K' before
 attempting import (type ':c' at mdb prompt to
 continue). In case it hangs again, you can press
 'F1-A' on the keyboard, drop into kmdb and then use
 '$systemdump' to force a crashdump.

I'll prepare the machine this way and repeat the import to reproduce the hang, 
then break into the kernel and capture the core dump.

 
 If you hardware has physical or virtual NMI button,
 you can use that too to drop into kmdb, but you'll
 need to set a kernel variable for that to work:
 
 http://blogs.sun.com/darren/entry/sending_a_break_to_o
 pensolaris
 
  Earlier I ran 'zdb -e -bcsvL tank' in write mode
 for 36 hours and gave up to try something different.
 Now the zpool import has hung the box.
 
 What do you mean be running zdb in write mode? zdb
 normally is readonly tool. Did you change it in some
 way?

I had read elsewhere that set /zfs/:zfs_recover=/1/ and set aok=/1/ placed zdb 
into some kind of a write/recovery mode. I have set these in /etc/system. Is 
this a bad idea in this case?

 
  Should I try zdb again? Any suggestions?
 
 It sounds like zdb is not going to be helpful, as
 inconsistent dataset processing happens only in
 read-write mode. So you need to try above suggestions
 with more memory and kmdb/nmi.

Will do, thanks!

 
 victor
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discu
 ss

-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-06-28 Thread Andrew Jones
Now at 36 hours since zdb process start and:


 PID USERNAME  SIZE   RSS STATE  PRI NICE  TIME  CPU PROCESS/NLWP
   827 root 4936M 4931M sleep   590   0:50:47 0.2% zdb/209

Idling at 0.2% processor for nearly the past 24 hours... feels very stuck. 
Thoughts on how to determine where and why?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-06-28 Thread Andrew Jones
Update: have given up on the zdb write mode repair effort, as least for now. 
Hoping for any guidance / direction anyone's willing to offer...

Re-running 'zpool import -F -f tank' with some stack trace debug, as suggested 
in similar threads elsewhere. Note that this appears hung at near idle.


ff03e278c520 ff03e9c60038 ff03ef109490   1  60 ff0530db4680
  PC: _resume_from_idle+0xf1CMD: zpool import -F -f tank
  stack pointer for thread ff03e278c520: ff00182bbff0
  [ ff00182bbff0 _resume_from_idle+0xf1() ]
swtch+0x145()
cv_wait+0x61()
zio_wait+0x5d()
dbuf_read+0x1e8()
dnode_next_offset_level+0x129()
dnode_next_offset+0xa2()
get_next_chunk+0xa5()
dmu_free_long_range_impl+0x9e()
dmu_free_object+0xe6()
dsl_dataset_destroy+0x122()
dsl_destroy_inconsistent+0x5f()
findfunc+0x23()
dmu_objset_find_spa+0x38c()
dmu_objset_find_spa+0x153()
dmu_objset_find+0x40()
spa_load_impl+0xb23()
spa_load+0x117()
spa_load_best+0x78()
spa_import+0xee()
zfs_ioc_pool_import+0xc0()
zfsdev_ioctl+0x177()
cdev_ioctl+0x45()
spec_ioctl+0x5a()
fop_ioctl+0x7b()
ioctl+0x18e()
dtrace_systrace_syscall32+0x11a()
_sys_sysenter_post_swapgs+0x149()
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-06-28 Thread Andrew Jones
Dedup had been turned on in the past for some of the volumes, but I had turned 
it off altogether before entering production due to performance issues. GZIP 
compression was turned on for the volume I was trying to delete.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-06-28 Thread Andrew Jones
Malachi,

Thanks for the reply. There were no snapshots for the CSV1 volume that I 
recall... very few snapshots on the any volume in the tank.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-06-28 Thread Andrew Jones
Just re-ran 'zdb -e tank' to confirm the CSV1 volume is still exhibiting error 
16:

snip
Could not open tank/CSV1, error 16
snip

Considering my attempt to delete the CSV1 volume lead to the failure in the 
first place, I have to think that if I can either 1) complete the deletion of 
this volume or 2) roll back to a transaction prior to this based on logging or 
3) repair whatever corruption has been caused by this partial deletion, that I 
will then be able to import the pool.

What does 'error 16' mean in the ZDB output, any suggestions?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-06-28 Thread Andrew Jones
Thanks Victor. I will give it another 24 hrs or so and will let you know how it 
goes...

You are right, a large 2TB volume (CSV1) was not in the process of being 
deleted, as described above. It is showing error 16 on  'zdb -e'
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss