Re: [zfs-discuss] pool died during scrub

2010-09-03 Thread Cia Watson

This may or may not be helpful, and I don't run a RAID but I do have an 
external USB drive where I've created a pool for
rsync backups and to import snapshots, and the current status of the pool is 
unavail insufficient replicas, as yours shows
above. I've found I can get it back online by turning on the drive then using 
'zpool clear poolname' (in your case srv,
and without quotes of course).

It just might work for you, though I'm running Opensolaris snv_134 and your 
situation isn't quite the same.

Cia W



Jeff Bacon wrote:

 ny-fs4(71)# zpool import


   pool: srv
 id: 6111323963551805601
  state: UNAVAIL
status: The pool was last accessed by another system.
action: The pool cannot be imported due to damaged devices or data.
see:http://www.sun.com/msg/ZFS-8000-EY
config:

 srv   UNAVAIL  insufficient replicas
 logs
 srv   UNAVAIL  insufficient replicas
   mirror  ONLINE
 c3t0d0s4  ONLINE box doesn't even have a c3
 c0t0d0s4  ONLINE what it's looking at - leftover from
who knows what

   pool: srv
 id: 9515618289022845993
  state: UNAVAIL
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
 devices and try again.
see:http://www.sun.com/msg/ZFS-8000-6X
config:









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pool died during scrub

2010-09-02 Thread Jeff Bacon

 looks similar to a crash I had here at our site a few month ago. Same
 symptoms, no actual solution. We had to recover from a rsync backup
 server.

Thanks Carsten. And on Sun hardware, too. Boy, that's comforting 

Three way mirrors anyone?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pool died during scrub

2010-09-01 Thread Carsten John
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Jeff Bacon wrote:
 I have a bunch of sol10U8 boxes with ZFS pools, most all raidz2 8-disk
 stripe. They're all supermicro-based with retail LSI cards.
 
 I've noticed a tendency for things to go a little bonkers during the
 weekly scrub (they all scrub over the weekend), and that's when I'll
 lose a disk here and there. OK, fine, that's sort of the point, and
 they're SATA drives so things happen. 
 
 I've never lost a pool though, until now. This is Not Fun. 
 
 ::status
 debugging crash dump vmcore.0 (64-bit) from ny-fs4
 operating system: 5.10 Generic_142901-10 (i86pc)
 panic message:
 BAD TRAP: type=e (#pf Page fault) rp=fe80007cb850 addr=28 occurred
 in module zfs due to a NULL pointer dereference
 dump content: kernel pages only
 $C
 fe80007cb960 vdev_is_dead+2()
 fe80007cb9a0 vdev_mirror_child_select+0x65()
 fe80007cba00 vdev_mirror_io_start+0x44()
 fe80007cba30 zio_vdev_io_start+0x159()
 fe80007cba60 zio_execute+0x6f()
 fe80007cba90 zio_wait+0x2d()
 fe80007cbb40 arc_read_nolock+0x668()
 fe80007cbbd0 dmu_objset_open_impl+0xcf()
 fe80007cbc20 dsl_pool_open+0x4e()
 fe80007cbcc0 spa_load+0x307()
 fe80007cbd00 spa_open_common+0xf7()
 fe80007cbd10 spa_open+0xb()
 fe80007cbd30 pool_status_check+0x19()
 fe80007cbd80 zfsdev_ioctl+0x1b1()
 fe80007cbd90 cdev_ioctl+0x1d()
 fe80007cbdb0 spec_ioctl+0x50()
 fe80007cbde0 fop_ioctl+0x25()
 fe80007cbec0 ioctl+0xac()
 fe80007cbf10 _sys_sysenter_post_swapgs+0x14b()
 
   pool: srv
 id: 9515618289022845993
  state: UNAVAIL
 status: One or more devices are missing from the system.
 action: The pool cannot be imported. Attach the missing
 devices and try again.
see: http://www.sun.com/msg/ZFS-8000-6X
 config:
 
 srvUNAVAIL  missing device
   raidz2   ONLINE
 c2t5000C5001F2CCE1Fd0  ONLINE
 c2t5000C5001F34F5FAd0  ONLINE
 c2t5000C5001F48D399d0  ONLINE
 c2t5000C5001F485EC3d0  ONLINE
 c2t5000C5001F492E42d0  ONLINE
 c2t5000C5001F48549Bd0  ONLINE
 c2t5000C5001F370919d0  ONLINE
 c2t5000C5001F484245d0  ONLINE
   raidz2   ONLINE
 c2t5F000B5C8187d0  ONLINE
 c2t5F000B5C8157d0  ONLINE
 c2t5F000B5C9101d0  ONLINE
 c2t5F000B5C8167d0  ONLINE
 c2t5F000B5C9120d0  ONLINE
 c2t5F000B5C9151d0  ONLINE
 c2t5F000B5C9170d0  ONLINE
 c2t5F000B5C9180d0  ONLINE
   raidz2   ONLINE
 c2t5000C50010A88E76d0  ONLINE
 c2t5000C5000DCD308Cd0  ONLINE
 c2t5000C5001F1F456Dd0  ONLINE
 c2t5000C50010920E06d0  ONLINE
 c2t5000C5001F20C81Fd0  ONLINE
 c2t5000C5001F3C7735d0  ONLINE
 c2t5000C500113BC008d0  ONLINE
 c2t5000C50014CD416Ad0  ONLINE
 
 Additional devices are known to be part of this pool, though
 their
 exact configuration cannot be determined.
 
 
 All of this would be ok... except THOSE ARE THE ONLY DEVICES THAT WERE
 PART OF THE POOL. How can it be missing a device that didn't exist? 
 
 A zpool import -fF results in the above kernel panic. This also
 creates /etc/zfs/zpool.cache.tmp, which then results in the pool being
 imported, which leads to a continuous reboot/panic cycle. 
 
 I can't obviously use b134 to import the pool without logs, since that
 would imply upgrading the pool first, which is hard to do if it's not
 imported. 
 
 My zdb skills are lacking - zdb -l gets you about so far and that's it.
 (where the heck are the other options to zdb even written down, besides
 in the code?)
 
 OK, so this isn't the end of the world, but it's 15TB of data I'd really
 rather not have to re-copy across a 100Mbit line. It really more
 concerns me that ZFS would do this in the first place - it's not
 supposed to corrupt itself!!
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Hi Jeff,



looks similar to a crash I had here at our site a few month ago. Same
symptoms, no actual solution. We had to recover from a rsync backup server.


We had the logs on an mirrored SSD and an additional SSD as cache.

The machine (SUN 4270 with SUN J4400 JBODS and SUN SAS disks) crashed in
the same manner (core dumping while trying to import the pool). After
booting into single user mode we found the log pool mirror corrupted
(one disk unavailbale). Even after replacing the disk and resilvering
the log mirror we were not able to import the pool.

I suggest that it may has been related to memory (perhaps a lack of memory).


all the best


Carsten





- --
Max Planck Institut fuer marine Mikrobiologie
- - Network Administration -
Celsiustr. 1

[zfs-discuss] pool died during scrub

2010-08-30 Thread Jeff Bacon
I have a bunch of sol10U8 boxes with ZFS pools, most all raidz2 8-disk
stripe. They're all supermicro-based with retail LSI cards.

I've noticed a tendency for things to go a little bonkers during the
weekly scrub (they all scrub over the weekend), and that's when I'll
lose a disk here and there. OK, fine, that's sort of the point, and
they're SATA drives so things happen. 

I've never lost a pool though, until now. This is Not Fun. 

 ::status
debugging crash dump vmcore.0 (64-bit) from ny-fs4
operating system: 5.10 Generic_142901-10 (i86pc)
panic message:
BAD TRAP: type=e (#pf Page fault) rp=fe80007cb850 addr=28 occurred
in module zfs due to a NULL pointer dereference
dump content: kernel pages only
 $C
fe80007cb960 vdev_is_dead+2()
fe80007cb9a0 vdev_mirror_child_select+0x65()
fe80007cba00 vdev_mirror_io_start+0x44()
fe80007cba30 zio_vdev_io_start+0x159()
fe80007cba60 zio_execute+0x6f()
fe80007cba90 zio_wait+0x2d()
fe80007cbb40 arc_read_nolock+0x668()
fe80007cbbd0 dmu_objset_open_impl+0xcf()
fe80007cbc20 dsl_pool_open+0x4e()
fe80007cbcc0 spa_load+0x307()
fe80007cbd00 spa_open_common+0xf7()
fe80007cbd10 spa_open+0xb()
fe80007cbd30 pool_status_check+0x19()
fe80007cbd80 zfsdev_ioctl+0x1b1()
fe80007cbd90 cdev_ioctl+0x1d()
fe80007cbdb0 spec_ioctl+0x50()
fe80007cbde0 fop_ioctl+0x25()
fe80007cbec0 ioctl+0xac()
fe80007cbf10 _sys_sysenter_post_swapgs+0x14b()

  pool: srv
id: 9515618289022845993
 state: UNAVAIL
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
devices and try again.
   see: http://www.sun.com/msg/ZFS-8000-6X
config:

srvUNAVAIL  missing device
  raidz2   ONLINE
c2t5000C5001F2CCE1Fd0  ONLINE
c2t5000C5001F34F5FAd0  ONLINE
c2t5000C5001F48D399d0  ONLINE
c2t5000C5001F485EC3d0  ONLINE
c2t5000C5001F492E42d0  ONLINE
c2t5000C5001F48549Bd0  ONLINE
c2t5000C5001F370919d0  ONLINE
c2t5000C5001F484245d0  ONLINE
  raidz2   ONLINE
c2t5F000B5C8187d0  ONLINE
c2t5F000B5C8157d0  ONLINE
c2t5F000B5C9101d0  ONLINE
c2t5F000B5C8167d0  ONLINE
c2t5F000B5C9120d0  ONLINE
c2t5F000B5C9151d0  ONLINE
c2t5F000B5C9170d0  ONLINE
c2t5F000B5C9180d0  ONLINE
  raidz2   ONLINE
c2t5000C50010A88E76d0  ONLINE
c2t5000C5000DCD308Cd0  ONLINE
c2t5000C5001F1F456Dd0  ONLINE
c2t5000C50010920E06d0  ONLINE
c2t5000C5001F20C81Fd0  ONLINE
c2t5000C5001F3C7735d0  ONLINE
c2t5000C500113BC008d0  ONLINE
c2t5000C50014CD416Ad0  ONLINE

Additional devices are known to be part of this pool, though
their
exact configuration cannot be determined.


All of this would be ok... except THOSE ARE THE ONLY DEVICES THAT WERE
PART OF THE POOL. How can it be missing a device that didn't exist? 

A zpool import -fF results in the above kernel panic. This also
creates /etc/zfs/zpool.cache.tmp, which then results in the pool being
imported, which leads to a continuous reboot/panic cycle. 

I can't obviously use b134 to import the pool without logs, since that
would imply upgrading the pool first, which is hard to do if it's not
imported. 

My zdb skills are lacking - zdb -l gets you about so far and that's it.
(where the heck are the other options to zdb even written down, besides
in the code?)

OK, so this isn't the end of the world, but it's 15TB of data I'd really
rather not have to re-copy across a 100Mbit line. It really more
concerns me that ZFS would do this in the first place - it's not
supposed to corrupt itself!!
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pool died during scrub

2010-08-30 Thread Mark J Musante

On Mon, 30 Aug 2010, Jeff Bacon wrote:




All of this would be ok... except THOSE ARE THE ONLY DEVICES THAT WERE 
PART OF THE POOL. How can it be missing a device that didn't exist?


The device(s) in question are probably the logs you refer to here:

I can't obviously use b134 to import the pool without logs, since that 
would imply upgrading the pool first, which is hard to do if it's not 
imported.


The stack trace you show is indicative of a memory corruption that may 
have gotten out to disk.  In other words, ZFS wrote data to ram, ram was 
corrupted, then the checksum was calculated and the result was written 
out.


Do you have a core dump from the panic?  Also, what kind of DRAM does this 
system use?


If you're lucky, then there's no corruption and instead it's a stale 
config that's causing the problem.  Try removing /etc/zfs/zpool.cache and 
then doing an zpool import -a

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pool died during scrub

2010-08-30 Thread Jeff Bacon
  All of this would be ok... except THOSE ARE THE ONLY DEVICES THAT
WERE
  PART OF THE POOL. How can it be missing a device that didn't exist?
 
 The device(s) in question are probably the logs you refer to here:

There is a log, with a different GUID, from another pool from long ago.
It isn't valid. I clipped that: 

ny-fs4(71)# zpool import
  pool: srv
id: 6111323963551805601
 state: UNAVAIL
status: The pool was last accessed by another system.
action: The pool cannot be imported due to damaged devices or data.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

srv   UNAVAIL  insufficient replicas
logs
srv   UNAVAIL  insufficient replicas
  mirror  ONLINE
c3t0d0s4  ONLINE  box doesn't even have a c3
c0t0d0s4  ONLINE  what it's looking at - leftover from
who knows what

  pool: srv
id: 9515618289022845993
 state: UNAVAIL
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
devices and try again.
   see: http://www.sun.com/msg/ZFS-8000-6X
config:



  I can't obviously use b134 to import the pool without logs, since
that
  would imply upgrading the pool first, which is hard to do if it's
not
  imported.
 The stack trace you show is indicative of a memory corruption that may
 have gotten out to disk.  In other words, ZFS wrote data to ram, ram
was
 corrupted, then the checksum was calculated and the result was written
 out.

Now this worries me. Granted, box works fairly hard, but ... no ECC
events to IPMI that I can see. Possible that the controller ka-futzed
somehow... but then presumably there should be SOME valid data to go
back to here somewhere?

The one fairly unusual item about this box is that it has another pool
with 12 15k SAS drives, which has a mysql database on it which gets
fairly well thrashed on a permanent basis.

 Do you have a core dump from the panic?  Also, what kind of DRAM
 does this system use?

It has 12 4GB DDR3-1066 ECC REG DIMMs. 

I can regenerate the panic on command (try to import the pool with -F
and it will go back into reboot loop mode). I pulled the stack from a
core dump. 


 If you're lucky, then there's no corruption and instead it's a
 stale config that's causing the problem.  Try removing
 /etc/zfs/zpool.cache and then doing an zpool import -a

Not nearly that lucky. It won't import. If it goes into reboot mode, the
only thing you can do is go to single-user, remove the cache, and reboot
so it forgets about the pool.



(Please, no rumblings from the peanut gallery about the evils of SATA or
SAS/SATA encapsulation. This is the only box in this mode. The mysql
database is an RTG stats database whose loss is not the end of the
world. The dataset is replicated in two other sites, this is a local
copy - just that it's 15TB, and as I said, recovery is, well,
time-consuming and therefore not the preferred option.

Real Production Boxes - slowly coming on line - are all using the
SuperMicro E26 dual-port backplane with 2TB constellation SAS drives on
paired LSI 9211-8is, with aforementioned ECC REG RAM, and I'm trying to
figure out how to either
 -- get my hands on SAS SSDs (of which there appears to be one, the new
OCZ Vertex 2 Pro), or
 -- install interposers in front of SATA SSDs so at least the
controllers aren't dealing with SATA encap - the big challenge being, of
all things, the form factor and the tray 

I think I'm going to yank the SAS drives out and migrate them so that
they're on a separate backplane and controller)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss