> > All of this would be ok... except THOSE ARE THE ONLY DEVICES THAT
WERE
> > PART OF THE POOL. How can it be missing a device that didn't exist?
> 
> The device(s) in question are probably the logs you refer to here:

There is a log, with a different GUID, from another pool from long ago.
It isn't valid. I clipped that: 

ny-fs4(71)# zpool import
  pool: srv
    id: 6111323963551805601
 state: UNAVAIL
status: The pool was last accessed by another system.
action: The pool cannot be imported due to damaged devices or data.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

        srv           UNAVAIL  insufficient replicas
        logs
        srv           UNAVAIL  insufficient replicas
          mirror      ONLINE
            c3t0d0s4  ONLINE <---- box doesn't even have a c3
            c0t0d0s4  ONLINE <---- what it's looking at - leftover from
who knows what

  pool: srv
    id: 9515618289022845993
 state: UNAVAIL
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
        devices and try again.
   see: http://www.sun.com/msg/ZFS-8000-6X
config:



> > I can't obviously use b134 to import the pool without logs, since
that
> > would imply upgrading the pool first, which is hard to do if it's
not
> > imported.
> The stack trace you show is indicative of a memory corruption that may
> have gotten out to disk.  In other words, ZFS wrote data to ram, ram
was
> corrupted, then the checksum was calculated and the result was written
> out.

Now this worries me. Granted, box works fairly hard, but ... no ECC
events to IPMI that I can see. Possible that the controller ka-futzed
somehow... but then presumably there should be SOME valid data to go
back to here somewhere?

The one fairly unusual item about this box is that it has another pool
with 12 15k SAS drives, which has a mysql database on it which gets
fairly well thrashed on a permanent basis.

> Do you have a core dump from the panic?  Also, what kind of DRAM
> does this system use?

It has 12 4GB DDR3-1066 ECC REG DIMMs. 

I can regenerate the panic on command (try to import the pool with -F
and it will go back into reboot loop mode). I pulled the stack from a
core dump. 


> If you're lucky, then there's no corruption and instead it's a
> stale config that's causing the problem.  Try removing
> /etc/zfs/zpool.cache and then doing an zpool import -a

Not nearly that lucky. It won't import. If it goes into reboot mode, the
only thing you can do is go to single-user, remove the cache, and reboot
so it forgets about the pool.



(Please, no rumblings from the peanut gallery about the evils of SATA or
SAS/SATA encapsulation. This is the only box in this mode. The mysql
database is an RTG stats database whose loss is not the end of the
world. The dataset is replicated in two other sites, this is a "local
copy" - just that it's 15TB, and as I said, recovery is, well,
time-consuming and therefore not the preferred option.

Real Production Boxes - slowly coming on line - are all using the
SuperMicro E26 dual-port backplane with 2TB constellation SAS drives on
paired LSI 9211-8is, with aforementioned ECC REG RAM, and I'm trying to
figure out how to either
 -- get my hands on SAS SSDs (of which there appears to be one, the new
OCZ Vertex 2 Pro), or
 -- install interposers in front of SATA SSDs so at least the
controllers aren't dealing with SATA encap - the big challenge being, of
all things, the form factor and the tray.... 

I think I'm going to yank the SAS drives out and migrate them so that
they're on a separate backplane and controller....)
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to