Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-06 Thread Tim Cook
On Tue, Apr 6, 2010 at 12:47 AM, Daniel Carosone d...@geek.com.au wrote:

 On Tue, Apr 06, 2010 at 12:29:35AM -0500, Tim Cook wrote:
  On Tue, Apr 6, 2010 at 12:24 AM, Daniel Carosone d...@geek.com.au
 wrote:
 
   On Mon, Apr 05, 2010 at 09:35:21PM -0700, Willard Korfhage wrote:
By the way, I see that now one of the disks is listed as degraded -
 too
   many errors. Is there a good way to identify exactly which of the disks
 it
   is?
  
   It's hidden in iostat -E, of all places.
  
   --
   Dan.
  
  
  I think he wants to know how to identify which physical drive maps to the
  dev ID in solaris.  The only way I can think of is to run something like
 DD
  against the drive to light up the activity LED.

 or look at the serial numbers printed in iostat -E

 --
 Dan.



And then what?  Cross your fingers and hope you pull the right drive on the
first go?  I don't know of any drives that come from the factory in a
hot-swap bay with the serial number printed on the front of the caddy.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-06 Thread Willard Korfhage
Yes, I was hoping to find the serial numbers. Unfortunately, it doesn't show 
any serial numbers for the disk attached to the Areca raid card.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-06 Thread James C.McPherson

On  6/04/10 11:47 PM, Willard Korfhage wrote:

Yes, I was hoping to find the serial numbers. Unfortunately, it doesn't
show any serial numbers for the disk attached to the Areca raid card.



You'll need to reboot and go into the card bios to
get that information.


James C. McPherson
--
Senior Software Engineer, Solaris
Oracle
http://www.jmcp.homeunix.com/blog
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-06 Thread Carson Gaspar

Willard Korfhage wrote:

Yes, I was hoping to find the serial numbers. Unfortunately, it
doesn't show any serial numbers for the disk attached to the Areca
raid card.


Does Areca provide any Solaris tools that will show you the drive info?

If you are using the Areca in JBOD mode, smartctl will frequently show 
serial numbers that iostat -E will not (iostat appears to be really 
stupid about getting serial numbers compared to just about any other 
tool out there).


--
Carson
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-05 Thread Willard Korfhage
Looks like it was RAM. I ran memtest+ 4.00, and it found no problems. I removed 
2 of the 3 sticks of RAM, ran a backup, and had no errors. I'm running more 
extensive tests, but it looks like that was it. A new motherboard, CPU and ECC 
RAM are on the way to me now.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-05 Thread Daniel Carosone
On Sun, Apr 04, 2010 at 11:46:16PM -0700, Willard Korfhage wrote:
 Looks like it was RAM. I ran memtest+ 4.00, and it found no problems.

Then why do you suspect the ram?

Especially with 12 disks, another likely candidate could be an
overloaded power supply.  While there may be problems showing up in
RAM, it may only be happening under the combined load of disks, cpu
and memory activity that brings the system into marginal power
conditions.  Sometimes it may be just one rail that is out of bounds,
and other devices are unaffected.

If memtest didn't find any problems without the disk and cpu load,
that tends to support this hypothesis.

So, the memory may not be bad per se, though it's still not ECC and
therefore not good either :-)   Perhaps you can still find a good
use for it elsewhere.

 I removed 2 of the 3 sticks of RAM, ran a backup, and had no
 errors. I'm running more extensive tests, but it looks like that was
 it. A new motherboard, CPU and ECC RAM are on the way to me now. 

Switching to ECC is a good thing.. but be prepared for possible
continued issues (with different detection thaks to ecc) if the root
cause is the psu.  In fact, ECC memory may draw marginally more power
and maybe make the problem worse (the new cpu and motherboard could go
either way, depending on your choices). 

--
Dan.

pgpCHuyPvOur2.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-05 Thread Willard Korfhage
It certainly has symptoms that match a marginal power supply, but I measured 
the power consumption some time ago and found it comfortably within the power 
supply's capacity. I've also wondered if the RAM is fine, but there is just 
some kind of flaky interaction of the ram configuration I had with the 
motherboard.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-05 Thread Tim Cook
On Mon, Apr 5, 2010 at 9:39 PM, Willard Korfhage opensola...@familyk.orgwrote:

 It certainly has symptoms that match a marginal power supply, but I
 measured the power consumption some time ago and found it comfortably within
 the power supply's capacity. I've also wondered if the RAM is fine, but
 there is just some kind of flaky interaction of the ram configuration I had
 with the motherboard.
 --
 This message posted from opensolaris.org


I think the confusion is that you said you ran memtest86+ and the memory
tested just fine.  Did you remove some memory before running memtest86+ and
narrow it down to a certain stick being bad or something?  Your post makes
it sound as though you found that all of the ram is working perfectly fine.
 IE: It's not the problem.

Also, a low power draw doesn't mean much of anything.  The power supply
could just be dying.  Load wouldn't really matter in that scenario (although
a high load will generally help it out the door a bit quicker due to higher
heat/etc.).

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-05 Thread Daniel Carosone
On Mon, Apr 05, 2010 at 09:46:58PM -0500, Tim Cook wrote:
 On Mon, Apr 5, 2010 at 9:39 PM, Willard Korfhage 
 opensola...@familyk.orgwrote:
 
  It certainly has symptoms that match a marginal power supply, but I
  measured the power consumption some time ago and found it comfortably within
  the power supply's capacity. I've also wondered if the RAM is fine, but
  there is just some kind of flaky interaction of the ram configuration I had
  with the motherboard.
 
 I think the confusion is that you said you ran memtest86+ and the memory
 tested just fine.  Did you remove some memory before running memtest86+ and
 narrow it down to a certain stick being bad or something?  Your post makes
 it sound as though you found that all of the ram is working perfectly fine.

Exactly.

 Also, a low power draw doesn't mean much of anything.  The power supply
 could just be dying.

Or just one part of it could be overloaded (like a particular 5v or
12v rail that happens to be shared between too many drives and the
m/b), even if the overall draw at the wall is less than the total
rating. Sometimes, just moving plugs around can help - or at least
show that a better psu is warranted.

--
Dan.


pgpJLlxR1urcu.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-05 Thread Willard Korfhage
Memtest didn't show any errors, but between Frank, early in the thread, saying 
that he had found memory errors that memtest didn't catch, and remove of DIMMs 
apparently fixing the problem, I too soon jumped to the conclusion it was the 
memory. Certainly there are other explanations. 

I see that I have a spare Corsair 620W power supply that I could try. It is a 
Corsair supply of some wattage in there now. If I recall properly, the steady 
state power draw is between 150 and 200 watts.

By the way, I see that now one of the disks is listed as degraded - too many 
errors. Is there a good way to identify exactly which of the disks it is?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-05 Thread Daniel Carosone
On Mon, Apr 05, 2010 at 09:35:21PM -0700, Willard Korfhage wrote:
 By the way, I see that now one of the disks is listed as degraded - too many 
 errors. Is there a good way to identify exactly which of the disks it is?

It's hidden in iostat -E, of all places.

--
Dan.

pgpB1dUBrSfPC.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-05 Thread Tim Cook
On Tue, Apr 6, 2010 at 12:24 AM, Daniel Carosone d...@geek.com.au wrote:

 On Mon, Apr 05, 2010 at 09:35:21PM -0700, Willard Korfhage wrote:
  By the way, I see that now one of the disks is listed as degraded - too
 many errors. Is there a good way to identify exactly which of the disks it
 is?

 It's hidden in iostat -E, of all places.

 --
 Dan.


I think he wants to know how to identify which physical drive maps to the
dev ID in solaris.  The only way I can think of is to run something like DD
against the drive to light up the activity LED.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-05 Thread Daniel Carosone
On Tue, Apr 06, 2010 at 12:29:35AM -0500, Tim Cook wrote:
 On Tue, Apr 6, 2010 at 12:24 AM, Daniel Carosone d...@geek.com.au wrote:
 
  On Mon, Apr 05, 2010 at 09:35:21PM -0700, Willard Korfhage wrote:
   By the way, I see that now one of the disks is listed as degraded - too
  many errors. Is there a good way to identify exactly which of the disks it
  is?
 
  It's hidden in iostat -E, of all places.
 
  --
  Dan.
 
 
 I think he wants to know how to identify which physical drive maps to the
 dev ID in solaris.  The only way I can think of is to run something like DD
 against the drive to light up the activity LED.

or look at the serial numbers printed in iostat -E

--
Dan.


pgpmo7XmmGf1I.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Diagnosing Permanent Errors

2010-04-04 Thread Willard Korfhage
I would like to get some help diagnosing permanent errors on my files. The 
machine in question has 12 1TB disks connected to an Areca raid card. I 
installed OpenSolaris build 134 and according to zpool history, created a pool 
with

zpool create bigraid raidz2 c4t0d0 c4t0d1 c4t0d2 c4t0d3 c4t0d4 c4t0d5 c4t0d6 
c4t0d7 c4t1d0 c4t1d1 c4t1d2 c4t1d3

I then backed up 806G of files to the machine, and had the backup program 
verify the files. It failed. The check is continuing to run, but so far it 
found 4 files where the checksums of the backup files don't match the checksum 
of the original file. Zpool status shows problems:

 $ sudo zpool status -v
  pool: bigraid
 state: DEGRADED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
bigraid DEGRADED 0 0   536
  raidz2-0  DEGRADED 0 0 3.14K
c4t0d0  ONLINE   0 0 0
c4t0d1  ONLINE   0 0 0
c4t0d2  ONLINE   0 0 0
c4t0d3  ONLINE   0 0 0
c4t0d4  ONLINE   0 0 0
c4t0d5  ONLINE   0 0 0
c4t0d6  ONLINE   0 0 0
c4t0d7  ONLINE   0 0 0
c4t1d0  ONLINE   0 0 0
c4t1d1  ONLINE   0 0 0
c4t1d2  ONLINE   0 0 0
c4t1d3  DEGRADED 0 0 0  too many errors

errors: Permanent errors have been detected in the following files:

metadata:0x18
metadata:0x3a

So, it appears that one of the disks is bad, but if one disk failed, how would 
a raidz2 pool develop permanent errors? The numbers in the CKSUM column are 
continuing to grow, but is that because the backup verification is tickling the 
errors as it runs?

Previous postings on permanent errors said to look at fmdump -eV, but that has 
437543 lines, and I don't really know how to interpret what I see. I did check 
the vdev_path with  fmdump -eV | grep  vdev_path | sort | uniq -c to see if 
it was only certain disks, but every disk in the array is listed in the file, 
albeit with different frequencies:

2189vdev_path = /dev/dsk/c4t0d0s0
1077vdev_path = /dev/dsk/c4t0d1s0
1077vdev_path = /dev/dsk/c4t0d2s0
1097vdev_path = /dev/dsk/c4t0d3s0
  25vdev_path = /dev/dsk/c4t0d4s0
  25vdev_path = /dev/dsk/c4t0d5s0
  20vdev_path = /dev/dsk/c4t0d6s0
1072vdev_path = /dev/dsk/c4t0d7s0
1092vdev_path = /dev/dsk/c4t1d0s0
vdev_path = /dev/dsk/c4t1d1s0
2221vdev_path = /dev/dsk/c4t1d2s0
1149vdev_path = /dev/dsk/c4t1d3s0

What should I make of this? All the disks are bad? That seems unlikely. I found 
another thread

http://opensolaris.org/jive/thread.jspa?messageID=399988

where it finally came down to bad memory, so I'll test that. Any other 
suggestions?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-04 Thread Frank Middleton

On 04/ 4/10 10:00 AM, Willard Korfhage wrote:


What should I make of this? All the disks are bad? That seems
unlikely. I found another thread

http://opensolaris.org/jive/thread.jspa?messageID=399988

where it finally came down to bad memory, so I'll test that. Any
other suggestions?


It could be the cpu. I had a very bizarre case where the cpu would
sometimes miscalculate the checksums of certain files and mostly
when the cpu was also  busy doing other things. Probably the cache.

Days of running memtest and SUNWvts didn't result in any errors
because this was a weirdly pattern sensitive problem. However, I
too am of the opinion that you shouldn't even think of running zfs
without ECC memory (lots of threads about that!) and that this
is far, far more likely to be your problem, but I wouldn't count on
diagnostics finding it, either. Of course it could be the controller too.

For laughs, the cpu calculating bad checksums was discussed in
http://opensolaris.org/jive/message.jspa?messageID=469108
(see last message in the thread).

If you are seriously contemplating using a system with
non-ECC RAM, check out the Google research mentioned in
http://opensolaris.org/jive/thread.jspa?messageID=423770
http://www.cs.toronto.edu/%7Ebianca/papers/sigmetrics09.pdf

Cheers -- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-04 Thread Willard Korfhage
Yeah, this morning I concluded I really should be running ECC ram. I sometimes 
wonder why people people don't run ECC ram more frequently. I remember a decade 
ago, when ram was much, much less dense, people fretted about alpha particles 
randomly flipping bits, but that seems to have died down.

I know, of course, there is some added expense, but browsing on Newegg, the 
additional RAM cost is pretty minimal. I see 2GB ECC sticks going for about $12 
more than similar non-ECC sticks. It's the motherboards that can handle ECC 
which are the expensive part. Now I've got to see what is a good motherboard 
for a file server.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss