subject:"\[zfs\-discuss\] ZFS still crashing after patch"

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-06 Thread Robert Milkowski

Hello Richard,

Monday, May 5, 2008, 4:12:23 PM, you wrote:

RE> Rustam wrote:
>> Hello Robert,
>>   
>>> Which would happen if you have problem with HW and you're getting
>>> wring checksums on both side of your mirrors. Maybe PS?
>>>
>>> Try memtest anyway or sunvts
>>> 
>> Unfortunately, SunVTS doesn't run on non-Sun/OEM hardware. And memtest 
>> requires too much downtime which I cannot afford right now.
>>   

RE> Sometimes if you read the docs, you can get confused by people who
RE> intend to confuse you.  SunVTS does work on a wide variety of
RE> hardware, though it may not be "supported." To fully understand the
RE> perspective, SunVTS is used by Sun in the manufacturing process.
RE> It is the tests run on hardware before shipping to customers.  It is not
RE> intended to be a generic "test whatever hardware you find laying around
RE> product."

Nevertheless you can actually "persuade" it to run on non Sun HW -
it's even in manual page IIRC.

-- 
Best regards,
 Robert Milkowskimailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-05 Thread eric kustarz

On May 5, 2008, at 4:43 PM, Bob Friesenhahn wrote:

> On Mon, 5 May 2008, eric kustarz wrote:
>>
>> That's not true:
>> http://blogs.sun.com/erickustarz/entry/zil_disable
>>
>> Perhaps people are using "consistency" to mean different things  
>> here...
>
> Consistency means that fsync() assures that the data will be written  
> to disk so no data is lost.  It is not the same thing as "no  
> corruption".  ZFS will happily lose some data in order to avoid some  
> corruption if the system loses power.

Ok, that makes more sense.  You're talking from the application  
perspective, whereas my blog entry is from the file system's  
perspective (disabling the ZIL does not compromise on-disk consistency).

eric
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-05 Thread Bob Friesenhahn

On Mon, 5 May 2008, Marcelo Leal wrote:
> I'm calling consistency, "a coherent local view"...
> I think that was one option to debug (if not a NFS server), without
> generate a corrupted filesystem.

In other words your flight reservation will not be lost if the system 
crashes.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-05 Thread Bob Friesenhahn

On Mon, 5 May 2008, eric kustarz wrote:
>
> That's not true:
> http://blogs.sun.com/erickustarz/entry/zil_disable
>
> Perhaps people are using "consistency" to mean different things here...

Consistency means that fsync() assures that the data will be written 
to disk so no data is lost.  It is not the same thing as "no 
corruption".  ZFS will happily lose some data in order to avoid some 
corruption if the system loses power.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-05 Thread eric kustarz


On May 5, 2008, at 1:43 PM, Bob Friesenhahn wrote:

> On Mon, 5 May 2008, Marcelo Leal wrote:
>
>> Hello, If you believe that the problem can be related to ZIL code,
>> you can try to disable it to debug (isolate) the problem. If it is
>> not a fileserver (NFS), disabling the zil should not impact
>> consistency.
>
> In what way is NFS special when it comes to ZFS consistency?  If NFS
> consistency is lost by disabling the zil then local consistency is
> also lost.

That's not true:
http://blogs.sun.com/erickustarz/entry/zil_disable

Perhaps people are using "consistency" to mean different things here...

eric
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-05 Thread Bob Friesenhahn

On Mon, 5 May 2008, Marcelo Leal wrote:

> Hello, If you believe that the problem can be related to ZIL code, 
> you can try to disable it to debug (isolate) the problem. If it is 
> not a fileserver (NFS), disabling the zil should not impact 
> consistency.

In what way is NFS special when it comes to ZFS consistency?  If NFS 
consistency is lost by disabling the zil then local consistency is 
also lost.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-05 Thread Rustam

Hello Leal,

I've been already warned 
(http://www.opensolaris.org/jive/message.jspa?messageID=231349) that ZIL could 
be a cause and I made tests with zil_disabled. I run scrub and system crashed 
exactly at after the same period and the same error. ZIL known to cause some 
problems on writes, while all my problems are with zio_read and checksum_verify.

This is NFS file server, but it crashed even when NFS unshared and nfs/server 
is disabled. So this is not NFS problem.

I reduced panic occasions by setting zfs_prefetch_disable. This allows me to 
avoid unnecessary reads and reduces chances of reading bad checksums. For now 
I've 24 hours without crash which is much better than few times a day. However, 
I know that bad checksums are there and I need to fix them somehow.

--
Rustam
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-05 Thread Marcelo Leal

Hello,
 If you believe that the problem can be related to ZIL code, you can try to 
disable it to debug (isolate) the problem. If it is not a fileserver (NFS), 
disabling the zil should not impact consistency.

 Leal.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-05 Thread Richard Elling

Rustam wrote:
> Hello Robert,
>   
>> Which would happen if you have problem with HW and you're getting
>> wring checksums on both side of your mirrors. Maybe PS?
>>
>> Try memtest anyway or sunvts
>> 
> Unfortunately, SunVTS doesn't run on non-Sun/OEM hardware. And memtest 
> requires too much downtime which I cannot afford right now.
>   

Sometimes if you read the docs, you can get confused by people who
intend to confuse you.  SunVTS does work on a wide variety of
hardware, though it may not be "supported." To fully understand the
perspective, SunVTS is used by Sun in the manufacturing process.
It is the tests run on hardware before shipping to customers.  It is not
intended to be a generic "test whatever hardware you find laying around
product."
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-05 Thread Rustam

Hello Robert,
> Which would happen if you have problem with HW and you're getting
> wring checksums on both side of your mirrors. Maybe PS?
>
> Try memtest anyway or sunvts
Unfortunately, SunVTS doesn't run on non-Sun/OEM hardware. And memtest requires 
too much downtime which I cannot afford right now.

However, I've interesting observations and now I can reproduce crash. It seems 
that I've bad checksum(s) and ZFS crashes each time when it tries to read it. 
Below are two cases:



Case1: I've got a checksum error not striped over mirrors, this time it was 
checksum for a file and not <0x0>. I tried to read file twice. First try 
returned I/O error, second try caused panic. Here's the log:




core# zpool status -xv
  pool: box5
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:
 
NAMESTATE READ WRITE CKSUM
box5ONLINE   0 0 2
  mirrorONLINE   0 0 0
c1d0ONLINE   0 0 0
c2d0ONLINE   0 0 0
  mirrorONLINE   0 0 2
c2d1ONLINE   0 0 4
c1d1ONLINE   0 0 4
 
errors: Permanent errors have been detected in the following files:
 
box5:<0x0>
/u02/domains/somedomain/0/1/5/data/sub1/sub2/1145543794.file

core# ll /u02/domains/somedomain/0/1/5/data/sub1/sub2/1145543794.file
-rw---   1 user group   489 Apr 20  2006 
/u02/domains/somedomain/0/1/5/data/sub1/sub2/1145543794.file

core# cat /u02/domains/somedomain/0/1/5/data/sub1/sub2/1145543794.file
cat: input error on 
/u02/domains/somedomain/0/1/5/data/sub1/sub2/1145543794.file: I/O error

core# zpool status -xv
  pool: box5
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:
 
NAMESTATE READ WRITE CKSUM
box5ONLINE   0 0 4
  mirrorONLINE   0 0 0
c1d0ONLINE   0 0 0
c2d0ONLINE   0 0 0
  mirrorONLINE   0 0 4
c2d1ONLINE   0 0 8
c1d1ONLINE   0 0 8
 
errors: Permanent errors have been detected in the following files:
 
box5:<0x0>
/u02/domains/somedomain/0/1/5/data/sub1/sub2/1145543794.file

core# cat /u02/domains/somedomain/0/1/5/data/sub1/sub2/1145543794.file
(Kernel Panic: BAD TRAP: type=e (#pf Page fault) rp=fe8001112490 
addr=fe80882b7000)
...
(after system boot up)
core# rm /u02/domains/somedomain/0/1/5/data/sub1/sub2/1145543794.file
core# zpool status -xv
  pool: box5
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:
 
NAMESTATE READ WRITE CKSUM
box5ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1d0ONLINE   0 0 0
c2d0ONLINE   0 0 0
  mirrorONLINE   0 0 0
c2d1ONLINE   0 0 0
c1d1ONLINE   0 0 0
 
errors: Permanent errors have been detected in the following files:
 
box5:<0x0>
box5:<0x4a049a>

core# mdb unix.17 vmcore.17
Loading modules: [ unix krtld genunix specfs dtrace cpu.generic uppc pcplusmp 
ufs ip hook neti sctp arp usba uhci fctl nca lofs zfs random nfs ipc sppp 
crypto ptm ]
> ::status
debugging crash dump vmcore.17 (64-bit) from core
operating system: 5.10 Generic_127128-11 (i86pc)
panic message: BAD TRAP: type=e (#pf Page fault) rp=fe8001112490 
addr=fe80882b7000
dump content: kernel pages only
> ::stack
fletcher_2_native+0x13()
zio_checksum_verify+0x27()
zio_next_stage+0x65()
zio_wait_for_children+0x49()
zio_wait_children_done+0x15()
zio_next_stage+0x65()
zio_vdev_io_assess+0x84()
zio_next_stage+0x65()
vdev_cache_read+0x14c()
vdev_disk_io_start+0x135()
vdev_io_start+0x12()
zio_vdev_io_start+0x7b()
zio_next_stage_async+0xae()
zio_nowait+9()
vdev_mirror_io_start+0xa9()
vdev_io_start+0x12()
zio_vdev_io_start+0x7b()
zio_next_stage_async+0xae()
zio_nowait+9()
vdev_mirror_io_start+0xa9()
zio_vdev_io_start+0x116()
zio_next_stage+0x65()
zio_ready+0xec()
zio_next_stage+0x65()
zio_wait_for_children+0x49()
zio_wait_chi

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-03 Thread Robert Milkowski

Hello Rustam,

Saturday, May 3, 2008, 9:16:41 AM, you wrote:

R> I don't think that this is hardware issue, however i don't except this. I'll 
try to explain why.

R> 1. I've replaced all memory modules which are more likely to cause such a 
problem.

R> 2. There are many different applications running on that server
R> (Apache, PostgreSQL, etc.). However, if you look at the four
R> different crash dump stack traces you see the same picture:

R> -- crash dump st1 --
R> mutex_enter+0xb()
R> zio_buf_alloc+0x1a()
R> zio_read+0xba()
R> spa_scrub_io_start+0xf1()
R> spa_scrub_cb+0x13d()

R> -- crash dump st2 --
R> mutex_enter+0xb()
R> zio_buf_alloc+0x1a()
R> zio_read+0xba()
R> arc_read+0x3cc()
R> dbuf_prefetch+0x11d()
R> dmu_prefetch+0x107()
R> zfs_readdir+0x408()
R> fop_readdir+0x34()

R> -- crash dump st3 --
R> mutex_enter+0xb()
R> zio_buf_alloc+0x1a()
R> zio_read+0xba()
R> arc_read+0x3cc()
R> dbuf_prefetch+0x11d()
R> dmu_prefetch+0x107()
R> zfs_readdir+0x408()
R> fop_readdir+0x34()

R> -- crash dump st4 --
R> mutex_enter+0xb()
R> zio_buf_alloc+0x1a()
R> zio_read+0xba()
R> arc_read+0x3cc()
R> dbuf_prefetch+0x11d()
R> dmu_prefetch+0x107()
R> zfs_readdir+0x408()
R> fop_readdir+0x34()


R> All four crash dumps show problem at zio_read/zio_buf_alloc. Three
R> of these appeared during metadata prefetch (dmu_prefetch) and one
R> during scrubbing. I don't think that it's coincidence. IMHO,
R> checksum errors are the result of this inconsistency.

Which would happen if you have problem with HW and you're getting
wring checksums on both side of your mirrors. Maybe PS?

Try memtest anyway or sunvts



-- 
Best regards,
 Robert Milkowskimailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-03 Thread Rustam

I don't think that this is hardware issue, however i don't except this. I'll 
try to explain why.

1. I've replaced all memory modules which are more likely to cause such a 
problem.

2. There are many different applications running on that server (Apache, 
PostgreSQL, etc.). However, if you look at the four different crash dump stack 
traces you see the same picture:

-- crash dump st1 --
mutex_enter+0xb()
zio_buf_alloc+0x1a()
zio_read+0xba()
spa_scrub_io_start+0xf1()
spa_scrub_cb+0x13d()

-- crash dump st2 --
mutex_enter+0xb()
zio_buf_alloc+0x1a()
zio_read+0xba()
arc_read+0x3cc()
dbuf_prefetch+0x11d()
dmu_prefetch+0x107()
zfs_readdir+0x408()
fop_readdir+0x34()

-- crash dump st3 --
mutex_enter+0xb()
zio_buf_alloc+0x1a()
zio_read+0xba()
arc_read+0x3cc()
dbuf_prefetch+0x11d()
dmu_prefetch+0x107()
zfs_readdir+0x408()
fop_readdir+0x34()

-- crash dump st4 --
mutex_enter+0xb()
zio_buf_alloc+0x1a()
zio_read+0xba()
arc_read+0x3cc()
dbuf_prefetch+0x11d()
dmu_prefetch+0x107()
zfs_readdir+0x408()
fop_readdir+0x34()


All four crash dumps show problem at zio_read/zio_buf_alloc. Three of these 
appeared during metadata prefetch (dmu_prefetch) and one during scrubbing. I 
don't think that it's coincidence. IMHO, checksum errors are the result of this 
inconsistency.

I tend to think that problem is in ZFS it exists even in the latest Solaris 
version (maybe OpenSolaris as well).


> 
> Lots of CKSUM errors like you see is often indicative
> of bad hardware. Run 
> memtest for 24-48 hours.
> 
> -marc
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-02 Thread Marc Bevand

Rustam  code.az> writes:
> 
> Didn't help. Keeps crashing.
> The worst thing is that I don't know where's the problem. More ideas on
> how to find problem?

Lots of CKSUM errors like you see is often indicative of bad hardware. Run 
memtest for 24-48 hours.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-02 Thread Rustam

> Seems kind of old.  I am using Generic_127112-11 here.
> 
> Probably many hundreds of nasty bugs have been
> eliminated since the version you are using.

I've updated to the latest available kernel 127128-11 (from 28 Apr) which 
included a number of fixes to AHCI SATA driver and ZFS.

Didn't help. Keeps crashing.
The worst thing is that I don't know where's the problem. More ideas on how to 
find problem?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-01 Thread Bob Friesenhahn

On Thu, 1 May 2008, Rustam wrote:

> operating system: 5.10 Generic_127112-07 (i86pc)

Seems kind of old.  I am using Generic_127112-11 here.

Probably many hundreds of nasty bugs have been eliminated since the 
version you are using.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-01 Thread Rustam

> Is your ZFS pool configured with redundancy (e.g mirrors, raidz) or is
> it non-redundant? If non-redundant, then there is not much that ZFS
> can really do if a device begins to fail.

It's RAID 10 (more info here: 
http://www.opensolaris.org/jive/thread.jspa?threadID=57425):

NAME STATE READ WRITE CKSUM
box5 ONLINE 0 0 4
mirror ONLINE 0 0 2
c1d0 ONLINE 0 0 4
c2d0 ONLINE 0 0 4
mirror ONLINE 0 0 2
c2d1 ONLINE 0 0 4
c1d1 ONLINE 0 0 4

Actually, there's no damaged data so far. I don't get any "unable to 
read/write" kind of errors. It's just very strange checksum errors synchronized 
over all disks.

> That's a bit harsh.  ZFS is telling you that you u have corrupted data 
> based on the checksums.  Other types of filesystems would likely simply 
> pass the corrupted data on silently.

Checksums are good, no complaints about that.

> Do you have the panic messages?  ZFS won't cause panics based on bad 
> checksums.  It will by default cause panic if it can't write data out to 
> any device or if it completely loses access to non-redundant devices or 
> loses both redundant devices at the same time.

A number of panic messages and crash dump stack trace are attached to the 
original post (http://www.opensolaris.org/jive/thread.jspa?threadID=57425). 
Here is the short snip:

> ::status
debugging crash dump vmcore.5 (64-bit) from core
operating system: 5.10 Generic_127112-07 (i86pc)
panic message: BAD TRAP: type=e (#pf Page fault) rp=fe800017f8d0 addr=238 
occurred in module "unix" due to a NULL pointer dereference
dump content: kernel pages only
>
> ::stack
mutex_enter+0xb()
zio_buf_alloc+0x1a()
zio_read+0xba()
spa_scrub_io_start+0xf1()
spa_scrub_cb+0x13d()
traverse_callback+0x6a()
traverse_segment+0x118()
traverse_more+0x7b()
spa_scrub_thread+0x147()
thread_start+8()

> Since this seems to show the same number of checksum errors across 2 
> different channels and 4 different drives.  Given that, I'd assume that 
> this is likely a dual-channel HBA of some sort.  It would appear that 
> you either have bad hardware or some sort of driver issue.

You right, this is the dual-channel Intel's ICH6 SATA controller. 10U4 has 
native support/drivers for this SATA controller (AHCI drivers afaik). The thing 
is that this hardware and ZFS were in production for almost 2 years (ok, not 
the best argument). However this problem occurred recently (20 days). It's even 
more strange because I didn't made any OS/diver upgrade or patch during last 
2-3 months.

However, this is good point. I've seen some new SATA/AHCI drivers available in 
10U5. Maybe I should try to upgrade and see if it helps. Thanks Phil.

--
Rustam
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-01 Thread Phillip Wagstrom -- Area SSE MidAmerica

Rustam wrote:
> Today my production server crashed  4 times. THIS IS NIGHTMARE! 
> Self-healing file system?! For me ZFS is SELF-KILLING filesystem.
> 
> I cannot fsck it, there's no such tool. I cannot scrub it, it crashes
> 30-40 minutes after scrub starts. I cannot use it, it crashes a
> number of times every day! And with every crash number of checksum
> failures is growing:
> 
> NAMESTATE READ WRITE CKSUM box5ONLINE   0
> 0 0 ...after a few hours... box5ONLINE   0 0
> 4 ...after a few hours... box5ONLINE   0 0 62 
> ...after another few hours... box5ONLINE   0 0
> 120 ...crash! and we start again... box5ONLINE   0 0
> 0 ...etc...
> 
> actually 120 is record, sometimes it crashed as soon as it boots.
> 
> and always there's a permanent error: errors: Permanent errors have
> been detected in the following files: box5:<0x0>
> 
> and very wise self-healing advice: http://www.sun.com/msg/ZFS-8000-8A
>  Restore the file in question if possible.  Otherwise restore the
> entire pool from backup.
> 
> Thanks, but if I restore it from backup it won't be ZFS anymore,
> that's for sure.

That's a bit harsh.  ZFS is telling you that you have corrupted data 
based on the checksums.  Other types of filesystems would likely simply 
pass the corrupted data on silently.

> It's not I/O problem. AFAIK, default ZFS I/O error behavior is "wait"
> to repair (i've 10U4, non-configurable). Then why it panics?

Do you have the panic messages?  ZFS won't cause panics based on bad 
checksums.  It will by default cause panic if it can't write data out to 
any device or if it completely loses access to non-redundant devices or 
loses both redundant devices at the same time.

> Recently there were discussions on failure of OpenSolaris community.
> Now it's been more than half a month since I reported such an error.
> Nobody even posted something like "RTFM". Come on guys, I know you
> are there and busy with enterprise customers... but at least give me
> some troubleshooting ideas. i'm totally lost.
> 
> just to remind, it's heavily loaded fs with 3-4 million files and
> folders.
> 
> Link to original post: 
> http://www.opensolaris.org/jive/thread.jspa?threadID=57425

Since this seems to show the same number of checksum errors across 2 
different channels and 4 different drives.  Given that, I'd assume that 
this is likely a dual-channel HBA of some sort.  It would appear that 
you either have bad hardware or some sort of driver issue.

Regards,
Phil

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-01 Thread Bob Friesenhahn

On Thu, 1 May 2008, Rustam wrote:

> Today my production server crashed  4 times. THIS IS NIGHTMARE!
> Self-healing file system?! For me ZFS is SELF-KILLING filesystem.
>
> I cannot fsck it, there's no such tool.
> I cannot scrub it, it crashes 30-40 minutes after scrub starts.
> I cannot use it, it crashes a number of times every day! And with every crash 
> number of checksum failures is growing:

Is your ZFS pool configured with redundancy (e.g mirrors, raidz) or is 
it non-redundant?  If non-redundant, then there is not much that ZFS 
can really do if a device begins to fail.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-01 Thread Rustam

Today my production server crashed  4 times. THIS IS NIGHTMARE!
Self-healing file system?! For me ZFS is SELF-KILLING filesystem. 

I cannot fsck it, there's no such tool.
I cannot scrub it, it crashes 30-40 minutes after scrub starts.
I cannot use it, it crashes a number of times every day! And with every crash 
number of checksum failures is growing:

NAMESTATE READ WRITE CKSUM
box5ONLINE   0 0 0
...after a few hours...
box5ONLINE   0 0 4
...after a few hours...
box5ONLINE   0 0 62
...after another few hours...
box5ONLINE   0 0 120
...crash! and we start again...
box5ONLINE   0 0 0
...etc...

actually 120 is record, sometimes it crashed as soon as it boots.

and always there's a permanent error:
errors: Permanent errors have been detected in the following files:
box5:<0x0>

and very wise self-healing advice:
http://www.sun.com/msg/ZFS-8000-8A
Restore the file in question if possible.  Otherwise restore the entire pool 
from backup.

Thanks, but if I restore it from backup it won't be ZFS anymore, that's for 
sure.

It's not I/O problem. AFAIK, default ZFS I/O error behavior is "wait" to repair 
(i've 10U4, non-configurable). Then why it panics?

Recently there were discussions on failure of OpenSolaris community. Now it's 
been more than half a month since I reported such an error. Nobody even posted 
something like "RTFM". Come on guys, I know you are there and busy with 
enterprise customers... but at least give me some troubleshooting ideas. i'm 
totally lost.

just to remind, it's heavily loaded fs with 3-4 million files and folders.

Link to original post:
http://www.opensolaris.org/jive/thread.jspa?threadID=57425
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS still crashing after patch

Re: [zfs-discuss] ZFS still crashing after patch

Re: [zfs-discuss] ZFS still crashing after patch

Re: [zfs-discuss] ZFS still crashing after patch

Re: [zfs-discuss] ZFS still crashing after patch

Re: [zfs-discuss] ZFS still crashing after patch

Re: [zfs-discuss] ZFS still crashing after patch

Re: [zfs-discuss] ZFS still crashing after patch

Re: [zfs-discuss] ZFS still crashing after patch

Re: [zfs-discuss] ZFS still crashing after patch

Re: [zfs-discuss] ZFS still crashing after patch

Re: [zfs-discuss] ZFS still crashing after patch

Re: [zfs-discuss] ZFS still crashing after patch

Re: [zfs-discuss] ZFS still crashing after patch

Re: [zfs-discuss] ZFS still crashing after patch

Re: [zfs-discuss] ZFS still crashing after patch

Re: [zfs-discuss] ZFS still crashing after patch

Re: [zfs-discuss] ZFS still crashing after patch

Re: [zfs-discuss] ZFS still crashing after patch

19 matches

Site Navigation

Mail list logo

Footer information