Re: redundant zfs pool, system traps and tonns of corrupted files

2017-06-29 Thread Alan Somers
On Thu, Jun 29, 2017 at 6:04 AM, Eugene M. Zheganin  wrote:
> Hi,
>
> On 29.06.2017 16:37, Eugene M. Zheganin wrote:
>>
>> Hi.
>>
>>
>> Say I'm having a server that traps more and more often (different panics:
>> zfs panics, GPFs, fatal traps while in kernel mode etc), and then I realize
>> it has tonns of permanent errors on all of it's pools that scrub is unable
>> to heal. Does this situation mean it's a bad memory case ? Unfortunately I
>> switched the hardware to an identical server prior to encountering zpools
>> have errors, so I'm not use when did they appear. Right now I'm about to run
>> a memtest on an old hardware.
>>
>>
>> So, whadda you say - does it point at the memory as the root problem ?

Certainly a good guess.

>>
>
> I'm also not quite getting the situation when I have errors on a vdev level,
> but 0 errors on a lower device layer (could someone please explain this):

ZFS checksums whole records at a time.  On RAIDZ, each record is
spread over multiple disks, usually the entire RAID stripe.  So when
ZFS detects a checksum error on a record stored in RAIDZ, it doesn't
know which individual disk was actually responsible.  Instead, it
blames the RAIDZ vdev.  That's why you have thousands of checksum
errors on your raidz vdevs.  The few checksum errors you have on
individual disks might have come from the labels or uberblocks, which
are not raided.

-Alan
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: redundant zfs pool, system traps and tonns of corrupted files

2017-06-29 Thread Eugene M. Zheganin

Hi,

On 29.06.2017 16:37, Eugene M. Zheganin wrote:

Hi.


Say I'm having a server that traps more and more often (different 
panics: zfs panics, GPFs, fatal traps while in kernel mode etc), and 
then I realize it has tonns of permanent errors on all of it's pools 
that scrub is unable to heal. Does this situation mean it's a bad 
memory case ? Unfortunately I switched the hardware to an identical 
server prior to encountering zpools have errors, so I'm not use when 
did they appear. Right now I'm about to run a memtest on an old hardware.



So, whadda you say - does it point at the memory as the root problem ?



I'm also not quite getting the situation when I have errors on a vdev 
level, but 0 errors on a lower device layer (could someone please 
explain this):


  pool: esx
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 3,74G in 0h5m with 0 errors on Tue Dec 27 05:14:32 2016
config:

NAMESTATE READ WRITE CKSUM
esx ONLINE   0 0 99,0K
  raidz1-0  ONLINE   0 0  113K
da0 ONLINE   0 0 0
da1 ONLINE   0 0 0
da2 ONLINE   0 0 2
da3 ONLINE   0 0 0
da5 ONLINE   0 0 0
  raidz1-1  ONLINE   0 0 84,7K
da12ONLINE   0 0 0
da13ONLINE   0 0 1
da14ONLINE   0 0 0
da15ONLINE   0 0 0
da16ONLINE   0 0 0

errors: 25 data errors, use '-v' for a list

  pool: gamestop
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub in progress since Thu Jun 29 12:30:21 2017
1,67T scanned out of 4,58T at 1002M/s, 0h50m to go
0 repaired, 36,44% done
config:

NAMESTATE READ WRITE CKSUM
gamestopONLINE   0 0 1
  raidz1-0  ONLINE   0 0 2
da6 ONLINE   0 0 0
da7 ONLINE   0 0 0
da8 ONLINE   0 0 0
da9 ONLINE   0 0 0
da11ONLINE   0 0 0

errors: 10 data errors, use '-v' for a list

P.S. This is a FreeBSD 11.1-BETA2 r320056M (M stands for CTL_MAX_PORTS = 
1024), with ECC memory.


Thanks.
Eugene.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


redundant zfs pool, system traps and tonns of corrupted files

2017-06-29 Thread Eugene M. Zheganin

Hi.


Say I'm having a server that traps more and more often (different 
panics: zfs panics, GPFs, fatal traps while in kernel mode etc), and 
then I realize it has tonns of permanent errors on all of it's pools 
that scrub is unable to heal. Does this situation mean it's a bad memory 
case ? Unfortunately I switched the hardware to an identical server 
prior to encountering zpools have errors, so I'm not use when did they 
appear. Right now I'm about to run a memtest on an old hardware.



So, whadda you say - does it point at the memory as the root problem ?


Thanks.

Eugene.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"