Re: [zfs-discuss] Failing disk(s) or controller in ZFS pool?
Hi Andy, On Feb 14, 2012, at 12:41 PM, andy thomas wrote: > On Tue, 14 Feb 2012, Richard Elling wrote: > >> Hi Andy >> >> On Feb 14, 2012, at 10:37 AM, andy thomas wrote: >> >>> On one of our servers, we have a RAIDz1 ZFS pool called 'maths2' consisting >>> of 7 x 300 Gb disks which in turn contains a single ZFS filesystem called >>> 'home'. >>> >>> Yesterday, using the 'ls' command to list the directories within this pool >>> caused the command to hang for a long period period followed by an 'i/o >>> error' message. 'zpool status -x maths2' reports the pool is healthy but >>> 'iostat -en' shows a rather different story: >>> >>> root@e450:~# iostat -en >>> errors --- >>> s/w h/w trn tot device >>> 0 0 0 0 fd0 >>> 0 0 0 0 c2t3d0 >>> 0 0 0 0 c2t0d0 >>> 0 0 0 0 c2t1d0 >>> 0 0 0 0 c5t3d0 >>> 0 0 0 0 c4t0d0 >>> 0 0 0 0 c4t1d0 >>> 0 0 0 0 c2t2d0 >>> 0 0 0 0 c4t2d0 >>> 0 0 0 0 c4t3d0 >>> 0 0 0 0 c5t0d0 >>> 0 0 0 0 c5t1d0 >>> 0 0 0 0 c8t0d0 >>> 0 0 0 0 c8t1d0 >>> 0 0 0 0 c8t2d0 >>> 0 503 1658 2161 c9t0d0 >>> 0 2515 6260 8775 c9t1d0 >>> 0 0 0 0 c8t3d0 >>> 0 492 2024 2516 c9t2d0 >>> 0 444 1810 2254 c9t3d0 >>> 0 0 0 0 c5t2d0 >>> 0 1 0 1 rmt/2 >>> >>> Obviously it looks like controller c9 or the cabling associated with it is >>> in trouble (the server is an Enterprise 450 with multiple disk >>> controllers). On taking the server down and running the 'probe-scsi-all' >>> command from the OBP, one disk c9t1d0 was reported as being faulty (no >>> media present) but the others seemed fine. >> >> We see similar symptoms when a misbehaving disk (usually SATA) disrupts the >> other disks in the same fault zone. > > OK, I will replace the disk. > >>> After booting back up, I started scrubbing the maths2 pool and for a long >>> time, only disk c9t1d0 reported it was being repaired. After a few hours, >>> another disk on this controller reported being repaired: >>> >>> NAMESTATE READ WRITE CKSUM >>> maths2 ONLINE 0 0 0 >>> raidz1-0 ONLINE 0 0 0 >>> c5t2d0 ONLINE 0 0 0 >>> c5t3d0 ONLINE 0 0 0 >>> c8t3d0 ONLINE 0 0 0 >>> c9t0d0 ONLINE 0 0 0 21K repaired >>> c9t1d0 ONLINE 0 0 0 938K repaired >>> c9t2d0 ONLINE 0 0 0 >>> c9t3d0 ONLINE 0 0 0 >>> >>> errors: No known data errors >>> >>> Now, does this point to a controller/cabling/backplane problem or could all >>> 4 disks on this controller have been corrupted in some way? The O/S is Osol >>> snv_134 for SPARC and the server has been up & running for nearly a year >>> with no problems to date - there are two other RAIDz1 pools on this server >>> but these are working fine. >> >> Not likely. More likely the faulty disk causing issues elsewhere. > > It sems odd that 'zpool status' is not reporting a degraded status and 'zpool > status -x' is still saying "all pools are healthy". This is a little worrying > as I use remote monitoring to keep an eye on all the servers I admin (many of > which run Solaris, OpenIndiana and FreeBSD) and one thing that is checked > every 15 minutes is the pool status using 'zpool status -x'. But this seems > to result in a false sense of security and I could be blissfully unaware that > half a pool has dropped out! The integrity of the pool was not in danger. I'll bet you have a whole bunch of errors logged to syslog. > >> NB, for file and RAID systems that do not use checksums, such corruptions >> can be catastrophic. Yea ZFS! > > Yes indeed! > :-) -- richard -- DTrace Conference, April 3, 2012, http://wiki.smartos.org/display/DOC/dtrace.conf ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Failing disk(s) or controller in ZFS pool?
On Tue, 14 Feb 2012, Richard Elling wrote: Hi Andy On Feb 14, 2012, at 10:37 AM, andy thomas wrote: On one of our servers, we have a RAIDz1 ZFS pool called 'maths2' consisting of 7 x 300 Gb disks which in turn contains a single ZFS filesystem called 'home'. Yesterday, using the 'ls' command to list the directories within this pool caused the command to hang for a long period period followed by an 'i/o error' message. 'zpool status -x maths2' reports the pool is healthy but 'iostat -en' shows a rather different story: root@e450:~# iostat -en errors --- s/w h/w trn tot device 0 0 0 0 fd0 0 0 0 0 c2t3d0 0 0 0 0 c2t0d0 0 0 0 0 c2t1d0 0 0 0 0 c5t3d0 0 0 0 0 c4t0d0 0 0 0 0 c4t1d0 0 0 0 0 c2t2d0 0 0 0 0 c4t2d0 0 0 0 0 c4t3d0 0 0 0 0 c5t0d0 0 0 0 0 c5t1d0 0 0 0 0 c8t0d0 0 0 0 0 c8t1d0 0 0 0 0 c8t2d0 0 503 1658 2161 c9t0d0 0 2515 6260 8775 c9t1d0 0 0 0 0 c8t3d0 0 492 2024 2516 c9t2d0 0 444 1810 2254 c9t3d0 0 0 0 0 c5t2d0 0 1 0 1 rmt/2 Obviously it looks like controller c9 or the cabling associated with it is in trouble (the server is an Enterprise 450 with multiple disk controllers). On taking the server down and running the 'probe-scsi-all' command from the OBP, one disk c9t1d0 was reported as being faulty (no media present) but the others seemed fine. We see similar symptoms when a misbehaving disk (usually SATA) disrupts the other disks in the same fault zone. OK, I will replace the disk. After booting back up, I started scrubbing the maths2 pool and for a long time, only disk c9t1d0 reported it was being repaired. After a few hours, another disk on this controller reported being repaired: NAMESTATE READ WRITE CKSUM maths2 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 c8t3d0 ONLINE 0 0 0 c9t0d0 ONLINE 0 0 0 21K repaired c9t1d0 ONLINE 0 0 0 938K repaired c9t2d0 ONLINE 0 0 0 c9t3d0 ONLINE 0 0 0 errors: No known data errors Now, does this point to a controller/cabling/backplane problem or could all 4 disks on this controller have been corrupted in some way? The O/S is Osol snv_134 for SPARC and the server has been up & running for nearly a year with no problems to date - there are two other RAIDz1 pools on this server but these are working fine. Not likely. More likely the faulty disk causing issues elsewhere. It sems odd that 'zpool status' is not reporting a degraded status and 'zpool status -x' is still saying "all pools are healthy". This is a little worrying as I use remote monitoring to keep an eye on all the servers I admin (many of which run Solaris, OpenIndiana and FreeBSD) and one thing that is checked every 15 minutes is the pool status using 'zpool status -x'. But this seems to result in a false sense of security and I could be blissfully unaware that half a pool has dropped out! NB, for file and RAID systems that do not use checksums, such corruptions can be catastrophic. Yea ZFS! Yes indeed! cheers, Andy - Andy Thomas, Time Domain Systems Tel: +44 (0)7866 556626 Fax: +44 (0)20 8372 2582 http://www.time-domain.co.uk ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Failing disk(s) or controller in ZFS pool?
Hi Andy On Feb 14, 2012, at 10:37 AM, andy thomas wrote: > On one of our servers, we have a RAIDz1 ZFS pool called 'maths2' consisting > of 7 x 300 Gb disks which in turn contains a single ZFS filesystem called > 'home'. > > Yesterday, using the 'ls' command to list the directories within this pool > caused the command to hang for a long period period followed by an 'i/o > error' message. 'zpool status -x maths2' reports the pool is healthy but > 'iostat -en' shows a rather different story: > > root@e450:~# iostat -en > errors --- > s/w h/w trn tot device >0 0 0 0 fd0 >0 0 0 0 c2t3d0 >0 0 0 0 c2t0d0 >0 0 0 0 c2t1d0 >0 0 0 0 c5t3d0 >0 0 0 0 c4t0d0 >0 0 0 0 c4t1d0 >0 0 0 0 c2t2d0 >0 0 0 0 c4t2d0 >0 0 0 0 c4t3d0 >0 0 0 0 c5t0d0 >0 0 0 0 c5t1d0 >0 0 0 0 c8t0d0 >0 0 0 0 c8t1d0 >0 0 0 0 c8t2d0 >0 503 1658 2161 c9t0d0 >0 2515 6260 8775 c9t1d0 >0 0 0 0 c8t3d0 >0 492 2024 2516 c9t2d0 >0 444 1810 2254 c9t3d0 >0 0 0 0 c5t2d0 >0 1 0 1 rmt/2 > > Obviously it looks like controller c9 or the cabling associated with it is in > trouble (the server is an Enterprise 450 with multiple disk controllers). On > taking the server down and running the 'probe-scsi-all' command from the OBP, > one disk c9t1d0 was reported as being faulty (no media present) but the > others seemed fine. We see similar symptoms when a misbehaving disk (usually SATA) disrupts the other disks in the same fault zone. > > After booting back up, I started scrubbing the maths2 pool and for a long > time, only disk c9t1d0 reported it was being repaired. After a few hours, > another disk on this controller reported being repaired: > >NAMESTATE READ WRITE CKSUM >maths2 ONLINE 0 0 0 > raidz1-0 ONLINE 0 0 0 >c5t2d0 ONLINE 0 0 0 >c5t3d0 ONLINE 0 0 0 >c8t3d0 ONLINE 0 0 0 >c9t0d0 ONLINE 0 0 0 21K repaired >c9t1d0 ONLINE 0 0 0 938K repaired >c9t2d0 ONLINE 0 0 0 >c9t3d0 ONLINE 0 0 0 > > errors: No known data errors > > Now, does this point to a controller/cabling/backplane problem or could all 4 > disks on this controller have been corrupted in some way? The O/S is Osol > snv_134 for SPARC and the server has been up & running for nearly a year with > no problems to date - there are two other RAIDz1 pools on this server but > these are working fine. Not likely. More likely the faulty disk causing issues elsewhere. NB, for file and RAID systems that do not use checksums, such corruptions can be catastrophic. Yea ZFS! -- richard -- DTrace Conference, April 3, 2012, http://wiki.smartos.org/display/DOC/dtrace.conf ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Failing disk(s) or controller in ZFS pool?
On one of our servers, we have a RAIDz1 ZFS pool called 'maths2' consisting of 7 x 300 Gb disks which in turn contains a single ZFS filesystem called 'home'. Yesterday, using the 'ls' command to list the directories within this pool caused the command to hang for a long period period followed by an 'i/o error' message. 'zpool status -x maths2' reports the pool is healthy but 'iostat -en' shows a rather different story: root@e450:~# iostat -en errors --- s/w h/w trn tot device 0 0 0 0 fd0 0 0 0 0 c2t3d0 0 0 0 0 c2t0d0 0 0 0 0 c2t1d0 0 0 0 0 c5t3d0 0 0 0 0 c4t0d0 0 0 0 0 c4t1d0 0 0 0 0 c2t2d0 0 0 0 0 c4t2d0 0 0 0 0 c4t3d0 0 0 0 0 c5t0d0 0 0 0 0 c5t1d0 0 0 0 0 c8t0d0 0 0 0 0 c8t1d0 0 0 0 0 c8t2d0 0 503 1658 2161 c9t0d0 0 2515 6260 8775 c9t1d0 0 0 0 0 c8t3d0 0 492 2024 2516 c9t2d0 0 444 1810 2254 c9t3d0 0 0 0 0 c5t2d0 0 1 0 1 rmt/2 Obviously it looks like controller c9 or the cabling associated with it is in trouble (the server is an Enterprise 450 with multiple disk controllers). On taking the server down and running the 'probe-scsi-all' command from the OBP, one disk c9t1d0 was reported as being faulty (no media present) but the others seemed fine. After booting back up, I started scrubbing the maths2 pool and for a long time, only disk c9t1d0 reported it was being repaired. After a few hours, another disk on this controller reported being repaired: NAMESTATE READ WRITE CKSUM maths2 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 c8t3d0 ONLINE 0 0 0 c9t0d0 ONLINE 0 0 0 21K repaired c9t1d0 ONLINE 0 0 0 938K repaired c9t2d0 ONLINE 0 0 0 c9t3d0 ONLINE 0 0 0 errors: No known data errors Now, does this point to a controller/cabling/backplane problem or could all 4 disks on this controller have been corrupted in some way? The O/S is Osol snv_134 for SPARC and the server has been up & running for nearly a year with no problems to date - there are two other RAIDz1 pools on this server but these are working fine. Andy - Andy Thomas, Time Domain Systems Tel: +44 (0)7866 556626 Fax: +44 (0)20 8372 2582 http://www.time-domain.co.uk ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] send only difference between snapshots
14.02.2012 15:03, Andrew Gabriel пишет: skeletor wrote: It's called an incremental - it's part of the zfs send command line options. thanks! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] send only difference between snapshots
skeletor wrote: There is a task: make backup by sending snapshots to another server. But I don't want to send each time a complete snapshot of the system - I want to send only the difference between a snapshots. For example: there are 2 servers, and I want to do the snapshot on the master, send only the difference between the current and recent snapshots on the backup and then deploy it on backup. Any ideas how this can be done? It's called an incremental - it's part of the zfs send command line options. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] send only difference between snapshots
There is a task: make backup by sending snapshots to another server. But I don't want to send each time a complete snapshot of the system - I want to send only the difference between a snapshots. For example: there are 2 servers, and I want to do the snapshot on the master, send only the difference between the current and recent snapshots on the backup and then deploy it on backup. Any ideas how this can be done? Thanks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss