Re: [zfs-discuss] Speeding up resilver on x4500

2009-07-22 Thread Roch

Stuart Anderson writes:
  
  On Jun 21, 2009, at 10:21 PM, Nicholas Lee wrote:
  
  
  
   On Mon, Jun 22, 2009 at 4:24 PM, Stuart Anderson 
   ander...@ligo.caltech.edu 
wrote:
  
   However, it is a bit disconcerting to have to run with reduced data
   protection for an entire week. While I am certainly not going back to
   UFS, it seems like it should be at least theoretically possible to  
   do this
   several orders of magnitude faster, e.g., what if every block on the
   replacement disk had its RAIDZ2 data recomputed from the degraded
  
   Maybe this is also saying - that for large disk sets a single RAIDZ2  
   provides a false sense of security.
  
  This configuration is with 3 large RAIDZ2 devices but I have more  
  recently
  been building thumper/thor systems with a larger number of smaller  
  RAIDZ2's.
  
  Thanks.
  

170M small files reconstructed in 1 week over 3 raid-z
groups is 93 files / sec per raid-z group. That is not too
far from expectations for 7.2K RPM drives (where they ?).

I don't see orders of magnitude improvements on this however
this CR (integrated in snv_109) might give the workload a boost :

6801507 ZFS read aggregation should not mind the gap

This will enable more read aggregation to occur during a
resilver. We could also contemplate enabling the vdev
prefetch code for data during a resilver. 

Otherwise, limiting the # of small objects per raid-z group
as you're doing now, seems wise to me.

-r


  --
  Stuart Anderson  ander...@ligo.caltech.edu
  http://www.ligo.caltech.edu/~anderson
  
  
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Speeding up resilver on x4500

2009-06-23 Thread Richard Elling

Erik Trimble wrote:
All this discussion hasn't answered one thing for me:   exactly _how_ 
does ZFS do resilvering?  Both in the case of mirrors, and of RAIDZ[2] ?


I've seen some mention that it goes in cronological order (which to 
me, means that the metadata must be read first) of file creation, and 
that only used blocks are rebuilt, but exactly what is the methodology 
being used?


See Jeff Bonwick's blog on the topic
http://blogs.sun.com/bonwick/entry/smokin_mirrors
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Speeding up resilver on x4500

2009-06-23 Thread Erik Trimble

Richard Elling wrote:

Erik Trimble wrote:
All this discussion hasn't answered one thing for me:   exactly _how_ 
does ZFS do resilvering?  Both in the case of mirrors, and of RAIDZ[2] ?


I've seen some mention that it goes in cronological order (which to 
me, means that the metadata must be read first) of file creation, and 
that only used blocks are rebuilt, but exactly what is the 
methodology being used?


See Jeff Bonwick's blog on the topic
http://blogs.sun.com/bonwick/entry/smokin_mirrors
-- richard



That's very informative. Thanks, Richard.

So, ZFS walks the used block tree to see what still needs rebuilding.   
I guess I have two related questions then:


(1) Are these blocks some fixed size (based on the media - usually 512 
bytes), or are they ZFS blocks - the fungible size based on the 
requirements of the original file size being written? 

(2) is there some reasonable way to read in multiples of these blocks in 
a single IOP?   Theoretically, if the blocks are in chronological 
creation order, they should be (relatively) sequential on the drive(s).  
Thus, ZFS should be able to read in several of them without forcing a 
random seek. That is, you should be able to get multiple blocks in a 
single IOP.



If we can't get multiple ZFS blocks in one sequential read, we're 
screwed - ZFS is going to be IOPS bound on the replacement disk, with no 
real workaround.  Which means rebuild times for disks with lots of small 
files is going to be hideous.




--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Speeding up resilver on x4500

2009-06-23 Thread Toby Thain


On 23-Jun-09, at 1:58 PM, Erik Trimble wrote:


Richard Elling wrote:

Erik Trimble wrote:
All this discussion hasn't answered one thing for me:   exactly  
_how_ does ZFS do resilvering?  Both in the case of mirrors, and  
of RAIDZ[2] ?


I've seen some mention that it goes in cronological order (which  
to me, means that the metadata must be read first) of file  
creation, and that only used blocks are rebuilt, but exactly what  
is the methodology being used?


See Jeff Bonwick's blog on the topic
http://blogs.sun.com/bonwick/entry/smokin_mirrors
-- richard



That's very informative. Thanks, Richard.

So, ZFS walks the used block tree to see what still needs  
rebuilding.   I guess I have two related questions then:


(1) Are these blocks some fixed size (based on the media - usually  
512 bytes), or are they ZFS blocks - the fungible size based on  
the requirements of the original file size being written?
(2) is there some reasonable way to read in multiples of these  
blocks in a single IOP?   Theoretically, if the blocks are in  
chronological creation order, they should be (relatively)  
sequential on the drive(s).  Thus, ZFS should be able to read in  
several of them without forcing a random seek.


(I think) the disk's internal scheduling could help out here if they  
are indeed close to physically sequential.


--Toby


That is, you should be able to get multiple blocks in a single IOP.


If we can't get multiple ZFS blocks in one sequential read, we're  
screwed - ZFS is going to be IOPS bound on the replacement disk,  
with no real workaround.  Which means rebuild times for disks with  
lots of small files is going to be hideous.




--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Speeding up resilver on x4500

2009-06-23 Thread Stuart Anderson


On Jun 23, 2009, at 11:50 AM, Richard Elling wrote:


(2) is there some reasonable way to read in multiples of these  
blocks in a single IOP?   Theoretically, if the blocks are in  
chronological creation order, they should be (relatively)  
sequential on the drive(s).  Thus, ZFS should be able to read in  
several of them without forcing a random seek. That is, you should  
be able to get multiple blocks in a single IOP.


Metadata is prefetched. You can look at the hit rate in kstats.
Stuart, you might post the output of kstat -n vdev_cache_stats
I regularly see cache hit rates in the 60% range, which isn't bad
considering what is being cached.


# kstat -n vdev_cache_stats
module: zfs instance: 0
name:   vdev_cache_statsclass:misc
crtime  129.03798177
delegations 25873382
hits114064783
misses  182253696
snaptime960064.85352608


Here is also some zpool iostat numbers during this resilver,

# zpool iostat ldas-cit1 10
   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
ldas-cit1   16.9T  3.49T165134  5.17M  1.58M
ldas-cit1   16.9T  3.49T225237  1.28M  1.98M
ldas-cit1   16.9T  3.49T288317  1.53M  2.26M
ldas-cit1   16.9T  3.49T174269  1014K  1.68M


And here is the pool configuration,

# zpool status ldas-cit1
  pool: ldas-cit1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool  
will

continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 96h49m, 63.69% done, 55h12m to go
config:

NAMESTATE READ WRITE CKSUM
ldas-cit1   DEGRADED 0 0 0
  raidz2DEGRADED 0 0 0
c0t1d0  ONLINE   0 0 0
c1t1d0  ONLINE   0 0 0
c3t1d0  ONLINE   0 0 0
c4t1d0  ONLINE   0 0 0
c5t1d0  ONLINE   0 0 0
c6t1d0  ONLINE   0 0 0
c0t2d0  ONLINE   0 0 0
c1t2d0  ONLINE   0 0 0
c3t2d0  ONLINE   0 0 0
c4t2d0  ONLINE   0 0 0
c5t2d0  ONLINE   0 0 0
spare   DEGRADED 0 0 0
  replacing DEGRADED 0 0 0
c6t2d0s0/o  FAULTED  0 0 0  corrupted data
c6t2d0  ONLINE   0 0 0
  c6t0d0ONLINE   0 0 0
c0t3d0  ONLINE   0 0 0
c1t3d0  ONLINE   0 0 0
c3t3d0  ONLINE   0 0 0
  raidz2ONLINE   0 0 0
c4t3d0  ONLINE   0 0 0
c5t3d0  ONLINE   0 0 0
c6t3d0  ONLINE   0 0 0
c0t4d0  ONLINE   0 0 0
c1t4d0  ONLINE   0 0 0
c3t4d0  ONLINE   0 0 0
c5t0d0  ONLINE   0 0 0
c5t4d0  ONLINE   0 0 0
c6t4d0  ONLINE   0 0 0
c0t5d0  ONLINE   0 0 0
c1t5d0  ONLINE   0 0 0
c3t5d0  ONLINE   0 0 0
c4t5d0  ONLINE   0 0 0
c5t5d0  ONLINE   0 0 0
c6t5d0  ONLINE   0 0 0
  raidz2ONLINE   0 0 0
c0t6d0  ONLINE   0 0 0
c1t6d0  ONLINE   0 0 0
c3t6d0  ONLINE   0 0 0
c4t6d0  ONLINE   0 0 0
c5t6d0  ONLINE   0 0 0
c6t6d0  ONLINE   0 0 0
c0t7d0  ONLINE   0 0 0
c1t7d0  ONLINE   0 0 0
c3t7d0  ONLINE   0 0 0
c4t7d0  ONLINE   0 0 0
c5t7d0  ONLINE   0 0 0
c6t7d0  ONLINE   0 0 0
c0t0d0  ONLINE   0 0 0
c1t0d0  ONLINE   0 0 0
c3t0d0  ONLINE   0 0 0
spares
  c6t0d0INUSE currently in use

errors: No known data errors


--
Stuart Anderson  

Re: [zfs-discuss] Speeding up resilver on x4500

2009-06-22 Thread Erik Trimble

Nicholas Lee wrote:



On Mon, Jun 22, 2009 at 4:24 PM, Stuart Anderson 
ander...@ligo.caltech.edu mailto:ander...@ligo.caltech.edu wrote:



However, it is a bit disconcerting to have to run with reduced data
protection for an entire week. While I am certainly not going back to
UFS, it seems like it should be at least theoretically possible to
do this
several orders of magnitude faster, e.g., what if every block on the
replacement disk had its RAIDZ2 data recomputed from the degraded


Maybe this is also saying - that for large disk sets a single RAIDZ2 
provides a false sense of security.


Nicholas   



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  



I'm assuming the problem is that you are IOPS bound.  Since you wrote 
small files, ZFS uses small stripe sizes.  Which means, that when you 
need to do a full-stripe read to reconstruct the RAIDZ2 parity, you're 
reading only a very small amount of data. You're IOPS bound on the 
replacement disk.


For arguments' sake, let's assume you have 4k stripe sizes. Thus, you do:

(1) 4k read across all disks
(2) checksum computation
(3) tiny write to re-silver disk

Assuming you might max out at 300 IOPS (not unreasonable for small reads 
on SATA drives), the results in:


(300 / 2 ) x 4kB = 600k/s. 

That is, you can do 150 stripe reads and writes, each read/write pair 
reconstructing the parity for 4k of data. And, that might be optimal.


At that rate, 1TB of data will take ( (1024 * 1024 * 1024 * 1024kB) / 
600kB/s) = 1.8 million seconds =~ 500 hours.



I don't know about how ZFS does the actual reconstruction, but I have 
two suggestions:


(1) if ZFS is doing a serial resilver (i.e. resilver stripe 1 before 
doing stripe 2, etc), would it be possible to NOT do a full stripe write 
when doing the reconstruction?  that is, only write the reconstructed 
data back to the replacement disk?  That would allow the data disks to 
use their full IOPS reading, and the replacement disks it's full IOPS 
writing. It's still going to suck rocks, but only half as much.


(2)  Multiple stripe-reconstruction would probably be better; that is, 
ZFS should reconstruct several adjacent stripes together, up to some 
reasonable total size (say 1MB or so). That way, you could get 
reconstruction rates of 100MB/s (that is, reconstruct the parity for 
100MB of data, NOT writing 100MB/s).   1TB of data @ 100MB/s is only 3 
hours.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Speeding up resilver on x4500

2009-06-22 Thread Richard Elling

Stuart Anderson wrote:


On Jun 21, 2009, at 8:57 PM, Richard Elling wrote:


Stuart Anderson wrote:

It is currently taking ~1 week to resilver an x4500 running S10U6,
recently patched with~170M small files on ~170 datasets after a
disk failure/replacement, i.e.,


wow, that is impressive.  There is zero chance of doing that with a
manageable number of UFS file systems.


However, it is a bit disconcerting to have to run with reduced data
protection for an entire week. While I am certainly not going back to
UFS, it seems like it should be at least theoretically possible to do 
this

several orders of magnitude faster, e.g., what if every block on the
replacement disk had its RAIDZ2 data recomputed from the degraded
array regardless of whether the pool was using it or not. In that case
I would expect it to be able to sequentially reconstruct in the same few
hours it would take a HW RAID controller to do the same RAID6 job.


ZFS reconstruction is done in time order, so the workload is random for
data which has been updated over time.

Nevertheless, in my lab testing, I was not able to create a random-enough
workload to not be write limited on the reconstructing drive.  Anecdotal
evidence shows that some systems are limited by the random reads.


Perhaps there needs to be an option to re-order the loops for
resilvering on pools with lots of small files to resilver in device
order rather than filesystem order?


The information is not there.  Unlike RAID-1/5/6, ZFS does not
require a 1:N mapping of blocks.



scrub: resilver in progress for 53h47m, 30.72% done, 121h19m to go

Is there anything that can be tuned to improve this performance, e.g.,
adding a faster cache device for reading and/or writing?


Resilver tends to be bound by one of two limits:

1. sequential write performance of the resilvering device

2. random I/O performance of the non-resilvering devices



A quick look at iostat leads me to conjecture that the vdev rebuilding is
taking a very low priority compared to ongoing application I/O (NFSD
in this case). Are there any ZFS knobs that control the relative 
priority of

resilvering to other disk I/O tasks?


Yes, it is low priority.  This is one argument for the competing RFEs:
CR 6592835, resilver needs to go faster
CR 6494473, ZFS needs a way to slow down resilvering

-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Speeding up resilver on x4500

2009-06-22 Thread Bill Sommerfeld
On Mon, 2009-06-22 at 06:06 -0700, Richard Elling wrote:
 Nevertheless, in my lab testing, I was not able to create a random-enough
 workload to not be write limited on the reconstructing drive.  Anecdotal
 evidence shows that some systems are limited by the random reads.

Systems I've run which have random-read-limited reconstruction have a
combination of:
 - regular time-based snapshots
 - daily cron jobs which walk the filesystem, accessing all directories
and updating all directory atimes in the process.

Because the directory dnodes are randomly distributed through the dnode
file, each block of the dnode file likely contains at least one
directory dnode, and as a result each of the tree walk jobs causes the
entire dnode file to diverge from the previous day's snapshot.

If the underlying filesystems are mostly static and there are dozens of
snapshots, a pool traverse spends most of its time reading the dnode
files and finding block pointers to older blocks which it knows it has
already seen.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Speeding up resilver on x4500

2009-06-22 Thread Stuart Anderson


On Jun 21, 2009, at 10:21 PM, Nicholas Lee wrote:




On Mon, Jun 22, 2009 at 4:24 PM, Stuart Anderson ander...@ligo.caltech.edu 
 wrote:


However, it is a bit disconcerting to have to run with reduced data
protection for an entire week. While I am certainly not going back to
UFS, it seems like it should be at least theoretically possible to  
do this

several orders of magnitude faster, e.g., what if every block on the
replacement disk had its RAIDZ2 data recomputed from the degraded

Maybe this is also saying - that for large disk sets a single RAIDZ2  
provides a false sense of security.


This configuration is with 3 large RAIDZ2 devices but I have more  
recently
been building thumper/thor systems with a larger number of smaller  
RAIDZ2's.


Thanks.

--
Stuart Anderson  ander...@ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Speeding up resilver on x4500

2009-06-22 Thread Erik Trimble
All this discussion hasn't answered one thing for me:   exactly _how_ 
does ZFS do resilvering?  Both in the case of mirrors, and of RAIDZ[2] ?


I've seen some mention that it goes in cronological order (which to me, 
means that the metadata must be read first) of file creation, and that 
only used blocks are rebuilt, but exactly what is the methodology being 
used?


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Speeding up resilver on x4500

2009-06-21 Thread Richard Elling

Stuart Anderson wrote:

It is currently taking ~1 week to resilver an x4500 running S10U6,
recently patched with~170M small files on ~170 datasets after a
disk failure/replacement, i.e.,


wow, that is impressive.  There is zero chance of doing that with a
manageable number of UFS file systems.



 scrub: resilver in progress for 53h47m, 30.72% done, 121h19m to go

Is there anything that can be tuned to improve this performance, e.g.,
adding a faster cache device for reading and/or writing?


Resilver tends to be bound by one of two limits:

1. sequential write performance of the resilvering device

2. random I/O performance of the non-resilvering devices

A while back, I was doing some characterization of this, but the
funding disappeared :-(  So, it is unclear whether or how caching
might help.
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Speeding up resilver on x4500

2009-06-21 Thread Stuart Anderson


On Jun 21, 2009, at 8:57 PM, Richard Elling wrote:


Stuart Anderson wrote:

It is currently taking ~1 week to resilver an x4500 running S10U6,
recently patched with~170M small files on ~170 datasets after a
disk failure/replacement, i.e.,


wow, that is impressive.  There is zero chance of doing that with a
manageable number of UFS file systems.


However, it is a bit disconcerting to have to run with reduced data
protection for an entire week. While I am certainly not going back to
UFS, it seems like it should be at least theoretically possible to do  
this

several orders of magnitude faster, e.g., what if every block on the
replacement disk had its RAIDZ2 data recomputed from the degraded
array regardless of whether the pool was using it or not. In that case
I would expect it to be able to sequentially reconstruct in the same few
hours it would take a HW RAID controller to do the same RAID6 job.

Perhaps there needs to be an option to re-order the loops for
resilvering on pools with lots of small files to resilver in device
order rather than filesystem order?





scrub: resilver in progress for 53h47m, 30.72% done, 121h19m to go

Is there anything that can be tuned to improve this performance,  
e.g.,

adding a faster cache device for reading and/or writing?


Resilver tends to be bound by one of two limits:

1. sequential write performance of the resilvering device

2. random I/O performance of the non-resilvering devices



A quick look at iostat leads me to conjecture that the vdev rebuilding  
is

taking a very low priority compared to ongoing application I/O (NFSD
in this case). Are there any ZFS knobs that control the relative  
priority of

resilvering to other disk I/O tasks?

Thanks.

--
Stuart Anderson  ander...@ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Speeding up resilver on x4500

2009-06-21 Thread Nicholas Lee
On Mon, Jun 22, 2009 at 4:24 PM, Stuart Anderson
ander...@ligo.caltech.eduwrote:


 However, it is a bit disconcerting to have to run with reduced data
 protection for an entire week. While I am certainly not going back to
 UFS, it seems like it should be at least theoretically possible to do this
 several orders of magnitude faster, e.g., what if every block on the
 replacement disk had its RAIDZ2 data recomputed from the degraded


Maybe this is also saying - that for large disk sets a single RAIDZ2
provides a false sense of security.

Nicholas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss