Re: [zfs-discuss] zfs defragmentation via resilvering?

Edward Ned Harvey Thu, 12 Jan 2012 20:17:18 -0800

> From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us]
> 
> > Suppose you write a 1G file to disk.  It is a database store.  Now you
start
> > running your db server.  It starts performing transactions all over the
> > place.  It overwrites the middle 4k of the file, and it overwrites 512b
> > somewhere else, and so on.  Since this is COW, each one of these little
> > writes in the middle of the file will actually get mapped to unused
sectors
> > of disk.  Depending on how quickly they're happening, they may be
> aggregated
> 
> Oops.  I see an error in the above.  Other than tail blocks, or due to
> compression, zfs will not write a COW data block smaller than the zfs
> filesystem blocksize.  If the blocksize was 128K, then updating just
> one byte in that 128K block results in writing a whole new 128K block.


Before anything else, let's define what "fragmentation" means in this
context, or more importantly, why anyone would care.

Fragmentation, in this context, is a measurement of how many blocks exist
sequentially aligned on disk, such that a sequential read will not suffer a
seek/latency penalty.  So the reason somebody would care is a function of
performance - disk work payload versus disk work wasted overhead time.  But
wait!  There are different types of reads.  If you read using a scrub or a
zfs send, then it will read the blocks in temporal order, so anything which
was previously write coalesced (even from many different files) will again
be read-coalesced (which is nice).  But if you read a file using something
like tar or cp or cat, then it reads the file in sequential file order,
which would be different from temporal order unless the file was originally
written sequentially and never overwritten by COW.

Suppose you have a 1G file open, and a snapshot of this file is on disk from
a previous point in time.
for ( i=0 ; i<1trillion ; i++ ) {
        seek(random integer in range[0 to 1G]);
        write(4k);
}

Something like this would quickly try to write a bunch of separate and
scattered 4k blocks at different offsets within the file.  Every 32 of these
4k writes would be write-coalesced into a single 128k on-disk block.  

Sometime later, you read the whole file sequentially such as cp or tar or
cat.  The first 4k come from this 128k block...  The next 4k come from
another 128k block...  The next 4k come from yet another 128k block...
Essentially, the file has become very fragmented and scattered about on the
physical disk.  Every 4k read results in a random disk seek.


> The worst case
> fragmentation pecentage for 8K blocks (and 512-byte sectors) is 6.25%
> ((100*1/((8*1024)/512))).  

You seem to be assuming that reading 512b disk sector and its neighboring
512b sector count as contiguous blocks.  And since there are guaranteed to
be exactly 256 sectors in every 128k filesystem block, then there is no
fragmentation for 256 contiguous sectors, guaranteed.  Unfortunately, the
512b sector size is just an arbitrary number (and variable, actually 4k on
modern disks), and the resultant percentage of fragmentation is equally
arbitrary.

To produce a number that actually matters - What you need to do is calculate
the percentage of time the disk is able to deliver payload, versus the
percentage of time the disk is performing time-wasting "overhead" operations
- seek and latency.

Suppose your disk speed is 1Gbit/sec while actively engaging the head, and
suppose the average random access (seek & latency) is 10ms.  Suppose you
wish for 99% efficiency.  The 10ms must be 1% of the time, and the head must
be engaged for 99% of the time, which is 990ms, which is very near 1Gbit, or
approximately 123MB sequential data for every random disk access.  You need
123MB sequential data payload for every random disk access.

That's 944 times larger than the largest 128k block size currently in zfs,
and obviously larger still compared to what you mentioned - 4k or 8k
recordsizes or 512b disk sectors...

Suppose you have 128k blocks written to disk, and all scattered about in
random order.  Your disk must seek & rotate for 10ms, and then it will be
engaged for 1.3ms reading the 128k, and then it will seek & rotate again for
10ms...  I would call that a 13% payload and 87% wasted time.  Fragmentation
at this level hurts you really bad.

Suppose there is a TXG flush every 5 seconds.  You write a program, which
will write a single byte to disk once every 5.1 seconds.  Then you leave
that program running for a very very long time.  You now have millions of
128k blocks written on disk scattered about in random order.  You start a
scrub.  It will read 128k, and then random seek, and then read 128k, etc.

I would call that 100% fragmentation, because there are no contiguously
aligned sequential blocks on disk anywhere.  But again, any measure of
"percent fragmentation" is purely arbitrary, unless you know (a) which type
of read behavior is being meaured (temporal or file order) and you know (b)
the sequential engaged disk speed, and you know (c) the average random
access time.

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs defragmentation via resilvering?

Reply via email to