Re: [zfs-discuss] zfs defragmentation via resilvering?

Jim Klimov Mon, 16 Jan 2012 05:04:57 -0800

2012-01-16 8:39, Bob Friesenhahn wrote:

On Sun, 15 Jan 2012, Edward Ned Harvey wrote:


While I'm waiting for this to run, I'll make some predictions:
The file is 2GB (16 Gbit) and the disk reads around 1Gbit/sec, so reading
the initial sequential file should take ~16 sec
After fragmentation, it should be essentially random 4k fragments (32768
bits). I figure each time the head is able to find useful data, it takes


The 4k fragments is the part I don't agree with. Zfs does not do that.
If you were to run raidzN over a wide enough array of disks you could
end up with 4K fragments (distributed across the disks), but then you
would always have 4K fragments.



I think that in order to create a truly fragmented ZFS layout,
Edward needs to do sync writes (without a ZIL?) so that every
block and its metadata go to disk (coalesced as they may be)
and no two blocks of the file would be sequenced on disk together.
Although creating snapshots should give that effect...

He would have to fight hard to defeat ZFS's anti-fragmentation
attempts overall - while this is possible on very full pools ;)
Hint: pre-fill Ed's test pool to 90%, then run the tests :)

I think that to go forward about discussing defragmentation
tools, we should define a metric of fragmentation - as Bob and
Edward have often brought up. This implies accounting for
the effects on end-user of some mix of factors like:

1) Size of "free" reads and writes, i.e. cheap prefetch of
   a HDD's track as opposed to seeking; reads of an SSD block
   (those 256KB that are sliced into 4/8KB pages) as opposed
   to random reads of pages from separate SSD blocks.
   Seeks to neighboring tracks may be faster, than full-disk
   seeks, but they are slower than no seeks at all.

   For an optimal read-performance, we might want to prefetch
   whole tracks/blocks (not 64Kb from the position of ZFS's
   wanted block, but the whole track including this block,
   reversely knowing the sector numbers of start and end).

   Effect: we might not need to fully defragment data, but
   rather make long-enough ranges "correctly" positioned
   on the media. These may span neighboring tracks/blocks.

   We do need to know media's performance characteristics
   to do this optimally (i.e. which disk tracks have which
   byte-lengths, and where does each circle start in terms
   of LBA offsets).

   Also, disks' internal reallocation to spare blocks
   may lead to uncontrollable random seeks, degrading
   performance over time, but an FS is unlikely to have
   control or knowledge of that.

   Metric: start-addresses and lengths of fastest-read
   locations (i.e. whole tracks or SSD blocks) on leaf
   storage. May be variable within the storage device.


2) In case of ZFS - reads of contiguously allocated and
   monotonously increasing block numbers of data from a
   file's or zvol's most current version (live dataset
   as opposed to block history change in snapshots and
   the monotonous TXG number increase in on-disk vlocks).
   This may be in unresolvable conflict with clones and
   deduplication, so some files or volumes can not be
   made contiguous without breaking continuity of others.
   Still, some "overall contiguousness" can be optimised.

   For users it might also be important to have many files
   from some directory stored close to each other, especially
   if these are small files used together somehow (sourcecode,
   thumbnails, whatever).

   Effect: fast reads of most-current datasets.
   Metric: length of continuous (DVA) stretches of current
   logical block numbers of userdata divided by total data
   size. Amount of separate fragments somehow included ;)

3) In case of ZFS - fast access to metadata, especially
   branches of the current blockpointer tree in sequence
   of increasing TXG numbers.

   Effect: fast reads of metadata, i.e. scrubbing.
   Metric: length of continuous (DVA) stretches of current
   block pointer trees in same-or-increasing TXG numbers
   divided by total size of the tree (branch).

There is likely no absolute fragmentation or defragmentation,
but there are some optimisations. For example, ZFS's attempts
to coalesce 10Mb of data during one write into one metaslab
may suffice. And we do actually see performance hits when it
can't find stretches long enough (quickly enough) with pools
over empirical 80% fill-up. Defragmentation might set the aim
of clearing up enough 10Mb-long stretches of free space and
relocate smaller fragments of current user-data or {monotonous
BPTree} metadata into these clearings.

In particular, even if we have old data in snapshots, but
it is stored in long 10Mb+ contiguous stretches, we might
just leave it there. It is already about as good as it gets.

Also, as I proposed elsewhere, the metadata might be stored
in separate stretches of physical disk space - thus different
aims of defragmenting userdata and metadata (and free space)
would not conflict.

What do you think?
//Jim
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs defragmentation via resilvering?

Reply via email to