On Aug 9, 2006, at 8:18 AM, Roch wrote:



    So while I'm feeling optimistic  :-) we really ought to be
  able to do this in two I/O operations. If we have, say, 500K
  of data to write (including all  of the metadata), we should
  be able  to allocate  a contiguous  500K  block on disk  and
  write  that with  a  single  operation.  Then we update  the
  Uberblock.

Hi Anton, Optimistic a little yes.

The data block should have aggregated quite well into near
recordsize I/Os, are you sure they did not ? No O_DSYNC in
here right ?

When I repeated this with just 512K written in 1K chunks via dd,
I saw six 16K writes.  Those were the largest.  The others were
around 1K-4K.  No O_DSYNC.

  dd if=/dev/zero of=xyz bs=1k count=512

So some writes are being aggregated, but we're missing a lot.

Once  the data  blocks are  on disk we  have the information
necessary to update the  indirect  blocks iteratively up  to
the  ueberblock. Those  are the  smaller I/Os;  I guess that
because    of ditto blocks  they  go  to physically seperate
locations, by design.

We shouldn't have to wait for the data blocks to reach disk,
though.  We know where they're going in advance.  One of the
key advantages of the überblock scheme is that we can, in a
sense, speculatively write to disk.  We don't need the tight
ordering that UFS requires to avoid security exposures and
allow the file system to be repaired.  We can lay out all of
the data and metadata, write them all to disk, choose new
locations if the writes fail, etc. and not worry about any
ordering or state issues, because the on-disk image doesn't
change until we commit it.

You're right, the ditto block mechanism will mean that some
writes will be spread around (at least when using a
non-redundant pool like mine), but then we should have at
most three writes followed by the überblock update, assuming
three degrees of replication.

All of these though are normally done asynchronously to
applications, unless the disks are flooded.

Which is a good thing (I think they're asynchronous anyway,
unless the cache is full).

But  I follow  you in that,  It  may be remotely possible to
reduce the number of Iterations  in the process by  assuming
that the I/O will  all succeed, then  if some fails, fix  up
the consequence and when all  done, update the ueberblock. I
would not hold my breath quite yet for that.

Hmmm.  I guess my point is that we shouldn't need to iterate
at all.  There are no dependencies between these writes; only
between the complete set of writes and the überblock update.

-- Anton

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to