The question put forth is whether the ZFS 128K blocksize is sufficient 
to saturate a regular disk. There is great body of evidence that shows 
that the bigger the write sizes and matching large FS clustersize lead 
to more throughput. The counter point is that ZFS schedules it's I/O
like nothing else seen before and manages to sature a single disk
using enough concurrent 128K I/O.

<There a few things I did here for the first time; so I may have erred
at places. So I am proposing this for review by the community>

I first measured the throughput of a write(2)  to raw device using for
instance this;

        dd if=/dev/zero of=/dev/rdsk/c1t1d0s0 bs=8192k count=1024

On   Solaris we would  see  some overhead of   reading  the block from
/dev/zero and then issuing the write call.  The tightest function that
fences the I/O is default_physio(). That  function will issue the I/O to
the device then wait for it  to complete.  If  we take the elapse time
spent in   this function and  count the  bytes that   are I/O-ed, this
should give  a   good  hint as   to   the throughput  the    device is
providing.  The above  dd command will  issue  a single I/O at  a time
(d-script to measure is attached).

Trying different blocksizes I see:

   Bytes   Elapse of phys IO     Size       
   Sent
   
   8 MB;   3576 ms of phys; avg sz : 16 KB; throughput 2 MB/s

   9 MB;   1861 ms of phys; avg sz : 32 KB; throughput 4 MB/s

   31 MB;  3450 ms of phys; avg sz : 64 KB; throughput 8 MB/s

   78 MB;  4932 ms of phys; avg sz : 128 KB; throughput 15 MB/s

   124 MB; 4903 ms of phys; avg sz : 256 KB; throughput 25 MB/s

   178 MB; 4868 ms of phys; avg sz : 512 KB; throughput 36 MB/s

   226 MB; 4824 ms of phys; avg sz : 1024 KB; throughput 46 MB/s

   226 MB; 4816 ms of phys; avg sz : 2048 KB; throughput 46 MB/s

    32 MB;  686 ms of phys; avg sz : 4096 KB; throughput 46 MB/s

   224 MB; 4741 ms of phys; avg sz : 8192 KB; throughput 47 MB/s

Now lets see what  ZFS gets.  I measure using  single dd process.  ZFS
will chunk up data  in 128K blocks.  Now  the dd command interact with
memory. But the I/O are scheduled under the control of spa_sync().  So
in  the d-script (attached) I check  for the start  of an spa_sync and
time that based on elapse.  At the same  time I  gather the number  of
bytes and keep  a count of I/O (bdev_strategy)  that are being issued.
When the spa_sync completes we are  sure that all  those are on stable
storage. The script is a bit more  complex because there are 2 threads
that   issue  spa_sync, but  only     one  of them actually    becomes
activated. So the script will print out  some spurious lines of output
at times. I measure I/O with the script while this runs:


        dd if=/dev/zero of=/zfs2/roch/f1 bs=1024k count=8000

And I see:

   1431 MB; 23723 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
   1387 MB; 23044 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
   2680 MB; 44209 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
   1359 MB; 24223 ms of spa_sync; avg sz : 127 KB; throughput 56 MB/s
   1143 MB; 19183 ms of spa_sync; avg sz : 126 KB; throughput 59 MB/s


OK, I  cheated. Here, ZFS is given  a full disk  to play with. In this
case ZFS enables the write cache. Note  that even with the write cache
enabled, when the spa_sync()  completes, it will  be after a  flush of
the cache has been executed. So the 60MB/sec do correspond to data set
to the platter. I just tried disabling the cache  (with format -e) but
I  am not sure if  that is taken into account  by ZFS; Results are the
same 60MB/sec. This will have to be confirmed.

With write cache enabled,  the physio test reaches 66  MB/s as soon as
we are issuing 16KB I/Os.   Here clearly though,  data  is not on  the
platter when the timed function completes.

Another variable  not  fully  controled  is the   physical  (cylinder)
locations of  the I/O. It could be  that some of the  differences come
from that.

What do I take away ?

        a single 2MB physical I/O will get 46 MB/sec out of my disk.

        35  concurrent  128K I/O  sustained  followed  by metadata I/O
        followed by flush  of  the write cache  allows  ZFS to get  60
        MB/sec out of the same disk.


This is what underwrites my belief that 128K blocksize is sufficiently
large. Now, nothing  here proves    that  256K would not give     more
throughput; so nothing is really settled. But I hope this helps put us 
on common ground.


-r



Attachment: phys.d
Description: Binary data

Attachment: spa_sync.d
Description: Binary data

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to