Hi,

In the process of testing the Lustre DMU-OSS with a write intensive 
workload, I have seen a performance issue where IOs were being sent to 
disk in 512 byte sizes (even though we are currently doing 4K writes per 
transaction).

I have noticed that vdev_queue.c is not being able to aggregate IOs, 
perhaps because vdev_file_io_start() is not doing asynchronous I/O.

To try to fix this, I have added ZIO_STAGE_VDEV_IO_START to the list of 
async I/O stages, which somewhat improved the number of IO aggregations, 
but not nearly enough. It seems that for some reason the number of nodes 
in vq_pending_tree and vq_deadline_tree don't go much above 1, even 
though the disk is always busy.

I have also noticed that the 1 GB file produced by this benchmark had >2 
million blocks, with an average block size (as reported by zdb -bbc) of 
524 bytes or so, instead of the 128 KB block size I expected. Even 
manually setting the "recordsize" property to 128 KB (which was already 
the default) didn't have any effect.

After changing the Lustre DMU code to call dmu_object_alloc() with a 
blocksize of 128 KB, throughput improved *a lot*.

Strangely (to me, at least), it seems that in ZFS all regular files are 
being created with 512 byte data block sizes, and that the "recordsize" 
property only affects the maximum write size per transaction in 
zfs_write(). Is this correct?

Comments and suggestions are welcome :)

Regards,
Ricardo

Reply via email to