I am talking about having a write queue, which points to ready to write, full 
stripes.

Ready to write full stripes would be
*The last byte of the full stripe has been updated. 
*The file has been closed for writing. (Exception to the above rule)

I believe there is now a scheduler for ZFS, to handle reads and write conflicts.

For example on a large Multi-Gigabyte NVRAM array, the only big consideration 
is how big is the Fibre Channel pipe is and the limit on outstanding I/Os

But on SATA off the motherboard, then it is about how much RAM cache each disk 
has is a consideration as well as the speed of the SATA connection as well as 
the number of outstanding I/Os

When it comes time to do txg some of the record blocks (most of the full 128k 
ones) will have been written out already. If we have only written out full 
record blocks then there has been no performance loss.

Eventually a txg going to happen, eventually these full writes will need to 
happen, but if we can choose a less busy time for them all the better.

e.g. on a raidz with 5 disks, if I have 128x4 worth of data to write, lets 
write it.
       on a mirror if I have 128k worth to write, lets write it. (record size 
128k), or let it be a tunable for zpool, as some arrays (RAID5) like to have 
larger chunks of data.

Why wait for the txg if the disk are not being pressured for reads. Rather than 
a pause every 30 seconds.

Bob wrote :> (I may not have explained it well enough)
>It is not true that there is "no cost" though. Since ZFS uses COW,
>this approach requires that new blocks be allocated and written at a
>much higher rate. There is also an "opportunity cost" in that if a
>read comes in while these continuous writes are occurring, the read
>will be delayed.

At some stage a write needs to happen. **Full** writes have very small COW cost 
compare with small writes. As I said above I talking about a write of 4x128k on 
a 5 disk raidz before the write would happen early. 

>There are many applications which continually write/overwrite file
>content, or which update a file at a slow pace. For example, log
>files are typically updated at a slow rate. Updating a block requires
>reading it first (if it is not already cached in the ARC), which can
>be quite expensive. By waiting a bit longer, there is a much better
>chance that the whole block is overwritten, so zfs can discard the
>existing block on disk without bothering to re-read it.

Apps which update at slow pace will not trigger the above early write, until 
they have at least written a record size worth of data, application which write 
slow than 128k (recordsize) in more than 30 secs will never trigger the early 
write on a mirrored disk or even a raidz setup.

What this will catch is the big writer of files greater than 128k (recordsize) 
on mirrored disk; and files larger than (4x128k) on RaidZ 5disks sets.

So that commands like dd if=x of=y bs=512k will not cause issues 
(pauses/delays) when the txg timeout. 

PS I already set zfs:zfs_write_limit_override and I would not recommend anyone 
to set this very low to get the above effect.

It's just an idea on how to prevent the delay effect, it may not be practical?
-- 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to