Re: [zfs-discuss] Very poor small-block random write performance

2012-07-21 Thread Jim Klimov

2012-07-22 1:24, Bob Friesenhahn пишет:

On Sat, 21 Jul 2012, Jim Klimov wrote:

During this quick test I did not manage to craft a test which
would inflate a file in the middle without touching its other
blocks (other than using a text editor which saves the whole
file - so that is irrelevant), in order to see if ZFS can
"insert" smaller blocks in the middle of an existing file,
and whether it would reallocate other blocks to fit the set
recordsizes.


The POSIX filesystem interface does not support such a thing
('insert').  Presumably the underlying zfs pool could support such a
thing if there was a layer on top to request it. The closest equivalent
in a POSIX filesystem would be if a previously-null block in a sparse
file is updated to hold content.



Well then, this concludes the matter: you were right
five years ago, and this righteousness still holds up ;)

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very poor small-block random write performance

2012-07-21 Thread Bob Friesenhahn

On Sat, 21 Jul 2012, Jim Klimov wrote:

During this quick test I did not manage to craft a test which
would inflate a file in the middle without touching its other
blocks (other than using a text editor which saves the whole
file - so that is irrelevant), in order to see if ZFS can
"insert" smaller blocks in the middle of an existing file,
and whether it would reallocate other blocks to fit the set
recordsizes.


The POSIX filesystem interface does not support such a thing 
('insert').  Presumably the underlying zfs pool could support such a 
thing if there was a layer on top to request it. The closest 
equivalent in a POSIX filesystem would be if a previously-null block 
in a sparse file is updated to hold content.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very poor small-block random write performance

2012-07-21 Thread Jim Klimov

2012-07-20 5:11, Bob Friesenhahn wrote:

On Fri, 20 Jul 2012, Jim Klimov wrote:


Zfs data block sizes are fixed size!  Only tail blocks are shorter.


This is the part I am not sure is either implied by the docs
nor confirmed by my practice. But maybe I've missed something...


This is something that I am quite certain of. :-)

When doing a random write inside a file, the unit of COW is the zfs
filesystem blocksize.


Well, apparently I was wrong, and Bob was right :)

I ran a simple test like this:

# zfs create -o compression=off -o dedup=off -o copies=1 rpool/test

This should rule out complex storage options for user-data
bytes.


# cd /rpool/test/ && touch file && ls -lai
total 9
 4 drwxr-xr-x   2 root root   3 Jul 21 22:38 .
 4 drwxr-xr-x   5 root root   5 Jul 21 22:37 ..
 8 -rw-r--r--   1 root root   0 Jul 21 22:38 file

So the file's inode number is 8 (above). This is used for zdb 
inspections (below).



# /usr/gnu/bin/dd if=/dev/random  bs=1k count=1 >> file; sync; \\
  zdb - rpool/test 8 | grep ' L. '

The last line was repeated a few times. Apparently, (as Bob
wrote me off-list), changes in the tail block cause it to be
read from disk completely, new bytes appended, and written
out - up to dataset recordsize. Thus all intermediate blocks
of a file should consume full recordsizes, even if it was
appended in small portions spread over several TXGs.

Replacing kilobytes at locations spawning one or two blocks
also caused reallocations and rewrites of zfs recordsized
pieces:

# /usr/gnu/bin/dd if=/dev/random of=/rpool/test/file bs=1k \\
  seek=12 count=10 conv=noerror,notrunc; sync

# /usr/gnu/bin/dd if=/dev/random of=/rpool/test/file bs=1k \\
  seek=125 count=10 conv=noerror,notrunc; sync

# zdb - rpool/test 8 | grep ' L. '
Dataset rpool/test [ZPL], ID 8412, cr_txg 2110110, 289K, 8 objects, 
rootbp DVA[0]=<0:a69b82a00:200> DVA[1]=<0:264111e00:200> [L0 DMU objset] 
fletcher4 lzjb LE contiguous unique double size=800L/200P 
birth=2110309L/2110309P fill=8 
cksum=1373a7e215:68dacd0d409:12c745c671d52:25e0616461e20c
   0 L1  0:a69b80200:400 0:263f1dc00:400 4000L/400P F=2 
B=2110309/2110309

   0  L0 0:a61bcb200:2 2L/2P F=1 B=2110309/2110309
   2  L0 0:a61b87a00:2 2L/2P F=1 B=2110304/2110304


During this quick test I did not manage to craft a test which
would inflate a file in the middle without touching its other
blocks (other than using a text editor which saves the whole
file - so that is irrelevant), in order to see if ZFS can
"insert" smaller blocks in the middle of an existing file,
and whether it would reallocate other blocks to fit the set
recordsizes.

For generic filesystem uses (append, replace 1:1) at least
Bob's assessment is right - zfs stores recordsized blocks
and one possibly smaller tail block, not a series of random
sized blocks as I implied.

I might imagine situations like heavily congested systems
where zfs might cut corners to get dirty bytes out to disk
faster - and not read-merge-write tail blocks, but even if
this is implemented at all, it should be a rare condition.

//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very poor small-block random write performance

2012-07-20 Thread Bill Sommerfeld

On 07/19/12 18:24, Traffanstead, Mike wrote:

iozone doesn't vary the blocksize during the test, it's a very
artificial test but it's useful for gauging performance under
different scenarios.

So for this test all of the writes would have been 64k blocks, 128k,
etc. for that particular step.

Just as another point of reference I reran the test with a Crucial M4
SSD and the results for 16G/64k were 35mB/s (x5 improvement).

I'll rerun that part of the test with zpool iostat and see what it says.


For random writes to work without forcing a lot of read i/o and 
read-modify-write sequences, set the recordsize on the filesystem used 
for the test to match the iozone recordsize.  For instance:


zfs set recordsize=64k $fsname

and ensure that the files used for the test are re-created after you 
make this setting change ("recordsize" is sticky at file creation time).





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very poor small-block random write performance

2012-07-19 Thread Traffanstead, Mike
iozone doesn't vary the blocksize during the test, it's a very
artificial test but it's useful for gauging performance under
different scenarios.

So for this test all of the writes would have been 64k blocks, 128k,
etc. for that particular step.

Just as another point of reference I reran the test with a Crucial M4
SSD and the results for 16G/64k were 35mB/s (x5 improvement).

I'll rerun that part of the test with zpool iostat and see what it says.

Mike

On Thu, Jul 19, 2012 at 7:27 PM, Jim Klimov  wrote:
>> This is normal.  The problem is that with zfs 128k block sizes, zfs
>> needs to re-read the original 128k block so that it can compose and
>> write the new 128k block.  With sufficient RAM, this is normally avoided
>> because the original block is already cached in the ARC.
>>
>> If you were to reduce the zfs blocksize to 64k then the performance dive
>> at 64k would go away but there would still be write performance loss at
>> sizes other than a multiple of 64k.
>
>
> I am not sure if I misunderstood the question or Bob's answer,
> but I have a gut feeling it is not fully correct: ZFS block
> sizes for files (filesystem datasets) are, at least by default,
> dynamically-sized depending on the contiguous write size as
> queued by the time a ZFS transaction is closed and flushed to
> disk. In case of RAIDZ layouts, this logical block is further
> striped over several sectors on several disks in one of the
> top-level vdev's, starting with parity sectors for each "row".
>
> So, if the test logically overwrites full blocks of test data
> files, reads for recombination are not needed (but that can
> be checked for with "iostat 1" or "zpool iostat" - to see how
> many reads do happen during write-tests?) Note that some reads
> will show up anyway, i.e. to update ZFS metadata (the block
> pointer tree).
>
> However, if the test file was written in 128K blocks and then
> is rewritten with 64K blocks, then Bob's answer is probably
> valid - the block would have to be re-read once for the first
> rewrite of its half; it might be taken from cache for the
> second half's rewrite (if that comes soon enough), and may be
> spooled to disk as a couple of 64K blocks or one 128K block
> (if both changes come soon after each other - within one TXG).
>
> HTH,
> //Jim Klimov
>
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very poor small-block random write performance

2012-07-19 Thread Traffanstead, Mike
vfs.zfs.txg.synctime_ms: 1000
vfs.zfs.txg.timeout: 5

On Thu, Jul 19, 2012 at 8:47 PM, John Martin  wrote:
> On 07/19/12 19:27, Jim Klimov wrote:
>
>> However, if the test file was written in 128K blocks and then
>> is rewritten with 64K blocks, then Bob's answer is probably
>> valid - the block would have to be re-read once for the first
>> rewrite of its half; it might be taken from cache for the
>> second half's rewrite (if that comes soon enough), and may be
>> spooled to disk as a couple of 64K blocks or one 128K block
>> (if both changes come soon after each other - within one TXG).
>
>
> What are the values for zfs_txg_synctime_ms and zfs_txg_timeout
> on this system (FreeBSD, IIRC)?
>
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very poor small-block random write performance

2012-07-19 Thread John Martin

On 07/19/12 19:27, Jim Klimov wrote:


However, if the test file was written in 128K blocks and then
is rewritten with 64K blocks, then Bob's answer is probably
valid - the block would have to be re-read once for the first
rewrite of its half; it might be taken from cache for the
second half's rewrite (if that comes soon enough), and may be
spooled to disk as a couple of 64K blocks or one 128K block
(if both changes come soon after each other - within one TXG).


What are the values for zfs_txg_synctime_ms and zfs_txg_timeout
on this system (FreeBSD, IIRC)?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very poor small-block random write performance

2012-07-19 Thread Bob Friesenhahn

On Fri, 20 Jul 2012, Jim Klimov wrote:


I am not sure if I misunderstood the question or Bob's answer,
but I have a gut feeling it is not fully correct: ZFS block
sizes for files (filesystem datasets) are, at least by default,
dynamically-sized depending on the contiguous write size as
queued by the time a ZFS transaction is closed and flushed to
disk. In case of RAIDZ layouts, this logical block is further


Zfs data block sizes are fixed size!  Only tail blocks are shorter.

The underlying representation (how the data block gets stored) depends 
on if compression, raidz, deduplication, etc., are used.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very poor small-block random write performance

2012-07-19 Thread Jim Klimov

This is normal.  The problem is that with zfs 128k block sizes, zfs
needs to re-read the original 128k block so that it can compose and
write the new 128k block.  With sufficient RAM, this is normally avoided
because the original block is already cached in the ARC.

If you were to reduce the zfs blocksize to 64k then the performance dive
at 64k would go away but there would still be write performance loss at
sizes other than a multiple of 64k.


I am not sure if I misunderstood the question or Bob's answer,
but I have a gut feeling it is not fully correct: ZFS block
sizes for files (filesystem datasets) are, at least by default,
dynamically-sized depending on the contiguous write size as
queued by the time a ZFS transaction is closed and flushed to
disk. In case of RAIDZ layouts, this logical block is further
striped over several sectors on several disks in one of the
top-level vdev's, starting with parity sectors for each "row".

So, if the test logically overwrites full blocks of test data
files, reads for recombination are not needed (but that can
be checked for with "iostat 1" or "zpool iostat" - to see how
many reads do happen during write-tests?) Note that some reads
will show up anyway, i.e. to update ZFS metadata (the block
pointer tree).

However, if the test file was written in 128K blocks and then
is rewritten with 64K blocks, then Bob's answer is probably
valid - the block would have to be re-read once for the first
rewrite of its half; it might be taken from cache for the
second half's rewrite (if that comes soon enough), and may be
spooled to disk as a couple of 64K blocks or one 128K block
(if both changes come soon after each other - within one TXG).

HTH,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very poor small-block random write performance

2012-07-19 Thread Bob Friesenhahn

On Wed, 18 Jul 2012, Michael Traffanstead wrote:


I have an 8 drive ZFS array (RAIDZ2 - 1 Spare) using 5900rpm 2TB SATA drives 
with an hpt27xx controller under FreeBSD 10
(but I've seen the same issue with FreeBSD 9).

The system has 8gigs and I'm letting FreeBSD auto-size the ARC.

Running iozone (from ports), everything is fine for file sizes up to 8GB, but 
when it runs with a 16GB file the random write
performance plummets using 64K record sizes.


This is normal.  The problem is that with zfs 128k block sizes, zfs 
needs to re-read the original 128k block so that it can compose and 
write the new 128k block.  With sufficient RAM, this is normally 
avoided because the original block is already cached in the ARC.


If you were to reduce the zfs blocksize to 64k then the performance 
dive at 64k would go away but there would still be write performance 
loss at sizes other than a multiple of 64k.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Very poor small-block random write performance

2012-07-18 Thread Michael Traffanstead
I have an 8 drive ZFS array (RAIDZ2 - 1 Spare) using 5900rpm 2TB SATA drives 
with an hpt27xx controller under FreeBSD 10 (but I've seen the same issue with 
FreeBSD 9). 

The system has 8gigs and I'm letting FreeBSD auto-size the ARC.

Running iozone (from ports), everything is fine for file sizes up to 8GB, but 
when it runs with a 16GB file the random write performance plummets using 64K 
record sizes.

8G - 64K -> 52mB/s
8G - 128K -> 713mB/s
8G - 256K -> 442mB/s

16G - 64K -> 7mB/s

16G - 128K -> 380mB/s
16G - 256K -> 392mB/s

Also, sequential small block performance doesn't show such a dramatic slowdown 
either.

16G - 64K -> 108mB/s (sequential) 

There's nothing else using the zpool at the moment, the system is on a separate 
ssd.

I was expecting performance to drop off at 16GB b/c that's well above the 
available ARC but see that dramatic of a drop off and then the sharp 
improvement at 128K and 256K is surprising.

Are there any configuration settings I should be looking at?

Mike 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss