Re: [PERFORM] Revisiting disk layout on ZFS systems

2014-05-01 Thread Josh Berkus
On 04/28/2014 08:47 AM, Karl Denninger wrote:
 The odd thing is that I am getting better performance with a 128k record
 size on this application than I get with an 8k one!  Not only is the
 system faster to respond subjectively and can it sustain a higher TPS
 load objectively but the I/O busy percentage as measured during
 operation is MARKEDLY lower (by nearly an order of magnitude!)

Thanks for posting your experience!  I'd love it even more if you could
post some numbers to go with.

Questions:

1) is your database (or the active portion thereof) smaller than RAM?

2) is this a DW workload, where most writes are large writes?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Revisiting disk layout on ZFS systems

2014-04-29 Thread Albe Laurenz
Karl Denninger wrote:
 I've been doing a bit of benchmarking and real-world performance
 testing, and have found some curious results.

[...]

 The odd thing is that I am getting better performance with a 128k record
 size on this application than I get with an 8k one!

[...]

 What I am curious about, however, is the xlog -- that appears to suffer
 pretty badly from 128k record size, although it compresses even
 more-materially; 1.94x (!)
 
 The files in the xlog directory are large (16MB each) and thus first
 blush would be that having a larger record size for that storage area
 would help.  It appears that instead it hurts.

As has been explained, the access patterns for WAL are quite different.

For your experiment, I'd keep them on different file systems so that
you can tune them independently.

Yours,
Laurenz Albe

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Revisiting disk layout on ZFS systems

2014-04-29 Thread Karl Denninger


On 4/29/2014 3:13 AM, Albe Laurenz wrote:

Karl Denninger wrote:

I've been doing a bit of benchmarking and real-world performance
testing, and have found some curious results.

[...]


The odd thing is that I am getting better performance with a 128k record
size on this application than I get with an 8k one!

[...]


What I am curious about, however, is the xlog -- that appears to suffer
pretty badly from 128k record size, although it compresses even
more-materially; 1.94x (!)

The files in the xlog directory are large (16MB each) and thus first
blush would be that having a larger record size for that storage area
would help.  It appears that instead it hurts.

As has been explained, the access patterns for WAL are quite different.

For your experiment, I'd keep them on different file systems so that
you can tune them independently.

They're on physically-different packs (pools and groups of spindles) as 
that has been best practice for performance reasons pretty-much always 
-- I just thought it was interesting, and worth noting, that the usual 
recommendation to run an 8k record size for the data store itself may no 
longer be valid.


It certainly isn't with my workload.

--
-- Karl
k...@denninger.net




smime.p7s
Description: S/MIME Cryptographic Signature


[PERFORM] Revisiting disk layout on ZFS systems

2014-04-28 Thread Karl Denninger
I've been doing a bit of benchmarking and real-world performance 
testing, and have found some curious results.


The load in question is a fairly-busy machine hosting a web service that 
uses Postgresql as its back end.


Conventional Wisdom is that you want to run an 8k record size to match 
Postgresql's inherent write size for the database.


However, operational experience says this may no longer be the case now 
that modern ZFS systems support LZ4 compression, because modern CPUs can 
compress fast enough that they overrun raw I/O capacity. This in turn 
means that the recordsize is no longer the record size, basically, and 
Postgresql's on-disk file format is rather compressible -- indeed, in 
actual performance on my dataset it appears to be roughly 1.24x which is 
nothing to sneeze at.


The odd thing is that I am getting better performance with a 128k record 
size on this application than I get with an 8k one!  Not only is the 
system faster to respond subjectively and can it sustain a higher TPS 
load objectively but the I/O busy percentage as measured during 
operation is MARKEDLY lower (by nearly an order of magnitude!)


This is not expected behavior!

What I am curious about, however, is the xlog -- that appears to suffer 
pretty badly from 128k record size, although it compresses even 
more-materially; 1.94x (!)


The files in the xlog directory are large (16MB each) and thus first 
blush would be that having a larger record size for that storage area 
would help.  It appears that instead it hurts.


Ideas?

--
-- Karl
k...@denninger.net




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PERFORM] Revisiting disk layout on ZFS systems

2014-04-28 Thread Karl Denninger


On 4/28/2014 1:04 PM, Heikki Linnakangas wrote:

On 04/28/2014 06:47 PM, Karl Denninger wrote:

What I am curious about, however, is the xlog -- that appears to suffer
pretty badly from 128k record size, although it compresses even
more-materially; 1.94x (!)

The files in the xlog directory are large (16MB each) and thus first
blush would be that having a larger record size for that storage area
would help.  It appears that instead it hurts.


The WAL is fsync'd frequently. My guess is that that causes a lot of 
extra work to repeatedly recompress the same data, or something like 
that.


- Heikki

It shouldn't as ZFS re-writes on change, and what's showing up is not 
high I/O *count* but rather percentage-busy, which implies lots of head 
movement (that is, lots of sub-allocation unit writes.)


Isn't WAL essentially sequential writes during normal operation?

--
-- Karl
k...@denninger.net




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PERFORM] Revisiting disk layout on ZFS systems

2014-04-28 Thread Heikki Linnakangas

On 04/28/2014 06:47 PM, Karl Denninger wrote:

What I am curious about, however, is the xlog -- that appears to suffer
pretty badly from 128k record size, although it compresses even
more-materially; 1.94x (!)

The files in the xlog directory are large (16MB each) and thus first
blush would be that having a larger record size for that storage area
would help.  It appears that instead it hurts.


The WAL is fsync'd frequently. My guess is that that causes a lot of 
extra work to repeatedly recompress the same data, or something like that.


- Heikki


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Revisiting disk layout on ZFS systems

2014-04-28 Thread Heikki Linnakangas

On 04/28/2014 09:07 PM, Karl Denninger wrote:

The WAL is fsync'd frequently. My guess is that that causes a lot of
extra work to repeatedly recompress the same data, or something like
that.


It shouldn't as ZFS re-writes on change, and what's showing up is not
high I/O*count*  but rather percentage-busy, which implies lots of head
movement (that is, lots of sub-allocation unit writes.)


That sounds consistent frequent fsyncs.


Isn't WAL essentially sequential writes during normal operation?


Yes, it's totally sequential. But it's fsync'd at every commit, which 
means a lot of small writes.


- Heikki


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Revisiting disk layout on ZFS systems

2014-04-28 Thread Jeff Janes
On Mon, Apr 28, 2014 at 11:07 AM, Karl Denninger k...@denninger.net wrote:


 On 4/28/2014 1:04 PM, Heikki Linnakangas wrote:

 On 04/28/2014 06:47 PM, Karl Denninger wrote:

 What I am curious about, however, is the xlog -- that appears to suffer
 pretty badly from 128k record size, although it compresses even
 more-materially; 1.94x (!)

 The files in the xlog directory are large (16MB each) and thus first
 blush would be that having a larger record size for that storage area
 would help.  It appears that instead it hurts.


 The WAL is fsync'd frequently. My guess is that that causes a lot of
 extra work to repeatedly recompress the same data, or something like that.

 - Heikki

  It shouldn't as ZFS re-writes on change, and what's showing up is not
 high I/O *count* but rather percentage-busy, which implies lots of head
 movement (that is, lots of sub-allocation unit writes.)

 Isn't WAL essentially sequential writes during normal operation?


Only if you have some sort of non-volatile intermediary, or are willing to
risk your data integrity.  Otherwise, the fsync nature trumps the
sequential nature.

Cheers,

Jeff


Re: [PERFORM] Revisiting disk layout on ZFS systems

2014-04-28 Thread Karl Denninger


On 4/28/2014 1:22 PM, Heikki Linnakangas wrote:

On 04/28/2014 09:07 PM, Karl Denninger wrote:

The WAL is fsync'd frequently. My guess is that that causes a lot of
extra work to repeatedly recompress the same data, or something like
that.


It shouldn't as ZFS re-writes on change, and what's showing up is not
high I/O*count*  but rather percentage-busy, which implies lots of head
movement (that is, lots of sub-allocation unit writes.)


That sounds consistent frequent fsyncs.


Isn't WAL essentially sequential writes during normal operation?


Yes, it's totally sequential. But it's fsync'd at every commit, which 
means a lot of small writes.


- Heikki


Makes sense; I'll muse on whether there's a way to optimize this 
further... I'm not running into performance problems at present but I'd 
rather be ahead of it


--
-- Karl
k...@denninger.net




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PERFORM] Revisiting disk layout on ZFS systems

2014-04-28 Thread Karl Denninger


On 4/28/2014 1:26 PM, Jeff Janes wrote:
On Mon, Apr 28, 2014 at 11:07 AM, Karl Denninger k...@denninger.net 
mailto:k...@denninger.net wrote:




Isn't WAL essentially sequential writes during normal operation?


Only if you have some sort of non-volatile intermediary, or are 
willing to risk your data integrity.  Otherwise, the fsync nature 
trumps the sequential nature.



That would be a no on the data integrity :-)

--
-- Karl
k...@denninger.net



smime.p7s
Description: S/MIME Cryptographic Signature