Re: [HACKERS] WAL Re-Writes

2016-02-08 Thread Amit Kapila
On Mon, Feb 8, 2016 at 8:16 PM, Andres Freund  wrote:
>
> On 2016-02-08 10:38:55 +0530, Amit Kapila wrote:
> > I think deciding it automatically without user require to configure it,
> > certainly has merits, but what about some cases where user can get
> > benefits by configuring themselves like the cases where we use
> > PG_O_DIRECT flag for WAL (with o_direct, it will by bypass OS
> > buffers and won't cause misaligned writes even for smaller chunk sizes
> > like 512 bytes or so).  Some googling [1] reveals that other databases
> > also provides user with option to configure wal block/chunk size (as
> > BLOCKSIZE), although they seem to decide chunk size based on
> > disk-sector size.
>
> FWIW, you usually can't do that small writes with O_DIRECT. Usually it
> has to be 4KB (pagesize) sized, aligned (4kb again) writes. And on
> filesystems that do support doing such writes, they essentially fall
> back to doing buffered IO.
>

I have not observed this during the tests (observation is based on the
fact that whenever there is a use of OS buffer cache, writing in smaller
chunks (lesser than 4K) leads to reads and in-turn decrease the
performance). I don't see such an implication even in documentation.

> > An additional thought, which is not necessarily related to this patch
is,
> > if user chooses and or we decide to write in 512 bytes sized chunks,
> > which is usually a disk sector size, then can't we think of avoiding
> > CRC for each record for such cases, because each WAL write in
> > it-self will be atomic.  While reading, if we process in wal-chunk-sized
> > units, then I think it should be possible to detect end-of-wal based
> > on data read.
>
> O_DIRECT doesn't give any useful guarantees to do something like the
> above. It doesn't have any ordering or durability implications. You
> still need to do fdatasyncs and such.
>

It doesn't need to, if we use o_sync flag which we always use whenever
we use O_DIRECT mode during WAL writes.


> Besides, with the new CRC implications, that doesn't really seem like
> such a large win anyway.
>

I haven't check this till now that how much big win we can get if we
can avoid CRC's and still provide same reliability, but I think it can
certainly save CPU instructions both during writes and replay and
performance must be better than current.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] WAL Re-Writes

2016-02-08 Thread Amit Kapila
On Mon, Feb 8, 2016 at 8:11 PM, Robert Haas  wrote:
>
> On Mon, Feb 8, 2016 at 12:08 AM, Amit Kapila 
wrote:
> > I think deciding it automatically without user require to configure it,
> > certainly has merits, but what about some cases where user can get
> > benefits by configuring themselves like the cases where we use
> > PG_O_DIRECT flag for WAL (with o_direct, it will by bypass OS
> > buffers and won't cause misaligned writes even for smaller chunk sizes
> > like 512 bytes or so).  Some googling [1] reveals that other databases
> > also provides user with option to configure wal block/chunk size (as
> > BLOCKSIZE), although they seem to decide chunk size based on
> > disk-sector size.
>
> Well, if you can prove that we need that flexibility, then we should
> have a GUC.  Where's the benchmarking data to support that conclusion?
>

It is not posted as some more work is needed to complete the
benchmarks results when PG_O_DIRECT is used (mainly with
open_sync and open_datasync).  I will do so.  But, I think main thing
which needs to be taken care is that as smaller-chunk sized writes are
useful only in some cases, we need to ensure that users should not
get baffled by the same.  So there are multiple ways to provide the same,

a) at the startup, we ensure that if the user has set smaller chunk-size
(other than 4KB which will be default as decided based on the way
described by you upthread at configure time) and it can use PG_O_DIRECT
as we decide in get_sync_bit(), then allow it, otherwise either return an
error or just set it to default which is 4KB.

b) mention in docs that it better not to tinker with wal_chunk_size guc
unless you have other relevant settings (like wal_sync_method =
open_sync or open_datasync) and wal_level as default.

c) there is yet another option which is, let us do with 4KB sized
chunks for now as the benefit for not doing so only in sub-set of
cases we can support.

The reason why I think it is beneficial to provide an option of writing in
smaller chunks is that it could lead to reduce the amount of re-writes
by higher percentage where they can be used.  For example at 4KB,
there is ~35% reduction, similarly at smaller chunks it could gives us
saving unto 50% or 70% depending on the chunk_size.


>
> > An additional thought, which is not necessarily related to this patch
is,
> > if user chooses and or we decide to write in 512 bytes sized chunks,
> > which is usually a disk sector size, then can't we think of avoiding
> > CRC for each record for such cases, because each WAL write in
> > it-self will be atomic.  While reading, if we process in wal-chunk-sized
> > units, then I think it should be possible to detect end-of-wal based
> > on data read.
>
> Gosh, taking CRCs off of WAL records sounds like a terrible idea.  I'm
> not sure why you think that writing in sector-sized chunks would make
> that any more safe, because to me it seems like it wouldn't.  But even
> if it does, it's hard to believe that we don't derive some reliability
> from CRCs that we would lose without them.
>

I think here the point is not about more-safety, rather it is about whether
writing in disk-sector sizes gives equal reliability as CRC's, because
if it does, then not doing crc calculation for each record both while
writing and during replay can save CPU and should intern lead to better
performance.  Now, the reason why I thought it could give equal-reliability
is that as disk-sector writes are atomic, so it should buy us that
reliability.
I admit that much more analysis/research is required before doing that
and we can do that later if it proves to be any valuable in terms of
performance and reliability.  Here, I mentioned to say that writing in
smaller chunks have other potential benefits.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] WAL Re-Writes

2016-02-08 Thread Robert Haas
On Mon, Feb 8, 2016 at 12:08 AM, Amit Kapila  wrote:
> I think deciding it automatically without user require to configure it,
> certainly has merits, but what about some cases where user can get
> benefits by configuring themselves like the cases where we use
> PG_O_DIRECT flag for WAL (with o_direct, it will by bypass OS
> buffers and won't cause misaligned writes even for smaller chunk sizes
> like 512 bytes or so).  Some googling [1] reveals that other databases
> also provides user with option to configure wal block/chunk size (as
> BLOCKSIZE), although they seem to decide chunk size based on
> disk-sector size.

Well, if you can prove that we need that flexibility, then we should
have a GUC.  Where's the benchmarking data to support that conclusion?

> An additional thought, which is not necessarily related to this patch is,
> if user chooses and or we decide to write in 512 bytes sized chunks,
> which is usually a disk sector size, then can't we think of avoiding
> CRC for each record for such cases, because each WAL write in
> it-self will be atomic.  While reading, if we process in wal-chunk-sized
> units, then I think it should be possible to detect end-of-wal based
> on data read.

Gosh, taking CRCs off of WAL records sounds like a terrible idea.  I'm
not sure why you think that writing in sector-sized chunks would make
that any more safe, because to me it seems like it wouldn't.  But even
if it does, it's hard to believe that we don't derive some reliability
from CRCs that we would lose without them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL Re-Writes

2016-02-08 Thread Andres Freund
On 2016-02-08 10:38:55 +0530, Amit Kapila wrote:
> I think deciding it automatically without user require to configure it,
> certainly has merits, but what about some cases where user can get
> benefits by configuring themselves like the cases where we use
> PG_O_DIRECT flag for WAL (with o_direct, it will by bypass OS
> buffers and won't cause misaligned writes even for smaller chunk sizes
> like 512 bytes or so).  Some googling [1] reveals that other databases
> also provides user with option to configure wal block/chunk size (as
> BLOCKSIZE), although they seem to decide chunk size based on
> disk-sector size.

FWIW, you usually can't do that small writes with O_DIRECT. Usually it
has to be 4KB (pagesize) sized, aligned (4kb again) writes. And on
filesystems that do support doing such writes, they essentially fall
back to doing buffered IO.

> An additional thought, which is not necessarily related to this patch is,
> if user chooses and or we decide to write in 512 bytes sized chunks,
> which is usually a disk sector size, then can't we think of avoiding
> CRC for each record for such cases, because each WAL write in
> it-self will be atomic.  While reading, if we process in wal-chunk-sized
> units, then I think it should be possible to detect end-of-wal based
> on data read.

O_DIRECT doesn't give any useful guarantees to do something like the
above. It doesn't have any ordering or durability implications. You
still need to do fdatasyncs and such.

Besides, with the new CRC implications, that doesn't really seem like
such a large win anyway.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL Re-Writes

2016-02-07 Thread Amit Kapila
On Wed, Feb 3, 2016 at 7:12 PM, Robert Haas  wrote:
>
> On Wed, Feb 3, 2016 at 7:28 AM, Amit Kapila 
wrote:
> > On further testing, it has been observed that misaligned writes could
> > cause reads even when blocks related to file are not in-memory, so
> > I think what Jan is describing is right.  The case where there is
> > absolutely zero chance of reads is when we write in OS-page boundary
> > which is generally 4K.  However I still think it is okay to provide an
> > option for  WAL writing in smaller chunks (512 bytes , 1024 bytes, etc)
> > for the cases when these are beneficial like when wal_level is
> > greater than equal to Archive and keep default as OS-page size if
> > the same is smaller than 8K.
>
> Hmm, a little research seems to suggest that 4kB pages are standard on
> almost every system we might care about: x86_64, x86, Power, Itanium,
> ARMv7.  Sparc uses 8kB, though, and a search through the Linux kernel
> sources (grep for PAGE_SHIFT) suggests that there are other obscure
> architectures that can at least optionally use larger pages, plus a
> few that can use smaller ones.
>
> I'd like this to be something that users don't have to configure, and
> it seems like that should be possible.  We can detect the page size on
> non-Windows systems using sysctl(_SC_PAGESIZE), and on Windows by
> using GetSystemInfo.  And I think it's safe to make this decision at
> configure time, because the page size is a function of the hardware
> architecture (it seems there are obscure systems that support multiple
> page sizes, but I don't care about them particularly).  So what I
> think we should do is set an XLOG_WRITESZ along with XLOG_BLCKSZ and
> set it to the smaller of XLOG_BLCKSZ and the system page size.  If we
> can't determine the system page size, assume 4kB.
>

I think deciding it automatically without user require to configure it,
certainly has merits, but what about some cases where user can get
benefits by configuring themselves like the cases where we use
PG_O_DIRECT flag for WAL (with o_direct, it will by bypass OS
buffers and won't cause misaligned writes even for smaller chunk sizes
like 512 bytes or so).  Some googling [1] reveals that other databases
also provides user with option to configure wal block/chunk size (as
BLOCKSIZE), although they seem to decide chunk size based on
disk-sector size.

An additional thought, which is not necessarily related to this patch is,
if user chooses and or we decide to write in 512 bytes sized chunks,
which is usually a disk sector size, then can't we think of avoiding
CRC for each record for such cases, because each WAL write in
it-self will be atomic.  While reading, if we process in wal-chunk-sized
units, then I think it should be possible to detect end-of-wal based
on data read.

[1] -
http://docs.oracle.com/cd/E11882_01/server.112/e41084/clauses004.htm#SQLRF52268

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] WAL Re-Writes

2016-02-03 Thread Amit Kapila
On Wed, Feb 3, 2016 at 11:12 AM, Amit Kapila 
wrote:
>
> On Mon, Feb 1, 2016 at 8:05 PM, Jim Nasby 
wrote:
>>
>> On 1/31/16 3:26 PM, Jan Wieck wrote:
>>>
>>> On 01/27/2016 08:30 AM, Amit Kapila wrote:

 operation.  Now why OS couldn't find the corresponding block in
 memory is that, while closing the WAL file, we use
 POSIX_FADV_DONTNEED if wal_level is less than 'archive' which
 lead to this problem.  So with this experiment, the conclusion is that
 though we can avoid re-write of WAL data by doing exact writes, but
 it could lead to significant reduction in TPS.
>>>
>>>
>>> POSIX_FADV_DONTNEED isn't the only way how those blocks would vanish
>>> from OS buffers. If I am not mistaken we recycle WAL segments in a round
>>> robin fashion. In a properly configured system, where the reason for a
>>> checkpoint is usually "time" rather than "xlog", a recycled WAL file
>>> written to had been closed and not touched for about a complete
>>> checkpoint_timeout or longer. You must have a really big amount of spare
>>> RAM in the machine to still find those blocks in memory. Basically we
>>> are talking about the active portion of your database, shared buffers,
>>> the sum of all process local memory and the complete pg_xlog directory
>>> content fitting into RAM.
>
>
>
> I think that could only be problem if reads were happening at write or
> fsync call, but that is not the case here.  Further investigation on this
> point reveals that the reads are not for fsync operation, rather they
> happen when we call posix_fadvise(,,POSIX_FADV_DONTNEED).
> Although this behaviour (writing in non-OS-page-cache-size chunks could
> lead to reads if followed by a call to posix_fadvise
> (,,POSIX_FADV_DONTNEED)) is not very clearly documented, but the
> reason for the same is that fadvise() call maps the specified data range
> (which in our case is whole file) into the list of pages and then
invalidate
> them which will further lead to removing them from OS cache, now any
> misaligned (w.r.t OS page-size) writes done during writing/fsyncing to
file
> could cause additional reads as everything written by us will not be on
> OS-page-boundary.
>

On further testing, it has been observed that misaligned writes could
cause reads even when blocks related to file are not in-memory, so
I think what Jan is describing is right.  The case where there is
absolutely zero chance of reads is when we write in OS-page boundary
which is generally 4K.  However I still think it is okay to provide an
option for  WAL writing in smaller chunks (512 bytes , 1024 bytes, etc)
for the cases when these are beneficial like when wal_level is
greater than equal to Archive and keep default as OS-page size if
the same is smaller than 8K.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] WAL Re-Writes

2016-02-03 Thread Robert Haas
On Wed, Feb 3, 2016 at 7:28 AM, Amit Kapila  wrote:
> On further testing, it has been observed that misaligned writes could
> cause reads even when blocks related to file are not in-memory, so
> I think what Jan is describing is right.  The case where there is
> absolutely zero chance of reads is when we write in OS-page boundary
> which is generally 4K.  However I still think it is okay to provide an
> option for  WAL writing in smaller chunks (512 bytes , 1024 bytes, etc)
> for the cases when these are beneficial like when wal_level is
> greater than equal to Archive and keep default as OS-page size if
> the same is smaller than 8K.

Hmm, a little research seems to suggest that 4kB pages are standard on
almost every system we might care about: x86_64, x86, Power, Itanium,
ARMv7.  Sparc uses 8kB, though, and a search through the Linux kernel
sources (grep for PAGE_SHIFT) suggests that there are other obscure
architectures that can at least optionally use larger pages, plus a
few that can use smaller ones.

I'd like this to be something that users don't have to configure, and
it seems like that should be possible.  We can detect the page size on
non-Windows systems using sysctl(_SC_PAGESIZE), and on Windows by
using GetSystemInfo.  And I think it's safe to make this decision at
configure time, because the page size is a function of the hardware
architecture (it seems there are obscure systems that support multiple
page sizes, but I don't care about them particularly).  So what I
think we should do is set an XLOG_WRITESZ along with XLOG_BLCKSZ and
set it to the smaller of XLOG_BLCKSZ and the system page size.  If we
can't determine the system page size, assume 4kB.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL Re-Writes

2016-02-02 Thread Amit Kapila
On Mon, Feb 1, 2016 at 8:05 PM, Jim Nasby  wrote:

> On 1/31/16 3:26 PM, Jan Wieck wrote:
>
>> On 01/27/2016 08:30 AM, Amit Kapila wrote:
>>
>>> operation.  Now why OS couldn't find the corresponding block in
>>> memory is that, while closing the WAL file, we use
>>> POSIX_FADV_DONTNEED if wal_level is less than 'archive' which
>>> lead to this problem.  So with this experiment, the conclusion is that
>>> though we can avoid re-write of WAL data by doing exact writes, but
>>> it could lead to significant reduction in TPS.
>>>
>>
>> POSIX_FADV_DONTNEED isn't the only way how those blocks would vanish
>> from OS buffers. If I am not mistaken we recycle WAL segments in a round
>> robin fashion. In a properly configured system, where the reason for a
>> checkpoint is usually "time" rather than "xlog", a recycled WAL file
>> written to had been closed and not touched for about a complete
>> checkpoint_timeout or longer. You must have a really big amount of spare
>> RAM in the machine to still find those blocks in memory. Basically we
>> are talking about the active portion of your database, shared buffers,
>> the sum of all process local memory and the complete pg_xlog directory
>> content fitting into RAM.
>>
>

I think that could only be problem if reads were happening at write or
fsync call, but that is not the case here.  Further investigation on this
point reveals that the reads are not for fsync operation, rather they
happen when we call posix_fadvise(,,POSIX_FADV_DONTNEED).
Although this behaviour (writing in non-OS-page-cache-size chunks could
lead to reads if followed by a call to posix_fadvise
(,,POSIX_FADV_DONTNEED)) is not very clearly documented, but the
reason for the same is that fadvise() call maps the specified data range
(which in our case is whole file) into the list of pages and then invalidate
them which will further lead to removing them from OS cache, now any
misaligned (w.r.t OS page-size) writes done during writing/fsyncing to file
could cause additional reads as everything written by us will not be on
OS-page-boundary. This theory is based on code of fadvise [1] and some
googling [2] which suggests that misaligned reads followed with
POSIX_FADV_DONTNEED could cause similar problem.  Colleague of
mine, Dilip Kumar has verified it even by writing a simple program
for open/write/fsync/fdvise/close as well.


>
> But that's only going to matter when the segment is newly recycled. My
> impression from Amit's email is that the OS was repeatedly reading even in
> the same segment?
>
>
As explained above the reads are only happening during file close.


> Either way, I would think it wouldn't be hard to work around this by
> spewing out a bunch of zeros to the OS in advance of where we actually need
> to write, preventing the need for reading back from disk.
>
>
I think we can simply prohibit to set wal_chunk_size to a value other
than OS-page-cache or XLOG_BLCKSZ (whichever is lesser) if the
wal_level is lesser than archive. This can avoid the problem of extra
reads for misaligned writes as we won't call fadvise().

We can even choose to always write in OS-page-cache boundary
or XLOG_BLCKSZ (whichever is lesser) as in many cases
OS-page-cache boundary is 4K which can also save significant
re-writes.



> Amit, did you do performance testing with archiving enabled an a no-op
> archive_command?
>

No, but what kind of advantage are you expecting from such
tests?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] WAL Re-Writes

2016-02-01 Thread Jim Nasby

On 1/31/16 3:26 PM, Jan Wieck wrote:

On 01/27/2016 08:30 AM, Amit Kapila wrote:

operation.  Now why OS couldn't find the corresponding block in
memory is that, while closing the WAL file, we use
POSIX_FADV_DONTNEED if wal_level is less than 'archive' which
lead to this problem.  So with this experiment, the conclusion is that
though we can avoid re-write of WAL data by doing exact writes, but
it could lead to significant reduction in TPS.


POSIX_FADV_DONTNEED isn't the only way how those blocks would vanish
from OS buffers. If I am not mistaken we recycle WAL segments in a round
robin fashion. In a properly configured system, where the reason for a
checkpoint is usually "time" rather than "xlog", a recycled WAL file
written to had been closed and not touched for about a complete
checkpoint_timeout or longer. You must have a really big amount of spare
RAM in the machine to still find those blocks in memory. Basically we
are talking about the active portion of your database, shared buffers,
the sum of all process local memory and the complete pg_xlog directory
content fitting into RAM.


But that's only going to matter when the segment is newly recycled. My 
impression from Amit's email is that the OS was repeatedly reading even 
in the same segment?


Either way, I would think it wouldn't be hard to work around this by 
spewing out a bunch of zeros to the OS in advance of where we actually 
need to write, preventing the need for reading back from disk.


Amit, did you do performance testing with archiving enabled an a no-op 
archive_command?

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL Re-Writes

2016-01-31 Thread Jan Wieck

On 01/27/2016 08:30 AM, Amit Kapila wrote:

operation.  Now why OS couldn't find the corresponding block in
memory is that, while closing the WAL file, we use
POSIX_FADV_DONTNEED if wal_level is less than 'archive' which
lead to this problem.  So with this experiment, the conclusion is that
though we can avoid re-write of WAL data by doing exact writes, but
it could lead to significant reduction in TPS.


POSIX_FADV_DONTNEED isn't the only way how those blocks would vanish 
from OS buffers. If I am not mistaken we recycle WAL segments in a round 
robin fashion. In a properly configured system, where the reason for a 
checkpoint is usually "time" rather than "xlog", a recycled WAL file 
written to had been closed and not touched for about a complete 
checkpoint_timeout or longer. You must have a really big amount of spare 
RAM in the machine to still find those blocks in memory. Basically we 
are talking about the active portion of your database, shared buffers, 
the sum of all process local memory and the complete pg_xlog directory 
content fitting into RAM.



Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL Re-Writes

2016-01-27 Thread Amit Kapila
On Thu, Jan 28, 2016 at 1:34 AM, james  wrote:

> On 27/01/2016 13:30, Amit Kapila wrote:
>
>>
>> Thoughts?
>>
>> Are the decreases observed with SSD as well as spinning rust?
>
>
The test is done with WAL on SSD and data on spinning rust, but I
think the results should be similar if we would have done it
otherwise as well.  Having said that, I think still it is worth-while to
test it that way and I will do it.


> I might imagine that decreasing the wear would be advantageous,


Yes.


> especially if the performance decrease is less with low read latency.
>
>
Let me clarify again here that with 4096 bytes chunk size, there is
no performance decrease observed and rather there is a performance
increase though relatively-small (1~5%) and there is a reduction of ~35%
disk writes.  Only if we do exact writes or write with smaller chunk size
like (512 or 1024 bytes, basically lesser than OS block size), then we can
see performance decrease mainly for wal_level < ARCHIVE, but then
writes are much more smaller.  I would also like to mention that what we
call reduction in disk writes, this is the 7th column in stat file [1]
(write sectors - number of sectors written, for details you can refer
documentation of stat file [1]).


[1] - https://www.kernel.org/doc/Documentation/block/stat.txt

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] WAL Re-Writes

2016-01-27 Thread james

On 27/01/2016 13:30, Amit Kapila wrote:


Thoughts?


Are the decreases observed with SSD as well as spinning rust?

I might imagine that decreasing the wear would be advantageous, 
especially if the performance decrease is less with low read latency.





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers