Re: [HACKERS] WAL Re-Writes
On Mon, Feb 8, 2016 at 8:16 PM, Andres Freund wrote: > > On 2016-02-08 10:38:55 +0530, Amit Kapila wrote: > > I think deciding it automatically without user require to configure it, > > certainly has merits, but what about some cases where user can get > > benefits by configuring themselves like the cases where we use > > PG_O_DIRECT flag for WAL (with o_direct, it will by bypass OS > > buffers and won't cause misaligned writes even for smaller chunk sizes > > like 512 bytes or so). Some googling [1] reveals that other databases > > also provides user with option to configure wal block/chunk size (as > > BLOCKSIZE), although they seem to decide chunk size based on > > disk-sector size. > > FWIW, you usually can't do that small writes with O_DIRECT. Usually it > has to be 4KB (pagesize) sized, aligned (4kb again) writes. And on > filesystems that do support doing such writes, they essentially fall > back to doing buffered IO. > I have not observed this during the tests (observation is based on the fact that whenever there is a use of OS buffer cache, writing in smaller chunks (lesser than 4K) leads to reads and in-turn decrease the performance). I don't see such an implication even in documentation. > > An additional thought, which is not necessarily related to this patch is, > > if user chooses and or we decide to write in 512 bytes sized chunks, > > which is usually a disk sector size, then can't we think of avoiding > > CRC for each record for such cases, because each WAL write in > > it-self will be atomic. While reading, if we process in wal-chunk-sized > > units, then I think it should be possible to detect end-of-wal based > > on data read. > > O_DIRECT doesn't give any useful guarantees to do something like the > above. It doesn't have any ordering or durability implications. You > still need to do fdatasyncs and such. > It doesn't need to, if we use o_sync flag which we always use whenever we use O_DIRECT mode during WAL writes. > Besides, with the new CRC implications, that doesn't really seem like > such a large win anyway. > I haven't check this till now that how much big win we can get if we can avoid CRC's and still provide same reliability, but I think it can certainly save CPU instructions both during writes and replay and performance must be better than current. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] WAL Re-Writes
On Mon, Feb 8, 2016 at 8:11 PM, Robert Haas wrote: > > On Mon, Feb 8, 2016 at 12:08 AM, Amit Kapila wrote: > > I think deciding it automatically without user require to configure it, > > certainly has merits, but what about some cases where user can get > > benefits by configuring themselves like the cases where we use > > PG_O_DIRECT flag for WAL (with o_direct, it will by bypass OS > > buffers and won't cause misaligned writes even for smaller chunk sizes > > like 512 bytes or so). Some googling [1] reveals that other databases > > also provides user with option to configure wal block/chunk size (as > > BLOCKSIZE), although they seem to decide chunk size based on > > disk-sector size. > > Well, if you can prove that we need that flexibility, then we should > have a GUC. Where's the benchmarking data to support that conclusion? > It is not posted as some more work is needed to complete the benchmarks results when PG_O_DIRECT is used (mainly with open_sync and open_datasync). I will do so. But, I think main thing which needs to be taken care is that as smaller-chunk sized writes are useful only in some cases, we need to ensure that users should not get baffled by the same. So there are multiple ways to provide the same, a) at the startup, we ensure that if the user has set smaller chunk-size (other than 4KB which will be default as decided based on the way described by you upthread at configure time) and it can use PG_O_DIRECT as we decide in get_sync_bit(), then allow it, otherwise either return an error or just set it to default which is 4KB. b) mention in docs that it better not to tinker with wal_chunk_size guc unless you have other relevant settings (like wal_sync_method = open_sync or open_datasync) and wal_level as default. c) there is yet another option which is, let us do with 4KB sized chunks for now as the benefit for not doing so only in sub-set of cases we can support. The reason why I think it is beneficial to provide an option of writing in smaller chunks is that it could lead to reduce the amount of re-writes by higher percentage where they can be used. For example at 4KB, there is ~35% reduction, similarly at smaller chunks it could gives us saving unto 50% or 70% depending on the chunk_size. > > > An additional thought, which is not necessarily related to this patch is, > > if user chooses and or we decide to write in 512 bytes sized chunks, > > which is usually a disk sector size, then can't we think of avoiding > > CRC for each record for such cases, because each WAL write in > > it-self will be atomic. While reading, if we process in wal-chunk-sized > > units, then I think it should be possible to detect end-of-wal based > > on data read. > > Gosh, taking CRCs off of WAL records sounds like a terrible idea. I'm > not sure why you think that writing in sector-sized chunks would make > that any more safe, because to me it seems like it wouldn't. But even > if it does, it's hard to believe that we don't derive some reliability > from CRCs that we would lose without them. > I think here the point is not about more-safety, rather it is about whether writing in disk-sector sizes gives equal reliability as CRC's, because if it does, then not doing crc calculation for each record both while writing and during replay can save CPU and should intern lead to better performance. Now, the reason why I thought it could give equal-reliability is that as disk-sector writes are atomic, so it should buy us that reliability. I admit that much more analysis/research is required before doing that and we can do that later if it proves to be any valuable in terms of performance and reliability. Here, I mentioned to say that writing in smaller chunks have other potential benefits. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] WAL Re-Writes
On 2016-02-08 10:38:55 +0530, Amit Kapila wrote: > I think deciding it automatically without user require to configure it, > certainly has merits, but what about some cases where user can get > benefits by configuring themselves like the cases where we use > PG_O_DIRECT flag for WAL (with o_direct, it will by bypass OS > buffers and won't cause misaligned writes even for smaller chunk sizes > like 512 bytes or so). Some googling [1] reveals that other databases > also provides user with option to configure wal block/chunk size (as > BLOCKSIZE), although they seem to decide chunk size based on > disk-sector size. FWIW, you usually can't do that small writes with O_DIRECT. Usually it has to be 4KB (pagesize) sized, aligned (4kb again) writes. And on filesystems that do support doing such writes, they essentially fall back to doing buffered IO. > An additional thought, which is not necessarily related to this patch is, > if user chooses and or we decide to write in 512 bytes sized chunks, > which is usually a disk sector size, then can't we think of avoiding > CRC for each record for such cases, because each WAL write in > it-self will be atomic. While reading, if we process in wal-chunk-sized > units, then I think it should be possible to detect end-of-wal based > on data read. O_DIRECT doesn't give any useful guarantees to do something like the above. It doesn't have any ordering or durability implications. You still need to do fdatasyncs and such. Besides, with the new CRC implications, that doesn't really seem like such a large win anyway. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL Re-Writes
On Mon, Feb 8, 2016 at 12:08 AM, Amit Kapila wrote: > I think deciding it automatically without user require to configure it, > certainly has merits, but what about some cases where user can get > benefits by configuring themselves like the cases where we use > PG_O_DIRECT flag for WAL (with o_direct, it will by bypass OS > buffers and won't cause misaligned writes even for smaller chunk sizes > like 512 bytes or so). Some googling [1] reveals that other databases > also provides user with option to configure wal block/chunk size (as > BLOCKSIZE), although they seem to decide chunk size based on > disk-sector size. Well, if you can prove that we need that flexibility, then we should have a GUC. Where's the benchmarking data to support that conclusion? > An additional thought, which is not necessarily related to this patch is, > if user chooses and or we decide to write in 512 bytes sized chunks, > which is usually a disk sector size, then can't we think of avoiding > CRC for each record for such cases, because each WAL write in > it-self will be atomic. While reading, if we process in wal-chunk-sized > units, then I think it should be possible to detect end-of-wal based > on data read. Gosh, taking CRCs off of WAL records sounds like a terrible idea. I'm not sure why you think that writing in sector-sized chunks would make that any more safe, because to me it seems like it wouldn't. But even if it does, it's hard to believe that we don't derive some reliability from CRCs that we would lose without them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL Re-Writes
On Wed, Feb 3, 2016 at 7:12 PM, Robert Haas wrote: > > On Wed, Feb 3, 2016 at 7:28 AM, Amit Kapila wrote: > > On further testing, it has been observed that misaligned writes could > > cause reads even when blocks related to file are not in-memory, so > > I think what Jan is describing is right. The case where there is > > absolutely zero chance of reads is when we write in OS-page boundary > > which is generally 4K. However I still think it is okay to provide an > > option for WAL writing in smaller chunks (512 bytes , 1024 bytes, etc) > > for the cases when these are beneficial like when wal_level is > > greater than equal to Archive and keep default as OS-page size if > > the same is smaller than 8K. > > Hmm, a little research seems to suggest that 4kB pages are standard on > almost every system we might care about: x86_64, x86, Power, Itanium, > ARMv7. Sparc uses 8kB, though, and a search through the Linux kernel > sources (grep for PAGE_SHIFT) suggests that there are other obscure > architectures that can at least optionally use larger pages, plus a > few that can use smaller ones. > > I'd like this to be something that users don't have to configure, and > it seems like that should be possible. We can detect the page size on > non-Windows systems using sysctl(_SC_PAGESIZE), and on Windows by > using GetSystemInfo. And I think it's safe to make this decision at > configure time, because the page size is a function of the hardware > architecture (it seems there are obscure systems that support multiple > page sizes, but I don't care about them particularly). So what I > think we should do is set an XLOG_WRITESZ along with XLOG_BLCKSZ and > set it to the smaller of XLOG_BLCKSZ and the system page size. If we > can't determine the system page size, assume 4kB. > I think deciding it automatically without user require to configure it, certainly has merits, but what about some cases where user can get benefits by configuring themselves like the cases where we use PG_O_DIRECT flag for WAL (with o_direct, it will by bypass OS buffers and won't cause misaligned writes even for smaller chunk sizes like 512 bytes or so). Some googling [1] reveals that other databases also provides user with option to configure wal block/chunk size (as BLOCKSIZE), although they seem to decide chunk size based on disk-sector size. An additional thought, which is not necessarily related to this patch is, if user chooses and or we decide to write in 512 bytes sized chunks, which is usually a disk sector size, then can't we think of avoiding CRC for each record for such cases, because each WAL write in it-self will be atomic. While reading, if we process in wal-chunk-sized units, then I think it should be possible to detect end-of-wal based on data read. [1] - http://docs.oracle.com/cd/E11882_01/server.112/e41084/clauses004.htm#SQLRF52268 With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] WAL Re-Writes
On Wed, Feb 3, 2016 at 7:28 AM, Amit Kapila wrote: > On further testing, it has been observed that misaligned writes could > cause reads even when blocks related to file are not in-memory, so > I think what Jan is describing is right. The case where there is > absolutely zero chance of reads is when we write in OS-page boundary > which is generally 4K. However I still think it is okay to provide an > option for WAL writing in smaller chunks (512 bytes , 1024 bytes, etc) > for the cases when these are beneficial like when wal_level is > greater than equal to Archive and keep default as OS-page size if > the same is smaller than 8K. Hmm, a little research seems to suggest that 4kB pages are standard on almost every system we might care about: x86_64, x86, Power, Itanium, ARMv7. Sparc uses 8kB, though, and a search through the Linux kernel sources (grep for PAGE_SHIFT) suggests that there are other obscure architectures that can at least optionally use larger pages, plus a few that can use smaller ones. I'd like this to be something that users don't have to configure, and it seems like that should be possible. We can detect the page size on non-Windows systems using sysctl(_SC_PAGESIZE), and on Windows by using GetSystemInfo. And I think it's safe to make this decision at configure time, because the page size is a function of the hardware architecture (it seems there are obscure systems that support multiple page sizes, but I don't care about them particularly). So what I think we should do is set an XLOG_WRITESZ along with XLOG_BLCKSZ and set it to the smaller of XLOG_BLCKSZ and the system page size. If we can't determine the system page size, assume 4kB. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL Re-Writes
On Wed, Feb 3, 2016 at 11:12 AM, Amit Kapila wrote: > > On Mon, Feb 1, 2016 at 8:05 PM, Jim Nasby wrote: >> >> On 1/31/16 3:26 PM, Jan Wieck wrote: >>> >>> On 01/27/2016 08:30 AM, Amit Kapila wrote: operation. Now why OS couldn't find the corresponding block in memory is that, while closing the WAL file, we use POSIX_FADV_DONTNEED if wal_level is less than 'archive' which lead to this problem. So with this experiment, the conclusion is that though we can avoid re-write of WAL data by doing exact writes, but it could lead to significant reduction in TPS. >>> >>> >>> POSIX_FADV_DONTNEED isn't the only way how those blocks would vanish >>> from OS buffers. If I am not mistaken we recycle WAL segments in a round >>> robin fashion. In a properly configured system, where the reason for a >>> checkpoint is usually "time" rather than "xlog", a recycled WAL file >>> written to had been closed and not touched for about a complete >>> checkpoint_timeout or longer. You must have a really big amount of spare >>> RAM in the machine to still find those blocks in memory. Basically we >>> are talking about the active portion of your database, shared buffers, >>> the sum of all process local memory and the complete pg_xlog directory >>> content fitting into RAM. > > > > I think that could only be problem if reads were happening at write or > fsync call, but that is not the case here. Further investigation on this > point reveals that the reads are not for fsync operation, rather they > happen when we call posix_fadvise(,,POSIX_FADV_DONTNEED). > Although this behaviour (writing in non-OS-page-cache-size chunks could > lead to reads if followed by a call to posix_fadvise > (,,POSIX_FADV_DONTNEED)) is not very clearly documented, but the > reason for the same is that fadvise() call maps the specified data range > (which in our case is whole file) into the list of pages and then invalidate > them which will further lead to removing them from OS cache, now any > misaligned (w.r.t OS page-size) writes done during writing/fsyncing to file > could cause additional reads as everything written by us will not be on > OS-page-boundary. > On further testing, it has been observed that misaligned writes could cause reads even when blocks related to file are not in-memory, so I think what Jan is describing is right. The case where there is absolutely zero chance of reads is when we write in OS-page boundary which is generally 4K. However I still think it is okay to provide an option for WAL writing in smaller chunks (512 bytes , 1024 bytes, etc) for the cases when these are beneficial like when wal_level is greater than equal to Archive and keep default as OS-page size if the same is smaller than 8K. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] WAL Re-Writes
On Mon, Feb 1, 2016 at 8:05 PM, Jim Nasby wrote: > On 1/31/16 3:26 PM, Jan Wieck wrote: > >> On 01/27/2016 08:30 AM, Amit Kapila wrote: >> >>> operation. Now why OS couldn't find the corresponding block in >>> memory is that, while closing the WAL file, we use >>> POSIX_FADV_DONTNEED if wal_level is less than 'archive' which >>> lead to this problem. So with this experiment, the conclusion is that >>> though we can avoid re-write of WAL data by doing exact writes, but >>> it could lead to significant reduction in TPS. >>> >> >> POSIX_FADV_DONTNEED isn't the only way how those blocks would vanish >> from OS buffers. If I am not mistaken we recycle WAL segments in a round >> robin fashion. In a properly configured system, where the reason for a >> checkpoint is usually "time" rather than "xlog", a recycled WAL file >> written to had been closed and not touched for about a complete >> checkpoint_timeout or longer. You must have a really big amount of spare >> RAM in the machine to still find those blocks in memory. Basically we >> are talking about the active portion of your database, shared buffers, >> the sum of all process local memory and the complete pg_xlog directory >> content fitting into RAM. >> > I think that could only be problem if reads were happening at write or fsync call, but that is not the case here. Further investigation on this point reveals that the reads are not for fsync operation, rather they happen when we call posix_fadvise(,,POSIX_FADV_DONTNEED). Although this behaviour (writing in non-OS-page-cache-size chunks could lead to reads if followed by a call to posix_fadvise (,,POSIX_FADV_DONTNEED)) is not very clearly documented, but the reason for the same is that fadvise() call maps the specified data range (which in our case is whole file) into the list of pages and then invalidate them which will further lead to removing them from OS cache, now any misaligned (w.r.t OS page-size) writes done during writing/fsyncing to file could cause additional reads as everything written by us will not be on OS-page-boundary. This theory is based on code of fadvise [1] and some googling [2] which suggests that misaligned reads followed with POSIX_FADV_DONTNEED could cause similar problem. Colleague of mine, Dilip Kumar has verified it even by writing a simple program for open/write/fsync/fdvise/close as well. > > But that's only going to matter when the segment is newly recycled. My > impression from Amit's email is that the OS was repeatedly reading even in > the same segment? > > As explained above the reads are only happening during file close. > Either way, I would think it wouldn't be hard to work around this by > spewing out a bunch of zeros to the OS in advance of where we actually need > to write, preventing the need for reading back from disk. > > I think we can simply prohibit to set wal_chunk_size to a value other than OS-page-cache or XLOG_BLCKSZ (whichever is lesser) if the wal_level is lesser than archive. This can avoid the problem of extra reads for misaligned writes as we won't call fadvise(). We can even choose to always write in OS-page-cache boundary or XLOG_BLCKSZ (whichever is lesser) as in many cases OS-page-cache boundary is 4K which can also save significant re-writes. > Amit, did you do performance testing with archiving enabled an a no-op > archive_command? > No, but what kind of advantage are you expecting from such tests? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] WAL Re-Writes
On 1/31/16 3:26 PM, Jan Wieck wrote: On 01/27/2016 08:30 AM, Amit Kapila wrote: operation. Now why OS couldn't find the corresponding block in memory is that, while closing the WAL file, we use POSIX_FADV_DONTNEED if wal_level is less than 'archive' which lead to this problem. So with this experiment, the conclusion is that though we can avoid re-write of WAL data by doing exact writes, but it could lead to significant reduction in TPS. POSIX_FADV_DONTNEED isn't the only way how those blocks would vanish from OS buffers. If I am not mistaken we recycle WAL segments in a round robin fashion. In a properly configured system, where the reason for a checkpoint is usually "time" rather than "xlog", a recycled WAL file written to had been closed and not touched for about a complete checkpoint_timeout or longer. You must have a really big amount of spare RAM in the machine to still find those blocks in memory. Basically we are talking about the active portion of your database, shared buffers, the sum of all process local memory and the complete pg_xlog directory content fitting into RAM. But that's only going to matter when the segment is newly recycled. My impression from Amit's email is that the OS was repeatedly reading even in the same segment? Either way, I would think it wouldn't be hard to work around this by spewing out a bunch of zeros to the OS in advance of where we actually need to write, preventing the need for reading back from disk. Amit, did you do performance testing with archiving enabled an a no-op archive_command? -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL Re-Writes
On 01/27/2016 08:30 AM, Amit Kapila wrote: operation. Now why OS couldn't find the corresponding block in memory is that, while closing the WAL file, we use POSIX_FADV_DONTNEED if wal_level is less than 'archive' which lead to this problem. So with this experiment, the conclusion is that though we can avoid re-write of WAL data by doing exact writes, but it could lead to significant reduction in TPS. POSIX_FADV_DONTNEED isn't the only way how those blocks would vanish from OS buffers. If I am not mistaken we recycle WAL segments in a round robin fashion. In a properly configured system, where the reason for a checkpoint is usually "time" rather than "xlog", a recycled WAL file written to had been closed and not touched for about a complete checkpoint_timeout or longer. You must have a really big amount of spare RAM in the machine to still find those blocks in memory. Basically we are talking about the active portion of your database, shared buffers, the sum of all process local memory and the complete pg_xlog directory content fitting into RAM. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL Re-Writes
On Thu, Jan 28, 2016 at 1:34 AM, james wrote: > On 27/01/2016 13:30, Amit Kapila wrote: > >> >> Thoughts? >> >> Are the decreases observed with SSD as well as spinning rust? > > The test is done with WAL on SSD and data on spinning rust, but I think the results should be similar if we would have done it otherwise as well. Having said that, I think still it is worth-while to test it that way and I will do it. > I might imagine that decreasing the wear would be advantageous, Yes. > especially if the performance decrease is less with low read latency. > > Let me clarify again here that with 4096 bytes chunk size, there is no performance decrease observed and rather there is a performance increase though relatively-small (1~5%) and there is a reduction of ~35% disk writes. Only if we do exact writes or write with smaller chunk size like (512 or 1024 bytes, basically lesser than OS block size), then we can see performance decrease mainly for wal_level < ARCHIVE, but then writes are much more smaller. I would also like to mention that what we call reduction in disk writes, this is the 7th column in stat file [1] (write sectors - number of sectors written, for details you can refer documentation of stat file [1]). [1] - https://www.kernel.org/doc/Documentation/block/stat.txt With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] WAL Re-Writes
On 27/01/2016 13:30, Amit Kapila wrote: Thoughts? Are the decreases observed with SSD as well as spinning rust? I might imagine that decreasing the wear would be advantageous, especially if the performance decrease is less with low read latency. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers