Re: Performance improvement of WAL writing?
On Wed, Aug 28, 2019 at 2:43 AM Moon, Insung wrote: > So what about modifying the XLogWrite function only to write the size > that should record? > Can this idea benefit from WAL writing performance? > If it's OK to improve, I want to do modification. > How do you think of it? > I have performed tests in this direction several years ago. The way it works now guarantees that the data of recycled WAL segments is never read from disk, as long as the WAL block size is a multiple of the filesystem's page/fragment size. The OS sees that the write() is on a fragment boundary and full fragment(s) in size. If the write() would be smaller in size, the OS would be forced to combine the new data with the rest of the fragment's old data on disk. To do so the system now has to wait until the old fragment is actually in the OS buffer. Which means that once you have enough WAL segments to cycle through so that the checkpoint reason is never XLOG and the blocks of the WAL segment that is cycled in have been evicted since it was last used, this causes real reads. On spinning media, which are still an excellent choice for WAL, this turns into a total disaster because it adds a rotational delay for every single WAL block that is only partially overwritten. I believe that this could only work if we stopped recycling WAL segments and instead delete old segments, then create new files as needed. This however adds the overhead of constant metadata change to WAL writing and I have no idea what performance or reliability impact that may have. There were reasons we chose to implement WAL segment recycling many years ago. These reasons may no longer be valid on modern filesystems, but it definitely is not a performance question alone. Regards, Jan -- Jan Wieck Senior Postgres Architect http://pgblog.wi3ck.info
Re: Performance improvement of WAL writing?
Hi, Moon-san. At Wed, 28 Aug 2019 15:43:02 +0900, "Moon, Insung" wrote in > Dear Hackers. > > Currently, the XLogWrite function is written in 8k(or 16kb) units > regardless of the size of the new record. > For example, even if a new record is only 300 bytes, pg_pwrite is > called to write data in 8k units (if it cannot be writing on one page > is 16kb written). > Let's look at the worst case. > 1) LogwrtResult.Flush is 8100 pos. > 2) And the new record size is only 100 bytes. > In this case, pg_pwrite is called which writes 16 kb to update only 100 bytes. > It is a rare case, but I think there is overhead for pg_pwrite for some > systems. > # For systems that often update one record in one transaction. If a commit is lonely in the system, the 100 bytes ought to be flushed out immediately. If there are many concurrent commits, XLogFlush() waits for all in-flight insertions up to the LSN the backends is writing then actually flushes. Of course sometimes it flushes just 100 bytes but sometimes it flushes far many bytes involving records from other backends. If another backend has flushed further than the backend's upto LSN, the backend skips flush. If you want to involve more commits in a flush, commit_delay lets the backend wait for that duration so that more commits can come in within the window. Or synchronous_commit = off works somewhat more aggressively. > So what about modifying the XLogWrite function only to write the size > that should record? If I understand your porposal correctly, you're proposing to separate fsync from XLogWrite. Actually, as you proposed, roughly speaking no flush happen execpt at segment switch or commit. If you are proposing not flushing immediately even at commit, commit_delay (or synchronous_commit) works that way. > Can this idea benefit from WAL writing performance? > If it's OK to improve, I want to do modification. > How do you think of it? So the proposal seems to be already achieved. If not, could you elaborate the proposal, or explain about actual problem? > Best Regards. > Moon. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Performance improvement of WAL writing?
Dear Hackers. Currently, the XLogWrite function is written in 8k(or 16kb) units regardless of the size of the new record. For example, even if a new record is only 300 bytes, pg_pwrite is called to write data in 8k units (if it cannot be writing on one page is 16kb written). Let's look at the worst case. 1) LogwrtResult.Flush is 8100 pos. 2) And the new record size is only 100 bytes. In this case, pg_pwrite is called which writes 16 kb to update only 100 bytes. It is a rare case, but I think there is overhead for pg_pwrite for some systems. # For systems that often update one record in one transaction. So what about modifying the XLogWrite function only to write the size that should record? Can this idea benefit from WAL writing performance? If it's OK to improve, I want to do modification. How do you think of it? Best Regards. Moon.