Re: Map WAL segment files on PMEM as WAL buffers
Hi Andres, Thank you for your report. I rebased and made patchset v9 attached to this email. Note that v9-0009 and v9-0010 are for those who want to pass their own Cirrus CI. Regards, Takashi On Tue, Mar 22, 2022 at 9:44 AM Andres Freund wrote: > > Hi, > > On 2022-01-20 14:55:13 +0900, Takashi Menjo wrote: > > Here is patchset v8. It will have "make check-world" and Cirrus to > > pass. > > This unfortunately does not apply anymore: > http://cfbot.cputube.org/patch_37_3181.log > > Could you rebase? > > - Andres -- Takashi Menjo v9-0003-Add-wal_pmem_map-to-postgresql.conf.sample.patch Description: Binary data v9-0001-Add-with-libpmem-option-for-PMEM-support.patch Description: Binary data v9-0002-Add-wal_pmem_map-to-GUC.patch Description: Binary data v9-0005-Map-WAL-segment-files-on-PMEM-as-WAL-buffers.patch Description: Binary data v9-0004-Export-InstallXLogFileSegment.patch Description: Binary data v9-0006-WAL-statistics-in-cases-of-wal_pmem_map-true.patch Description: Binary data v9-0007-Update-document.patch Description: Binary data v9-0008-Preallocate-and-initialize-more-WAL-if-wal_pmem_m.patch Description: Binary data v9-0009-For-CI-only-Setup-Cirrus-CI-for-with-libpmem.patch Description: Binary data v9-0010-For-CI-only-Modify-initdb-for-wal_pmem_map-on.patch Description: Binary data
Re: Map WAL segment files on PMEM as WAL buffers
Hi Justin, Here is patchset v8. It will have "make check-world" and Cirrus to pass. Would you try this one? The v8 squashes some patches in v7 into related ones, and adds the following patches: - v8-0003: Add wal_pmem_map to postgresql.conf.sample. It also helps v8-0011. - v8-0009: Fix wrong handling of missingContrecPtr for test/recovery/t/026 to pass. It is the cause of the error. Thanks for your report. - v8-0010 and v8-0011: Each of the two is for CI only. v8-0010 adds --with-libpmem and v8-0011 enables "wal_pmem_map = on". Please note that, unlike your suggestion, in my patchset PMEM_IS_PMEM_FORCE=1 will be given as an environment variable in .cirrus.yml and "wal_pmem_map = on" will be given by initdb. Regards, Takashi -- Takashi Menjo v8-0001-Add-with-libpmem-option-for-PMEM-support.patch Description: Binary data v8-0002-Add-wal_pmem_map-to-GUC.patch Description: Binary data v8-0005-Map-WAL-segment-files-on-PMEM-as-WAL-buffers.patch Description: Binary data v8-0003-Add-wal_pmem_map-to-postgresql.conf.sample.patch Description: Binary data v8-0004-Export-InstallXLogFileSegment.patch Description: Binary data v8-0007-Update-document.patch Description: Binary data v8-0008-Preallocate-and-initialize-more-WAL-if-wal_pmem_m.patch Description: Binary data v8-0006-WAL-statistics-in-cases-of-wal_pmem_map-true.patch Description: Binary data v8-0009-Fix-wrong-handling-of-missingContrecPtr.patch Description: Binary data v8-0011-For-CI-only-Modify-initdb-for-wal_pmem_map-on.patch Description: Binary data v8-0010-For-CI-only-Setup-Cirrus-CI-for-with-libpmem.patch Description: Binary data
Re: Map WAL segment files on PMEM as WAL buffers
Hi Justin, I can reproduce the error you reported, with PMEM_IS_PMEM_FORCE=1. Moreover, I can reproduce it **on a real PMem device**. So the causes are in my patchset, not in PMem environment. I'll fix it in the next patchset version. Regards, Takashi -- Takashi Menjo
Re: Map WAL segment files on PMEM as WAL buffers
Hi Justin, Thanks for your help. I'm making an additional patch for Cirrus CI. I'm also trying to reproduce the "make check-world" error you reported, on my Linux environment that has neither a real PMem nor an emulated one, with PMEM_IS_PMEM_FORCE=1. I'll keep you updated. Regards, Takashi On Mon, Jan 17, 2022 at 4:34 PM Justin Pryzby wrote: > > On Thu, Jan 06, 2022 at 10:43:37PM -0600, Justin Pryzby wrote: > > On Fri, Jan 07, 2022 at 12:50:01PM +0900, Takashi Menjo wrote: > > > > But in this case it really doesn't work :( > > > > > > > > running bootstrap script ... 2022-01-05 23:17:30.244 CST [12088] FATAL: > > > > file not on PMEM: path "pg_wal/00010001" > > > > > > Do you have a real PMEM device such as NVDIMM-N or Intel Optane PMem? > > > > No - the point is that we'd like to have a way to exercise this patch on the > > cfbot. Particularly the new code introduced by this patch, not just the > > --without-pmem case... > .. > > I think you should add a patch which does what Thomas suggested: 1) add to > > ./.cirrus.yaml installation of the libpmem package for > > debian/bsd/mac/windows; > > 2) add setenv to main(), as above; 3) change configure.ac and guc.c to > > default > > to --with-libpmem and wal_pmem_map=on. This should be the last patch, for > > cfbot only, not meant to be merged. > > I was able to get the cirrus CI to compile on linux and bsd with the below > changes. I don't know if there's an easy package installation for mac OSX. I > think it's okay if mac CI doesn't use --enable-pmem for now. > > > You can test that the package installation part works before mailing > > patches to > > the list with the instructions here: > > > > src/tools/ci/README: > > Enabling cirrus-ci in a github repository.. > > I ran the CI under my own github account. > Linux crashes in the recovery check. > And freebsd has been stuck for 45min. > > I'm not sure, but maybe those are legimate consequence of using > PMEM_IS_PMEM_FORCE (?) If so, maybe the recovery check would need to be > disabled for this patch to run on CI... Or maybe my suggestion to enable it > by > default for CI doesn't work for this patch. It would need to be specially > tested with real hardware. > > https://cirrus-ci.com/task/6245151591890944 > > https://cirrus-ci.com/task/6162551485497344?logs=test_world#L3941 > #2 0x55ff43c6edad in ExceptionalCondition (conditionName=0x55ff43d18108 > "!XLogRecPtrIsInvalid(missingContrecPtr)", errorType=0x55ff43d151c4 > "FailedAssertion", fileName=0x55ff43d151bd "xlog.c", lineNumber=8297) at > assert.c:69 > > commit 15533794e465a381eb23634d67700afa809a0210 > Author: Justin Pryzby > Date: Thu Jan 6 22:53:28 2022 -0600 > > tmp: enable pmem by default, for CI > > diff --git a/.cirrus.yml b/.cirrus.yml > index 677bdf0e65e..0cb961c8103 100644 > --- a/.cirrus.yml > +++ b/.cirrus.yml > @@ -81,6 +81,7 @@ task: > mkdir -m 770 /tmp/cores > chown root:postgres /tmp/cores > sysctl kern.corefile='/tmp/cores/%N.%P.core' > +pkg install -y devel/pmdk > ># NB: Intentionally build without --with-llvm. The freebsd image size is ># already large enough to make VM startup slow, and even without llvm > @@ -99,6 +100,7 @@ task: > --with-lz4 \ > --with-pam \ > --with-perl \ > +--with-libpmem \ > --with-python \ > --with-ssl=openssl \ > --with-tcl --with-tclconfig=/usr/local/lib/tcl8.6/ \ > @@ -138,6 +140,7 @@ LINUX_CONFIGURE_FEATURES: _CONFIGURE_FEATURES >- >--with-lz4 >--with-pam >--with-perl > + --with-libpmem >--with-python >--with-selinux >--with-ssl=openssl > @@ -188,6 +191,9 @@ task: > mkdir -m 770 /tmp/cores > chown root:postgres /tmp/cores > sysctl kernel.core_pattern='/tmp/cores/%e-%s-%p.core' > +echo 'deb http://deb.debian.org/debian bullseye universe' > >>/etc/apt/sources.list > +apt-get update > +apt-get -y install libpmem-dev > >configure_script: | > su postgres <<-EOF > @@ -267,6 +273,7 @@ task: >make \ >openldap \ >openssl \ > + pmem \ >python \ >tcl-tk > > @@ -301,6 +308,7 @@ task: >--with-libxslt \ >--with-lz4 \ >--with-perl \ > + --with-libpmem \ >--with-python \ >--with-ssl=openssl \ >--with-tcl --with-tclconfig=${brewpath}/opt/tcl-tk/lib/ \ > diff --git a/src/backend/main/main.c b/src/backend/main/main.c > inde
Re: Map WAL segment files on PMEM as WAL buffers
Hi Justin, Thank you for your build test and comments. The v7 patchset attached to this email fixes the issues you reported. > The cfbot showed issues compiling on linux and windows. > http://cfbot.cputube.org/takashi-menjo.html > > https://cirrus-ci.com/task/6125740327436288 > [02:30:06.538] In file included from xlog.c:38: > [02:30:06.538] ../../../../src/include/access/xlogpmem.h:32:42: error: > unknown type name ‘tli’ > [02:30:06.538]32 | PmemXLogEnsurePrevMapped(XLogRecPtr ptr, tli) > [02:30:06.538] | ^~~ > [02:30:06.538] xlog.c: In function ‘GetXLogBuffer’: > [02:30:06.538] xlog.c:1959:19: warning: implicit declaration of function > ‘PmemXLogEnsurePrevMapped’ [-Wimplicit-function-declaration] > [02:30:06.538] 1959 |openLogSegNo = PmemXLogEnsurePrevMapped(endptr, > tli); > > https://cirrus-ci.com/task/6688690280857600?logs=build#L379 > [02:33:25.752] c:\cirrus\src\include\access\xlogpmem.h(33,1): error C2081: > 'tli': name in formal parameter list illegal (compiling source file > src/backend/access/transam/xlog.c) [c:\cirrus\postgres.vcxproj] > > I'm attaching a probable fix. Unfortunately, for patches like this, most of > the functionality isn't exercised unless the library is installed and > compilation and runtime are enabled by default. I got the same error when without --with-libpmem. Your fix looks reasonable. My v7-0008 fixes this error. > In 0009: recaluculated => recalculated v7-0011 fixes this typo. > 0010-Update-document should be squished with 0003-Add-wal_pmem_map-to-GUC (and > maybe 0002 and 0001). I believe the patches after 0005 are more WIP, so it's > fine if they're not squished yet. As you say, the patch updating document should melt into a related fix, probably "Add wal_pmem_map to GUC". For now I want it to be a separate patch (v7-0014). > I'm not sure what the point is of this one: > 0008-Let-wal_pmem_map-be-constant-unl If USE_LIBPMEM is not defined (that is, no --with-libpmem), wal_pmem_map is always false and is never used essentially. Using #if(n)def everywhere is not good for code readability, so I let wal_pmem_map be constant. This may help compilers optimize conditional branches. v7-0005 adds the comment above. > + ereport(ERROR, > + (errcode_for_file_access(), > +errmsg("could not pmem_map_file \"%s\": %m", > path))); > > => The outer parenthesis are not needed since e3a87b4. v7-0009 fixes this. > But in this case it really doesn't work :( > > running bootstrap script ... 2022-01-05 23:17:30.244 CST [12088] FATAL: file > not on PMEM: path "pg_wal/00010001" Do you have a real PMEM device such as NVDIMM-N or Intel Optane PMem? If so, please use a PMEM mounted with Filesystem DAX option for pg_wal, or the FATAL error will occur. If you don't, you have two alternatives below. Note that neither of them ensures durability. Each of them is just for testing. 1. Emulate PMEM with memmap=nn[KMG]!ss[KMG]. This can be used only on Linux. Please see [1][2] for details; or 2. Set the environment variable PMEM_IS_PMEM_FORCE=1 to tell libpmem to treat any devices as if they were PMEM. Regards, Takashi [1] https://www.intel.com/content/www/us/en/developer/articles/training/how-to-emulate-persistent-memory-on-an-intel-architecture-server.html [2] https://nvdimm.wiki.kernel.org/ -- Takashi Menjo v7-0004-Let-wal_pmem_map-be-constant-unless-with-libpmem.patch Description: Binary data v7-0002-Support-build-with-MSVC-on-Windows.patch Description: Binary data v7-0001-Add-with-libpmem-option-for-PMEM-support.patch Description: Binary data v7-0005-Comment-for-constant-wal_pmem_map.patch Description: Binary data v7-0003-Add-wal_pmem_map-to-GUC.patch Description: Binary data v7-0006-Export-InstallXLogFileSegment.patch Description: Binary data v7-0008-Fix-invalid-declaration-of-PmemXLogEnsurePrevMapp.patch Description: Binary data v7-0007-Map-WAL-segment-files-on-PMEM-as-WAL-buffers.patch Description: Binary data v7-0009-Remove-redundant-parentheses-from-ereport-call.patch Description: Binary data v7-0011-Fix-typo-in-comment.patch Description: Binary data v7-0013-WAL-statistics-in-cases-of-wal_pmem_map-true.patch Description: Binary data v7-0012-Compatible-to-Windows.patch Description: Binary data v7-0010-Ensure-WAL-mappings-before-assertion.patch Description: Binary data v7-0014-Update-document.patch Description: Binary data v7-0015-Preallocate-and-initialize-more-WAL-if-wal_pmem_m.patch Description: Binary data
Re: Map WAL segment files on PMEM as WAL buffers
Rebased. On Fri, Nov 5, 2021 at 3:47 PM Takashi Menjo wrote: > > Hi Daniel, > > The issue you told has been fixed. I attach the v5 patchset to this email. > > The v5 has all the patches in the v4, and in addition, has the > following two new patches: > > - (v5-0002) Support build with MSVC on Windows: Please have > src\tools\msvc\config.pl as follows to "configure --with-libpmem:" > > $config->{pmem} = 'C:\path\to\pmdk\x64-windows'; > > - (v5-0006) Compatible to Windows: This patch resolves conflicting > mode_t typedefs and libpmem API variants (U or W, like Windows API). > > Best regards, > Takashi > > On Thu, Nov 4, 2021 at 5:46 PM Takashi Menjo wrote: > > > > Hello Daniel, > > > > Thank you for your comment. I had the following error message with > > MSVC on Windows. It looks the same as what you told me. I'll fix it. > > > > | > cd src\tools\msvc > > | > build > > | (..snipped..) > > | Copying pg_config_os.h... > > | Generating configuration headers... > > | undefined symbol: HAVE_LIBPMEM at src/include/pg_config.h line 347 > > at C:/Users/menjo/Documents/git/postgres/src/tools/msvc/Mkvcbuild.pm > > line 860. > > > > Best regards, > > Takashi > > > > > > On Wed, Nov 3, 2021 at 10:04 PM Daniel Gustafsson wrote: > > > > > > > On 28 Oct 2021, at 08:09, Takashi Menjo wrote: > > > > > > > Rebased, and added the patches below into the patchset. > > > > > > Looks like the 0001 patch needs to be updated to support Windows and > > > MSVC. See > > > src/tools/msvc/Mkvcbuild.pm and Solution.pm et.al for inspiration on how > > > to add > > > the MSVC equivalent of --with-libpmem. Currently the patch fails in the > > > "Generating configuration headers" step in Solution.pm. > > > > > > -- > > > Daniel Gustafsson https://vmware.com/ > > > > > > > > > -- > > Takashi Menjo > > > > -- > Takashi Menjo -- Takashi Menjo v6-0003-Add-wal_pmem_map-to-GUC.patch Description: Binary data v6-0001-Add-with-libpmem-option-for-PMEM-support.patch Description: Binary data v6-0004-Export-InstallXLogFileSegment.patch Description: Binary data v6-0002-Support-build-with-MSVC-on-Windows.patch Description: Binary data v6-0005-Map-WAL-segment-files-on-PMEM-as-WAL-buffers.patch Description: Binary data v6-0008-Let-wal_pmem_map-be-constant-unless-with-libpmem.patch Description: Binary data v6-0006-Compatible-to-Windows.patch Description: Binary data v6-0009-Ensure-WAL-mappings-before-assertion.patch Description: Binary data v6-0007-WAL-statistics-in-cases-of-wal_pmem_map-true.patch Description: Binary data v6-0010-Update-document.patch Description: Binary data v6-0011-Preallocate-and-initialize-more-WAL-if-wal_pmem_m.patch Description: Binary data
Re: Map WAL segment files on PMEM as WAL buffers
Hi Daniel, The issue you told has been fixed. I attach the v5 patchset to this email. The v5 has all the patches in the v4, and in addition, has the following two new patches: - (v5-0002) Support build with MSVC on Windows: Please have src\tools\msvc\config.pl as follows to "configure --with-libpmem:" $config->{pmem} = 'C:\path\to\pmdk\x64-windows'; - (v5-0006) Compatible to Windows: This patch resolves conflicting mode_t typedefs and libpmem API variants (U or W, like Windows API). Best regards, Takashi On Thu, Nov 4, 2021 at 5:46 PM Takashi Menjo wrote: > > Hello Daniel, > > Thank you for your comment. I had the following error message with > MSVC on Windows. It looks the same as what you told me. I'll fix it. > > | > cd src\tools\msvc > | > build > | (..snipped..) > | Copying pg_config_os.h... > | Generating configuration headers... > | undefined symbol: HAVE_LIBPMEM at src/include/pg_config.h line 347 > at C:/Users/menjo/Documents/git/postgres/src/tools/msvc/Mkvcbuild.pm > line 860. > > Best regards, > Takashi > > > On Wed, Nov 3, 2021 at 10:04 PM Daniel Gustafsson wrote: > > > > > On 28 Oct 2021, at 08:09, Takashi Menjo wrote: > > > > > Rebased, and added the patches below into the patchset. > > > > Looks like the 0001 patch needs to be updated to support Windows and MSVC. > > See > > src/tools/msvc/Mkvcbuild.pm and Solution.pm et.al for inspiration on how to > > add > > the MSVC equivalent of --with-libpmem. Currently the patch fails in the > > "Generating configuration headers" step in Solution.pm. > > > > -- > > Daniel Gustafsson https://vmware.com/ > > > > > -- > Takashi Menjo -- Takashi Menjo v5-0001-Add-with-libpmem-option-for-PMEM-support.patch Description: Binary data v5-0002-Support-build-with-MSVC-on-Windows.patch Description: Binary data v5-0003-Add-wal_pmem_map-to-GUC.patch Description: Binary data v5-0004-Export-InstallXLogFileSegment.patch Description: Binary data v5-0005-Map-WAL-segment-files-on-PMEM-as-WAL-buffers.patch Description: Binary data v5-0006-Compatible-to-Windows.patch Description: Binary data v5-0007-WAL-statistics-in-cases-of-wal_pmem_map-true.patch Description: Binary data v5-0008-Let-wal_pmem_map-be-constant-unless-with-libpmem.patch Description: Binary data v5-0009-Ensure-WAL-mappings-before-assertion.patch Description: Binary data v5-0010-Update-document.patch Description: Binary data v5-0011-Preallocate-and-initialize-more-WAL-if-wal_pmem_m.patch Description: Binary data
Re: Map WAL segment files on PMEM as WAL buffers
Hello Daniel, Thank you for your comment. I had the following error message with MSVC on Windows. It looks the same as what you told me. I'll fix it. | > cd src\tools\msvc | > build | (..snipped..) | Copying pg_config_os.h... | Generating configuration headers... | undefined symbol: HAVE_LIBPMEM at src/include/pg_config.h line 347 at C:/Users/menjo/Documents/git/postgres/src/tools/msvc/Mkvcbuild.pm line 860. Best regards, Takashi On Wed, Nov 3, 2021 at 10:04 PM Daniel Gustafsson wrote: > > > On 28 Oct 2021, at 08:09, Takashi Menjo wrote: > > > Rebased, and added the patches below into the patchset. > > Looks like the 0001 patch needs to be updated to support Windows and MSVC. > See > src/tools/msvc/Mkvcbuild.pm and Solution.pm et.al for inspiration on how to > add > the MSVC equivalent of --with-libpmem. Currently the patch fails in the > "Generating configuration headers" step in Solution.pm. > > -- > Daniel Gustafsson https://vmware.com/ > -- Takashi Menjo
Re: Map WAL segment files on PMEM as WAL buffers
Hi, Rebased, and added the patches below into the patchset. - (0006) Let wal_pmem_map be constant unless --with-libpmem wal_pmem_map never changes from false in that case, so let it be constant. Thanks, Matthias! - (0007) Ensure WAL mappings before assertion This fixes SIGSEGV abortion in GetXLogBuffer when --enable-cassert. - (0008) Update document This adds a new entry for wal_pmem_map in the section Write Ahead Log -> Settings. Best regards, Takashi On Fri, Oct 8, 2021 at 5:07 PM Takashi Menjo wrote: > > Hello Matthias, > > Thank you for your comment! > > > > [ v3-0002-Add-wal_pmem_map-to-GUC.patch ] > > > +extern bool wal_pmem_map; > > > > A lot of the new code in these patches is gated behind this one flag, > > but the flag should never be true on !pmem systems. Could you instead > > replace it with something like the following? > > > > +#ifdef USE_LIBPMEM > > +extern bool wal_pmem_map; > > +#else > > +#define wal_pmem_map false > > +#endif > > > > A good compiler would then eliminate all the dead code from being > > generated on non-pmem builds (instead of the compiler needing to keep > > that code around just in case some extension decides to set > > wal_pmem_map to true on !pmem systems because it has access to that > > variable). > > That sounds good. I will introduce it in the next update. > > > > [ v3-0004-Map-WAL-segment-files-on-PMEM-as-WAL-buffers.patch ] > > > +if ((uintptr_t) addr & ~PG_DAX_HUGEPAGE_MASK) > > > +elog(WARNING, > > > + "file not mapped on DAX hugepage boundary: path \"%s\" addr > > > %p", > > > + path, addr); > > > > I'm not sure that we should want to log this every time we detect the > > issue; It's likely that once it happens it will happen for the next > > file as well. Maybe add a timeout, or do we generally not deduplicate > > such messages? > > Let me give it some thought. I have believed this WARNING is most > unlikely to happen, and is mutually independent from other happenings. > I will try to find a case where the WARNING happens repeatedly; or I > will de-duplicate the messages if it is easier. > > Best regards, > Takashi > > -- > Takashi Menjo -- Takashi Menjo v4-0001-Add-with-libpmem-option-for-PMEM-support.patch Description: Binary data v4-0002-Add-wal_pmem_map-to-GUC.patch Description: Binary data v4-0003-Export-InstallXLogFileSegment.patch Description: Binary data v4-0005-WAL-statistics-in-cases-of-wal_pmem_map-true.patch Description: Binary data v4-0004-Map-WAL-segment-files-on-PMEM-as-WAL-buffers.patch Description: Binary data v4-0006-Let-wal_pmem_map-be-constant-unless-with-libpmem.patch Description: Binary data v4-0007-Ensure-WAL-mappings-before-assertion.patch Description: Binary data v4-0008-Update-document.patch Description: Binary data v4-0009-Preallocate-and-initialize-more-WAL-if-wal_pmem_m.patch Description: Binary data
Re: Map WAL segment files on PMEM as WAL buffers
Hello Matthias, Thank you for your comment! > > [ v3-0002-Add-wal_pmem_map-to-GUC.patch ] > > +extern bool wal_pmem_map; > > A lot of the new code in these patches is gated behind this one flag, > but the flag should never be true on !pmem systems. Could you instead > replace it with something like the following? > > +#ifdef USE_LIBPMEM > +extern bool wal_pmem_map; > +#else > +#define wal_pmem_map false > +#endif > > A good compiler would then eliminate all the dead code from being > generated on non-pmem builds (instead of the compiler needing to keep > that code around just in case some extension decides to set > wal_pmem_map to true on !pmem systems because it has access to that > variable). That sounds good. I will introduce it in the next update. > > [ v3-0004-Map-WAL-segment-files-on-PMEM-as-WAL-buffers.patch ] > > +if ((uintptr_t) addr & ~PG_DAX_HUGEPAGE_MASK) > > +elog(WARNING, > > + "file not mapped on DAX hugepage boundary: path \"%s\" addr > > %p", > > + path, addr); > > I'm not sure that we should want to log this every time we detect the > issue; It's likely that once it happens it will happen for the next > file as well. Maybe add a timeout, or do we generally not deduplicate > such messages? Let me give it some thought. I have believed this WARNING is most unlikely to happen, and is mutually independent from other happenings. I will try to find a case where the WARNING happens repeatedly; or I will de-duplicate the messages if it is easier. Best regards, Takashi -- Takashi Menjo
Re: Map WAL segment files on PMEM as WAL buffers
Rebased. -- Takashi Menjo v3-0001-Add-with-libpmem-option-for-PMEM-support.patch Description: Binary data v3-0002-Add-wal_pmem_map-to-GUC.patch Description: Binary data v3-0003-Export-InstallXLogFileSegment.patch Description: Binary data v3-0004-Map-WAL-segment-files-on-PMEM-as-WAL-buffers.patch Description: Binary data v3-0005-WAL-statistics-in-cases-of-wal_pmem_map-true.patch Description: Binary data v3-0006-Preallocate-and-initialize-more-WAL-if-wal_pmem_m.patch Description: Binary data
Re: [PoC] Non-volatile WAL buffer
Hi Tomas, > Hello Takashi-san, > > On 3/5/21 9:08 AM, Takashi Menjo wrote: > > Hi Tomas, > > > > Thank you so much for your report. I have read it with great interest. > > > > Your conclusion sounds reasonable to me. My patchset you call "NTT / > > segments" got as good performance as "NTT / buffer" patchset. I have > > been worried that calling mmap/munmap for each WAL segment file could > > have a lot of overhead. Based on your performance tests, however, the > > overhead looks less than what I thought. In addition, "NTT / segments" > > patchset is more compatible to the current PG and more friendly to > > DBAs because that patchset uses WAL segment files and does not > > introduce any other new WAL-related file. > > > > I agree. I was actually a bit surprised it performs this well, mostly in > line with the "NTT / buffer" patchset. I've seen significant issues with > our simple experimental patches, which however went away with larger WAL > segments. But the "NTT / segments" patch does not have that issue, so > either our patches were doing something wrong, or perhaps there was some > other issue (not sure why larger WAL segments would improve that). > > Do these results match your benchmarks? Or are you seeing significantly > different behavior? I made a performance test for "NTT / segments" and added its results to my previous report [1], on the same conditions. The updated graph is attached to this mail. Note that some legends are renamed: "Mapped WAL file" to "NTT / simple", and "Non-volatile WAL buffer" to "NTT / buffer." The graph tells me that "NTT / segments" performs as well as "NTT / buffer." This matches with the results you reported. > Do you have any thoughts regarding the impact of full-page writes? So > far all the benchmarks we did focused on small OLTP transactions on data > sets that fit into RAM. The assumption was that that's the workload that > would benefit from this, but maybe that's missing something important > about workloads producing much larger WAL records? Say, workloads > working with large BLOBs, bulk loads etc. I'd say that more work is needed for workloads producing a large amount of WAL (in the number of records or the size per record, or both of them). Based on the case Gang reported and I have tried to reproduce in this thread [2][3], the current inserting and flushing method can be unsuitable for such workloads. The case was for "NTT / buffer," but I think it can be also applied to "NTT / segments." > The other question is whether simply placing WAL on DAX (without any > code changes) is safe. If it's not, then all the "speedups" are computed > with respect to unsafe configuration and so are useless. And BTT should > be used instead, which would of course produce very different results. I think it's safe, thanks to the checksum in the header of WAL record (xl_crc in struct XLogRecord). In DAX mode, user data (WAL record here) is written to the PMEM device by a smaller unit (probably a byte or a cache line) than the traditional 512-byte disk sector. So a torn-write such that "some bytes in a sector persist, other bytes not" can occur when crash. AFAICS, however, the checksum for WAL records can also support such a torn-write case. > > I also think that supporting both file I/O and mmap is better than > > supporting only mmap. I will continue my work on "NTT / segments" > > patchset to support both ways. > > > > +1 > > > In the following, I will answer "Issues & Questions" you reported. > > > > > >> While testing the "NTT / segments" patch, I repeatedly managed to crash > >> the cluster with errors like this: > >> > >> 2021-02-28 00:07:21.221 PST client backend [3737139] WARNING: creating > >> logfile segment just before > >> mapping; path "pg_wal/00010007002F" > >> 2021-02-28 00:07:21.670 PST client backend [3737142] WARNING: creating > >> logfile segment just before > >> mapping; path "pg_wal/000100070030" > >> ... > >> 2021-02-28 00:07:21.698 PST client backend [3737145] WARNING: creating > >> logfile segment just before > >> mapping; path "pg_wal/000100070030" > >> 2021-02-28 00:07:21.698 PST client backend [3737130] PANIC: could not > >> open file > >> "pg_wal/000100070030": No such file or directory > >> > >> I do believe this is a thinko in the 0008 patch, which does XLogFileInit > >> in XLogFileMap. Not
Re: [PoC] Non-volatile WAL buffer
37145] WARNING: creating > logfile segment just before mapping; path "pg_wal/000100070030" > 2021-02-28 00:07:21.698 PST client backend [3737130] PANIC: could not > open file "pg_wal/000100070030": No such file or directory > > I do believe this is a thinko in the 0008 patch, which does XLogFileInit > in XLogFileMap. Notice there are multiple "creating logfile" messages > with the ..0030 segment, followed by the failure. AFAICS the XLogFileMap > may be called from multiple backends, so they may call XLogFileInit > concurrently, likely triggering some sort of race condition. It's fairly > rare issue, though - I've only seen it twice from ~20 runs. > > > The other question I have is about WALInsertLockUpdateInsertingAt. 0003 > removes this function, but leaves behind some of the other bits working > with insert locks and insertingAt. But it does not explain how it works > without WaitXLogInsertionsToFinish() - how does it ensure that when we > commit something, all the preceding WAL is "complete" (i.e. written by > other backends etc.)? > > > Conclusion > -- > > I do think the "NTT / segments" patch is the most promising way forward. > It does perform about as well as the "NTT / buffer" patch (and much both > perform much better than the experimental patches I shared in January). > > The "NTT / buffer" patch seems much more disruptive - it introduces one > large buffer for WAL, which makes various other tasks more complicated > (i.e. it needs additional complexity to handle WAL archival, etc.). Are > there some advantages of this patch (compared to the other patch)? > > As for the "NTT / segments" patch, I wonder if we can just rework the > code like this (to use mmap etc.) or whether we need to support both > both ways (file I/O and mmap). I don't have much experience with many > other platforms, but it seems quite possible that mmap won't work all > that well on some of them. So my assumption is we'll need to support > both file I/O and mmap to make any of this committable, but I may be wrong. > > > [1] > https://www.postgresql.org/message-id/CAOwnP3Oz4CnKp0-_KU-x5irr9pBqPNkk7pjwZE5Pgo8i1CbFGg%40mail.gmail.com > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company -- Takashi Menjo
Re: [PoC] Non-volatile WAL buffer
Hi Sawada, I am relieved to hear that the performance problem was solved. And I added a tip about PMEM namespace and partitioning in PG wiki[1]. Regards, [1] https://wiki.postgresql.org/wiki/Persistent_Memory_for_WAL#Configure_and_verify_DAX_hugepage_faults -- Takashi Menjo
Re: [PoC] Non-volatile WAL buffer
Hi, I had a performance test in another environment. The steps, setup, and postgresql.conf of the test are same as the ones sent by me on Feb 17 [1], except the following items: # Setup - Distro: Red Hat Enterprise Linux release 8.2 (Ootpa) - C compiler: gcc-8.3.1-5.el8.x86_64 - libc: glibc-2.28-101.el8.x86_64 - Linux kernel: kernel-4.18.0-193.el8.x86_64 - PMDK: libpmem-1.6.1-1.el8.x86_64, libpmem-devel-1.6.1-1.el8.x86_64 See the attached figure for the results. In short, the v5 non-volatile WAL buffer got better performance than the original (non-patched) one. Regards, [1] https://www.postgresql.org/message-id/caownp3ofofosftmeikqcbmp0ywdjn0kvb4ka_0tj+urq7dt...@mail.gmail.com -- Takashi Menjo
Re: [PoC] Non-volatile WAL buffer
Hi Sawada, Thank you for your performance report. First, I'd say that the latest v5 non-volatile WAL buffer patchset looks not bad itself. I made a performance test for the v5 and got better performance than the original (non-patched) one and our previous work. See the attached figure for results. I think steps and/or setups of Tomas's, yours, and mine could be different, leading to the different performance results. So I show my steps and setups for my performance test. Please see the tail of this mail for them. Also, I write performance tips to the PMEM page at PostgreSQL wiki [1]. I wish it could be helpful to improve performance. Regards, Takashi [1] https://wiki.postgresql.org/wiki/Persistent_Memory_for_WAL#Performance_tips # Environment variables export PGHOST=/tmp export PGPORT=5432 export PGDATABASE="$USER" export PGUSER="$USER" export PGDATA=/mnt/nvme0n1/pgdata # Steps Note that I ran postgres server and pgbench in a single-machine system but separated two NUMA nodes. PMEM and PCI SSD for the server process are on the server-side NUMA node. 01) Create a PMEM namespace (sudo ndctl create-namespace -f -t pmem -m fsdax -M dev -e namespace0.0) 02) Make an ext4 filesystem for PMEM then mount it with DAX option (sudo mkfs.ext4 -q -F /dev/pmem0 ; sudo mount -o dax /dev/pmem0 /mnt/pmem0) 03) Make another ext4 filesystem for PCIe SSD then mount it (sudo mkfs.ext4 -q -F /dev/nvme0n1 ; sudo mount /dev/nvme0n1 /mnt/nvme0n1) 04) Make /mnt/pmem0/pg_wal directory for WAL 05) Make /mnt/nvme0n1/pgdata directory for PGDATA 06) Run initdb (initdb --locale=C --encoding=UTF8 -X /mnt/pmem0/pg_wal ...) - Also give -P /mnt/pmem0/pg_wal/nvwal -Q 81920 in the case of "Non-volatile WAL buffer" 07) Edit postgresql.conf as the attached one 08) Start postgres server process on NUMA node 0 (numactl -N 0 -m 0 -- pg_ctl -l pg.log start) 09) Create a database (createdb --locale=C --encoding=UTF8) 10) Initialize pgbench tables with s=50 (pgbench -i -s 50) 11) Stop the postgres server process (pg_ctl -l pg.log -m smart stop) 12) Remount the PMEM and the PCIe SSD 13) Start postgres server process on NUMA node 0 again (numactl -N 0 -m 0 -- pg_ctl -l pg.log start) 14) Run pg_prewarm for all the four pgbench_* tables 15) Run pgbench on NUMA node 1 for 30 minutes (numactl -N 1 -m 1 -- pgbench -r -M prepared -T 1800 -c __ -j __) - It executes the default tpcb-like transactions I repeated all the steps three times for each (c,j) then got the median "tps = __ (including connections establishing)" of the three as throughput and the "latency average = __ ms " of that time as average latency. # Setup - System: HPE ProLiant DL380 Gen10 - CPU: Intel Xeon Gold 6240M x2 sockets (18 cores per socket; HT disabled by BIOS) - DRAM: DDR4 2933MHz 192GiB/socket x2 sockets (32 GiB per channel x 6 channels per socket) - Optane PMem: Apache Pass, AppDirect Mode, DDR4 2666MHz 1.5TiB/socket x2 sockets (256 GiB per channel x 6 channels per socket; interleaving enabled) - PCIe SSD: DC P4800X Series SSDPED1K750GA - Distro: Ubuntu 20.04.1 - C compiler: gcc 9.3.0 - libc: glibc 2.31 - Linux kernel: 5.7.0 (built by myself) - Filesystem: ext4 (DAX enabled when using Optane PMem) - PMDK: 1.9 (built by myself) - PostgreSQL (Original): 9e7dbe3369cd8f5b0136c53b817471002505f934 (Jan 18, 2021 @ master) - PostgreSQL (Mapped WAL file): Original + v5 of "Applying PMDK to WAL operations for persistent memory" [2] - PostgreSQL (Non-volatile WAL buffer): Original + v5 of "Non-volatile WAL buffer" [3]; please read the files' prefix "v4-" as "v5-" [2] https://www.postgresql.org/message-id/CAOwnP3O3O1GbHpddUAzT%3DCP3aMpX99%3D1WtBAfsRZYe2Ui53MFQ%40mail.gmail.com [3] https://www.postgresql.org/message-id/CAOwnP3Oz4CnKp0-_KU-x5irr9pBqPNkk7pjwZE5Pgo8i1CbFGg%40mail.gmail.com -- Takashi Menjo postgresql.conf Description: Binary data
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory
Rebased to make patchset v5. I also found that my past replies have separated the thread in the pgsql-hackers archive. I try to connect this mail to the original thread [1], and let this point to the separated portions [2][3][4]. Note that the patchset v3 is in [3] and v4 is in [4]. Regards, [1] https://www.postgresql.org/message-id/flat/C20D38E97BCB33DAD59E3A1%40lab.ntt.co.jp [2] https://www.postgresql.org/message-id/flat/000501d4b794%245094d140%24f1be73c0%24%40lab.ntt.co.jp [3] https://www.postgresql.org/message-id/flat/01d4b863%244c9e8fc0%24e5dbaf40%24%40lab.ntt.co.jp [4] https://www.postgresql.org/message-id/flat/01d4c2a1%2488c6cc40%249a5464c0%24%40lab.ntt.co.jp -- Takashi Menjo v5-0001-Add-configure-option-for-PMDK.patch Description: Binary data v5-0003-Walreceiver-WAL-IO-using-PMDK.patch Description: Binary data v5-0002-Read-write-WAL-files-using-PMDK.patch Description: Binary data
Re: [PoC] Non-volatile WAL buffer
Hi Takayuki, Thank you for your helpful comments. In "Allocates WAL buffers on shared buffers", "shared buffers" should be > DRAM because shared buffers in Postgres means the buffer cache for database > data. > That's true. Fixed. > I haven't tracked the whole thread, but could you collect information like > the following? I think such (partly basic) information will be helpful to > decide whether it's worth casting more efforts into complex code, or it's > enough to place WAL on DAX-aware filesystems and tune the filesystem. > > * What approaches other DBMSs take, and their performance gains (Oracle, > SQL Server, HANA, Cassandra, etc.) > The same DBMS should take different approaches depending on the file type: > Oracle recommends different things to data files and REDO logs. > I also think it will be helpful. Adding "Other DBMSes using PMEM" section. * The storage capabilities of PMEM compared to the fast(est) alternatives > such as NVMe SSD (read/write IOPS, latency, throughput, concurrency, which > may be posted on websites like Tom's Hardware or SNIA) > This will be helpful, too. Adding "Basic performance" subsection under "Overview of persistent memory (PMEM)." * What's the situnation like on Windows? > Sorry but I don't know Windows' PMEM support very much. All I know is that Windows Server 2016 and 2019 supports PMEM (2016 partially) [1] and PMDK supports Windows [2]. All the above contents will be updated gradually. Please stay tuned. Regards, [1] https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/deploy-pmem [2] https://docs.pmem.io/persistent-memory/getting-started-guide/installing-pmdk/installing-pmdk-on-windows -- Takashi Menjo
Re: [PoC] Non-volatile WAL buffer
Hi, I made a new page at PostgreSQL Wiki to gather and summarize information and discussion about PMEM-backed WAL designs and implementations. Some parts of the page are TBD. I will continue to maintain the page. Requests are welcome. Persistent Memory for WAL https://wiki.postgresql.org/wiki/Persistent_Memory_for_WAL Regards, -- Takashi Menjo
Re: [PoC] Non-volatile WAL buffer
Hi Tomas, I'd answer your questions. (Not all for now, sorry.) > Do I understand correctly that the patch removes "regular" WAL buffers and instead writes the data into the non-volatile PMEM buffer, without writing that to the WAL segments at all (unless in archiving mode)? > Firstly, I guess many (most?) instances will have to write the WAL segments anyway because of PITR/backups, so I'm not sure we can save much here. Mostly yes. My "non-volatile WAL buffer" patchset removes regular volatile WAL buffers and brings non-volatile ones. All the WAL will get into the non-volatile buffers and persist there. No write out of the buffers to WAL segment files is required. However in archiving mode or in a case of buffer full (described later), both of the non-volatile buffers and the segment files are used. In archiving mode with my patchset, for each time one segment (16MB default) is fixed on the non-volatile buffers, that segment is written to a segment file asynchronously (by XLogBackgroundFlush). Then it will be archived by existing archiving functionality. > But more importantly - doesn't that mean the nvwal_size value is essentially a hard limit? With max_wal_size, it's a soft limit i.e. we're allowed to temporarily use more WAL when needed. But with a pre-allocated file, that's clearly not possible. So what would happen in those cases? Yes, nvwal_size is a hard limit, and I see it's a major weak point of my patchset. When all non-volatile WAL buffers are filled, the oldest segment on the buffers is written (by XLogWrite) to a regular WAL segment file, then those buffers are cleared (by AdvanceXLInsertBuffer) for new records. All WAL record insertions to the buffers block until that write and clear are complete. Due to that, all write transactions also block. To make the matter worse, if a checkpoint eventually occurs in such a buffer full case, record insertions would block for a certain time at the end of the checkpoint because a large amount of the non-volatile buffers will be cleared (see PreallocNonVolatileXlogBuffer). From a client view, it would look as if the postgres server freezes for a while. Proper checkpointing would prevent such cases, but it could be hard to control. When I reproduced the Gang's case reported in this thread, such buffer full and freeze occured. > Also, is it possible to change nvwal_size? I haven't tried, but I wonder what happens with the current contents of the file. The value of nvwal_size should be equal to the actual size of nvwal_path file when postgres starts up. If not equal, postgres will panic at MapNonVolatileXLogBuffer (see nv_xlog_buffer.c), and the WAL contents on the file will remain as it was. So, if an admin accidentally changes the nvwal_size value, they just cannot get postgres up. The file size may be extended/shrunk offline by truncate(1) command, but the WAL contents on the file also should be moved to the proper offset because the insertion/recovery offset is calculated by modulo, that is, record's LSN % nvwal_size; otherwise we lose WAL. An offline tool to do such an operation might be required, but is not yet. > The way I understand the current design is that we're essentially switching from this architecture: > >clients -> wal buffers (DRAM) -> wal segments (storage) > > to this > >clients -> wal buffers (PMEM) > > (Assuming there we don't have to write segments because of archiving.) Yes. Let me describe how current PostgreSQL design is and how the patchsets and works talked in this thread changes it, AFAIU: - Current PostgreSQL: clients -[memcpy]-> buffers (DRAM) -[write]-> segments (disk) - Patch "pmem-with-wal-buffers-master.patch" Tomas posted: clients -[memcpy]-> buffers (DRAM) -[pmem_memcpy]-> mmap-ed segments (PMEM) - My "non-volatile WAL buffer" patchset: clients -[pmem_memcpy(*)]-> buffers (PMEM) - My another patchset mmap-ing segments as buffers: clients -[pmem_memcpy(*)]-> mmap-ed segments as buffers (PMEM) - "Non-volatile Memory Logging" in PGcon 2016 [1][2][3]: clients -[memcpy]-> buffers (WC[4] DRAM as pseudo PMEM) -[async write]-> segments (disk) (* or memcpy + pmem_flush) And I'd say that our previous work "Introducing PMDK into PostgreSQL" talked in PGCon 2018 [5] and its patchset [6 for the latest] are based on the same idea as Tomas's patch above. That's all for this mail. Please be patient for the next mail. Best regards, Takashi [1] https://www.pgcon.org/2016/schedule/track/Performance/945.en.html [2] https://github.com/meistervonperf/postgresql-NVM-logging [3] https://github.com/meistervonperf/pseudo-pram [4] https://www.kernel.org/doc/html/latest/x86/pat.html [5] https://pgcon.org/2018/schedule/events/1154.en.html [6] https://www.postgresql.org/message-id/CAOwnP3ONd9uXPXKoc5AAfnpCnCyOna1ru6sU=ey_4wfmjak...@mail.gmail.com -- Takashi Menjo
Re: [PoC] Non-volatile WAL buffer
Hi, Now I have caught up with this thread. I see that many of you are interested in performance profiling. I share my slides in SNIA SDC 2020 [1]. In the slides, I had profiles focused on XLogInsert and XLogFlush (mainly the latter) for my non-volatile WAL buffer patchset. I found that the time for XLogWrite and locking/unlocking WALWriteLock were eliminated by the patchset. Instead, XLogInsert and WaitXLogInsertionsToFinish took more (or a little more) time than ever because memcpy-ing to PMEM (Optane PMem) is slower than to DRAM. For details, please see the slides. Best regards, Takashi [1] https://www.snia.org/educational-library/how-can-persistent-memory-make-databases-faster-and-how-could-we-go-ahead-2020 2021年1月26日(火) 18:50 Takashi Menjo : > Dear everyone, Tomas, > > First of all, the "v4" patchset for non-volatile WAL buffer attached to > the previous mail is actually v5... Please read "v4" as "v5." > > Then, to Tomas: > Thank you for your crash report you gave on Nov 27, 2020, regarding msync > patchset. I applied the latest msync patchset v3 attached to the previous > to master 411ae64 (on Jan18, 2021) then tested it, and I got no error when > pgbench -i -s 500. Please try it if necessary. > > Best regards, > Takashi > > > 2021年1月26日(火) 17:52 Takashi Menjo : > >> Dear everyone, >> >> Sorry but I forgot to attach my patchsets... Please see the files >> attached to this mail. Please also note that they contain some fixes. >> >> Best regards, >> Takashi >> >> >> 2021年1月26日(火) 17:46 Takashi Menjo : >> >>> Dear everyone, >>> >>> I'm sorry for the late reply. I rebase my two patchsets onto the latest >>> master 411ae64.The one patchset prefixed with v4 is for non-volatile WAL >>> buffer; the other prefixed with v3 is for msync. >>> >>> I will reply to your thankful feedbacks one by one within days. Please >>> wait for a moment. >>> >>> Best regards, >>> Takashi >>> >>> >>> 01/25/2021(Mon) 11:56 Masahiko Sawada : >>> >>>> On Fri, Jan 22, 2021 at 11:32 AM Tomas Vondra >>>> wrote: >>>> > >>>> > >>>> > >>>> > On 1/21/21 3:17 AM, Masahiko Sawada wrote: >>>> > > On Thu, Jan 7, 2021 at 2:16 AM Tomas Vondra >>>> > > wrote: >>>> > >> >>>> > >> Hi, >>>> > >> >>>> > >> I think I've managed to get the 0002 patch [1] rebased to master >>>> and >>>> > >> working (with help from Masahiko Sawada). It's not clear to me how >>>> it >>>> > >> could have worked as submitted - my theory is that an incomplete >>>> patch >>>> > >> was submitted by mistake, or something like that. >>>> > >> >>>> > >> Unfortunately, the benchmark results were kinda disappointing. For >>>> a >>>> > >> pgbench on scale 500 (fits into shared buffers), an average of >>>> three >>>> > >> 5-minute runs looks like this: >>>> > >> >>>> > >> branch 1163264 >>>> 96 >>>> > >> >>>> >>>> > >> master 7291 87704165310150437 >>>> 224186 >>>> > >> ntt 7912106095213206212410 >>>> 237819 >>>> > >> simple-no-buffers 7654 96544115416 95828 >>>> 103065 >>>> > >> >>>> > >> NTT refers to the patch from September 10, pre-allocating a large >>>> WAL >>>> > >> file on PMEM, and simple-no-buffers is the simpler patch simply >>>> removing >>>> > >> the WAL buffers and writing directly to a mmap-ed WAL segment on >>>> PMEM. >>>> > >> >>>> > >> Note: The patch is just replacing the old implementation with mmap. >>>> > >> That's good enough for experiments like this, but we probably want >>>> to >>>> > >> keep the old one for setups without PMEM. But it's good enough for >>>> > >> testing, benchmarking etc. >>>> > >> >>>> > >> Unfortunately, the results for this simple approach are pretty >>>> bad. Not >>>> > >&
Re: [PoC] Non-volatile WAL buffer
Dear everyone, Tomas, First of all, the "v4" patchset for non-volatile WAL buffer attached to the previous mail is actually v5... Please read "v4" as "v5." Then, to Tomas: Thank you for your crash report you gave on Nov 27, 2020, regarding msync patchset. I applied the latest msync patchset v3 attached to the previous to master 411ae64 (on Jan18, 2021) then tested it, and I got no error when pgbench -i -s 500. Please try it if necessary. Best regards, Takashi 2021年1月26日(火) 17:52 Takashi Menjo : > Dear everyone, > > Sorry but I forgot to attach my patchsets... Please see the files attached > to this mail. Please also note that they contain some fixes. > > Best regards, > Takashi > > > 2021年1月26日(火) 17:46 Takashi Menjo : > >> Dear everyone, >> >> I'm sorry for the late reply. I rebase my two patchsets onto the latest >> master 411ae64.The one patchset prefixed with v4 is for non-volatile WAL >> buffer; the other prefixed with v3 is for msync. >> >> I will reply to your thankful feedbacks one by one within days. Please >> wait for a moment. >> >> Best regards, >> Takashi >> >> >> 01/25/2021(Mon) 11:56 Masahiko Sawada : >> >>> On Fri, Jan 22, 2021 at 11:32 AM Tomas Vondra >>> wrote: >>> > >>> > >>> > >>> > On 1/21/21 3:17 AM, Masahiko Sawada wrote: >>> > > On Thu, Jan 7, 2021 at 2:16 AM Tomas Vondra >>> > > wrote: >>> > >> >>> > >> Hi, >>> > >> >>> > >> I think I've managed to get the 0002 patch [1] rebased to master and >>> > >> working (with help from Masahiko Sawada). It's not clear to me how >>> it >>> > >> could have worked as submitted - my theory is that an incomplete >>> patch >>> > >> was submitted by mistake, or something like that. >>> > >> >>> > >> Unfortunately, the benchmark results were kinda disappointing. For a >>> > >> pgbench on scale 500 (fits into shared buffers), an average of three >>> > >> 5-minute runs looks like this: >>> > >> >>> > >> branch 116326496 >>> > >> >>> > >> master 7291 87704165310150437224186 >>> > >> ntt 7912106095213206212410237819 >>> > >> simple-no-buffers 7654 96544115416 95828103065 >>> > >> >>> > >> NTT refers to the patch from September 10, pre-allocating a large >>> WAL >>> > >> file on PMEM, and simple-no-buffers is the simpler patch simply >>> removing >>> > >> the WAL buffers and writing directly to a mmap-ed WAL segment on >>> PMEM. >>> > >> >>> > >> Note: The patch is just replacing the old implementation with mmap. >>> > >> That's good enough for experiments like this, but we probably want >>> to >>> > >> keep the old one for setups without PMEM. But it's good enough for >>> > >> testing, benchmarking etc. >>> > >> >>> > >> Unfortunately, the results for this simple approach are pretty bad. >>> Not >>> > >> only compared to the "ntt" patch, but even to master. I'm not >>> entirely >>> > >> sure what's the root cause, but I have a couple hypotheses: >>> > >> >>> > >> 1) bug in the patch - That's clearly a possibility, although I've >>> tried >>> > >> tried to eliminate this possibility. >>> > >> >>> > >> 2) PMEM is slower than DRAM - From what I know, PMEM is much faster >>> than >>> > >> NVMe storage, but still much slower than DRAM (both in terms of >>> latency >>> > >> and bandwidth, see [2] for some data). It's not terrible, but the >>> > >> latency is maybe 2-3x higher - not a huge difference, but may >>> matter for >>> > >> WAL buffers? >>> > >> >>> > >> 3) PMEM does not handle parallel writes well - If you look at [2], >>> > >> Figure 4(b), you'll see that the throughput actually *drops" as the >>> > >> number of threads increase. That's pretty strange / annoying, >>> because >>> > &g
Re: [PoC] Non-volatile WAL buffer
master 6635 88524171106163387245307 > > >> ntt 7909106826217364223338242042 > > >> simple-no-buffers 7871101575199403188074224716 > > >> with-wal-buffers7643101056206911223860261712 > > >> > > >> So yeah, there's a clear difference. It changes the values for > "master" > > >> a bit, but both the "simple" patches (with and without) WAL buffers > are > > >> much faster. The with-wal-buffers is almost equal to the NTT patch, > > >> which was using 96GB file. I presume larger WAL segments would get > even > > >> closer, if we supported them. > > >> > > >> I'll continue investigating this, but my conclusion so far seem to be > > >> that we can't really replace WAL buffers with PMEM - that seems to > > >> perform much worse. > > >> > > >> The question is what to do about the segment size. Can we reduce the > > >> overhead of mmap-ing individual segments, so that this works even for > > >> smaller WAL segments, to make this useful for common instances (not > > >> everyone wants to run with 1GB WAL). Or whether we need to adopt the > > >> design with a large file, mapped just once. > > >> > > >> Another question is whether it's even worth the extra complexity. On > > >> 16MB segments the difference between master and NTT patch seems to be > > >> non-trivial, but increasing the WAL segment size kinda reduces that. > So > > >> maybe just using File I/O on PMEM DAX filesystem seems good enough. > > >> Alternatively, maybe we could switch to libpmemblk, which should > > >> eliminate the filesystem overhead at least. > > > > > > I think the performance improvement by NTT patch with the 16MB WAL > > > segment, the most common WAL segment size, is very good (150437 vs. > > > 212410 with 64 clients). But maybe evaluating writing WAL segment > > > files on PMEM DAX filesystem is also worth, as you mentioned, if we > > > don't do that yet. > > > > > > > Well, not sure. I think the question is still open whether it's actually > > safe to run on DAX, which does not have atomic writes of 512B sectors, > > and I think we rely on that e.g. for pg_config. But maybe for WAL that's > > not an issue. > > I think we can use the Block Translation Table (BTT) driver that > provides atomic sector updates. > > > > > > Also, I'm interested in why the through-put of NTT patch saturated at > > > 32 clients, which is earlier than the master's one (96 clients). How > > > many CPU cores are there on the machine you used? > > > > > > > From what I know, this is somewhat expected for PMEM devices, for a > > bunch of reasons: > > > > 1) The memory bandwidth is much lower than for DRAM (maybe ~10-20%), so > > it takes fewer processes to saturate it. > > > > 2) Internally, the PMEM has a 256B buffer for writes, used for combining > > etc. With too many processes sending writes, it becomes to look more > > random, which is harmful for throughput. > > > > When combined, this means the performance starts dropping at certain > > number of threads, and the optimal number of threads is rather low > > (something like 5-10). This is very different behavior compared to DRAM. > > Makes sense. > > > > > There's a nice overview and measurements in this paper: > > > > Building blocks for persistent memory / How to get the most out of your > > new memory? > > Alexander van Renen, Lukas Vogel, Viktor Leis, Thomas Neumann & Alfons > > Kemper > > > > https://link.springer.com/article/10.1007/s00778-020-00622-9 > > Thank you. I'll read it. > > > > > > > >> I'm also wondering if WAL is the right usage for PMEM. Per [2] > there's a > > >> huge read-write assymmetry (the writes being way slower), and their > > >> recommendation (in "Observation 3" is) > > >> > > >> The read-write asymmetry of PMem im-plies the necessity of > avoiding > > >> writes as much as possible for PMem. > > >> > > >> So maybe we should not be trying to use PMEM for WAL, which is pretty > > >> write-heavy (and in most cases even write-only). > > > > > > I think using PMEM for WAL is cost-effective but it leverages the only > > > low-latency (sequential) write, but not other abilities such as > > > fine-grained access and low-latency random write. If we want to > > > exploit its all ability we might need some drastic changes to logging > > > protocol while considering storing data on PMEM. > > > > > > > True. I think investigating whether it's sensible to use PMEM for this > > purpose. It may turn out that replacing the DRAM WAL buffers with writes > > directly to PMEM is not economical, and aggregating data in a DRAM > > buffer is better :-( > > Yes. I think it might be interesting to do an analysis of the > bottlenecks of NTT patch by perf etc. If bottlenecks are moved to > other places by removing WALWriteLock during flush, it's probably a > good sign for further performance improvements. IIRC WALWriteLock is > one of the main bottlenecks on OLTP workload, although my memory might > already be out of date. > > Regards, > > -- > Masahiko Sawada > EDB: https://www.enterprisedb.com/ > -- Takashi Menjo
Re: [PoC] Non-volatile WAL buffer
Hi Gang, I appreciate your patience. I reproduced the results you reported to me, on my environment. First of all, the condition you gave to me was a little unstable on my environment, so I made the values of {max_,min_,nv}wal_size larger and the pre-warm duration longer to get stable performance. I didn't modify your table and query, and benchmark duration. Under the stable condition, Original (PMEM) still got better performance than Non-volatile WAL Buffer. To sum up, the reason was that Non-volatile WAL Buffer on Optane PMem spent much more time than Original (PMEM) for XLogInsert when using your table and query. It offset the improvement of XLogFlush, and degraded performance in total. VTune told me that Non-volatile WAL Buffer took more CPU time than Original (PMEM) for (XLogInsert => XLogInsertRecord => CopyXLogRecordsToWAL =>) memcpy while it took less time for XLogFlush. This profile was very similar to the one you reported. In general, when WAL buffers are on Optane PMem rather than DRAM, it is obvious that it takes more time to memcpy WAL records into the buffers because Optane PMem is a little slower than DRAM. In return for that, Non-volatile WAL Buffer reduces the time to let the records hit to devices because it doesn't need to write them out of the buffers to somewhere else, but just need to flush out of CPU caches to the underlying memory-mapped file. Your report shows that Non-volatile WAL Buffer on Optane PMem is not good for certain kinds of transactions, and is good for others. I have tried to fix how to insert and flush WAL records, or the configurations or constants that could change performance such as NUM_XLOGINSERT_LOCKS, but Non-volatile WAL Buffer have not achieved better performance than Original (PMEM) yet when using your table and query. I will continue to work on this issue and will report if I have any update. By the way, did your performance progress reported by pgbench with -P option get down to zero when you run Non-volatile WAL Buffer? If so, your {max_,min_,nv}wal_size might be too small or your checkpoint configurations might be not appropriate. Could you check your results again? Best regards, Takashi -- Takashi Menjo
Re: [PoC] Non-volatile WAL buffer
Hi Heikki, > I had a new look at this thread today, trying to figure out where we are. I'm a bit confused. > > One thing we have established: mmap()ing WAL files performs worse than the current method, if pg_wal is not on > a persistent memory device. This is because the kernel faults in existing content of each page, even though we're > overwriting everything. Yes. In addition, after a certain page (in the sense of OS page) is msync()ed, another page fault will occur again when something is stored into that page. > That's unfortunate. I was hoping that mmap() would be a good option even without persistent memory hardware. > I wish we could tell the kernel to zero the pages instead of reading them from the file. Maybe clear the file with > ftruncate() before mmapping it? The area extended by ftruncate() appears as if it were zero-filled [1]. Please note that it merely "appears as if." It might not be actually zero-filled as data blocks on devices, so pre-allocating files should improve transaction performance. At least, on Linux 5.7 and ext4, it takes more time to store into the mapped file just open(O_CREAT)ed and ftruncate()d than into the one filled already and actually. > That should not be problem with a real persistent memory device, however (or when emulating it with DRAM). With > DAX, the storage is memory-mapped directly and there is no page cache, and no pre-faulting. Yes, with filesystem DAX, there is no page cache for file data. A page fault still occurs but for each 2MiB DAX hugepage, so its overhead decreases compared with 4KiB page fault. Such a DAX hugepage fault is only applied to DAX-mapped files and is different from a general transparent hugepage fault. > Because of that, I'm baffled by what the v4-0002-Non-volatile-WAL-buffer.patch does. If I understand it > correctly, it puts the WAL buffers in a separate file, which is stored on the NVRAM. Why? I realize that this is just > a Proof of Concept, but I'm very much not interested in anything that requires the DBA to manage a second WAL > location. Did you test the mmap() patches with persistent memory hardware? Did you compare that with the pmem > patchset, on the same hardware? If there's a meaningful performance difference between the two, what's causing > it? Yes, this patchset puts the WAL buffers into the file specified by "nvwal_path" in postgresql.conf. Why this patchset puts the buffers into the separated file, not existing segment files in PGDATA/pg_wal, is because it reduces the overhead due to system calls such as open(), mmap(), munmap(), and close(). It open()s and mmap()s the file "nvwal_path" once, and keeps that file mapped while running. On the other hand, as for the patchset mmap()ing the segment files, a backend process should munmap() and close() the current mapped file and open() and mmap() the new one for each time the inserting location for that process goes over segments. This causes the performance difference between the two. Best regards, Takashi [1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html -- Takashi Menjo
RE: [PoC] Non-volatile WAL buffer
Hi Gang, Thanks. I have tried to reproduce performance degrade, using your configuration, query, and steps. And today, I got some results that Original (PMEM) achieved better performance than Non-volatile WAL buffer on my Ubuntu environment. Now I work for further investigation. Best regards, Takashi -- Takashi Menjo NTT Software Innovation Center > -Original Message- > From: Deng, Gang > Sent: Friday, October 9, 2020 3:10 PM > To: Takashi Menjo > Cc: pgsql-hack...@postgresql.org; 'Takashi Menjo' > Subject: RE: [PoC] Non-volatile WAL buffer > > Hi Takashi, > > There are some differences between our HW/SW configuration and test steps. I > attached postgresql.conf I used > for your reference. I would like to try postgresql.conf and steps you > provided in the later days to see if I can find > cause. > > I also ran pgbench and postgres server on the same server but on different > NUMA node, and ensure server process > and PMEM on the same NUMA node. I used similar steps are yours from step 1 to > 9. But some difference in later > steps, major of them are: > > In step 10), I created a database and table for test by: > #create database: > psql -c "create database insert_bench;" > #create table: > psql -d insert_bench -c "create table test(crt_time timestamp, info text > default > '75feba6d5ca9ff65d09af35a67fe962a4e3fa5ef279f94df6696bee65f4529a4bbb03ae56c3b5b86c22b447fc > 48da894740ed1a9d518a9646b3a751a57acaca1142ccfc945b1082b40043e3f83f8b7605b5a55fcd7eb8fc1 > d0475c7fe465477da47d96957849327731ae76322f440d167725d2e2bbb60313150a4f69d9a8c9e86f9d7 > 9a742e7a35bf159f670e54413fb89ff81b8e5e8ab215c3ddfd00bb6aeb4');" > > in step 15), I did not use pg_prewarm, but just ran pg_bench for 180 seconds > to warm up. > In step 16), I ran pgbench using command: pgbench -M prepared -n -r -P 10 -f > ./test.sql -T 600 -c _ -j _ > insert_bench. (test.sql can be found in attachment) > > For HW/SW conf, the major differences are: > CPU: I used Xeon 8268 (24c@2.9Ghz, HT enabled) OS Distro: CentOS 8.2.2004 > Kernel: 4.18.0-193.6.3.el8_2.x86_64 > GCC: 8.3.1 > > Best regards > Gang > > -Original Message- > From: Takashi Menjo > Sent: Tuesday, October 6, 2020 4:49 PM > To: Deng, Gang > Cc: pgsql-hack...@postgresql.org; 'Takashi Menjo' > Subject: RE: [PoC] Non-volatile WAL buffer > > Hi Gang, > > I have tried to but yet cannot reproduce performance degrade you reported > when inserting 328-byte records. So > I think the condition of you and me would be different, such as steps to > reproduce, postgresql.conf, installation > setup, and so on. > > My results and condition are as follows. May I have your condition in more > detail? Note that I refer to your "Storage > over App Direct" as my "Original (PMEM)" and "NVWAL patch" to "Non-volatile > WAL buffer." > > Best regards, > Takashi > > > # Results > See the attached figure. In short, Non-volatile WAL buffer got better > performance than Original (PMEM). > > # Steps > Note that I ran postgres server and pgbench in a single-machine system but > separated two NUMA nodes. PMEM > and PCI SSD for the server process are on the server-side NUMA node. > > 01) Create a PMEM namespace (sudo ndctl create-namespace -f -t pmem -m fsdax > -M dev -e namespace0.0) > 02) Make an ext4 filesystem for PMEM then mount it with DAX option (sudo > mkfs.ext4 -q -F /dev/pmem0 ; sudo > mount -o dax /dev/pmem0 /mnt/pmem0) > 03) Make another ext4 filesystem for PCIe SSD then mount it (sudo mkfs.ext4 > -q -F /dev/nvme0n1 ; sudo mount > /dev/nvme0n1 /mnt/nvme0n1) > 04) Make /mnt/pmem0/pg_wal directory for WAL > 05) Make /mnt/nvme0n1/pgdata directory for PGDATA > 06) Run initdb (initdb --locale=C --encoding=UTF8 -X /mnt/pmem0/pg_wal ...) > - Also give -P /mnt/pmem0/pg_wal/nvwal -Q 81920 in the case of > Non-volatile WAL buffer > 07) Edit postgresql.conf as the attached one > - Please remove nvwal_* lines in the case of Original (PMEM) > 08) Start postgres server process on NUMA node 0 (numactl -N 0 -m 0 -- pg_ctl > -l pg.log start) > 09) Create a database (createdb --locale=C --encoding=UTF8) > 10) Initialize pgbench tables with s=50 (pgbench -i -s 50) > 11) Change # characters of "filler" column of "pgbench_history" table to 300 > (ALTER TABLE pgbench_history > ALTER filler TYPE character(300);) > - This would make the row size of the table 328 bytes > 12) Stop the postgres server process (pg_ctl -l pg.log -m smart stop) > 13) Remount the PMEM and the PCIe SSD > 14) Start postgres server process on NUMA node 0 again (numactl -N 0 -m 0 -- > pg_ctl -l p
RE: [PoC] Non-volatile WAL buffer
Hi Gang, I have tried to but yet cannot reproduce performance degrade you reported when inserting 328-byte records. So I think the condition of you and me would be different, such as steps to reproduce, postgresql.conf, installation setup, and so on. My results and condition are as follows. May I have your condition in more detail? Note that I refer to your "Storage over App Direct" as my "Original (PMEM)" and "NVWAL patch" to "Non-volatile WAL buffer." Best regards, Takashi # Results See the attached figure. In short, Non-volatile WAL buffer got better performance than Original (PMEM). # Steps Note that I ran postgres server and pgbench in a single-machine system but separated two NUMA nodes. PMEM and PCI SSD for the server process are on the server-side NUMA node. 01) Create a PMEM namespace (sudo ndctl create-namespace -f -t pmem -m fsdax -M dev -e namespace0.0) 02) Make an ext4 filesystem for PMEM then mount it with DAX option (sudo mkfs.ext4 -q -F /dev/pmem0 ; sudo mount -o dax /dev/pmem0 /mnt/pmem0) 03) Make another ext4 filesystem for PCIe SSD then mount it (sudo mkfs.ext4 -q -F /dev/nvme0n1 ; sudo mount /dev/nvme0n1 /mnt/nvme0n1) 04) Make /mnt/pmem0/pg_wal directory for WAL 05) Make /mnt/nvme0n1/pgdata directory for PGDATA 06) Run initdb (initdb --locale=C --encoding=UTF8 -X /mnt/pmem0/pg_wal ...) - Also give -P /mnt/pmem0/pg_wal/nvwal -Q 81920 in the case of Non-volatile WAL buffer 07) Edit postgresql.conf as the attached one - Please remove nvwal_* lines in the case of Original (PMEM) 08) Start postgres server process on NUMA node 0 (numactl -N 0 -m 0 -- pg_ctl -l pg.log start) 09) Create a database (createdb --locale=C --encoding=UTF8) 10) Initialize pgbench tables with s=50 (pgbench -i -s 50) 11) Change # characters of "filler" column of "pgbench_history" table to 300 (ALTER TABLE pgbench_history ALTER filler TYPE character(300);) - This would make the row size of the table 328 bytes 12) Stop the postgres server process (pg_ctl -l pg.log -m smart stop) 13) Remount the PMEM and the PCIe SSD 14) Start postgres server process on NUMA node 0 again (numactl -N 0 -m 0 -- pg_ctl -l pg.log start) 15) Run pg_prewarm for all the four pgbench_* tables 16) Run pgbench on NUMA node 1 for 30 minutes (numactl -N 1 -m 1 -- pgbench -r -M prepared -T 1800 -c __ -j __) - It executes the default tpcb-like transactions I repeated all the steps three times for each (c,j) then got the median "tps = __ (including connections establishing)" of the three as throughput and the "latency average = __ ms " of that time as average latency. # Environment variables export PGHOST=/tmp export PGPORT=5432 export PGDATABASE="$USER" export PGUSER="$USER" export PGDATA=/mnt/nvme0n1/pgdata # Setup - System: HPE ProLiant DL380 Gen10 - CPU: Intel Xeon Gold 6240M x2 sockets (18 cores per socket; HT disabled by BIOS) - DRAM: DDR4 2933MHz 192GiB/socket x2 sockets (32 GiB per channel x 6 channels per socket) - Optane PMem: Apache Pass, AppDirect Mode, DDR4 2666MHz 1.5TiB/socket x2 sockets (256 GiB per channel x 6 channels per socket; interleaving enabled) - PCIe SSD: DC P4800X Series SSDPED1K750GA - Distro: Ubuntu 20.04.1 - C compiler: gcc 9.3.0 - libc: glibc 2.31 - Linux kernel: 5.7 (vanilla) - Filesystem: ext4 (DAX enabled when using Optane PMem) - PMDK: 1.9 - PostgreSQL (Original): 14devel (200f610: Jul 26, 2020) - PostgreSQL (Non-volatile WAL buffer): 14devel (200f610: Jul 26, 2020) + non-volatile WAL buffer patchset v4 -- Takashi Menjo NTT Software Innovation Center > -Original Message- > From: Takashi Menjo > Sent: Thursday, September 24, 2020 2:38 AM > To: Deng, Gang > Cc: pgsql-hack...@postgresql.org; Takashi Menjo > > Subject: Re: [PoC] Non-volatile WAL buffer > > Hello Gang, > > Thank you for your report. I have not taken care of record size deeply yet, > so your report is very interesting. I will > also have a test like yours then post results here. > > Regards, > Takashi > > > 2020年9月21日(月) 14:14 Deng, Gang <mailto:gang.d...@intel.com> >: > > > Hi Takashi, > > > > Thank you for the patch and work on accelerating PG performance with > NVM. I applied the patch and made > some performance test based on the patch v4. I stored database data files on > NVMe SSD and stored WAL file on > Intel PMem (NVM). I used two methods to store WAL file(s): > > 1. Leverage your patch to access PMem with libpmem (NVWAL patch). > > 2. Access PMem with legacy filesystem interface, that means use > PMem as ordinary block device, no > PG patch is required to access PMem (Storage over App Direct). > > > > I tried two insert scenarios: > > A. Insert small record (length of record to be
Re: [PoC] Non-volatile WAL buffer
Hello Gang, Thank you for your report. I have not taken care of record size deeply yet, so your report is very interesting. I will also have a test like yours then post results here. Regards, Takashi 2020年9月21日(月) 14:14 Deng, Gang : > Hi Takashi, > > > > Thank you for the patch and work on accelerating PG performance with NVM. > I applied the patch and made some performance test based on the patch v4. I > stored database data files on NVMe SSD and stored WAL file on Intel PMem > (NVM). I used two methods to store WAL file(s): > > 1. Leverage your patch to access PMem with libpmem (NVWAL patch). > > 2. Access PMem with legacy filesystem interface, that means use PMem > as ordinary block device, no PG patch is required to access PMem (Storage > over App Direct). > > > > I tried two insert scenarios: > > A. Insert small record (length of record to be inserted is 24 bytes), > I think it is similar as your test > > B. Insert large record (length of record to be inserted is 328 bytes) > > > > My original purpose is to see higher performance gain in scenario B as it > is more write intensive on WAL. But I observed that NVWAL patch method had > ~5% performance improvement compared with Storage over App Direct method in > scenario A, while had ~20% performance degradation in scenario B. > > > > I made further investigation on the test. I found that NVWAL patch can > improve performance of XlogFlush function, but it may impact performance of > CopyXlogRecordToWAL function. It may be related to the higher latency of > memcpy to Intel PMem comparing with DRAM. Here are key data in my test: > > > > Scenario A (length of record to be inserted: 24 bytes per record): > > == > > >NVWAL SoAD > > > --- --- > > Througput (10^3 TPS) > 310.5 296.0 > > CPU Time % of CopyXlogRecordToWAL > 0.4 0.2 > > CPU Time % of XLogInsertRecord > 1.5 0.8 > > CPU Time % of XLogFlush > 2.1 9.6 > > > > Scenario B (length of record to be inserted: 328 bytes per record): > > == > > >NVWAL SoAD > > > --- --- > > Througput (10^3 TPS) > 13.0 16.9 > > CPU Time % of CopyXlogRecordToWAL > 3.0 1.6 > > CPU Time % of XLogInsertRecord > 23.0 16.4 > > CPU Time % of XLogFlush > 2.3 5.9 > > > > Best Regards, > > Gang > > > > *From:* Takashi Menjo > *Sent:* Thursday, September 10, 2020 4:01 PM > *To:* Takashi Menjo > *Cc:* pgsql-hack...@postgresql.org > *Subject:* Re: [PoC] Non-volatile WAL buffer > > > > Rebased. > > > > > > 2020年6月24日(水) 16:44 Takashi Menjo : > > Dear hackers, > > I update my non-volatile WAL buffer's patchset to v3. Now we can use it > in streaming replication mode. > > Updates from v2: > > - walreceiver supports non-volatile WAL buffer > Now walreceiver stores received records directly to non-volatile WAL > buffer if applicable. > > - pg_basebackup supports non-volatile WAL buffer > Now pg_basebackup copies received WAL segments onto non-volatile WAL > buffer if you run it with "nvwal" mode (-Fn). > You should specify a new NVWAL path with --nvwal-path option. The path > will be written to postgresql.auto.conf or recovery.conf. The size of the > new NVWAL is same as the master's one. > > > Best regards, > Takashi > > -- > Takashi Menjo > NTT Software Innovation Center > > > -Original Message- > > From: Takashi Menjo > > Sent: Wednesday, March 18, 2020 5:59 PM > > To: 'PostgreSQL-development' > > Cc: 'Robert Haas' ; 'Heikki Linnakangas' < > hlinn...@iki.fi>; 'Amit Langote' > > > > Subject: RE: [PoC] Non-volatile WAL buffer > > > > Dear hackers, > > > > I rebased my non-volatile WAL buffer's patchset onto master. A new v2 > patchset is attached to this mail. > > > > I also measured performance before and after patchset, varying > -c/--client and -j/--jobs options of pgbench, for > > each scaling factor s = 50 or 1000. The results are presented in the > following tables and the attached charts. > > Co
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory
Dear hackers, I rebased my old patchset. It would be good to compare this v4 patchset to non-volatile WAL buffer's one [1]. [1] https://www.postgresql.org/message-id/002101d649fb$1f5966e0$5e0c34a0$@hco.ntt.co.jp_1 Regards, Takashi -- Takashi Menjo v4-0001-Add-configure-option-for-PMDK.patch Description: Binary data v4-0003-Walreceiver-WAL-IO-using-PMDK.patch Description: Binary data v4-0002-Read-write-WAL-files-using-PMDK.patch Description: Binary data
Re: Remove page-read callback from XLogReaderState.
0 in src/backend/access/transam/xlog.c. Regards, Takashi 2020年7月2日(木) 13:53 Kyotaro Horiguchi : > cfbot is complaining as this is no longer applicable. Rebased. > > In v14, some reference to XLogReaderState parameter to read_pages > functions are accidentally replaced by the reference to the global > variable xlogreader. Fixed it, too. > > regards. > > -- > Kyotaro Horiguchi > NTT Open Source Software Center > -- Takashi Menjo
RE: [PoC] Non-volatile WAL buffer
Dear hackers, I update my non-volatile WAL buffer's patchset to v3. Now we can use it in streaming replication mode. Updates from v2: - walreceiver supports non-volatile WAL buffer Now walreceiver stores received records directly to non-volatile WAL buffer if applicable. - pg_basebackup supports non-volatile WAL buffer Now pg_basebackup copies received WAL segments onto non-volatile WAL buffer if you run it with "nvwal" mode (-Fn). You should specify a new NVWAL path with --nvwal-path option. The path will be written to postgresql.auto.conf or recovery.conf. The size of the new NVWAL is same as the master's one. Best regards, Takashi -- Takashi Menjo NTT Software Innovation Center > -Original Message- > From: Takashi Menjo > Sent: Wednesday, March 18, 2020 5:59 PM > To: 'PostgreSQL-development' > Cc: 'Robert Haas' ; 'Heikki Linnakangas' > ; 'Amit Langote' > > Subject: RE: [PoC] Non-volatile WAL buffer > > Dear hackers, > > I rebased my non-volatile WAL buffer's patchset onto master. A new v2 > patchset is attached to this mail. > > I also measured performance before and after patchset, varying -c/--client > and -j/--jobs options of pgbench, for > each scaling factor s = 50 or 1000. The results are presented in the > following tables and the attached charts. > Conditions, steps, and other details will be shown later. > > > Results (s=50) > == > Throughput [10^3 TPS] Average latency [ms] > ( c, j) before after before after > --- - - > ( 8, 8) 35.737.1 (+3.9%) 0.224 0.216 (-3.6%) > (18,18) 70.974.7 (+5.3%) 0.254 0.241 (-5.1%) > (36,18) 76.080.8 (+6.3%) 0.473 0.446 (-5.7%) > (54,18) 75.581.8 (+8.3%) 0.715 0.660 (-7.7%) > > > Results (s=1000) > > Throughput [10^3 TPS] Average latency [ms] > ( c, j) before after before after > --- - - > ( 8, 8) 37.440.1 (+7.3%) 0.214 0.199 (-7.0%) > (18,18) 79.386.7 (+9.3%) 0.227 0.208 (-8.4%) > (36,18) 87.295.5 (+9.5%) 0.413 0.377 (-8.7%) > (54,18) 86.894.8 (+9.3%) 0.622 0.569 (-8.5%) > > > Both throughput and average latency are improved for each scaling factor. > Throughput seemed to almost reach > the upper limit when (c,j)=(36,18). > > The percentage in s=1000 case looks larger than in s=50 case. I think larger > scaling factor leads to less > contentions on the same tables and/or indexes, that is, less lock and unlock > operations. In such a situation, > write-ahead logging appears to be more significant for performance. > > > Conditions > == > - Use one physical server having 2 NUMA nodes (node 0 and 1) > - Pin postgres (server processes) to node 0 and pgbench to node 1 > - 18 cores and 192GiB DRAM per node > - Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set for pg_wal > - Both are installed on the server-side node, that is, node 0 > - Both are formatted with ext4 > - NVDIMM-N is mounted with "-o dax" option to enable Direct Access (DAX) > - Use the attached postgresql.conf > - Two new items nvwal_path and nvwal_size are used only after patch > > > Steps > = > For each (c,j) pair, I did the following steps three times then I found the > median of the three as a final result shown > in the tables above. > > (1) Run initdb with proper -D and -X options; and also give --nvwal-path and > --nvwal-size options after patch > (2) Start postgres and create a database for pgbench tables > (3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000) > (4) Stop postgres, remount filesystems, and start postgres again > (5) Execute pg_prewarm extension for all the four pgbench tables > (6) Run pgbench during 30 minutes > > > pgbench command line > > $ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ > dbname > > I gave no -b option to use the built-in "TPC-B (sort-of)" query. > > > Software > > - Distro: Ubuntu 18.04 > - Kernel: Linux 5.4 (vanilla kernel) > - C Compiler: gcc 7.4.0 > - PMDK: 1.7 > - PostgreSQL: d677550 (master on Mar 3, 2020) > > > Hardware > > - System: HPE ProLiant DL380 Gen10 > - CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets > - DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets > - NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets > - NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA > > > Best regards, > Takashi > > -- > Takashi Menjo NTT Software Innovation Center &g
RE: [PoC] Non-volatile WAL buffer
Dear Amit, Thank you for your advice. Exactly, it's so to speak "do as the hackers do when in pgsql"... I'm rebasing my branch onto master. I'll submit an updated patchset and performance report later. Best regards, Takashi -- Takashi Menjo NTT Software Innovation Center > -Original Message- > From: Amit Langote > Sent: Monday, February 17, 2020 5:21 PM > To: Takashi Menjo > Cc: Robert Haas ; Heikki Linnakangas > ; PostgreSQL-development > > Subject: Re: [PoC] Non-volatile WAL buffer > > Hello, > > On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo > wrote: > > Hello Amit, > > > > > I apologize for not having any opinion on the patches themselves, > > > but let me point out that it's better to base these patches on HEAD > > > (master branch) than REL_12_0, because all new code is committed to > > > the master branch, whereas stable branches such as REL_12_0 only receive > > > bug fixes. Do you have any > specific reason to be working on REL_12_0? > > > > Yes, because I think it's human-friendly to reproduce and discuss > > performance measurement. Of course I know > all new accepted patches are merged into master's HEAD, not stable branches > and not even release tags, so I'm > aware of rebasing my patchset onto master sooner or later. However, if > someone, including me, says that s/he > applies my patchset to "master" and measures its performance, we have to pay > attention to which commit the > "master" really points to. Although we have sha1 hashes to specify which > commit, we should check whether the > specific commit on master has patches affecting performance or not because > master's HEAD gets new patches day > by day. On the other hand, a release tag clearly points the commit all we > probably know. Also we can check more > easily the features and improvements by using release notes and user manuals. > > Thanks for clarifying. I see where you're coming from. > > While I do sometimes see people reporting numbers with the latest stable > release' branch, that's normally just one > of the baselines. > The more important baseline for ongoing development is the master branch's > HEAD, which is also what people > volunteering to test your patches would use. Anyone who reports would have > to give at least two numbers -- > performance with a branch's HEAD without patch applied and that with patch > applied -- which can be enough in > most cases to see the difference the patch makes. Sure, the numbers might > change on each report, but that's fine > I'd think. If you continue to develop against the stable branch, you might > miss to notice impact from any relevant > developments in the master branch, even developments which possibly require > rethinking the architecture of your > own changes, although maybe that rarely occurs. > > Thanks, > Amit
RE: [PoC] Non-volatile WAL buffer
Hello Amit, > I apologize for not having any opinion on the patches themselves, but let me > point out that it's better to base these > patches on HEAD (master branch) than REL_12_0, because all new code is > committed to the master branch, > whereas stable branches such as REL_12_0 only receive bug fixes. Do you have > any specific reason to be working > on REL_12_0? Yes, because I think it's human-friendly to reproduce and discuss performance measurement. Of course I know all new accepted patches are merged into master's HEAD, not stable branches and not even release tags, so I'm aware of rebasing my patchset onto master sooner or later. However, if someone, including me, says that s/he applies my patchset to "master" and measures its performance, we have to pay attention to which commit the "master" really points to. Although we have sha1 hashes to specify which commit, we should check whether the specific commit on master has patches affecting performance or not because master's HEAD gets new patches day by day. On the other hand, a release tag clearly points the commit all we probably know. Also we can check more easily the features and improvements by using release notes and user manuals. Best regards, Takashi -- Takashi Menjo NTT Software Innovation Center > -Original Message- > From: Amit Langote > Sent: Monday, February 17, 2020 1:39 PM > To: Takashi Menjo > Cc: Robert Haas ; Heikki Linnakangas > ; PostgreSQL-development > > Subject: Re: [PoC] Non-volatile WAL buffer > > Menjo-san, > > On Mon, Feb 17, 2020 at 1:13 PM Takashi Menjo > wrote: > > I applied my patchset that mmap()-s WAL segments as WAL buffers to > > refs/tags/REL_12_0, and measured and > analyzed its performance with pgbench. Roughly speaking, When I used *SSD > and ext4* to store WAL, it was > "obviously worse" than the original REL_12_0. > > I apologize for not having any opinion on the patches themselves, but let me > point out that it's better to base these > patches on HEAD (master branch) than REL_12_0, because all new code is > committed to the master branch, > whereas stable branches such as REL_12_0 only receive bug fixes. Do you have > any specific reason to be working > on REL_12_0? > > Thanks, > Amit
RE: [PoC] Non-volatile WAL buffer
Dear hackers, I applied my patchset that mmap()-s WAL segments as WAL buffers to refs/tags/REL_12_0, and measured and analyzed its performance with pgbench. Roughly speaking, When I used *SSD and ext4* to store WAL, it was "obviously worse" than the original REL_12_0. VTune told me that the CPU time of memcpy() called by CopyXLogRecordToWAL() got larger than before. When I used *NVDIMM-N and ext4 with filesystem DAX* to store WAL, however, it achieved "not bad" performance compared with our previous patchset and non-volatile WAL buffer. Each CPU time of XLogInsert() and XLogFlush() was reduced like as non-volatile WAL buffer. So I think mmap()-ing WAL segments as WAL buffers is not such a bad idea as long as we use PMEM, at least NVDIMM-N. Excuse me but for now I'd keep myself not talking about how much the performance was, because the mmap()-ing patchset is WIP so there might be bugs which wrongfully "improve" or "degrade" performance. Also we need to know persistent memory programming and related features such as filesystem DAX, huge page faults, and WAL persistence with cache flush and memory barrier instructions to explain why the performance improved. I'd talk about all the details at the appropriate time and place. (The conference, or here later...) Best regards, Takashi -- Takashi Menjo NTT Software Innovation Center > -Original Message- > From: Takashi Menjo > Sent: Monday, February 10, 2020 6:30 PM > To: 'Robert Haas' ; 'Heikki Linnakangas' > > Cc: 'pgsql-hack...@postgresql.org' > Subject: RE: [PoC] Non-volatile WAL buffer > > Dear hackers, > > I made another WIP patchset to mmap WAL segments as WAL buffers. Note that > this is not a non-volatile WAL > buffer patchset but its competitor. I am measuring and analyzing the > performance of this patchset to compare > with my N.V.WAL buffer. > > Please wait for a several more days for the result report... > > Best regards, > Takashi > > -- > Takashi Menjo NTT Software Innovation Center > > > -Original Message- > > From: Robert Haas > > Sent: Wednesday, January 29, 2020 6:00 AM > > To: Takashi Menjo > > Cc: Heikki Linnakangas ; pgsql-hack...@postgresql.org > > Subject: Re: [PoC] Non-volatile WAL buffer > > > > On Tue, Jan 28, 2020 at 3:28 AM Takashi Menjo > > wrote: > > > I think our concerns are roughly classified into two: > > > > > > (1) Performance > > > (2) Consistency > > > > > > And your "different concern" is rather into (2), I think. > > > > Actually, I think it was mostly a performance concern (writes > > triggering lots of reading) but there might be a consistency issue as well. > > > > -- > > Robert Haas > > EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL > > Company
RE: [PoC] Non-volatile WAL buffer
Dear hackers, I made another WIP patchset to mmap WAL segments as WAL buffers. Note that this is not a non-volatile WAL buffer patchset but its competitor. I am measuring and analyzing the performance of this patchset to compare with my N.V.WAL buffer. Please wait for a several more days for the result report... Best regards, Takashi -- Takashi Menjo NTT Software Innovation Center > -Original Message- > From: Robert Haas > Sent: Wednesday, January 29, 2020 6:00 AM > To: Takashi Menjo > Cc: Heikki Linnakangas ; pgsql-hack...@postgresql.org > Subject: Re: [PoC] Non-volatile WAL buffer > > On Tue, Jan 28, 2020 at 3:28 AM Takashi Menjo > wrote: > > I think our concerns are roughly classified into two: > > > > (1) Performance > > (2) Consistency > > > > And your "different concern" is rather into (2), I think. > > Actually, I think it was mostly a performance concern (writes triggering lots > of reading) but there might be a > consistency issue as well. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company 0001-Preallocate-more-WAL-segments.patch Description: Binary data 0002-Use-WAL-segments-as-WAL-buffers.patch Description: Binary data 0003-Lazy-unmap-WAL-segments.patch Description: Binary data 0004-Speculative-map-WAL-segments.patch Description: Binary data 0005-Allocate-WAL-segments-to-utilize-hugepage.patch Description: Binary data
RE: [PoC] Non-volatile WAL buffer
Hello Robert, I think our concerns are roughly classified into two: (1) Performance (2) Consistency And your "different concern" is rather into (2), I think. I'm also worried about it, but I have no good answer for now. I suppose mmap(flags|=MAP_SHARED) called by multiple backend processes for the same file works consistently for both PMEM and non-PMEM devices. However, I have not found any evidence such as specification documents yet. I also made a tiny program calling memcpy() and msync() on the same mmap()-ed file but mutually distinct address range in parallel, and found that there was no corrupted data. However, that result does not ensure any consistency I'm worried about. I could give it up if there *were* corrupted data... So I will go to (1) first. I will test the way Heikki told us to answer whether the cost of mmap() and munmap() per WAL segment, etc, is reasonable or not. If it really is, then I will go to (2). Best regards, Takashi -- Takashi Menjo NTT Software Innovation Center
RE: [PoC] Non-volatile WAL buffer
Hello Heikki, > I have the same comments on this that I had on the previous patch, see: > > https://www.postgresql.org/message-id/2aec6e2a-6a32-0c39-e4e2-aad854543aa8%40iki.fi Thanks. I re-read your messages [1][2]. What you meant, AFAIU, is how about using memory-mapped WAL segment files as WAL buffers, and switching CPU instructions or msync() depending on whether the segment files are on PMEM or not, to sync inserted WAL records. It sounds reasonable, but I'm sorry that I haven't tested such a program yet. I'll try it to compare with my non-volatile WAL buffer. For now, I'm a little worried about the overhead of mmap()/munmap() for each WAL segment file. You also told a SIGBUS problem of memory-mapped I/O. I think it's true for reading from bad memory blocks, as you mentioned, and also true for writing to such blocks [3]. Handling SIGBUS properly or working around it is future work. Best regards, Takashi [1] https://www.postgresql.org/message-id/83eafbfd-d9c5-6623-2423-7cab1be3888c%40iki.fi [2] https://www.postgresql.org/message-id/2aec6e2a-6a32-0c39-e4e2-aad854543aa8%40iki.fi [3] https://pmem.io/2018/11/26/bad-blocks.htm -- Takashi Menjo NTT Software Innovation Center
RE: [PoC] Non-volatile WAL buffer
Hello Fabien, Thank you for your +1 :) > Is it possible to emulate somthing without the actual hardware, at least > for testing purposes? Yes, you can emulate PMEM using DRAM on Linux, via "memmap=nnG!ssG" kernel parameter. Please see [1] and [2] for emulation details. If your emulation does not work well, please check if the kernel configuration options (like CONFIG_ FOOBAR) for PMEM and DAX (in [1] and [3]) are set up properly. Best regards, Takashi [1] How to Emulate Persistent Memory Using Dynamic Random-access Memory (DRAM) https://software.intel.com/en-us/articles/how-to-emulate-persistent-memory-on-an-intel-architecture-server [2] how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system [3] Persistent Memory Wiki https://nvdimm.wiki.kernel.org/ -- Takashi Menjo NTT Software Innovation Center
[PoC] Non-volatile WAL buffer
Dear hackers, I propose "non-volatile WAL buffer," a proof-of-concept new feature. It enables WAL records to be durable without output to WAL segment files by residing on persistent memory (PMEM) instead of DRAM. It improves database performance by reducing copies of WAL and shortening the time of write transactions. I attach the first patchset that can be applied to PostgreSQL 12.0 (refs/ tags/REL_12_0). Please see README.nvwal (added by the patch 0003) to use the new feature. PMEM [1] is fast, non-volatile, and byte-addressable memory installed into DIMM slots. Such products have been already available. For example, an NVDIMM-N is a type of PMEM module that contains both DRAM and NAND flash. It can be accessed like a regular DRAM, but on power loss, it can save its contents into flash area. On power restore, it performs the reverse, that is, the contents are copied back into DRAM. PMEM also has been already supported by major operating systems such as Linux and Windows, and new open-source libraries such as Persistent Memory Development Kit (PMDK) [2]. Furthermore, several DBMSes have started to support PMEM. It's time for PostgreSQL. PMEM is faster than a solid state disk and naively can be used as a block storage. However, we cannot gain much performance in that way because it is so fast that the overhead of traditional software stacks now becomes unignorable, such as user buffers, filesystems, and block layers. Non-volatile WAL buffer is a work to make PostgreSQL PMEM-aware, that is, accessing directly to PMEM as a RAM to bypass such overhead and achieve the maximum possible benefit. I believe WAL is one of the most important modules to be redesigned for PMEM because it has assumed slow disks such as HDDs and SSDs but PMEM is not so. This work is inspired by "Non-volatile Memory Logging" talked in PGCon 2016 [3] to gain more benefit from PMEM than my and Yoshimi's previous work did [4][5]. I submitted a talk proposal for PGCon in this year, and have measured and analyzed performance of my PostgreSQL with non-volatile WAL buffer, comparing with the original one that uses PMEM as "a faster- than-SSD storage." I will talk about the results if accepted. Best regards, Takashi Menjo [1] Persistent Memory (SNIA) https://www.snia.org/PM [2] Persistent Memory Development Kit (pmem.io) https://pmem.io/pmdk/ [3] Non-volatile Memory Logging (PGCon 2016) https://www.pgcon.org/2016/schedule/track/Performance/945.en.html [4] Introducing PMDK into PostgreSQL (PGCon 2018) https://www.pgcon.org/2018/schedule/events/1154.en.html [5] Applying PMDK to WAL operations for persistent memory (pgsql-hackers) https://www.postgresql.org/message-id/c20d38e97bcb33dad59e...@lab.ntt.co.jp -- Takashi Menjo NTT Software Innovation Center 0001-Support-GUCs-for-external-WAL-buffer.patch Description: Binary data 0002-Non-volatile-WAL-buffer.patch Description: Binary data 0003-README-for-non-volatile-WAL-buffer.patch Description: Binary data
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory
Peter Eisentraut wrote: > I'm concerned with how this would affect the future maintenance of this > code. You are introducing a whole separate code path for PMDK beside > the normal file path (and it doesn't seem very well separated either). > Now everyone who wants to do some surgery in the WAL code needs to take > that into account. And everyone who wants to do performance work in the > WAL code needs to check that the PMDK path doesn't regress. AFAICT, > this hardware isn't very popular at the moment, so it would be very hard > to peer review any work in this area. Thank you for your comment. It is reasonable that you are concerned with maintainability. Our patchset still lacks of it. I will consider about that when I submit a next update. (It may take a long time, so please be patient...) Regards, Takashi -- Takashi Menjo - NTT Software Innovation Center
RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory
Hi, Sorry but I found that the patchset v2 had a bug in managing WAL segment file offset. I fixed it and updated a patchset as v3 (attached). Regards, Takashi -- Takashi Menjo - NTT Software Innovation Center 0001-Add-configure-option-for-PMDK-v3.patch Description: Binary data 0002-Read-write-WAL-files-using-PMDK-v3.patch Description: Binary data 0003-Walreceiver-WAL-IO-using-PMDK-v3.patch Description: Binary data
RE: static global variable openLogOff in xlog.c seems no longer used
Michael Paquier wrote: > It seems to me that keeping openLogOff is still useful to get a report > about the full chunk area being written if the data gets written in > multiple chunks and fails afterwards. Your patch would modify the > report so as only the area with the partial write is reported. For > debugging, having a static reference is also useful in my opinion. I agree with you on both error reporting and debugging. Now that you mention it, I find that my patch modifies ereport... When I wrote a patchset to xlog.c (in another email thread), I thought that this can be fixed. But now I understand it is not a simple thing. Thank you. Regards, Takashi -- Takashi Menjo - NTT Software Innovation Center
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory
Hi, Peter Eisentraut wrote: > When you manage the WAL (or perhaps in the future relation files) > through PMDK, is there still a file system view of it somewhere, for > browsing, debugging, and for monitoring tools? First, I assume that our patchset is used with a filesystem that supports direct access (DAX) feature, and I test it with ext4 on Linux. You can cd into pg_wal directory created by initdb -X pg_wal on such a filesystem, and ls WAL segment files managed by PMDK at runtime. For each PostgreSQL-specific tool, perhaps yes, but I have not tested yet. At least, pg_waldump looks working as before. Regards, Takashi -- Takashi Menjo - NTT Software Innovation Center
static global variable openLogOff in xlog.c seems no longer used
Hi, Because of pg_pwrite()[1], openLogOff, a static global variable in xlog.c, seems taken over by a local variable startoffset and no longer used now. I write the attached patch that removes openLogOff. Both "make check" and "make installcheck" passed, and just after that, "pg_ctl -m immediate stop" then "pg_ctl start" looked OK. Regards, Takashi [1] See commit c24dcd0cfd949bdf245814c4c2b3df828ee7db36. -- Takashi Menjo - NTT Software Innovation Center Remove-openLogOff.patch Description: Binary data
RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory
Hello, On behalf of Yoshimi, I rebased the patchset onto the latest master (e3565fd6). Please see the attachment. It also includes an additional bug fix (in patch 0002) about temporary filename. Note that PMDK 1.4.2+ supports MAP_SYNC and MAP_SHARED_VALIDATE flags, so please use a new version of PMDK when you test. The latest version is 1.5. Heikki Linnakangas wrote: > To re-iterate what I said earlier in this thread, I think the next step > here is to write a patch that modifies xlog.c to use plain old > mmap()/msync() to memory-map the WAL files, to replace the WAL buffers. Sorry but my new patchset still uses PMDK, because PMDK is supported on Linux _and Windows_, and I think someone may want to test this patchset on Windows... Regards, Takashi -- Takashi Menjo - NTT Software Innovation Center 0001-Add-configure-option-for-PMDK-v2.patch Description: Binary data 0002-Read-write-WAL-files-using-PMDK-v2.patch Description: Binary data 0003-Walreceiver-WAL-IO-using-PMDK-v2.patch Description: Binary data