Re: Map WAL segment files on PMEM as WAL buffers

2022-03-23 Thread Takashi Menjo
Hi Andres,

Thank you for your report. I rebased and made patchset v9 attached to
this email. Note that v9-0009 and v9-0010 are for those who want to
pass their own Cirrus CI.

Regards,
Takashi


On Tue, Mar 22, 2022 at 9:44 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-01-20 14:55:13 +0900, Takashi Menjo wrote:
> > Here is patchset v8. It will have "make check-world" and Cirrus to
> > pass.
>
> This unfortunately does not apply anymore: 
> http://cfbot.cputube.org/patch_37_3181.log
>
> Could you rebase?
>
> - Andres



-- 
Takashi Menjo 


v9-0003-Add-wal_pmem_map-to-postgresql.conf.sample.patch
Description: Binary data


v9-0001-Add-with-libpmem-option-for-PMEM-support.patch
Description: Binary data


v9-0002-Add-wal_pmem_map-to-GUC.patch
Description: Binary data


v9-0005-Map-WAL-segment-files-on-PMEM-as-WAL-buffers.patch
Description: Binary data


v9-0004-Export-InstallXLogFileSegment.patch
Description: Binary data


v9-0006-WAL-statistics-in-cases-of-wal_pmem_map-true.patch
Description: Binary data


v9-0007-Update-document.patch
Description: Binary data


v9-0008-Preallocate-and-initialize-more-WAL-if-wal_pmem_m.patch
Description: Binary data


v9-0009-For-CI-only-Setup-Cirrus-CI-for-with-libpmem.patch
Description: Binary data


v9-0010-For-CI-only-Modify-initdb-for-wal_pmem_map-on.patch
Description: Binary data


Re: Map WAL segment files on PMEM as WAL buffers

2022-01-19 Thread Takashi Menjo
Hi Justin,

Here is patchset v8. It will have "make check-world" and Cirrus to
pass. Would you try this one?

The v8 squashes some patches in v7 into related ones, and adds the
following patches:

- v8-0003: Add wal_pmem_map to postgresql.conf.sample. It also helps v8-0011.

- v8-0009: Fix wrong handling of missingContrecPtr for
test/recovery/t/026 to pass. It is the cause of the error. Thanks for
your report.

- v8-0010 and v8-0011: Each of the two is for CI only. v8-0010 adds
--with-libpmem and v8-0011 enables "wal_pmem_map = on". Please note
that, unlike your suggestion, in my patchset PMEM_IS_PMEM_FORCE=1 will
be given as an environment variable in .cirrus.yml and "wal_pmem_map =
on" will be given by initdb.

Regards,
Takashi

-- 
Takashi Menjo 


v8-0001-Add-with-libpmem-option-for-PMEM-support.patch
Description: Binary data


v8-0002-Add-wal_pmem_map-to-GUC.patch
Description: Binary data


v8-0005-Map-WAL-segment-files-on-PMEM-as-WAL-buffers.patch
Description: Binary data


v8-0003-Add-wal_pmem_map-to-postgresql.conf.sample.patch
Description: Binary data


v8-0004-Export-InstallXLogFileSegment.patch
Description: Binary data


v8-0007-Update-document.patch
Description: Binary data


v8-0008-Preallocate-and-initialize-more-WAL-if-wal_pmem_m.patch
Description: Binary data


v8-0006-WAL-statistics-in-cases-of-wal_pmem_map-true.patch
Description: Binary data


v8-0009-Fix-wrong-handling-of-missingContrecPtr.patch
Description: Binary data


v8-0011-For-CI-only-Modify-initdb-for-wal_pmem_map-on.patch
Description: Binary data


v8-0010-For-CI-only-Setup-Cirrus-CI-for-with-libpmem.patch
Description: Binary data


Re: Map WAL segment files on PMEM as WAL buffers

2022-01-18 Thread Takashi Menjo
Hi Justin,

I can reproduce the error you reported, with PMEM_IS_PMEM_FORCE=1.

Moreover, I can reproduce it **on a real PMem device**. So the causes
are in my patchset, not in PMem environment.

I'll fix it in the next patchset version.

Regards,
Takashi

--
Takashi Menjo 




Re: Map WAL segment files on PMEM as WAL buffers

2022-01-18 Thread Takashi Menjo
Hi Justin,

Thanks for your help. I'm making an additional patch for Cirrus CI.

I'm also trying to reproduce the "make check-world" error you
reported, on my Linux environment that has neither a real PMem nor an
emulated one, with PMEM_IS_PMEM_FORCE=1. I'll keep you updated.

Regards,
Takashi

On Mon, Jan 17, 2022 at 4:34 PM Justin Pryzby  wrote:
>
> On Thu, Jan 06, 2022 at 10:43:37PM -0600, Justin Pryzby wrote:
> > On Fri, Jan 07, 2022 at 12:50:01PM +0900, Takashi Menjo wrote:
> > > > But in this case it really doesn't work :(
> > > >
> > > > running bootstrap script ... 2022-01-05 23:17:30.244 CST [12088] FATAL: 
> > > >  file not on PMEM: path "pg_wal/00010001"
> > >
> > > Do you have a real PMEM device such as NVDIMM-N or Intel Optane PMem?
> >
> > No - the point is that we'd like to have a way to exercise this patch on the
> > cfbot.  Particularly the new code introduced by this patch, not just the
> > --without-pmem case...
> ..
> > I think you should add a patch which does what Thomas suggested: 1) add to
> > ./.cirrus.yaml installation of the libpmem package for 
> > debian/bsd/mac/windows;
> > 2) add setenv to main(), as above; 3) change configure.ac and guc.c to 
> > default
> > to --with-libpmem and wal_pmem_map=on.  This should be the last patch, for
> > cfbot only, not meant to be merged.
>
> I was able to get the cirrus CI to compile on linux and bsd with the below
> changes.  I don't know if there's an easy package installation for mac OSX.  I
> think it's okay if mac CI doesn't use --enable-pmem for now.
>
> > You can test that the package installation part works before mailing 
> > patches to
> > the list with the instructions here:
> >
> > src/tools/ci/README:
> > Enabling cirrus-ci in a github repository..
>
> I ran the CI under my own github account.
> Linux crashes in the recovery check.
> And freebsd has been stuck for 45min.
>
> I'm not sure, but maybe those are legimate consequence of using
> PMEM_IS_PMEM_FORCE (?)  If so, maybe the recovery check would need to be
> disabled for this patch to run on CI...  Or maybe my suggestion to enable it 
> by
> default for CI doesn't work for this patch.  It would need to be specially
> tested with real hardware.
>
> https://cirrus-ci.com/task/6245151591890944
>
> https://cirrus-ci.com/task/6162551485497344?logs=test_world#L3941
> #2  0x55ff43c6edad in ExceptionalCondition (conditionName=0x55ff43d18108 
> "!XLogRecPtrIsInvalid(missingContrecPtr)", errorType=0x55ff43d151c4 
> "FailedAssertion", fileName=0x55ff43d151bd "xlog.c", lineNumber=8297) at 
> assert.c:69
>
> commit 15533794e465a381eb23634d67700afa809a0210
> Author: Justin Pryzby 
> Date:   Thu Jan 6 22:53:28 2022 -0600
>
> tmp: enable pmem by default, for CI
>
> diff --git a/.cirrus.yml b/.cirrus.yml
> index 677bdf0e65e..0cb961c8103 100644
> --- a/.cirrus.yml
> +++ b/.cirrus.yml
> @@ -81,6 +81,7 @@ task:
>  mkdir -m 770 /tmp/cores
>  chown root:postgres /tmp/cores
>  sysctl kern.corefile='/tmp/cores/%N.%P.core'
> +pkg install -y devel/pmdk
>
># NB: Intentionally build without --with-llvm. The freebsd image size is
># already large enough to make VM startup slow, and even without llvm
> @@ -99,6 +100,7 @@ task:
>  --with-lz4 \
>  --with-pam \
>  --with-perl \
> +--with-libpmem \
>  --with-python \
>  --with-ssl=openssl \
>  --with-tcl --with-tclconfig=/usr/local/lib/tcl8.6/ \
> @@ -138,6 +140,7 @@ LINUX_CONFIGURE_FEATURES: _CONFIGURE_FEATURES >-
>--with-lz4
>--with-pam
>--with-perl
> +  --with-libpmem
>--with-python
>--with-selinux
>--with-ssl=openssl
> @@ -188,6 +191,9 @@ task:
>  mkdir -m 770 /tmp/cores
>  chown root:postgres /tmp/cores
>  sysctl kernel.core_pattern='/tmp/cores/%e-%s-%p.core'
> +echo 'deb http://deb.debian.org/debian bullseye universe' 
> >>/etc/apt/sources.list
> +apt-get update
> +apt-get -y install libpmem-dev
>
>configure_script: |
>  su postgres <<-EOF
> @@ -267,6 +273,7 @@ task:
>make \
>openldap \
>openssl \
> +  pmem \
>python \
>tcl-tk
>
> @@ -301,6 +308,7 @@ task:
>--with-libxslt \
>--with-lz4 \
>--with-perl \
> +  --with-libpmem \
>--with-python \
>--with-ssl=openssl \
>--with-tcl --with-tclconfig=${brewpath}/opt/tcl-tk/lib/ \
> diff --git a/src/backend/main/main.c b/src/backend/main/main.c
> inde

Re: Map WAL segment files on PMEM as WAL buffers

2022-01-06 Thread Takashi Menjo
Hi Justin,

Thank you for your build test and comments. The v7 patchset attached
to this email fixes the issues you reported.


> The cfbot showed issues compiling on linux and windows.
> http://cfbot.cputube.org/takashi-menjo.html
>
> https://cirrus-ci.com/task/6125740327436288
> [02:30:06.538] In file included from xlog.c:38:
> [02:30:06.538] ../../../../src/include/access/xlogpmem.h:32:42: error: 
> unknown type name ‘tli’
> [02:30:06.538]32 | PmemXLogEnsurePrevMapped(XLogRecPtr ptr, tli)
> [02:30:06.538]   |  ^~~
> [02:30:06.538] xlog.c: In function ‘GetXLogBuffer’:
> [02:30:06.538] xlog.c:1959:19: warning: implicit declaration of function 
> ‘PmemXLogEnsurePrevMapped’ [-Wimplicit-function-declaration]
> [02:30:06.538]  1959 |openLogSegNo = PmemXLogEnsurePrevMapped(endptr, 
> tli);
>
> https://cirrus-ci.com/task/6688690280857600?logs=build#L379
> [02:33:25.752] c:\cirrus\src\include\access\xlogpmem.h(33,1): error C2081: 
> 'tli': name in formal parameter list illegal (compiling source file 
> src/backend/access/transam/xlog.c) [c:\cirrus\postgres.vcxproj]
>
> I'm attaching a probable fix.  Unfortunately, for patches like this, most of
> the functionality isn't exercised unless the library is installed and
> compilation and runtime are enabled by default.

I got the same error when without --with-libpmem. Your fix looks
reasonable. My v7-0008 fixes this error.


> In 0009: recaluculated => recalculated

v7-0011 fixes this typo.


> 0010-Update-document should be squished with 0003-Add-wal_pmem_map-to-GUC (and
> maybe 0002 and 0001).  I believe the patches after 0005 are more WIP, so it's
> fine if they're not squished yet.

As you say, the patch updating document should melt into a related
fix, probably "Add wal_pmem_map to GUC".  For now I want it to be a
separate patch (v7-0014).


> I'm not sure what the point is of this one: 
> 0008-Let-wal_pmem_map-be-constant-unl

If USE_LIBPMEM is not defined (that is, no --with-libpmem),
wal_pmem_map is always false and is never used essentially. Using
#if(n)def everywhere is not good for code readability, so I let
wal_pmem_map be constant. This may help compilers optimize conditional
branches.

v7-0005 adds the comment above.


> +   ereport(ERROR,
> +   (errcode_for_file_access(),
> +errmsg("could not pmem_map_file \"%s\": %m", 
> path)));
>
> => The outer parenthesis are not needed since e3a87b4.

v7-0009 fixes this.


> But in this case it really doesn't work :(
>
> running bootstrap script ... 2022-01-05 23:17:30.244 CST [12088] FATAL:  file 
> not on PMEM: path "pg_wal/00010001"

Do you have a real PMEM device such as NVDIMM-N or Intel Optane PMem?
If so, please use a PMEM mounted with Filesystem DAX option for
pg_wal, or the FATAL error will occur.

If you don't, you have two alternatives below. Note that neither of
them ensures durability. Each of them is just for testing.

1. Emulate PMEM with memmap=nn[KMG]!ss[KMG]. This can be used only on
Linux. Please see [1][2] for details; or
2. Set the environment variable PMEM_IS_PMEM_FORCE=1 to tell libpmem
to treat any devices as if they were PMEM.


Regards,
Takashi


[1] 
https://www.intel.com/content/www/us/en/developer/articles/training/how-to-emulate-persistent-memory-on-an-intel-architecture-server.html
[2] https://nvdimm.wiki.kernel.org/

-- 
Takashi Menjo 


v7-0004-Let-wal_pmem_map-be-constant-unless-with-libpmem.patch
Description: Binary data


v7-0002-Support-build-with-MSVC-on-Windows.patch
Description: Binary data


v7-0001-Add-with-libpmem-option-for-PMEM-support.patch
Description: Binary data


v7-0005-Comment-for-constant-wal_pmem_map.patch
Description: Binary data


v7-0003-Add-wal_pmem_map-to-GUC.patch
Description: Binary data


v7-0006-Export-InstallXLogFileSegment.patch
Description: Binary data


v7-0008-Fix-invalid-declaration-of-PmemXLogEnsurePrevMapp.patch
Description: Binary data


v7-0007-Map-WAL-segment-files-on-PMEM-as-WAL-buffers.patch
Description: Binary data


v7-0009-Remove-redundant-parentheses-from-ereport-call.patch
Description: Binary data


v7-0011-Fix-typo-in-comment.patch
Description: Binary data


v7-0013-WAL-statistics-in-cases-of-wal_pmem_map-true.patch
Description: Binary data


v7-0012-Compatible-to-Windows.patch
Description: Binary data


v7-0010-Ensure-WAL-mappings-before-assertion.patch
Description: Binary data


v7-0014-Update-document.patch
Description: Binary data


v7-0015-Preallocate-and-initialize-more-WAL-if-wal_pmem_m.patch
Description: Binary data


Re: Map WAL segment files on PMEM as WAL buffers

2022-01-05 Thread Takashi Menjo
Rebased.

On Fri, Nov 5, 2021 at 3:47 PM Takashi Menjo  wrote:
>
> Hi Daniel,
>
> The issue you told has been fixed.  I attach the v5 patchset to this email.
>
> The v5 has all the patches in the v4, and in addition, has the
> following two new patches:
>
> - (v5-0002) Support build with MSVC on Windows: Please have
> src\tools\msvc\config.pl as follows to "configure --with-libpmem:"
>
> $config->{pmem} = 'C:\path\to\pmdk\x64-windows';
>
> - (v5-0006) Compatible to Windows: This patch resolves conflicting
> mode_t typedefs and libpmem API variants (U or W, like Windows API).
>
> Best regards,
> Takashi
>
> On Thu, Nov 4, 2021 at 5:46 PM Takashi Menjo  wrote:
> >
> > Hello Daniel,
> >
> > Thank you for your comment. I had the following error message with
> > MSVC on Windows. It looks the same as what you told me. I'll fix it.
> >
> > | > cd src\tools\msvc
> > | > build
> > | (..snipped..)
> > | Copying pg_config_os.h...
> > | Generating configuration headers...
> > | undefined symbol: HAVE_LIBPMEM at src/include/pg_config.h line 347
> > at C:/Users/menjo/Documents/git/postgres/src/tools/msvc/Mkvcbuild.pm
> > line 860.
> >
> > Best regards,
> > Takashi
> >
> >
> > On Wed, Nov 3, 2021 at 10:04 PM Daniel Gustafsson  wrote:
> > >
> > > > On 28 Oct 2021, at 08:09, Takashi Menjo  wrote:
> > >
> > > > Rebased, and added the patches below into the patchset.
> > >
> > > Looks like the 0001 patch needs to be updated to support Windows and 
> > > MSVC.  See
> > > src/tools/msvc/Mkvcbuild.pm and Solution.pm et.al for inspiration on how 
> > > to add
> > > the MSVC equivalent of --with-libpmem.  Currently the patch fails in the
> > > "Generating configuration headers" step in Solution.pm.
> > >
> > > --
> > > Daniel Gustafsson   https://vmware.com/
> > >
> >
> >
> > --
> > Takashi Menjo 
>
>
>
> --
> Takashi Menjo 



-- 
Takashi Menjo 


v6-0003-Add-wal_pmem_map-to-GUC.patch
Description: Binary data


v6-0001-Add-with-libpmem-option-for-PMEM-support.patch
Description: Binary data


v6-0004-Export-InstallXLogFileSegment.patch
Description: Binary data


v6-0002-Support-build-with-MSVC-on-Windows.patch
Description: Binary data


v6-0005-Map-WAL-segment-files-on-PMEM-as-WAL-buffers.patch
Description: Binary data


v6-0008-Let-wal_pmem_map-be-constant-unless-with-libpmem.patch
Description: Binary data


v6-0006-Compatible-to-Windows.patch
Description: Binary data


v6-0009-Ensure-WAL-mappings-before-assertion.patch
Description: Binary data


v6-0007-WAL-statistics-in-cases-of-wal_pmem_map-true.patch
Description: Binary data


v6-0010-Update-document.patch
Description: Binary data


v6-0011-Preallocate-and-initialize-more-WAL-if-wal_pmem_m.patch
Description: Binary data


Re: Map WAL segment files on PMEM as WAL buffers

2021-11-05 Thread Takashi Menjo
Hi Daniel,

The issue you told has been fixed.  I attach the v5 patchset to this email.

The v5 has all the patches in the v4, and in addition, has the
following two new patches:

- (v5-0002) Support build with MSVC on Windows: Please have
src\tools\msvc\config.pl as follows to "configure --with-libpmem:"

$config->{pmem} = 'C:\path\to\pmdk\x64-windows';

- (v5-0006) Compatible to Windows: This patch resolves conflicting
mode_t typedefs and libpmem API variants (U or W, like Windows API).

Best regards,
Takashi

On Thu, Nov 4, 2021 at 5:46 PM Takashi Menjo  wrote:
>
> Hello Daniel,
>
> Thank you for your comment. I had the following error message with
> MSVC on Windows. It looks the same as what you told me. I'll fix it.
>
> | > cd src\tools\msvc
> | > build
> | (..snipped..)
> | Copying pg_config_os.h...
> | Generating configuration headers...
> | undefined symbol: HAVE_LIBPMEM at src/include/pg_config.h line 347
> at C:/Users/menjo/Documents/git/postgres/src/tools/msvc/Mkvcbuild.pm
> line 860.
>
> Best regards,
> Takashi
>
>
> On Wed, Nov 3, 2021 at 10:04 PM Daniel Gustafsson  wrote:
> >
> > > On 28 Oct 2021, at 08:09, Takashi Menjo  wrote:
> >
> > > Rebased, and added the patches below into the patchset.
> >
> > Looks like the 0001 patch needs to be updated to support Windows and MSVC.  
> > See
> > src/tools/msvc/Mkvcbuild.pm and Solution.pm et.al for inspiration on how to 
> > add
> > the MSVC equivalent of --with-libpmem.  Currently the patch fails in the
> > "Generating configuration headers" step in Solution.pm.
> >
> > --
> > Daniel Gustafsson   https://vmware.com/
> >
>
>
> --
> Takashi Menjo 



-- 
Takashi Menjo 


v5-0001-Add-with-libpmem-option-for-PMEM-support.patch
Description: Binary data


v5-0002-Support-build-with-MSVC-on-Windows.patch
Description: Binary data


v5-0003-Add-wal_pmem_map-to-GUC.patch
Description: Binary data


v5-0004-Export-InstallXLogFileSegment.patch
Description: Binary data


v5-0005-Map-WAL-segment-files-on-PMEM-as-WAL-buffers.patch
Description: Binary data


v5-0006-Compatible-to-Windows.patch
Description: Binary data


v5-0007-WAL-statistics-in-cases-of-wal_pmem_map-true.patch
Description: Binary data


v5-0008-Let-wal_pmem_map-be-constant-unless-with-libpmem.patch
Description: Binary data


v5-0009-Ensure-WAL-mappings-before-assertion.patch
Description: Binary data


v5-0010-Update-document.patch
Description: Binary data


v5-0011-Preallocate-and-initialize-more-WAL-if-wal_pmem_m.patch
Description: Binary data


Re: Map WAL segment files on PMEM as WAL buffers

2021-11-04 Thread Takashi Menjo
Hello Daniel,

Thank you for your comment. I had the following error message with
MSVC on Windows. It looks the same as what you told me. I'll fix it.

| > cd src\tools\msvc
| > build
| (..snipped..)
| Copying pg_config_os.h...
| Generating configuration headers...
| undefined symbol: HAVE_LIBPMEM at src/include/pg_config.h line 347
at C:/Users/menjo/Documents/git/postgres/src/tools/msvc/Mkvcbuild.pm
line 860.

Best regards,
Takashi


On Wed, Nov 3, 2021 at 10:04 PM Daniel Gustafsson  wrote:
>
> > On 28 Oct 2021, at 08:09, Takashi Menjo  wrote:
>
> > Rebased, and added the patches below into the patchset.
>
> Looks like the 0001 patch needs to be updated to support Windows and MSVC.  
> See
> src/tools/msvc/Mkvcbuild.pm and Solution.pm et.al for inspiration on how to 
> add
> the MSVC equivalent of --with-libpmem.  Currently the patch fails in the
> "Generating configuration headers" step in Solution.pm.
>
> --
> Daniel Gustafsson   https://vmware.com/
>


-- 
Takashi Menjo 




Re: Map WAL segment files on PMEM as WAL buffers

2021-10-28 Thread Takashi Menjo
Hi,

Rebased, and added the patches below into the patchset.

- (0006) Let wal_pmem_map be constant unless --with-libpmem
wal_pmem_map never changes from false in that case, so let it be
constant.  Thanks, Matthias!

- (0007) Ensure WAL mappings before assertion
This fixes SIGSEGV abortion in GetXLogBuffer when --enable-cassert.

- (0008) Update document
This adds a new entry for wal_pmem_map in the section Write Ahead Log
-> Settings.

Best regards,
Takashi

On Fri, Oct 8, 2021 at 5:07 PM Takashi Menjo  wrote:
>
> Hello Matthias,
>
> Thank you for your comment!
>
> > > [ v3-0002-Add-wal_pmem_map-to-GUC.patch ]
> > > +extern bool wal_pmem_map;
> >
> > A lot of the new code in these patches is gated behind this one flag,
> > but the flag should never be true on !pmem systems. Could you instead
> > replace it with something like the following?
> >
> > +#ifdef USE_LIBPMEM
> > +extern bool wal_pmem_map;
> > +#else
> > +#define wal_pmem_map false
> > +#endif
> >
> > A good compiler would then eliminate all the dead code from being
> > generated on non-pmem builds (instead of the compiler needing to keep
> > that code around just in case some extension decides to set
> > wal_pmem_map to true on !pmem systems because it has access to that
> > variable).
>
> That sounds good. I will introduce it in the next update.
>
> > > [ v3-0004-Map-WAL-segment-files-on-PMEM-as-WAL-buffers.patch ]
> > > +if ((uintptr_t) addr & ~PG_DAX_HUGEPAGE_MASK)
> > > +elog(WARNING,
> > > + "file not mapped on DAX hugepage boundary: path \"%s\" addr 
> > > %p",
> > > + path, addr);
> >
> > I'm not sure that we should want to log this every time we detect the
> > issue; It's likely that once it happens it will happen for the next
> > file as well. Maybe add a timeout, or do we generally not deduplicate
> > such messages?
>
> Let me give it some thought.  I have believed this WARNING is most
> unlikely to happen, and is mutually independent from other happenings.
> I will try to find a case where the WARNING happens repeatedly; or I
> will de-duplicate the messages if it is easier.
>
> Best regards,
> Takashi
>
> --
> Takashi Menjo 



-- 
Takashi Menjo 


v4-0001-Add-with-libpmem-option-for-PMEM-support.patch
Description: Binary data


v4-0002-Add-wal_pmem_map-to-GUC.patch
Description: Binary data


v4-0003-Export-InstallXLogFileSegment.patch
Description: Binary data


v4-0005-WAL-statistics-in-cases-of-wal_pmem_map-true.patch
Description: Binary data


v4-0004-Map-WAL-segment-files-on-PMEM-as-WAL-buffers.patch
Description: Binary data


v4-0006-Let-wal_pmem_map-be-constant-unless-with-libpmem.patch
Description: Binary data


v4-0007-Ensure-WAL-mappings-before-assertion.patch
Description: Binary data


v4-0008-Update-document.patch
Description: Binary data


v4-0009-Preallocate-and-initialize-more-WAL-if-wal_pmem_m.patch
Description: Binary data


Re: Map WAL segment files on PMEM as WAL buffers

2021-10-08 Thread Takashi Menjo
Hello Matthias,

Thank you for your comment!

> > [ v3-0002-Add-wal_pmem_map-to-GUC.patch ]
> > +extern bool wal_pmem_map;
>
> A lot of the new code in these patches is gated behind this one flag,
> but the flag should never be true on !pmem systems. Could you instead
> replace it with something like the following?
>
> +#ifdef USE_LIBPMEM
> +extern bool wal_pmem_map;
> +#else
> +#define wal_pmem_map false
> +#endif
>
> A good compiler would then eliminate all the dead code from being
> generated on non-pmem builds (instead of the compiler needing to keep
> that code around just in case some extension decides to set
> wal_pmem_map to true on !pmem systems because it has access to that
> variable).

That sounds good. I will introduce it in the next update.

> > [ v3-0004-Map-WAL-segment-files-on-PMEM-as-WAL-buffers.patch ]
> > +if ((uintptr_t) addr & ~PG_DAX_HUGEPAGE_MASK)
> > +elog(WARNING,
> > + "file not mapped on DAX hugepage boundary: path \"%s\" addr 
> > %p",
> > + path, addr);
>
> I'm not sure that we should want to log this every time we detect the
> issue; It's likely that once it happens it will happen for the next
> file as well. Maybe add a timeout, or do we generally not deduplicate
> such messages?

Let me give it some thought.  I have believed this WARNING is most
unlikely to happen, and is mutually independent from other happenings.
I will try to find a case where the WARNING happens repeatedly; or I
will de-duplicate the messages if it is easier.

Best regards,
Takashi

-- 
Takashi Menjo 




Re: Map WAL segment files on PMEM as WAL buffers

2021-06-29 Thread Takashi Menjo
Rebased.

-- 
Takashi Menjo 


v3-0001-Add-with-libpmem-option-for-PMEM-support.patch
Description: Binary data


v3-0002-Add-wal_pmem_map-to-GUC.patch
Description: Binary data


v3-0003-Export-InstallXLogFileSegment.patch
Description: Binary data


v3-0004-Map-WAL-segment-files-on-PMEM-as-WAL-buffers.patch
Description: Binary data


v3-0005-WAL-statistics-in-cases-of-wal_pmem_map-true.patch
Description: Binary data


v3-0006-Preallocate-and-initialize-more-WAL-if-wal_pmem_m.patch
Description: Binary data


Re: [PoC] Non-volatile WAL buffer

2021-03-08 Thread Takashi Menjo
Hi Tomas,

> Hello Takashi-san,
>
> On 3/5/21 9:08 AM, Takashi Menjo wrote:
> > Hi Tomas,
> >
> > Thank you so much for your report. I have read it with great interest.
> >
> > Your conclusion sounds reasonable to me. My patchset you call "NTT /
> > segments" got as good performance as "NTT / buffer" patchset. I have
> > been worried that calling mmap/munmap for each WAL segment file could
> > have a lot of overhead. Based on your performance tests, however, the
> > overhead looks less than what I thought. In addition, "NTT / segments"
> > patchset is more compatible to the current PG and more friendly to
> > DBAs because that patchset uses WAL segment files and does not
> > introduce any other new WAL-related file.
> >
>
> I agree. I was actually a bit surprised it performs this well, mostly in
> line with the "NTT / buffer" patchset. I've seen significant issues with
> our simple experimental patches, which however went away with larger WAL
> segments. But the "NTT / segments" patch does not have that issue, so
> either our patches were doing something wrong, or perhaps there was some
> other issue (not sure why larger WAL segments would improve that).
>
> Do these results match your benchmarks? Or are you seeing significantly
> different behavior?

I made a performance test for "NTT / segments" and added its results
to my previous report [1], on the same conditions. The updated graph
is attached to this mail. Note that some legends are renamed: "Mapped
WAL file" to "NTT / simple", and "Non-volatile WAL buffer" to "NTT /
buffer."

The graph tells me that "NTT / segments" performs as well as "NTT /
buffer." This matches with the results you reported.

> Do you have any thoughts regarding the impact of full-page writes? So
> far all the benchmarks we did focused on small OLTP transactions on data
> sets that fit into RAM. The assumption was that that's the workload that
> would benefit from this, but maybe that's missing something important
> about workloads producing much larger WAL records? Say, workloads
> working with large BLOBs, bulk loads etc.

I'd say that more work is needed for workloads producing a large
amount of WAL (in the number of records or the size per record, or
both of them). Based on the case Gang reported and I have tried to
reproduce in this thread [2][3], the current inserting and flushing
method can be unsuitable for such workloads. The case was for "NTT /
buffer," but I think it can be also applied to "NTT / segments."

> The other question is whether simply placing WAL on DAX (without any
> code changes) is safe. If it's not, then all the "speedups" are computed
> with respect to unsafe configuration and so are useless. And BTT should
> be used instead, which would of course produce very different results.

I think it's safe, thanks to the checksum in the header of WAL record
(xl_crc in struct XLogRecord). In DAX mode, user data (WAL record
here) is written to the PMEM device by a smaller unit (probably a byte
or a cache line) than the traditional 512-byte disk sector. So a
torn-write such that "some bytes in a sector persist, other bytes not"
can occur when crash. AFAICS, however, the checksum for WAL records
can also support such a torn-write case.

> > I also think that supporting both file I/O and mmap is better than
> > supporting only mmap. I will continue my work on "NTT / segments"
> > patchset to support both ways.
> >
>
> +1
>
> > In the following, I will answer "Issues & Questions" you reported.
> >
> >
> >> While testing the "NTT / segments" patch, I repeatedly managed to crash 
> >> the cluster with errors like this:
> >>
> >> 2021-02-28 00:07:21.221 PST client backend [3737139] WARNING:  creating 
> >> logfile segment just before
> >> mapping; path "pg_wal/00010007002F"
> >> 2021-02-28 00:07:21.670 PST client backend [3737142] WARNING:  creating 
> >> logfile segment just before
> >> mapping; path "pg_wal/000100070030"
> >> ...
> >> 2021-02-28 00:07:21.698 PST client backend [3737145] WARNING:  creating 
> >> logfile segment just before
> >> mapping; path "pg_wal/000100070030"
> >> 2021-02-28 00:07:21.698 PST client backend [3737130] PANIC:  could not 
> >> open file
> >> "pg_wal/000100070030": No such file or directory
> >>
> >> I do believe this is a thinko in the 0008 patch, which does XLogFileInit 
> >> in XLogFileMap. Not

Re: [PoC] Non-volatile WAL buffer

2021-03-05 Thread Takashi Menjo
37145] WARNING:  creating
> logfile segment just before mapping; path "pg_wal/000100070030"
> 2021-02-28 00:07:21.698 PST client backend [3737130] PANIC:  could not
> open file "pg_wal/000100070030": No such file or directory
>
> I do believe this is a thinko in the 0008 patch, which does XLogFileInit
> in XLogFileMap. Notice there are multiple "creating logfile" messages
> with the ..0030 segment, followed by the failure. AFAICS the XLogFileMap
> may be called from multiple backends, so they may call XLogFileInit
> concurrently, likely triggering some sort of race condition. It's fairly
> rare issue, though - I've only seen it twice from ~20 runs.
>
>
> The other question I have is about WALInsertLockUpdateInsertingAt. 0003
> removes this function, but leaves behind some of the other bits working
> with insert locks and insertingAt. But it does not explain how it works
> without WaitXLogInsertionsToFinish() - how does it ensure that when we
> commit something, all the preceding WAL is "complete" (i.e. written by
> other backends etc.)?
>
>
> Conclusion
> --
>
> I do think the "NTT / segments" patch is the most promising way forward.
> It does perform about as well as the "NTT / buffer" patch (and much both
> perform much better than the experimental patches I shared in January).
>
> The "NTT / buffer" patch seems much more disruptive - it introduces one
> large buffer for WAL, which makes various other tasks more complicated
> (i.e. it needs additional complexity to handle WAL archival, etc.). Are
> there some advantages of this patch (compared to the other patch)?
>
> As for the "NTT / segments" patch, I wonder if we can just rework the
> code like this (to use mmap etc.) or whether we need to support both
> both ways (file I/O and mmap). I don't have much experience with many
> other platforms, but it seems quite possible that mmap won't work all
> that well on some of them. So my assumption is we'll need to support
> both file I/O and mmap to make any of this committable, but I may be wrong.
>
>
> [1]
> https://www.postgresql.org/message-id/CAOwnP3Oz4CnKp0-_KU-x5irr9pBqPNkk7pjwZE5Pgo8i1CbFGg%40mail.gmail.com
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



-- 
Takashi Menjo 




Re: [PoC] Non-volatile WAL buffer

2021-02-28 Thread Takashi Menjo
Hi Sawada,

I am relieved to hear that the performance problem was solved.

And I added a tip about PMEM namespace and partitioning in PG wiki[1].

Regards,

[1] 
https://wiki.postgresql.org/wiki/Persistent_Memory_for_WAL#Configure_and_verify_DAX_hugepage_faults

-- 
Takashi Menjo 




Re: [PoC] Non-volatile WAL buffer

2021-02-23 Thread Takashi Menjo
Hi,

I had a performance test in another environment. The steps, setup,
and postgresql.conf of the test are same as the ones sent by me on
Feb 17 [1], except the following items:

# Setup
- Distro: Red Hat Enterprise Linux release 8.2 (Ootpa)
- C compiler: gcc-8.3.1-5.el8.x86_64
- libc: glibc-2.28-101.el8.x86_64
- Linux kernel: kernel-4.18.0-193.el8.x86_64
- PMDK: libpmem-1.6.1-1.el8.x86_64, libpmem-devel-1.6.1-1.el8.x86_64

See the attached figure for the results. In short, the v5 non-volatile
WAL buffer got better performance than the original (non-patched) one.

Regards,

[1] 
https://www.postgresql.org/message-id/caownp3ofofosftmeikqcbmp0ywdjn0kvb4ka_0tj+urq7dt...@mail.gmail.com

-- 
Takashi Menjo 


Re: [PoC] Non-volatile WAL buffer

2021-02-17 Thread Takashi Menjo
Hi Sawada,

Thank you for your performance report.

First, I'd say that the latest v5 non-volatile WAL buffer patchset
looks not bad itself. I made a performance test for the v5 and got
better performance than the original (non-patched) one and our
previous work. See the attached figure for results.

I think steps and/or setups of Tomas's, yours, and mine could be
different, leading to the different performance results. So I show my
steps and setups for my performance test. Please see the tail of this
mail for them.

Also, I write performance tips to the PMEM page at PostgreSQL wiki
[1]. I wish it could be helpful to improve performance.

Regards,
Takashi

[1] https://wiki.postgresql.org/wiki/Persistent_Memory_for_WAL#Performance_tips



# Environment variables
export PGHOST=/tmp
export PGPORT=5432
export PGDATABASE="$USER"
export PGUSER="$USER"
export PGDATA=/mnt/nvme0n1/pgdata

# Steps
Note that I ran postgres server and pgbench in a single-machine system
but separated two NUMA nodes. PMEM and PCI SSD for the server process
are on the server-side NUMA node.

01) Create a PMEM namespace (sudo ndctl create-namespace -f -t pmem -m
fsdax -M dev -e namespace0.0)
02) Make an ext4 filesystem for PMEM then mount it with DAX option
(sudo mkfs.ext4 -q -F /dev/pmem0 ; sudo mount -o dax /dev/pmem0
/mnt/pmem0)
03) Make another ext4 filesystem for PCIe SSD then mount it (sudo
mkfs.ext4 -q -F /dev/nvme0n1 ; sudo mount /dev/nvme0n1 /mnt/nvme0n1)
04) Make /mnt/pmem0/pg_wal directory for WAL
05) Make /mnt/nvme0n1/pgdata directory for PGDATA
06) Run initdb (initdb --locale=C --encoding=UTF8 -X /mnt/pmem0/pg_wal ...)
- Also give -P /mnt/pmem0/pg_wal/nvwal -Q 81920 in the case of
"Non-volatile WAL buffer"
07) Edit postgresql.conf as the attached one
08) Start postgres server process on NUMA node 0 (numactl -N 0 -m 0 --
pg_ctl -l pg.log start)
09) Create a database (createdb --locale=C --encoding=UTF8)
10) Initialize pgbench tables with s=50 (pgbench -i -s 50)
11) Stop the postgres server process (pg_ctl -l pg.log -m smart stop)
12) Remount the PMEM and the PCIe SSD
13) Start postgres server process on NUMA node 0 again (numactl -N 0
-m 0 -- pg_ctl -l pg.log start)
14) Run pg_prewarm for all the four pgbench_* tables
15) Run pgbench on NUMA node 1 for 30 minutes (numactl -N 1 -m 1 --
pgbench -r -M prepared -T 1800 -c __ -j __)
- It executes the default tpcb-like transactions

I repeated all the steps three times for each (c,j) then got the
median "tps = __ (including connections establishing)" of the three as
throughput and the "latency average = __ ms " of that time as average
latency.

# Setup
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6240M x2 sockets (18 cores per socket; HT
disabled by BIOS)
- DRAM: DDR4 2933MHz 192GiB/socket x2 sockets (32 GiB per channel x 6
channels per socket)
- Optane PMem: Apache Pass, AppDirect Mode, DDR4 2666MHz 1.5TiB/socket
x2 sockets (256 GiB per channel x 6 channels per socket; interleaving
enabled)
- PCIe SSD: DC P4800X Series SSDPED1K750GA
- Distro: Ubuntu 20.04.1
- C compiler: gcc 9.3.0
- libc: glibc 2.31
- Linux kernel: 5.7.0 (built by myself)
- Filesystem: ext4 (DAX enabled when using Optane PMem)
- PMDK: 1.9 (built by myself)
- PostgreSQL (Original): 9e7dbe3369cd8f5b0136c53b817471002505f934 (Jan
18, 2021 @ master)
- PostgreSQL (Mapped WAL file): Original + v5 of "Applying PMDK to WAL
operations for persistent memory" [2]
- PostgreSQL (Non-volatile WAL buffer): Original + v5 of "Non-volatile
WAL buffer" [3]; please read the files' prefix "v4-" as "v5-"

[2] 
https://www.postgresql.org/message-id/CAOwnP3O3O1GbHpddUAzT%3DCP3aMpX99%3D1WtBAfsRZYe2Ui53MFQ%40mail.gmail.com
[3] 
https://www.postgresql.org/message-id/CAOwnP3Oz4CnKp0-_KU-x5irr9pBqPNkk7pjwZE5Pgo8i1CbFGg%40mail.gmail.com

-- 
Takashi Menjo 


postgresql.conf
Description: Binary data


Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2021-02-16 Thread Takashi Menjo
Rebased to make patchset v5.

I also found that my past replies have separated the thread in the
pgsql-hackers archive. I try to connect this mail to the original
thread [1], and let this point to the separated portions [2][3][4].
Note that the patchset v3 is in [3] and v4 is in [4].

Regards,

[1] 
https://www.postgresql.org/message-id/flat/C20D38E97BCB33DAD59E3A1%40lab.ntt.co.jp
[2] 
https://www.postgresql.org/message-id/flat/000501d4b794%245094d140%24f1be73c0%24%40lab.ntt.co.jp
[3] 
https://www.postgresql.org/message-id/flat/01d4b863%244c9e8fc0%24e5dbaf40%24%40lab.ntt.co.jp
[4] 
https://www.postgresql.org/message-id/flat/01d4c2a1%2488c6cc40%249a5464c0%24%40lab.ntt.co.jp

-- 
Takashi Menjo 


v5-0001-Add-configure-option-for-PMDK.patch
Description: Binary data


v5-0003-Walreceiver-WAL-IO-using-PMDK.patch
Description: Binary data


v5-0002-Read-write-WAL-files-using-PMDK.patch
Description: Binary data


Re: [PoC] Non-volatile WAL buffer

2021-02-16 Thread Takashi Menjo
Hi Takayuki,

Thank you for your helpful comments.

In "Allocates WAL buffers on shared buffers", "shared buffers" should be
> DRAM because shared buffers in Postgres means the buffer cache for database
> data.
>

That's true. Fixed.


> I haven't tracked the whole thread, but could you collect information like
> the following?  I think such (partly basic) information will be helpful to
> decide whether it's worth casting more efforts into complex code, or it's
> enough to place WAL on DAX-aware filesystems and tune the filesystem.
>
> * What approaches other DBMSs take, and their performance gains (Oracle,
> SQL Server, HANA, Cassandra, etc.)
> The same DBMS should take different approaches depending on the file type:
> Oracle recommends different things to data files and REDO logs.
>

I also think it will be helpful. Adding "Other DBMSes using PMEM" section.

* The storage capabilities of PMEM compared to the fast(est) alternatives
> such as NVMe SSD (read/write IOPS, latency, throughput, concurrency, which
> may be posted on websites like Tom's Hardware or SNIA)
>

This will be helpful, too. Adding "Basic performance" subsection under
"Overview of persistent memory (PMEM)."

* What's the situnation like on Windows?
>

Sorry but I don't know Windows' PMEM support very much. All I know is that
Windows Server 2016 and 2019 supports PMEM (2016 partially) [1] and PMDK
supports Windows [2].

All the above contents will be updated gradually. Please stay tuned.

Regards,

[1]
https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/deploy-pmem
[2]
https://docs.pmem.io/persistent-memory/getting-started-guide/installing-pmdk/installing-pmdk-on-windows

-- 
Takashi Menjo 


Re: [PoC] Non-volatile WAL buffer

2021-02-15 Thread Takashi Menjo
Hi,

I made a new page at PostgreSQL Wiki to gather and summarize information
and discussion about PMEM-backed WAL designs and implementations. Some
parts of the page are TBD. I will continue to maintain the page. Requests
are welcome.

Persistent Memory for WAL
https://wiki.postgresql.org/wiki/Persistent_Memory_for_WAL

Regards,

-- 
Takashi Menjo 


Re: [PoC] Non-volatile WAL buffer

2021-01-29 Thread Takashi Menjo
Hi Tomas,

I'd answer your questions. (Not all for now, sorry.)


> Do I understand correctly that the patch removes "regular" WAL buffers
and instead writes the data into the non-volatile PMEM buffer, without
writing that to the WAL segments at all (unless in archiving mode)?
> Firstly, I guess many (most?) instances will have to write the WAL
segments anyway because of PITR/backups, so I'm not sure we can save much
here.

Mostly yes. My "non-volatile WAL buffer" patchset removes regular volatile
WAL buffers and brings non-volatile ones. All the WAL will get into the
non-volatile buffers and persist there. No write out of the buffers to WAL
segment files is required. However in archiving mode or in a case of buffer
full (described later), both of the non-volatile buffers and the segment
files are used.

In archiving mode with my patchset, for each time one segment (16MB
default) is fixed on the non-volatile buffers, that segment is written to a
segment file asynchronously (by XLogBackgroundFlush). Then it will be
archived by existing archiving functionality.


> But more importantly - doesn't that mean the nvwal_size value is
essentially a hard limit? With max_wal_size, it's a soft limit i.e. we're
allowed to temporarily use more WAL when needed. But with a pre-allocated
file, that's clearly not possible. So what would happen in those cases?

Yes, nvwal_size is a hard limit, and I see it's a major weak point of my
patchset.

When all non-volatile WAL buffers are filled, the oldest segment on the
buffers is written (by XLogWrite) to a regular WAL segment file, then those
buffers are cleared (by AdvanceXLInsertBuffer) for new records. All WAL
record insertions to the buffers block until that write and clear are
complete. Due to that, all write transactions also block.

To make the matter worse, if a checkpoint eventually occurs in such a
buffer full case, record insertions would block for a certain time at the
end of the checkpoint because a large amount of the non-volatile buffers
will be cleared (see PreallocNonVolatileXlogBuffer). From a client view, it
would look as if the postgres server freezes for a while.

Proper checkpointing would prevent such cases, but it could be hard to
control. When I reproduced the Gang's case reported in this thread, such
buffer full and freeze occured.


> Also, is it possible to change nvwal_size? I haven't tried, but I wonder
what happens with the current contents of the file.

The value of nvwal_size should be equal to the actual size of nvwal_path
file when postgres starts up. If not equal, postgres will panic at
MapNonVolatileXLogBuffer (see nv_xlog_buffer.c), and the WAL contents on
the file will remain as it was. So, if an admin accidentally changes the
nvwal_size value, they just cannot get postgres up.

The file size may be extended/shrunk offline by truncate(1) command, but
the WAL contents on the file also should be moved to the proper offset
because the insertion/recovery offset is calculated by modulo, that is,
record's LSN % nvwal_size; otherwise we lose WAL. An offline tool to do
such an operation might be required, but is not yet.


> The way I understand the current design is that we're essentially
switching from this architecture:
>
>clients -> wal buffers (DRAM) -> wal segments (storage)
>
> to this
>
>clients -> wal buffers (PMEM)
>
> (Assuming there we don't have to write segments because of archiving.)

Yes. Let me describe how current PostgreSQL design is and how the patchsets
and works talked in this thread changes it, AFAIU:

  - Current PostgreSQL:
clients -[memcpy]-> buffers (DRAM) -[write]-> segments (disk)

  - Patch "pmem-with-wal-buffers-master.patch" Tomas posted:
clients -[memcpy]-> buffers (DRAM) -[pmem_memcpy]-> mmap-ed segments
(PMEM)

  - My "non-volatile WAL buffer" patchset:
clients -[pmem_memcpy(*)]-> buffers (PMEM)

  - My another patchset mmap-ing segments as buffers:
clients -[pmem_memcpy(*)]-> mmap-ed segments as buffers (PMEM)

  - "Non-volatile Memory Logging" in PGcon 2016 [1][2][3]:
clients -[memcpy]-> buffers (WC[4] DRAM as pseudo PMEM) -[async
write]-> segments (disk)

  (* or memcpy + pmem_flush)

And I'd say that our previous work "Introducing PMDK into PostgreSQL"
talked in PGCon 2018 [5] and its patchset [6 for the latest] are based on
the same idea as Tomas's patch above.


That's all for this mail. Please be patient for the next mail.

Best regards,
Takashi

[1] https://www.pgcon.org/2016/schedule/track/Performance/945.en.html
[2] https://github.com/meistervonperf/postgresql-NVM-logging
[3] https://github.com/meistervonperf/pseudo-pram
[4] https://www.kernel.org/doc/html/latest/x86/pat.html
[5] https://pgcon.org/2018/schedule/events/1154.en.html
[6]
https://www.postgresql.org/message-id/CAOwnP3ONd9uXPXKoc5AAfnpCnCyOna1ru6sU=ey_4wfmjak...@mail.gmail.com

-- 
Takashi Menjo 


Re: [PoC] Non-volatile WAL buffer

2021-01-27 Thread Takashi Menjo
Hi,

Now I have caught up with this thread. I see that many of you are
interested in performance profiling.

I share my slides in SNIA SDC 2020 [1]. In the slides, I had profiles
focused on XLogInsert and XLogFlush (mainly the latter) for my non-volatile
WAL buffer patchset. I found that the time for XLogWrite and
locking/unlocking WALWriteLock were eliminated by the patchset. Instead,
XLogInsert and WaitXLogInsertionsToFinish took more (or a little more) time
than ever because memcpy-ing to PMEM (Optane PMem) is slower than to DRAM.
For details, please see the slides.

Best regards,
Takashi

[1]
https://www.snia.org/educational-library/how-can-persistent-memory-make-databases-faster-and-how-could-we-go-ahead-2020


2021年1月26日(火) 18:50 Takashi Menjo :

>  Dear everyone, Tomas,
>
> First of all, the "v4" patchset for non-volatile WAL buffer attached to
> the previous mail is actually v5... Please read "v4" as "v5."
>
> Then, to Tomas:
> Thank you for your crash report you gave on Nov 27, 2020, regarding msync
> patchset. I applied the latest msync patchset v3 attached to the previous
> to master 411ae64 (on Jan18, 2021) then tested it, and I got no error when
> pgbench -i -s 500. Please try it if necessary.
>
> Best regards,
> Takashi
>
>
> 2021年1月26日(火) 17:52 Takashi Menjo :
>
>> Dear everyone,
>>
>> Sorry but I forgot to attach my patchsets... Please see the files
>> attached to this mail. Please also note that they contain some fixes.
>>
>> Best regards,
>> Takashi
>>
>>
>> 2021年1月26日(火) 17:46 Takashi Menjo :
>>
>>> Dear everyone,
>>>
>>> I'm sorry for the late reply. I rebase my two patchsets onto the latest
>>> master 411ae64.The one patchset prefixed with v4 is for non-volatile WAL
>>> buffer; the other prefixed with v3 is for msync.
>>>
>>> I will reply to your thankful feedbacks one by one within days. Please
>>> wait for a moment.
>>>
>>> Best regards,
>>> Takashi
>>>
>>>
>>> 01/25/2021(Mon) 11:56 Masahiko Sawada :
>>>
>>>> On Fri, Jan 22, 2021 at 11:32 AM Tomas Vondra
>>>>  wrote:
>>>> >
>>>> >
>>>> >
>>>> > On 1/21/21 3:17 AM, Masahiko Sawada wrote:
>>>> > > On Thu, Jan 7, 2021 at 2:16 AM Tomas Vondra
>>>> > >  wrote:
>>>> > >>
>>>> > >> Hi,
>>>> > >>
>>>> > >> I think I've managed to get the 0002 patch [1] rebased to master
>>>> and
>>>> > >> working (with help from Masahiko Sawada). It's not clear to me how
>>>> it
>>>> > >> could have worked as submitted - my theory is that an incomplete
>>>> patch
>>>> > >> was submitted by mistake, or something like that.
>>>> > >>
>>>> > >> Unfortunately, the benchmark results were kinda disappointing. For
>>>> a
>>>> > >> pgbench on scale 500 (fits into shared buffers), an average of
>>>> three
>>>> > >> 5-minute runs looks like this:
>>>> > >>
>>>> > >> branch 1163264
>>>> 96
>>>> > >>
>>>>  
>>>> > >> master  7291 87704165310150437
>>>> 224186
>>>> > >> ntt 7912106095213206212410
>>>> 237819
>>>> > >> simple-no-buffers   7654 96544115416 95828
>>>> 103065
>>>> > >>
>>>> > >> NTT refers to the patch from September 10, pre-allocating a large
>>>> WAL
>>>> > >> file on PMEM, and simple-no-buffers is the simpler patch simply
>>>> removing
>>>> > >> the WAL buffers and writing directly to a mmap-ed WAL segment on
>>>> PMEM.
>>>> > >>
>>>> > >> Note: The patch is just replacing the old implementation with mmap.
>>>> > >> That's good enough for experiments like this, but we probably want
>>>> to
>>>> > >> keep the old one for setups without PMEM. But it's good enough for
>>>> > >> testing, benchmarking etc.
>>>> > >>
>>>> > >> Unfortunately, the results for this simple approach are pretty
>>>> bad. Not
>>>> > >&

Re: [PoC] Non-volatile WAL buffer

2021-01-26 Thread Takashi Menjo
 Dear everyone, Tomas,

First of all, the "v4" patchset for non-volatile WAL buffer attached to the
previous mail is actually v5... Please read "v4" as "v5."

Then, to Tomas:
Thank you for your crash report you gave on Nov 27, 2020, regarding msync
patchset. I applied the latest msync patchset v3 attached to the previous
to master 411ae64 (on Jan18, 2021) then tested it, and I got no error when
pgbench -i -s 500. Please try it if necessary.

Best regards,
Takashi


2021年1月26日(火) 17:52 Takashi Menjo :

> Dear everyone,
>
> Sorry but I forgot to attach my patchsets... Please see the files attached
> to this mail. Please also note that they contain some fixes.
>
> Best regards,
> Takashi
>
>
> 2021年1月26日(火) 17:46 Takashi Menjo :
>
>> Dear everyone,
>>
>> I'm sorry for the late reply. I rebase my two patchsets onto the latest
>> master 411ae64.The one patchset prefixed with v4 is for non-volatile WAL
>> buffer; the other prefixed with v3 is for msync.
>>
>> I will reply to your thankful feedbacks one by one within days. Please
>> wait for a moment.
>>
>> Best regards,
>> Takashi
>>
>>
>> 01/25/2021(Mon) 11:56 Masahiko Sawada :
>>
>>> On Fri, Jan 22, 2021 at 11:32 AM Tomas Vondra
>>>  wrote:
>>> >
>>> >
>>> >
>>> > On 1/21/21 3:17 AM, Masahiko Sawada wrote:
>>> > > On Thu, Jan 7, 2021 at 2:16 AM Tomas Vondra
>>> > >  wrote:
>>> > >>
>>> > >> Hi,
>>> > >>
>>> > >> I think I've managed to get the 0002 patch [1] rebased to master and
>>> > >> working (with help from Masahiko Sawada). It's not clear to me how
>>> it
>>> > >> could have worked as submitted - my theory is that an incomplete
>>> patch
>>> > >> was submitted by mistake, or something like that.
>>> > >>
>>> > >> Unfortunately, the benchmark results were kinda disappointing. For a
>>> > >> pgbench on scale 500 (fits into shared buffers), an average of three
>>> > >> 5-minute runs looks like this:
>>> > >>
>>> > >> branch 116326496
>>> > >> 
>>> > >> master  7291 87704165310150437224186
>>> > >> ntt 7912106095213206212410237819
>>> > >> simple-no-buffers   7654 96544115416 95828103065
>>> > >>
>>> > >> NTT refers to the patch from September 10, pre-allocating a large
>>> WAL
>>> > >> file on PMEM, and simple-no-buffers is the simpler patch simply
>>> removing
>>> > >> the WAL buffers and writing directly to a mmap-ed WAL segment on
>>> PMEM.
>>> > >>
>>> > >> Note: The patch is just replacing the old implementation with mmap.
>>> > >> That's good enough for experiments like this, but we probably want
>>> to
>>> > >> keep the old one for setups without PMEM. But it's good enough for
>>> > >> testing, benchmarking etc.
>>> > >>
>>> > >> Unfortunately, the results for this simple approach are pretty bad.
>>> Not
>>> > >> only compared to the "ntt" patch, but even to master. I'm not
>>> entirely
>>> > >> sure what's the root cause, but I have a couple hypotheses:
>>> > >>
>>> > >> 1) bug in the patch - That's clearly a possibility, although I've
>>> tried
>>> > >> tried to eliminate this possibility.
>>> > >>
>>> > >> 2) PMEM is slower than DRAM - From what I know, PMEM is much faster
>>> than
>>> > >> NVMe storage, but still much slower than DRAM (both in terms of
>>> latency
>>> > >> and bandwidth, see [2] for some data). It's not terrible, but the
>>> > >> latency is maybe 2-3x higher - not a huge difference, but may
>>> matter for
>>> > >> WAL buffers?
>>> > >>
>>> > >> 3) PMEM does not handle parallel writes well - If you look at [2],
>>> > >> Figure 4(b), you'll see that the throughput actually *drops" as the
>>> > >> number of threads increase. That's pretty strange / annoying,
>>> because
>>> > &g

Re: [PoC] Non-volatile WAL buffer

2021-01-26 Thread Takashi Menjo
master  6635 88524171106163387245307
> > >> ntt 7909106826217364223338242042
> > >> simple-no-buffers   7871101575199403188074224716
> > >> with-wal-buffers7643101056206911223860261712
> > >>
> > >> So yeah, there's a clear difference. It changes the values for
> "master"
> > >> a bit, but both the "simple" patches (with and without) WAL buffers
> are
> > >> much faster. The with-wal-buffers is almost equal to the  NTT patch,
> > >> which was using 96GB file. I presume larger WAL segments would get
> even
> > >> closer, if we supported them.
> > >>
> > >> I'll continue investigating this, but my conclusion so far seem to be
> > >> that we can't really replace WAL buffers with PMEM - that seems to
> > >> perform much worse.
> > >>
> > >> The question is what to do about the segment size. Can we reduce the
> > >> overhead of mmap-ing individual segments, so that this works even for
> > >> smaller WAL segments, to make this useful for common instances (not
> > >> everyone wants to run with 1GB WAL). Or whether we need to adopt the
> > >> design with a large file, mapped just once.
> > >>
> > >> Another question is whether it's even worth the extra complexity. On
> > >> 16MB segments the difference between master and NTT patch seems to be
> > >> non-trivial, but increasing the WAL segment size kinda reduces that.
> So
> > >> maybe just using File I/O on PMEM DAX filesystem seems good enough.
> > >> Alternatively, maybe we could switch to libpmemblk, which should
> > >> eliminate the filesystem overhead at least.
> > >
> > > I think the performance improvement by NTT patch with the 16MB WAL
> > > segment, the most common WAL segment size, is very good (150437 vs.
> > > 212410 with 64 clients). But maybe evaluating writing WAL segment
> > > files on PMEM DAX filesystem is also worth, as you mentioned, if we
> > > don't do that yet.
> > >
> >
> > Well, not sure. I think the question is still open whether it's actually
> > safe to run on DAX, which does not have atomic writes of 512B sectors,
> > and I think we rely on that e.g. for pg_config. But maybe for WAL that's
> > not an issue.
>
> I think we can use the Block Translation Table (BTT) driver that
> provides atomic sector updates.
>
> >
> > > Also, I'm interested in why the through-put of NTT patch saturated at
> > > 32 clients, which is earlier than the master's one (96 clients). How
> > > many CPU cores are there on the machine you used?
> > >
> >
> >  From what I know, this is somewhat expected for PMEM devices, for a
> > bunch of reasons:
> >
> > 1) The memory bandwidth is much lower than for DRAM (maybe ~10-20%), so
> > it takes fewer processes to saturate it.
> >
> > 2) Internally, the PMEM has a 256B buffer for writes, used for combining
> > etc. With too many processes sending writes, it becomes to look more
> > random, which is harmful for throughput.
> >
> > When combined, this means the performance starts dropping at certain
> > number of threads, and the optimal number of threads is rather low
> > (something like 5-10). This is very different behavior compared to DRAM.
>
> Makes sense.
>
> >
> > There's a nice overview and measurements in this paper:
> >
> > Building blocks for persistent memory / How to get the most out of your
> > new memory?
> > Alexander van Renen, Lukas Vogel, Viktor Leis, Thomas Neumann & Alfons
> > Kemper
> >
> > https://link.springer.com/article/10.1007/s00778-020-00622-9
>
> Thank you. I'll read it.
>
> >
> >
> > >> I'm also wondering if WAL is the right usage for PMEM. Per [2]
> there's a
> > >> huge read-write assymmetry (the writes being way slower), and their
> > >> recommendation (in "Observation 3" is)
> > >>
> > >>   The read-write asymmetry of PMem im-plies the necessity of
> avoiding
> > >>   writes as much as possible for PMem.
> > >>
> > >> So maybe we should not be trying to use PMEM for WAL, which is pretty
> > >> write-heavy (and in most cases even write-only).
> > >
> > > I think using PMEM for WAL is cost-effective but it leverages the only
> > > low-latency (sequential) write, but not other abilities such as
> > > fine-grained access and low-latency random write. If we want to
> > > exploit its all ability we might need some drastic changes to logging
> > > protocol while considering storing data on PMEM.
> > >
> >
> > True. I think investigating whether it's sensible to use PMEM for this
> > purpose. It may turn out that replacing the DRAM WAL buffers with writes
> > directly to PMEM is not economical, and aggregating data in a DRAM
> > buffer is better :-(
>
> Yes. I think it might be interesting to do an analysis of the
> bottlenecks of NTT patch by perf etc. If bottlenecks are moved to
> other places by removing WALWriteLock during flush, it's probably a
> good sign for further performance improvements. IIRC WALWriteLock is
> one of the main bottlenecks on OLTP workload, although my memory might
> already be out of date.
>
> Regards,
>
> --
> Masahiko Sawada
> EDB:  https://www.enterprisedb.com/
>


-- 
Takashi Menjo 


Re: [PoC] Non-volatile WAL buffer

2020-11-05 Thread Takashi Menjo
Hi Gang,

I appreciate your patience. I reproduced the results you reported to me, on
my environment.

First of all, the condition you gave to me was a little unstable on my
environment, so I made the values of {max_,min_,nv}wal_size larger and the
pre-warm duration longer to get stable performance. I didn't modify your
table and query, and benchmark duration.

Under the stable condition, Original (PMEM) still got better performance
than Non-volatile WAL Buffer. To sum up, the reason was that Non-volatile
WAL Buffer on Optane PMem spent much more time than Original (PMEM) for
XLogInsert when using your table and query. It offset the improvement of
XLogFlush, and degraded performance in total. VTune told me that
Non-volatile WAL Buffer took more CPU time than Original (PMEM) for
(XLogInsert => XLogInsertRecord => CopyXLogRecordsToWAL =>) memcpy while it
took less time for XLogFlush. This profile was very similar to the one you
reported.

In general, when WAL buffers are on Optane PMem rather than DRAM, it is
obvious that it takes more time to memcpy WAL records into the buffers
because Optane PMem is a little slower than DRAM. In return for that,
Non-volatile WAL Buffer reduces the time to let the records hit to devices
because it doesn't need to write them out of the buffers to somewhere else,
but just need to flush out of CPU caches to the underlying memory-mapped
file.

Your report shows that Non-volatile WAL Buffer on Optane PMem is not good
for certain kinds of transactions, and is good for others. I have tried to
fix how to insert and flush WAL records, or the configurations or constants
that could change performance such as NUM_XLOGINSERT_LOCKS, but
Non-volatile WAL Buffer have not achieved better performance than Original
(PMEM) yet when using your table and query. I will continue to work on this
issue and will report if I have any update.

By the way, did your performance progress reported by pgbench with -P
option get down to zero when you run Non-volatile WAL Buffer? If so, your
{max_,min_,nv}wal_size might be too small or your checkpoint configurations
might be not appropriate. Could you check your results again?

Best regards,
Takashi

-- 
Takashi Menjo 


Re: [PoC] Non-volatile WAL buffer

2020-10-29 Thread Takashi Menjo
Hi Heikki,

> I had a new look at this thread today, trying to figure out where we are.
I'm a bit confused.
>
> One thing we have established: mmap()ing WAL files performs worse than
the current method, if pg_wal is not on
> a persistent memory device. This is because the kernel faults in existing
content of each page, even though we're
> overwriting everything.
Yes. In addition, after a certain page (in the sense of OS page) is
msync()ed, another page fault will occur again when something is stored
into that page.

> That's unfortunate. I was hoping that mmap() would be a good option even
without persistent memory hardware.
> I wish we could tell the kernel to zero the pages instead of reading them
from the file. Maybe clear the file with
> ftruncate() before mmapping it?
The area extended by ftruncate() appears as if it were zero-filled [1].
Please note that it merely "appears as if." It might not be actually
zero-filled as data blocks on devices, so pre-allocating files should
improve transaction performance. At least, on Linux 5.7 and ext4, it takes
more time to store into the mapped file just open(O_CREAT)ed and
ftruncate()d than into the one filled already and actually.

> That should not be problem with a real persistent memory device, however
(or when emulating it with DRAM). With
> DAX, the storage is memory-mapped directly and there is no page cache,
and no pre-faulting.
Yes, with filesystem DAX, there is no page cache for file data. A page
fault still occurs but for each 2MiB DAX hugepage, so its overhead
decreases compared with 4KiB page fault. Such a DAX hugepage fault is only
applied to DAX-mapped files and is different from a general transparent
hugepage fault.

> Because of that, I'm baffled by what the
v4-0002-Non-volatile-WAL-buffer.patch does. If I understand it
> correctly, it puts the WAL buffers in a separate file, which is stored on
the NVRAM. Why? I realize that this is just
> a Proof of Concept, but I'm very much not interested in anything that
requires the DBA to manage a second WAL
> location. Did you test the mmap() patches with persistent memory
hardware? Did you compare that with the pmem
> patchset, on the same hardware? If there's a meaningful performance
difference between the two, what's causing
> it?
Yes, this patchset puts the WAL buffers into the file specified by
"nvwal_path" in postgresql.conf.

Why this patchset puts the buffers into the separated file, not existing
segment files in PGDATA/pg_wal, is because it reduces the overhead due to
system calls such as open(), mmap(), munmap(), and close(). It open()s and
mmap()s the file "nvwal_path" once, and keeps that file mapped while
running. On the other hand, as for the patchset mmap()ing the segment
files, a backend process should munmap() and close() the current mapped
file and open() and mmap() the new one for each time the inserting location
for that process goes over segments. This causes the performance difference
between the two.

Best regards,
Takashi

[1]
https://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html

-- 
Takashi Menjo 


RE: [PoC] Non-volatile WAL buffer

2020-10-14 Thread Takashi Menjo
Hi Gang,

Thanks. I have tried to reproduce performance degrade, using your 
configuration, query, and steps. And today, I got some results that Original 
(PMEM) achieved better performance than Non-volatile WAL buffer on my Ubuntu 
environment. Now I work for further investigation.

Best regards,
Takashi

-- 
Takashi Menjo 
NTT Software Innovation Center

> -Original Message-
> From: Deng, Gang 
> Sent: Friday, October 9, 2020 3:10 PM
> To: Takashi Menjo 
> Cc: pgsql-hack...@postgresql.org; 'Takashi Menjo' 
> Subject: RE: [PoC] Non-volatile WAL buffer
> 
> Hi Takashi,
> 
> There are some differences between our HW/SW configuration and test steps. I 
> attached postgresql.conf I used
> for your reference. I would like to try postgresql.conf and steps you 
> provided in the later days to see if I can find
> cause.
> 
> I also ran pgbench and postgres server on the same server but on different 
> NUMA node, and ensure server process
> and PMEM on the same NUMA node. I used similar steps are yours from step 1 to 
> 9. But some difference in later
> steps, major of them are:
> 
> In step 10), I created a database and table for test by:
> #create database:
> psql -c "create database insert_bench;"
> #create table:
> psql -d insert_bench -c "create table test(crt_time timestamp, info text 
> default
> '75feba6d5ca9ff65d09af35a67fe962a4e3fa5ef279f94df6696bee65f4529a4bbb03ae56c3b5b86c22b447fc
> 48da894740ed1a9d518a9646b3a751a57acaca1142ccfc945b1082b40043e3f83f8b7605b5a55fcd7eb8fc1
> d0475c7fe465477da47d96957849327731ae76322f440d167725d2e2bbb60313150a4f69d9a8c9e86f9d7
> 9a742e7a35bf159f670e54413fb89ff81b8e5e8ab215c3ddfd00bb6aeb4');"
> 
> in step 15), I did not use pg_prewarm, but just ran pg_bench for 180 seconds 
> to warm up.
> In step 16), I ran pgbench using command: pgbench -M prepared -n -r -P 10 -f 
> ./test.sql -T 600 -c _ -j _
> insert_bench. (test.sql can be found in attachment)
> 
> For HW/SW conf, the major differences are:
> CPU: I used Xeon 8268 (24c@2.9Ghz, HT enabled) OS Distro: CentOS 8.2.2004
> Kernel: 4.18.0-193.6.3.el8_2.x86_64
> GCC: 8.3.1
> 
> Best regards
> Gang
> 
> -Original Message-
> From: Takashi Menjo 
> Sent: Tuesday, October 6, 2020 4:49 PM
> To: Deng, Gang 
> Cc: pgsql-hack...@postgresql.org; 'Takashi Menjo' 
> Subject: RE: [PoC] Non-volatile WAL buffer
> 
> Hi Gang,
> 
> I have tried to but yet cannot reproduce performance degrade you reported 
> when inserting 328-byte records. So
> I think the condition of you and me would be different, such as steps to 
> reproduce, postgresql.conf, installation
> setup, and so on.
> 
> My results and condition are as follows. May I have your condition in more 
> detail? Note that I refer to your "Storage
> over App Direct" as my "Original (PMEM)" and "NVWAL patch" to "Non-volatile 
> WAL buffer."
> 
> Best regards,
> Takashi
> 
> 
> # Results
> See the attached figure. In short, Non-volatile WAL buffer got better 
> performance than Original (PMEM).
> 
> # Steps
> Note that I ran postgres server and pgbench in a single-machine system but 
> separated two NUMA nodes. PMEM
> and PCI SSD for the server process are on the server-side NUMA node.
> 
> 01) Create a PMEM namespace (sudo ndctl create-namespace -f -t pmem -m fsdax 
> -M dev -e namespace0.0)
> 02) Make an ext4 filesystem for PMEM then mount it with DAX option (sudo 
> mkfs.ext4 -q -F /dev/pmem0 ; sudo
> mount -o dax /dev/pmem0 /mnt/pmem0)
> 03) Make another ext4 filesystem for PCIe SSD then mount it (sudo mkfs.ext4 
> -q -F /dev/nvme0n1 ; sudo mount
> /dev/nvme0n1 /mnt/nvme0n1)
> 04) Make /mnt/pmem0/pg_wal directory for WAL
> 05) Make /mnt/nvme0n1/pgdata directory for PGDATA
> 06) Run initdb (initdb --locale=C --encoding=UTF8 -X /mnt/pmem0/pg_wal ...)
> - Also give -P /mnt/pmem0/pg_wal/nvwal -Q 81920 in the case of 
> Non-volatile WAL buffer
> 07) Edit postgresql.conf as the attached one
> - Please remove nvwal_* lines in the case of Original (PMEM)
> 08) Start postgres server process on NUMA node 0 (numactl -N 0 -m 0 -- pg_ctl 
> -l pg.log start)
> 09) Create a database (createdb --locale=C --encoding=UTF8)
> 10) Initialize pgbench tables with s=50 (pgbench -i -s 50)
> 11) Change # characters of "filler" column of "pgbench_history" table to 300 
> (ALTER TABLE pgbench_history
> ALTER filler TYPE character(300);)
> - This would make the row size of the table 328 bytes
> 12) Stop the postgres server process (pg_ctl -l pg.log -m smart stop)
> 13) Remount the PMEM and the PCIe SSD
> 14) Start postgres server process on NUMA node 0 again (numactl -N 0 -m 0 -- 
> pg_ctl -l p

RE: [PoC] Non-volatile WAL buffer

2020-10-06 Thread Takashi Menjo
Hi Gang,

I have tried to but yet cannot reproduce performance degrade you reported when 
inserting 328-byte records. So I think the condition of you and me would be 
different, such as steps to reproduce, postgresql.conf, installation setup, and 
so on.

My results and condition are as follows. May I have your condition in more 
detail? Note that I refer to your "Storage over App Direct" as my "Original 
(PMEM)" and "NVWAL patch" to "Non-volatile WAL buffer."

Best regards,
Takashi


# Results
See the attached figure. In short, Non-volatile WAL buffer got better 
performance than Original (PMEM).

# Steps
Note that I ran postgres server and pgbench in a single-machine system but 
separated two NUMA nodes. PMEM and PCI SSD for the server process are on the 
server-side NUMA node.

01) Create a PMEM namespace (sudo ndctl create-namespace -f -t pmem -m fsdax -M 
dev -e namespace0.0)
02) Make an ext4 filesystem for PMEM then mount it with DAX option (sudo 
mkfs.ext4 -q -F /dev/pmem0 ; sudo mount -o dax /dev/pmem0 /mnt/pmem0)
03) Make another ext4 filesystem for PCIe SSD then mount it (sudo mkfs.ext4 -q 
-F /dev/nvme0n1 ; sudo mount /dev/nvme0n1 /mnt/nvme0n1)
04) Make /mnt/pmem0/pg_wal directory for WAL
05) Make /mnt/nvme0n1/pgdata directory for PGDATA
06) Run initdb (initdb --locale=C --encoding=UTF8 -X /mnt/pmem0/pg_wal ...)
- Also give -P /mnt/pmem0/pg_wal/nvwal -Q 81920 in the case of Non-volatile 
WAL buffer
07) Edit postgresql.conf as the attached one
- Please remove nvwal_* lines in the case of Original (PMEM)
08) Start postgres server process on NUMA node 0 (numactl -N 0 -m 0 -- pg_ctl 
-l pg.log start)
09) Create a database (createdb --locale=C --encoding=UTF8)
10) Initialize pgbench tables with s=50 (pgbench -i -s 50)
11) Change # characters of "filler" column of "pgbench_history" table to 300 
(ALTER TABLE pgbench_history ALTER filler TYPE character(300);)
- This would make the row size of the table 328 bytes
12) Stop the postgres server process (pg_ctl -l pg.log -m smart stop)
13) Remount the PMEM and the PCIe SSD
14) Start postgres server process on NUMA node 0 again (numactl -N 0 -m 0 -- 
pg_ctl -l pg.log start)
15) Run pg_prewarm for all the four pgbench_* tables
16) Run pgbench on NUMA node 1 for 30 minutes (numactl -N 1 -m 1 -- pgbench -r 
-M prepared -T 1800 -c __ -j __)
- It executes the default tpcb-like transactions

I repeated all the steps three times for each (c,j) then got the median "tps = 
__ (including connections establishing)" of the three as throughput and the 
"latency average = __ ms " of that time as average latency.

# Environment variables
export PGHOST=/tmp
export PGPORT=5432
export PGDATABASE="$USER"
export PGUSER="$USER"
export PGDATA=/mnt/nvme0n1/pgdata

# Setup
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6240M x2 sockets (18 cores per socket; HT disabled by 
BIOS)
- DRAM: DDR4 2933MHz 192GiB/socket x2 sockets (32 GiB per channel x 6 channels 
per socket)
- Optane PMem: Apache Pass, AppDirect Mode, DDR4 2666MHz 1.5TiB/socket x2 
sockets (256 GiB per channel x 6 channels per socket; interleaving enabled)
- PCIe SSD: DC P4800X Series SSDPED1K750GA
- Distro: Ubuntu 20.04.1
- C compiler: gcc 9.3.0
- libc: glibc 2.31
- Linux kernel: 5.7 (vanilla)
- Filesystem: ext4 (DAX enabled when using Optane PMem)
- PMDK: 1.9
- PostgreSQL (Original): 14devel (200f610: Jul 26, 2020)
- PostgreSQL (Non-volatile WAL buffer): 14devel (200f610: Jul 26, 2020) + 
non-volatile WAL buffer patchset v4

-- 
Takashi Menjo 
NTT Software Innovation Center

> -Original Message-
> From: Takashi Menjo 
> Sent: Thursday, September 24, 2020 2:38 AM
> To: Deng, Gang 
> Cc: pgsql-hack...@postgresql.org; Takashi Menjo 
> 
> Subject: Re: [PoC] Non-volatile WAL buffer
> 
> Hello Gang,
> 
> Thank you for your report. I have not taken care of record size deeply yet, 
> so your report is very interesting. I will
> also have a test like yours then post results here.
> 
> Regards,
> Takashi
> 
> 
> 2020年9月21日(月) 14:14 Deng, Gang  <mailto:gang.d...@intel.com> >:
> 
> 
>   Hi Takashi,
> 
> 
> 
>   Thank you for the patch and work on accelerating PG performance with 
> NVM. I applied the patch and made
> some performance test based on the patch v4. I stored database data files on 
> NVMe SSD and stored WAL file on
> Intel PMem (NVM). I used two methods to store WAL file(s):
> 
>   1.  Leverage your patch to access PMem with libpmem (NVWAL patch).
> 
>   2.  Access PMem with legacy filesystem interface, that means use 
> PMem as ordinary block device, no
> PG patch is required to access PMem (Storage over App Direct).
> 
> 
> 
>   I tried two insert scenarios:
> 
>   A. Insert small record (length of record to be

Re: [PoC] Non-volatile WAL buffer

2020-09-23 Thread Takashi Menjo
Hello Gang,

Thank you for your report. I have not taken care of record size deeply yet,
so your report is very interesting. I will also have a test like yours then
post results here.

Regards,
Takashi


2020年9月21日(月) 14:14 Deng, Gang :

> Hi Takashi,
>
>
>
> Thank you for the patch and work on accelerating PG performance with NVM.
> I applied the patch and made some performance test based on the patch v4. I
> stored database data files on NVMe SSD and stored WAL file on Intel PMem
> (NVM). I used two methods to store WAL file(s):
>
> 1.  Leverage your patch to access PMem with libpmem (NVWAL patch).
>
> 2.  Access PMem with legacy filesystem interface, that means use PMem
> as ordinary block device, no PG patch is required to access PMem (Storage
> over App Direct).
>
>
>
> I tried two insert scenarios:
>
> A. Insert small record (length of record to be inserted is 24 bytes),
> I think it is similar as your test
>
> B.  Insert large record (length of record to be inserted is 328 bytes)
>
>
>
> My original purpose is to see higher performance gain in scenario B as it
> is more write intensive on WAL. But I observed that NVWAL patch method had
> ~5% performance improvement compared with Storage over App Direct method in
> scenario A, while had ~20% performance degradation in scenario B.
>
>
>
> I made further investigation on the test. I found that NVWAL patch can
> improve performance of XlogFlush function, but it may impact performance of
> CopyXlogRecordToWAL function. It may be related to the higher latency of
> memcpy to Intel PMem comparing with DRAM. Here are key data in my test:
>
>
>
> Scenario A (length of record to be inserted: 24 bytes per record):
>
> ==
>
>
>NVWAL   SoAD
>
> 
> ---  ---
>
> Througput (10^3 TPS)
> 310.5 296.0
>
> CPU Time % of CopyXlogRecordToWAL
>   0.4 0.2
>
> CPU Time % of XLogInsertRecord
> 1.5 0.8
>
> CPU Time % of XLogFlush
>  2.1 9.6
>
>
>
> Scenario B (length of record to be inserted: 328 bytes per record):
>
> ==
>
>
>NVWAL   SoAD
>
> 
> ---  ---
>
> Througput (10^3 TPS)
> 13.0   16.9
>
> CPU Time % of CopyXlogRecordToWAL
>   3.0     1.6
>
> CPU Time % of XLogInsertRecord
> 23.0   16.4
>
> CPU Time % of XLogFlush
>  2.3 5.9
>
>
>
> Best Regards,
>
> Gang
>
>
>
> *From:* Takashi Menjo 
> *Sent:* Thursday, September 10, 2020 4:01 PM
> *To:* Takashi Menjo 
> *Cc:* pgsql-hack...@postgresql.org
> *Subject:* Re: [PoC] Non-volatile WAL buffer
>
>
>
> Rebased.
>
>
>
>
>
> 2020年6月24日(水) 16:44 Takashi Menjo :
>
> Dear hackers,
>
> I update my non-volatile WAL buffer's patchset to v3.  Now we can use it
> in streaming replication mode.
>
> Updates from v2:
>
> - walreceiver supports non-volatile WAL buffer
> Now walreceiver stores received records directly to non-volatile WAL
> buffer if applicable.
>
> - pg_basebackup supports non-volatile WAL buffer
> Now pg_basebackup copies received WAL segments onto non-volatile WAL
> buffer if you run it with "nvwal" mode (-Fn).
> You should specify a new NVWAL path with --nvwal-path option.  The path
> will be written to postgresql.auto.conf or recovery.conf.  The size of the
> new NVWAL is same as the master's one.
>
>
> Best regards,
> Takashi
>
> --
> Takashi Menjo 
> NTT Software Innovation Center
>
> > -Original Message-
> > From: Takashi Menjo 
> > Sent: Wednesday, March 18, 2020 5:59 PM
> > To: 'PostgreSQL-development' 
> > Cc: 'Robert Haas' ; 'Heikki Linnakangas' <
> hlinn...@iki.fi>; 'Amit Langote'
> > 
> > Subject: RE: [PoC] Non-volatile WAL buffer
> >
> > Dear hackers,
> >
> > I rebased my non-volatile WAL buffer's patchset onto master.  A new v2
> patchset is attached to this mail.
> >
> > I also measured performance before and after patchset, varying
> -c/--client and -j/--jobs options of pgbench, for
> > each scaling factor s = 50 or 1000.  The results are presented in the
> following tables and the attached charts.
> > Co

Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2020-08-04 Thread Takashi Menjo
Dear hackers,

I rebased my old patchset.  It would be good to compare this v4 patchset to
non-volatile WAL buffer's one [1].

[1]
https://www.postgresql.org/message-id/002101d649fb$1f5966e0$5e0c34a0$@hco.ntt.co.jp_1

Regards,
Takashi

-- 
Takashi Menjo 


v4-0001-Add-configure-option-for-PMDK.patch
Description: Binary data


v4-0003-Walreceiver-WAL-IO-using-PMDK.patch
Description: Binary data


v4-0002-Read-write-WAL-files-using-PMDK.patch
Description: Binary data


Re: Remove page-read callback from XLogReaderState.

2020-07-16 Thread Takashi Menjo
0 in
src/backend/access/transam/xlog.c.


Regards,
Takashi



2020年7月2日(木) 13:53 Kyotaro Horiguchi :

> cfbot is complaining as this is no longer applicable. Rebased.
>
> In v14, some reference to XLogReaderState parameter to read_pages
> functions are accidentally replaced by the reference to the global
> variable xlogreader. Fixed it, too.
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>


-- 
Takashi Menjo 


RE: [PoC] Non-volatile WAL buffer

2020-06-24 Thread Takashi Menjo
Dear hackers,

I update my non-volatile WAL buffer's patchset to v3.  Now we can use it in 
streaming replication mode.

Updates from v2:

- walreceiver supports non-volatile WAL buffer
Now walreceiver stores received records directly to non-volatile WAL buffer if 
applicable.

- pg_basebackup supports non-volatile WAL buffer
Now pg_basebackup copies received WAL segments onto non-volatile WAL buffer if 
you run it with "nvwal" mode (-Fn).
You should specify a new NVWAL path with --nvwal-path option.  The path will be 
written to postgresql.auto.conf or recovery.conf.  The size of the new NVWAL is 
same as the master's one.


Best regards,
Takashi

-- 
Takashi Menjo 
NTT Software Innovation Center

> -Original Message-
> From: Takashi Menjo 
> Sent: Wednesday, March 18, 2020 5:59 PM
> To: 'PostgreSQL-development' 
> Cc: 'Robert Haas' ; 'Heikki Linnakangas' 
> ; 'Amit Langote'
> 
> Subject: RE: [PoC] Non-volatile WAL buffer
> 
> Dear hackers,
> 
> I rebased my non-volatile WAL buffer's patchset onto master.  A new v2 
> patchset is attached to this mail.
> 
> I also measured performance before and after patchset, varying -c/--client 
> and -j/--jobs options of pgbench, for
> each scaling factor s = 50 or 1000.  The results are presented in the 
> following tables and the attached charts.
> Conditions, steps, and other details will be shown later.
> 
> 
> Results (s=50)
> ==
>  Throughput [10^3 TPS]  Average latency [ms]
> ( c, j)  before  after  before  after
> ---  -  -
> ( 8, 8)  35.737.1 (+3.9%)   0.224   0.216 (-3.6%)
> (18,18)  70.974.7 (+5.3%)   0.254   0.241 (-5.1%)
> (36,18)  76.080.8 (+6.3%)   0.473   0.446 (-5.7%)
> (54,18)  75.581.8 (+8.3%)   0.715   0.660 (-7.7%)
> 
> 
> Results (s=1000)
> 
>  Throughput [10^3 TPS]  Average latency [ms]
> ( c, j)  before  after  before  after
> ---  -  -
> ( 8, 8)  37.440.1 (+7.3%)   0.214   0.199 (-7.0%)
> (18,18)  79.386.7 (+9.3%)   0.227   0.208 (-8.4%)
> (36,18)  87.295.5 (+9.5%)   0.413   0.377 (-8.7%)
> (54,18)  86.894.8 (+9.3%)   0.622   0.569 (-8.5%)
> 
> 
> Both throughput and average latency are improved for each scaling factor.  
> Throughput seemed to almost reach
> the upper limit when (c,j)=(36,18).
> 
> The percentage in s=1000 case looks larger than in s=50 case.  I think larger 
> scaling factor leads to less
> contentions on the same tables and/or indexes, that is, less lock and unlock 
> operations.  In such a situation,
> write-ahead logging appears to be more significant for performance.
> 
> 
> Conditions
> ==
> - Use one physical server having 2 NUMA nodes (node 0 and 1)
>   - Pin postgres (server processes) to node 0 and pgbench to node 1
>   - 18 cores and 192GiB DRAM per node
> - Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set for pg_wal
>   - Both are installed on the server-side node, that is, node 0
>   - Both are formatted with ext4
>   - NVDIMM-N is mounted with "-o dax" option to enable Direct Access (DAX)
> - Use the attached postgresql.conf
>   - Two new items nvwal_path and nvwal_size are used only after patch
> 
> 
> Steps
> =
> For each (c,j) pair, I did the following steps three times then I found the 
> median of the three as a final result shown
> in the tables above.
> 
> (1) Run initdb with proper -D and -X options; and also give --nvwal-path and 
> --nvwal-size options after patch
> (2) Start postgres and create a database for pgbench tables
> (3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
> (4) Stop postgres, remount filesystems, and start postgres again
> (5) Execute pg_prewarm extension for all the four pgbench tables
> (6) Run pgbench during 30 minutes
> 
> 
> pgbench command line
> 
> $ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ 
> dbname
> 
> I gave no -b option to use the built-in "TPC-B (sort-of)" query.
> 
> 
> Software
> 
> - Distro: Ubuntu 18.04
> - Kernel: Linux 5.4 (vanilla kernel)
> - C Compiler: gcc 7.4.0
> - PMDK: 1.7
> - PostgreSQL: d677550 (master on Mar 3, 2020)
> 
> 
> Hardware
> 
> - System: HPE ProLiant DL380 Gen10
> - CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
> - DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
> - NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
> - NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA
> 
> 
> Best regards,
> Takashi
> 
> --
> Takashi Menjo  NTT Software Innovation Center
&g

RE: [PoC] Non-volatile WAL buffer

2020-02-20 Thread Takashi Menjo
Dear Amit,

Thank you for your advice.  Exactly, it's so to speak "do as the hackers do 
when in pgsql"...

I'm rebasing my branch onto master.  I'll submit an updated patchset and 
performance report later.

Best regards,
Takashi

-- 
Takashi Menjo 
NTT Software Innovation Center

> -Original Message-
> From: Amit Langote 
> Sent: Monday, February 17, 2020 5:21 PM
> To: Takashi Menjo 
> Cc: Robert Haas ; Heikki Linnakangas 
> ; PostgreSQL-development
> 
> Subject: Re: [PoC] Non-volatile WAL buffer
> 
> Hello,
> 
> On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo 
>  wrote:
> > Hello Amit,
> >
> > > I apologize for not having any opinion on the patches themselves,
> > > but let me point out that it's better to base these patches on HEAD
> > > (master branch) than REL_12_0, because all new code is committed to
> > > the master branch, whereas stable branches such as REL_12_0 only receive 
> > > bug fixes.  Do you have any
> specific reason to be working on REL_12_0?
> >
> > Yes, because I think it's human-friendly to reproduce and discuss 
> > performance measurement.  Of course I know
> all new accepted patches are merged into master's HEAD, not stable branches 
> and not even release tags, so I'm
> aware of rebasing my patchset onto master sooner or later.  However, if 
> someone, including me, says that s/he
> applies my patchset to "master" and measures its performance, we have to pay 
> attention to which commit the
> "master" really points to.  Although we have sha1 hashes to specify which 
> commit, we should check whether the
> specific commit on master has patches affecting performance or not because 
> master's HEAD gets new patches day
> by day.  On the other hand, a release tag clearly points the commit all we 
> probably know.  Also we can check more
> easily the features and improvements by using release notes and user manuals.
> 
> Thanks for clarifying. I see where you're coming from.
> 
> While I do sometimes see people reporting numbers with the latest stable 
> release' branch, that's normally just one
> of the baselines.
> The more important baseline for ongoing development is the master branch's 
> HEAD, which is also what people
> volunteering to test your patches would use.  Anyone who reports would have 
> to give at least two numbers --
> performance with a branch's HEAD without patch applied and that with patch 
> applied -- which can be enough in
> most cases to see the difference the patch makes.  Sure, the numbers might 
> change on each report, but that's fine
> I'd think.  If you continue to develop against the stable branch, you might 
> miss to notice impact from any relevant
> developments in the master branch, even developments which possibly require 
> rethinking the architecture of your
> own changes, although maybe that rarely occurs.
> 
> Thanks,
> Amit






RE: [PoC] Non-volatile WAL buffer

2020-02-16 Thread Takashi Menjo
Hello Amit,

> I apologize for not having any opinion on the patches themselves, but let me 
> point out that it's better to base these
> patches on HEAD (master branch) than REL_12_0, because all new code is 
> committed to the master branch,
> whereas stable branches such as REL_12_0 only receive bug fixes.  Do you have 
> any specific reason to be working
> on REL_12_0?

Yes, because I think it's human-friendly to reproduce and discuss performance 
measurement.  Of course I know all new accepted patches are merged into 
master's HEAD, not stable branches and not even release tags, so I'm aware of 
rebasing my patchset onto master sooner or later.  However, if someone, 
including me, says that s/he applies my patchset to "master" and measures its 
performance, we have to pay attention to which commit the "master" really 
points to.  Although we have sha1 hashes to specify which commit, we should 
check whether the specific commit on master has patches affecting performance 
or not because master's HEAD gets new patches day by day.  On the other hand, a 
release tag clearly points the commit all we probably know.  Also we can check 
more easily the features and improvements by using release notes and user 
manuals.

Best regards,
Takashi

-- 
Takashi Menjo 
NTT Software Innovation Center
> -Original Message-
> From: Amit Langote 
> Sent: Monday, February 17, 2020 1:39 PM
> To: Takashi Menjo 
> Cc: Robert Haas ; Heikki Linnakangas 
> ; PostgreSQL-development
> 
> Subject: Re: [PoC] Non-volatile WAL buffer
> 
> Menjo-san,
> 
> On Mon, Feb 17, 2020 at 1:13 PM Takashi Menjo 
>  wrote:
> > I applied my patchset that mmap()-s WAL segments as WAL buffers to 
> > refs/tags/REL_12_0, and measured and
> analyzed its performance with pgbench.  Roughly speaking, When I used *SSD 
> and ext4* to store WAL, it was
> "obviously worse" than the original REL_12_0.
> 
> I apologize for not having any opinion on the patches themselves, but let me 
> point out that it's better to base these
> patches on HEAD (master branch) than REL_12_0, because all new code is 
> committed to the master branch,
> whereas stable branches such as REL_12_0 only receive bug fixes.  Do you have 
> any specific reason to be working
> on REL_12_0?
> 
> Thanks,
> Amit






RE: [PoC] Non-volatile WAL buffer

2020-02-16 Thread Takashi Menjo
Dear hackers,

I applied my patchset that mmap()-s WAL segments as WAL buffers to 
refs/tags/REL_12_0, and measured and analyzed its performance with pgbench.  
Roughly speaking, When I used *SSD and ext4* to store WAL, it was "obviously 
worse" than the original REL_12_0.  VTune told me that the CPU time of memcpy() 
called by CopyXLogRecordToWAL() got larger than before.  When I used *NVDIMM-N 
and ext4 with filesystem DAX* to store WAL, however, it achieved "not bad" 
performance compared with our previous patchset and non-volatile WAL buffer.  
Each CPU time of XLogInsert() and XLogFlush() was reduced like as non-volatile 
WAL buffer.

So I think mmap()-ing WAL segments as WAL buffers is not such a bad idea as 
long as we use PMEM, at least NVDIMM-N.

Excuse me but for now I'd keep myself not talking about how much the 
performance was, because the mmap()-ing patchset is WIP so there might be bugs 
which wrongfully  "improve" or "degrade" performance.  Also we need to know 
persistent memory programming and related features such as filesystem DAX, huge 
page faults, and WAL persistence with cache flush and memory barrier 
instructions to explain why the performance improved.  I'd talk about all the 
details at the appropriate time and place. (The conference, or here later...)

Best regards,
Takashi

-- 
Takashi Menjo 
NTT Software Innovation Center

> -Original Message-
> From: Takashi Menjo 
> Sent: Monday, February 10, 2020 6:30 PM
> To: 'Robert Haas' ; 'Heikki Linnakangas' 
> 
> Cc: 'pgsql-hack...@postgresql.org' 
> Subject: RE: [PoC] Non-volatile WAL buffer
> 
> Dear hackers,
> 
> I made another WIP patchset to mmap WAL segments as WAL buffers.  Note that 
> this is not a non-volatile WAL
> buffer patchset but its competitor.  I am measuring and analyzing the 
> performance of this patchset to compare
> with my N.V.WAL buffer.
> 
> Please wait for a several more days for the result report...
> 
> Best regards,
> Takashi
> 
> --
> Takashi Menjo  NTT Software Innovation Center
> 
> > -Original Message-
> > From: Robert Haas 
> > Sent: Wednesday, January 29, 2020 6:00 AM
> > To: Takashi Menjo 
> > Cc: Heikki Linnakangas ; pgsql-hack...@postgresql.org
> > Subject: Re: [PoC] Non-volatile WAL buffer
> >
> > On Tue, Jan 28, 2020 at 3:28 AM Takashi Menjo 
> >  wrote:
> > > I think our concerns are roughly classified into two:
> > >
> > >  (1) Performance
> > >  (2) Consistency
> > >
> > > And your "different concern" is rather into (2), I think.
> >
> > Actually, I think it was mostly a performance concern (writes
> > triggering lots of reading) but there might be a consistency issue as well.
> >
> > --
> > Robert Haas
> > EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL
> > Company






RE: [PoC] Non-volatile WAL buffer

2020-02-10 Thread Takashi Menjo
Dear hackers,

I made another WIP patchset to mmap WAL segments as WAL buffers.  Note that 
this is not a non-volatile WAL buffer patchset but its competitor.  I am 
measuring and analyzing the performance of this patchset to compare with my 
N.V.WAL buffer.

Please wait for a several more days for the result report...

Best regards,
Takashi

-- 
Takashi Menjo 
NTT Software Innovation Center

> -Original Message-
> From: Robert Haas 
> Sent: Wednesday, January 29, 2020 6:00 AM
> To: Takashi Menjo 
> Cc: Heikki Linnakangas ; pgsql-hack...@postgresql.org
> Subject: Re: [PoC] Non-volatile WAL buffer
> 
> On Tue, Jan 28, 2020 at 3:28 AM Takashi Menjo 
>  wrote:
> > I think our concerns are roughly classified into two:
> >
> >  (1) Performance
> >  (2) Consistency
> >
> > And your "different concern" is rather into (2), I think.
> 
> Actually, I think it was mostly a performance concern (writes triggering lots 
> of reading) but there might be a
> consistency issue as well.
> 
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company


0001-Preallocate-more-WAL-segments.patch
Description: Binary data


0002-Use-WAL-segments-as-WAL-buffers.patch
Description: Binary data


0003-Lazy-unmap-WAL-segments.patch
Description: Binary data


0004-Speculative-map-WAL-segments.patch
Description: Binary data


0005-Allocate-WAL-segments-to-utilize-hugepage.patch
Description: Binary data


RE: [PoC] Non-volatile WAL buffer

2020-01-28 Thread Takashi Menjo
Hello Robert,

I think our concerns are roughly classified into two:

 (1) Performance
 (2) Consistency

And your "different concern" is rather into (2), I think.

I'm also worried about it, but I have no good answer for now.  I suppose 
mmap(flags|=MAP_SHARED) called by multiple backend processes for the same file 
works consistently for both PMEM and non-PMEM devices.  However, I have not 
found any evidence such as specification documents yet.

I also made a tiny program calling memcpy() and msync() on the same mmap()-ed 
file but mutually distinct address range in parallel, and found that there was 
no corrupted data.  However, that result does not ensure any consistency I'm 
worried about.  I could give it up if there *were* corrupted data...

So I will go to (1) first.  I will test the way Heikki told us to answer 
whether the cost of mmap() and munmap() per WAL segment, etc, is reasonable or 
not.  If it really is, then I will go to (2).

Best regards,
Takashi

-- 
Takashi Menjo 
NTT Software Innovation Center







RE: [PoC] Non-volatile WAL buffer

2020-01-26 Thread Takashi Menjo
Hello Heikki,

> I have the same comments on this that I had on the previous patch, see:
> 
> https://www.postgresql.org/message-id/2aec6e2a-6a32-0c39-e4e2-aad854543aa8%40iki.fi

Thanks.  I re-read your messages [1][2].  What you meant, AFAIU, is how
about using memory-mapped WAL segment files as WAL buffers, and switching
CPU instructions or msync() depending on whether the segment files are on
PMEM or not, to sync inserted WAL records. 

It sounds reasonable, but I'm sorry that I haven't tested such a program
yet.  I'll try it to compare with my non-volatile WAL buffer.  For now, I'm
a little worried about the overhead of mmap()/munmap() for each WAL segment
file.

You also told a SIGBUS problem of memory-mapped I/O.  I think it's true for
reading from bad memory blocks, as you mentioned, and also true for writing
to such blocks [3].  Handling SIGBUS properly or working around it is future
work.

Best regards,
Takashi

[1] 
https://www.postgresql.org/message-id/83eafbfd-d9c5-6623-2423-7cab1be3888c%40iki.fi
[2] 
https://www.postgresql.org/message-id/2aec6e2a-6a32-0c39-e4e2-aad854543aa8%40iki.fi
[3] https://pmem.io/2018/11/26/bad-blocks.htm

-- 
Takashi Menjo 
NTT Software Innovation Center







RE: [PoC] Non-volatile WAL buffer

2020-01-26 Thread Takashi Menjo
Hello Fabien,

Thank you for your +1 :)

> Is it possible to emulate somthing without the actual hardware, at least
> for testing purposes?

Yes, you can emulate PMEM using DRAM on Linux, via "memmap=nnG!ssG" kernel
parameter.  Please see [1] and [2] for emulation details.  If your emulation
does not work well, please check if the kernel configuration options (like
CONFIG_ FOOBAR) for PMEM and DAX (in [1] and [3]) are set up properly.

Best regards,
Takashi

[1] How to Emulate Persistent Memory Using Dynamic Random-access Memory (DRAM)
 
https://software.intel.com/en-us/articles/how-to-emulate-persistent-memory-on-an-intel-architecture-server
[2] how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system
 
https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system
[3] Persistent Memory Wiki
 https://nvdimm.wiki.kernel.org/

-- 
Takashi Menjo 
NTT Software Innovation Center







[PoC] Non-volatile WAL buffer

2020-01-24 Thread Takashi Menjo
Dear hackers,

I propose "non-volatile WAL buffer," a proof-of-concept new feature.  It
enables WAL records to be durable without output to WAL segment files by
residing on persistent memory (PMEM) instead of DRAM.  It improves database
performance by reducing copies of WAL and shortening the time of write
transactions.

I attach the first patchset that can be applied to PostgreSQL 12.0 (refs/
tags/REL_12_0).  Please see README.nvwal (added by the patch 0003) to use
the new feature.

PMEM [1] is fast, non-volatile, and byte-addressable memory installed into
DIMM slots. Such products have been already available.  For example, an
NVDIMM-N is a type of PMEM module that contains both DRAM and NAND flash.
It can be accessed like a regular DRAM, but on power loss, it can save its
contents into flash area.  On power restore, it performs the reverse, that
is, the contents are copied back into DRAM.  PMEM also has been already
supported by major operating systems such as Linux and Windows, and new
open-source libraries such as Persistent Memory Development Kit (PMDK) [2].
Furthermore, several DBMSes have started to support PMEM.

It's time for PostgreSQL.  PMEM is faster than a solid state disk and
naively can be used as a block storage.  However, we cannot gain much
performance in that way because it is so fast that the overhead of
traditional software stacks now becomes unignorable, such as user buffers,
filesystems, and block layers.  Non-volatile WAL buffer is a work to make
PostgreSQL PMEM-aware, that is, accessing directly to PMEM as a RAM to
bypass such overhead and achieve the maximum possible benefit.  I believe
WAL is one of the most important modules to be redesigned for PMEM because
it has assumed slow disks such as HDDs and SSDs but PMEM is not so.

This work is inspired by "Non-volatile Memory Logging" talked in PGCon
2016 [3] to gain more benefit from PMEM than my and Yoshimi's previous
work did [4][5].  I submitted a talk proposal for PGCon in this year, and
have measured and analyzed performance of my PostgreSQL with non-volatile
WAL buffer, comparing with the original one that uses PMEM as "a faster-
than-SSD storage."  I will talk about the results if accepted.

Best regards,
Takashi Menjo

[1] Persistent Memory (SNIA)
  https://www.snia.org/PM
[2] Persistent Memory Development Kit (pmem.io)
  https://pmem.io/pmdk/ 
[3] Non-volatile Memory Logging (PGCon 2016)
  https://www.pgcon.org/2016/schedule/track/Performance/945.en.html
[4] Introducing PMDK into PostgreSQL (PGCon 2018)
  https://www.pgcon.org/2018/schedule/events/1154.en.html
[5] Applying PMDK to WAL operations for persistent memory (pgsql-hackers)
  
https://www.postgresql.org/message-id/c20d38e97bcb33dad59e...@lab.ntt.co.jp

-- 
Takashi Menjo 
NTT Software Innovation Center




0001-Support-GUCs-for-external-WAL-buffer.patch
Description: Binary data


0002-Non-volatile-WAL-buffer.patch
Description: Binary data


0003-README-for-non-volatile-WAL-buffer.patch
Description: Binary data


Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2019-02-11 Thread Takashi Menjo
Peter Eisentraut wrote:
> I'm concerned with how this would affect the future maintenance of this
> code.  You are introducing a whole separate code path for PMDK beside
> the normal file path (and it doesn't seem very well separated either).
> Now everyone who wants to do some surgery in the WAL code needs to take
> that into account.  And everyone who wants to do performance work in the
> WAL code needs to check that the PMDK path doesn't regress.  AFAICT,
> this hardware isn't very popular at the moment, so it would be very hard
> to peer review any work in this area.

Thank you for your comment.  It is reasonable that you are concerned with
maintainability.  Our patchset still lacks of it.  I will consider about
that when I submit a next update.  (It may take a long time, so please be
patient...)


Regards,
Takashi

-- 
Takashi Menjo - NTT Software Innovation Center






RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2019-01-29 Thread Takashi Menjo
Hi,

Sorry but I found that the patchset v2 had a bug in managing WAL segment
file offset.  I fixed it and updated a patchset as v3 (attached).

Regards,
Takashi

-- 
Takashi Menjo - NTT Software Innovation Center




0001-Add-configure-option-for-PMDK-v3.patch
Description: Binary data


0002-Read-write-WAL-files-using-PMDK-v3.patch
Description: Binary data


0003-Walreceiver-WAL-IO-using-PMDK-v3.patch
Description: Binary data


RE: static global variable openLogOff in xlog.c seems no longer used

2019-01-29 Thread Takashi Menjo
Michael Paquier wrote:
> It seems to me that keeping openLogOff is still useful to get a report
> about the full chunk area being written if the data gets written in
> multiple chunks and fails afterwards.  Your patch would modify the
> report so as only the area with the partial write is reported.  For
> debugging, having a static reference is also useful in my opinion.

I agree with you on both error reporting and debugging.  Now that you
mention it, I find that my patch modifies ereport...

When I wrote a patchset to xlog.c (in another email thread), I thought that
this can be fixed. But now I understand it is not a simple thing.  Thank
you.


Regards,
Takashi

-- 
Takashi Menjo - NTT Software Innovation Center






Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2019-01-28 Thread Takashi Menjo
Hi,

Peter Eisentraut wrote:
> When you manage the WAL (or perhaps in the future relation files)
> through PMDK, is there still a file system view of it somewhere, for
> browsing, debugging, and for monitoring tools?

First, I assume that our patchset is used with a filesystem that supports
direct access (DAX) feature, and I test it with ext4 on Linux.  You can cd
into pg_wal directory created by initdb -X pg_wal on such a filesystem, and
ls WAL segment files managed by PMDK at runtime.

For each PostgreSQL-specific tool, perhaps yes, but I have not tested yet.
At least, pg_waldump looks working as before.

Regards,
Takashi

-- 
Takashi Menjo - NTT Software Innovation Center







static global variable openLogOff in xlog.c seems no longer used

2019-01-28 Thread Takashi Menjo
Hi,

Because of pg_pwrite()[1], openLogOff, a static global variable in xlog.c,
seems taken over by a local variable startoffset and no longer used now.

I write the attached patch that removes openLogOff. Both "make check" and
"make installcheck" passed, and just after that, "pg_ctl -m immediate stop"
then "pg_ctl start" looked OK.

Regards,
Takashi

[1] See commit c24dcd0cfd949bdf245814c4c2b3df828ee7db36.

-- 
Takashi Menjo - NTT Software Innovation Center




Remove-openLogOff.patch
Description: Binary data


RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2019-01-25 Thread Takashi Menjo
Hello,


On behalf of Yoshimi, I rebased the patchset onto the latest master
(e3565fd6).
Please see the attachment. It also includes an additional bug fix (in patch
0002) 
about temporary filename.

Note that PMDK 1.4.2+ supports MAP_SYNC and MAP_SHARED_VALIDATE flags, 
so please use a new version of PMDK when you test. The latest version is
1.5.


Heikki Linnakangas wrote:
> To re-iterate what I said earlier in this thread, I think the next step 
> here is to write a patch that modifies xlog.c to use plain old 
> mmap()/msync() to memory-map the WAL files, to replace the WAL buffers.

Sorry but my new patchset still uses PMDK, because PMDK is supported on
Linux 
_and Windows_, and I think someone may want to test this patchset on
Windows...


Regards,
Takashi

-- 
Takashi Menjo - NTT Software Innovation Center




0001-Add-configure-option-for-PMDK-v2.patch
Description: Binary data


0002-Read-write-WAL-files-using-PMDK-v2.patch
Description: Binary data


0003-Walreceiver-WAL-IO-using-PMDK-v2.patch
Description: Binary data