Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2021-02-16 Thread Takashi Menjo
Rebased to make patchset v5.

I also found that my past replies have separated the thread in the
pgsql-hackers archive. I try to connect this mail to the original
thread [1], and let this point to the separated portions [2][3][4].
Note that the patchset v3 is in [3] and v4 is in [4].

Regards,

[1] 
https://www.postgresql.org/message-id/flat/C20D38E97BCB33DAD59E3A1%40lab.ntt.co.jp
[2] 
https://www.postgresql.org/message-id/flat/000501d4b794%245094d140%24f1be73c0%24%40lab.ntt.co.jp
[3] 
https://www.postgresql.org/message-id/flat/01d4b863%244c9e8fc0%24e5dbaf40%24%40lab.ntt.co.jp
[4] 
https://www.postgresql.org/message-id/flat/01d4c2a1%2488c6cc40%249a5464c0%24%40lab.ntt.co.jp

-- 
Takashi Menjo 


v5-0001-Add-configure-option-for-PMDK.patch
Description: Binary data


v5-0003-Walreceiver-WAL-IO-using-PMDK.patch
Description: Binary data


v5-0002-Read-write-WAL-files-using-PMDK.patch
Description: Binary data


Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2020-08-04 Thread Takashi Menjo
Dear hackers,

I rebased my old patchset.  It would be good to compare this v4 patchset to
non-volatile WAL buffer's one [1].

[1]
https://www.postgresql.org/message-id/002101d649fb$1f5966e0$5e0c34a0$@hco.ntt.co.jp_1

Regards,
Takashi

-- 
Takashi Menjo 


v4-0001-Add-configure-option-for-PMDK.patch
Description: Binary data


v4-0003-Walreceiver-WAL-IO-using-PMDK.patch
Description: Binary data


v4-0002-Read-write-WAL-files-using-PMDK.patch
Description: Binary data


Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2019-02-11 Thread Takashi Menjo
Peter Eisentraut wrote:
> I'm concerned with how this would affect the future maintenance of this
> code.  You are introducing a whole separate code path for PMDK beside
> the normal file path (and it doesn't seem very well separated either).
> Now everyone who wants to do some surgery in the WAL code needs to take
> that into account.  And everyone who wants to do performance work in the
> WAL code needs to check that the PMDK path doesn't regress.  AFAICT,
> this hardware isn't very popular at the moment, so it would be very hard
> to peer review any work in this area.

Thank you for your comment.  It is reasonable that you are concerned with
maintainability.  Our patchset still lacks of it.  I will consider about
that when I submit a next update.  (It may take a long time, so please be
patient...)


Regards,
Takashi

-- 
Takashi Menjo - NTT Software Innovation Center






Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2019-02-07 Thread Peter Eisentraut
On 30/01/2019 07:16, Takashi Menjo wrote:
> Sorry but I found that the patchset v2 had a bug in managing WAL segment
> file offset.  I fixed it and updated a patchset as v3 (attached).

I'm concerned with how this would affect the future maintenance of this
code.  You are introducing a whole separate code path for PMDK beside
the normal file path (and it doesn't seem very well separated either).
Now everyone who wants to do some surgery in the WAL code needs to take
that into account.  And everyone who wants to do performance work in the
WAL code needs to check that the PMDK path doesn't regress.  AFAICT,
this hardware isn't very popular at the moment, so it would be very hard
to peer review any work in this area.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2019-01-29 Thread Takashi Menjo
Hi,

Sorry but I found that the patchset v2 had a bug in managing WAL segment
file offset.  I fixed it and updated a patchset as v3 (attached).

Regards,
Takashi

-- 
Takashi Menjo - NTT Software Innovation Center




0001-Add-configure-option-for-PMDK-v3.patch
Description: Binary data


0002-Read-write-WAL-files-using-PMDK-v3.patch
Description: Binary data


0003-Walreceiver-WAL-IO-using-PMDK-v3.patch
Description: Binary data


Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2019-01-28 Thread Takashi Menjo
Hi,

Peter Eisentraut wrote:
> When you manage the WAL (or perhaps in the future relation files)
> through PMDK, is there still a file system view of it somewhere, for
> browsing, debugging, and for monitoring tools?

First, I assume that our patchset is used with a filesystem that supports
direct access (DAX) feature, and I test it with ext4 on Linux.  You can cd
into pg_wal directory created by initdb -X pg_wal on such a filesystem, and
ls WAL segment files managed by PMDK at runtime.

For each PostgreSQL-specific tool, perhaps yes, but I have not tested yet.
At least, pg_waldump looks working as before.

Regards,
Takashi

-- 
Takashi Menjo - NTT Software Innovation Center







Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2019-01-25 Thread Peter Eisentraut
On 25/01/2019 09:52, Takashi Menjo wrote:
> Heikki Linnakangas wrote:
>> To re-iterate what I said earlier in this thread, I think the next step 
>> here is to write a patch that modifies xlog.c to use plain old 
>> mmap()/msync() to memory-map the WAL files, to replace the WAL buffers.
> Sorry but my new patchset still uses PMDK, because PMDK is supported on
> Linux 
> _and Windows_, and I think someone may want to test this patchset on
> Windows...

When you manage the WAL (or perhaps in the future relation files)
through PMDK, is there still a file system view of it somewhere, for
browsing, debugging, and for monitoring tools?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2019-01-25 Thread Takashi Menjo
Hello,


On behalf of Yoshimi, I rebased the patchset onto the latest master
(e3565fd6).
Please see the attachment. It also includes an additional bug fix (in patch
0002) 
about temporary filename.

Note that PMDK 1.4.2+ supports MAP_SYNC and MAP_SHARED_VALIDATE flags, 
so please use a new version of PMDK when you test. The latest version is
1.5.


Heikki Linnakangas wrote:
> To re-iterate what I said earlier in this thread, I think the next step 
> here is to write a patch that modifies xlog.c to use plain old 
> mmap()/msync() to memory-map the WAL files, to replace the WAL buffers.

Sorry but my new patchset still uses PMDK, because PMDK is supported on
Linux 
_and Windows_, and I think someone may want to test this patchset on
Windows...


Regards,
Takashi

-- 
Takashi Menjo - NTT Software Innovation Center




0001-Add-configure-option-for-PMDK-v2.patch
Description: Binary data


0002-Read-write-WAL-files-using-PMDK-v2.patch
Description: Binary data


0003-Walreceiver-WAL-IO-using-PMDK-v2.patch
Description: Binary data


Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2019-01-23 Thread Andres Freund
Hi,

On 2019-01-23 18:45:42 +0200, Heikki Linnakangas wrote:
> To re-iterate what I said earlier in this thread, I think the next step here
> is to write a patch that modifies xlog.c to use plain old mmap()/msync() to
> memory-map the WAL files, to replace the WAL buffers. Let's see what the
> performance of that is, with or without NVM hardware. I think that might
> actually make the code simpler. There's a bunch of really hairy code around
> locking the WAL buffers, which could be made simpler if each backend
> memory-mapped the WAL segment files independently.
> 
> One thing to watch out for, is that if you read() a file, and there's an I/O
> error, you have a chance to ereport() it. If you try to read from a
> memory-mapped file, and there's an I/O error, the process is killed with
> SIGBUS. So I think we have to be careful with using memory-mapped I/O for
> reading files. But for writing WAL files, it seems like a good fit.
> 
> Once we have a reliable mmap()/msync() implementation running, it should be
> straightforward to change it to use MAP_SYNC and the special CPU
> instructions for the flushing.

FWIW, I don't think we should go there as the sole implementation. I'm
fairly convinced that we're going to need to go to direct-IO in more
cases here, and that'll not work well with mmap.  I think this'd be a
worthwhile experiment, but I'm doubtful it'd end up simplifying our
code.

Greetings,

Andres Freund



Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2019-01-23 Thread Heikki Linnakangas

On 10/12/2018 23:37, Dmitry Dolgov wrote:

On Thu, Nov 29, 2018 at 6:48 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:


On Tue, Oct 2, 2018 at 4:53 AM Michael Paquier  wrote:

On Mon, Aug 06, 2018 at 06:00:54PM +0900, Yoshimi Ichiyanagi wrote:

The libpmem's pmem_map_file() supported 2M/1G(the size of huge page)
alignment, since it could reduce the number of page faults.
In addition, libpmem's pmem_memcpy_nodrain() is the function
to copy data using single instruction, multiple data(SIMD) instructions
and NT store instructions(MOVNT).
As a result, using these APIs is faster than using old mmap()/memcpy().

Please see the PGCon2018 presentation[1] for the details.

[1] 
https://www.pgcon.org/2018/schedule/attachments/507_PGCon2018_Introducing_PMDK_into_PostgreSQL.pdf


So you say that this represents a 3% gain based on the presentation?
That may be interesting to dig into it.  Could you provide fresher
performance numbers?  I am moving this patch to the next CF 2018-10 for
now, waiting for input from the author.


Unfortunately, the patch has some conflicts now, so probably not only fresher
performance numbers are necessary, but also a rebased version.


I believe the idea behind this patch is quite important (thanks to CMU DG for
inspiring lectures), so I decided to put some efforts and rebase it to prevent
from rotting. At the same time I have a vague impression that the patch itself
suggests quite narrow way of using of PMDK.


Thanks.

To re-iterate what I said earlier in this thread, I think the next step 
here is to write a patch that modifies xlog.c to use plain old 
mmap()/msync() to memory-map the WAL files, to replace the WAL buffers. 
Let's see what the performance of that is, with or without NVM hardware. 
I think that might actually make the code simpler. There's a bunch of 
really hairy code around locking the WAL buffers, which could be made 
simpler if each backend memory-mapped the WAL segment files independently.


One thing to watch out for, is that if you read() a file, and there's an 
I/O error, you have a chance to ereport() it. If you try to read from a 
memory-mapped file, and there's an I/O error, the process is killed with 
SIGBUS. So I think we have to be careful with using memory-mapped I/O 
for reading files. But for writing WAL files, it seems like a good fit.


Once we have a reliable mmap()/msync() implementation running, it should 
be straightforward to change it to use MAP_SYNC and the special CPU 
instructions for the flushing.


- Heikki



Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-12-10 Thread Dmitry Dolgov
> On Thu, Nov 29, 2018 at 6:48 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Tue, Oct 2, 2018 at 4:53 AM Michael Paquier  wrote:
> >
> > On Mon, Aug 06, 2018 at 06:00:54PM +0900, Yoshimi Ichiyanagi wrote:
> > > The libpmem's pmem_map_file() supported 2M/1G(the size of huge page)
> > > alignment, since it could reduce the number of page faults.
> > > In addition, libpmem's pmem_memcpy_nodrain() is the function
> > > to copy data using single instruction, multiple data(SIMD) instructions
> > > and NT store instructions(MOVNT).
> > > As a result, using these APIs is faster than using old mmap()/memcpy().
> > >
> > > Please see the PGCon2018 presentation[1] for the details.
> > >
> > > [1] 
> > > https://www.pgcon.org/2018/schedule/attachments/507_PGCon2018_Introducing_PMDK_into_PostgreSQL.pdf
> >
> > So you say that this represents a 3% gain based on the presentation?
> > That may be interesting to dig into it.  Could you provide fresher
> > performance numbers?  I am moving this patch to the next CF 2018-10 for
> > now, waiting for input from the author.
>
> Unfortunately, the patch has some conflicts now, so probably not only fresher
> performance numbers are necessary, but also a rebased version.

I believe the idea behind this patch is quite important (thanks to CMU DG for
inspiring lectures), so I decided to put some efforts and rebase it to prevent
from rotting. At the same time I have a vague impression that the patch itself
suggests quite narrow way of using of PMDK.

> On 01/03/18 12:40, Heikki Linnakangas wrote:
> > On 16/01/18 15:00, Yoshimi Ichiyanagi wrote:
> >> These patches enable to use Persistent Memory Development Kit(PMDK)[1]
> >> for reading/writing WAL logs on persistent memory(PMEM).
> >> PMEM is next generation storage and it has a number of nice features:
> >> fast, byte-addressable and non-volatile.
> >
> > Interesting. How does this compare with using good old mmap()?

E.g. byte-addressability is not used here at all, and it's probably one of the
most cool properties, when we write not a block/page, but a small amount of
data and flush it using PMDK.


0001-Add-configure-option-for-PMDK-v2.patch
Description: Binary data


0003-Walreceiver-WAL-IO-using-PMDK-v2.patch
Description: Binary data


0002-Read-write-WAL-files-using-PMDK-v2.patch
Description: Binary data


Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-11-29 Thread Dmitry Dolgov
> On Tue, Oct 2, 2018 at 4:53 AM Michael Paquier  wrote:
>
> On Mon, Aug 06, 2018 at 06:00:54PM +0900, Yoshimi Ichiyanagi wrote:
> > The libpmem's pmem_map_file() supported 2M/1G(the size of huge page)
> > alignment, since it could reduce the number of page faults.
> > In addition, libpmem's pmem_memcpy_nodrain() is the function
> > to copy data using single instruction, multiple data(SIMD) instructions
> > and NT store instructions(MOVNT).
> > As a result, using these APIs is faster than using old mmap()/memcpy().
> >
> > Please see the PGCon2018 presentation[1] for the details.
> >
> > [1] 
> > https://www.pgcon.org/2018/schedule/attachments/507_PGCon2018_Introducing_PMDK_into_PostgreSQL.pdf
>
> So you say that this represents a 3% gain based on the presentation?
> That may be interesting to dig into it.  Could you provide fresher
> performance numbers?  I am moving this patch to the next CF 2018-10 for
> now, waiting for input from the author.

Unfortunately, the patch has some conflicts now, so probably not only fresher
performance numbers are necessary, but also a rebased version.



Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-10-01 Thread Michael Paquier
On Mon, Aug 06, 2018 at 06:00:54PM +0900, Yoshimi Ichiyanagi wrote:
> The libpmem's pmem_map_file() supported 2M/1G(the size of huge page)
> alignment, since it could reduce the number of page faults. 
> In addition, libpmem's pmem_memcpy_nodrain() is the function
> to copy data using single instruction, multiple data(SIMD) instructions
> and NT store instructions(MOVNT).
> As a result, using these APIs is faster than using old mmap()/memcpy().
> 
> Please see the PGCon2018 presentation[1] for the details.
> 
> [1] 
> https://www.pgcon.org/2018/schedule/attachments/507_PGCon2018_Introducing_PMDK_into_PostgreSQL.pdf

So you say that this represents a 3% gain based on the presentation?
That may be interesting to dig into it.  Could you provide fresher
performance numbers?  I am moving this patch to the next CF 2018-10 for
now, waiting for input from the author.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-08-06 Thread Yoshimi Ichiyanagi
I'm sorry for the delay in replying your mail.

<91411837-8c65-bf7d-7ca3-d69bdcb49...@iki.fi>
Thu, 1 Mar 2018 18:40:05 +0800Heikki Linnakangas  wrote
 :
>Interesting. How does this compare with using good old mmap()?

The libpmem's pmem_map_file() supported 2M/1G(the size of huge page)
alignment, since it could reduce the number of page faults. 
In addition, libpmem's pmem_memcpy_nodrain() is the function
to copy data using single instruction, multiple data(SIMD) instructions
and NT store instructions(MOVNT).
As a result, using these APIs is faster than using old mmap()/memcpy().

Please see the PGCon2018 presentation[1] for the details.

[1] 
https://www.pgcon.org/2018/schedule/attachments/507_PGCon2018_Introducing_PMDK_into_PostgreSQL.pdf


<83eafbfd-d9c5-6623-2423-7cab1be38...@iki.fi>
Fri, 20 Jul 2018 23:18:05 +0300Heikki Linnakangas  
wrote :
>I think the way forward with this patch would be to map WAL segments 
>with plain old mmap(), and use msync(). If that's faster than the status 
>quo, great. If not, it would still be a good stepping stone for actually 
>using PMDK. 

I think so too.

I wrote this patch to replace read/write syscalls with libpmem's
API only. I believe that PMDK can make the current PostgreSQL faster.


> If nothing else, it would provide a way to test most of the 
>code paths, without actually having a persistent memory device, or 
>libpmem. The examples at http://pmem.io/pmdk/libpmem/ actually sugest 
>doing exactly that: use libpmem to map a file to memory, and check if it 
>lives on persistent memory using libpmem's pmem_is_pmem() function. If 
>it returns yes, use pmem_drain(), if it return false, fall back to using 
>msync().

When PMEM_IS_PMEM_FORCE(the environment variable[2]) is set to 1,
pmem_is_pmem() return yes.

Linux 4.15 and more supported MAP_SYNC and MAP_SHARED_VALIDATE of
mmap() flags to check if the mapped file is stored on PMEM.
An application that used both flags in its mmap() call can be sure
that MAP_SYNC is actually supported by both the kernel and
the filesystem that the mapped file is stored in[3].
But pmem_is_pmem() doesn't support this mechanism for now.

[2] http://pmem.io/pmdk/manpages/linux/v1.4/libpmem/libpmem.7.html
[3] https://lwn.net/Articles/758594/ 

--
Yoshimi Ichiyanagi
NTT Software Innovation Center
e-mail : ichiyanagi.yosh...@lab.ntt.co.jp




Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-07-20 Thread Heikki Linnakangas

On 01/03/18 12:40, Heikki Linnakangas wrote:

On 16/01/18 15:00, Yoshimi Ichiyanagi wrote:

These patches enable to use Persistent Memory Development Kit(PMDK)[1]
for reading/writing WAL logs on persistent memory(PMEM).
PMEM is next generation storage and it has a number of nice features:
fast, byte-addressable and non-volatile.


Interesting. How does this compare with using good old mmap()? I think
just doing that would allow eliminating much of the complexity around
managing the shared_buffers. And if the OS is smart about persistent
memory (I don't know what the state of the art on that is), presumably
msync() and fsync() on an file that lives in persistent memory is
lightning fast.


I briefly looked at the docs at pmem.io. pmem_map_file() uses mmap() 
under the hood, but it does some extra checks to test if the files is on 
a persistent memory device, and makes a note of it.


I think the way forward with this patch would be to map WAL segments 
with plain old mmap(), and use msync(). If that's faster than the status 
quo, great. If not, it would still be a good stepping stone for actually 
using PMDK. If nothing else, it would provide a way to test most of the 
code paths, without actually having a persistent memory device, or 
libpmem. The examples at http://pmem.io/pmdk/libpmem/ actually sugest 
doing exactly that: use libpmem to map a file to memory, and check if it 
lives on persistent memory using libpmem's pmem_is_pmem() function. If 
it returns yes, use pmem_drain(), if it return false, fall back to using 
msync().


- Heikki



Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-03-01 Thread Yoshimi Ichiyanagi
<20180301103641.tudam4mavba3g...@alap3.anarazel.de>
Thu, 1 Mar 2018 02:36:41 -0800Andres Freund <and...@anarazel.de> wrote :

Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent 
memory
>On 2018-02-05 09:59:25 +0900, Yoshimi Ichiyanagi wrote:
>> I added my patches to the CommitFest 2018-3.
>> https://commitfest.postgresql.org/17/1485/
>
>Unfortunately this is the last CF for the v11 development cycle. This is
>a major project submitted late for v11, there's been no code level
>review, the goals aren't agreed upon yet, etc. So I'd unfortunately like
>to move this to the next CF?

I get it. I modified the status to "move to next CF".

-- 
Yoshimi Ichiyanagi
NTT laboratories




Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-03-01 Thread Heikki Linnakangas

On 16/01/18 15:00, Yoshimi Ichiyanagi wrote:

Hi.

These patches enable to use Persistent Memory Development Kit(PMDK)[1]
for reading/writing WAL logs on persistent memory(PMEM).
PMEM is next generation storage and it has a number of nice features:
fast, byte-addressable and non-volatile.


Interesting. How does this compare with using good old mmap()? I think 
just doing that would allow eliminating much of the complexity around 
managing the shared_buffers. And if the OS is smart about persistent 
memory (I don't know what the state of the art on that is), presumably 
msync() and fsync() on an file that lives in persistent memory is 
lightning fast.


- Heikki



Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-03-01 Thread Andres Freund
On 2018-02-05 09:59:25 +0900, Yoshimi Ichiyanagi wrote:
> I added my patches to the CommitFest 2018-3.
> https://commitfest.postgresql.org/17/1485/

Unfortunately this is the last CF for the v11 development cycle. This is
a major project submitted late for v11, there's been no code level
review, the goals aren't agreed upon yet, etc. So I'd unfortunately like
to move this to the next CF?

Greetings,

Andres Freund



Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-02-04 Thread Yoshimi Ichiyanagi
>On Tue, Jan 30, 2018 at 3:37 AM, Yoshimi Ichiyanagi
> wrote:
>> Oracle and Microsoft SQL Server suported PMEM [1][2].
>> I think it is not too early for PostgreSQL to support PMEM.
>
>I agree; it's good to have the option available for those who have
>access to the hardware.
>
>If you haven't added your patch to the next CommitFest, please do so.

Thank you for your time.

I added my patches to the CommitFest 2018-3.
https://commitfest.postgresql.org/17/1485/

Oh by the way, we submitted this proposal(Introducing PMDK into
PostgreSQL) to PGcon2018.
If our proposal is accepted and you have time, please listen to
our presentation.

-- 
Yoshimi Ichiyanagi
Mailto : ichiyanagi.yosh...@lab.ntt.co.jp




Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-30 Thread Robert Haas
On Tue, Jan 30, 2018 at 3:37 AM, Yoshimi Ichiyanagi
 wrote:
> Oracle and Microsoft SQL Server suported PMEM [1][2].
> I think it is not too early for PostgreSQL to support PMEM.

I agree; it's good to have the option available for those who have
access to the hardware.

If you haven't added your patch to the next CommitFest, please do so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-26 Thread Robert Haas
On Thu, Jan 25, 2018 at 8:54 PM, Tsunakawa, Takayuki
 wrote:
> Yes, that's pg_test_fsync output.  Isn't pg_test_fsync the tool to determine 
> the value for wal_sync_method?  Is this manual misleading?

Hmm.  I hadn't thought about it as misleading, but now that you
mention it, I'd say that it probably is.  I suspect that there should
be a disclaimer saying that the fastest WAL sync method in terms of
ops/second is not necessarily the one that will deliver the best
database performance, and mention the issues around open_sync and
open_datasync specifically.  But let's see what your testing shows;
I'm talking based on now-fairly-old experience with this and a passing
familiarity with the relevant source code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-25 Thread Tsunakawa, Takayuki
From: Robert Haas [mailto:robertmh...@gmail.com]
> If I understand correctly, those results are all just pg_test_fsync results.
> That's not reflective of what will happen when the database is actually
> running.  When you use open_sync or open_datasync, you force WAL write and
> WAL flush to happen simultaneously, instead of letting the WAL flush be
> delayed.

Yes, that's pg_test_fsync output.  Isn't pg_test_fsync the tool to determine 
the value for wal_sync_method?  Is this manual misleading?

https://www.postgresql.org/docs/devel/static/pgtestfsync.html
--
pg_test_fsync - determine fastest wal_sync_method for PostgreSQL

pg_test_fsync is intended to give you a reasonable idea of what the fastest 
wal_sync_method is on your specific system, as well as supplying diagnostic 
information in the event of an identified I/O problem.
--


Anyway, I'll use pgbench, and submit a patch if open_datasync is better than 
fdatasync.  I guess the current tweak of making fdatasync the default is a 
holdover from the era before ext4 and XFS became prevalent.


> I don't have the results handy at the moment.  We found it to be faster
> on a database benchmark where the WAL was stored on an NVRAM device.

Oh, NVRAM.  Interesting.  Then I'll try open_datasync/fdatasync comparison on 
HDD and SSD/PCie flash with pgbench.

Regards
Takayuki Tsunakawa




RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-25 Thread Tsunakawa, Takayuki
From: Michael Paquier [mailto:michael.paqu...@gmail.com]
> Or to put it short, the lack of granular syncs in ext3 kills performance
> for some workloads. Tomas Vondra's presentation on such matters are a really
> cool read by the way:
> https://www.slideshare.net/fuzzycz/postgresql-on-ext4-xfs-btrfs-and-zf
> s

Yeah, I saw this recently, too.  That was cool.

Regards
Takayuki Tsunakawa






Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-25 Thread Robert Haas
On Thu, Jan 25, 2018 at 8:32 PM, Tsunakawa, Takayuki
 wrote:
> As I showed previously, regular file writes on PCIe flash, *not writes using 
> PMDK on persistent memory*, was 20% faster with open_datasync than with 
> fdatasync.

If I understand correctly, those results are all just pg_test_fsync
results.  That's not reflective of what will happen when the database
is actually running.  When you use open_sync or open_datasync, you
force WAL write and WAL flush to happen simultaneously, instead of
letting the WAL flush be delayed.

> And you said open_datasync was significantly faster than fdatasync.  Could 
> you show your results?  What device and filesystem did you use?

I don't have the results handy at the moment.  We found it to be
faster on a database benchmark where the WAL was stored on an NVRAM
device.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-25 Thread Tsunakawa, Takayuki
From: Robert Haas [mailto:robertmh...@gmail.com]> On Thu, Jan 25, 2018 at 7:08 
PM, Tsunakawa, Takayuki
>  wrote:
> > No, I'm not saying we should make the persistent memory mode the default.
> I'm simply asking whether it's time to make open_datasync the default
> setting.  We can write a notice in the release note for users who still
> use ext3 etc. on old systems.  If there's no objection, I'll submit a patch
> for the next CF.
> 
> Well, like I said, I think that will degrade performance for users of SSDs
> or spinning disks.


As I showed previously, regular file writes on PCIe flash, *not writes using 
PMDK on persistent memory*, was 20% faster with open_datasync than with 
fdatasync.

In addition, regular file writes on HDD with ext4 was also 10% faster:

--
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync  3408.905 ops/sec 293 usecs/op
fdatasync  3111.621 ops/sec 321 usecs/op
fsync  3609.940 ops/sec 277 usecs/op
fsync_writethrough  n/a
open_sync  3356.362 ops/sec 298 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync  1892.157 ops/sec 528 usecs/op
fdatasync  3284.278 ops/sec 304 usecs/op
fsync  3066.655 ops/sec 326 usecs/op
fsync_writethrough  n/a
open_sync  1853.415 ops/sec 540 usecs/op
--


And you said open_datasync was significantly faster than fdatasync.  Could you 
show your results?  What device and filesystem did you use?

Regards
Takayuki Tsunakawa




Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-25 Thread Michael Paquier
On Thu, Jan 25, 2018 at 09:30:45AM -0500, Robert Haas wrote:
> On Wed, Jan 24, 2018 at 10:31 PM, Tsunakawa, Takayuki
>  wrote:
>>> This is just a guess, of course.  You didn't mention what the underlying
>>> storage for your test was?
>>
>> Uh, your guess was correct.  My file system was ext3, where fsync() writes 
>> all dirty buffers in page cache.
> 
> Oh, ext3 is terrible.  I don't think you can do any meaningful
> benchmark results on ext3.  Use ext4 or, if you prefer, xfs.

Or to put it short, the lack of granular syncs in ext3 kills
performance for some workloads. Tomas Vondra's presentation on such
matters are a really cool read by the way: 
https://www.slideshare.net/fuzzycz/postgresql-on-ext4-xfs-btrfs-and-zfs
(I would have loved seeing this presentation in live).
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-25 Thread Robert Haas
On Thu, Jan 25, 2018 at 7:08 PM, Tsunakawa, Takayuki
 wrote:
> No, I'm not saying we should make the persistent memory mode the default.  
> I'm simply asking whether it's time to make open_datasync the default 
> setting.  We can write a notice in the release note for users who still use 
> ext3 etc. on old systems.  If there's no objection, I'll submit a patch for 
> the next CF.

Well, like I said, I think that will degrade performance for users of
SSDs or spinning disks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-25 Thread Tsunakawa, Takayuki
From: Robert Haas [mailto:robertmh...@gmail.com]
> On Wed, Jan 24, 2018 at 10:31 PM, Tsunakawa, Takayuki
>  wrote:
> > As you said, open_datasync was 20% faster than fdatasync on RHEL7.2, on
> a LVM volume with ext4 (mounted with options noatime, nobarrier) on a PCIe
> flash memory.
> 
> So does that mean it was faster than your PMDK implementation?

The PMDK patch is not mine, but is from people in NTT Lab.  I'm very curious 
about the comparison of open_datasync and PMDK, too.


> > What do you think about changing the default value of wal_sync_method
> on Linux in PG 11?  I can understand the concern that users might hit
> performance degredation if they are using PostgreSQL on older systems.  But
> it's also mottainai that many users don't notice the benefits of
> wal_sync_method = open_datasync on new systems.
> 
> Well, some day persistent memory may be a common enough storage technology
> that such a change makes sense, but these days most people have either SSD
> or spinning disks, where the change would probably be a net negative.  It
> seems more like something we might think about changing in PG 20 or PG 30.

No, I'm not saying we should make the persistent memory mode the default.  I'm 
simply asking whether it's time to make open_datasync the default setting.  We 
can write a notice in the release note for users who still use ext3 etc. on old 
systems.  If there's no objection, I'll submit a patch for the next CF.

Regards
Takayuki Tsunakawa





Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-25 Thread Robert Haas
On Wed, Jan 24, 2018 at 10:31 PM, Tsunakawa, Takayuki
 wrote:
>> This is just a guess, of course.  You didn't mention what the underlying
>> storage for your test was?
>
> Uh, your guess was correct.  My file system was ext3, where fsync() writes 
> all dirty buffers in page cache.

Oh, ext3 is terrible.  I don't think you can do any meaningful
benchmark results on ext3.  Use ext4 or, if you prefer, xfs.

> As you said, open_datasync was 20% faster than fdatasync on RHEL7.2, on a LVM 
> volume with ext4 (mounted with options noatime, nobarrier) on a PCIe flash 
> memory.

So does that mean it was faster than your PMDK implementation?

> What do you think about changing the default value of wal_sync_method on 
> Linux in PG 11?  I can understand the concern that users might hit 
> performance degredation if they are using PostgreSQL on older systems.  But 
> it's also mottainai that many users don't notice the benefits of 
> wal_sync_method = open_datasync on new systems.

Well, some day persistent memory may be a common enough storage
technology that such a change makes sense, but these days most people
have either SSD or spinning disks, where the change would probably be
a net negative.  It seems more like something we might think about
changing in PG 20 or PG 30.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-24 Thread Tsunakawa, Takayuki
From: Robert Haas [mailto:robertmh...@gmail.com]
> I think open_datasync will be worse on systems where fsync() is expensive
> -- it forces the data out to disk immediately, even if the data doesn't
> need to be flushed immediately.  That's bad, because we wait immediately
> when we could have deferred the wait until later and maybe gotten the WAL
> writer to do the work in the background.  But it might be better on systems
> where fsync() is basically free, because there you might as well just get
> it out of the way immediately and not leave something left to be done later.
> 
> This is just a guess, of course.  You didn't mention what the underlying
> storage for your test was?

Uh, your guess was correct.  My file system was ext3, where fsync() writes all 
dirty buffers in page cache.

As you said, open_datasync was 20% faster than fdatasync on RHEL7.2, on a LVM 
volume with ext4 (mounted with options noatime, nobarrier) on a PCIe flash 
memory.

5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 50829.597 ops/sec  20 usecs/op
fdatasync 42094.381 ops/sec  24 usecs/op
fsync  42209.972 ops/sec  
24 usecs/op
fsync_writethroughn/a
open_sync 48669.605 ops/sec  21 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 26366.373 ops/sec  38 usecs/op
fdatasync 33922.725 ops/sec  29 usecs/op
fsync 32990.209 ops/sec  30 usecs/op
fsync_writethroughn/a
open_sync 24326.249 ops/sec  41 usecs/op

What do you think about changing the default value of wal_sync_method on Linux 
in PG 11?  I can understand the concern that users might hit performance 
degredation if they are using PostgreSQL on older systems.  But it's also 
mottainai that many users don't notice the benefits of wal_sync_method = 
open_datasync on new systems.

Regards
Takayuki Tsunakawa




Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-24 Thread Robert Haas
On Tue, Jan 23, 2018 at 8:07 PM, Tsunakawa, Takayuki
 wrote:
> From: Robert Haas [mailto:robertmh...@gmail.com]
>> Oh, incidentally -- in our internal testing, we found that
>> wal_sync_method=open_datasync was significantly faster than
>> wal_sync_method=fdatasync.  You might find that open_datasync isn't much
>> different from pmem_drain, even though they're both faster than fdatasync.
>
> That's interesting.  How fast was open_datasync in what environment (Linux 
> distro/kernel version, HDD or SSD etc.)?
>
> Is it now time to change the default setting to open_datasync on Linux, at 
> least when O_DIRECT is not used (i.e. WAL archiving or streaming replication 
> is used)?

I think open_datasync will be worse on systems where fsync() is
expensive -- it forces the data out to disk immediately, even if the
data doesn't need to be flushed immediately.  That's bad, because we
wait immediately when we could have deferred the wait until later and
maybe gotten the WAL writer to do the work in the background.  But it
might be better on systems where fsync() is basically free, because
there you might as well just get it out of the way immediately and not
leave something left to be done later.

This is just a guess, of course.  You didn't mention what the
underlying storage for your test was?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-23 Thread Tsunakawa, Takayuki
From: Robert Haas [mailto:robertmh...@gmail.com]
> Oh, incidentally -- in our internal testing, we found that
> wal_sync_method=open_datasync was significantly faster than
> wal_sync_method=fdatasync.  You might find that open_datasync isn't much
> different from pmem_drain, even though they're both faster than fdatasync.

That's interesting.  How fast was open_datasync in what environment (Linux 
distro/kernel version, HDD or SSD etc.)?

Is it now time to change the default setting to open_datasync on Linux, at 
least when O_DIRECT is not used (i.e. WAL archiving or streaming replication is 
used)?

[Current port/linux.h]
/*
 * Set the default wal_sync_method to fdatasync.  With recent Linux versions,
 * xlogdefs.h's normal rules will prefer open_datasync, which (a) doesn't
 * perform better and (b) causes outright failures on ext4 data=journal
 * filesystems, because those don't support O_DIRECT.
 */
#define PLATFORM_DEFAULT_SYNC_METHODSYNC_METHOD_FDATASYNC


pg_test_fsync showed open_datasync is slower on my RHEL6 VM:

ep
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync  4276.373 ops/sec 234 usecs/op
fdatasync  4895.256 ops/sec 204 usecs/op
fsync  4797.094 ops/sec 208 usecs/op
fsync_writethrough  n/a
open_sync  4575.661 ops/sec 219 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync  2243.680 ops/sec 446 usecs/op
fdatasync  4347.466 ops/sec 230 usecs/op
fsync  4337.312 ops/sec 231 usecs/op
fsync_writethrough  n/a
open_sync  2329.700 ops/sec 429 usecs/op
ep

Regards
Takayuki Tsunakawa



Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-23 Thread Robert Haas
On Fri, Jan 19, 2018 at 9:42 AM, Robert Haas  wrote:
> That's not necessarily an argument against this patch, which by the
> way I have not reviewed.  Even a 5% speedup on this kind of workload
> is potentially worthwhile; everyone likes it when things go faster.
> I'm just not convinced you can get very much more than that on a
> realistic workload.  Of course, I might be wrong.

Oh, incidentally -- in our internal testing, we found that
wal_sync_method=open_datasync was significantly faster than
wal_sync_method=fdatasync.  You might find that open_datasync isn't
much different from pmem_drain, even though they're both faster than
fdatasync.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-19 Thread Robert Haas
On Fri, Jan 19, 2018 at 4:56 AM, Yoshimi Ichiyanagi
 wrote:
>>Was the only non-default configuration setting wal_sync_method?  i.e.
>>synchronous_commit=on?  No change to max_wal_size?
> No, I used the following parameter in postgresql.conf to prevent
> checkpoints from occurring while running the tests.

I think that you really need to include the checkpoints in the tests.
I would suggest setting max_wal_size and/or checkpoint_timeout so that
you reliably complete 2 checkpoints in a 30-minute test, and then do a
comparison on that basis.

> Do you know any good WAL I/O intensive benchmarks? DBT2?

pgbench is quite a WAL-intensive benchmark; it is much more
write-heavy than what most systems experience in real life, at least
in my experience.  Your comparison of DAX FS to DAX FS + PMDK is very
interesting, but in real life the bandwidth of DAX FS is already so
high -- and the latency so low -- that I think most real-world
workloads won't gain very much.  At least, that is my impression based
on internal testing EnterpriseDB did a few months back.  (Thanks to
Mithun and Kuntal for that work.)

That's not necessarily an argument against this patch, which by the
way I have not reviewed.  Even a 5% speedup on this kind of workload
is potentially worthwhile; everyone likes it when things go faster.
I'm just not convinced you can get very much more than that on a
realistic workload.  Of course, I might be wrong.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-17 Thread Robert Haas
On Tue, Jan 16, 2018 at 2:00 AM, Yoshimi Ichiyanagi
 wrote:
> C-5. Running the 2 benchmarks(1. pgbench, 2. my insert benchmark)
> C-5-1. pgbench
> # numactl -N 1 pgbech -c 32 -j 8 -T 120 -M prepared [DB_NAME]
>
> The averages of running pgbench three times are:
> wal_sync_method=fdatasync:   tps = 43,179
> wal_sync_method=pmem_drain:  tps = 45,254

What scale factor was used for this test?

Was the only non-default configuration setting wal_sync_method?  i.e.
synchronous_commit=on?  No change to max_wal_size?

This seems like an exceedingly short test -- normally, for write
tests, I recommend the median of 3 30-minute runs.  It also seems
likely to be client-bound, because of the fact that jobs = clients/4.
Normally I use jobs = clients or at least jobs = clients/2.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-17 Thread Robert Haas
On Tue, Jan 16, 2018 at 2:00 AM, Yoshimi Ichiyanagi
 wrote:
> Using pgbench which is a PostgreSQL general benchmark, the postgres server
> to which the patches is applied is about 5% faster than original server.
> And using my insert benchmark, it is up to 90% faster than original one.
> I will describe these details later.

Interesting.  But your insert benchmark looks highly artificial... in
real life, you would not insert the same long static string 160
million times.  Or if you did, you would use COPY or INSERT .. SELECT.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



[HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-16 Thread Yoshimi Ichiyanagi
Hi.

These patches enable to use Persistent Memory Development Kit(PMDK)[1]
for reading/writing WAL logs on persistent memory(PMEM).
PMEM is next generation storage and it has a number of nice features:
fast, byte-addressable and non-volatile.

Using pgbench which is a PostgreSQL general benchmark, the postgres server
to which the patches is applied is about 5% faster than original server.
And using my insert benchmark, it is up to 90% faster than original one.
I will describe these details later.


This e-mail describes the following:
A) About PMDK
B) About the patches
C) The way of running benchmarks using the patches, and the results


A) About PMDK
PMDK provides the functions to allow an application to directly access
PMEM without going through the kernel as a memory for the purpose of
high-speed access to PMEM by the application.
The following APIs are available through PMDK.
A-1. APIs to open a file on PMEM, to create a file on PMEM,
 and to map a file on PMEM to virtual addresses
A-2. APIs to read/write data from and to a file on PMEM


A-1. APIs to open a file on PMEM, to create a file on PMEM,
 and to map a file on PMEM to virtual addresses

PMDK provides these APIs using DAX filesystem(DAX FS)[2] feature. 

DAX FS is a PMEM-aware file system which allows direct access
to PMEM without using the kernel page caches. A file in DAX FS can
be mapped to memory using standard calls like mmap() on Linux. 
Furthermore by mapping the file on PMEM to virtual addresses(and
after any initial minor page faults that may be required to create
the mappings in the MMU), the applications can access PMEM
using CPU load/store instructions instead of read/write system calls.


A-2. APIs to read/write data from and to a file on PMEM

PMDK provides the APIs like memcpy() to copy data to PMEM
using single instruction, multiple data(SIMD) instructions[3] and
NT store instructions[4]. These instructions improve the performance
to copy data to PMEM. As a result, using these APIs is faster than
using read/write system calls.


[1] http://pmem.io/pmdk/
[2] 
https://www.usenix.org/system/files/login/articles/login_summer17_07_rudoff.pdf
[3] SIMD: SIMD is the instruction operates on all loaded data in a single
operation. If the SIMD system loads eight data into registers at once,
the store operation to PMEM will happen to all eight values
at the same time.
[4] NT store instructions: NT store instructions bypass the CPU cache,
so using these instructions does not require a flush.


B) About the patches
Changes by the patches:
0001-Add-configure-option-for-PMDK.patch:
- Added "--with-libpmem" configure option to execute I/O with PMDK library

0002-Read-write-WAL-files-using-PMDK.patch:
- Added PMDK implementation for WAL I/O operations
- Added "pmem-drain" to the wal_sync_method parameter list
  to write logs synchronously on PMEM

0003-Walreceiver-WAL-IO-using-PMDK.patch:
- Added PMDK implementation for Walreceiver of secondary server processes



C) The way of running benchmarks using the patches, and the results
It's the following:

Experimental setup
Server: HP ProLiant DL360 Gen9
CPU:Xeon E5-2667 v4 (3.20GHz); 2 processors(without HT)
DRAM:   DDR4-2400; 32 GiB/processor
(8GiB/socket x 4 sockets/processor) x 2 processors
NVDIMM: DDR4-2133; 32 GiB/processor
(8GiB/socket x 4 sockets/processor) x 2 processors
HDD:Seagate Constellation2 2.5inch SATA 3.0. 6Gb/s 1TB 7200rpm x 1
OS: Ubuntu 16.04, linux-4.12
DAX FS: ext4
NVML:   master@Aug 30, 2017
PostgreSQL: master
Note: I bound the postgres processes to one NUMA node, 
  and the benchmarks to other NUMA node.


C-1. Configuring PMEM for using as a block device
# ndctl list
# ndctl create-namespace -f -e namespace0.0 --mode=memory -M dev

C-2. Creating a file system on PMEM, and mounting it with DAX
# mkfs.ext4 /dev/pmem0
# mount -t ext4 -o dax /dev/pmem0 /mnt/pmem0

C-3. Setting PMEM_IS_PMEM_FORCE to determine if the WAL files is stored
 on PMEM
Note: If this environment variable is not set, postgres processes are
  not started.
# export PMEM_IS_PMEM_FORCE=1

C-4. Installing PostgreSQL
Note: There are 3 important things in installing PostgreSQL.
a. Executing "./configure --with-libpmem" to link libpmem
b. Setting WAL directory on PMEM
c. Modifying wal_sync_method parameter in postgresql.conf from fdatasync
   to pmem_drain

# cd /path/to/[PG_source dir]
# ./configure --with-libpmem
# make && make install
# initdb /path/to/PG_DATA -X /mnt/pmem0/path/to/[PG_WAL dir]
# cat /path/to/PG_DATA/postgresql.conf | sed -e s/#wal_sync_method\ =\ 
fsync/wal_sync_method\ =\ pmem_drain/ > /path/to/PG_DATA/postgresql.conf.
tmp
# mv /path/to/PG_DATA/postgresql.conf.tmp /path/to/PG_DATA/postgresql.conf
# pg_ctl start -D /path/to/PG_DATA
# created [DB_NAME]

C-5. Running the 2 benchmarks(1. pgbench, 2. my insert benchmark)
C-5-1. pgbench
# numactl -N 1 pgbech -c 32 -j 8 -T 120 -M prepared [DB_NAME]

The averages of running pgbench three