Re: [HACKERS] NUMA packaging and patch

2014-07-01 Thread Christoph Berg
Re: Kevin Grittner 2014-06-09 
1402267501.4.yahoomail...@web122304.mail.ne1.yahoo.com
 @@ -536,6 +539,24 @@ PGSharedMemoryCreate(Size size, bool makePrivate, int 
 port,
*/
   }
  
 +#ifdef USE_LIBNUMA
 + /*
 +  * If this is not a private segment and we are using libnuma, make the
 +  * large memory segment interleaved.
 +  */
 + if (!makePrivate  numa_available())
 + {
 + void   *start;
 +
 + if (AnonymousShmem == NULL)
 + start = memAddress;
 + else
 + start = AnonymousShmem;
 +
 + numa_interleave_memory(start, size, numa_all_nodes_ptr);
 + }
 +#endif

How much difference would it make if numactl --interleave=all was used
instead of using numa_interleave_memory() on the shared memory
segments? I guess that would make backend-local memory also
interleaved, but it would avoid having a dependency on libnuma in the
packages.

The numactl manpage even has this example:

numactl --interleave=all bigdatabase arguments Run big
database with its memory interleaved on all CPUs.

It is probably better to have native support in the postmaster, though
this could be mentioned as an alternative in the documentation.

Christoph
-- 
c...@df7cb.de | http://www.df7cb.de/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] NUMA packaging and patch

2014-07-01 Thread Andres Freund
On 2014-07-01 11:01:04 +0200, Christoph Berg wrote:
 Re: Kevin Grittner 2014-06-09 
 1402267501.4.yahoomail...@web122304.mail.ne1.yahoo.com
  @@ -536,6 +539,24 @@ PGSharedMemoryCreate(Size size, bool makePrivate, int 
  port,
   */
  }
   
  +#ifdef USE_LIBNUMA
  +   /*
  +* If this is not a private segment and we are using libnuma, make the
  +* large memory segment interleaved.
  +*/
  +   if (!makePrivate  numa_available())
  +   {
  +   void   *start;
  +
  +   if (AnonymousShmem == NULL)
  +   start = memAddress;
  +   else
  +   start = AnonymousShmem;
  +
  +   numa_interleave_memory(start, size, numa_all_nodes_ptr);
  +   }
  +#endif
 
 How much difference would it make if numactl --interleave=all was used
 instead of using numa_interleave_memory() on the shared memory
 segments? I guess that would make backend-local memory also
 interleaved, but it would avoid having a dependency on libnuma in the
 packages.

I've tested this a while ago, and it's rather painful if you have a OLAP
workload with lots of backend private memory.

 The numactl manpage even has this example:
 
   numactl --interleave=all bigdatabase arguments Run big
   database with its memory interleaved on all CPUs.
 
 It is probably better to have native support in the postmaster, though
 this could be mentioned as an alternative in the documentation.

I wonder if we shouldn't backpatch such a notice.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] NUMA packaging and patch

2014-07-01 Thread Kevin Grittner
Andres Freund and...@2ndquadrant.com wrote:
 On 2014-07-01 11:01:04 +0200, Christoph Berg wrote:

 How much difference would it make if numactl --interleave=all
 was used instead of using numa_interleave_memory() on the shared
 memory segments? I guess that would make backend-local memory
 also interleaved, but it would avoid having a dependency on
 libnuma in the packages.

 I've tested this a while ago, and it's rather painful if you have
 a OLAP workload with lots of backend private memory.

I'm not surprised; I would expect it to generally have a negative
effect, which would be most pronounced with an OLAP workload.

 The numactl manpage even has this example:

 numactl --interleave=all bigdatabase arguments Run big
 database with its memory interleaved on all CPUs.

 It is probably better to have native support in the postmaster,
 though this could be mentioned as an alternative in the
 documentation.

 I wonder if we shouldn't backpatch such a notice.

I would want to see some evidence that it was useful first.  In
most of my tests the benefit of interleaving just the OS cache and
PostgreSQL shared_buffers was about 2%.  That could easily be
erased if work_mem allocations and other process-local memory were
not allocated close to the process which was using it.

I expect that the main benefit of this proposed patch isn't the 2%
typical benefit I was seeing, but that it will be insurance against
occasional, much larger hits.  I haven't had much luck making these
worst case episodes reproducible, though.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] NUMA packaging and patch

2014-07-01 Thread Christoph Berg
Re: Kevin Grittner 2014-07-01 
1404213492.98740.yahoomail...@web122306.mail.ne1.yahoo.com
 Andres Freund and...@2ndquadrant.com wrote:
  On 2014-07-01 11:01:04 +0200, Christoph Berg wrote:
 
  How much difference would it make if numactl --interleave=all
  was used instead of using numa_interleave_memory() on the shared
  memory segments? I guess that would make backend-local memory
  also interleaved, but it would avoid having a dependency on
  libnuma in the packages.
 
  I've tested this a while ago, and it's rather painful if you have
  a OLAP workload with lots of backend private memory.
 
 I'm not surprised; I would expect it to generally have a negative
 effect, which would be most pronounced with an OLAP workload.

Ok, then +1 on having this in core, even if it buys us a dependency on
something that isn't in the usual base system after OS install.

  I wonder if we shouldn't backpatch such a notice.
 
 I would want to see some evidence that it was useful first.  In
 most of my tests the benefit of interleaving just the OS cache and
 PostgreSQL shared_buffers was about 2%.  That could easily be
 erased if work_mem allocations and other process-local memory were
 not allocated close to the process which was using it.
 
 I expect that the main benefit of this proposed patch isn't the 2%
 typical benefit I was seeing, but that it will be insurance against
 occasional, much larger hits.  I haven't had much luck making these
 worst case episodes reproducible, though.

Afaict, the numactl notice will only be useful as a postscriptum to
the --with-libnuma docs, with the caveats mentioned. Or we backpatch
(something like) the full docs of the feature, with a note that it's
only 9.5+. (Or the full feature gets backpatched...)

Christoph
-- 
c...@df7cb.de | http://www.df7cb.de/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] NUMA packaging and patch

2014-06-26 Thread Kohei KaiGai
Hello,

Let me comment on this patch.

It can be applied on head of the master branch, built and run
regression test successfully.
What this patch tries to do is quite simple and obvious.
It suggests operating system to distribute physical pages to
every numa nodes on allocation.

One thing I concern is, it may conflict with numa-balancing
features that is supported in the recent Linux kernel; that
migrates physical pages according to the location of tasks
which references the page beyond the numa zone.
# I'm not sure whether it is applied on shared memory region.
# Please correct me if I misunderstood. But it looks to me
# physical page in shared memory is also moved.
http://events.linuxfoundation.org/sites/events/files/slides/summit2014_riel_chegu_w_0340_automatic_numa_balancing_0.pdf

Probably, interleave policy should work well on OLTP workload.
How about OLAP workload if physical pages are migrated
by operating system transparently to local node?
In OLAP case, less concurrency is required, but a query run
complicated logic (usually including full-scan) on a particular
CPU.

Isn't it make sense to have a GUC to control the numa policy.
In some cases, it makes sense to allocate physical memory
according to operating system's choice.

Thanks,

2014-06-11 2:34 GMT+09:00 Kevin Grittner kgri...@ymail.com:
 Josh Berkus j...@agliodbs.com wrote:
 On 06/08/2014 03:45 PM, Kevin Grittner wrote:
 By default, the OS cache and buffers are allocated in the memory
 node with the shortest distance from the CPU a process is
 running on.

 Note that this will stop being the default in future Linux kernels.
 However, we'll have to deal with the old ones for some time to come.

 I was not aware of that.  Thanks.  Do you have a URL handy?

 In any event, that is the part of the problem which I think falls
 into the realm of packagers and/or sysadmins; a patch for that
 doesn't seem sensible, given how cpusets are implemented.  I did
 figure we would want to add some documentation around it, though.
 Do you agree that is worthwhile?

 --
 Kevin Grittner
 EDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] NUMA packaging and patch

2014-06-26 Thread Claudio Freire
On Thu, Jun 26, 2014 at 11:18 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 One thing I concern is, it may conflict with numa-balancing
 features that is supported in the recent Linux kernel; that
 migrates physical pages according to the location of tasks
 which references the page beyond the numa zone.
 # I'm not sure whether it is applied on shared memory region.
 # Please correct me if I misunderstood. But it looks to me
 # physical page in shared memory is also moved.
 http://events.linuxfoundation.org/sites/events/files/slides/summit2014_riel_chegu_w_0340_automatic_numa_balancing_0.pdf


Sadly, it excludes the OS cache explicitly (when it mentions libc.so),
which is one of the hottest sources of memory bandwidth consumption in
a database.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] NUMA packaging and patch

2014-06-26 Thread Kevin Grittner
Claudio Freire klaussfre...@gmail.com wrote:

 Sadly, it excludes the OS cache explicitly (when it mentions libc.so),
 which is one of the hottest sources of memory bandwidth consumption in
 a database.

Agreed.  On the bright side, the packagers and/or sysadmins can fix this
without any changes to the PostgreSQL code, by creating a custom cpuset
and using it during launch of the postmaster.  I went through that
exercise in my original email.  This patch complements that by
preventing one CPU from managing all of PostgreSQL shared memory, and
thus becoming a bottleneck.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] NUMA packaging and patch

2014-06-10 Thread Robert Haas
On Mon, Jun 9, 2014 at 1:00 PM, Kevin Grittner kgri...@ymail.com wrote:
 Andres Freund and...@2ndquadrant.com wrote:
 On 2014-06-09 08:59:03 -0700, Kevin Grittner wrote:
 *) There is a lot of advice floating around (for example here:
 http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html
  )
 to instruct operators to disable zone_reclaim.  Will your changes
 invalidate any of that advice?

 I expect that it will make the need for that far less acute,
 although it is probably still best to disable zone_reclaim (based
 on the documented conditions under which disabling it makes sense).

 I think it'll still be important unless you're running an OLTP workload
 (i.e. minimal per backend allocations) and your entire workload fits
 into shared buffers. What zone_reclaim  0 essentially does is to never
 allocate memory from remote nodes. I.e. it will throw away all numa node
 local OS cache to satisfy a memory allocation (including
 pagefaults).

 I don't think that cpuset spreading of OS buffers and cache, and
 the patch to spread shared memory, will make too much difference
 unless the working set is fully cached.  Where I have seen the
 biggest problems is when the active set  one memory node and 
 total machine RAM.

But that's precisely the scenario where vm.zone_reclaim_mode != 0 is a
disaster.  You'll end up throwing away the cached pages and rereading
the data from disk, even though the memory *could* have been kept all
in cache.

 I would agree that unless this patch is
 providing benefit for such a fully-cached load, it won't make any
 difference regarding the need for zone_reclaim_mode.  Where the
 data is heavily cached, zone_reclaim  0 might discard some cached
 pages to allow, say, a RAM sort to be done in faster memory (for
 the current process's core), so it might be a wash or even make
 zone_reclaim  0 a win.

I will believe that when, and only when, I see benchmarks convincingly
demonstrating it.  Setting zone_reclaim_mode can only be a win if the
performance benefit from using faster memory is greater than the
performance cost of any rereading-from-disk that happens.  IME, that's
a highly unusual situation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] NUMA packaging and patch

2014-06-10 Thread Josh Berkus
On 06/08/2014 03:45 PM, Kevin Grittner wrote:
 By default, the OS cache and buffers are allocated in the memory
 node with the shortest distance from the CPU a process is running
 on. 

Note that this will stop being the default in future Linux kernels.
However, we'll have to deal with the old ones for some time to come.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] NUMA packaging and patch

2014-06-10 Thread Kevin Grittner
Josh Berkus j...@agliodbs.com wrote:
 On 06/08/2014 03:45 PM, Kevin Grittner wrote:
 By default, the OS cache and buffers are allocated in the memory
 node with the shortest distance from the CPU a process is
 running on.

 Note that this will stop being the default in future Linux kernels.
 However, we'll have to deal with the old ones for some time to come.

I was not aware of that.  Thanks.  Do you have a URL handy?

In any event, that is the part of the problem which I think falls
into the realm of packagers and/or sysadmins; a patch for that
doesn't seem sensible, given how cpusets are implemented.  I did
figure we would want to add some documentation around it, though. 
Do you agree that is worthwhile?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] NUMA packaging and patch

2014-06-09 Thread Merlin Moncure
On Sun, Jun 8, 2014 at 5:45 PM, Kevin Grittner kgri...@ymail.com wrote:
 I ran into a situation where a machine with 4 NUMA memory nodes and
 40 cores had performance problems due to NUMA.  The problems were
 worst right after they rebooted the OS and warmed the cache by
 running a script of queries to read all tables.  These were all run
 on a single connection.  As it turned out, the size of the database
 was just over one-quarter of the size of RAM, and with default NUMA
 policies both the OS cache for the database and the PostgreSQL
 shared memory allocation were placed on a single NUMA segment, so
 access to the CPU package managing that segment became a
 bottleneck.  On top of that, processes which happened to run on the
 CPU package which had all the cached data had to allocate memory
 for local use on more distant memory because there was none left in
 the more local memory.

 Through normal operations, things eventually tended to shift around
 and get better (after several hours of heavy use with substandard
 performance).  I ran some benchmarks and found that even in
 long-running tests, spreading these allocations among the memory
 segments showed about a 2% benefit in a read-only load.  The
 biggest difference I saw in a long-running read-write load was
 about a 20% hit for unbalanced allocations, but I only saw that
 once.  I talked to someone at PGCon who managed to engineer much
 worse performance hits for an unbalanced load, although the
 circumstances were fairly artificial.  Still, fixing this seems
 like something worth doing if further benchmarks confirm benefits
 at this level.

 By default, the OS cache and buffers are allocated in the memory
 node with the shortest distance from the CPU a process is running
 on.  This is determined by a the cpuset associated with the
 process which reads or writes the disk page.  Typically a NUMA
 machine starts with a single cpuset with a policy specifying this
 behavior.  Fixing this aspect of things seems like an issue for
 packagers, although we should probably document it for those
 running from their own source builds.

 To set an alternate policy for PostgreSQL, you first need to find
 or create the location for cpuset specification, which uses a
 filesystem in a way similar to the /proc directory.  On a machine
 with more than one memory node, the appropriate filesystem is
 probably already mounted, although different distributions use
 different filesystem names and mount locations.  I will illustrate
 the process on my Ubuntu machine.  Even though it has only one
 memory node (and so, this makes no difference), I have it handy at
 the moment to confirm the commands as I put them into the email.

 # Sysadmin must create the root cpuset if not already done.  (On a
 # system with NUMA memory, this will probably already be mounted.)
 # Location and options can vary by distro.

 sudo sudo mkdir /dev/cpuset
 sudo mount -t cpuset none /dev/cpuset

 # Sysadmin must create a cpuset for postgres and configure
 # resources.  This will normally be all cores and all RAM.  This is
 # where we specify that this cpuset will spread pages among its
 # memory nodes.

 sudo mkdir /dev/cpuset/postgres
 sudo /bin/bash -c echo 0-3 /dev/cpuset/postgres/cpus
 sudo /bin/bash -c echo 0 /dev/cpuset/postgres/mems
 sudo /bin/bash -c echo 1 /dev/cpuset/postgres/memory_spread_page

 # Sysadmin must grant permissions to the desired setting(s).
 # This could be by user or group.

 sudo chown postgres /dev/cpuset/postgres/tasks

 # The pid of postmaster or an ancestor process must be written to
 # the tasks file of the cpuset.  This can be a shell from which
 # pg_ctl is run, at least for bash shells.  It could also be
 # written by the postmaster itself, essentially as an extra pid
 # file.  Possible snippet from a service script:

 echo $$ /dev/cpuset/postgres/tasks
 pg_ctl start ...

 Where the OS cache is larger than shared_buffers, the above is
 probably more important than the attached patch, which causes the
 main shared memory segment to be spread among all available memory
 nodes.  This patch only compiles in the relevant code if configure
 is run using the --with-libnuma option, in which case a dependency
 on the numa library is created.  It is v3 to avoid confusion with
 earlier versions I have shared with a few people off-list.  (The
 only difference from v2 is fixing bitrot.)

 I'll add it to the next CF.

Hm, your patch seems to boil down to interleave_memory(start, size,
numa_all_nodes_ptr) inside PGSharedMemoryCreate().  I've read your
email a couple of times and am a little hazy around a couple of
points, in particular: the above is probably more important than the
attached patch.  So I have a couple of questions:

*) There is a lot of advice floating around (for example here:
http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html)
to instruct operators to disable zone_reclaim.  Will your changes
invalidate any of that 

Re: [HACKERS] NUMA packaging and patch

2014-06-09 Thread Kevin Grittner
Merlin Moncure mmonc...@gmail.com wrote:
 On Sun, Jun 8, 2014 at 5:45 PM, Kevin Grittner kgri...@ymail.com wrote:

 Hm, your patch seems to boil down to
   interleave_memory(start, size, numa_all_nodes_ptr)
 inside PGSharedMemoryCreate().

That's the functional part -- the rest is about not breaking the
builds for environments which are not NUMA-aware.

 I've read your email a couple of times and am a little hazy
 around a couple of points, in particular: the above is probably
 more important than the attached patch.  So I have a couple of
 questions:

 *) There is a lot of advice floating around (for example here:
 http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html
  )
 to instruct operators to disable zone_reclaim.  Will your changes
 invalidate any of that advice?

I expect that it will make the need for that far less acute,
although it is probably still best to disable zone_reclaim (based
on the documented conditions under which disabling it makes sense).

 *) is there any downside to enabling --with-libnuma if you have
 support?

Not that I can see.  There are two additional system calls on
postmaster start-up.  I don't expect the time those take to be
significant.

 Do you expect packagers will enable it generally?

I suspect so.

 Why not just always build it in (if configure allows it) and rely
 on a GUC if there is some kind of tradeoff (and if there is one,
 what kinds of things are you looking for to manage it)?

If a build is done on a machine with the NUMA library, and the
executable is deployed on a machine without it, the postmaster will
get an error on the missing library.  I talked about this briefly
with Tom in Ottawa, and he thought that it would be up to packagers
to create a dependency on the library if they build PostgreSQL
using the --with-libnuma option.  The reason to require the option
is so that a build is not created which won't run on target
machines if a packagers does nothing to deal with NUMA.

 *) The bash script above, what problem does the 'alternate
 policy' solve?

By default, all OS buffers and cache is located in the memory node
closest to the process which does the read or write which first
causes it to be used.  For something like the cp command, that
probably makes sense.  For something like PostgreSQL it can lead to
unbalanced placement of shared resources (like pages in shared
tables and indexes).

 *) What kinds of improvements (even if in very general terms)
 will we see from better numa management?  Are there further
 optimizations possible?

When I spread both OS cache and PostgreSQL shared memory, I got
about 2% better performance overall for a read-only load on a 4
node system which started with everything on one node.  I used
pgbench and picked a scale which put the database size at about 25%
of machine memory before I initialized the database, so that one
memory node was 100% filled with minimal spill to the other
nodes.  The run times between the two cases had very minimal
overlap.  The balanced memory usage had more consistent results;
the unbalance load had more variable performance timings, with a
rare run showing better than all the balanced times.

I didn't spend as much time with read/write benchmarks but those
seemed overall worse for the unbalance load, and one outlier on the
bad side was about 20% below the (again, pretty tightly clustered)
times for the balanced load.

These tests were designed to try to create a pretty bad case for
the unbalanced load in a default cpuset configuration and just an
unlucky sizing of the working set relative to a memory node size. 
At PGCon I had a discussion over lunch with someone who saw far
worse performance from unbalance memory, but he carefully
engineered a really bad case by using one cpuset to force all data
into one node, and then another cpuset to force PostgreSQL to run
only on cores from which access to that node was relatively slow. 
If I remember correctly, he saw about 20% of the throughput that way
versus using the same cores with balanced memory usage.  He
conceded that this was a pretty artificial case, and you would
have to be *trying* to hurt performance to set things up that way,
but he wanted to establish a worst case so that he had a hard
bounding of what the maximum possible benefit from balancing load
might be.

There is definitely a need for more benchmarks and benchmarks on
more environments, but my preliminary tests all looked favorable to
the combination of this patch and the cpuset changes.  I would have
posted this months ago if I had found enough time to do more
benchmarks and put together a nice presentation of the results, but
I figured it was a good idea to put this in front of people even
with only preliminary results, so that if others were interested in
doing so they could see what results they got in their
environments or with workloads I had not considered.

I will note that given the wide differences I saw between run times
with the 

Re: [HACKERS] NUMA packaging and patch

2014-06-09 Thread Andres Freund
On 2014-06-09 08:59:03 -0700, Kevin Grittner wrote:
  *) There is a lot of advice floating around (for example here:
  http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html
   )
  to instruct operators to disable zone_reclaim.  Will your changes
  invalidate any of that advice?
 
 I expect that it will make the need for that far less acute,
 although it is probably still best to disable zone_reclaim (based
 on the documented conditions under which disabling it makes sense).

I think it'll still be important unless you're running an OLTP workload
(i.e. minimal per backend allocations) and your entire workload fits
into shared buffers. What zone_reclaim  0 essentially does is to never
allocate memory from remote nodes. I.e. it will throw away all numa node
local OS cache to satisfy a memory allocation (including
pagefaults).
I honestly wouldn't expect this to make a huge difference *wrt*
zone_reclaim_mode.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] NUMA packaging and patch

2014-06-09 Thread Kevin Grittner
Andres Freund and...@2ndquadrant.com wrote:
 On 2014-06-09 08:59:03 -0700, Kevin Grittner wrote:
 *) There is a lot of advice floating around (for example here:
 http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html
  )
 to instruct operators to disable zone_reclaim.  Will your changes
 invalidate any of that advice?

 I expect that it will make the need for that far less acute,
 although it is probably still best to disable zone_reclaim (based
 on the documented conditions under which disabling it makes sense).

 I think it'll still be important unless you're running an OLTP workload
 (i.e. minimal per backend allocations) and your entire workload fits
 into shared buffers. What zone_reclaim  0 essentially does is to never
 allocate memory from remote nodes. I.e. it will throw away all numa node
 local OS cache to satisfy a memory allocation (including
 pagefaults).

I don't think that cpuset spreading of OS buffers and cache, and
the patch to spread shared memory, will make too much difference
unless the working set is fully cached.  Where I have seen the
biggest problems is when the active set  one memory node and 
total machine RAM.  I would agree that unless this patch is
providing benefit for such a fully-cached load, it won't make any
difference regarding the need for zone_reclaim_mode.  Where the
data is heavily cached, zone_reclaim  0 might discard some cached
pages to allow, say, a RAM sort to be done in faster memory (for
the current process's core), so it might be a wash or even make
zone_reclaim  0 a win.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] NUMA packaging and patch

2014-06-08 Thread Kevin Grittner
I ran into a situation where a machine with 4 NUMA memory nodes and
40 cores had performance problems due to NUMA.  The problems were
worst right after they rebooted the OS and warmed the cache by
running a script of queries to read all tables.  These were all run
on a single connection.  As it turned out, the size of the database
was just over one-quarter of the size of RAM, and with default NUMA
policies both the OS cache for the database and the PostgreSQL
shared memory allocation were placed on a single NUMA segment, so
access to the CPU package managing that segment became a
bottleneck.  On top of that, processes which happened to run on the
CPU package which had all the cached data had to allocate memory
for local use on more distant memory because there was none left in
the more local memory.

Through normal operations, things eventually tended to shift around
and get better (after several hours of heavy use with substandard
performance).  I ran some benchmarks and found that even in
long-running tests, spreading these allocations among the memory
segments showed about a 2% benefit in a read-only load.  The
biggest difference I saw in a long-running read-write load was
about a 20% hit for unbalanced allocations, but I only saw that
once.  I talked to someone at PGCon who managed to engineer much
worse performance hits for an unbalanced load, although the
circumstances were fairly artificial.  Still, fixing this seems
like something worth doing if further benchmarks confirm benefits
at this level.

By default, the OS cache and buffers are allocated in the memory
node with the shortest distance from the CPU a process is running
on.  This is determined by a the cpuset associated with the
process which reads or writes the disk page.  Typically a NUMA
machine starts with a single cpuset with a policy specifying this
behavior.  Fixing this aspect of things seems like an issue for
packagers, although we should probably document it for those
running from their own source builds.

To set an alternate policy for PostgreSQL, you first need to find
or create the location for cpuset specification, which uses a
filesystem in a way similar to the /proc directory.  On a machine
with more than one memory node, the appropriate filesystem is
probably already mounted, although different distributions use
different filesystem names and mount locations.  I will illustrate
the process on my Ubuntu machine.  Even though it has only one
memory node (and so, this makes no difference), I have it handy at
the moment to confirm the commands as I put them into the email.

# Sysadmin must create the root cpuset if not already done.  (On a
# system with NUMA memory, this will probably already be mounted.)
# Location and options can vary by distro.

sudo sudo mkdir /dev/cpuset
sudo mount -t cpuset none /dev/cpuset

# Sysadmin must create a cpuset for postgres and configure
# resources.  This will normally be all cores and all RAM.  This is
# where we specify that this cpuset will spread pages among its
# memory nodes.

sudo mkdir /dev/cpuset/postgres
sudo /bin/bash -c echo 0-3 /dev/cpuset/postgres/cpus
sudo /bin/bash -c echo 0 /dev/cpuset/postgres/mems
sudo /bin/bash -c echo 1 /dev/cpuset/postgres/memory_spread_page

# Sysadmin must grant permissions to the desired setting(s).
# This could be by user or group.

sudo chown postgres /dev/cpuset/postgres/tasks

# The pid of postmaster or an ancestor process must be written to
# the tasks file of the cpuset.  This can be a shell from which
# pg_ctl is run, at least for bash shells.  It could also be
# written by the postmaster itself, essentially as an extra pid
# file.  Possible snippet from a service script:

echo $$ /dev/cpuset/postgres/tasks
pg_ctl start ...

Where the OS cache is larger than shared_buffers, the above is
probably more important than the attached patch, which causes the
main shared memory segment to be spread among all available memory
nodes.  This patch only compiles in the relevant code if configure
is run using the --with-libnuma option, in which case a dependency
on the numa library is created.  It is v3 to avoid confusion with
earlier versions I have shared with a few people off-list.  (The
only difference from v2 is fixing bitrot.)

I'll add it to the next CF.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Companydiff --git a/configure b/configure
index ed1ff0a..79a0dea 100755
--- a/configure
+++ b/configure
@@ -702,6 +702,7 @@ EGREP
 GREP
 with_zlib
 with_system_tzdata
+with_libnuma
 with_libxslt
 with_libxml
 XML2_CONFIG
@@ -831,6 +832,7 @@ with_uuid
 with_ossp_uuid
 with_libxml
 with_libxslt
+with_libnuma
 with_system_tzdata
 with_zlib
 with_gnu_ld
@@ -1518,6 +1520,7 @@ Optional Packages:
   --with-ossp-uuidobsolete spelling of --with-uuid=ossp
   --with-libxml   build with XML support
   --with-libxslt  use XSLT support when building contrib/xml2
+  --with-libnuma  use libnuma for NUMA support