Re: [PERFORM] Profiling PostgreSQL

2014-05-23 Thread Dimitris Karampinas
Thanks for your answers. A script around pstack worked for me.

(I'm not sure if I should open a new thread, I hope it's OK to ask another
question here)

For the workload I run it seems that PostgreSQL scales with the number of
concurrent clients up to the point that these reach the number of cores
(more or less).
Further increase to the number of clients leads to dramatic performance
degradation. pstack and perf show that backends block on LWLockAcquire
calls, so, someone could assume that the reason the system slows down is
because of multiple concurrent transactions that access the same data.
However I did the two following experiments:
1) I completely removed the UPDATE transactions from my workload. The
throughput turned out to be better yet the trend was the same. Increasing
the number of clients, has a very negative performance impact.
2) I deployed PostgreSQL on more cores. The throughput improved a lot. If
the problem was due to concurrecy control, the throughput should remain the
same - no matter the number of hardware contexts.

Any insight why the system behaves like this ?

Cheers,
Dimitris


On Fri, May 23, 2014 at 1:39 AM, Michael Paquier
michael.paqu...@gmail.comwrote:

 On Thu, May 22, 2014 at 10:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Call graph data usually isn't trustworthy unless you built the program
  with -fno-omit-frame-pointer ...
 This page is full of ideas as well:
 https://wiki.postgresql.org/wiki/Profiling_with_perf
 --
 Michael



Re: [PERFORM] Profiling PostgreSQL

2014-05-23 Thread Pavel Stehule
Dne 23.5.2014 16:41 Dimitris Karampinas dkaram...@gmail.com napsal(a):

 Thanks for your answers. A script around pstack worked for me.

 (I'm not sure if I should open a new thread, I hope it's OK to ask
another question here)

 For the workload I run it seems that PostgreSQL scales with the number of
concurrent clients up to the point that these reach the number of cores
(more or less).
 Further increase to the number of clients leads to dramatic performance
degradation. pstack and perf show that backends block on LWLockAcquire
calls, so, someone could assume that the reason the system slows down is
because of multiple concurrent transactions that access the same data.
 However I did the two following experiments:
 1) I completely removed the UPDATE transactions from my workload. The
throughput turned out to be better yet the trend was the same. Increasing
the number of clients, has a very negative performance impact.
 2) I deployed PostgreSQL on more cores. The throughput improved a lot. If
the problem was due to concurrecy control, the throughput should remain the
same - no matter the number of hardware contexts.

 Any insight why the system behaves like this ?

Physical limits, there two possible botleneck: cpu or io. Postgres use one
cpu per session, and if you have cpu intensive benchmark, then max should
be in cpu related workers. Later a workers shares cpu, bu total throughput
should be same to cca 10xCpu (depends on test)


 Cheers,
 Dimitris


 On Fri, May 23, 2014 at 1:39 AM, Michael Paquier 
michael.paqu...@gmail.com wrote:

 On Thu, May 22, 2014 at 10:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Call graph data usually isn't trustworthy unless you built the program
  with -fno-omit-frame-pointer ...
 This page is full of ideas as well:
 https://wiki.postgresql.org/wiki/Profiling_with_perf
 --
 Michael




Re: [PERFORM] Profiling PostgreSQL

2014-05-23 Thread Jeff Janes
On Fri, May 23, 2014 at 7:40 AM, Dimitris Karampinas dkaram...@gmail.comwrote:

 Thanks for your answers. A script around pstack worked for me.

 (I'm not sure if I should open a new thread, I hope it's OK to ask another
 question here)

 For the workload I run it seems that PostgreSQL scales with the number of
 concurrent clients up to the point that these reach the number of cores
 (more or less).
 Further increase to the number of clients leads to dramatic performance
 degradation. pstack and perf show that backends block on LWLockAcquire
 calls, so, someone could assume that the reason the system slows down is
 because of multiple concurrent transactions that access the same data.
 However I did the two following experiments:
 1) I completely removed the UPDATE transactions from my workload. The
 throughput turned out to be better yet the trend was the same. Increasing
 the number of clients, has a very negative performance impact.


Currently acquisition and release of all LWLock, even in shared mode, are
protected by spinlocks, which are exclusive. So they cause a lot of
contention even on read-only workloads.  Also if the working set fits in
RAM but not in shared_buffers, you will have a lot of exclusive locks on
the buffer freelist and the buffer mapping tables.



 2) I deployed PostgreSQL on more cores. The throughput improved a lot. If
 the problem was due to concurrecy control, the throughput should remain the
 same - no matter the number of hardware contexts.


Hardware matters!  How did you change the number of cores?

Cheers,

Jeff


Re: [PERFORM] Profiling PostgreSQL

2014-05-23 Thread Dimitris Karampinas
I want to bypass any disk bottleneck so I store all the data in ramfs (the
purpose the project is to profile pg so I don't care for data loss if
anything goes wrong).
Since my data are memory resident, I thought the size of the shared buffers
wouldn't play much role, yet I have to admit that I saw difference in
performance when modifying shared_buffers parameter.

I use taskset to control the number of cores that PostgreSQL is deployed on.

Is there any parameter/variable in the system that is set dynamically and
depends on the number of cores ?

Cheers,
Dimitris


On Fri, May 23, 2014 at 6:52 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 On Fri, May 23, 2014 at 7:40 AM, Dimitris Karampinas 
 dkaram...@gmail.comwrote:

 Thanks for your answers. A script around pstack worked for me.

 (I'm not sure if I should open a new thread, I hope it's OK to ask
 another question here)

 For the workload I run it seems that PostgreSQL scales with the number of
 concurrent clients up to the point that these reach the number of cores
 (more or less).
 Further increase to the number of clients leads to dramatic performance
 degradation. pstack and perf show that backends block on LWLockAcquire
 calls, so, someone could assume that the reason the system slows down is
 because of multiple concurrent transactions that access the same data.
 However I did the two following experiments:
 1) I completely removed the UPDATE transactions from my workload. The
 throughput turned out to be better yet the trend was the same. Increasing
 the number of clients, has a very negative performance impact.


 Currently acquisition and release of all LWLock, even in shared mode, are
 protected by spinlocks, which are exclusive. So they cause a lot of
 contention even on read-only workloads.  Also if the working set fits in
 RAM but not in shared_buffers, you will have a lot of exclusive locks on
 the buffer freelist and the buffer mapping tables.



 2) I deployed PostgreSQL on more cores. The throughput improved a lot. If
 the problem was due to concurrecy control, the throughput should remain the
 same - no matter the number of hardware contexts.


 Hardware matters!  How did you change the number of cores?

 Cheers,

 Jeff



Re: [PERFORM] Profiling PostgreSQL

2014-05-23 Thread Jeff Janes
On Fri, May 23, 2014 at 10:25 AM, Dimitris Karampinas
dkaram...@gmail.comwrote:

 I want to bypass any disk bottleneck so I store all the data in ramfs (the
 purpose the project is to profile pg so I don't care for data loss if
 anything goes wrong).
 Since my data are memory resident, I thought the size of the shared
 buffers wouldn't play much role, yet I have to admit that I saw difference
 in performance when modifying shared_buffers parameter.


In which direction?  If making shared_buffers larger improves things, that
suggests that you have contention on the BufFreelistLock.  Increasing
shared_buffers reduces buffer churn (assuming you increase it by enough)
and so decreases that contention.



 I use taskset to control the number of cores that PostgreSQL is deployed
 on.


It can be important what bits you set.  For example if you have 4 sockets,
each one with a quadcore, you would probably maximize the consequences of
spinlock contention by putting one process on each socket, rather than
putting them all on the same socket.



 Is there any parameter/variable in the system that is set dynamically and
 depends on the number of cores ?


The number of spins a spinlock goes through before sleeping,
spins_per_delay, is determined dynamically based on how often a tight loop
pays off.  But I don't think this is very sensitive to the exact number
of processors, just the difference between 1 and more than 1.