Re: [PERFORM] Full statement logging problematic on larger machines?

2009-03-12 Thread Guillaume Smet
On Thu, Mar 12, 2009 at 2:05 AM, Andrew Dunstan and...@dunslane.net wrote:
 It is buffered at the individual log message level, so that we make sure we
 don't multiplex messages. No more than that.

OK. So if the OP can afford multiplexed queries by using a log
analyzer supporting them, it might be a good idea to try syslog with
full buffering.

-- 
Guillaume

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Entry point for Postgresql Performance

2009-03-12 Thread Nagalingam, Karthikeyan
Hi,
 Can you guide me, Where is the entry point to get the documentation
for Postgresql performance tuning, Optimization for Postgresql with
Storage controller. 
 
Your recommendation and suggestion are welcome. 
 
Regards 
Karthikeyan.N
 
 


Re: [PERFORM] Entry point for Postgresql Performance

2009-03-12 Thread Ashish Karalkar

Nagalingam, Karthikeyan wrote:

Hi,
 Can you guide me, Where is the entry point to get the 
documentation for Postgresql performance tuning, Optimization for 
Postgresql with Storage controller.
 
Your recommendation and suggestion are welcome.
 
Regards

Karthikeyan.N
 
 

Take a look at

http://www.postgresql.org/files/documentation/books/aw_pgsql/hw_performance/ 


http://www.scribd.com/doc/4846381/PostgreSQL-Performance-Tuning
http://www.linuxjournal.com/article/4791

With Regards
--Ashish

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Full statement logging problematic on larger machines?

2009-03-12 Thread Frank Joerdens
On Thu, Mar 12, 2009 at 1:45 AM, Tom Lane t...@sss.pgh.pa.us wrote:
[...]
 You could try changing _IOLBF
 to _IOFBF near the head of postmaster/syslogger.c and see if that helps.

I just put the patched .deb on staging and we'll give it a whirl there
for basic sanity checking - we currently have no way to even
approximate the load that we have on live for testing.

If all goes well I expect we'll put it on live early next week. I'll
let you know how it goes.

Regards,

Frank

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Full statement logging problematic on larger machines?

2009-03-12 Thread Dimitri Fontaine
On Thursday 12 March 2009 14:38:56 Frank Joerdens wrote:
 I just put the patched .deb on staging and we'll give it a whirl there
 for basic sanity checking - we currently have no way to even
 approximate the load that we have on live for testing.

Is it a capacity problem or a tool suite problem?
If the latter, you could try out tsung:
  http://archives.postgresql.org/pgsql-admin/2008-12/msg00032.php
  http://tsung.erlang-projects.org/

Regards,
-- 
dim



signature.asc
Description: This is a digitally signed message part.


Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Kevin Grittner
Jignesh K. Shah j.k.s...@sun.com wrote: 
 On 03/11/09 18:27, Kevin Grittner wrote:
 Jignesh K. Shah j.k.s...@sun.com wrote: 
 
 Rerunning similar tests on a 64-thread UltraSPARC T2plus based
 server config
   
 (IO is not a problem... all in RAM .. no disks):
 Time:Users:Type:TPM: Response Time
 60: 100: Medium Throughput: 10552.000 Avg Medium Resp: 0.006
 120: 200: Medium Throughput: 22897.000 Avg Medium Resp: 0.006
 180: 300: Medium Throughput: 33099.000 Avg Medium Resp: 0.009
 240: 400: Medium Throughput: 44692.000 Avg Medium Resp: 0.007
 300: 500: Medium Throughput: 56455.000 Avg Medium Resp: 0.007
 360: 600: Medium Throughput: 67220.000 Avg Medium Resp: 0.008
 420: 700: Medium Throughput: 77592.000 Avg Medium Resp: 0.009
 
 I'm a lot more interested in what's happening between 60 and 180
than
 over 1000, personally.  If there was a RAID involved, I'd put it
down
 to better use of the numerous spindles, but when it's all in RAM it
 makes no sense.
 
 The problem is the CPUs are not all busy there is plenty of idle
cycles 
 since PostgreSQL ends up in situations where they are all waiting for

 lockacquires for exclusive..
 
Precisely.  This is the area where it seems there is the most to gain.
The area you're looking at seems to have less than a 2X gain
available.
This part of the curve clearly has much more.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Jignesh K. Shah



On 03/11/09 22:01, Scott Carey wrote:

On 3/11/09 3:27 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote:


I'm a lot more interested in what's happening between 60 and 180 than
over 1000, personally.  If there was a RAID involved, I'd put it down
to better use of the numerous spindles, but when it's all in RAM it
makes no sense.

If there is enough lock contention and a common lock case is a short 
lived shared lock, it makes perfect sense sense.  Fewer readers are 
blocked waiting on writers at any given time.  Readers can 'cut' in 
line ahead of writers within a certain scope (only up to the number 
waiting at the time a shared lock is at the head of the queue). 
 Essentially this clumps up shared and exclusive locks into larger 
streaks, and allows for higher shared lock throughput.  
Exclusive locks may be delayed, but will NOT be starved, since on the 
next iteration, a streak of exclusive locks will occur first in the 
list and they will all process before any more shared locks can go.


This will even help in on a single CPU system if it is read dominated, 
lowering read latency and slightly increasing write latency.


If you want to make this more fair, instead of freeing all shared 
locks, limit the count to some number, such as the number of CPU 
cores.  Perhaps rather than wake-up-all-waiters=true, the parameter 
can be an integer representing how many shared locks can be freed at 
once if an exclusive lock is encountered.



Well I am waking up not just shared but shared and exclusives.. However 
i like your idea of waking up the next N waiters where N matches the 
number of cpus available.  In my case it is 64 so yes this works well 
since the idea being of all the 64 waiters running right now one will be 
able to lock the next lock  immediately and hence there are no cycles 
wasted where nobody gets a lock which is often the case when you say 
wake up only 1 waiter and hope that the process is on the CPU (which in 
my case it is 64 processes) and it is able to acquire the lock.. The 
probability of acquiring the lock within the next few cycles is much 
less for only 1 waiter  than giving chance to 64 such processes  and 
then let them fight based on who is already on CPU  and acquire the 
lock. That way the period where nobody has a lock is reduced and that 
helps to cut out artifact  idle time on the system.



As soon as I get more cycles I will try variations of it but it would 
help if others can try it out in their own environments to see if it 
helps their instances.



-Jignesh



Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Kevin Grittner
 Scott Carey sc...@richrelevance.com wrote: 
 Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 
 I'm a lot more interested in what's happening between 60 and 180
 than over 1000, personally.  If there was a RAID involved, I'd put
 it down to better use of the numerous spindles, but when it's all
 in RAM it makes no sense.
 
 If there is enough lock contention and a common lock case is a short
 lived shared lock, it makes perfect sense sense.  Fewer readers are
 blocked waiting on writers at any given time.  Readers can 'cut' in
 line ahead of writers within a certain scope (only up to the number
 waiting at the time a shared lock is at the head of the queue). 
 Essentially this clumps up shared and exclusive locks into larger
 streaks, and allows for higher shared lock throughput.
 
You misunderstood me.  I wasn't addressing the affects of his change,
but rather the fact that his test shows a linear improvement in TPS up
to 1000 connections for a 64 thread machine which is dealing entirely
with RAM -- no disk access.  Where's the bottleneck that allows this
to happen?  Without understanding that, his results are meaningless.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Grzegorz Jaśkiewicz
On Thu, Mar 12, 2009 at 3:13 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Scott Carey sc...@richrelevance.com wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov wrote:

 I'm a lot more interested in what's happening between 60 and 180
 than over 1000, personally.  If there was a RAID involved, I'd put
 it down to better use of the numerous spindles, but when it's all
 in RAM it makes no sense.

 If there is enough lock contention and a common lock case is a short
 lived shared lock, it makes perfect sense sense.  Fewer readers are
 blocked waiting on writers at any given time.  Readers can 'cut' in
 line ahead of writers within a certain scope (only up to the number
 waiting at the time a shared lock is at the head of the queue).
 Essentially this clumps up shared and exclusive locks into larger
 streaks, and allows for higher shared lock throughput.

 You misunderstood me.  I wasn't addressing the affects of his change,
 but rather the fact that his test shows a linear improvement in TPS up
 to 1000 connections for a 64 thread machine which is dealing entirely
 with RAM -- no disk access.  Where's the bottleneck that allows this
 to happen?  Without understanding that, his results are meaningless.

I think you try to argue about oranges, and he does about pears. Your
argument has nothing to do with what you are saying, which you should
understand.
Scalability is something that is affected by everything, and fixing
this makes sens as much as looking at possible fixes to make raids
more scalable, which is looked at by someone else I think.
So please, don't say that this doesn't make sense because he tested it
against ram disc. That was precisely the point of exercise.


-- 
GJ

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Kevin Grittner
 Grzegorz Jaœkiewicz gryz...@gmail.com wrote: 
 Scalability is something that is affected by everything, and fixing
 this makes sens as much as looking at possible fixes to make raids
 more scalable, which is looked at by someone else I think.
 So please, don't say that this doesn't make sense because he tested
it
 against ram disc. That was precisely the point of exercise.
 
I'm probably more inclined to believe that his change may have merit
than many here, but I can't accept anything based on this test until
someone answers the question, so far ignored by all responses, of
where the bottleneck is at the low end which allows linear scalability
up to 1000 users (which I assume means connections).
 
I'm particularly inclined to be suspicious of this test since my own
benchmarks, with real applications replaying real URL requests from a
production website that gets millions of hits per day, show that
response time and throughput are improved by using a connection pool
with queuing to limit the concurrent active queries.
 
My skepticism is not helped by the fact that in a previous discussion
with someone about performance as connections are increased, this
point was covered by introducing a primitive connection pool --
which used a one second sleep for a thread if the maximum number of
connections were already in use, rather than proper queuing and
semaphores.  That really gives no clue how performance would be with a
real connection pool.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Scott Carey
On 3/12/09 7:57 AM, Jignesh K. Shah j.k.s...@sun.com wrote:



On 03/11/09 22:01, Scott Carey wrote:
 Re: [PERFORM] Proposal of tunable fix for scalability of 8.4 On 3/11/09 3:27 
PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote:


If you want to make this more fair, instead of freeing all shared locks, limit 
the count to some number, such as the number of CPU cores.  Perhaps rather than 
wake-up-all-waiters=true, the parameter can be an integer representing how many 
shared locks can be freed at once if an exclusive lock is encountered.



Well I am waking up not just shared but shared and exclusives.. However i like 
your idea of waking up the next N waiters where N matches the number of cpus 
available.  In my case it is 64 so yes this works well since the idea being of 
all the 64 waiters running right now one will be able to lock the next lock  
immediately and hence there are no cycles wasted where nobody gets a lock which 
is often the case when you say wake up only 1 waiter and hope that the process 
is on the CPU (which in my case it is 64 processes) and it is able to acquire 
the lock.. The probability of acquiring the lock within the next few cycles is 
much less for only 1 waiter  than giving chance to 64 such processes  and then 
let them fight based on who is already on CPU  and acquire the lock. That way 
the period where nobody has a lock is reduced and that helps to cut out 
artifact  idle time on the system.

In that case, there can be some starvation of writers.  If all the shareds are 
woken up but the exclusives are left in the front of the queued, no starvation 
can occur.
That was a bit of confusion on my part with respect to what the change was 
doing.  Thanks for clarification.



As soon as I get more cycles I will try variations of it but it would help if 
others can try it out in their own environments to see if it helps their 
instances.


-Jignesh




Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Gregory Stark
Grzegorz Jaśkiewicz gryz...@gmail.com writes:

 So please, don't say that this doesn't make sense because he tested it
 against ram disc. That was precisely the point of exercise.

What people are tip-toeing around saying, which I'll just say right out in the
most provocative way, is that Jignesh has simply *misconfigured* the system.
He's contrived to artificially create a lot of unnecessary contention.
Optimizing the system to reduce the cost of that artificial contention at the
expense of a properly configured system would be a bad idea.

It's misconfigured because there are more runnable threads than there are
cpus. A lot more. 15 times as many as necessary. If users couldn't run
connection poolers on their own the right approach for us to address this
contention would be to build one into Postgres, not to re-engineer the
internals around the misuse.

Ram-resident use cases are entirely valid and worth testing, but in those use
cases you would want to have about as many processes as you have processes.

The use case where having larger number of connections than processors makes
sense is when they're blocked on disk i/o (or network i/o or whatever else
other than cpu).

And having it be configurable doesn't mean that it has no cost. Having a test
of a user-settable dynamic variable in the middle of a low-level routine could
very well have some cost. Just the extra code would have some cost in reduced
cache efficiency. It could be that loop prediction and so on save us but that
remains to be proven.

And as always the question would be whether the code designed for this
misconfigured setup is worth the maintenance effort if it's not helping
properly configured setups. Consider for example any work with dtrace to
optimize locks under properly configured setups would lead us to make changes
which would have to be tested twice, once with and once without this option.
What do we do if dtrace says some unrelated change helps systems with this
option disabled but hurts systems with it enabled?



-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Scott Carey

On 3/12/09 8:13 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote:

 Scott Carey sc...@richrelevance.com wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov wrote:

 I'm a lot more interested in what's happening between 60 and 180
 than over 1000, personally.  If there was a RAID involved, I'd put
 it down to better use of the numerous spindles, but when it's all
 in RAM it makes no sense.

 If there is enough lock contention and a common lock case is a short
 lived shared lock, it makes perfect sense sense.  Fewer readers are
 blocked waiting on writers at any given time.  Readers can 'cut' in
 line ahead of writers within a certain scope (only up to the number
 waiting at the time a shared lock is at the head of the queue).
 Essentially this clumps up shared and exclusive locks into larger
 streaks, and allows for higher shared lock throughput.

You misunderstood me.  I wasn't addressing the affects of his change,
but rather the fact that his test shows a linear improvement in TPS up
to 1000 connections for a 64 thread machine which is dealing entirely
with RAM -- no disk access.  Where's the bottleneck that allows this
to happen?  Without understanding that, his results are meaningless.

-Kevin

They are not meaningless.  It is certainly more to understand, but the test is 
entirely valid without that.  In a CPU bound / RAM bound case, as concurrency 
increases you look for the throughput trend, the %CPU use trend and the context 
switch rate trend.  More information would be useful but the test is validated 
by the evidence that it is held up by lock contention.

The reasons for not scaling with user count at lower numbers are numerous:  
network, client limitations, or 'lock locality' (if test user blocks access 
data in an organized pattern rather than random distribution neighbor clients 
are more likely to block than non-neighbor ones).
Furthermore, the MOST valid types of tests don't drive each user in an ASAP 
fashion, but with some pacing to emulate the real world.  In this case you 
expect the user count to significantly be greater than CPU core count before 
saturation.  We need more info about the relationship between users and 
active postgres backends.  If each user sleeps for 100 ms between queries (or 
processes results and writes HTML for 100ms) your assumption that it should 
take about CPU core count users to saturate the CPUs is flawed.

Either way, the result here demonstrates something powerful with respect to CPU 
scalability and just because 300 clients isn't where it peaks does not mean its 
invalid, it merely means we don't have enough information to understand the 
test.

The  fact is very simple:  Increasing concurrency does not saturate all the 
CPUs due to lock contention.  That can be shown by the results demonstrated 
without more information.
User count is irrelevant - performance is increasing linearly with user count 
for quite a while and then peaks and slightly dips.  This is the typical curve 
for all tests with a measured pacing per client.
We want to know more though.  More data would help (active postgres backends, 
%CPU, context switch rate would be my top 3 extra columns in the data set). 
From there all that we want to know is what the locks are and if that 
contention is artificial.  What tools are available to show what locks are most 
contended with Postgres?  Once the locks are known, we want to know if the 
locking can be tuned away by one of three general types of strategies:  Less 
locking via smart use of atomics or copy on write (non-blocking strategies, 
probably fully investigated already); finer grained locks (most definitely 
investigated); improved performance of locks (looked into for sure, but is 
highly hardware dependant).



Re: [PERFORM] Entry point for Postgresql Performance

2009-03-12 Thread Rajesh Kumar Mallah
Databases are usually IO bound , vmstat results can confirm individual
cases and setups.
In case the server is IO bound the entry point should be setting up
properly performing
IO. RAID10 helps a great extent in improving IO bandwidth by
parallelizing the IO operations,
more spindles the better. Also write caches helps in great deal in
caching the writes and making
commits faster.

In my opinion system level tools (like vmstat) at peak load times  can
be an entry point
in understanding the  bottlenecks of a particular setup.

if there is swapping u absolutely need to double the ram . ( excess
ram can be used in disk block caching)
if its cpu bound add more cores or high speed cpus
if its io bound put better raid arrays  controller.


regds
mallah.

On Thu, Mar 12, 2009 at 4:22 PM, Nagalingam, Karthikeyan
karthikeyan.nagalin...@netapp.com wrote:
 Hi,
  Can you guide me, Where is the entry point to get the documentation for
 Postgresql performance tuning, Optimization for Postgresql with Storage
 controller.

 Your recommendation and suggestion are welcome.

 Regards
 Karthikeyan.N



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Scott Carey
On 3/11/09 7:47 PM, Tom Lane t...@sss.pgh.pa.us wrote:

Scott Carey sc...@richrelevance.com writes:
 If there is enough lock contention and a common lock case is a short lived 
 shared lock, it makes perfect sense sense.  Fewer readers are blocked waiting 
 on writers at any given time.  Readers can 'cut' in line ahead of writers 
 within a certain scope (only up to the number waiting at the time a shared 
 lock is at the head of the queue).  Essentially this clumps up shared and 
 exclusive locks into larger streaks, and allows for higher shared lock 
 throughput.
 Exclusive locks may be delayed, but will NOT be starved, since on the next 
 iteration, a streak of exclusive locks will occur first in the list and they 
 will all process before any more shared locks can go.

That's a lot of sunny assertions without any shred of evidence behind
them...

The current LWLock behavior was arrived at over multiple iterations and
is not lightly to be toyed with IMHO.  Especially not on the basis of
one benchmark that does not reflect mainstream environments.

Note that I'm not saying no.  I'm saying that I want a lot more
evidence *before* we go to the trouble of making this configurable
and asking users to test it.

regards, tom lane


All I'm adding, is that it makes some sense to me based on my experience in CPU 
/ RAM bound scalability tuning.  It was expressed that the test itself didn't 
even make sense.

I was wrong in my understanding of what the change did.  If it wakes ALL 
waiters up there is an indeterminate amount of time a lock will wait.
However, if instead of waking up all of them, if it only wakes up the shared 
readers and leaves all the exclusive ones at the front of the queue, there is 
no possibility of starvation since those exclusives will be at the front of the 
line after the wake-up batch.

As for this being a use case that is important:

*  SSDs will drive the % of use cases that are not I/O bound up significantly 
over the next couple years.  All postgres installations with less than about 
100GB of data TODAY could avoid being I/O bound with current SSD technology, 
and those less than 2TB can do so as well but at high expense or with less 
proven technology like the ZFS L2ARC flash cache.
*  Intel will have a mainstream CPU that handles 12 threads (6 cores, 2 threads 
each) at the end of this year.  Mainstream two CPU systems will have access to 
24 threads and be common in 2010.  Higher end 4CPU boxes will have access to 48 
CPU threads.  Hardware thread count is only going up.  This is the future.



Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes:
 You misunderstood me.  I wasn't addressing the affects of his change,
 but rather the fact that his test shows a linear improvement in TPS up
 to 1000 connections for a 64 thread machine which is dealing entirely
 with RAM -- no disk access.  Where's the bottleneck that allows this
 to happen?  Without understanding that, his results are meaningless.

Yeah, that is a really good point.  For a CPU-bound test you would
ideally expect linear performance improvement up to the point at which
number of active threads equals number of CPUs, and flat throughput
with more threads.  The fact that his results don't look like that
should excite deep suspicion that something is wrong somewhere.

This does not in itself prove that the idea is wrong, but it does say
that there is some major effect happening in this test that we don't
understand.  Without understanding it, it's impossible to guess whether
the proposal is helpful in any other scenario.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Jignesh K. Shah



On 03/12/09 11:13, Kevin Grittner wrote:
Scott Carey sc...@richrelevance.com wrote: 


Kevin Grittner kevin.gritt...@wicourts.gov wrote:



I'm a lot more interested in what's happening between 60 and 180
than over 1000, personally.  If there was a RAID involved, I'd put
it down to better use of the numerous spindles, but when it's all
in RAM it makes no sense.
  

If there is enough lock contention and a common lock case is a short
lived shared lock, it makes perfect sense sense.  Fewer readers are
blocked waiting on writers at any given time.  Readers can 'cut' in
line ahead of writers within a certain scope (only up to the number
waiting at the time a shared lock is at the head of the queue). 
Essentially this clumps up shared and exclusive locks into larger

streaks, and allows for higher shared lock throughput.

 
You misunderstood me.  I wasn't addressing the affects of his change,

but rather the fact that his test shows a linear improvement in TPS up
to 1000 connections for a 64 thread machine which is dealing entirely
with RAM -- no disk access.  Where's the bottleneck that allows this
to happen?  Without understanding that, his results are meaningless.
 
-Kevin


  


Every user has a think time (200ms) to wait before doing the next 
transaction which results in idle time and theoretically allows other 
users to run in between ..


-Jignesh



Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Scott Carey
On 3/12/09 10:09 AM, Gregory Stark st...@enterprisedb.com wrote:


Ram-resident use cases are entirely valid and worth testing, but in those use
cases you would want to have about as many processes as you have processes.

Within a factor of two or so, yes.  However, where in his results does it show 
that there are 1000 active postgres connections?  What if the test script is 
the most valid type:  emulating application compute and sleep time between 
requests?

What it is showing is “Users”.  We don’t know the relationship between those 
and active postgres connections.  Your contention is ONLY valid for active 
postgres processes.

Yes, the test could be invalid if it is artificially making all users bang up 
on the same locks by for example, having them all access the same rows.  
However, if this was what explains the results around the user count being 
about equal to CPU threads, then the throughput would have stopped growing 
around where the user count got near the CPU threads, not after a couple 
thousand.

The ‘fingerprint’ of this load test — linear scaling up to a point, then a peak 
and dropoff — is one of a test with paced users not one with artificial locking 
affecting results at low user counts.  More data would help, but artificial 
lock contention with low user count would have shown up at low user count, not 
after 1000 users.  There are some difficult to manipulate ways to fake this out 
(which is why CPU% and context switch rate data would help).  This is most 
likely a ‘paced user’ profile.

The use case where having larger number of connections than processors makes
sense is when they're blocked on disk i/o (or network i/o or whatever else
other than cpu).

Um, or are idle in a connection pool for 100ms.  There is no such thing as a 
perfectly sized connection pool.  And there is nothing wrong with some idle 
connections.


And as always the question would be whether the code designed for this
misconfigured setup is worth the maintenance effort if it's not helping
properly configured setups.

Now you are just assuming its misconfigured.  I’d wager quite a bit it helps 
properly configured setups too so long as they have lots of hardware threads.





Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Scott Carey
On 3/12/09 10:53 AM, Tom Lane t...@sss.pgh.pa.us wrote:

Kevin Grittner kevin.gritt...@wicourts.gov writes:
 You misunderstood me.  I wasn't addressing the affects of his change,
 but rather the fact that his test shows a linear improvement in TPS up
 to 1000 connections for a 64 thread machine which is dealing entirely
 with RAM -- no disk access.  Where's the bottleneck that allows this
 to happen?  Without understanding that, his results are meaningless.

Yeah, that is a really good point.  For a CPU-bound test you would
ideally expect linear performance improvement up to the point at which
number of active threads equals number of CPUs, and flat throughput
with more threads.  The fact that his results don't look like that
should excite deep suspicion that something is wrong somewhere.

This does not in itself prove that the idea is wrong, but it does say
that there is some major effect happening in this test that we don't
understand.  Without understanding it, it's impossible to guess whether
the proposal is helpful in any other scenario.

regards, tom lane

Only on the assumption that each thread in the load test is running in ASAP 
mode rather than a metered pace.


Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Tom Lane
Scott Carey sc...@richrelevance.com writes:
 They are not meaningless.  It is certainly more to understand, but the test 
 is entirely valid without that.  In a CPU bound / RAM bound case, as 
 concurrency increases you look for the throughput trend, the %CPU use trend 
 and the context switch rate trend.  More information would be useful but the 
 test is validated by the evidence that it is held up by lock contention.

Er ... *what* evidence?  There might be evidence somewhere that proves
that, but Jignesh hasn't shown it.  The available data suggests that the
first-order performance limiter in this test is something else.
Otherwise it should be possible to max out the performance with a lot
less than 1000 active backends.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Ron

At 11:44 AM 3/12/2009, Kevin Grittner wrote:

I'm probably more inclined to believe that his change may have merit 
than many here, but I can't accept anything based on this test until 
someone answers the question, so far ignored by all responses, of 
where the bottleneck is at the low end which allows linear 
scalability up to 1000 users (which I assume means connections).


I'm particularly inclined to be suspicious of this test since my own 
benchmarks, with real applications replaying real URL requests from 
a production website that gets millions of hits per day, show that 
response time and throughput are improved by using a connection pool 
with queuing to limit the concurrent active queries.


My skepticism is not helped by the fact that in a previous 
discussion with someone about performance as connections are 
increased, this point was covered by introducing a primitive 
connection pool -- which used a one second sleep for a thread if the 
maximum number of connections were already in use, rather than 
proper queuing and semaphores.  That really gives no clue how 
performance would be with a real connection pool.


-Kevin


IMHO, Jignesh is looking at performance for a spcialized niche in the 
overall space of pg use- that of memory resident DBs.  Here's my 
thoughts on the more general problem.  The following seems to explain 
all the performance phenomenon discussed so far while suggesting an 
improvement in how pg deals with lock scaling and contention.


 Thoughts on lock scaling and contention

logical limits
...for Exclusive locks
a= the number of non overlapping sets of DB entities (tables, rows, etc)
If every exclusive lock wants a different table,
then the limit is the number of tables.
If any exclusive lock wants the whole DB,
then there can only be one lock.
b= possible HW limits
Even if all exclusive locks in question ask for distinct DB entities, it is
possible that the HW servicing those locks could be saturated.
...for Shared locks
a= HW Limits

HW limits
a= network IO
b= HD IO
Note that a and b may change relative order in some cases.
A possibly unrealistic extreme to demonstrate the point would be a system with
1 HD and 10G networking. It's likely to be HD IO bound before network 
IO bound.

c= RAM IO
d= Internal CPU bandwidth

Since a DB must first and foremost protect the integrity of the data being
processed, the above implies that we should process transactions in time order
of resource access (thus transactions that do not share resources can always
run in parallel) while running as many of them in parallel as we can that
a= do not violate the exclusive criteria, and
b= do not over saturate any resource being used for the processing.

This looks exactly like a job scheduling problem from the days of mainframes.
(Or instruction scheduling in a CPU to maximize the IPC of a thread.)

The solution in the mainframe domain was multi-level feedback queues with
priority aging.
Since the concept of a time slice makes no sense in a DB, this becomes a
multi-level resource coloring problem with dynamic feedback based on 
exclusivity

and resource contention.

A possible algorithm might be
1= every transaction for a given DB entity has priority over any transaction
submitted at a later time that uses that same DB entity.
2= every transaction that does not conflict with an earlier transaction can
run in parallel with that earlier transaction
3= if any resource becomes saturated, we stop scheduling transactions that use
that resource or that are dependent on that resource until the deadlock is
resolved.

To implement this, we need
a= to be able to count the number of locks for any given DB entity
b= some way of detecting HW saturation

Hope this is useful,
Ron Peacetree








--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Jignesh K. Shah



On 03/12/09 13:48, Scott Carey wrote:

On 3/11/09 7:47 PM, Tom Lane t...@sss.pgh.pa.us wrote:

All I'm adding, is that it makes some sense to me based on my 
experience in CPU / RAM bound scalability tuning.  It was expressed 
that the test itself didn't even make sense.


I was wrong in my understanding of what the change did.  If it wakes 
ALL waiters up there is an indeterminate amount of time a lock will wait.
However, if instead of waking up all of them, if it only wakes up the 
shared readers and leaves all the exclusive ones at the front of the 
queue, there is no possibility of starvation since those exclusives 
will be at the front of the line after the wake-up batch.


As for this being a use case that is important:

*  SSDs will drive the % of use cases that are not I/O bound up 
significantly over the next couple years.  All postgres installations 
with less than about 100GB of data TODAY could avoid being I/O bound 
with current SSD technology, and those less than 2TB can do so as well 
but at high expense or with less proven technology like the ZFS L2ARC 
flash cache.
*  Intel will have a mainstream CPU that handles 12 threads (6 cores, 
2 threads each) at the end of this year.  Mainstream two CPU systems 
will have access to 24 threads and be common in 2010.  Higher end 4CPU 
boxes will have access to 48 CPU threads.  Hardware thread count is 
only going up.  This is the future.




SSDs are precisely my motivation of doing RAM based tests with 
PostgreSQL. While I am waiting for my SSDs to arrive, I started to 
emulate SSDs by putting the whole database on RAM which in sense are 
better than SSDs so if we can tune with RAM disks then SSDs will be covered.


What we have is a pool of 2000 users and we start making each user do 
series of transactions on different rows and see how much the database 
can handle linearly before some bottleneck (system or database) kicks in 
and there can be no more linear increase in active users. Many times 
there is drop after reaching some value of active users. If all 2000 
users can scale linearly then another test with say 2500 can be executed 
.. All to do is what's the limit we can go till typically there are no 
system resources still remaining to be exploited.


That said the testkit that I am using is a lightweight OLTP typish 
workload which a user runs against a preknown schema and between various 
transactions that it does it emulates a wait time of 200ms. That said it 
is some sense emulating a real user who clicks and then waits to see 
what he got and does another click which results in another transaction 
happening.  (Not exactly but you get the point).  Like all workloads it 
is generally used to find bottlenecks in systems before putting 
production stuff on it.



That said my current environment I am having similar workloads and 
seeing how many users can go to the point where system has no more CPU 
resources available to do a linear growth in tpm. Generally as many of 
you  mentioned you will see disk latency, network latency, cpu resource 
problems, etc.. And thats the work I am doing right now.. I am working 
around network latency by doing a private network, improving Operating 
systems tunables to improve efficiency out there.. I am improving disk 
latency by putting them on /RAM (and soon on SSDs).. However if I still 
cannot consume all CPU then it means I am probably hit by locks . Using 
PostgreSQL DTrace probes I can see what's happening..


At low user (100 users) counts my lock profiles from a user point of 
view are as follows:



# dtrace -q -s 84_lwlock.d 1764

 Lock IdMode   State   Count
   ProcArrayLock  Shared Waiting   1
 CLogControlLock  SharedAcquired   2
   ProcArrayLock   Exclusive Waiting   3
   ProcArrayLock   ExclusiveAcquired  24
  XidGenLock   ExclusiveAcquired  24
FirstLockMgrLock  SharedAcquired  25
 CLogControlLock   ExclusiveAcquired  26
 FirstBufMappingLock  SharedAcquired  55
   WALInsertLock   ExclusiveAcquired  75
   ProcArrayLock  SharedAcquired 178
  SInvalReadLock  SharedAcquired 378

 Lock IdMode   State   Combined Time (ns)
  SInvalReadLockAcquired29849
   ProcArrayLock  Shared Waiting92261
   ProcArrayLockAcquired   951470
FirstLockMgrLock   ExclusiveAcquired  1069064
 CLogControlLock   ExclusiveAcquired  1295551
   ProcArrayLock   Exclusive Waiting  1758033
 FirstBufMappingLock   Exclusive  

Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Alvaro Herrera
Tom Lane wrote:
 Scott Carey sc...@richrelevance.com writes:
  They are not meaningless.  It is certainly more to understand, but the test 
  is entirely valid without that.  In a CPU bound / RAM bound case, as 
  concurrency increases you look for the throughput trend, the %CPU use trend 
  and the context switch rate trend.  More information would be useful but 
  the test is validated by the evidence that it is held up by lock contention.
 
 Er ... *what* evidence?  There might be evidence somewhere that proves
 that, but Jignesh hasn't shown it.  The available data suggests that the
 first-order performance limiter in this test is something else.
 Otherwise it should be possible to max out the performance with a lot
 less than 1000 active backends.

With 200ms of think times as Jignesh just said, 1000 users does not
equate 1000 active backends.  (It's probably closer to 100 backends,
given an avg. response time of ~20ms)

Something that might be useful for him to report is the avg number of
active backends for each data point ...

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Kevin Grittner
 Jignesh K. Shah j.k.s...@sun.com wrote: 
 What we have is a pool of 2000 users and we start making each user
 do series of transactions on different rows and see how much the
 database can handle linearly before some bottleneck (system or
 database) kicks in and there can be no more linear increase in
 active users. Many times there is drop after reaching some value of
 active users. If all 2000 users can scale linearly then another test
 with say 2500 can be executed .. All to do is what's the limit we
 can go till typically there are no system resources still remaining
 to be exploited.
 
 I dont think I have misconfigured the system.
 
If you're not using a queuing connection pool with that many users, I
think you have.  Let me illustrate with a simple example.
 
Imagine you have one CPU and negligible hardware resource delays, and
you have 100 queries submitted at the same moment which each take one
second of CPU time.  If you start them all concurrently, they will all
be done in about 100 seconds, with an average run time of 100 seconds.
If you queue them and run them one at a time, the first will be done
in one second, and the last will be done in 100 seconds, with an
average run time of 50.5 seconds.  The context switching and extra RAM
needed for the multiple connections would tend to make the difference
worse.
 
What makes concurrent queries helpful is that one might block waiting
on a resource, and another can run during that time.  Still, there is
a concurrency level at which the above effect comes into play.  The
more CPUs and spindles you have, the higher the count of useful
concurrent sessions; but there will always be a point where you're
better off queuing additional requests and scheduling them.  The RAM
usage per connection and the cost of context switching pretty much
guarantee that.
 
With our hardware and workloads, I've been able to spot the pattern
that we settle in best with a pool which allows the number of active
queries to be about 2 times the CPU count plus the number of effective
spindles.  Other hardware environments and workloads will undoubtedly
have different sweet spots; however, 2000 concurrent queries running
on 64 CPUs with no significant latency on storage or network is almost
certainly *not* a sweet spot.  Changing PostgreSQL to be well
optimized for such a misconfigured system seems ill-advised to me.
 
On the other hand, I'd love to see numbers for your change in a more
optimally configured environment, since we found that allowing the
thundering herd worked pretty well in allowing threads in our
framework's database service to compete for pulling requests off the
prioritized queue of requests -- as long as the herd didn't get too
big.  I just want to see some plausible evidence from a test
environment which seems reasonable to me before I spend time setting
up my own benchmarks.
 
 I am trying another run where I limit the waked up threads to a 
 pre-configured number to see how various numbers pans out in terms
 of throughput on this server.
 
Please ensure that requests are queued when all allowed connections
are busy, and that when a connection completes a request it will
immediately begin serving another.  Routing requests through a method
which introduces an arbitrary sleep delay before waking up and
checking again is not going to be very convincing.  It would help if
the number of connections used is related to your pool size, and the
max_connections is adjusted proportionally.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Jignesh K. Shah



On 03/12/09 15:10, Alvaro Herrera wrote:

Tom Lane wrote:
  

Scott Carey sc...@richrelevance.com writes:


They are not meaningless.  It is certainly more to understand, but the test is 
entirely valid without that.  In a CPU bound / RAM bound case, as concurrency 
increases you look for the throughput trend, the %CPU use trend and the context 
switch rate trend.  More information would be useful but the test is validated 
by the evidence that it is held up by lock contention.
  

Er ... *what* evidence?  There might be evidence somewhere that proves
that, but Jignesh hasn't shown it.  The available data suggests that the
first-order performance limiter in this test is something else.
Otherwise it should be possible to max out the performance with a lot
less than 1000 active backends.



With 200ms of think times as Jignesh just said, 1000 users does not
equate 1000 active backends.  (It's probably closer to 100 backends,
given an avg. response time of ~20ms)

Something that might be useful for him to report is the avg number of
active backends for each data point ...
  
short of doing select * from pg_stat_activity and removing the IDLE 
entries, any other clean way to get that information. If there is no 
other latency then active backends should be active users * 10ms/200ms 
or activeusers/20 on average. However the number is still lower than 
that since active user can still be waiting for locks which can be 
either on CPU (spin) or sleeping (proven by increase in average response 
time of execution which includes the wait).


Also till date I am primarily more interested in active backends which 
are waiting for acquiring the locks since I find making that more 
efficient gives me the biggest return on my buck.. Lower response time 
and higher throughput.


-Jignesh



Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Greg Smith

On Thu, 12 Mar 2009, Jignesh K. Shah wrote:

As soon as I get more cycles I will try variations of it but it would 
help if others can try it out in their own environments to see if it 
helps their instances.


What you should do next is see whether you can remove the bottleneck your 
test is running into via using a connection pooler.  That's what I think 
most informed people would do were you to ask how to setup an optimal 
environment using PostgreSQL that aimed to serve thousands of clients. 
If that makes your bottleneck go away, that's what you should be 
recommending to customers who want to scale in this fashion too.  If the 
bottleneck moves to somewhere else, that new hot spot might be one people 
care more about.  Given that there are multiple good pooling solutions 
floating around already, it's hard to justify dumping coding and testing 
resources here if that makes the problem move somewhere else.


It's great that you've identified an alternate scheduling approach that 
helps on your problematic test case, but you're a long ways from having a 
full model of how changes to the locking model impact other database 
workloads.  As for the idea of doing something in this area for 8.4, there 
are a significant number of performance-related changes already committed 
for that version that deserve more focused testing during beta.  You're 
way too late to throw another one into that already crowded area.


--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Full statement logging problematic on larger machines?

2009-03-12 Thread Laurent Laborde
On Wed, Mar 11, 2009 at 11:42 PM, Frank Joerdens fr...@joerdens.de wrote:

 effective_cache_size            = 4GB

Only 4GB with 64GB of ram ?

About logging, we have 3 partition :
- data
- index
- everything else, including logging.

Usually, we log on a remote syslog (a dedicated log server for the
whole server farm).

For profiling (pgfouine), we have a crontab that change the postgresql
logging configuration for just a few mn.
and log all on the everything but postgresql partition.

around 2000 query/seconds/servers, no problem.


-- 
Laurent Laborde
Sysadmin at JFG-Networks / Over-blog

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Scott Carey
On 3/12/09 11:28 AM, Tom Lane t...@sss.pgh.pa.us wrote:

Scott Carey sc...@richrelevance.com writes:
 They are not meaningless.  It is certainly more to understand, but the test 
 is entirely valid without that.  In a CPU bound / RAM bound case, as 
 concurrency increases you look for the throughput trend, the %CPU use trend 
 and the context switch rate trend.  More information would be useful but the 
 test is validated by the evidence that it is held up by lock contention.

Er ... *what* evidence?  There might be evidence somewhere that proves
that, but Jignesh hasn't shown it.  The available data suggests that the
first-order performance limiter in this test is something else.
Otherwise it should be possible to max out the performance with a lot
less than 1000 active backends.

regards, tom lane

Evidence:

Ramp up the concurrency, measure throughput.  Throughput peaks at X with low 
CPU utilization, linear ramp up until then.   Change lock code.  Throughput 
scales past that point to much higher CPU load.
That's evidence.  Please explain a scenario that proves otherwise.  Your last 
statement above is true but not applicable here.  The test is not 1000 
backends, it lists 1000 users.

There is a key difference between users and backends.  In fact, the evidence is 
that the result can't be backends (the column is labeled users).  If its not 
I/O bound it must cap out at roughly the number of active backends near the 
number of CPU or less,  and as noted it does not.  This isn't proof that there 
is something wrong with the test, its proof that the 1000 number cannot be 
active backends.

I spent a decade solving and tuning CPU scalability problems in CPU/memory 
bound systems.  Sophisticated tests peak at a user count  CPU count, because 
real users don't execute as fast as possible.  Through a chain of servers 
several layers deep, each tier can have different levels of concurrent 
activity.  Its useful to measure concurrency at each tier, but almost 
impossible in postgres (easy in oracle / mssql).  Most systems have a limited 
thread pool but can queue much more than that number.  Postgres and many 
databases don't do that so clients must via connection pools.  But the result 
behavior of too much concurrency is thrashing and inefficiency - this shows up 
in a test that ramps up concurrency by peak throughput followed by a steep drop 
off in throughput as concurrency goes into the thrashing state.  At this 
thrashing time a lot of context switching and sometimes RAM pressure is a 
typical symptom.

The only way to construct a test that shows the current described behavior 
(linear ramp up, then plateau) is to  have lock contention, I/O bottlenecks, or 
CPU saturation.  The number of users is irrelevant, the trend is the same 
regardless of the relationship between user count and active backend count (0 
delay or 1 second delay, same result different X axis).  If it was an I/O or 
client bottleneck, changing the lock code wouldn't have made it faster.

The evidence is 100% certain that the first test result is limited by locks, 
and that changing them increased throughput.


Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Scott Carey

On 3/12/09 1:35 PM, Greg Smith gsm...@gregsmith.com wrote:

On Thu, 12 Mar 2009, Jignesh K. Shah wrote:

 As soon as I get more cycles I will try variations of it but it would
 help if others can try it out in their own environments to see if it
 helps their instances.

What you should do next is see whether you can remove the bottleneck your
test is running into via using a connection pooler.

I doubt it is running into a bottleneck due to that, the symptoms aren't right. 
 He can change his test to have near zero delay to simulate such a connection 
pool.

If it was an issue due to concurrency at that level, the results would not have 
scaled linearly with user count to a plateau the way they did.  There would be 
a steep drop-down from thrashing as concurrency kept going up.  Context switch 
data would help, since the thrashing ends up as a measurable there.  No 
evidence of concurrency thrashing yet that I see, but more tests and data would 
help.

The disconnect, is that the Users column in his data does not represent 
back-ends.  It represents concurrent users on the front-end.  Whether these 
while idle pool or not is not clear.  It would be useful to rule that 
possibility out but that looks like an improbable diagnosis to me given the 
lack of performance decrease as concurrency goes up.
Furthermore, if the problem was due to too much concurrency in the database 
with active connections, its hard to see how changing the lock code would 
change the result the way it did - increasing CPU and throughput accordingly.  
Again, context switch rate info would help rule out many possibilities.

That's what I think
most informed people would do were you to ask how to setup an optimal
environment using PostgreSQL that aimed to serve thousands of clients.
If that makes your bottleneck go away, that's what you should be
recommending to customers who want to scale in this fashion too.

First just run a test with a tiny delay (5ms? 0?) and fewer users to compare.  
If your theory that a connection pooler would help, that test would provide 
higher throughput with low user count and not be lock limited.  This may be 
easier to run than setting up a pooler, though he should investigate one 
regardless.

If the
bottleneck moves to somewhere else, that new hot spot might be one people
care more about.  Given that there are multiple good pooling solutions
floating around already, it's hard to justify dumping coding and testing
resources here if that makes the problem move somewhere else.

Its worth ruling out given that even if the likelihood is small, the fix is 
easy.  However, I don't see the throughput drop from peak as more concurrency 
is added that is the hallmark of this problem - usually with a lot of context 
switching and a sudden increase in CPU use per transaction.

The biggest disconnect in load testing almost always occurs over the definition 
of concurrent users.
Think of an HTTP app, backed by a db - about as simple as it gets these days 
(this is fun with 5, 6 tier fanned out stuff).

Users could mean:
Number of application user logins used.
Number of test harness threads or processes that are active.
Number of open HTTP connections
Number of HTTP requests being processed
Number of connections from the app to the db
Number of active connections from the app to the db

Knowing which of these is the topic, and what that means in relation to all the 
others, is often messy.  Without knowing which one it is in a result, you can 
still learn a lot.  The data in the results here prove its not the last one on 
the list above, nor the first one.  It could still be any of the middle four, 
but is most likely #2 or the second to last one (which might be equivalent).


Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Scott Carey
On 3/12/09 11:37 AM, Jignesh K. Shah j.k.s...@sun.com wrote:


And again this is the third time I am saying.. the test users also have some 
latency build up in them which is what generally is exploited to get more users 
than number of CPUS on the system but that's the point we want to exploit.. 
Otherwise if all new users begin to do their job with no latency then we would 
need 6+ billion cpus to handle all possible users. Typically as an 
administrator (System and database) I can only tweak/control latencies within 
my domain, that is network, disk, cpu's etc and those are what I am tweaking 
and coming to a *Configured* environment and now trying to improve lock 
contentions/waits in PostgreSQL so that we have an optimized setup.

In general, I suggest that it is useful to run tests with a few different types 
of pacing.  Zero delay pacing will not have realistic number of connections, 
but will expose bottlenecks that are universal, and less controversial.   Small 
latency (100ms to 1s) tests are easy to make from the zero delay ones, and help 
expose problems with connection count or other forms of 'non-active' 
concurrency.  End-user realistic delays are app specific, and useful with 
larger holistic load tests (say, through the application interface).  
Generally, running them in this order helps because at each stage you are 
adding complexity.  Based on your explanations, you've probably done much of 
this so far and your approach sounds solid to me.
If the first case fails (zero delay, smaller user count), there is no way the 
others will pass.

I am trying another run where I limit the waked up threads to a pre-configured 
number to see how various numbers pans out in terms of throughput on this 
server.

Regards,
Jignesh


This would be good, as would waking up only the shared locks, but refining the 
test somewhat to be maximally convincing would help.  The first thing to show 
is either a test with very small or no sleep delay, or with a connection pooler 
in between.  I prefer the former since it is the most simple.   This will be a 
test that is less entangled with the connection count and should peak at a lot 
closer to the CPU core count and be more convincing to some.  I'm positive it 
won't change the basic trend  (ramp up and plateau, with a higher plateau with 
the changed lock code) but others seem unconvinced and I'm a nobody anyway.


Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Robert Haas
 Its worth ruling out given that even if the likelihood is small, the fix is
 easy.  However, I don’t see the throughput drop from peak as more
 concurrency is added that is the hallmark of this problem — usually with a
 lot of context switching and a sudden increase in CPU use per transaction.

The problem is that the proposed fix bears a strong resemblence to
attempting to improve your gas mileage by removing a few non-critical
parts from your card, like, say, the bumpers, muffler, turn signals,
windshield wipers, and emergency brake.  While it's true that the car
might be drivable in that condition (as long as nothing unexpected
happens), you're going to have a hard time convincing the manufacturer
to offer that as an options package.

I think that changing the locking behavior is attacking the problem at
the wrong level anyway.  If someone want to look at optimizing
PostgreSQL for very large numbers of concurrent connections without a
connection pooler... at least IMO, it would be more worthwhile to
study WHY there's so much locking contention, and, on a lock by lock
basis, what can be done about it without harming performance under
more normal loads?  The fact that there IS locking contention is sorta
interesting, but it would be a lot more interesting to know why.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Greg Smith

On Thu, 12 Mar 2009, Scott Carey wrote:

Furthermore, if the problem was due to too much concurrency in the 
database with active connections, its hard to see how changing the lock 
code would change the result the way it did ?


What I wonder about is if the locking mechanism is accidentally turning 
into a CPU resource scheduling problem on this benchmark.  If the 
connections were pooled instead, control over that scheduling would be 
more explicit, because connections would more directly map onto physical 
CPUs.  What if the fall-off is because the sum of the working code set 
here is simply exceeding the sum of the CPU caching available once the 
number of active connections gets big enough?  The real problem could be 
that the connections waiting on ProcArray are just falling out of cache, 
such that when they do wake up they take a while to page back in and keep 
going.


I wouldn't actually bet anything on that theory though, or any of the 
others offered here.  I find wandering into performance bottleneck 
analysis presuming you know what's going on to be dangerous.  The bigger 
issue here is that Jignesh is using a configuration known to be 
problematic (lots of connections), which introduces some uncertaintly 
about the true root cause here.  Whether it's well founded or not, it 
still hurts his case.


And to step back for a second, after reading up on it again I see that 
Sun's internal iGen-OLTP benchmark stresses lock management and 
connectivity[1], which makes me wonder even more than I did before about 
how specific this fix is to this workload.


[1] http://blogs.sun.com/bmseer/entry/t2000_adds_database_leadership_to

First just run a test with a tiny delay (5ms? 0?) and fewer users to 
compare.  If your theory that a connection pooler would help, that test 
would provide higher throughput with low user count and not be lock 
limited.


If the symptoms stay the same but are just scaled to a much lower 
connection count, that might help rule out some types of context switching 
and caching problem from the list of most likely suspects.  Might as well 
make it 0ms to minimize the number of connections.


--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

2009-03-12 Thread Greg Smith

On Thu, 12 Mar 2009, Jignesh K. Shah wrote:

That said the testkit that I am using is a lightweight OLTP typish 
workload which a user runs against a preknown schema and between various 
transactions that it does it emulates a wait time of 200ms.


After re-reading about this all again at 
http://blogs.sun.com/jkshah/resource/pgcon_problems.pdf I remembered I 
wanted more info on just what Sun's iGen OLTP does anyway.  Here's a 
collection of published comments on it that assembles into a reasonably 
detailed picture, as long as you're somewhat familiar with what TPC-C 
does:


http://blogs.sun.com/bmseer/entry/t2000_adds_database_leadership_to

The iGEN-OLTP 1.5 benchmark is a SUN internally developed transaction 
processing database workload. This workload simulates a light-weight 
Global Order System that stresses lock management and connectivity.


http://www.mysqlperformanceblog.com/2008/02/27/a-piece-of-sunmysql-marketing/#comment-246663

The iGen workload was created from actual customer workloads and has a 
lot more complexity than Sysbench which only test very simple operations 
one at a time. The iGen database consist of 6 tables and its executes a 
combination of light, medium and heavy transactions.


http://www.sun.com/third-party/global/oracle/collateral/T2000_Oracle_iGEN_05-12-06.pdf?null

The iGEN-OLTP benchmark is a stress and performance test, measuring the 
throughput and simultaneous user connections of an OLTP database workload. 
The iGEN-OLTP workload is based on customer applications and is 
constructed as a 2-tier orders database application where three 
transactions are executed:


 * light read-only query
 * medium read-only query
 * 'heavy' read and insert operation.

The transactions are comprised of various SQL statements: read-only 
selects, joins, update and insert operations.  iGen OLTP avoids problems 
that plague other OTLP benchmarks like TPC-C. TPC-C has problems with only 
using light-weight queries, allowing artificial data partitioning, and 
only testing a few database functions. The iGen transactions take almost 
twice the computation work compared to the TPC-C transactions.


http://blogs.sun.com/ritu/entry/mysql_benchmark_us_t2_beats

iGen OLTP avoids problems that plague other OTLP benchmarks like TPC-C. 
In particular, it is completely random in table row selections and thus is 
difficult to use artificial optimizations. iGen OLTP stresses process and 
thread creation, process scheduling, and database commit processing...The 
transactions are comprised of various SQL transactions: read-only selects, 
joins, inserts and update operations.


--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance