Re: [HACKERS] testing ProcArrayLock patches

2011-11-22 Thread Pavan Deolasee
On Tue, Nov 22, 2011 at 4:40 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Pavan Deolasee pavan.deola...@gmail.com wrote:

 It will be a great help if you could spare few minutes to also
 test the patch to take out the frequently accessed PGPROC members
 to a different array. We are seeing good improvements on HPUX IA
 platform and the AMD Opteron and it will be interesting to know
 what happens on the Intel platform too.

 For a read only comparison (which was run using the simple
 protocol), using identical settings to the previous master run, but
 with the PGPROC split patch:

 m32 tps = 201738.209348 (including connections establishing)
 p32 tps = 201620.966988 (including connections establishing)

 m128 tps = 352159.631878 (including connections establishing)
 p128 tps = 363998.703900 (including connections establishing)

 Clearly a win at 128 clients; not at 32.

 For updates:

 sm32 tps = 27392.393850 (including connections establishing)
 sp32 tps = 27995.784333 (including connections establishing)

 sm128 tps = 22261.902571 (including connections establishing)
 sp128 tps = 23690.408272 (including connections establishing)

 pm32 tps = 34983.352396 (including connections establishing)
 pp32 tps = 36076.373389 (including connections establishing)

 pm128 tps = 24164.441954 (including connections establishing)
 pp128 tps = 27070.824588 (including connections establishing)

 That's a pretty decisive win all around.


Thanks for running those tests. The numbers are not that bad, but
definitely not as good as we saw on some other platforms. But its
possible that they may improve in percentage terms with even more
number of clients on this box. And given that we are seeing big gains
on other platforms, hopefully it will give us confident to proceed
with the patch.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-22 Thread Kevin Grittner
Pavan Deolasee pavan.deola...@gmail.com wrote:
 
 The numbers are not that bad, but definitely not as good as we saw
 on some other platforms.
 
Well, this machine is definitely designed to hold up under high
concurrency.  As I understand it, each core is the memory manager
for two 4GB DIMMs, with two channels to them, each with two buffers.
The way the cores are connected, a core never needs to go through
more than one other core to get to memory not directly managed, and
that uses snoop technology which hands the cached data right over
from one core to the other when possible, rather than making the
core which now owns the cache line pull it from RAM.  It seems the
2.6.32 kernel is able to manage that technology in a reasonable
fashion.
 
At first I was surprised to see performance top out on the update
tests between 80 and 96 clients.  But then, that lands almost
exactly where my old reliable ((2 * core count) + effective spindle
count) would predict.  The SELECT only tests peaked at 64 clients,
but those were fully cached, so effective spindle count was zero,
again fitting the formula.  So these optimizations seem to me to
break down the barriers which had previously capped the number of
clients which could be handled, letting them peak at their natural
levels.
 
 But its possible that they may improve in percentage terms with
 even more number of clients on this box.
 
I think so; I think this box is just so scalable that at 128
clients we were just barely getting past the knee in the
performance graphs to where these patches help most.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-21 Thread Kevin Grittner
Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 
 I can run one more set of tests tonight before I have to give it
 back to the guy who's putting it into production.  It sounds like
 a set like the above except with synchronous_commit = off might be
 desirable?
 
OK, that's what I did.  This gave me my best numbers yet for an
updating run of pgbench: tps = 38039.724212 for prepared statements
using the flexlock patch.  This patch is a clear win when you get to
16 clients or more.
 
sm1 tps = 1312.501168 (including connections establishing)
sf1 tps = 1376.678293 (including connections establishing)
sm2 tps = 2705.571856 (including connections establishing)
sf2 tps = 2689.577938 (including connections establishing)
sm4 tps = 5461.403557 (including connections establishing)
sf4 tps = 5447.363103 (including connections establishing)
sm8 tps = 10524.695338 (including connections establishing)
sf8 tps = 10448.012069 (including connections establishing)
sm16 tps = 18952.968472 (including connections establishing)
sf16 tps = 18969.505631 (including connections establishing)
sm32 tps = 27392.393850 (including connections establishing)
sf32 tps = 29225.974112 (including connections establishing)
sm64 tps = 28947.675549 (including connections establishing)
sf64 tps = 31417.536816 (including connections establishing)
sm80 tps = 28053.684182 (including connections establishing)
sf80 tps = 29970.555401 (including connections establishing)
sm96 tps = 25885.679957 (including connections establishing)
sf96 tps = 28581.271436 (including connections establishing)
sm128 tps = 22261.902571 (including connections establishing)
sf128 tps = 24537.566960 (including connections establishing)
 
pm1 tps = 2082.958841 (including connections establishing)
pf1 tps = 2052.328339 (including connections establishing)
pm2 tps = 4287.257860 (including connections establishing)
pf2 tps = 4228.770795 (including connections establishing)
pm4 tps = 8653.196863 (including connections establishing)
pf4 tps = 8592.091631 (including connections establishing)
pm8 tps = 16071.432101 (including connections establishing)
pf8 tps = 16196.992207 (including connections establishing)
pm16 tps = 27146.441216 (including connections establishing)
pf16 tps = 27441.966562 (including connections establishing)
pm32 tps = 34983.352396 (including connections establishing)
pf32 tps = 38039.724212 (including connections establishing)
pm64 tps = 33182.643501 (including connections establishing)
pf64 tps = 34193.732669 (including connections establishing)
pm80 tps = 30686.712607 (including connections establishing)
pf80 tps = 6.011769 (including connections establishing)
pm96 tps = 24692.015615 (including connections establishing)
pf96 tps = 32907.472665 (including connections establishing)
pm128 tps = 24164.441954 (including connections establishing)
pf128 tps = 25742.670928 (including connections establishing)
 
At lower client numbers the tps values within each set of five
samples were very tightly grouped.  With either protocol, and
whether or not the patch was applied, the higher concurrency groups
tended to be bifurcated within a set of five samples between good
and bad numbers.  The patch seemed to increase the number of
clients which could be handled without collapse into the bad
numbers.  It really looks like there's some sort of performance
collapse at higher concurrency which may or may not happen in any
particular five minute run.  Just as one example, running the simple
protocol with the flexlock patch:
 
tps = 24491.653873 (including connections establishing)
tps = 24537.566960 (including connections establishing)
tps = 28462.276323 (including connections establishing)
tps = 24403.373002 (including connections establishing)
tps = 28458.902549 (including connections establishing)

-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-21 Thread Pavan Deolasee
On Mon, Nov 21, 2011 at 10:44 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov wrote:

 I can run one more set of tests tonight before I have to give it
 back to the guy who's putting it into production.  It sounds like
 a set like the above except with synchronous_commit = off might be
 desirable?

 OK, that's what I did.  This gave me my best numbers yet for an
 updating run of pgbench: tps = 38039.724212 for prepared statements
 using the flexlock patch.  This patch is a clear win when you get to
 16 clients or more.


It will be a great help if you could spare few minutes to also test
the patch to take out the frequently accessed PGPROC members to a
different array. We are seeing good improvements on HPUX IA platform
and the AMD Opteron and it will be interesting to know what happens on
the Intel platform too.

http://archives.postgresql.org/message-id/4eb7c4c9.9070...@enterprisedb.com

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-21 Thread Kevin Grittner
Pavan Deolasee pavan.deola...@gmail.com wrote:
 
 It will be a great help if you could spare few minutes to also
 test the patch to take out the frequently accessed PGPROC members
 to a different array. We are seeing good improvements on HPUX IA
 platform and the AMD Opteron and it will be interesting to know
 what happens on the Intel platform too.
 

http://archives.postgresql.org/message-id/4eb7c4c9.9070...@enterprisedb.com
 
It's going to be hard to arrange more of the 20-hours runs I've been
doing, but I can work in some more abbreviated tests.  What would be
the best test for this?  (I would hate to try and find out I didn't
exercise the right code path.)
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-21 Thread Pavan Deolasee
On Mon, Nov 21, 2011 at 11:01 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Pavan Deolasee pavan.deola...@gmail.com wrote:

 It will be a great help if you could spare few minutes to also
 test the patch to take out the frequently accessed PGPROC members
 to a different array. We are seeing good improvements on HPUX IA
 platform and the AMD Opteron and it will be interesting to know
 what happens on the Intel platform too.


 http://archives.postgresql.org/message-id/4eb7c4c9.9070...@enterprisedb.com

 It's going to be hard to arrange more of the 20-hours runs I've been
 doing, but I can work in some more abbreviated tests.  What would be
 the best test for this?  (I would hate to try and find out I didn't
 exercise the right code path.)


I think 2-3 runs with 32 and 128 clients each with prepared statements
should suffice to quickly compare with the other numbers you posted
for the master.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-21 Thread Kevin Grittner
Pavan Deolasee pavan.deola...@gmail.com wrote:
 
 It will be a great help if you could spare few minutes to also
 test the patch to take out the frequently accessed PGPROC members
 to a different array. We are seeing good improvements on HPUX IA
 platform and the AMD Opteron and it will be interesting to know
 what happens on the Intel platform too.
 
For a read only comparison (which was run using the simple
protocol), using identical settings to the previous master run, but
with the PGPROC split patch:
 
m32 tps = 201738.209348 (including connections establishing)
p32 tps = 201620.966988 (including connections establishing)

m128 tps = 352159.631878 (including connections establishing)
p128 tps = 363998.703900 (including connections establishing)
 
Clearly a win at 128 clients; not at 32.
 
For updates:
 
sm32 tps = 27392.393850 (including connections establishing)
sp32 tps = 27995.784333 (including connections establishing) 
 
sm128 tps = 22261.902571 (including connections establishing)
sp128 tps = 23690.408272 (including connections establishing)
 
pm32 tps = 34983.352396 (including connections establishing)
pp32 tps = 36076.373389 (including connections establishing)
 
pm128 tps = 24164.441954 (including connections establishing)
pp128 tps = 27070.824588 (including connections establishing)
 
That's a pretty decisive win all around.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-20 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 
 Hmm.  There's obviously something that's different in your
 environment or configuration from what I tested, but I don't know
 what it is. The fact that your scale factor is larger than
 shared_buffers might matter; or Intel vs. AMD.  Or maybe you're
 running with synchronous_commit=on?
 
Yes, I had synchronous_commit = on for these runs.  Here are the
settings:
 
cat  $PGDATA/postgresql.conf EOM;
max_connections = 200
max_pred_locks_per_transaction = 256
shared_buffers = 10GB
maintenance_work_mem = 1GB
checkpoint_segments = 300
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
wal_writer_delay = 20ms
seq_page_cost = 0.1
random_page_cost = 0.1
cpu_tuple_cost = 0.05
effective_cache_size = 40GB
default_transaction_isolation = '$iso'
EOM
 
Is there any chance that having pg_xlog on a separate RAID 10 set of
drives with it's own BBU controller would explain anything?  I mean,
I always knew that was a good idea for a big, heavily-loaded box,
but I remember being surprised at how *big* a difference that made
when a box accidentally went into production without moving the
pg_xlog directory there.

There is one other things which might matter, I didn't use the -n
pgbench option, and on the sample you showed, you were using it.
 
Here is the median of five from the latest runs.  On these
read/write tests there was very little spread within each set of
five samples, with no extreme outliers like I had on the SELECT-only
tests.  In the first position s means simple protocol and p means
prepared protocol.  In the second position m means master, f means
with the flexlock patch.
 
sm1 tps = 1092.269228 (including connections establishing)
sf1 tps = 1090.511552 (including connections establishing)
sm2 tps = 2171.867100 (including connections establishing)
sf2 tps = 2158.609189 (including connections establishing)
sm4 tps = 4278.541453 (including connections establishing)
sf4 tps = 4269.921594 (including connections establishing)
sm8 tps = 8472.257182 (including connections establishing)
sf8 tps = 8476.150588 (including connections establishing)
sm16 tps = 15905.074160 (including connections establishing)
sf16 tps = 15937.372689 (including connections establishing)
sm32 tps = 22331.817413 (including connections establishing)
sf32 tps = 22861.258757 (including connections establishing)
sm64 tps = 26388.391614 (including connections establishing)
sf64 tps = 26529.152361 (including connections establishing)
sm80 tps = 25617.651194 (including connections establishing)
sf80 tps = 26560.541237 (including connections establishing)
sm96 tps = 24105.455175 (including connections establishing)
sf96 tps = 26569.244384 (including connections establishing)
sm128 tps = 21467.530210 (including connections establishing)
sf128 tps = 25883.023093 (including connections establishing)
 
pm1 tps = 1629.265970 (including connections establishing)
pf1 tps = 1619.024905 (including connections establishing)
pm2 tps = 3164.061963 (including connections establishing)
pf2 tps = 3137.469377 (including connections establishing)
pm4 tps = 6114.787505 (including connections establishing)
pf4 tps = 6061.750200 (including connections establishing)
pm8 tps = 11884.534375 (including connections establishing)
pf8 tps = 11870.670086 (including connections establishing)
pm16 tps = 20575.737107 (including connections establishing)
pf16 tps = 20437.648809 (including connections establishing)
pm32 tps = 27664.381103 (including connections establishing)
pf32 tps = 28046.846479 (including connections establishing)
pm64 tps = 26764.294547 (including connections establishing)
pf64 tps = 26631.589294 (including connections establishing)
pm80 tps = 27716.198263 (including connections establishing)
pf80 tps = 28393.642871 (including connections establishing)
pm96 tps = 26616.076293 (including connections establishing)
pf96 tps = 28055.921427 (including connections establishing)
pm128 tps = 23282.912620 (including connections establishing)
pf128 tps = 23072.766829 (including connections establishing)
 
Note that on this 32 core box, performance on the read/write pgbench
is peaking at 64 clients, but without a lot of variance between 32
and 96 clients.  And with the patch, performance still hasn't fallen
off too badly at 128 clients. This is good news in terms of not
having to sweat connection pool sizing quite as much as earlier
releases.
 
Next I will get the profile for the SELECT-only runs.  It seems to
make sense to profile at the peak performance level, which was 64
clients.
 
I can run one more set of tests tonight before I have to give it
back to the guy who's putting it into production.  It sounds like a
set like the above except with synchronous_commit = off might be
desirable?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-20 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 
 I was actually thinking it would be interesting to oprofile the
 read-only test; see if we can figure out where those slowdowns are
 coming from.
 
CPU: Intel Core/i7, speed 2262 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with
a unit mask of 0x00 (No unit mask) count 10
samples  %image name  symbol name
3124242   5.7137  postgress_lock
254   4.6737  postgresAllocSetAlloc
2403412   4.3954  postgresGetSnapshotData
1967132   3.5975  postgresSearchCatCache
1872176   3.4239  postgresbase_yyparse
1327256   2.4273  postgreshash_search_with_hash_value
1040131   1.9022  postgres_bt_compare
1038976   1.9001  postgresLWLockAcquire
8171221.4944  postgresMemoryContextAllocZeroAligned
7383211.3503  postgrescore_yylex
6226131.1386  postgresMemoryContextAlloc
5970541.0919  postgresPinBuffer
5561381.0171  postgresScanKeywordLookup
5523181.0101  postgresexpression_tree_walker
4942790.9039  postgresLWLockRelease
4886280.8936  postgreshash_any
4729060.8649  postgresnocachegetattr
3964820.7251  postgresgrouping_planner
3829740.7004  postgresLockAcquireExtended
3751860.6861  postgresAllocSetFree
3750720.6859  postgresProcArrayLockRelease
3736680.6834  postgresnew_list
3659170.6692  postgresfmgr_info_cxt_security
3013980.5512  postgresProcArrayLockAcquire
3006470.5498  postgresLockReleaseAll
2920730.5341  postgresDirectFunctionCall1Coll
2857450.5226  postgresMemoryContextAllocZero
2846840.5206  postgresFunctionCall2Coll
2827010.5170  postgresSearchSysCache
 
max_connections = 100
max_pred_locks_per_transaction = 64
shared_buffers = 8GB
maintenance_work_mem = 1GB
checkpoint_segments = 300
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
wal_writer_delay = 20ms
seq_page_cost = 0.1
random_page_cost = 0.1
cpu_tuple_cost = 0.05
effective_cache_size = 40GB
default_transaction_isolation = '$iso'
 
pgbench -i -s 100
pgbench -S -M simple -T 300 -c 80 -j 80
 
transaction type: SELECT only
scaling factor: 100
query mode: simple
number of clients: 80
number of threads: 80
duration: 300 s
number of transactions actually processed: 104391011
tps = 347964.636256 (including connections establishing)
tps = 347976.389034 (excluding connections establishing)
 
vmstat 1 showed differently this time -- no clue why.
 
procs --memory- ---swap-- -io
 r  b   swpdfree  buffcache   si   sobibo
 ---system -cpu--
in  cs us sy id wa st
91  0   8196 4189436 203925700 5231449200 0 0
 32255 1522807 85 13  1  0  0
92  0   8196 4189404 203925700 5231449200 0 0
 32796 1525463 85 14  1  0  0
67  0   8196 4189404 203925700 5231448800 0 0
 32343 1527988 85 13  1  0  0
93  0   8196 4189404 203925700 5231448800 0 0
 32701 1535827 85 13  1  0  0
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-18 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 
 Nate Boley's AMD 6128 box (which has 32 cores) and an HP Integrity
 server (also with 32 cores). 
 
 [clear improvement with flexlock patch]
 
Hmm.  We have a 32-core Intel box (4 x X7560 @ 2.27GHz) with 256 GB
RAM.  It's about a week from going into production, at which point
it will be extremely hard to schedule such tests, but for a few days
more I've got shots at it.  The flexlock patch doesn't appear to be
such a clear win here.
 
I started from Robert's tests, but used these settings so that I
could go to higher client counts and better test serializable
transactions.  Everything is fully cached.
 
max_connections = 200
max_pred_locks_per_transaction = 256
shared_buffers = 8GB
maintenance_work_mem = 1GB
checkpoint_segments = 30
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
seq_page_cost = 0.1
random_page_cost = 0.1
cpu_tuple_cost = 0.05
effective_cache_size = 40GB
default_transaction_isolation = '$iso'
 
Serializable results not shown here -- that's to gather information
for trying to improve SSI locking.
 
m1 tps = 7847.834544 (including connections establishing)
f1 tps = 7917.225382 (including connections establishing)
m2 tps = 18672.145526 (including connections establishing)
f2 tps = 17486.435322 (including connections establishing)
m4 tps = 34371.278253 (including connections establishing)
f4 tps = 34465.898173 (including connections establishing)
m8 tps = 68228.261694 (including connections establishing)
f8 tps = 68505.285830 (including connections establishing)
m16 tps = 127449.815100 (including connections establishing)
f16 tps = 127208.939670 (including connections establishing)
m32 tps = 201738.209348 (including connections establishing)
f32 tps = 201637.237903 (including connections establishing)
m64 tps = 380326.800557 (including connections establishing)
f64 tps = 380628.429408 (including connections establishing)
m80 tps = 366628.197546 (including connections establishing)
f80 tps = 162594.012051 (including connections establishing)
m96 tps = 360922.948775 (including connections establishing)
f96 tps = 366728.987041 (including connections establishing)
m128 tps = 352159.631878 (including connections establishing)
f128 tps = 355475.129448 (including connections establishing)
 
I did five runs each and took the median.  In most cases, the values
were pretty close to one another in a group, so confidence is pretty
high that this is meaningful.  There were a few anomalies where
performance for one or more samples was horrid.  This seems
consistent with the theory of pathological pileups on the LW locks
(or also flexlocks?).
 
The problem groups:
 
m64 tps = 380407.768906 (including connections establishing)
m64 tps = 79197.470389 (including connections establishing)
m64 tps = 381112.194105 (including connections establishing)
m64 tps = 378579.036542 (including connections establishing)
m64 tps = 380326.800557 (including connections establishing)

m96 tps = 360582.945291 (including connections establishing)
m96 tps = 363021.805138 (including connections establishing)
m96 tps = 362468.870516 (including connections establishing)
m96 tps = 59614.322351 (including connections establishing)
m96 tps = 360922.948775 (including connections establishing)

f80 tps = 158905.149822 (including connections establishing)
f80 tps = 157192.460599 (including connections establishing)
f80 tps = 370757.790443 (including connections establishing)
f80 tps = 162594.012051 (including connections establishing)
f80 tps = 372170.638516 (including connections establishing)

f96 tps = 366804.733788 (including connections establishing)
f96 tps = 366728.987041 (including connections establishing)
f96 tps = 365490.380848 (including connections establishing)
f96 tps = 366770.193305 (including connections establishing)
f96 tps = 125225.371140 (including connections establishing)
 
So the lows don't seem to be as low when they happen with the
flexlock patch, but they still happen -- possibly more often?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-18 Thread Robert Haas
On Fri, Nov 18, 2011 at 11:26 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Robert Haas robertmh...@gmail.com wrote:
 Nate Boley's AMD 6128 box (which has 32 cores) and an HP Integrity
 server (also with 32 cores).

 [clear improvement with flexlock patch]

 Hmm.  We have a 32-core Intel box (4 x X7560 @ 2.27GHz) with 256 GB
 RAM.  It's about a week from going into production, at which point
 it will be extremely hard to schedule such tests, but for a few days
 more I've got shots at it.  The flexlock patch doesn't appear to be
 such a clear win here.

 I started from Robert's tests, but used these settings so that I
 could go to higher client counts and better test serializable
 transactions.  Everything is fully cached.

 max_connections = 200
 max_pred_locks_per_transaction = 256
 shared_buffers = 8GB
 maintenance_work_mem = 1GB
 checkpoint_segments = 30
 checkpoint_timeout = 15min
 checkpoint_completion_target = 0.9
 seq_page_cost = 0.1
 random_page_cost = 0.1
 cpu_tuple_cost = 0.05
 effective_cache_size = 40GB
 default_transaction_isolation = '$iso'

I had a dismaying benchmarking experience recently that involved
settings very similar to the ones you've got there - in particular, I
also had checkpoint_segments set to 30.  When I raised it to 300,
performance improved dramatically at 8 clients and above.

Then again, is this a regular pgbench test or is this SELECT-only?
Because the absolute numbers you're posting are vastly higher than
anything I've ever seen on a write test.

Can you by any chance check top or vmstat during the 32-client test
and see what percentage you have of user time/system time/idle time?

What OS are you running?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-18 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 
 Then again, is this a regular pgbench test or is this SELECT-only?
 
SELECT-only
 
 Can you by any chance check top or vmstat during the 32-client
 test and see what percentage you have of user time/system
 time/idle time?
 
You didn't say whether you wanted master or flexlock, but it turned
out that any difference was way too far into the noise to show. 
They both looked like this:
 
procs --memory- ---swap-- -io
 r  b   swpdfree  buffcache   si   sobibo
 system -cpu--
 in  cs us sy id wa st
38  0352 1157400 207177020 5236047200 016
  13345 1190230 40  7 53  0  0
37  0352 1157480 207177020 5236047200 0 0
  12953 1263310 40  8 52  0  0
36  0352 1157484 207177020 5236047200 0 0
  13411 1233365 38  7 54  0  0
37  0352 1157476 207177020 5236047200 0 0
  12780 1193575 41  7 51  0  0
 
Keep in mind that while there are really 32 cores, the cpu
percentages seem to be based on the threads from hyperthreading. 
Top showed pgbench (running on the same machine) as eating a pretty
steady 5.2 of the cores, leaving 26.8 cores to actually drive the 32
postgres processes.
 
 What OS are you running?
 
Linux new-CIR 2.6.32.43-0.4-default #1 SMP 2011-07-14 14:47:44 +0200
x86_64 x86_64 x86_64 GNU/Linux
 
SUSE Linux Enterprise Server 11 (x86_64)
VERSION = 11
PATCHLEVEL = 1
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-18 Thread Kevin Grittner
Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 
 We have a 32-core Intel box (4 x X7560 @ 2.27GHz) with 256 GB
 RAM.
 
In case anyone cares, this is the same box for which I posted STREAM
test results a while back.  The PostgreSQL tests seem to peak on
this 32-core box at 64 clients, while the STREAM test of raw RAM
speed kept increasing up to 128 clients.  Overall, though, it's
impressive how close PostgreSQL is now coming to the raw RAM access
speed curve.
 
http://archives.postgresql.org/pgsql-hackers/2011-08/msg01306.php
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-18 Thread Robert Haas
On Fri, Nov 18, 2011 at 12:03 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Then again, is this a regular pgbench test or is this SELECT-only?

 SELECT-only

Ah, OK.  I would not expect flexlocks to help with that; Pavan's patch
might, though.

 Can you by any chance check top or vmstat during the 32-client
 test and see what percentage you have of user time/system
 time/idle time?

 You didn't say whether you wanted master or flexlock, but it turned
 out that any difference was way too far into the noise to show.
 They both looked like this:

 procs --memory- ---swap-- -io
  r  b   swpd    free      buff    cache   si   so    bi    bo
  system -cpu--
     in      cs us sy id wa st
 38  0    352 1157400 207177020 52360472    0    0     0    16
  13345 1190230 40  7 53  0  0
 37  0    352 1157480 207177020 52360472    0    0     0     0
  12953 1263310 40  8 52  0  0
 36  0    352 1157484 207177020 52360472    0    0     0     0
  13411 1233365 38  7 54  0  0
 37  0    352 1157476 207177020 52360472    0    0     0     0
  12780 1193575 41  7 51  0  0

 Keep in mind that while there are really 32 cores, the cpu
 percentages seem to be based on the threads from hyperthreading.
 Top showed pgbench (running on the same machine) as eating a pretty
 steady 5.2 of the cores, leaving 26.8 cores to actually drive the 32
 postgres processes.

It doesn't make any sense for PostgreSQL master to be using only 50%
of the CPU and leaving the rest idle on a lots-of-clients SELECT-only
test.  That could easily happen on 9.1, but my lock manager changes
eliminated the only place where anything gets put to sleep in that
path (except for the emergency sleeps done by s_lock, when a spinlock
is really badly contended).  So I'm confused by these results.  Are we
sure that the processes are being scheduled across all 32 physical
cores?

At any rate, I do think it's likely that you're being bitten by
spinlock contention, but we'd need to do some legwork to verify that
and work out the details.  Any chance you can run oprofile (on either
branch, don't really care) against the 32 client test and post the
results?  If it turns out s_lock is at the top of the heap, I can put
together a patch to help figure out which spinlock is the culprit.

Anyway, this is probably a digression as it relates to FlexLocks:
those are not optimizing for a read-only workload.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-18 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 
 Then again, is this a regular pgbench test or is this
 SELECT-only?

 SELECT-only
 
 Ah, OK.  I would not expect flexlocks to help with that; Pavan's
 patch might, though.
 
OK.  Sorry for misunderstanding that.  I haven't gotten around to a
deep reading of the patch yet.  :-(  I based this on the test script
you posted here (with slight modifications for my preferred
directory structures):
 
http://archives.postgresql.org/pgsql-hackers/2011-10/msg00605.php
 
If I just drop the -S switch will I have a good test, or are there
other adjustments I should make (besides increasing checkpoint
segments)?  (Well, for the SELECT-only test I didn't bother putting
pg_xlog on a separate RAID 10 on it's own BBU controller as we
normally would for this machine, I'll cover that, too.)
 
 It doesn't make any sense for PostgreSQL master to be using only
 50% of the CPU and leaving the rest idle on a lots-of-clients
 SELECT-only test.  That could easily happen on 9.1, but my lock
 manager changes eliminated the only place where anything gets put
 to sleep in that path (except for the emergency sleeps done by
 s_lock, when a spinlock is really badly contended).  So I'm
 confused by these results. Are we sure that the processes are
 being scheduled across all 32 physical cores?
 
I think so.  My take was that it was showing 32 of 64 *threads*
active -- the hyperthreading funkiness.  Is there something in
particular you'd like me to check?
 
 At any rate, I do think it's likely that you're being bitten by
 spinlock contention, but we'd need to do some legwork to verify
 that and work out the details.  Any chance you can run oprofile
 (on either branch, don't really care) against the 32 client test
 and post the results?  If it turns out s_lock is at the top of the
 heap, I can put together a patch to help figure out which spinlock
 is the culprit.
 
oprofile isn't installed on this machine.  I'll take care of that
and post results when I can.
 
-Kevin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-18 Thread Robert Haas
On Fri, Nov 18, 2011 at 12:45 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 OK.  Sorry for misunderstanding that.  I haven't gotten around to a
 deep reading of the patch yet.  :-(  I based this on the test script
 you posted here (with slight modifications for my preferred
 directory structures):

 http://archives.postgresql.org/pgsql-hackers/2011-10/msg00605.php

 If I just drop the -S switch will I have a good test, or are there
 other adjustments I should make (besides increasing checkpoint
 segments)?  (Well, for the SELECT-only test I didn't bother putting
 pg_xlog on a separate RAID 10 on it's own BBU controller as we
 normally would for this machine, I'll cover that, too.)

Yeah, I'd just drop -S.  Make sure to use -c N -j N with pgbench, or
you'll probably not be able to saturate it.  I've also had good luck
with wal_writer_delay=20ms, although if you have synchronous_commit=on
that might not matter, and it's much less important since Simon's
recent patch in that area went in.

What scale factor are you testing at?

 It doesn't make any sense for PostgreSQL master to be using only
 50% of the CPU and leaving the rest idle on a lots-of-clients
 SELECT-only test.  That could easily happen on 9.1, but my lock
 manager changes eliminated the only place where anything gets put
 to sleep in that path (except for the emergency sleeps done by
 s_lock, when a spinlock is really badly contended).  So I'm
 confused by these results. Are we sure that the processes are
 being scheduled across all 32 physical cores?

 I think so.  My take was that it was showing 32 of 64 *threads*
 active -- the hyperthreading funkiness.  Is there something in
 particular you'd like me to check?

Not really, just don't understand the number.

 At any rate, I do think it's likely that you're being bitten by
 spinlock contention, but we'd need to do some legwork to verify
 that and work out the details.  Any chance you can run oprofile
 (on either branch, don't really care) against the 32 client test
 and post the results?  If it turns out s_lock is at the top of the
 heap, I can put together a patch to help figure out which spinlock
 is the culprit.

 oprofile isn't installed on this machine.  I'll take care of that
 and post results when I can.

OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-18 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 
 Yeah, I'd just drop -S.
 
Easily done.
 
 Make sure to use -c N -j N with pgbench, or you'll probably not be
 able to saturate it.
 
Yeah, that's part of the script I copied from you.
 
 I've also had good luck with wal_writer_delay=20ms, although if
 you have synchronous_commit=on that might not matter, and it's
 much less important since Simon's recent patch in that area went
 in.
 
What the heck; will do.
 
 What scale factor are you testing at?
 
100.  Perhaps I should boost that since I'm going as far as 128
clients?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-18 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 
 Any chance you can run oprofile (on either branch, don't really
 care) against the 32 client test and post the results?
 
Besides the other changes we discussed, I boosted scale to 150 and
ran at READ COMMITTED isolation level (because all threads promptly
crashed and burned at REPEATABLE READ -- we desperately need a
pgbench option to retry a transaction on serialization failure). 
The oprofile hot spots at half a percent or higher:
 
CPU: Intel Core/i7, speed 2262 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with
a unit mask of 0x00 (No unit mask) count 10
samples  %image name  symbol name
9333944.9651  postgresAllocSetAlloc
8484764.5134  postgresbase_yyparse
7195153.8274  postgresSearchCatCache
4612752.4537  postgreshash_search_with_hash_value
4264112.2682  postgresGetSnapshotData
3229381.7178  postgresLWLockAcquire
3222361.7141  postgrescore_yylex
3054711.6249  postgresMemoryContextAllocZeroAligned
2815431.4976  postgresexpression_tree_walker
2702411.4375  postgresXLogInsert
2348991.2495  postgresMemoryContextAlloc
2101371.1178  postgresScanKeywordLookup
1848570.9833  postgresheap_page_prune
1736080.9235  postgreshash_any
1530110.8139  postgres_bt_compare
1445380.7689  postgresnocachegetattr
1314660.6993  postgresfmgr_info_cxt_security
1310010.6968  postgresgrouping_planner
1308080.6958  postgresLWLockRelease
1241120.6602  postgresPinBuffer
1207450.6423  postgresLockAcquireExtended
1129920.6010  postgresExecInitExpr
1128300.6002  postgreslappend
1123110.5974  postgresnew_list
1103680.5871  postgrescheck_stack_depth
1060360.5640  postgresAllocSetFree
1025650.5456  postgresMemoryContextAllocZero
94689 0.5037  postgresSearchSysCache
 
Do you want line numbers or lower percentages?
 
Two runs:
 
tps = 21946.961196 (including connections establishing)
tps = 22911.873227 (including connections establishing)
 
For write transactions, that seems pretty respectable.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-18 Thread anara...@anarazel.de


Kevin Grittner kevin.gritt...@wicourts.gov schrieb:

Robert Haas robertmh...@gmail.com wrote:
 
 Any chance you can run oprofile (on either branch, don't really
 care) against the 32 client test and post the results?
 
Besides the other changes we discussed, I boosted scale to 150 and
ran at READ COMMITTED isolation level (because all threads promptly
crashed and burned at REPEATABLE READ -- we desperately need a
pgbench option to retry a transaction on serialization failure). 
The oprofile hot spots at half a percent or higher:
 
CPU: Intel Core/i7, speed 2262 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with
a unit mask of 0x00 (No unit mask) count 10
samples  %image name  symbol name
9333944.9651  postgresAllocSetAlloc
8484764.5134  postgresbase_yyparse
7195153.8274  postgresSearchCatCache
4612752.4537  postgreshash_search_with_hash_value
4264112.2682  postgresGetSnapshotData
3229381.7178  postgresLWLockAcquire
3222361.7141  postgrescore_yylex
3054711.6249  postgresMemoryContextAllocZeroAligned
2815431.4976  postgresexpression_tree_walker
2702411.4375  postgresXLogInsert
2348991.2495  postgresMemoryContextAlloc
2101371.1178  postgresScanKeywordLookup
1848570.9833  postgresheap_page_prune
1736080.9235  postgreshash_any
1530110.8139  postgres_bt_compare
1445380.7689  postgresnocachegetattr
1314660.6993  postgresfmgr_info_cxt_security
1310010.6968  postgresgrouping_planner
1308080.6958  postgresLWLockRelease
1241120.6602  postgresPinBuffer
1207450.6423  postgresLockAcquireExtended
1129920.6010  postgresExecInitExpr
1128300.6002  postgreslappend
1123110.5974  postgresnew_list
1103680.5871  postgrescheck_stack_depth
1060360.5640  postgresAllocSetFree
1025650.5456  postgresMemoryContextAllocZero
94689 0.5037  postgresSearchSysCache
That profile looks like you ran pgbench with -m simple. How does it look with 
prepared instead?

Andres
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-18 Thread Kevin Grittner
anara...@anarazel.de and...@anarazel.de wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov schrieb:
 
samples  %image name  symbol name
9333944.9651  postgresAllocSetAlloc
8484764.5134  postgresbase_yyparse
7195153.8274  postgresSearchCatCache
 
 That profile looks like you ran pgbench with -m simple. How does
 it look with prepared instead?
 
samples  %image name  symbol name
4954633.6718  postgreshash_search_with_hash_value
4909713.6385  postgresGetSnapshotData
4439653.2902  postgresLWLockAcquire
4435663.2872  postgresAllocSetAlloc
3023882.2409  postgresXLogInsert
2868892.1261  postgresSearchCatCache
2464171.8262  postgresPostgresMain
2350181.7417  postgresheap_page_prune
1984421.4706  postgres_bt_compare
1814461.3447  postgreshash_any
1771311.3127  postgresExecInitExpr
1757751.3026  postgresLWLockRelease
1523241.1288  postgresPinBuffer
1502851.1137  postgresexec_bind_message
1452141.0762  postgresfmgr_info_cxt_security
1404931.0412  postgress_lock
1241620.9201  postgresLockAcquireExtended
1204290.8925  postgresMemoryContextAlloc
1170760.8676  postgrespfree
1164930.8633  postgresAllocSetFree
1050270.7783  postgrespgstat_report_activity
1014070.7515  postgresProcArrayLockAcquire
1007970.7470  postgresMemoryContextAllocZeroAligned
98360 0.7289  postgresProcArrayLockRelease
86938 0.6443  postgresheap_hot_search_buffer
82635 0.6124  postgreshash_search
79902 0.5921  postgreserrstart
79465 0.5889  postgresHeapTupleSatisfiesVacuum
78709 0.5833  postgresResourceOwnerReleaseInternal
76068 0.5637  postgresExecModifyTable
73043 0.5413  postgresheap_update
72175 0.5349  postgresstrlcpy
71253 0.5280  postgresMemoryContextAllocZero
 
tps = 27392.219364 (including connections establishing)
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-18 Thread Kevin Grittner
Kevin Grittner kevin.gritt...@wicourts.gov wrote:

 samples  %image name  symbol name
 4954633.6718  postgreshash_search_with_hash_value
 
When lines like these show up in the annotated version, I'm
impressed that we're still finding gains as big as we are:
 
 44613  0.3306 :if (segp == NULL)
   :hash_corrupted(hashp);
 
101910  0.7552 :keysize = hashp-keysize;   /* ditto */
 
There goes over 1% of my server run time, right there!
 
Of course, these make no sense unless there is cache line
contention, which is why that area is bearing fruit.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-18 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
  I think so.  My take was that it was showing 32 of 64 *threads*
 active -- the hyperthreading funkiness.  Is there something in
 particular you'd like me to check?
 
 Not really, just don't understand the number.
 
I'm having trouble resolving the vmstat numbers I got during the
32-client pgbench runs which modified data.
 
-M simple:
 
procs --memory- ---swap-- -io-
 r  b   swpd   free  buffcache   si   sobi bo
 system -cpu--
 in  cs us sy id wa st
30  1   4464 513492 205564572 5447212400 0  78170
 621724 1246300 30  8 61  1  0
27  1   4464 509288 205564572 5447460000 0 125620
 599403 1192046 29  8 63  1  0
35  1   4464 508368 205564572 5447699600 0  89801
 595939 1186496 29  8 63  0  0
25  0   4464 506088 205564572 5447866800 0  90121
 594800 1189649 28  8 63  0  0
 
-M prepared:
 
procs --memory-- ---swap-- -io-
 r  b   swpdfree  buffcache   si   sobi bo
 system -cpu--
 in  cs us sy id wa st
28  0   5612 1204404 205107344 5423053600 0  93212
 527284 1456417 22  9 69  0  0
 8  1   5612 1202044 205107344 542600 0  93217
 512819 1417457 21  9 70  1  0
17  1   5612 1201892 205107344 5423604800 0 132699
 502333 1412878 21  9 70  0  0
19  1   5612 1199208 205107344 5423893600 0  93612
 519113 1484386 21  9 69  0  0
 
So 60% or 70% idle without any I/O wait time.  I don't know how to
explain that.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-18 Thread Andres Freund
On Friday, November 18, 2011 08:36:59 PM Kevin Grittner wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov wrote:
  samples  %image name  symbol name
  4954633.6718  postgreshash_search_with_hash_value
 
 When lines like these show up in the annotated version, I'm
 impressed that we're still finding gains as big as we are:
 
  44613  0.3306 :if (segp == NULL)
 
:hash_corrupted(hashp);
 
 101910  0.7552 :keysize = hashp-keysize;   /* ditto */
When doing line-level profiles I would suggest looking at the instructions. 
Quite often the line shown doesn't have much to do whats executed as the 
compiler tries to schedule instructions cleverly.
Also in many situations the shown cost doesn't actually lie in the instruction 
shown but in some previous one. The shown instruction e.g. has to wait for the 
result of the earlier instructions. Pipelining makes that hard to correctly 
observe.

A simplified example would be something like:

bool func(int a, int b, int c){
   int res = a / b;
   if(res == c){
   return true;
   }
   return false;
}

Likely the instruction showing up in the profile would be the comparison. Which 
obviously is not the really expensive part...


 There goes over 1% of my server run time, right there!
 
 Of course, these make no sense unless there is cache line
 contention, which is why that area is bearing fruit.
I don't think cache line contention is the most likely candidate here.  Simple 
cache-misses seem far more likely. In combination with pipeline stalls...

Newer cpus (nehalem+) can measure stalled cycles which can be really useful 
when analyzing performance. I don't remember how to do that with oprofile right 
now though as I use perf these days (its -e stalled-cycles{frontend|backend} 
there}).

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-18 Thread Kevin Grittner
Andres Freund and...@anarazel.de wrote:
 
 When doing line-level profiles I would suggest looking at the
 instructions.
 
What's the best way to do that?
 
 I don't think cache line contention is the most likely candidate
 here.  Simple cache-misses seem far more likely. In combination
 with pipeline stalls...
 
 Newer cpus (nehalem+) can measure stalled cycles which can be
 really useful when analyzing performance. I don't remember how to
 do that with oprofile right now though as I use perf these days
 (its -e stalled-cycles{frontend|backend} there}).
 
When I run oprofile, I still always go back to this post by Tom:
 
http://archives.postgresql.org/pgsql-performance/2009-06/msg00154.php
 
Can anyone provide such a cheat sheet for perf?  I could give that
a try if I knew how.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-18 Thread Andres Freund
On Friday, November 18, 2011 09:16:01 PM Kevin Grittner wrote:
 Andres Freund and...@anarazel.de wrote:
  When doing line-level profiles I would suggest looking at the
  instructions.
 What's the best way to do that?
I think opannotate -a -s produces output with instructions/code intermingled.

  I don't think cache line contention is the most likely candidate
  here.  Simple cache-misses seem far more likely. In combination
  with pipeline stalls...
  
  Newer cpus (nehalem+) can measure stalled cycles which can be
  really useful when analyzing performance. I don't remember how to
  do that with oprofile right now though as I use perf these days
  (its -e stalled-cycles{frontend|backend} there}).
 
 When I run oprofile, I still always go back to this post by Tom:
 http://archives.postgresql.org/pgsql-performance/2009-06/msg00154.php
Hrm. I am on the train and for unknown reasons the only sensible working
protocols are smtp + pop Waiting Waiting
Sorry, too slow/high latency atm. I wrote everything below and another mail
and the page still hasn't loaded.

oprofile can produces graphes as well (--callgraph). for both tools you need
-fno-omit-frame-pointers to get usable graphs.

 Can anyone provide such a cheat sheet for perf?  I could give that
 a try if I knew how.
Unfortunately for sensible results the kernel needs to be rather new.
I would say  2.6.28 or so (just guessed).

# to record activity
perf record [-g|--call-graph] program|-p pid 

# to view a summation
perf report

graph:
# Overhead   Command  Shared Object 
Symbol
#     .  
.
#
 4.09%  postgres  postgres   [.] slab_alloc_dyn
|
--- slab_alloc_dyn
   |  
   |--18.52%-- new_list
   |  |  
   |  |--63.79%-- lappend
   |  |  |  
   |  |  |--13.40%-- find_usable_indexes
   |  |  |  create_index_paths
   |  |  |  set_rel_pathlist
   |  |  |  make_one_rel

flat:

# Overhead   Command  Shared Object 
Symbol
#     .  
.
#
 5.10%  postgres  [vdso] [.] 0x73d8d770  
 4.26%  postgres  postgres   [.] base_yyparse
 3.88%  postgres  postgres   [.] slab_alloc_dyn
 2.82%  postgres  postgres   [.] core_yylex
 2.37%  postgres  postgres   [.] SearchCatCache
 1.85%  postgres  libc-2.13.so   [.] __memcpy_ssse3
 1.66%  postgres  libc-2.13.so   [.] __GI___strcmp_ssse3
 1.23%  postgres  postgres   [.] MemoryContextAlloc


# to view a line/source/instruction level view
perf annotate -l symbol

...
 :
 :  /*
 :   * one-time startup overhead for each cache
 :   */
 :  if (cache-cc_tupdesc == NULL)
0.35 :6e81fd:   48 83 7f 28 00  cmpq   $0x0,0x28(%rdi)
 
/home/andres/src/postgresql/build/optimize/../../src/backend/utils/cache/catcache.c:1070
4.15 :6e8202:   0f 84 54 04 00 00   je 6e865c 
SearchCatCache+0x47c
 :  #endif
 :
 :  /*
 :   * initialize the search key information
 :   */
 :  memcpy(cur_skey, cache-cc_skey, sizeof(cur_skey));
0.00 :6e8208:   48 8d bd a0 fe ff fflea-0x160(%rbp),%rdi
0.17 :6e820f:   49 8d 77 70 lea0x70(%r15),%rsi
0.00 :6e8213:   b9 24 00 00 00  mov$0x24,%ecx
 
/home/andres/src/postgresql/build/optimize/../../src/backend/utils/cache/catcache.c:1080
   33.22 :6e8218:   f3 48 a5rep movsq 
%ds:(%rsi),%es:(%rdi)
 :  cur_skey[0].sk_argument = v1;
 
/home/andres/src/postgresql/build/optimize/../../src/backend/utils/cache/catcache.c:1081
1.56 :6e821b:   48 89 9d e0 fe ff ffmov%rbx,-0x120(%rbp)
...

# get heaps of stats from something
perf stat -ddd someprogram|-p pid

   1242.409965 task-clock#0.824 CPUs utilized   
[100.00%]
14,572 context-switches  #0.012 M/sec   
[100.00%]
   264 CPU-migrations#0.000 M/sec   
[100.00%]
 0 page-faults   #0.000 M/sec  
 2,854,775,135 cycles#2.298 GHz 
[26.28%]
   not supported stalled-cycles-frontend 
   not supported stalled-cycles-backend  
 2,024,997,785 instructions  #0.71  insns per cycle 

Re: [HACKERS] testing ProcArrayLock patches

2011-11-18 Thread Andres Freund
On Friday, November 18, 2011 11:12:02 PM Andres Freund wrote:
 On Friday, November 18, 2011 09:16:01 PM Kevin Grittner wrote:
  Andres Freund and...@anarazel.de wrote:
   When doing line-level profiles I would suggest looking at the
   instructions.
  
  What's the best way to do that?
 
 I think opannotate -a -s produces output with instructions/code
 intermingled.
 
   I don't think cache line contention is the most likely candidate
   here.  Simple cache-misses seem far more likely. In combination
   with pipeline stalls...
   
   Newer cpus (nehalem+) can measure stalled cycles which can be
   really useful when analyzing performance. I don't remember how to
   do that with oprofile right now though as I use perf these days
   (its -e stalled-cycles{frontend|backend} there}).
  
  When I run oprofile, I still always go back to this post by Tom:
  http://archives.postgresql.org/pgsql-performance/2009-06/msg00154.php
 
 Hrm. I am on the train and for unknown reasons the only sensible working
 protocols are smtp + pop Waiting Waiting
 Sorry, too slow/high latency atm. I wrote everything below and another mail
 and the page still hasn't loaded.
 
 oprofile can produces graphes as well (--callgraph). for both tools you
 need -fno-omit-frame-pointers to get usable graphs.
 
  Can anyone provide such a cheat sheet for perf?  I could give that
  a try if I knew how.
 
 Unfortunately for sensible results the kernel needs to be rather new.
 I would say  2.6.28 or so (just guessed).
 
 # to record activity
 perf record [-g|--call-graph] program|-p pid
 
 # to view a summation
 perf report

 # get heaps of stats from something
 perf stat -ddd someprogram|-p pid

 # show whats the system executing overall
 perf top -az
 
 # get help
 perf help (record|report|annotate|stat|...)
 
...
I forgot that there is also 

# get a list of event types
perf list

# measure somethign for a specidif event
perf (record|stat|top) -e some_event_type



Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-18 Thread Kevin Grittner
Andres Freund and...@anarazel.de wrote:
 
 I think opannotate -a -s produces output with instructions/code
 intermingled.
 
Thanks.  I'll check out perf later (thanks for the tips!), but for
now, here's the function which was at the top of my oprofile
results, annotated with those options.  I'm afraid it's a bit
intimidating to me -- the last time I did much with X86 assembly
language was in the mid-80s, on an 80286.  :-/  Hopefully, since
this is at the top of the oprofile results when running with
prepared statements, it will be of use to somebody.
 
The instructions which are shown as having that 1% still seem odd to
me, but as you say, they were probably actually waiting for some
previous operation to finish:
 
 43329  0.3211 :  70b56a:   test   %rbp,%rbp
 
 99903  0.7404 :  70b58a:   mov%rax,0x18(%rsp)
 
If anyone wants any other detail from what I captured, let me know.
 
-Kevin

0070b520 hash_search_with_hash_value: /* hash_search_with_hash_value 
total: 495463  3.6718 */
   :hash_search_with_hash_value(HTAB *hashp,
   :const 
void *keyPtr,
   :uint32 
hashvalue,
   :
HASHACTION action,
   :bool 
*foundPtr)
   :{
  5023  0.0372 :  70b520:   push   %r15
  5967  0.0442 :  70b522:   push   %r14
  1407  0.0104 :  70b524:   mov%rdi,%r14
30 2.2e-04 :  70b527:   push   %r13
  2495  0.0185 :  70b529:   push   %r12
  2631  0.0195 :  70b52b:   mov%edx,%r12d
18 1.3e-04 :  70b52e:   push   %rbp
  1277  0.0095 :  70b52f:   push   %rbx
   :static inline uint32
   :calc_bucket(HASHHDR *hctl, uint32 hash_val)
   :{
   :uint32  bucket;
   :
   :bucket = hash_val  hctl-high_mask;
  2122  0.0157 :  70b530:   mov%edx,%ebx
   :hash_search_with_hash_value(HTAB *hashp,
   :const 
void *keyPtr,
   :uint32 
hashvalue,
   :
HASHACTION action,
   :bool 
*foundPtr)
   :{
   247  0.0018 :  70b532:   sub$0x58,%rsp
   236  0.0017 :  70b536:   mov%rsi,0x10(%rsp)
  3851  0.0285 :  70b53b:   mov%ecx,0xc(%rsp)
  2551  0.0189 :  70b53f:   mov%r8,(%rsp)
   :HASHHDR*hctl = hashp-hctl;
  2225  0.0165 :  70b543:   mov(%rdi),%r15
   :static inline uint32
   :calc_bucket(HASHHDR *hctl, uint32 hash_val)
   :{
   :uint32  bucket;
   :
   :bucket = hash_val  hctl-high_mask;
  4544  0.0337 :  70b546:   and0x2c(%r15),%ebx
   :if (bucket  hctl-max_bucket)
 53409  0.3958 :  70b54a:   cmp0x28(%r15),%ebx
   :  70b54e:   jbe70b554 hash_search_with_hash_value+0x34
   :bucket = bucket  hctl-low_mask;
  3324  0.0246 :  70b550:   and0x30(%r15),%ebx
   :bucket = calc_bucket(hctl, hashvalue);
   :
   :segment_num = bucket  hashp-sshift;
   :segment_ndx = MOD(bucket, hashp-ssize);
   :
   :segp = hashp-dir[segment_num];
  9702  0.0719 :  70b554:   mov0x58(%r14),%ecx
  2428  0.0180 :  70b558:   mov%ebx,%eax
   489  0.0036 :  70b55a:   mov0x8(%r14),%rdx
   : * Do the initial lookup
   : */
   :bucket = calc_bucket(hctl, hashvalue);
   :
   :segment_num = bucket  hashp-sshift;
   :segment_ndx = MOD(bucket, hashp-ssize);
   391  0.0029 :  70b55e:   mov0x50(%r14),%r13
   :
   :segp = hashp-dir[segment_num];
  2062  0.0153 :  70b562:   shr%cl,%eax
   309  0.0023 :  70b564:   mov%eax,%eax
   643  0.0048 :  70b566:   mov(%rdx,%rax,8),%rbp
   :
   :if (segp == NULL)
 43329  0.3211 :  70b56a:   test   %rbp,%rbp
  1284  0.0095 :  70b56d:   je 70b727 
hash_search_with_hash_value+0x207
   :hash_corrupted(hashp);
   :
   :prevBucketPtr = segp[segment_ndx];
  1878  0.0139 :  70b573:   lea-0x1(%r13),%rax
   :currBucket = *prevBucketPtr;
   :
   :/*
   : * Follow collision chain looking for 

Re: [HACKERS] testing ProcArrayLock patches

2011-11-18 Thread Robert Haas
On Fri, Nov 18, 2011 at 2:05 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Robert Haas robertmh...@gmail.com wrote:
 Any chance you can run oprofile (on either branch, don't really
 care) against the 32 client test and post the results?

 [ oprofile results ]

Hmm.  That looks a lot like a profile with no lock contention at all.
Since I see XLogInsert in there, I assume this must be a pgbench write
test on unlogged tables?  How close am I?

I was actually thinking it would be interesting to oprofile the
read-only test; see if we can figure out where those slowdowns are
coming from.

 Two runs:

 tps = 21946.961196 (including connections establishing)
 tps = 22911.873227 (including connections establishing)

 For write transactions, that seems pretty respectable.

Very.  What do you get without the patch?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-18 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote: 
 
 Hmm.  That looks a lot like a profile with no lock contention at
 all.  Since I see XLogInsert in there, I assume this must be a
 pgbench write test on unlogged tables?  How close am I?
 
Not unless pgbench on HEAD does that by default.  Here are the
relevant statements:
 
$prefix/bin/pgbench -i -s 150
$prefix/bin/pgbench -T $time -c $clients -j $clients $resultfile
 
Perhaps the Intel cores implement the relevant primitives better? 
Maybe I didn't run the profile or reports the right way?
 
 I was actually thinking it would be interesting to oprofile the
 read-only test; see if we can figure out where those slowdowns are
 coming from.
 
I'll plan on doing that this weekend.
 
 tps = 21946.961196 (including connections establishing)
 tps = 22911.873227 (including connections establishing)

 For write transactions, that seems pretty respectable.
 
 Very.  What do you get without the patch?
 
[quick runs a couple tests that way]
 
Single run with -M simple:
 
tps = 23018.314292 (including connections establishing)
 
Single run with -M prepared:
 
tps = 27910.621044 (including connections establishing)
 
So, the patch appears to hinder performance in this environment,
although certainty is quite low with so few samples.  I'll schedule
a spectrum of runs before I leave this evening (very soon).
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-18 Thread Robert Haas
On Fri, Nov 18, 2011 at 6:46 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 tps = 21946.961196 (including connections establishing)
 tps = 22911.873227 (including connections establishing)

 For write transactions, that seems pretty respectable.

 Very.  What do you get without the patch?

 [quick runs a couple tests that way]

 Single run with -M simple:

 tps = 23018.314292 (including connections establishing)

 Single run with -M prepared:

 tps = 27910.621044 (including connections establishing)

 So, the patch appears to hinder performance in this environment,
 although certainty is quite low with so few samples.  I'll schedule
 a spectrum of runs before I leave this evening (very soon).

Hmm.  There's obviously something that's different in your environment
or configuration from what I tested, but I don't know what it is.  The
fact that your scale factor is larger than shared_buffers might
matter; or Intel vs. AMD.  Or maybe you're running with
synchronous_commit=on?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] testing ProcArrayLock patches

2011-11-18 Thread Andres Freund
On Saturday, November 19, 2011 12:18:07 AM Kevin Grittner wrote:
 Andres Freund and...@anarazel.de wrote:
  I think opannotate -a -s produces output with instructions/code
  intermingled.
 
 Thanks.  I'll check out perf later (thanks for the tips!), but for
 now, here's the function which was at the top of my oprofile
 results, annotated with those options.  I'm afraid it's a bit
 intimidating to me -- the last time I did much with X86 assembly
 language was in the mid-80s, on an 80286.  :-/ 
While my assembly knoweldge surely isn't from the 80s be assured that I find it 
intimidating as well ;)

 Hopefully, since
 this is at the top of the oprofile results when running with
 prepared statements, it will be of use to somebody.
I think in quite many situations hash_search_with_hash_value is rather 
noticeable in the profiles. Even without concurrency...

Looking at your annotation output the code seems to be almost entirely stalled 
waiting for memory.
The first stall is after the first reading memory access which is likely to be 
uncached (the first cacheline of the HTAB is accessed before but that will be 
in the cache). The interesting thing is that I would have expected a higher 
likelihood for this to stay in the cache.
  2225  0.0165 :  70b543:   mov(%rdi),%r15
   :static inline uint32
   :calc_bucket(HASHHDR *hctl, uint32 hash_val)
   :{
   :uint32  bucket;
   :
   :bucket = hash_val  hctl-high_mask;
  4544  0.0337 :  70b546:   and0x2c(%r15),%ebx
   :if (bucket  hctl-max_bucket)
 53409  0.3958 :  70b54a:   cmp0x28(%r15),%ebx
   :  70b54e:   jbe70b554 hash_search_with_hash_value+0x34

So a stall here is not that surprising.


Here we fetch data from memory which is unlikely to be prefetchable and then 
require the result from that fetch. Note how  segp = hashp-dir[segment_num]; 
is distributed over line 52, 64, 83.

   :segp = hashp-dir[segment_num];
  2062  0.0153 :  70b562:   shr%cl,%eax
   309  0.0023 :  70b564:   mov%eax,%eax
   643  0.0048 :  70b566:   mov(%rdx,%rax,8),%rbp
   :
   :if (segp == NULL)
 43329  0.3211 :  70b56a:   test   %rbp,%rbp





The next cacheline is referenced here. Again a fetch from memory which is soon 
after needed to continue.
Unless I misunderstood the code-flow this disproves my theory that we might 
have many collisions as that test seems to be outside the test (
   :prevBucketPtr = segp[segment_ndx];
   :currBucket = *prevBucketPtr;
   122 9.0e-04 :  70b586:   mov0x0(%rbp),%rbx
   :
   :/*
   : * Follow collision chain looking for matching key
   : */
   :match = hashp-match;   /* save one fetch in 
inner 
loop */
   :keysize = hashp-keysize;   /* ditto */
 99903  0.7404 :  70b58a:   mov%rax,0x18(%rsp)
   :
   :while (currBucket != NULL)
  1066  0.0079 :  70b58f:   test   %rbx,%rbx




line 136 is the first time the contents of the current bucket is needed. Thats 
why the test is so noticeable.
   :currBucket = *prevBucketPtr;
   655  0.0049 :  70b5a3:   mov(%rbx),%rbx
   : * Follow collision chain looking for matching key
   : */
   :match = hashp-match;   /* save one fetch in 
inner 
loop */
   :keysize = hashp-keysize;   /* ditto */
   :
   :while (currBucket != NULL)
   608  0.0045 :  70b5a6:   test   %rbx,%rbx
   :  70b5a9:   je 70b5d0 hash_search_with_hash_value+0xb0
   :{
   :if (currBucket-hashvalue == hashvalue 
  3504  0.0260 :  70b5ab:   cmp%r12d,0x8(%rbx)
 98486  0.7299 :  70b5af:   nop
  1233  0.0091 :  70b5b0:   jne70b5a0 hash_search_with_hash_value+0x80


That covers all the slow points in the function. And unless I am missing 
something those are all the fetched cachelines of that function... For 
HASH_FIND that is.


So I think that reinforces my belive that ordinary cachemisses are the culprit 
here. Which is to be excepted in a hashtable...


Andres


PS: No idea whether that rambling made sense to anyone... But I looked at that 
function fo  the first time ;)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers