RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-09-02 Thread Tsunakawa, Takayuki
From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com]
> Hmm ... is this patch rejected, or is somebody still trying to get it to
> committable state?  David, you're listed as committer.

I don't think it's rejected.  It would be a pity (mottainai) to refuse this, 
because it provides significant speedup despite its simple modification.

Again, I think the v2 patch is OK.  Tom's comment was as follows:


[Tom's comment against v2]

FWIW, I tried this patch against current HEAD (959d00e9d).
Using the test case described by Amit at

I do measure an undeniable speedup, close to 35%.

However ... I submit that that's a mighty extreme test case.
(I had to increase max_locks_per_transaction to get it to run
at all.)  We should not be using that sort of edge case to drive
performance optimization choices.

If I reduce the number of partitions in Amit's example from 8192
to something more real-world, like 128, I do still measure a
performance gain, but it's ~ 1.5% which is below what I'd consider
a reproducible win.  I'm accustomed to seeing changes up to 2%
in narrow benchmarks like this one, even when "nothing changes"
except unrelated code.

Trying a standard pgbench test case (pgbench -M prepared -S with
one client and an -s 10 database), it seems that the patch is about
0.5% slower than HEAD.  Again, that's below the noise threshold,
but it's not promising for the net effects of this patch on workloads
that aren't specifically about large and prunable partition sets.

I'm also fairly concerned about the effects of the patch on
sizeof(LOCALLOCK) --- on a 64-bit machine it goes from 72 to 88
bytes, a 22% increase.  That's a lot if you're considering cases
with many locks.

On the whole I don't think there's an adequate case for committing
this patch.

I'd also point out that this is hardly the only place where we've
seen hash_seq_search on nearly-empty hash tables become a bottleneck.
So I'm not thrilled about attacking that with one-table-at-time patches.
I'd rather see us do something to let hash_seq_search win across
the board.



* Extreme test case: 
Not extreme.  Two of our customers, who are considering to use PostgreSQL, are 
using thousands of partitions now.  We hit this issue -- a point query gets 
nearly 20% slower after automatically creating a generic plan.  That's the 
reason for this proposal.

* 0.5% slowdown with pgbench:
I think it's below the noise, as Tom said.

* sizeof(LOCALLOCK):
As Andres replied to Tom in the immediately following mail, LOCALLOCK was 
bigger in PG 11.

* Use case is narrow:
No.  The bloated LockMethodLocalHash affects the performance of the items below 
as well as transaction commit/abort:
  - AtPrepare_Locks() and PostPrepare_Locks(): the hash table is scanned twice 
in PREPARE!
  - LockReleaseSession: advisory lock
  - LockReleaseCurrentOwner: ??
  - LockReassignCurrentOwner: ??


Regards
Takayuki Tsunakawa



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-09-02 Thread Alvaro Herrera
On 2019-Aug-14, David Rowley wrote:

> For now, I'm out of ideas. If anyone else feels like suggesting
> something of picking this up, feel free.

Hmm ... is this patch rejected, or is somebody still trying to get it to
committable state?  David, you're listed as committer.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-08-15 Thread Tomas Vondra

On Wed, Aug 14, 2019 at 07:25:10PM +1200, David Rowley wrote:

On Thu, 25 Jul 2019 at 05:49, Tom Lane  wrote:

On the whole, I don't especially like this approach, because of the
confusion between peak lock count and end-of-xact lock count.  That
seems way too likely to cause problems.


Thanks for having a look at this.  I've not addressed the points
you've mentioned due to what you mention above.  The only way I can
think of so far to resolve that would be to add something to track
peak lock usage.  The best I can think of to do that, short of adding
something to dynahash.c is to check how many locks are held each time
we obtain a lock, then if that count is higher than the previous time
we checked, then update the maximum locks held, (probably a global
variable).   That seems pretty horrible to me and adds overhead each
time we obtain a lock, which is a pretty performance-critical path.



Would it really be a measurable overhead? I mean, we only really need
one int counter, and you don't need to do the check on every lock
acquisition - you just need to recheck on the first lock release. But
maybe I'm underestimating how expensive it is ...

Talking about dynahash - doesn't it already track this information?
Maybe not directly but surely it has to track the number of entries in
the hash table, in order to compute fill factor. Can't we piggy-back on
that and track the highest fill-factor for a particular period of time?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-08-14 Thread David Rowley
On Thu, 25 Jul 2019 at 05:49, Tom Lane  wrote:
> On the whole, I don't especially like this approach, because of the
> confusion between peak lock count and end-of-xact lock count.  That
> seems way too likely to cause problems.

Thanks for having a look at this.  I've not addressed the points
you've mentioned due to what you mention above.  The only way I can
think of so far to resolve that would be to add something to track
peak lock usage.  The best I can think of to do that, short of adding
something to dynahash.c is to check how many locks are held each time
we obtain a lock, then if that count is higher than the previous time
we checked, then update the maximum locks held, (probably a global
variable).   That seems pretty horrible to me and adds overhead each
time we obtain a lock, which is a pretty performance-critical path.

I've not tested what Andres mentioned about simplehash instead of
dynahash yet. I did a quick scan of simplehash and it looked like
SH_START_ITERATE would suffer the same problems as dynahash's
hash_seq_search(), albeit, perhaps with some more efficient memory
lookups. i.e it still has to skip over empty buckets, which might be
costly in a bloated table.

For now, I'm out of ideas. If anyone else feels like suggesting
something of picking this up, feel free.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-08-01 Thread Thomas Munro
On Thu, Jul 25, 2019 at 5:49 AM Tom Lane  wrote:
> David Rowley  writes:
> > Here's a more polished version with the debug code removed, complete
> > with benchmarks.
>
> A few gripes:
>
> [gripes]

Based on the above, I've set this to "Waiting on Author", in the next CF.

-- 
Thomas Munro
https://enterprisedb.com




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-24 Thread Tom Lane
David Rowley  writes:
> Here's a more polished version with the debug code removed, complete
> with benchmarks.

A few gripes:

You're measuring the number of locks held at completion of the
transaction, which fails to account for locks transiently taken and
released, so that the actual peak usage will be more.  I think we mostly
only do that for system catalog accesses, so *maybe* the number of extra
locks involved isn't very large, but that's a very shaky assumption.

I don't especially like the fact that this seems to have a hard-wired
(and undocumented) assumption that buckets == entries, ie that the
fillfactor of the table is set at 1.0.  lock.c has no business knowing
that.  Perhaps instead of returning the raw bucket count, you could have
dynahash.c return bucket count times fillfactor, so that the number is in
the same units as the entry count.

This:
running_avg_locks -= running_avg_locks / 10.0;
running_avg_locks += numLocksHeld / 10.0;
seems like a weird way to do the calculation.  Personally I'd write
running_avg_locks += (numLocksHeld - running_avg_locks) / 10.0;
which is more the way I'm used to seeing exponential moving averages
computed --- at least, it seems clearer to me why this will converge
towards the average value of numLocksHeld over time.  It also makes
it clear that it wouldn't be sane to use two different divisors,
whereas your formulation makes it look like maybe they could be
set independently.

Your compiler might not complain that LockReleaseAll's numLocksHeld
is potentially uninitialized, but other people's compilers will.


On the whole, I don't especially like this approach, because of the
confusion between peak lock count and end-of-xact lock count.  That
seems way too likely to cause problems.

regards, tom lane




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-24 Thread David Rowley
On Wed, 24 Jul 2019 at 16:16, David Rowley  wrote:
>
> On Wed, 24 Jul 2019 at 15:05, David Rowley  
> wrote:
> > To be able to reduce the threshold down again we'd need to make a
> > hash_get_num_entries(LockMethodLocalHash) call before performing the
> > guts of LockReleaseAll(). We could then weight that onto some running
> > average counter with a weight of, say... 10, so we react to changes
> > fairly quickly, but not instantly. We could then have some sort of
> > logic like "rebuild the hash table if running average 4 times less
> > than max_bucket"
> >
> > I've attached a spreadsheet of that idea and the algorithm we could
> > use to track the running average.  Initially, I've mocked it up a
> > series of transactions that use 1000 locks, then at row 123 dropped
> > that to 10 locks. If we assume the max_bucket is 1000, then it takes
> > until row 136 for the running average to drop below the max_bucket
> > count, i.e 13 xacts. There we'd reset there at the init size of 16. If
> > the average went up again, then we'd automatically expand the table as
> > we do now.  To make this work we'd need an additional call to
> > hash_get_num_entries(), before we release the locks, so there is more
> > overhead.
>
> Here's a patch with this implemented. I've left a NOTICE in there to
> make it easier for people to follow along at home and see when the
> lock table is reset.

Here's a more polished version with the debug code removed, complete
with benchmarks.

-- Test 1. Select 1 record from a 140 partitioned table. Tests
creating a large number of locks with a fast query.

create table hp (a int, b int) partition by hash(a);
select 'create table hp'||x||' partition of hp for values with
(modulus 140, remainder ' || x || ');' from generate_series(0,139)x;
create index on hp (b);
insert into hp select x,x from generate_series(1, 14) x;
analyze hp;

select3.sql: select * from hp where b = 1

-
Master:

$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2098.628975 (excluding connections establishing)
tps = 2101.797672 (excluding connections establishing)
tps = 2085.317292 (excluding connections establishing)
tps = 2094.931999 (excluding connections establishing)
tps = 2092.328908 (excluding connections establishing)

Patched:

$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2101.691459 (excluding connections establishing)
tps = 2104.533249 (excluding connections establishing)
tps = 2106.499123 (excluding connections establishing)
tps = 2104.033459 (excluding connections establishing)
tps = 2105.463629 (excluding connections establishing)

(I'm surprised there is not more overhead in the additional tracking
added to calculate the running average)

-- Test 2. Tests a prepared query which will perform a generic plan on
the 6th execution then fallback on a custom plan due to it pruning all
but one partition.  Master suffers from the lock table becoming
bloated after locking all partitions when planning the generic plan.

create table ht (a int primary key, b int, c int) partition by hash (a);
select 'create table ht' || x::text || ' partition of ht for values
with (modulus 8192, remainder ' || (x)::text || ');' from
generate_series(0,8191) x;
\gexec

select.sql:
\set p 1
select * from ht where a = :p

Master:

$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 10207.780843 (excluding connections establishing)
tps = 10205.772688 (excluding connections establishing)
tps = 10214.896449 (excluding connections establishing)
tps = 10157.572153 (excluding connections establishing)
tps = 10147.764477 (excluding connections establishing)

Patched:

$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 14677.636570 (excluding connections establishing)
tps = 14661.437186 (excluding connections establishing)
tps = 14647.202877 (excluding connections establishing)
tps = 14784.165753 (excluding connections establishing)
tps = 14850.355344 (excluding connections establishing)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


shrink_bloated_locallocktable_v8.patch
Description: Binary data


Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-24 Thread Andres Freund
Hi,

On 2019-07-21 21:37:28 +1200, David Rowley wrote:
> select.sql:
> \set p 1
> select * from ht where a = :p
> 
> Master:
> 
> $ pgbench -n -f select.sql -T 60 -M prepared postgres
> tps = 10172.035036 (excluding connections establishing)
> tps = 10192.780529 (excluding connections establishing)
> tps = 10331.306003 (excluding connections establishing)
> 
> Patched:
> 
> $ pgbench -n -f select.sql -T 60 -M prepared postgres
> tps = 15080.765549 (excluding connections establishing)
> tps = 14994.404069 (excluding connections establishing)
> tps = 14982.923480 (excluding connections establishing)
> 
> That seems fine, 46% faster.
> 
> v6 is attached.
> 
> I plan to push this in a few days unless someone objects.

It does seem far less objectionable than the other case.  I hate to
throw in one more wrench into a topic finally making progress, but: Have
either of you considered just replacing the dynahash table with a
simplehash style one?  Given the obvious speed sensitivity, and the fact
that for it (in contrast to the shared lock table) no partitioning is
needed, that seems like a good thing to try.  It seems quite possible
that both the iteration and plain manipulations are going to be faster,
due to far less indirections - e.g. the iteration through the array will
just be an array walk with a known stride, far easier for the CPU to
prefetch.

Greetings,

Andres Freund




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-23 Thread David Rowley
On Wed, 24 Jul 2019 at 15:05, David Rowley  wrote:
> To be able to reduce the threshold down again we'd need to make a
> hash_get_num_entries(LockMethodLocalHash) call before performing the
> guts of LockReleaseAll(). We could then weight that onto some running
> average counter with a weight of, say... 10, so we react to changes
> fairly quickly, but not instantly. We could then have some sort of
> logic like "rebuild the hash table if running average 4 times less
> than max_bucket"
>
> I've attached a spreadsheet of that idea and the algorithm we could
> use to track the running average.  Initially, I've mocked it up a
> series of transactions that use 1000 locks, then at row 123 dropped
> that to 10 locks. If we assume the max_bucket is 1000, then it takes
> until row 136 for the running average to drop below the max_bucket
> count, i.e 13 xacts. There we'd reset there at the init size of 16. If
> the average went up again, then we'd automatically expand the table as
> we do now.  To make this work we'd need an additional call to
> hash_get_num_entries(), before we release the locks, so there is more
> overhead.

Here's a patch with this implemented. I've left a NOTICE in there to
make it easier for people to follow along at home and see when the
lock table is reset.

There will be a bit of additional overhead to the reset detection
logic over the v7 patch. Namely: additional hash_get_num_entries()
call before releasing the locks, and more complex and floating-point
maths instead of more simple integer maths in v7.

Here's a demo with the debug NOTICE in there to show us what's going on.

-- setup
create table a (a int) partition by range (a);
select 'create table a'||x||' partition of a for values from('||x||')
to ('||x+1||');' from generate_Series(1,1000) x;
\gexec

$ psql postgres
NOTICE:  max_bucket = 15, threshold = 64.00, running_avg_locks
0.10 Reset? No
psql (13devel)
# \o /dev/null
# select * from a where a < 100;
NOTICE:  max_bucket = 101, threshold = 64.00, running_avg_locks
10.09 Reset? Yes
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 76.324005, running_avg_locks
19.081001 Reset? Yes
# select * from a where a < 100;

A couple of needless resets there... Maybe we can get rid of those by
setting the initial running average up to something higher than 0.0.

NOTICE:  max_bucket = 99, threshold = 108.691605, running_avg_locks
27.172901 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 137.822449, running_avg_locks
34.455612 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 164.040207, running_avg_locks
41.010052 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 187.636185, running_avg_locks
46.909046 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 208.872559, running_avg_locks
52.218140 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 227.985306, running_avg_locks
56.996326 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 245.186768, running_avg_locks
61.296692 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 260.668091, running_avg_locks
65.167023 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 274.601288, running_avg_locks
68.650322 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 287.141174, running_avg_locks
71.785294 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 298.427063, running_avg_locks
74.606766 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 308.584351, running_avg_locks
77.146088 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 317.725922, running_avg_locks
79.431480 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 325.953339, running_avg_locks
81.488335 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 333.358002, running_avg_locks
83.339500 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 340.022217, running_avg_locks
85.005554 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 346.019989, running_avg_locks
86.504997 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 351.417999, running_avg_locks
87.854500 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 356.276184, running_avg_locks
89.069046 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 360.648560, running_avg_locks
90.162140 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 364.583710, running_avg_locks
91.145927 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 368.125336, running_avg_locks
92.031334 Reset? No
# select * from a 

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-23 Thread David Rowley
Thanks for having a look at this.

On Wed, 24 Jul 2019 at 04:13, Tom Lane  wrote:
>
> David Rowley  writes:
> > I'm pretty happy with v7 now. If anyone has any objections to it,
> > please speak up very soon.  Otherwise, I plan to push it about this
> > time tomorrow.
>
> I dunno, this seems close to useless in this form.  As it stands,
> once hash_get_max_bucket has exceeded the threshold, you will
> arbitrarily reset the table 1000 transactions later (since the
> max bucket is certainly not gonna decrease otherwise).  So that's
> got two big problems:
>
> 1. In the assumed-common case where most of the transactions take
> few locks, you wasted cycles for 999 transactions.
>
> 2. You'll reset the table independently of subsequent history,
> even if the session's usage is pretty much always over the
> threshold.  Admittedly, if you do this only once per 1K
> transactions, it's probably not a horrible overhead --- but you
> can't improve point 1 without making it a bigger overhead.

This is true, but I think you might be overestimating just how much
effort is wasted with #1. We're only seeing this overhead in small
very fast to execute xacts. In the test case in [1], I was getting
about 10k TPS unpatched and about 15k patched. This means that, on
average, 1 unpatched xact takes 100 microseconds and 1 average patched
xact takes 66 microseconds, so the additional time spent doing the
hash_seq_search() must be about 34 microseconds. So we'll waste a
total of 34 *milliseconds* if we wait for 1000 xacts before we reset
the table.  With 10k TPS we're going to react to the change in 0.1
seconds.

I think you'd struggle to measure that 1 xact is taking 34
microseconds longer without running a few thousand queries. My view is
that nobody is ever going to notice that it takes 1k xacts to shrink
the table, and I've already shown that the overhead of the shrink
every 1k xact is tiny. I mentioned 0.34% in [1] using v6. This is
likely a bit smaller in v7 due to swapping the order of the if
condition to put the less likely case first. Since the overhead of
rebuilding the table was 7% when done every xact, then it stands to
reason that this has become 0.007% doing it every 1k xats and that the
additional overhead to make up that 0.34% is from testing if the reset
condition has been met (or noise).  That's not something we can remove
completely with any solution that resets the hash table.

> I did complain about the cost of tracking the stats proposed by
> some earlier patches, but I don't think the answer to that is to
> not track any stats at all.  The proposed hash_get_max_bucket()
> function is quite cheap, so I think it wouldn't be out of line to
> track the average value of that at transaction end over the
> session's lifespan, and reset if the current value is more than
> some-multiple of the average.
>
> The tricky part here is that if some xact kicks that up to
> say 64 entries, and we don't choose to reset, then the reading
> for subsequent transactions will be 64 even if they use very
> few locks.  So we probably need to not use a plain average,
> but account for that effect somehow.  Maybe we could look at
> how quickly the number goes up after we reset?
>
> [ thinks for awhile... ]  As a straw-man proposal, I suggest
> the following (completely untested) idea:
>
> * Make the table size threshold variable, not constant.
>
> * If, at end-of-transaction when the table is empty,
> the table bucket count exceeds the threshold, reset
> immediately; but if it's been less than K transactions
> since the last reset, increase the threshold (by say 10%).
>
> I think K can be a constant; somewhere between 10 and 100 would
> probably work.  At process start, we should act as though the last
> reset were more than K transactions ago (i.e., don't increase the
> threshold at the first reset).

I think the problem with this idea is that there is no way once the
threshold has been enlarged to recover from that to work better
workloads that require very few locks again. If we end up with some
large value for the variable threshold, there's no way to decrease
that again.  All this method stops is the needless hash table resets
if the typical case requires many locks. The only way to know if we
can reduce the threshold again is to count the locks released during
LockReleaseAll().  Some version of the patch did that, and you
objected.

> The main advantage this has over v7 is that we don't have the
> 1000-transaction delay before reset, which ISTM is giving up
> much of the benefit of the whole idea.  Also, if the session
> is consistently using lots of locks, we'll adapt to that after
> awhile and not do useless table resets.

True, but you neglected to mention the looming and critical drawback,
which pretty much makes that idea useless. All we'd need to do is give
this a workload that throws that variable threshold up high so that it
can't recover. It would be pretty simple then to show that
LockReleaseAll() is still slow with 

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-23 Thread Tom Lane
David Rowley  writes:
> I've attached v7, which really is v6 with some comments adjusted and
> the order of the hash_get_num_entries and hash_get_max_bucket function
> calls swapped.  I think hash_get_num_entries() will return 0 most of
> the time where we're calling it, so it makes sense to put the
> condition that's less likely to be true first in the if condition.

> I'm pretty happy with v7 now. If anyone has any objections to it,
> please speak up very soon.  Otherwise, I plan to push it about this
> time tomorrow.

I dunno, this seems close to useless in this form.  As it stands,
once hash_get_max_bucket has exceeded the threshold, you will
arbitrarily reset the table 1000 transactions later (since the
max bucket is certainly not gonna decrease otherwise).  So that's
got two big problems:

1. In the assumed-common case where most of the transactions take
few locks, you wasted cycles for 999 transactions.

2. You'll reset the table independently of subsequent history,
even if the session's usage is pretty much always over the
threshold.  Admittedly, if you do this only once per 1K
transactions, it's probably not a horrible overhead --- but you
can't improve point 1 without making it a bigger overhead.

I did complain about the cost of tracking the stats proposed by
some earlier patches, but I don't think the answer to that is to
not track any stats at all.  The proposed hash_get_max_bucket()
function is quite cheap, so I think it wouldn't be out of line to
track the average value of that at transaction end over the
session's lifespan, and reset if the current value is more than
some-multiple of the average.

The tricky part here is that if some xact kicks that up to
say 64 entries, and we don't choose to reset, then the reading
for subsequent transactions will be 64 even if they use very
few locks.  So we probably need to not use a plain average,
but account for that effect somehow.  Maybe we could look at
how quickly the number goes up after we reset?

[ thinks for awhile... ]  As a straw-man proposal, I suggest
the following (completely untested) idea:

* Make the table size threshold variable, not constant.

* If, at end-of-transaction when the table is empty,
the table bucket count exceeds the threshold, reset
immediately; but if it's been less than K transactions
since the last reset, increase the threshold (by say 10%).

I think K can be a constant; somewhere between 10 and 100 would
probably work.  At process start, we should act as though the last
reset were more than K transactions ago (i.e., don't increase the
threshold at the first reset).

The main advantage this has over v7 is that we don't have the
1000-transaction delay before reset, which ISTM is giving up
much of the benefit of the whole idea.  Also, if the session
is consistently using lots of locks, we'll adapt to that after
awhile and not do useless table resets.

regards, tom lane




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-23 Thread David Rowley
On Tue, 23 Jul 2019 at 15:47, Tsunakawa, Takayuki
 wrote:
> OTOH, how about my original patch that is based on the local lock list?  I 
> expect that it won't that significant slowdown in the same test case.  If 
> it's not satisfactory, then v6 is the best to commit.

I think we need to move beyond the versions that have been reviewed
and rejected. I don't think the reasons for their rejection will have
changed.

I've attached v7, which really is v6 with some comments adjusted and
the order of the hash_get_num_entries and hash_get_max_bucket function
calls swapped.  I think hash_get_num_entries() will return 0 most of
the time where we're calling it, so it makes sense to put the
condition that's less likely to be true first in the if condition.

I'm pretty happy with v7 now. If anyone has any objections to it,
please speak up very soon.  Otherwise, I plan to push it about this
time tomorrow.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


shrink_bloated_locallocktable_v7.patch
Description: Binary data


RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-22 Thread Tsunakawa, Takayuki
From: David Rowley [mailto:david.row...@2ndquadrant.com]
> Another counter-argument to this is that there's already an
> unexplainable slowdown after you run a query which obtains a large
> number of locks in a session or use prepared statements and a
> partitioned table with the default plan_cache_mode setting. Are we not
> just righting a wrong here? Albeit, possibly 1000 queries later.
> 
> I am, of course, open to other ideas which solve the problem that v5
> has, but failing that, I don't see v6 as all that bad.  At least all
> the logic is contained in one function.  I know Tom wanted to steer
> clear of heuristics to reinitialise the table, but most of the stuff
> that was in the patch back when he complained was trying to track the
> average number of locks over the previous N transactions, and his
> concerns were voiced before I showed the 7% performance regression
> with unconditionally rebuilding the table.

I think I understood what you mean.  Sorry, I don't have a better idea.  This 
unexplanability is probably what we should accept to avoid the shocking 7% 
slowdown.

OTOH, how about my original patch that is based on the local lock list?  I 
expect that it won't that significant slowdown in the same test case.  If it's 
not satisfactory, then v6 is the best to commit.


Regards
Takayuki Tsunakawa




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-22 Thread David Rowley
On Mon, 22 Jul 2019 at 12:48, Tsunakawa, Takayuki
 wrote:
>
> From: David Rowley [mailto:david.row...@2ndquadrant.com]
> > I went back to the drawing board on this and I've added some code that 
> > counts
> > the number of times we've seen the table to be oversized and just shrinks
> > the table back down on the 1000th time.  6.93% / 1000 is not all that much.
>
> I'm afraid this kind of hidden behavior would appear mysterious to users.  
> They may wonder "Why is the same query fast at first in the session (5 or 6 
> times of execution), then gets slower for a while, and gets faster again?  Is 
> there something to tune?  Am I missing something wrong with my app (e.g. how 
> to use prepared statements)?"  So I prefer v5.

Another counter-argument to this is that there's already an
unexplainable slowdown after you run a query which obtains a large
number of locks in a session or use prepared statements and a
partitioned table with the default plan_cache_mode setting. Are we not
just righting a wrong here? Albeit, possibly 1000 queries later.

I am, of course, open to other ideas which solve the problem that v5
has, but failing that, I don't see v6 as all that bad.  At least all
the logic is contained in one function.  I know Tom wanted to steer
clear of heuristics to reinitialise the table, but most of the stuff
that was in the patch back when he complained was trying to track the
average number of locks over the previous N transactions, and his
concerns were voiced before I showed the 7% performance regression
with unconditionally rebuilding the table.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-21 Thread David Rowley
On Mon, 22 Jul 2019 at 16:36, Tsunakawa, Takayuki
 wrote:
>
> From: David Rowley [mailto:david.row...@2ndquadrant.com]
> > For the use case we've been measuring with partitioned tables and the
> > generic plan generation causing a sudden spike in the number of
> > obtained locks, then having plan_cache_mode = force_custom_plan will
> > cause the lock table not to become bloated. I'm not sure there's
> > anything interesting to measure there.
>
> I meant the difference between the following two cases, where the query only 
> touches one partition (e.g. SELECT ... WHERE pkey = value):
>
> * plan_cache_mode=force_custom_plan: LocalLockHash won't bloat.  The query 
> execution time is steady.
>
> * plan_cache_mode=auto: LocalLockHash bloats on the sixth execution due to 
> the creation of the generic plan.  The generic plan is not adopted because 
> its cost is high.  Later executions of the query will suffer from the bloat 
> until the 1006th execution when LocalLockHash is shrunk.

I measured this again in
https://www.postgresql.org/message-id/CAKJS1f_ycJ-6QTKC70pZRYdwsAwUo+t0_CV0eXC=j-b5a2m...@mail.gmail.com
where I posted the v6 patch.  It's the final results in the email.   I
didn't measure plan_cache_mode = force_custom_plan. There'd be no lock
table bloat in that case and the additional overhead would just be
from the hash_get_num_entries() && hash_get_max_bucket() calls, which
the first results show next to no overhead for.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-21 Thread Tsunakawa, Takayuki
From: David Rowley [mailto:david.row...@2ndquadrant.com]
> For the use case we've been measuring with partitioned tables and the
> generic plan generation causing a sudden spike in the number of
> obtained locks, then having plan_cache_mode = force_custom_plan will
> cause the lock table not to become bloated. I'm not sure there's
> anything interesting to measure there.

I meant the difference between the following two cases, where the query only 
touches one partition (e.g. SELECT ... WHERE pkey = value):

* plan_cache_mode=force_custom_plan: LocalLockHash won't bloat.  The query 
execution time is steady.

* plan_cache_mode=auto: LocalLockHash bloats on the sixth execution due to the 
creation of the generic plan.  The generic plan is not adopted because its cost 
is high.  Later executions of the query will suffer from the bloat until the 
1006th execution when LocalLockHash is shrunk.


Depending on the number of transactions and what each transaction does, I 
thought the difference will be noticeable or not.


Regards
Takayuki Tsunakawa




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-21 Thread David Rowley
On Mon, 22 Jul 2019 at 14:21, Tsunakawa, Takayuki
 wrote:
>
> From: David Rowley [mailto:david.row...@2ndquadrant.com]
> > I personally don't think that's true.  The only way you'll notice the
> > LockReleaseAll() overhead is to execute very fast queries with a
> > bloated lock table.  It's pretty hard to notice that a single 0.1ms
> > query is slow. You'll need to execute thousands of them before you'll
> > be able to measure it, and once you've done that, the lock shrink code
> > will have run and the query will be performing optimally again.
>
> Maybe so.  Will the difference be noticeable between plan_cache_mode=auto 
> (default) and plan_cache_mode=custom?

For the use case we've been measuring with partitioned tables and the
generic plan generation causing a sudden spike in the number of
obtained locks, then having plan_cache_mode = force_custom_plan will
cause the lock table not to become bloated. I'm not sure there's
anything interesting to measure there. The only additional code that
gets executed is the hash_get_num_entries() and possibly
hash_get_max_bucket. Maybe it's worth swapping the order of those
calls since most of the time the entry will be 0 and the max bucket
count threshold won't be exceeded.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-21 Thread Tsunakawa, Takayuki
From: David Rowley [mailto:david.row...@2ndquadrant.com]
> I personally don't think that's true.  The only way you'll notice the
> LockReleaseAll() overhead is to execute very fast queries with a
> bloated lock table.  It's pretty hard to notice that a single 0.1ms
> query is slow. You'll need to execute thousands of them before you'll
> be able to measure it, and once you've done that, the lock shrink code
> will have run and the query will be performing optimally again.

Maybe so.  Will the difference be noticeable between plan_cache_mode=auto 
(default) and plan_cache_mode=custom?


> I voice my concerns with v5 and I wasn't really willing to push it
> with a known performance regression of 7% in a fairly valid case. v6
> does not suffer from that.

You're right.  We may have to consider the unpredictability to users by this 
hidden behavior as a compromise for higher throughput.


> > Where else does the extra overhead come from?
> 
> hash_get_num_entries(LockMethodLocalHash) == 0 &&
> + hash_get_max_bucket(LockMethodLocalHash) >
> + LOCKMETHODLOCALHASH_SHRINK_THRESHOLD)
> 
> that's executed every time, not every 1000 times.

I see.  Thanks.


Regards
Takayuki Tsunakawa



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-21 Thread David Rowley
On Mon, 22 Jul 2019 at 12:48, Tsunakawa, Takayuki
 wrote:
>
> From: David Rowley [mailto:david.row...@2ndquadrant.com]
> > I went back to the drawing board on this and I've added some code that 
> > counts
> > the number of times we've seen the table to be oversized and just shrinks
> > the table back down on the 1000th time.  6.93% / 1000 is not all that much.
>
> I'm afraid this kind of hidden behavior would appear mysterious to users.  
> They may wonder "Why is the same query fast at first in the session (5 or 6 
> times of execution), then gets slower for a while, and gets faster again?  Is 
> there something to tune?  Am I missing something wrong with my app (e.g. how 
> to use prepared statements)?"  So I prefer v5.

I personally don't think that's true.  The only way you'll notice the
LockReleaseAll() overhead is to execute very fast queries with a
bloated lock table.  It's pretty hard to notice that a single 0.1ms
query is slow. You'll need to execute thousands of them before you'll
be able to measure it, and once you've done that, the lock shrink code
will have run and the query will be performing optimally again.

I voice my concerns with v5 and I wasn't really willing to push it
with a known performance regression of 7% in a fairly valid case. v6
does not suffer from that.

> > Of course, not all the extra overhead might be from rebuilding the table,
> > so here's a test with the updated patch.
>
> Where else does the extra overhead come from?

hash_get_num_entries(LockMethodLocalHash) == 0 &&
+ hash_get_max_bucket(LockMethodLocalHash) >
+ LOCKMETHODLOCALHASH_SHRINK_THRESHOLD)

that's executed every time, not every 1000 times.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-21 Thread Tsunakawa, Takayuki
From: David Rowley [mailto:david.row...@2ndquadrant.com]
> I went back to the drawing board on this and I've added some code that counts
> the number of times we've seen the table to be oversized and just shrinks
> the table back down on the 1000th time.  6.93% / 1000 is not all that much.

I'm afraid this kind of hidden behavior would appear mysterious to users.  They 
may wonder "Why is the same query fast at first in the session (5 or 6 times of 
execution), then gets slower for a while, and gets faster again?  Is there 
something to tune?  Am I missing something wrong with my app (e.g. how to use 
prepared statements)?"  So I prefer v5.


> Of course, not all the extra overhead might be from rebuilding the table,
> so here's a test with the updated patch.

Where else does the extra overhead come from?


Regards
Takayuki Tsunakawa



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-21 Thread David Rowley
On Thu, 18 Jul 2019 at 14:53, David Rowley  wrote:
> Is anyone particularly concerned about the worst-case slowdown here
> being about 1.54%?  The best case, and arguably a more realistic case
> above showed a 34% speedup for the best case.

I took a bit more time to test the performance on this. I thought I
might have been a bit unfair on the patch by giving it completely
empty tables to look at. It just seems too unrealistic to have a large
number of completely empty partitions. I decided to come up with a
more realistic case where there are 140 partitions but we're
performing an index scan to find a record that can exist in only 1 of
the 140 partitions.

create table hp (a int, b int) partition by hash(a);
select 'create table hp'||x||' partition of hp for values with
(modulus 140, remainder ' || x || ');' from generate_series(0,139)x;
create index on hp (b);
insert into hp select x,x from generate_series(1, 14) x;
analyze hp;

select3.sql: select * from hp where b = 1

Master:

$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2124.591367 (excluding connections establishing)
tps = 2158.754837 (excluding connections establishing)
tps = 2146.348465 (excluding connections establishing)
tps = 2148.995800 (excluding connections establishing)
tps = 2154.223600 (excluding connections establishing)

Patched:

$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2002.480729 (excluding connections establishing)
tps = 1997.272944 (excluding connections establishing)
tps = 1992.527079 (excluding connections establishing)
tps = 1995.789367 (excluding connections establishing)
tps = 2001.195760 (excluding connections establishing)

so it turned out it's even slower, and not by a small amount either!
Patched is 6.93% slower on average with this case :-(

I went back to the drawing board on this and I've added some code that
counts the number of times we've seen the table to be oversized and
just shrinks the table back down on the 1000th time.  6.93% / 1000 is
not all that much. Of course, not all the extra overhead might be from
rebuilding the table, so here's a test with the updated patch.

$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2142.414323 (excluding connections establishing)
tps = 2139.666899 (excluding connections establishing)
tps = 2138.744789 (excluding connections establishing)
tps = 2138.300299 (excluding connections establishing)
tps = 2137.346278 (excluding connections establishing)

Just a 0.34% drop. Pretty hard to pick that out the noise.

Testing the original case that shows the speedup:

create table ht (a int primary key, b int, c int) partition by hash (a);
select 'create table ht' || x::text || ' partition of ht for values
with (modulus 8192, remainder ' || (x)::text || ');' from
generate_series(0,8191) x;
\gexec

select.sql:
\set p 1
select * from ht where a = :p

Master:

$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 10172.035036 (excluding connections establishing)
tps = 10192.780529 (excluding connections establishing)
tps = 10331.306003 (excluding connections establishing)

Patched:

$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 15080.765549 (excluding connections establishing)
tps = 14994.404069 (excluding connections establishing)
tps = 14982.923480 (excluding connections establishing)

That seems fine, 46% faster.

v6 is attached.

I plan to push this in a few days unless someone objects.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


shrink_bloated_locallocktable_v6.patch
Description: Binary data


Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-17 Thread David Rowley
On Thu, 27 Jun 2019 at 12:59, Tsunakawa, Takayuki
 wrote:
>
> From: David Rowley [mailto:david.row...@2ndquadrant.com]
> Thank you, looks good.  I find it ready for committer (I noticed the status 
> is already set so.)

Thanks for looking.

I've just been looking at this again and I thought I'd better check
the performance of the worst case for the patch, where the hash table
is rebuilt each query.

To do this I first created a single column 70 partition partitioned
table ("p") and left it empty.

I then checked the performance of:

SELECT * FROM p;

Having 70 partitions means that the lock table's max bucket goes over
the LOCKMETHODLOCALHASH_SHRINK_THRESHOLD which is set to 64 and
results in the table being rebuilt each time the query is run.

The performance was as follows:

70 partitions: LOCKMETHODLOCALHASH_SHRINK_THRESHOLD = 64

master + shrink_bloated_locallocktable_v5.patch:

ubuntu@ip-10-0-0-201:~$ pgbench -n -T 60 -f select1.sql -M prepared postgres
tps = 8427.053378 (excluding connections establishing)
tps = 8583.251821 (excluding connections establishing)
tps = 8569.587268 (excluding connections establishing)
tps = 8552.988483 (excluding connections establishing)
tps = 8527.735108 (excluding connections establishing)

master (93907478):

ubuntu@ip-10-0-0-201:~$ pgbench -n -T 60 -f select1.sql -M prepared postgres
tps = 8712.919411 (excluding connections establishing)
tps = 8760.190372 (excluding connections establishing)
tps = 8755.069470 (excluding connections establishing)
tps = 8747.389735 (excluding connections establishing)
tps = 8758.275202 (excluding connections establishing)

patched is 2.45% slower


If I increase the partition count to 140 and put the
LOCKMETHODLOCALHASH_SHRINK_THRESHOLD up to 128, then the performance
is as follows:

master + shrink_bloated_locallocktable_v5.patch:

ubuntu@ip-10-0-0-201:~$ pgbench -n -T 60 -f select1.sql -M prepared postgres
tps = 2548.917856 (excluding connections establishing)
tps = 2561.283564 (excluding connections establishing)
tps = 2549.669870 (excluding connections establishing)
tps = 2421.971864 (excluding connections establishing)
tps = 2428.983660 (excluding connections establishing)

Master (93907478):

ubuntu@ip-10-0-0-201:~$ pgbench -n -T 60 -f select1.sql -M prepared postgres
tps = 2605.407529 (excluding connections establishing)
tps = 2600.691426 (excluding connections establishing)
tps = 2594.123983 (excluding connections establishing)
tps = 2455.745644 (excluding connections establishing)
tps = 2450.061483 (excluding connections establishing)

patched is 1.54% slower

I'd rather not put the LOCKMETHODLOCALHASH_SHRINK_THRESHOLD up any
higher than 128 since it can detract from the improvement we're trying
to make with this patch.

Now, this case of querying a partitioned table that happens to be
completely empty seems a bit unrealistic. Something more realistic
might be index scanning all partitions to find a value that only
exists in a single partition. Assuming the partitions actually have
some records, then that's going to be a more expensive query, so the
overhead of rebuilding the table will be less noticeable.

A previous version of the patch has already had some heuristics to try
to only rebuild the hash table when it's likely beneficial. I'd rather
not go exploring in that area again.

Is anyone particularly concerned about the worst-case slowdown here
being about 1.54%?  The best case, and arguably a more realistic case
above showed a 34% speedup for the best case.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-06-26 Thread Tsunakawa, Takayuki
From: David Rowley [mailto:david.row...@2ndquadrant.com]
v5 is attached.


Thank you, looks good.  I find it ready for committer (I noticed the status is 
already set so.)


Regards
Takayuki Tsunakawa




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-06-26 Thread David Rowley
On Mon, 17 Jun 2019 at 15:05, Tsunakawa, Takayuki
 wrote:
> (1)
> +#define LOCKMETHODLOCALHASH_SHRINK_SIZE 64
>
> How about LOCKMETHODLOCALHASH_SHRINK_THRESHOLD, because this determines the 
> threshold value to trigger shrinkage?  Code in PostgreSQL seems to use the 
> term threshold.

That's probably better. I've renamed it to that.

> (2)
> +/* Complain if the above are not set to something sane */
> +#if LOCKMETHODLOCALHASH_SHRINK_SIZE < LOCKMETHODLOCALHASH_INIT_SIZE
> +#error "invalid LOCKMETHODLOCALHASH_SHRINK_SIZE"
> +#endif
>
> I don't think these are necessary, because these are fixed and not 
> configurable.  FYI, src/include/utils/memutils.h doesn't have #error to test 
> these macros.

Yeah. I was thinking it was overkill when I wrote it, but somehow
couldn't bring myself to remove it. Done now.

> (3)
> +   if (hash_get_num_entries(LockMethodLocalHash) == 0 &&
> +   hash_get_max_bucket(LockMethodLocalHash) >
> +   LOCKMETHODLOCALHASH_SHRINK_SIZE)
> +   CreateLocalLockHash();
>
> I get an impression that Create just creates something where there's nothing. 
>  How about Init or Recreate?

Renamed to InitLocalLoclHash()

v5 is attached.

Thank you for the review.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


shrink_bloated_locallocktable_v5.patch
Description: Binary data


RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-06-16 Thread Tsunakawa, Takayuki
From: David Rowley [mailto:david.row...@2ndquadrant.com]
> I've revised the patch to add a new constant named
> LOCKMETHODLOCALHASH_SHRINK_SIZE. I've set this to 64 for now. Once the hash

Thank you, and good performance.  The patch passed make check.

I'm OK with the current patch, but I have a few comments.  Please take them as 
you see fit (I wouldn't mind if you don't.)


(1)
+#define LOCKMETHODLOCALHASH_SHRINK_SIZE 64

How about LOCKMETHODLOCALHASH_SHRINK_THRESHOLD, because this determines the 
threshold value to trigger shrinkage?  Code in PostgreSQL seems to use the term 
threshold.


(2)
+/* Complain if the above are not set to something sane */
+#if LOCKMETHODLOCALHASH_SHRINK_SIZE < LOCKMETHODLOCALHASH_INIT_SIZE
+#error "invalid LOCKMETHODLOCALHASH_SHRINK_SIZE"
+#endif

I don't think these are necessary, because these are fixed and not 
configurable.  FYI, src/include/utils/memutils.h doesn't have #error to test 
these macros.

#define ALLOCSET_DEFAULT_MINSIZE   0
#define ALLOCSET_DEFAULT_INITSIZE  (8 * 1024)
#define ALLOCSET_DEFAULT_MAXSIZE   (8 * 1024 * 1024)


(3)
+   if (hash_get_num_entries(LockMethodLocalHash) == 0 &&
+   hash_get_max_bucket(LockMethodLocalHash) >
+   LOCKMETHODLOCALHASH_SHRINK_SIZE)
+   CreateLocalLockHash();

I get an impression that Create just creates something where there's nothing.  
How about Init or Recreate?


Regards
Takayuki Tsunakawa



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-06-14 Thread David Rowley
On Mon, 8 Apr 2019 at 04:09, Tom Lane  wrote:
> Also, I would not define "significantly bloated" as "the table has
> grown at all".  I think the threshold ought to be at least ~100
> buckets, if we're starting at 16.

I've revised the patch to add a new constant named
LOCKMETHODLOCALHASH_SHRINK_SIZE. I've set this to 64 for now. Once the
hash table grows over that size we shrink it back down to
LOCKMETHODLOCALHASH_INIT_SIZE, which I've kept at 16.

I'm not opposed to setting it to 128. For this particular benchmark,
it won't make any difference as it's only going to affect something
that does not quite use 128 locks and has to work with a slightly
bloated local lock table. I think hitting 64 locks in a transaction is
a good indication that it's not a simple transaction so users are
probably unlikely to notice the small slowdown from the hash table
reinitialisation.

Since quite a bit has changed around partition planning lately, I've
taken a fresh set of benchmarks on today's master. I'm using something
very close to Amit's benchmark from upthread. I just changed the query
so we hit the same partition each time instead of a random one.

create table ht (a int primary key, b int, c int) partition by hash (a);
select 'create table ht' || x::text || ' partition of ht for values
with (modulus 8192, remainder ' || (x)::text || ');' from
generate_series(0,8191) x;
\gexec

select.sql:
\set p 1
select * from ht where a = :p

master @ a193cbec119 + shrink_bloated_locallocktable_v4.patch:

plan_cache_mode = 'auto';

ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 14101.226982 (excluding connections establishing)
tps = 14034.250962 (excluding connections establishing)
tps = 14107.937755 (excluding connections establishing)

plan_cache_mode = 'force_custom_plan';

ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 14240.366770 (excluding connections establishing)
tps = 14272.244886 (excluding connections establishing)
tps = 14130.684315 (excluding connections establishing)

master @ a193cbec119:

plan_cache_mode = 'auto';

ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 10467.027666 (excluding connections establishing)
tps = 10333.700917 (excluding connections establishing)
tps = 10633.084426 (excluding connections establishing)

plan_cache_mode = 'force_custom_plan';

ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 13938.083272 (excluding connections establishing)
tps = 14143.241802 (excluding connections establishing)
tps = 14097.406758 (excluding connections establishing)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


shrink_bloated_locallocktable_v4.patch
Description: Binary data


Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-08 Thread Peter Eisentraut
On 2019-04-08 05:46, Tsunakawa, Takayuki wrote:
> I'm registering you as another author and me as a reviewer, and marking this 
> ready for committer.

Moved to next commit fest.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread Tsunakawa, Takayuki
From: David Rowley [mailto:david.row...@2ndquadrant.com]
> It would be good to get your view on the
> shrink_bloated_locallocktable_v3.patch I worked on last night. I was
> unable to measure any overhead to solving the problem that way.

Thanks, it looks super simple and good.  I understood the idea behind your 
patch is:

* Transactions that touch many partitions and/or tables are a special event and 
not normal, and the hash table bloat is an unlucky accident.  So it's 
reasonable to revert the bloated hash table back to the original size.

* Repeated transactions that acquire many locks have to enlarge the hash table 
every time.  However, the overhead of hash table expansion should be hidden 
behind other various processing (acquiring/releasing locks, reading/writing the 
relations, accessing the catalogs of those relations)


TBH, I think the linked list approach feels more intuitive because the 
resulting code looks what it wants to do (efficiently iterate over acquired 
locks) and is based on the classic book.  But your approach seems to relieve 
people.  So I'm OK with your patch.

I'm registering you as another author and me as a reviewer, and marking this 
ready for committer.


Regards
Takayuki Tsunakawa






Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread David Rowley
On Mon, 8 Apr 2019 at 14:54, Tsunakawa, Takayuki
 wrote:
>
> From: 'Andres Freund' [mailto:and...@anarazel.de]
> > Did you see that people measured slowdowns?
>
> Yeah, 0.5% decrease with pgbench -M prepared -S (select-only), which feels 
> like a somewhat extreme test case.  And that might be within noise as was 
> mentioned.
>
> If we want to remove even the noise, we may have to think of removing the 
> LocalLockHash completely.  But it doesn't seem feasible...

It would be good to get your view on the
shrink_bloated_locallocktable_v3.patch I worked on last night. I was
unable to measure any overhead to solving the problem that way.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread Tsunakawa, Takayuki
From: 'Andres Freund' [mailto:and...@anarazel.de]
> On 2019-04-08 02:28:12 +, Tsunakawa, Takayuki wrote:
> > I think the linked list of LOCALLOCK approach is natural, simple, and
> > good.
> 
> Did you see that people measured slowdowns?

Yeah, 0.5% decrease with pgbench -M prepared -S (select-only), which feels like 
a somewhat extreme test case.  And that might be within noise as was mentioned.

If we want to remove even the noise, we may have to think of removing the 
LocalLockHash completely.  But it doesn't seem feasible...


Regards
Takayuki Tsunakawa









Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread 'Andres Freund'
Hi,

On 2019-04-08 02:28:12 +, Tsunakawa, Takayuki wrote:
> I think the linked list of LOCALLOCK approach is natural, simple, and
> good.

Did you see that people measured slowdowns?

Greetings,

Andres Freund




RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread Tsunakawa, Takayuki
From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> On the whole I don't think there's an adequate case for committing
> this patch.

From: Andres Freund [mailto:and...@anarazel.de]
> On 2019-04-05 23:03:11 -0400, Tom Lane wrote:
> > If I reduce the number of partitions in Amit's example from 8192
> > to something more real-world, like 128, I do still measure a
> > performance gain, but it's ~ 1.5% which is below what I'd consider
> > a reproducible win.  I'm accustomed to seeing changes up to 2%
> > in narrow benchmarks like this one, even when "nothing changes"
> > except unrelated code.
> 
> I'm not sure it's actually that narrow these days. With all the
> partitioning improvements happening, the numbers of locks commonly held
> are going to rise. And while 8192 partitions is maybe on the more
> extreme side, it's a workload with only a single table, and plenty
> workloads touch more than a single partitioned table.

I would feel happy if I could say such a many-partitions use case is narrow or 
impractical and ignore it, but it's not narrow.  Two of our customers are 
actually requesting such usage: one uses 5,500 partitions and is trying to 
migrate from a commercial database on Linux, and the other requires 200,000 
partitions to migrate from a legacy database on a mainframe.  At first, I 
thought such many partitions indicate a bad application design, but it sounded 
valid (or at least I can't insist that's bad).  PostgreSQL is now expected to 
handle such huge workloads.


From: Andres Freund [mailto:and...@anarazel.de]
> I'm not sure I'm quite that concerned. For one, a good bit of that space
> was up for grabs until the recent reordering of LOCALLOCK and nobody
> complained. But more importantly, I think commonly the amount of locks
> around is fairly constrained, isn't it? We can't really have that many
> concurrently held locks, due to the shared memory space, and the size of
> a LOCALLOCK isn't that big compared to say relcache entries.  We also
> probably fairly easily could win some space back - e.g. make nLocks 32
> bits.

+1



From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> I'd also point out that this is hardly the only place where we've
> seen hash_seq_search on nearly-empty hash tables become a bottleneck.
> So I'm not thrilled about attacking that with one-table-at-time patches.
> I'd rather see us do something to let hash_seq_search win across
> the board.
> 
> I spent some time wondering whether we could adjust the data structure
> so that all the live entries in a hashtable are linked into one chain,
> but I don't quite see how to do it without adding another list link to
> struct HASHELEMENT, which seems pretty expensive.

I think the linked list of LOCALLOCK approach is natural, simple, and good.  In 
the Jim Gray's classic book "Transaction processing: concepts and techniques", 
we can find the following sentence in "8.4.5 Lock Manager Internal Logic."  The 
sample implementation code in the book uses a similar linked list to remember 
and release a transaction's acquired locks.

"All the locks of a transaction are kept in a list so they can be quickly found 
and released at commit or rollback."

And handling this issue with the LOCALLOCK linked list is more natural than 
with the hash table resize.  We just want to quickly find all grabbed locks, so 
we use a linked list.  A hash table is a mechanism to find a particular item 
quickly.  So it was merely wrong to use the hash table to iterate all grabbed 
locks.  Also, the hash table got big because some operation in the session 
needed it, and some subsequent operations in the same session may need it 
again.  So we wouldn't be relieved with shrinking the hash table.


Regards
Takayuki Tsunakawa









Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread David Rowley
On Mon, 8 Apr 2019 at 04:09, Tom Lane  wrote:
> Um ... I don't see where you're destroying the old hash?

In CreateLocalLockHash.

> Also, I entirely dislike wiring in assumptions about hash_seq_search's
> private state structure here.  I think it's worth having an explicit
> entry point in dynahash.c to get the current number of buckets.

Okay. Added hash_get_max_bucket()

> Also, I would not define "significantly bloated" as "the table has
> grown at all".  I think the threshold ought to be at least ~100
> buckets, if we're starting at 16.

I wouldn't either. I don't think the comment says that. It says there
can be slowdowns when its significantly bloated, and then goes on to
say that we just resize when it's bigger than standard.

> Probably we ought to try to gather some evidence to inform the
> choice of cutoff here.  Maybe instrument the regression tests to
> see how big the table typically gets?

In partition_prune.sql I see use of a bucket as high as 285 on my machine with:

drop table lp, coll_pruning, rlp, mc3p, mc2p, boolpart, rp,
coll_pruning_multi, like_op_noprune, lparted_by_int2, rparted_by_int2;

I've not added any sort of cut-off though as I benchmarked it and
surprisingly I don't see any slowdown with the worst case. So I'm
thinking there might not be any point.

alter system set plan_cache_mode = 'force_generic_plan';
create table hp (a int primary key) partition by hash (a);
select 'create table hp' || x::text || ' partition of hp for values
with (modulus 32, remainder ' || (x)::text || ');' from
generate_series(0,31) x;
\gexec

select.sql
\set p 1
select * from hp where a = :p

Master
$ pgbench -n -M prepared -f select.sql -T 60 postgres
tps = 11834.764309 (excluding connections establishing)
tps = 12279.212223 (excluding connections establishing)
tps = 12007.263547 (excluding connections establishing)

Patched:
$ pgbench -n -M prepared -f select.sql -T 60 postgres
tps = 13380.684817 (excluding connections establishing)
tps = 12790.999279 (excluding connections establishing)
tps = 12568.892558 (excluding connections establishing)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


shrink_bloated_locallocktable_v3.patch
Description: Binary data


Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread Tom Lane
David Rowley  writes:
> Okay.  Here's another version with all the average locks code removed
> that only recreates the table when it's completely empty.

Um ... I don't see where you're destroying the old hash?

Also, I entirely dislike wiring in assumptions about hash_seq_search's
private state structure here.  I think it's worth having an explicit
entry point in dynahash.c to get the current number of buckets.

Also, I would not define "significantly bloated" as "the table has
grown at all".  I think the threshold ought to be at least ~100
buckets, if we're starting at 16.

Probably we ought to try to gather some evidence to inform the
choice of cutoff here.  Maybe instrument the regression tests to
see how big the table typically gets?

regards, tom lane




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread David Rowley
On Mon, 8 Apr 2019 at 03:47, Andres Freund  wrote:
> Could you benchmark your adversarial case?

Which case?

I imagine the worst case for v2 is a query that just constantly asks
for over 16 locks. Perhaps a prepared plan, so not to add planner
overhead.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread Andres Freund
Hi,

On 2019-04-08 03:40:52 +1200, David Rowley wrote:
> On Mon, 8 Apr 2019 at 03:20, Tom Lane  wrote:
> >
> > David Rowley  writes:
> > > The reason I thought it was a good idea to track some history there
> > > was to stop the lock table constantly being shrunk back to the default
> > > size every time a simple single table query was executed.
> >
> > I think that's probably gilding the lily, considering that this whole
> > issue is pretty new.  There's no evidence that expanding the local
> > lock table is a significant drag on queries that need a lot of locks.
> 
> Okay.  Here's another version with all the average locks code removed
> that only recreates the table when it's completely empty.

Could you benchmark your adversarial case?

- Andres




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread David Rowley
On Mon, 8 Apr 2019 at 03:20, Tom Lane  wrote:
>
> David Rowley  writes:
> > The reason I thought it was a good idea to track some history there
> > was to stop the lock table constantly being shrunk back to the default
> > size every time a simple single table query was executed.
>
> I think that's probably gilding the lily, considering that this whole
> issue is pretty new.  There's no evidence that expanding the local
> lock table is a significant drag on queries that need a lot of locks.

Okay.  Here's another version with all the average locks code removed
that only recreates the table when it's completely empty.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


shrink_bloated_locallocktable_v2.patch
Description: Binary data


Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread Tom Lane
David Rowley  writes:
> On Mon, 8 Apr 2019 at 02:59, Tom Lane  wrote:
>> We *should* be using hash_get_num_entries(), but only to verify
>> that the table is empty before resetting it.  The additional bit
>> that is needed is to see whether the number of buckets is large
>> enough to justify calling the table bloated.

> The reason I thought it was a good idea to track some history there
> was to stop the lock table constantly being shrunk back to the default
> size every time a simple single table query was executed.

I think that's probably gilding the lily, considering that this whole
issue is pretty new.  There's no evidence that expanding the local
lock table is a significant drag on queries that need a lot of locks.

regards, tom lane




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread David Rowley
On Mon, 8 Apr 2019 at 02:59, Tom Lane  wrote:
>
> David Rowley  writes:
> > hash_get_num_entries() looks cheap enough to me. Can you explain why
> > you think that's too expensive?
>
> What I objected to cost-wise was counting the number of lock
> acquisitions/releases, which seems entirely beside the point.
>
> We *should* be using hash_get_num_entries(), but only to verify
> that the table is empty before resetting it.  The additional bit
> that is needed is to see whether the number of buckets is large
> enough to justify calling the table bloated.

The reason I thought it was a good idea to track some history there
was to stop the lock table constantly being shrunk back to the default
size every time a simple single table query was executed. For example,
a workload repeatably doing:

SELECT * FROM table_with_lots_of_partitions;
SELECT * FROM non_partitioned_table;

I was worried that obtaining locks on the partitioned table would
become a little slower because it would have to expand the hash table
each time the query is executed.

> > As cheap as possible sounds good, but I'm confused at why you think
> > the table will always be empty at the end of transaction.
>
> It's conceivable that it won't be, which is why we need a test.
> I'm simply arguing that if it is not, we can just postpone de-bloating
> till it is.  Session-level locks are so rarely used that there's no
> need to sweat about that case.

That seems fair.  It would certainly simplify the patch.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread Tom Lane
David Rowley  writes:
> On Mon, 8 Apr 2019 at 02:20, Tom Lane  wrote:
>> I like the concept ... but the particular implementation, not so much.
>> It seems way overcomplicated.  In the first place, why should we
>> add code to copy entries?  Just don't do it except when the table
>> is empty.  In the second, I think we could probably have a far
>> cheaper test for how big the table is --- maybe we'd need to
>> expose some function in dynahash.c, but the right way here is just
>> to see how many buckets there are.  I don't like adding statistics
>> counting for this, because it's got basically nothing to do with
>> what the actual problem is.  (If you acquire and release one lock,
>> and do that over and over, you don't have a bloat problem no
>> matter how many times you do it.)

> hash_get_num_entries() looks cheap enough to me. Can you explain why
> you think that's too expensive?

What I objected to cost-wise was counting the number of lock
acquisitions/releases, which seems entirely beside the point.

We *should* be using hash_get_num_entries(), but only to verify
that the table is empty before resetting it.  The additional bit
that is needed is to see whether the number of buckets is large
enough to justify calling the table bloated.

> As cheap as possible sounds good, but I'm confused at why you think
> the table will always be empty at the end of transaction.

It's conceivable that it won't be, which is why we need a test.
I'm simply arguing that if it is not, we can just postpone de-bloating
till it is.  Session-level locks are so rarely used that there's no
need to sweat about that case.

regards, tom lane




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread David Rowley
On Mon, 8 Apr 2019 at 02:36, David Rowley  wrote:
> > LockMethodLocalHash is special in that it predictably goes to empty
> > at the end of every transaction, so that de-bloating at that point
> > is a workable strategy.  I think we'd probably need something more
> > robust if we were trying to fix this generally for all hash tables.
> > But if we're going to go with the one-off hack approach, we should
> > certainly try to keep that hack as simple as possible.
>
> As cheap as possible sounds good, but I'm confused at why you think
> the table will always be empty at the end of transaction. It's my
> understanding and I see from debugging that session level locks remain
> in there. If I don't copy those into the new table they'll be lost.

Or we could just skip the table recreation if there are no
session-levels.  That would require calling hash_get_num_entries() on
the table again and just recreating the table if there are 0 locks.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread David Rowley
On Mon, 8 Apr 2019 at 02:20, Tom Lane  wrote:
> I like the concept ... but the particular implementation, not so much.
> It seems way overcomplicated.  In the first place, why should we
> add code to copy entries?  Just don't do it except when the table
> is empty.  In the second, I think we could probably have a far
> cheaper test for how big the table is --- maybe we'd need to
> expose some function in dynahash.c, but the right way here is just
> to see how many buckets there are.  I don't like adding statistics
> counting for this, because it's got basically nothing to do with
> what the actual problem is.  (If you acquire and release one lock,
> and do that over and over, you don't have a bloat problem no
> matter how many times you do it.)

hash_get_num_entries() looks cheap enough to me. Can you explain why
you think that's too expensive?

> LockMethodLocalHash is special in that it predictably goes to empty
> at the end of every transaction, so that de-bloating at that point
> is a workable strategy.  I think we'd probably need something more
> robust if we were trying to fix this generally for all hash tables.
> But if we're going to go with the one-off hack approach, we should
> certainly try to keep that hack as simple as possible.

As cheap as possible sounds good, but I'm confused at why you think
the table will always be empty at the end of transaction. It's my
understanding and I see from debugging that session level locks remain
in there. If I don't copy those into the new table they'll be lost.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread Tom Lane
David Rowley  writes:
> On Sat, 6 Apr 2019 at 16:03, Tom Lane  wrote:
>> My own thought about how to improve this situation was just to destroy
>> and recreate LockMethodLocalHash at transaction end (or start)
>> if its size exceeded $some-value.  Leaving it permanently bloated seems
>> like possibly a bad idea, even if we get rid of all the hash_seq_searches
>> on it.

> Which I thought was an okay idea.  I think the one advantage that
> would have over making hash_seq_search() faster for large and mostly
> empty tables is that over-sized hash tables are just not very cache
> efficient, and if we don't need it to be that large then we should
> probably consider making it smaller again.

> I've had a go at implementing this and using Amit's benchmark the
> performance looks pretty good. I can't detect any slowdown for the
> general case.

I like the concept ... but the particular implementation, not so much.
It seems way overcomplicated.  In the first place, why should we
add code to copy entries?  Just don't do it except when the table
is empty.  In the second, I think we could probably have a far
cheaper test for how big the table is --- maybe we'd need to
expose some function in dynahash.c, but the right way here is just
to see how many buckets there are.  I don't like adding statistics
counting for this, because it's got basically nothing to do with
what the actual problem is.  (If you acquire and release one lock,
and do that over and over, you don't have a bloat problem no
matter how many times you do it.)

LockMethodLocalHash is special in that it predictably goes to empty
at the end of every transaction, so that de-bloating at that point
is a workable strategy.  I think we'd probably need something more
robust if we were trying to fix this generally for all hash tables.
But if we're going to go with the one-off hack approach, we should
certainly try to keep that hack as simple as possible.

regards, tom lane




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread David Rowley
On Sat, 6 Apr 2019 at 16:03, Tom Lane  wrote:
> I'd also point out that this is hardly the only place where we've
> seen hash_seq_search on nearly-empty hash tables become a bottleneck.
> So I'm not thrilled about attacking that with one-table-at-time patches.
> I'd rather see us do something to let hash_seq_search win across
> the board.

Rewinding back to mid-Feb:

You wrote:
> My own thought about how to improve this situation was just to destroy
> and recreate LockMethodLocalHash at transaction end (or start)
> if its size exceeded $some-value.  Leaving it permanently bloated seems
> like possibly a bad idea, even if we get rid of all the hash_seq_searches
> on it.

Which I thought was an okay idea.  I think the one advantage that
would have over making hash_seq_search() faster for large and mostly
empty tables is that over-sized hash tables are just not very cache
efficient, and if we don't need it to be that large then we should
probably consider making it smaller again.

I've had a go at implementing this and using Amit's benchmark the
performance looks pretty good. I can't detect any slowdown for the
general case.

master:

plan_cache_mode = auto:

$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 9373.698212 (excluding connections establishing)
tps = 9356.993148 (excluding connections establishing)
tps = 9367.579806 (excluding connections establishing)

plan_cache_mode = force_custom_plan:

$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 12863.758185 (excluding connections establishing)
tps = 12787.766054 (excluding connections establishing)
tps = 12817.878940 (excluding connections establishing)

shrink_bloated_locallocktable.patch:

plan_cache_mode = auto:

$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 12756.021211 (excluding connections establishing)
tps = 12800.939518 (excluding connections establishing)
tps = 12804.501977 (excluding connections establishing)

plan_cache_mode = force_custom_plan:

$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 12763.448836 (excluding connections establishing)
tps = 12901.673271 (excluding connections establishing)
tps = 12856.512745 (excluding connections establishing)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


shrink_bloated_locallocktable.patch
Description: Binary data


Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread Peter Eisentraut
On 2019-04-06 05:03, Tom Lane wrote:
> Trying a standard pgbench test case (pgbench -M prepared -S with
> one client and an -s 10 database), it seems that the patch is about
> 0.5% slower than HEAD.  Again, that's below the noise threshold,
> but it's not promising for the net effects of this patch on workloads
> that aren't specifically about large and prunable partition sets.

In my testing, I've also noticed that it seems to be slightly on the
slower side for these simple tests.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-05 Thread Tom Lane
Andres Freund  writes:
> I wonder if one approach to solve this wouldn't be to just make the
> hashtable drastically smaller. Right now we'll often have have lots
> empty entries that are 72 bytes + dynahash overhead. That's plenty of
> memory that needs to be skipped over.   I wonder if we instead should
> have an array of held locallocks, and a hashtable with {hashcode,
> offset_in_array} + custom comparator for lookups.

Well, that's not going to work all that well for retail lock releases;
you'll end up with holes in the array, maybe a lot of them.

However, it led me to think of another way we might approach the general
hashtable problem: right now, we are not making any use of the fact that
the hashtable's entries are laid out in big slabs (arrays).  What if we
tried to ensure that the live entries are allocated fairly compactly in
those arrays, and then implemented hash_seq_search as a scan over the
arrays, ignoring the hash bucket structure per se?

We'd need a way to reliably tell a live entry from a free entry, but
again, there's plenty of space for a flag bit or two.

This might perform poorly if you allocated a bunch of entries,
freed most-but-not-all, and then wanted to seqscan the remainder;
you'd end up with the same problem I complained of above that
you're iterating over an array that's mostly uninteresting.
In principle we could keep count of the live vs free entries and
dynamically decide to scan via the hash bucket structure instead of
searching the storage array when the array is too sparse; but that
might be overly complicated.

I haven't tried to work this out in detail, it's just a late
night brainstorm.  But, again, I'd much rather solve this in
dynahash.c than by layering some kind of hack on top of it.

regards, tom lane




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-05 Thread Andres Freund
Hi,

On 2019-04-05 23:03:11 -0400, Tom Lane wrote:
> Peter Eisentraut  writes:
> > I can't detect any performance improvement with the patch applied to
> > current master, using the test case from Yoshikazu Imai (2019-03-19).
> 
> FWIW, I tried this patch against current HEAD (959d00e9d).
> Using the test case described by Amit at
> 
> I do measure an undeniable speedup, close to 35%.
> 
> However ... I submit that that's a mighty extreme test case.
> (I had to increase max_locks_per_transaction to get it to run
> at all.)  We should not be using that sort of edge case to drive
> performance optimization choices.
> 
> If I reduce the number of partitions in Amit's example from 8192
> to something more real-world, like 128, I do still measure a
> performance gain, but it's ~ 1.5% which is below what I'd consider
> a reproducible win.  I'm accustomed to seeing changes up to 2%
> in narrow benchmarks like this one, even when "nothing changes"
> except unrelated code.

I'm not sure it's actually that narrow these days. With all the
partitioning improvements happening, the numbers of locks commonly held
are going to rise. And while 8192 partitions is maybe on the more
extreme side, it's a workload with only a single table, and plenty
workloads touch more than a single partitioned table.


> Trying a standard pgbench test case (pgbench -M prepared -S with
> one client and an -s 10 database), it seems that the patch is about
> 0.5% slower than HEAD.  Again, that's below the noise threshold,
> but it's not promising for the net effects of this patch on workloads
> that aren't specifically about large and prunable partition sets.

Yea, that's concerning.


> I'm also fairly concerned about the effects of the patch on
> sizeof(LOCALLOCK) --- on a 64-bit machine it goes from 72 to 88
> bytes, a 22% increase.  That's a lot if you're considering cases
> with many locks.

I'm not sure I'm quite that concerned. For one, a good bit of that space
was up for grabs until the recent reordering of LOCALLOCK and nobody
complained. But more importantly, I think commonly the amount of locks
around is fairly constrained, isn't it? We can't really have that many
concurrently held locks, due to the shared memory space, and the size of
a LOCALLOCK isn't that big compared to say relcache entries.  We also
probably fairly easily could win some space back - e.g. make nLocks 32
bits.

I wonder if one approach to solve this wouldn't be to just make the
hashtable drastically smaller. Right now we'll often have have lots
empty entries that are 72 bytes + dynahash overhead. That's plenty of
memory that needs to be skipped over.   I wonder if we instead should
have an array of held locallocks, and a hashtable with {hashcode,
offset_in_array} + custom comparator for lookups.  That'd mean we could
either scan the array of locallocks at release (which'd need to skip
over entries that have already been released), or we could scan the much
smaller hashtable sequentially.

I don't think the above idea is quite there, and I'm tired, but I
thought it might still be worth bringing up.


> I spent some time wondering whether we could adjust the data structure
> so that all the live entries in a hashtable are linked into one chain,
> but I don't quite see how to do it without adding another list link to
> struct HASHELEMENT, which seems pretty expensive.

Yea :(

Greetings,

Andres Freund




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-05 Thread Tom Lane
Peter Eisentraut  writes:
> I can't detect any performance improvement with the patch applied to
> current master, using the test case from Yoshikazu Imai (2019-03-19).

FWIW, I tried this patch against current HEAD (959d00e9d).
Using the test case described by Amit at

I do measure an undeniable speedup, close to 35%.

However ... I submit that that's a mighty extreme test case.
(I had to increase max_locks_per_transaction to get it to run
at all.)  We should not be using that sort of edge case to drive
performance optimization choices.

If I reduce the number of partitions in Amit's example from 8192
to something more real-world, like 128, I do still measure a
performance gain, but it's ~ 1.5% which is below what I'd consider
a reproducible win.  I'm accustomed to seeing changes up to 2%
in narrow benchmarks like this one, even when "nothing changes"
except unrelated code.

Trying a standard pgbench test case (pgbench -M prepared -S with
one client and an -s 10 database), it seems that the patch is about
0.5% slower than HEAD.  Again, that's below the noise threshold,
but it's not promising for the net effects of this patch on workloads
that aren't specifically about large and prunable partition sets.

I'm also fairly concerned about the effects of the patch on
sizeof(LOCALLOCK) --- on a 64-bit machine it goes from 72 to 88
bytes, a 22% increase.  That's a lot if you're considering cases
with many locks.

On the whole I don't think there's an adequate case for committing
this patch.

I'd also point out that this is hardly the only place where we've
seen hash_seq_search on nearly-empty hash tables become a bottleneck.
So I'm not thrilled about attacking that with one-table-at-time patches.
I'd rather see us do something to let hash_seq_search win across
the board.

I spent some time wondering whether we could adjust the data structure
so that all the live entries in a hashtable are linked into one chain,
but I don't quite see how to do it without adding another list link to
struct HASHELEMENT, which seems pretty expensive.

I'll sketch the idea I had, just in case it triggers a better idea
in someone else.  Assuming we are willing to add another pointer to
HASHELEMENT, use the two pointers to create a doubly-linked circular
list that includes all live entries in the hashtable, with a list
header in the hashtable's control area.  (Maybe we'd use a dlist for
this, but it's not essential.)  Require this list to be organized so
that all entries that belong to the same hash bucket are consecutive in
the list, and make each non-null hash bucket header point to the first
entry in the list for its bucket.  To allow normal searches to detect
when they've run through their bucket, add a flag to HASHELEMENT that
is set only in entries that are the first, or perhaps last, of their
bucket (so you'd detect end-of-bucket by checking the flag instead of
testing for a null pointer).  Adding a bool field is free due to
alignment considerations, at least on 64-bit machines.  Given this,
I think normal hash operations are more-or-less the same cost as
before, while hash_seq_search just has to follow the circular list.

I tried to figure out how to do the same thing with a singly-linked
instead of doubly-linked list, but it doesn't quite work: if you need
to remove the first element of a bucket, you have no cheap way to find
its predecessor in the overall list (which belongs to some other
bucket, but you don't know which one).  Maybe we could just mark such
entries dead (there's plenty of room for another flag bit) and plan
to clean them up later?  But it's not clear how to ensure that they'd
get cleaned up in any sort of timely fashion.

Another issue is that probably none of this works for the partitioned
hash tables we use for some of the larger shared-memory hashes.  But
I'm not sure we care about hash_seq_search for those, so maybe we just
say those are a different data structure.

regards, tom lane




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-05 Thread Peter Eisentraut
On 2019-03-19 10:21, Tsunakawa, Takayuki wrote:
> From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com]
>> Fixed.
> 
> Rebased on HEAD.

Do you need the dlist_foreach_modify() calls?  You are not actually
modifying the list structure.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-04 Thread Imai, Yoshikazu
On Fri, Apr 5, 2019 at 0:05 AM, Tsunakawa, Takayuki wrote:
> From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com]
> > I can't detect any performance improvement with the patch applied to
> > current master, using the test case from Yoshikazu Imai (2019-03-19).
> 
> That's strange...  Peter, Imai-san, can you compare your test procedures?

Just for make sure, I described my test procedures in detail.

I install and setup HEAD and patched as follows.

[HEAD(413ccaa)]
(git pull)
./configure --prefix=/usr/local/pgsql-dev --enable-depend
make clean
make

make install

su postgres
export PATH=/usr/local/pgsql-dev/bin:$PATH
initdb -D /var/lib/pgsql/data-dev
vi /var/lib/pgsql/data-dev/postgresql.conf

port = 44201
plan_cache_mode = 'auto' or 'force_custom_plan'
max_parallel_workers = 0
max_parallel_workers_per_gather = 0
max_locks_per_transaction = 4096

pg_ctl -D /var/lib/pgsql/data-dev start



[HEAD(413ccaa) + patch]
(git pull)
patch -u -p1 < 0002.patch
./configure --prefix=/usr/local/pgsql-locallock --enable-depend
make clean
make

make install

su postgres
export PATH=/usr/local/pgsql-locallock/bin:$PATH
initdb -D /var/lib/pgsql/data-locallock
vi /var/lib/pgsql/data-locallock/postgresql.conf

port = 44301
plan_cache_mode = 'auto' or 'force_custom_plan'
max_parallel_workers = 0
max_parallel_workers_per_gather = 0
max_locks_per_transaction = 4096

pg_ctl -D /var/lib/pgsql/data-locallock start


And I tested as follows.

(creating partitioned table for port 44201)
(creating partitioned table for port 44301)
(creating select4096.sql)
for i in `seq 1 5`; do
  pgbench -n -f select4096.sql -T 60 -M prepared -p 44201 | grep including;
  pgbench -n -f select4096.sql -T 60 -M prepared -p 44301 | grep including;
done
tps = 8146.039546 (including connections establishing)
tps = 9021.340872 (including connections establishing)
tps = 8011.186017 (including connections establishing)
tps = 8926.191054 (including connections establishing)
tps = 8006.769690 (including connections establishing)
tps = 9028.716806 (including connections establishing)
tps = 8057.709961 (including connections establishing)
tps = 9017.713714 (including connections establishing)
tps = 7956.332863 (including connections establishing)
tps = 9126.650533 (including connections establishing)


Thanks
--
Yoshikazu Imai


RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-04 Thread Tsunakawa, Takayuki
Hi Amit-san, Imai-snan,

From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
> I was able to detect it as follows.
> plan_cache_mode = auto
> 
>HEAD: 1915 tps
> Patched: 2394 tps
> 
> plan_cache_mode = custom (non-problematic: generic plan is never created)
> 
>HEAD: 2402 tps
> Patched: 2393 tps

Thanks a lot for very quick confirmation.  I'm relieved to still see the good 
results.


Regards
Takayuki Tsunakawa



RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-04 Thread Imai, Yoshikazu
On Fri, Apr 5, 2019 at 1:31 AM, Amit Langote wrote:
> On 2019/04/05 5:42, Peter Eisentraut wrote:
> > On 2019-04-04 06:58, Amit Langote wrote:
> >> Also, since the "speed up partition planning" patch went in
> >> (428b260f8), it might be possible to see the performance boost even
> >> with the partitioning example you cited upthread.
> >
> > I can't detect any performance improvement with the patch applied to
> > current master, using the test case from Yoshikazu Imai (2019-03-19).
> 
> I was able to detect it as follows.
> 
> * partitioned table setup:
> 
> $ cat ht.sql
> drop table ht cascade;
> create table ht (a int primary key, b int, c int) partition by hash (a);
> select 'create table ht' || x::text || ' partition of ht for values with
> (modulus 8192, remainder ' || (x)::text || ');' from generate_series(0,
> 8191) x;
> \gexec
> 
> * pgbench script:
> 
> $ cat select.sql
> \set param random(1, 8192)
> select * from ht where a = :param
> 
> * pgbench (5 minute run with -M prepared)
> 
> pgbench -n -M prepared -T 300 -f select.sql
> 
> * tps:
> 
> plan_cache_mode = auto
> 
>HEAD: 1915 tps
> Patched: 2394 tps
> 
> plan_cache_mode = custom (non-problematic: generic plan is never created)
> 
>HEAD: 2402 tps
> Patched: 2393 tps

Amit-san, thanks for testing this.

I also re-ran my tests(3/19) with HEAD(413ccaa) and HEAD(413ccaa) + patched, 
and I can still detect the performance difference with plan_cache_mode = auto.

Thanks
--
Yoshikazu Imai 



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-04 Thread Amit Langote
On 2019/04/05 5:42, Peter Eisentraut wrote:
> On 2019-04-04 06:58, Amit Langote wrote:
>> Also, since the "speed up partition planning" patch went in (428b260f8),
>> it might be possible to see the performance boost even with the
>> partitioning example you cited upthread.
> 
> I can't detect any performance improvement with the patch applied to
> current master, using the test case from Yoshikazu Imai (2019-03-19).

I was able to detect it as follows.

* partitioned table setup:

$ cat ht.sql
drop table ht cascade;
create table ht (a int primary key, b int, c int) partition by hash (a);
select 'create table ht' || x::text || ' partition of ht for values with
(modulus 8192, remainder ' || (x)::text || ');' from generate_series(0,
8191) x;
\gexec

* pgbench script:

$ cat select.sql
\set param random(1, 8192)
select * from ht where a = :param

* pgbench (5 minute run with -M prepared)

pgbench -n -M prepared -T 300 -f select.sql

* tps:

plan_cache_mode = auto

   HEAD: 1915 tps
Patched: 2394 tps

plan_cache_mode = custom (non-problematic: generic plan is never created)

   HEAD: 2402 tps
Patched: 2393 tps

Thanks,
Amit





RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-04 Thread Tsunakawa, Takayuki
Hi Peter, Imai-san,

From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com]
> I can't detect any performance improvement with the patch applied to
> current master, using the test case from Yoshikazu Imai (2019-03-19).

That's strange...  Peter, Imai-san, can you compare your test procedures?

Peter, can you check and see the performance improvement with David's method on 
March 26 instead?


Regards
Takayuki Tsunakawa



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-04 Thread Peter Eisentraut
On 2019-04-04 06:58, Amit Langote wrote:
> Also, since the "speed up partition planning" patch went in (428b260f8),
> it might be possible to see the performance boost even with the
> partitioning example you cited upthread.

I can't detect any performance improvement with the patch applied to
current master, using the test case from Yoshikazu Imai (2019-03-19).

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-03 Thread Amit Langote
Hi,

On 2019/04/04 13:37, Tsunakawa, Takayuki wrote:
> Hi Peter,
> 
> From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com]
>> I did a bit of performance testing, both a plain pgbench and the
>> suggested test case with 4096 partitions.  I can't detect any
>> performance improvements.  In fact, within the noise, it tends to be
>> just a bit on the slower side.
>>
>> So I'd like to kick it back to the patch submitter now and ask for more
>> justification and performance analysis.
>>
>> Perhaps "speeding up planning with partitions" needs to be accepted first?
> 
> David kindly showed how to demonstrate the performance improvement on March 
> 26, so I changed the status to needs review.  I'd appreciate it if you could 
> continue the final check.

Also, since the "speed up partition planning" patch went in (428b260f8),
it might be possible to see the performance boost even with the
partitioning example you cited upthread.

Thanks,
Amit





RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-03 Thread Tsunakawa, Takayuki
Hi Peter,

From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com]
> I did a bit of performance testing, both a plain pgbench and the
> suggested test case with 4096 partitions.  I can't detect any
> performance improvements.  In fact, within the noise, it tends to be
> just a bit on the slower side.
> 
> So I'd like to kick it back to the patch submitter now and ask for more
> justification and performance analysis.
> 
> Perhaps "speeding up planning with partitions" needs to be accepted first?

David kindly showed how to demonstrate the performance improvement on March 26, 
so I changed the status to needs review.  I'd appreciate it if you could 
continue the final check.


Regards
Takayuki Tsunakawa




RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-26 Thread Tsunakawa, Takayuki
From: David Rowley [mailto:david.row...@2ndquadrant.com]
> Here a benchmark doing that using pgbench's script weight feature.

Wow, I didn't know that pgbench has evolved to have such a convenient feature.  
Thanks for telling me how to utilize it in testing.  PostgreSQL is cool!


Regards
Takayuki Tsunakawa




RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-26 Thread Tsunakawa, Takayuki
From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
> My understanding of what David wrote is that the slowness of bloated hash
> table is hard to notice, because planning itself is pretty slow.  With the
> "speeding up planning with partitions" patch, planning becomes quite fast,
> so the bloated hash table overhead and so your patch's benefit is easier
> to notice.  This patch is clearly helpful, but it's just hard to notice
> it
> when the other big bottleneck is standing in the way.

Ah, I see.  I failed to recognize the simple fact that without your patch, 
EXECUTE on a table with many partitions is slow due to the custom planning time 
proportional to the number of partitions.  Thanks for waking up my sleeping 
head!


Regards
Takayuki Tsunakawa



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-26 Thread David Rowley
On Tue, 26 Mar 2019 at 21:55, David Rowley  wrote:
>
> On Tue, 26 Mar 2019 at 21:23, Tsunakawa, Takayuki
>  wrote:
> > Thank you David for explaining.  Although I may not understand the effect 
> > of "speeding up planning with partitions" patch, this patch  takes effect 
> > even without it.  That is, perform the following in the same session:
> >
> > 1. SELECT count(*) FROM table; on a table with many partitions.  That 
> > bloats the LocalLockHash.
> > 2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1;
> > 3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate 
> > transaction.  Without the patch, each transaction's LockReleaseAll() has to 
> > scan the bloated large hash table.
>
> Oh. I think I see what you're saying.  Really the table in #2 would
> have to be some completely different table that's not partitioned. I
> think in that case it should make a difference.

Here a benchmark doing that using pgbench's script weight feature.

I've set this up so the query that hits the partitioned table runs
once for every 10k times the other script runs.  I picked that number
so the lock table was expanded fairly early on in the benchmark.

setup:
create table t1 (a int primary key);
create table hp (a int) partition by hash (a);
select 'create table hp'||x|| ' partition of hp for values with
(modulus 4096, remainder ' || x || ');' from generate_series(0,4095)
x;
\gexec

hp.sql
select count(*) from hp;

t1.sql
\set p 1
select a from t1 where a = :p;

Master = c8c885b7a5

Master:
$ pgbench -T 60 -M prepared -n -f hp.sql@1 -f t1.sql@1 postgres
SQL script 2: t1.sql
 - 1057306 transactions (100.0% of total, tps = 17621.756473)
 - 1081905 transactions (100.0% of total, tps = 18021.449914)
 - 1122420 transactions (100.0% of total, tps = 18690.804699)

Master + 0002-speed-up-LOCALLOCK-scan.patch

$ pgbench -T 60 -M prepared -n -f hp.sql@1 -f t1.sql@1 postgres
SQL script 2: t1.sql
 - 1277014 transactions (100.0% of total, tps = 21283.551615)
 - 1184052 transactions (100.0% of total, tps = 19734.185872)
 - 1188523 transactions (100.0% of total, tps = 19785.835662)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-26 Thread David Rowley
On Tue, 26 Mar 2019 at 21:23, Tsunakawa, Takayuki
 wrote:
> Thank you David for explaining.  Although I may not understand the effect of 
> "speeding up planning with partitions" patch, this patch  takes effect even 
> without it.  That is, perform the following in the same session:
>
> 1. SELECT count(*) FROM table; on a table with many partitions.  That bloats 
> the LocalLockHash.
> 2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1;
> 3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate 
> transaction.  Without the patch, each transaction's LockReleaseAll() has to 
> scan the bloated large hash table.

Oh. I think I see what you're saying.  Really the table in #2 would
have to be some completely different table that's not partitioned. I
think in that case it should make a difference.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-26 Thread Amit Langote
Tsunakawa-san,

On 2019/03/26 17:21, Tsunakawa, Takayuki wrote:
> From: David Rowley [mailto:david.row...@2ndquadrant.com]
>> On Mon, 25 Mar 2019 at 23:44, Peter Eisentraut
>>  wrote:
>>> Perhaps "speeding up planning with partitions" needs to be accepted first?
>>
>> Yeah, I think it likely will require that patch to be able to measure
>> the gains from this patch.
>>
>> If planning a SELECT to a partitioned table with a large number of
>> partitions using PREPAREd statements, when we attempt the generic plan
>> on the 6th execution, it does cause the local lock table to expand to
>> fit all the locks for each partition. This does cause the
>> LockReleaseAll() to become slow due to the hash_seq_search having to
>> skip over many empty buckets.   Since generating a custom plan for a
>> partitioned table with many partitions is still slow in master, then I
>> very much imagine you'll struggle to see the gains brought by this
>> patch.
> 
> Thank you David for explaining.  Although I may not understand the effect of 
> "speeding up planning with partitions" patch, this patch  takes effect even 
> without it.  That is, perform the following in the same session:
> 
> 1. SELECT count(*) FROM table; on a table with many partitions.  That bloats 
> the LocalLockHash.
> 2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1;
> 3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate 
> transaction.  Without the patch, each transaction's LockReleaseAll() has to 
> scan the bloated large hash table.

My understanding of what David wrote is that the slowness of bloated hash
table is hard to notice, because planning itself is pretty slow.  With the
"speeding up planning with partitions" patch, planning becomes quite fast,
so the bloated hash table overhead and so your patch's benefit is easier
to notice.  This patch is clearly helpful, but it's just hard to notice it
when the other big bottleneck is standing in the way.

Thanks,
Amit




RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-26 Thread Tsunakawa, Takayuki
From: David Rowley [mailto:david.row...@2ndquadrant.com]
> On Mon, 25 Mar 2019 at 23:44, Peter Eisentraut
>  wrote:
> > Perhaps "speeding up planning with partitions" needs to be accepted first?
> 
> Yeah, I think it likely will require that patch to be able to measure
> the gains from this patch.
> 
> If planning a SELECT to a partitioned table with a large number of
> partitions using PREPAREd statements, when we attempt the generic plan
> on the 6th execution, it does cause the local lock table to expand to
> fit all the locks for each partition. This does cause the
> LockReleaseAll() to become slow due to the hash_seq_search having to
> skip over many empty buckets.   Since generating a custom plan for a
> partitioned table with many partitions is still slow in master, then I
> very much imagine you'll struggle to see the gains brought by this
> patch.

Thank you David for explaining.  Although I may not understand the effect of 
"speeding up planning with partitions" patch, this patch  takes effect even 
without it.  That is, perform the following in the same session:

1. SELECT count(*) FROM table; on a table with many partitions.  That bloats 
the LocalLockHash.
2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1;
3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate 
transaction.  Without the patch, each transaction's LockReleaseAll() has to 
scan the bloated large hash table.


Regards
Takayuki Tsunakawa




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-25 Thread David Rowley
On Mon, 25 Mar 2019 at 23:44, Peter Eisentraut
 wrote:
> I did a bit of performance testing, both a plain pgbench and the
> suggested test case with 4096 partitions.  I can't detect any
> performance improvements.  In fact, within the noise, it tends to be
> just a bit on the slower side.
>
> So I'd like to kick it back to the patch submitter now and ask for more
> justification and performance analysis.
>
> Perhaps "speeding up planning with partitions" needs to be accepted first?

Yeah, I think it likely will require that patch to be able to measure
the gains from this patch.

If planning a SELECT to a partitioned table with a large number of
partitions using PREPAREd statements, when we attempt the generic plan
on the 6th execution, it does cause the local lock table to expand to
fit all the locks for each partition. This does cause the
LockReleaseAll() to become slow due to the hash_seq_search having to
skip over many empty buckets.   Since generating a custom plan for a
partitioned table with many partitions is still slow in master, then I
very much imagine you'll struggle to see the gains brought by this
patch.

I did a quick benchmark too and couldn't measure anything:

create table hp (a int) partition by hash (a);
select 'create table hp'||x|| ' partition of hp for values with
(modulus 4096, remainder ' || x || ');' from generate_series(0,4095)
x;

bench.sql
\set p_a 13315
select * from hp where a = :p_a;

Master:
$ pgbench -M prepared -n -T 60 -f bench.sql postgres
tps = 31.844468 (excluding connections establishing)
tps = 32.950154 (excluding connections establishing)
tps = 31.534761 (excluding connections establishing)

Patched:
$ pgbench -M prepared -n -T 60 -f bench.sql postgres
tps = 30.099133 (excluding connections establishing)
tps = 32.157328 (excluding connections establishing)
tps = 32.329884 (excluding connections establishing)

The situation won't be any better with plan_cache_mode =
force_generic_plan either. In this case, we'll only plan once but
we'll also have to obtain and release a lock for each partition for
each execution of the prepared statement. LockReleaseAll() is going to
be slow in that case because it actually has to release a lot of
locks.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-25 Thread Peter Eisentraut
On 2019-03-19 16:38, Peter Eisentraut wrote:
> On 2019-03-19 10:21, Tsunakawa, Takayuki wrote:
>> From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com]
>>> Fixed.
>>
>> Rebased on HEAD.
> 
> I have committed the first patch that reorganizes the struct.  I'll have
> to spend some time evaluating the performance impact of the second
> patch, but it seems OK in principle.  Performance tests from others welcome.

I did a bit of performance testing, both a plain pgbench and the
suggested test case with 4096 partitions.  I can't detect any
performance improvements.  In fact, within the noise, it tends to be
just a bit on the slower side.

So I'd like to kick it back to the patch submitter now and ask for more
justification and performance analysis.

Perhaps "speeding up planning with partitions" needs to be accepted first?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-19 Thread Peter Eisentraut
On 2019-03-19 10:21, Tsunakawa, Takayuki wrote:
> From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com]
>> Fixed.
> 
> Rebased on HEAD.

I have committed the first patch that reorganizes the struct.  I'll have
to spend some time evaluating the performance impact of the second
patch, but it seems OK in principle.  Performance tests from others welcome.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-19 Thread Tsunakawa, Takayuki
From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com]
> Fixed.

Rebased on HEAD.


Regards
Takayuki Tsunakawa



0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch
Description: 0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch


0002-speed-up-LOCALLOCK-scan.patch
Description: 0002-speed-up-LOCALLOCK-scan.patch


RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-19 Thread Imai, Yoshikazu
Hi Tsunakawa-san, Peter

On Tue, Mar 19, 2019 at 7:53 AM, Tsunakawa, Takayuki wrote:
> From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com]
> > You posted a link to some performance numbers, but I didn't see the
> > test setup explained there.  I'd like to get some more information on
> > this impact of this.  Is there an effect with 100 tables, or do you
> need 10?
> 
> Imai-san, can you tell us the test setup?

Maybe I used this test setup[1].

I tested again with those settings for prepared transactions.
I used Tsunakawa-san's patch for locallock[2] (which couldn't be applied to 
current master so I fixed it) and Amit's v32 patch for speeding up planner[3]. 

[settings]
plan_cache_mode = 'auto' or 'force_custom_plan'
max_parallel_workers = 0
max_parallel_workers_per_gather = 0
max_locks_per_transaction = 4096

[partitioning table definitions(with 4096 partitions)]
create table rt (a int, b int, c int) partition by range (a);

\o /dev/null
select 'create table rt' || x::text || ' partition of rt for values from (' ||
 (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 4096) x;
\gexec
\o

[select4096.sql]
\set a random(1, 4096)
select a from rt where a = :a;

[pgbench(with 4096 partitions)]
pgbench -n -f select4096.sql -T 60 -M prepared

[results]
  master  locallock  v32  v32+locallock
  --  -  ---  -
auto21.9   22.96,834  7,355
custom  19.7   20.07,415  7,252


[1] 
https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A51256276%40g01jpexmbkw24
[2] 
https://www.postgresql.org/message-id/0A3221C70F24FB45833433255569204D1FBDFA00%40G01JPEXMBYT05
[3] 
https://www.postgresql.org/message-id/9feacaf6-ddb3-96dd-5b98-df5e927b1439%40lab.ntt.co.jp

--
Yoshikazu Imai



RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-19 Thread Tsunakawa, Takayuki
Hi Peter, Imai-san,

From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com]
> Your changes in LOCALLOCK still refer to PGPROC, from your first version
> of the patch.
> 
> I think the reordering of struct members could be done as a separate
> preliminary patch.
> 
> Some more documentation in the comment before dlist_head LocalLocks to
> explain this whole mechanism would be nice.

Fixed.


> You posted a link to some performance numbers, but I didn't see the test
> setup explained there.  I'd like to get some more information on this
> impact of this.  Is there an effect with 100 tables, or do you need 10?

Imai-san, can you tell us the test setup?


Regards
Takayuki Tsunakawa



0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch
Description: 0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch


0002-speed-up-LOCALLOCK-scan.patch
Description: 0002-speed-up-LOCALLOCK-scan.patch


Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-18 Thread Peter Eisentraut
On 2019-02-20 07:20, Tsunakawa, Takayuki wrote:
> From: Tom Lane [mailto:t...@sss.pgh.pa.us]
>> Hm.  Putting a list header for a purely-local data structure into shared
>> memory seems quite ugly.  Isn't there a better place to keep that?
> 
> Agreed.  I put it in the global variable.

I think there is agreement on the principles of this patch.  Perhaps it
could be polished a bit.

Your changes in LOCALLOCK still refer to PGPROC, from your first version
of the patch.

I think the reordering of struct members could be done as a separate
preliminary patch.

Some more documentation in the comment before dlist_head LocalLocks to
explain this whole mechanism would be nice.

You posted a link to some performance numbers, but I didn't see the test
setup explained there.  I'd like to get some more information on this
impact of this.  Is there an effect with 100 tables, or do you need 10?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-02-19 Thread Tsunakawa, Takayuki
From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> Hm.  Putting a list header for a purely-local data structure into shared
> memory seems quite ugly.  Isn't there a better place to keep that?

Agreed.  I put it in the global variable.


> Do we really want a dlist here at all?  I'm concerned that bloating
> LOCALLOCK will cost us when there are many locks involved.  This patch
> increases the size of LOCALLOCK by 25% if I counted right, which does
> not seem like a negligible penalty.

To delete the LOCALLOCK in RemoveLocalLock(), we need a dlist.  slist requires 
the list iterator to be passed from callers.


From: Andres Freund [mailto:and...@anarazel.de]
> Sure, we should do that. I don't buy the "illogical" bit, just moving
> hashcode up to after tag isn't more or less logical, and saves most of
> the padding, and moving the booleans to the end isn't better/worse
> either.
> 
> I don't find

Thanks, I've done it.  


From: Simon Riggs [mailto:si...@2ndquadrant.com]
> Can we use our knowledge of the structure of locks, i.e. partition locks
> are all children of the partitioned table, to do a better job?

I couldn't come up with a idea.


Regards
Takayuki Tsunakawa



faster-locallock-scan_v2.patch
Description: faster-locallock-scan_v2.patch


RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-02-19 Thread Tsunakawa, Takayuki
From: Tomas Vondra [mailto:tomas.von...@2ndquadrant.com]
> On 2/12/19 7:33 AM, Tsunakawa, Takayuki wrote:
> > Imai-san confirmed performance improvement with this patch:
> >
> > https://commitfest.postgresql.org/22/1993/
> >
> 
> Can you quantify the effects? That is, how much slower/faster does it get?

Ugh, sorry, I wrote a wrong URL.  The correct page is:

https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A512787EC%40g01jpexmbkw24

The quoted figures re:

[v20 + faster-locallock-scan.patch]
auto:   9,069 TPS
custom: 9,015 TPS

[v20]
auto:   8,037 TPS
custom: 8,933 TPS


In the original problematic case, plan_cache_mode = auto (default), we can see 
about 13% improvement.


Regards
Takayuki Tsunakawa





Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-02-19 Thread Tomas Vondra



On 2/12/19 7:33 AM, Tsunakawa, Takayuki wrote:
> ...
> 
> This problem was uncovered while evaluating partitioning performance.
> When the application PREPAREs a statement once and then
> EXECUTE-COMMIT repeatedly, the server creates a generic plan on the
> 6th EXECUTE.  Unfortunately, creation of the generic plan of
> UPDATE/DELETE currently accesses all partitions of the target table
> (this itself needs improvement), expanding the LOCALLOCK hash table.
> As a result, 7th and later EXECUTEs get slower.
> 
> Imai-san confirmed performance improvement with this patch:
> 
> https://commitfest.postgresql.org/22/1993/
> 

Can you quantify the effects? That is, how much slower/faster does it get?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-02-19 Thread Simon Riggs
On Tue, 19 Feb 2019 at 00:20, Andres Freund  wrote:


> On 2019-02-18 19:13:31 -0500, Tom Lane wrote:
> > Andres Freund  writes:
> > > On 2019-02-18 19:01:06 -0500, Tom Lane wrote:
> > >> Mmm ... AIUI, the patches currently proposed can only help for what
> > >> David called "point lookup" queries.  There are still going to be
> > >> queries that scan a large proportion of a partition tree, so if you've
> > >> got tons of partitions, you'll be concerned about this sort of thing.
>


> > I think what Maumau-san is on about here is that not only does your
> > $giant-query take a long time, but it has a permanent negative effect
> > on all subsequent transactions in the session.  That seems worth
> > doing something about.
>
> Ah, yes, that makes sense.  I'm inclined to think however that the
> original approach presented in this thread is better than the
> reset-the-whole-hashtable approach.
>

If it was just many-tables then blowing away the hash table would work fine.

The main issue seems to be with partitioning, not with the general case of
many-tables. For that case, it seems like reset hashtable is too much.

Can we use our knowledge of the structure of locks, i.e. partition locks
are all children of the partitioned table, to do a better job?

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-02-18 Thread Andres Freund
Hi,

On 2019-02-18 20:29:29 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > but it's smaller (althoug there's plenty trailing space).
> 
> I think you're supposing that these things are independently palloc'd, but
> they aren't --- dynahash lays them out in arrays without palloc padding.

I don't think that matters, given that the trailing six bytes are
included in sizeof() (and have to, to guarantee suitable alignment in
arrays etc).

Greetings,

Andres Freund



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-02-18 Thread Tom Lane
Andres Freund  writes:
> On 2019-02-18 19:24:54 -0500, Tom Lane wrote:
>> Yeah, but if we want to rearrange the members into an illogical order
>> to save some space, we should do that independently of this patch ---

> Sure, we should do that. I don't buy the "illogical" bit, just moving
> hashcode up to after tag isn't more or less logical, and saves most of
> the padding, and moving the booleans to the end isn't better/worse
> either.

I hadn't looked at the details closely, but if we can squeeze out the
padding space without any loss of intelligibility, sure let's do so.
I still say that's independent of whether to adopt this patch though.

> but it's smaller (althoug there's plenty trailing space).

I think you're supposing that these things are independently palloc'd, but
they aren't --- dynahash lays them out in arrays without palloc padding.

> IDK, we, including you, very often make largely independent improvements
> to make the cost of something else more palpable. Why's that not OK
> here?

When we do that, we aren't normally talking about overheads as high as
25% (even more, if it's measured as I think it ought to be).  What I'm
concerned about is that the patch is being advocated for cases where
there are lots of LOCALLOCK entries --- which is exactly where the
space overhead is going to hurt the most.

> Especially because we're not comparing to an alternative where no
> cost is added, keeping track of e.g. a running average of the hashtable
> size isn't free either; nor does it help in the intermittent cases.

What I was hoping for --- though perhaps it's not achievable --- was
statistical overhead amounting to just a few more instructions per
transaction.  Adding dlist linking would add more instructions per
hashtable entry/removal, which seems like it'd be a substantially
bigger time penalty.  As for the intermittent-usage issue, that largely
depends on the details of the when-to-reset heuristic, which we don't
have a concrete design for yet.  But I could certainly imagine it waiting
for a few transactions before deciding to chomp.

Anyway, I'm not trying to veto the patch in this form, just suggesting
that there are alternatives worth thinking about.

regards, tom lane



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-02-18 Thread Andres Freund
Hi,

On 2019-02-18 19:24:54 -0500, Tom Lane wrote:
> Yeah, but if we want to rearrange the members into an illogical order
> to save some space, we should do that independently of this patch ---

Sure, we should do that. I don't buy the "illogical" bit, just moving
hashcode up to after tag isn't more or less logical, and saves most of
the padding, and moving the booleans to the end isn't better/worse
either.

You always bring up that argument. While I agree that sometimes the most
optimal ordering can be less natural, I think most of the time it vastly
overstates how intelligent the original ordering was. Often new elements
were either just added iteratively without consideration for padding, or
the attention to padding was paid in 32bit times.

I don't find

struct LOCALLOCK {
LOCALLOCKTAG   tag;  /* 020 */
uint32 hashcode; /*20 4 */
LOCK * lock; /*24 8 */
PROCLOCK * proclock; /*32 8 */
int64  nLocks;   /*40 8 */
intnumLockOwners;/*48 4 */
intmaxLockOwners;/*52 4 */
LOCALLOCKOWNER *   lockOwners;   /*56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
_Bool  holdsStrongLockCount; /*64 1 */
_Bool  lockCleared;  /*65 1 */

/* size: 72, cachelines: 2, members: 10 */
/* padding: 6 */
/* last cacheline: 8 bytes */
};

less clear than

struct LOCALLOCK {
LOCALLOCKTAG   tag;  /* 020 */

/* XXX 4 bytes hole, try to pack */

LOCK * lock; /*24 8 */
PROCLOCK * proclock; /*32 8 */
uint32 hashcode; /*40 4 */

/* XXX 4 bytes hole, try to pack */

int64  nLocks;   /*48 8 */
_Bool  holdsStrongLockCount; /*56 1 */
_Bool  lockCleared;  /*57 1 */

/* XXX 2 bytes hole, try to pack */

intnumLockOwners;/*60 4 */
/* --- cacheline 1 boundary (64 bytes) --- */
intmaxLockOwners;/*64 4 */

/* XXX 4 bytes hole, try to pack */

LOCALLOCKOWNER *   lockOwners;   /*72 8 */

/* size: 80, cachelines: 2, members: 10 */
/* sum members: 66, holes: 4, sum holes: 14 */
/* last cacheline: 16 bytes */
};

but it's smaller (althoug there's plenty trailing space).


> and then the overhead of this patch would be even worse than 25%.

IDK, we, including you, very often make largely independent improvements
to make the cost of something else more palpable. Why's that not OK
here?  Especially because we're not comparing to an alternative where no
cost is added, keeping track of e.g. a running average of the hashtable
size isn't free either; nor does it help in the intermittent cases.

- Andres



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-02-18 Thread Tom Lane
Andres Freund  writes:
> On 2019-02-18 18:42:32 -0500, Tom Lane wrote:
>> Do we really want a dlist here at all?  I'm concerned that bloating
>> LOCALLOCK will cost us when there are many locks involved.  This patch
>> increases the size of LOCALLOCK by 25% if I counted right, which does
>> not seem like a negligible penalty.

> It's currently [ 80 bytes with several padding holes ]
> seems we could trivially squeeze most of the bytes for a dlist node out
> of padding.

Yeah, but if we want to rearrange the members into an illogical order
to save some space, we should do that independently of this patch ---
and then the overhead of this patch would be even worse than 25%.

regards, tom lane



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-02-18 Thread Andres Freund
Hi,

On 2019-02-18 19:13:31 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2019-02-18 19:01:06 -0500, Tom Lane wrote:
> >> Mmm ... AIUI, the patches currently proposed can only help for what
> >> David called "point lookup" queries.  There are still going to be
> >> queries that scan a large proportion of a partition tree, so if you've
> >> got tons of partitions, you'll be concerned about this sort of thing.
> 
> > Agreed - but it seems not unlikely that for those the rest of the
> > planner / executor overhead will entirely swamp any improvement we could
> > make here. If I understand correctly the benchmarks here were made with
> > "point" update and select queries, although the reference in the first
> > post in this thread is a bit vague.
> 
> I think what Maumau-san is on about here is that not only does your
> $giant-query take a long time, but it has a permanent negative effect
> on all subsequent transactions in the session.  That seems worth
> doing something about.

Ah, yes, that makes sense.  I'm inclined to think however that the
original approach presented in this thread is better than the
reset-the-whole-hashtable approach. Because:


> I think David's on the right track --- keep some kind of moving average of
> the LOCALLOCK table size for each transaction, and nuke it if it exceeds
> some multiple of the recent average.  Not sure offhand about how to get
> the data cheaply --- it might not be sufficient to look at transaction
> end, if we release LOCALLOCK entries before that (but do we?)

Seems too complicated for my taste. And it doesn't solve the issue of
having some transactions with few locks (say because the plan can be
nicely pruned) interspersed with transactions where a lot of locks are
acquired.

Greetings,

Andres Freund



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-02-18 Thread Andres Freund
Hi,

On 2019-02-18 18:42:32 -0500, Tom Lane wrote:
> "Tsunakawa, Takayuki"  writes:
> > The attached patch speeds up transaction completion when any prior 
> > transaction accessed many relations in the same session.
> 
> Hm.  Putting a list header for a purely-local data structure into shared
> memory seems quite ugly.  Isn't there a better place to keep that?

Yea, I think it'd be just as fine to store that in a static
variable (best defined directly besides LockMethodLocalHash).

(Btw, I'd be entirely unsurprised if moving away from a dynahash for
LockMethodLocalHash would be beneficial)


> Do we really want a dlist here at all?  I'm concerned that bloating
> LOCALLOCK will cost us when there are many locks involved.  This patch
> increases the size of LOCALLOCK by 25% if I counted right, which does
> not seem like a negligible penalty.

It's currently

struct LOCALLOCK {
LOCALLOCKTAG   tag;  /* 020 */

/* XXX 4 bytes hole, try to pack */

LOCK * lock; /*24 8 */
PROCLOCK * proclock; /*32 8 */
uint32 hashcode; /*40 4 */

/* XXX 4 bytes hole, try to pack */

int64  nLocks;   /*48 8 */
_Bool  holdsStrongLockCount; /*56 1 */
_Bool  lockCleared;  /*57 1 */

/* XXX 2 bytes hole, try to pack */

intnumLockOwners;/*60 4 */
/* --- cacheline 1 boundary (64 bytes) --- */
intmaxLockOwners;/*64 4 */

/* XXX 4 bytes hole, try to pack */

LOCALLOCKOWNER *   lockOwners;   /*72 8 */

/* size: 80, cachelines: 2, members: 10 */
/* sum members: 66, holes: 4, sum holes: 14 */
/* last cacheline: 16 bytes */
};

seems we could trivially squeeze most of the bytes for a dlist node out
of padding.


> My own thought about how to improve this situation was just to destroy
> and recreate LockMethodLocalHash at transaction end (or start)
> if its size exceeded $some-value.  Leaving it permanently bloated seems
> like possibly a bad idea, even if we get rid of all the hash_seq_searches
> on it.

OTOH, that'll force constant incremental resizing of the hashtable, for
workloads that regularly need a lot of locks. And I'd assume in most
cases if one transaction needs a lot of locks it's quite likely that
future ones will need a lot of locks, too.

Greetings,

Andres Freund



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-02-18 Thread Tom Lane
Andres Freund  writes:
> On 2019-02-18 19:01:06 -0500, Tom Lane wrote:
>> Mmm ... AIUI, the patches currently proposed can only help for what
>> David called "point lookup" queries.  There are still going to be
>> queries that scan a large proportion of a partition tree, so if you've
>> got tons of partitions, you'll be concerned about this sort of thing.

> Agreed - but it seems not unlikely that for those the rest of the
> planner / executor overhead will entirely swamp any improvement we could
> make here. If I understand correctly the benchmarks here were made with
> "point" update and select queries, although the reference in the first
> post in this thread is a bit vague.

I think what Maumau-san is on about here is that not only does your
$giant-query take a long time, but it has a permanent negative effect
on all subsequent transactions in the session.  That seems worth
doing something about.

>> I didn't say it had to be a constant ...

> Do you have good idea?

I think David's on the right track --- keep some kind of moving average of
the LOCALLOCK table size for each transaction, and nuke it if it exceeds
some multiple of the recent average.  Not sure offhand about how to get
the data cheaply --- it might not be sufficient to look at transaction
end, if we release LOCALLOCK entries before that (but do we?)

regards, tom lane



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-02-18 Thread Andres Freund
Hi,

On 2019-02-18 19:01:06 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > Isn't a large portion of benefits in this patch going to be mooted by
> > the locking improvements discussed in the other threads? I.e. there's
> > hopefully not going to be a ton of cases with low overhead where we
> > acquire a lot of locks and release them very soon after. Sure, for DDL
> > etc we will, but I can't see this mattering from a performance POV?
> 
> Mmm ... AIUI, the patches currently proposed can only help for what
> David called "point lookup" queries.  There are still going to be
> queries that scan a large proportion of a partition tree, so if you've
> got tons of partitions, you'll be concerned about this sort of thing.

Agreed - but it seems not unlikely that for those the rest of the
planner / executor overhead will entirely swamp any improvement we could
make here. If I understand correctly the benchmarks here were made with
"point" update and select queries, although the reference in the first
post in this thread is a bit vague.


> > I'm not against doing something like Tom proposes, but heuristics with
> > magic constants like this tend to age purely / are hard to tune well
> > across systems.
> 
> I didn't say it had to be a constant ...

Do you have good idea?

Greetings,

Andres Freund



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-02-18 Thread David Rowley
On Tue, 19 Feb 2019 at 12:56, Andres Freund  wrote:
> Isn't a large portion of benefits in this patch going to be mooted by
> the locking improvements discussed in the other threads? I.e. there's
> hopefully not going to be a ton of cases with low overhead where we
> acquire a lot of locks and release them very soon after. Sure, for DDL
> etc we will, but I can't see this mattering from a performance POV?

I think this patch was born from Amit's partition planner improvement
patch. If not that one, which other threads did you have in mind?

A problem exists where, if using a PREPAREd statement to plan a query
to a partitioned table containing many partitions that a generic plan
will never be favoured over a custom plan since the generic plan might
not be able to prune partitions like the custom plan can.   The actual
problem is around that we do need to at some point generate a generic
plan in order to know it's more costly and that requires locking
possibly every partition.  When plan_cache_mode = auto, this is done
on the 6th execution of the statement.  After Amit's partition planner
changes [1], the custom plan will only lock partitions that are not
pruned, so the 6th execution of the statement bloats the local lock
table.

[1] https://commitfest.postgresql.org/22/1778/

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-02-18 Thread Tom Lane
Andres Freund  writes:
> Isn't a large portion of benefits in this patch going to be mooted by
> the locking improvements discussed in the other threads? I.e. there's
> hopefully not going to be a ton of cases with low overhead where we
> acquire a lot of locks and release them very soon after. Sure, for DDL
> etc we will, but I can't see this mattering from a performance POV?

Mmm ... AIUI, the patches currently proposed can only help for what
David called "point lookup" queries.  There are still going to be
queries that scan a large proportion of a partition tree, so if you've
got tons of partitions, you'll be concerned about this sort of thing.

> I'm not against doing something like Tom proposes, but heuristics with
> magic constants like this tend to age purely / are hard to tune well
> across systems.

I didn't say it had to be a constant ...

regards, tom lane



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-02-18 Thread Tom Lane
David Rowley  writes:
> On Tue, 19 Feb 2019 at 12:42, Tom Lane  wrote:
>> My own thought about how to improve this situation was just to destroy
>> and recreate LockMethodLocalHash at transaction end (or start)
>> if its size exceeded $some-value.  Leaving it permanently bloated seems
>> like possibly a bad idea, even if we get rid of all the hash_seq_searches
>> on it.

> That seems like a good idea. Although, it would be good to know that
> it didn't add too much overhead dropping and recreating the table when
> every transaction happened to obtain more locks than $some-value.  If
> it did, then maybe we could track the average locks per of recent
> transactions and just ditch the table after the locks are released if
> the locks held by the last transaction exceeded the average *
> 1.something. No need to go near shared memory to do that.

Yeah, I'd deliberately avoided saying how we'd choose $some-value ;-).
Making it adaptive might not be a bad plan.

regards, tom lane



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-02-18 Thread Andres Freund
Hi,

On 2019-02-19 12:52:08 +1300, David Rowley wrote:
> On Tue, 19 Feb 2019 at 12:42, Tom Lane  wrote:
> > My own thought about how to improve this situation was just to destroy
> > and recreate LockMethodLocalHash at transaction end (or start)
> > if its size exceeded $some-value.  Leaving it permanently bloated seems
> > like possibly a bad idea, even if we get rid of all the hash_seq_searches
> > on it.
> 
> That seems like a good idea. Although, it would be good to know that
> it didn't add too much overhead dropping and recreating the table when
> every transaction happened to obtain more locks than $some-value.  If
> it did, then maybe we could track the average locks per of recent
> transactions and just ditch the table after the locks are released if
> the locks held by the last transaction exceeded the average *
> 1.something. No need to go near shared memory to do that.

Isn't a large portion of benefits in this patch going to be mooted by
the locking improvements discussed in the other threads? I.e. there's
hopefully not going to be a ton of cases with low overhead where we
acquire a lot of locks and release them very soon after. Sure, for DDL
etc we will, but I can't see this mattering from a performance POV?

I'm not against doing something like Tom proposes, but heuristics with
magic constants like this tend to age purely / are hard to tune well
across systems.

Greetings,

Andres Freund



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-02-18 Thread David Rowley
On Tue, 19 Feb 2019 at 12:42, Tom Lane  wrote:
> My own thought about how to improve this situation was just to destroy
> and recreate LockMethodLocalHash at transaction end (or start)
> if its size exceeded $some-value.  Leaving it permanently bloated seems
> like possibly a bad idea, even if we get rid of all the hash_seq_searches
> on it.

That seems like a good idea. Although, it would be good to know that
it didn't add too much overhead dropping and recreating the table when
every transaction happened to obtain more locks than $some-value.  If
it did, then maybe we could track the average locks per of recent
transactions and just ditch the table after the locks are released if
the locks held by the last transaction exceeded the average *
1.something. No need to go near shared memory to do that.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-02-18 Thread Tom Lane
"Tsunakawa, Takayuki"  writes:
> The attached patch speeds up transaction completion when any prior 
> transaction accessed many relations in the same session.

Hm.  Putting a list header for a purely-local data structure into shared
memory seems quite ugly.  Isn't there a better place to keep that?

Do we really want a dlist here at all?  I'm concerned that bloating
LOCALLOCK will cost us when there are many locks involved.  This patch
increases the size of LOCALLOCK by 25% if I counted right, which does
not seem like a negligible penalty.

My own thought about how to improve this situation was just to destroy
and recreate LockMethodLocalHash at transaction end (or start)
if its size exceeded $some-value.  Leaving it permanently bloated seems
like possibly a bad idea, even if we get rid of all the hash_seq_searches
on it.

regards, tom lane