Re: [External] Join queries slow with predicate, limit, and ordering

2019-12-02 Thread Jeff Janes
On Mon, Dec 2, 2019 at 8:29 AM Aufar Gilbran  wrote:

> Hello,
>
> I'm trying to figure out how to optimise 3-table (many-to-many relation)
> joins
> with predicate, limit, and ordering, where one of the tables returns at
> most one
> row.
>
> This is the query that I have right now:
>
> SELECT entity.id
> FROM (
> SELECT entity_tag.entity_id
> FROM tag
> JOIN entity_tag ON tag.id = entity_tag.tag_id
> WHERE tag.key = 'status'
>   AND tag.value = 'SUCCEEDED'
> ) matched
> JOIN entity ON matched.entity_id = entity.id
> WHERE entity.type = 'execution'
> ORDER BY entity.id DESC
> LIMIT 10;
>

What happens if you set enable_sort to off before running it?


> ->  Nested Loop  (cost=1.28..723.38 rows=1 width=4) (actual
> time=0.153..5590.717 rows=89222 loops=1)
>

It thinks it will find 1 row, and actually finds 89,222.  I don't know
exactly why that would be, I suppose tag_id has an extremely skewed
distribution.  But yeah, that is going to cause some problems.  For one
thing, if there was actually just one qualifying row, then it wouldn't get
to stop early, as the LIMIT would never be satisfied.  So it thinks that if
it choose to walk the index backwards, it would have to walk the **entire**
index.


->  Index Only Scan using
> entity_tag_tag_id_entity_id_idx on public.entity_tag (cost=0.43..711.53
> rows=201 width=16) (actual time=0.035..756.829 rows=89222 loops=1)
>   Heap Fetches: 89222
>

You should vacuum this table.  Doing that (and only that) probably won't
make a great deal of difference to this particular query, but still, it
will help some.  And might help other ones you haven't noticed yet as well.


>
> Both tag_key_value_key and entity_tag_tag_id_entity_id_idx is a UNIQUE
> constraint on tag(key,value) and entity_tag(tag_id, entity_id)
> respectively.
>
> It seems to me that PostgreSQL runs the nested loop against all of the 90K
> records because it wants to sort the result before limiting the result.


It doesn't **know** there are going to be 9 records.  It cannot plan
queries based on knowledge it doesn't possess.


> It
> doesn't take into account of the UNIQUE constraint imposed on the table and
> thinks that the join being done inside the subquery will change the
> ordering of
> entity_id returned by the subquery, thus prompting the sort.
>

This seems like rather adventurous speculation.  It does the sort because
the horrible estimation makes it think it will be faster that way, not
because it thinks it is the only possible way.  Of you set enable_sort =
off and it still does a sort, then you know it thinks there is no other way.



>
> I believe with how the index sorted, it should be able to just scan the
> index
> backwards because at most only one tag_id will be returned. When I tried
> changing the predicate here to filter by ID with the following query:
>
> -- This runs very fast
> SELECT entity.id
> FROM (
> SELECT entity_tag.entity_id
> FROM tag
> JOIN entity_tag ON tag.id = entity_tag.tag_id
> WHERE tag.id = 24
> ) matched
> JOIN entity ON matched.entity_id = entity.id
> WHERE entity.type = 'execution'
> ORDER BY entity.id DESC
> LIMIT 10;
>

With this query, it can use the join condition to transfer the knowledge of
tag.id=24 to become entity_tag.tag_id=24, and then look up stats on
entity_tag.tag_id for the value 24.  When you specify the single row of tag
indirectly, it can't do that as it doesn't know what specific value of
tag.id is going to be the one it finds (until after the query is done being
planned and starts executing, at which point it is too late).  But the row
with id=24 doesn't seem to be the same one with "tag.key = 'status' AND
tag.value = 'SUCCEEDED'", so you have basically changed the query entirely
on us.

If you replanned this query with ORDER BY entity.id+0 DESC, (and with the
true value of tag_id) that might give you some more insight into the hidden
"thought process" behind the planner.

Cheers,

Jeff


Re: Considerable performance downgrade of v11 and 12 on Windows

2019-12-02 Thread Eugene Podshivalov
I have managed to split the 365GB file into 2GB chunks with the help of
'split' unix utility in mingw shell like so
split -C 2GB ways.txt
Then I imported the files into a clean database with the help of the
following cmd command
for /f %f in ('dir /b') do psql -U postgres -w -d osm -t -c "set
client_encoding TO 'UTF8'; copy ways from 'D:\ways\%f';"
The operation took ~3.5 hour which is the same as v10!

Prior to that I set 'parallel_leader_participation = on' and
'synchronous_commit = off' in the config file and restarted the server.

Then I logged into the psql interactive terminal and ran
ALTER TABLE ONLY ways ADD CONSTRAINT pk_ways PRIMARY KEY (id);
It took 1h 10m which is 30m faster than with the default settings (after
'type' commad if it really matters) but still 3 times slower than in v10.

Regards

пн, 2 дек. 2019 г. в 12:04, Laurenz Albe :

> On Sat, 2019-11-30 at 22:47 +0300, Eugene Podshivalov wrote:
> > It sounds strange but the "type" is indeed impacting the overall
> > performance somehow.
> > I've just tried to execute the following sequence of commands on a
> > fresh new database with PostreSQL v10 and both the copy and primary
> > key commands performed as slow as in v11 and 12.
> >
> > SET synchronous_commit TO OFF;
> > SET client_encoding TO 'UTF8';
> > COPY ways FROM program 'cmd /c "type D:\ways.txt"';
> > ALTER TABLE ONLY ways ADD CONSTRAINT pk_ways PRIMARY KEY (id);
> >
> > Regards
> >
> > пт, 29 нояб. 2019 г. в 15:22, Eugene Podshivalov :
> > > Laurenz,
> > > There is no way to run copy without the "type" on v11. See this thread
> > >
> https://www.postgresql.org/message-id/CAKHmqNCfTMM6%3DPqc6RUMEQ_2BPfo5KGGG-0fzRXZCVooo%3DwdNA%40mail.gmail.com
> > >
> > > My machine is running on NVMe disks, so the I/O subsystem very strong.
> > > The 100% overload is not constant but periodical, as if there are some
> > > kind of dumps for recovery performed in the background.
>
> Is it an option to split the file into parts of less than 2GB in size?
>
> Yours,
> Laurenz Albe
> --
> Cybertec | https://www.cybertec-postgresql.com
>
>


Re: Make recently inserted/updated records available in the buffer/cache

2019-12-02 Thread MichaelDBA
All updated/dirty records go through PG internal memory buffer, 
shared_buffers.  Make sure that is configured optimally.  Use 
pg_buffercache extension to set it correctly.


Regards,
Michael Vitale

Hüseyin Demir wrote on 12/2/2019 12:13 PM:

I guess there won't be any adverse effect






Re: Make recently inserted/updated records available in the buffer/cache

2019-12-02 Thread Hüseyin Demir
Hi,

PostgreSQL decides which pages should be evicted from memory and written to
disk with the help of LRU algorithm. Thus, it depends on your query work
load. In OLTP systems, the algorithm is beneficial to business
requirements(almost :) )

It's hard to figure out that a configuration change will affect the
performance in a good way. Maybe, you can use PostgreSQL warmup features in
order to make sure the data pages that you need will be available in cache.

Because the results of LRU algorithm can vary depending on your business
and system workload.

Best Regards.


Sachin Divekar , 2 Ara 2019 Pzt, 20:03 tarihinde şunu
yazdı:

> Hi,
>
> I am looking for tuning my PG setup such that recently inserted or updated
> record will be available in the buffer/cache (I believe they are same in
> this context). Does PostgreSQL do it by default? If yes, just increasing
> buffer size sufficient? What will be its effect on LRU performance -- I
> guess there won't be any adverse effect?
>
> My use case is that I am going to use it as a queue and performance will
> be dependent upon whether the recently updated record is available in the
> cache.
>
> Thank you.
>
> regards
> Sachin
>


-- 

*Hüseyin DEMİR*

IT SOLUTION ARCHITECT

0534-614-72-06
demirhuseyinn...@gmail.com

selfarrival.blogspot.com.tr


Make recently inserted/updated records available in the buffer/cache

2019-12-02 Thread Sachin Divekar
Hi,

I am looking for tuning my PG setup such that recently inserted or updated
record will be available in the buffer/cache (I believe they are same in
this context). Does PostgreSQL do it by default? If yes, just increasing
buffer size sufficient? What will be its effect on LRU performance -- I
guess there won't be any adverse effect?

My use case is that I am going to use it as a queue and performance will be
dependent upon whether the recently updated record is available in the
cache.

Thank you.

regards
Sachin


[External] Join queries slow with predicate, limit, and ordering

2019-12-02 Thread Aufar Gilbran
Hello,

I'm trying to figure out how to optimise 3-table (many-to-many relation) joins
with predicate, limit, and ordering, where one of the tables returns at most one
row.

This is the query that I have right now:

SELECT entity.id
FROM (
SELECT entity_tag.entity_id
FROM tag
JOIN entity_tag ON tag.id = entity_tag.tag_id
WHERE tag.key = 'status'
  AND tag.value = 'SUCCEEDED'
) matched
JOIN entity ON matched.entity_id = entity.id
WHERE entity.type = 'execution'
ORDER BY entity.id DESC
LIMIT 10;

It runs very slowly when there are many rows matched on entity table
which, in my
case, there are about 90K rows, even though at most there is only one
row returned
by tag.

Limit  (cost=723.39..723.40 rows=1 width=4) (actual
time=6189.015..6189.242 rows=10 loops=1)
  Output: entity.id
  Buffers: shared hit=411886 read=31282
  ->  Sort  (cost=723.39..723.40 rows=1 width=4) (actual
time=6188.999..6189.059 rows=10 loops=1)
Output: entity.id
Sort Key: entity.id DESC
Sort Method: top-N heapsort  Memory: 25kB
Buffers: shared hit=411886 read=31282
->  Nested Loop  (cost=1.28..723.38 rows=1 width=4) (actual
time=0.153..5590.717 rows=89222 loops=1)
  Output: entity.id
  Buffers: shared hit=411886 read=31282
  ->  Nested Loop  (cost=0.86..721.98 rows=3 width=8)
(actual time=0.108..1851.707 rows=89222 loops=1)
Output: entity_tag.entity_id
Buffers: shared hit=65146 read=20646
->  Index Scan using tag_key_value_key on
public.tag  (cost=0.43..8.45 rows=1 width=4) (actual time=0.043..0.061
rows=1 loops=1)
  Output: tag.id, tag.key, tag.value, tag.created_at
  Index Cond: (((tag.key)::text =
'status'::text) AND ((tag.value)::text = 'SUCCEEDED'::text))
  Buffers: shared hit=1 read=3
->  Index Only Scan using
entity_tag_tag_id_entity_id_idx on public.entity_tag
(cost=0.43..711.53 rows=201 width=16) (actual time=0.035..756.829
rows=89222 loops=1)
  Output: entity_tag.tag_id, entity_tag.entity_id
  Index Cond: (entity_tag.tag_id = tag.id)
  Heap Fetches: 89222
  Buffers: shared hit=65145 read=20643
  ->  Index Scan using entity_pkey on public.entity
(cost=0.42..0.46 rows=1 width=4) (actual time=0.010..0.017 rows=1
loops=89222)
Output: entity.id, entity.entity_id, entity.type,
entity.created_at
Index Cond: (entity.id = entity_tag.entity_id)
Filter: ((entity.type)::text = 'execution'::text)
Buffers: shared hit=346740 read=10636
Planning time: 0.817 ms
Execution time: 6189.419 ms

Both tag_key_value_key and entity_tag_tag_id_entity_id_idx is a UNIQUE
constraint on tag(key,value) and entity_tag(tag_id, entity_id) respectively.

It seems to me that PostgreSQL runs the nested loop against all of the 90K
records because it wants to sort the result before limiting the result. It
doesn't take into account of the UNIQUE constraint imposed on the table and
thinks that the join being done inside the subquery will change the ordering of
entity_id returned by the subquery, thus prompting the sort.

I believe with how the index sorted, it should be able to just scan the index
backwards because at most only one tag_id will be returned. When I tried
changing the predicate here to filter by ID with the following query:

-- This runs very fast
SELECT entity.id
FROM (
SELECT entity_tag.entity_id
FROM tag
JOIN entity_tag ON tag.id = entity_tag.tag_id
WHERE tag.id = 24
) matched
JOIN entity ON matched.entity_id = entity.id
WHERE entity.type = 'execution'
ORDER BY entity.id DESC
LIMIT 10;

and it's blazing fast. This time PostgreSQL seems to know that the join inside
the subqery won't change the ordering of entity_id returned by the subquery, as
seen in the following query explanation:

Limit  (cost=1.28..1025.56 rows=10 width=4) (actual time=0.144..0.276
rows=1 loops=1)
  Output: entity.id
  Buffers: shared hit=12
  ->  Nested Loop  (cost=1.28..1537.70 rows=15 width=4) (actual
time=0.125..0.238 rows=1 loops=1)
Output: entity.id
Buffers: shared hit=12
->  Nested Loop  (cost=0.86..1529.06 rows=15 width=12) (actual
time=0.057..0.116 rows=1 loops=1)
  Output: entity_tag.tag_id, entity.id
  Buffers: shared hit=8
  ->  Index Only Scan Backward using
entity_tag_tag_id_entity_id_idx on public.entity_tag
(cost=0.43..454.82 rows=128 width=16) (actual time=0.018..0.038 rows=1
loops=1)
Output: entity_tag.tag_id, entity_tag.entity_id
Index Cond: (entity_tag.tag_id = 24)
Heap Fetches: 1
Buffers: shared hit=4
  ->  Index Scan using entity_pkey on public.entity
(cost=0.42..8.38 

Re: Considerable performance downgrade of v11 and 12 on Windows

2019-12-02 Thread Laurenz Albe
On Sat, 2019-11-30 at 22:47 +0300, Eugene Podshivalov wrote:
> It sounds strange but the "type" is indeed impacting the overall
> performance somehow.
> I've just tried to execute the following sequence of commands on a
> fresh new database with PostreSQL v10 and both the copy and primary
> key commands performed as slow as in v11 and 12.
> 
> SET synchronous_commit TO OFF;
> SET client_encoding TO 'UTF8';
> COPY ways FROM program 'cmd /c "type D:\ways.txt"';
> ALTER TABLE ONLY ways ADD CONSTRAINT pk_ways PRIMARY KEY (id);
> 
> Regards
> 
> пт, 29 нояб. 2019 г. в 15:22, Eugene Podshivalov :
> > Laurenz,
> > There is no way to run copy without the "type" on v11. See this thread
> > https://www.postgresql.org/message-id/CAKHmqNCfTMM6%3DPqc6RUMEQ_2BPfo5KGGG-0fzRXZCVooo%3DwdNA%40mail.gmail.com
> > 
> > My machine is running on NVMe disks, so the I/O subsystem very strong.
> > The 100% overload is not constant but periodical, as if there are some
> > kind of dumps for recovery performed in the background.

Is it an option to split the file into parts of less than 2GB in size?

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com