Re: [GENERAL] Slow join over three tables

2016-06-08 Thread MOLINA BRAVO FELIPE DE JESUS
Hi!!

what happens if you change your query to:

SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
  FROM reports r
  INNER JOIN report_drugs d USING (rid) 
  INNER JOIN report_adverses a USING (rid) 
 WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back 
pain', 'back pain'])  AND
d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;



I have seen differences in time between "ON vs USING"i prefer "USING" when 
is possible



> > 
> > All tables have indexes on the "id"/"rid" columns and on the 
> > "drug"/"adverse" columns.
> > 
> > The query:
> > 
> > SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
> > FROM reports r
> > JOIN report_drugs d ON d.rid = r.id
> > JOIN report_adverses a ON a.rid = r.id 
> > WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back 
> > pain', 'back pain']) 
> > AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;
> > 
> I would suggest a few experiments to see how you can modify the plans
> available to the optimizer:
> 
> 1. CREATE INDEX ON report_drugs (drug, rid)
> 2. CREATE INDEX ON report_adverses (adverse, rid)
> 3. CREATE INDEX ON report (id, created)
> 
> Re-run EXPLAIN ANALYZE of your query after each of these steps to see
> how each one affects planning.  You might also try two variants of the
> query at each step, with and without the ORDER BY.
> 
> Note, the index column order in (1) and (2) above is very important.
> 
> 
> Karl
> 
> 
> 

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Slow join over three tables

2016-04-28 Thread Tim van der Linden
On Thu, 28 Apr 2016 16:38:53 +0200
Alban Hertroys  wrote:

Hi Alban

First off ... thanks you for your time on this so far and giving me the 
educational smacks on the head :) I appreciate this a lot. 

> You're doing ~9 times as many index lookups. A slowdown of 6x of this
> part of the query seems rather reasonable.

Again, very true. But why 9 times as much?
 
> >  Planning time: 16.438 ms
> >  Execution time: 663.583 ms
> >
> > A total of 660 ms instead of the previous 120 ms. The amount of rows 
> > returned are 4076, instead of the 448 rows of the previous query. Could 
> > this be the cause of the slow down? Or is it the fact that the integers in 
> > the ANY clause are further apart from one another and thus make for longer 
> > index searches?
> 
> Technically probably both, but the index node distance (assuming a
> btree index) should be insignificant.

It is indeed a b-tree index.

> The sorting of found candidate rows on rid probably contributes the
> most. I forgot how quicksort scales to the number of items to sort,
> probably something like 2log(n) or something like that.

Looked this up and you are right, a quicksort (in C) is of nlog(n) complexity 
... and if I'm right this is "rather" linear. Less sorting is thus the message 
:)

> That said, you didn't happen to perform these tests (assuming >1)
> during a busier period of the database server? I ask because later on
> you seem to be getting fast results again after some more index
> changes and I don't see cause for the difference.

No, these test are on a local development machine with nothing else going on 
but the database cluster.

> > 2. Adding extra columns ignores indexes
> > ...
> > - a.recovery
> > - a.severity
> > - d.reason
> > - d.effectiveness
> > - d.duration
> 
> I have to admit that I'm a bit behind with the current state of the
> art of PostgreSQL, but last time I checked, the database needed to
> look at each row in the result-set for transaction visibility
> information. I recall that there was (at some point) much discussion
> whether that visibility information could be added to indexes and that
> there were strong arguments against doing so. Hence, I doubt that
> those new index-only scans skip that step.

I see.

> Unless I'm wrong there, adding non-queried fields to the index is only
> going to affect your performance adversely. Relevant fields for
> indices are those used in joins, those regularly used in conditions
> (where-clauses) and those that are regularly sorted over.
> 
> Other fields are available from the candidate rows of the result set.
> Putting them in the index just results in storing redundant
> information.

Storing redundant information and making for bloated indexes which the planner 
might choose to skip if I understand correctly. Good to know.

> > ...
> > Still fast enough ... but I was wondering why the planner would not use the 
> > new index and instead fall back on the "report_drugs_drug_idx" single 
> > column index.
> 
> I'm actually a bit suspicious of those numbers; those are different
> drug id's than those from the first explain too.

You are completely right, this was the wrong plan ... the correct plan is:

 Sort  (cost=31757.71..31765.48 rows=3107 width=76) (actual 
time=722.348..722.461 rows=4076 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 495kB
   ->  Nested Loop  (cost=26041.17..31577.48 rows=3107 width=76) (actual 
time=139.407..721.090 rows=4076 loops=1)
 ->  Merge Join  (cost=26040.61..27143.53 rows=3107 width=64) (actual 
time=139.396..170.446 rows=4076 loops=1)
   Merge Cond: (d.rid = a.rid)
   ->  Sort  (cost=16332.08..16671.61 rows=135810 width=35) (actual 
time=108.866..119.143 rows=135370 loops=1)
 Sort Key: d.rid
 Sort Method: quicksort  Memory: 13992kB
 ->  Index Scan using report_drugs_drug_idx on report_drugs 
d  (cost=0.44..4753.44 rows=135810 width=35) (actual time=0.038..53.758 
rows=135532 loops=1)
   Index Cond: (drug = ANY 
('{9557,17848,17880,18223}'::integer[]))
   ->  Sort  (cost=9708.53..9904.94 rows=78565 width=29) (actual 
time=30.517..34.876 rows=77163 loops=1)
 Sort Key: a.rid
 Sort Method: quicksort  Memory: 6702kB
 ->  Index Only Scan using 
report_adverses_adverse_rid_severity_recovery_idx on report_adverses a  
(cost=0.56..3320.57 rows=78565 width=29) (actual time=1.005..16.135 rows=76972 
loops=1)
   Index Cond: (adverse = ANY ('{"back pain - 
nonspecific","nonspecific back pain","back pain"}'::text[]))
   Heap Fetches: 0
 ->  Index Only Scan using reports_id_age_gender_created_idx on reports 
r  (cost=0.56..1.42 rows=1 width=20) (actual time=0.134..0.135 rows=1 
loops=4076)
   Index Cond: (id = d.rid)
   Heap Fetches: 0
 Planning time: 

Re: [GENERAL] Slow join over three tables

2016-04-28 Thread Alban Hertroys
On 28 April 2016 at 08:36, Tim van der Linden  wrote:
> On Wed, 27 Apr 2016 13:48:06 +0200
> Alban Hertroys  wrote:
>
>> In this case, you're using the values in adverse to filter relevant rid's 
>> for the FK join, so you might be better off with the inverse of above index:
>>   create index on report_adverses (adverse, rid);
>>   create index on report_drugs (drug, rid);
>
> H, like I reported yesterday this achieved a huge performance boost.
>
> However, two (new) things I like to touch on while further experimenting with 
> this query:
>
> 1. Altering Drug IDs or Adverse names effects the executing time negatively.
>
> In this example altering the drug IDs I search for makes the query 6 times 
> slower again:
>
> SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
> FROM reports r
> JOIN report_drugs d ON d.rid = r.id
> JOIN report_adverses a ON a.rid = r.id
> WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back 
> pain', 'back pain'])
> AND d.drug = ANY (ARRAY[9557, 17848, 17880, 18223]) ORDER BY r.created;
>
> Different drug ID ([9557, 17848, 17880, 18223]), but the rest is the same. 
> Query plan:
>
>  Sort  (cost=31409.71..31417.48 rows=3107 width=41) (actual 
> time=662.707..662.819 rows=4076 loops=1)
>Sort Key: r.created
>Sort Method: quicksort  Memory: 415kB
>->  Nested Loop  (cost=25693.17..31229.48 rows=3107 width=41) (actual 
> time=71.748..661.743 rows=4076 loops=1)
>  ->  Merge Join  (cost=25692.61..26795.53 rows=3107 width=29) (actual 
> time=70.841..97.094 rows=4076 loops=1)
>Merge Cond: (d.rid = a.rid)
>->  Sort  (cost=16332.08..16671.61 rows=135810 width=8) 
> (actual time=48.946..58.623 rows=135370 loops=1)
>  Sort Key: d.rid
>  Sort Method: quicksort  Memory: 12498kB
>  ->  Index Scan using report_drugs_drug_idx on 
> report_drugs d  (cost=0.44..4753.44 rows=135810 width=8) (actual 
> time=0.681..28.441 rows=135532 loops=1)
>Index Cond: (drug = ANY 
> ('{9557,17848,17880,18223}'::integer[]))


You're doing ~9 times as many index lookups. A slowdown of 6x of this
part of the query seems rather reasonable.

>  Planning time: 16.438 ms
>  Execution time: 663.583 ms
>
> A total of 660 ms instead of the previous 120 ms. The amount of rows returned 
> are 4076, instead of the 448 rows of the previous query. Could this be the 
> cause of the slow down? Or is it the fact that the integers in the ANY clause 
> are further apart from one another and thus make for longer index searches?

Technically probably both, but the index node distance (assuming a
btree index) should be insignificant.

The sorting of found candidate rows on rid probably contributes the
most. I forgot how quicksort scales to the number of items to sort,
probably something like 2log(n) or something like that.

That said, you didn't happen to perform these tests (assuming >1)
during a busier period of the database server? I ask because later on
you seem to be getting fast results again after some more index
changes and I don't see cause for the difference.

> The same happens with changing the adverses, some adverses come back in ~120 
> ms, others can take up to one second.

Likely the same issue

> 2. Adding extra columns ignores indexes
>
> This one was kind of expected. Adding extra columns from both "report_drugs" 
> and "report_adverses" slows the lot down again. Extra columns added:
>
> - a.recovery
> - a.severity
> - d.reason
> - d.effectiveness
> - d.duration

I have to admit that I'm a bit behind with the current state of the
art of PostgreSQL, but last time I checked, the database needed to
look at each row in the result-set for transaction visibility
information. I recall that there was (at some point) much discussion
whether that visibility information could be added to indexes and that
there were strong arguments against doing so. Hence, I doubt that
those new index-only scans skip that step.

Unless I'm wrong there, adding non-queried fields to the index is only
going to affect your performance adversely. Relevant fields for
indices are those used in joins, those regularly used in conditions
(where-clauses) and those that are regularly sorted over.

Other fields are available from the candidate rows of the result set.
Putting them in the index just results in storing redundant
information.

That's how it always has been and I'm pretty sure that's still the case.

> As I would expect, adding these columns would make the previous multi-column 
> indexes useless. And they indeed were not used anymore.

What adding those fields to the index did was drive up the cost of
using that index. It requires more memory to cache, for example, and
this will be evicted from caches sooner than a smaller index.

I suspect the query planner also noticed that and chose for the
smaller index as that indexes usage costs 

Re: [GENERAL] Slow join over three tables

2016-04-28 Thread Tim van der Linden
On Wed, 27 Apr 2016 13:48:06 +0200
Alban Hertroys  wrote:

> In this case, you're using the values in adverse to filter relevant rid's for 
> the FK join, so you might be better off with the inverse of above index:
>   create index on report_adverses (adverse, rid);
>   create index on report_drugs (drug, rid);

H, like I reported yesterday this achieved a huge performance boost.

However, two (new) things I like to touch on while further experimenting with 
this query:

1. Altering Drug IDs or Adverse names effects the executing time negatively.

In this example altering the drug IDs I search for makes the query 6 times 
slower again:

SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
FROM reports r
JOIN report_drugs d ON d.rid = r.id
JOIN report_adverses a ON a.rid = r.id 
WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back 
pain', 'back pain']) 
AND d.drug = ANY (ARRAY[9557, 17848, 17880, 18223]) ORDER BY r.created;

Different drug ID ([9557, 17848, 17880, 18223]), but the rest is the same. 
Query plan:

 Sort  (cost=31409.71..31417.48 rows=3107 width=41) (actual 
time=662.707..662.819 rows=4076 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 415kB
   ->  Nested Loop  (cost=25693.17..31229.48 rows=3107 width=41) (actual 
time=71.748..661.743 rows=4076 loops=1)
 ->  Merge Join  (cost=25692.61..26795.53 rows=3107 width=29) (actual 
time=70.841..97.094 rows=4076 loops=1)
   Merge Cond: (d.rid = a.rid)
   ->  Sort  (cost=16332.08..16671.61 rows=135810 width=8) (actual 
time=48.946..58.623 rows=135370 loops=1)
 Sort Key: d.rid
 Sort Method: quicksort  Memory: 12498kB
 ->  Index Scan using report_drugs_drug_idx on report_drugs 
d  (cost=0.44..4753.44 rows=135810 width=8) (actual time=0.681..28.441 
rows=135532 loops=1)
   Index Cond: (drug = ANY 
('{9557,17848,17880,18223}'::integer[]))
   ->  Sort  (cost=9360.53..9556.94 rows=78565 width=21) (actual 
time=21.880..25.969 rows=77163 loops=1)
 Sort Key: a.rid
 Sort Method: quicksort  Memory: 6682kB
 ->  Index Only Scan using report_adverses_adverse_rid_idx 
on report_adverses a  (cost=0.56..2972.57 rows=78565 width=21) (actual 
time=0.983..10.744 rows=76972 loops=1)
   Index Cond: (adverse = ANY ('{"back pain - 
nonspecific","nonspecific back pain","back pain"}'::text[]))
   Heap Fetches: 0
 ->  Index Only Scan using reports_id_age_gender_created_idx on reports 
r  (cost=0.56..1.42 rows=1 width=20) (actual time=0.137..0.138 rows=1 
loops=4076)
   Index Cond: (id = d.rid)
   Heap Fetches: 0
 Planning time: 16.438 ms
 Execution time: 663.583 ms

A total of 660 ms instead of the previous 120 ms. The amount of rows returned 
are 4076, instead of the 448 rows of the previous query. Could this be the 
cause of the slow down? Or is it the fact that the integers in the ANY clause 
are further apart from one another and thus make for longer index searches?

The same happens with changing the adverses, some adverses come back in ~120 
ms, others can take up to one second.

2. Adding extra columns ignores indexes

This one was kind of expected. Adding extra columns from both "report_drugs" 
and "report_adverses" slows the lot down again. Extra columns added:

- a.recovery
- a.severity
- d.reason
- d.effectiveness
- d.duration

As I would expect, adding these columns would make the previous multi-column 
indexes useless. And they indeed were not used anymore.

So my initial thought was to create new indexes on the columns I now query:

- CREATE INDEX ON report_adverses(adverse, rid, severity, recovery);
- CREATE INDEX ON report_drugs(drug, rid, reason, effectiveness, duration);

After running the query again, the new index on "report_adverses" got picked 
up, but the index on "report_drugs" did not:

 Sort  (cost=12365.79..12366.61 rows=329 width=76) (actual 
time=129.106..129.120 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 66kB
   ->  Nested Loop  (cost=11212.93..12352.04 rows=329 width=76) (actual 
time=31.558..128.951 rows=448 loops=1)
 Join Filter: (d.rid = r.id)
 ->  Merge Join  (cost=11212.38..11680.44 rows=329 width=64) (actual 
time=30.705..39.527 rows=448 loops=1)
   Merge Cond: (d.rid = a.rid)
   ->  Sort  (cost=1503.85..1539.82 rows=14390 width=35) (actual 
time=6.977..7.993 rows=14074 loops=1)
 Sort Key: d.rid
 Sort Method: quicksort  Memory: 1202kB
 ->  Index Scan using report_drugs_drug_idx on report_drugs 
d  (cost=0.44..510.02 rows=14390 width=35) (actual time=0.567..4.638 rows=14200 
loops=1)
   Index Cond: (drug = ANY 
('{359,360,361,362,363}'::integer[]))
  

Re: [GENERAL] Slow join over three tables

2016-04-27 Thread Karl Czajkowski
On Apr 27, Tim van der Linden modulated:
...
> I'm joining three fairly large tables together, and it is slow. The tables 
> are:
> 
> - "reports": 6 million rows
> - "report_drugs": 20 million rows
> - "report_adverses": 20 million rows
> 
...
> All tables have indexes on the "id"/"rid" columns and on the "drug"/"adverse" 
> columns.
> 
> The query:
> 
> SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
> FROM reports r
> JOIN report_drugs d ON d.rid = r.id
> JOIN report_adverses a ON a.rid = r.id 
> WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back 
> pain', 'back pain']) 
> AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;
> 

I would suggest a few experiments to see how you can modify the plans
available to the optimizer:

1. CREATE INDEX ON report_drugs (drug, rid)
2. CREATE INDEX ON report_adverses (adverse, rid)
3. CREATE INDEX ON report (id, created)

Re-run EXPLAIN ANALYZE of your query after each of these steps to see
how each one affects planning.  You might also try two variants of the
query at each step, with and without the ORDER BY.

Note, the index column order in (1) and (2) above is very important.


Karl



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Slow join over three tables

2016-04-27 Thread Tim van der Linden
On Wed, 27 Apr 2016 13:48:06 +0200
Alban Hertroys  wrote:

Hi Alban

Thanks for chiming in!

> Since you're not using age and gender in this (particular) query until the 
> rows are combined into a result set already, it doesn't make a whole lot of 
> sense to add them to the index. Moreover, since your'e ordering by created, 
> I'd at least put that directly after id:
> 
>   create index on reports (id, created);

The ORDER statement did not attribute much to the total query time, but it 
makes sense to index this.
 
> In this case, you're using the values in adverse to filter relevant rid's for 
> the FK join, so you might be better off with the inverse of above index:
>   create index on report_adverses (adverse, rid);
>   create index on report_drugs (drug, rid);

Hold the phone ... you just cracked the code.

I added these two proposed indexes (reversing their order). After purging the 
OS disk cache, the query runs in (*drum roll*) 120 ms. You managed to speed 
this up ~40 times.

The query plan:

 Sort  (cost=12108.99..12109.83 rows=337 width=41) (actual 
time=119.517..119.531 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Nested Loop  (cost=10940.25..12094.84 rows=337 width=41) (actual 
time=26.473..119.392 rows=448 loops=1)
 Join Filter: (d.rid = r.id)
 ->  Merge Join  (cost=10939.69..11414.84 rows=337 width=29) (actual 
time=25.624..33.650 rows=448 loops=1)
   Merge Cond: (d.rid = a.rid)
   ->  Sort  (cost=1417.78..1454.02 rows=14496 width=8) (actual 
time=4.208..4.976 rows=14074 loops=1)
 Sort Key: d.rid
 Sort Method: quicksort  Memory: 1050kB
 ->  Index Only Scan using report_drugs_drug_rid_idx on 
report_drugs d  (cost=0.44..415.86 rows=14496 width=8) (actual 
time=0.648..2.236 rows=14200 loops=1)
   Index Cond: (drug = ANY 
('{359,360,361,362,363}'::integer[]))
   Heap Fetches: 0
   ->  Sort  (cost=9521.91..9721.56 rows=79860 width=21) (actual 
time=21.410..24.174 rows=76974 loops=1)
 Sort Key: a.rid
 Sort Method: quicksort  Memory: 6682kB
 ->  Index Only Scan using report_adverses_adverse_rid_idx 
on report_adverses a  (cost=0.56..3019.24 rows=79860 width=21) (actual 
time=0.916..10.689 rows=76972 loops=1)
   Index Cond: (adverse = ANY ('{"back pain - 
nonspecific","nonspecific back pain","back pain"}'::text[]))
   Heap Fetches: 0
 ->  Index Only Scan using reports_id_age_gender_created_idx on reports 
r  (cost=0.56..2.01 rows=1 width=20) (actual time=0.190..0.191 rows=1 loops=448)
   Index Cond: (id = a.rid)
   Heap Fetches: 0
 Planning time: 15.466 ms
 Execution time: 119.871 ms

This is amazing, and the slow down was purely a fact of missing indexes on the 
sweet spot.
You can now clearly see that the above two indexes are used 
(report_drugs_drug_rid_idx) and (report_adverses_adverse_rid_idx) and that the 
nested loop is much faster.

> Do these tables have a primary key and can that be used here?

Only the "reports" table has a PK, the other two don't (only a foreign key to 
"reports").

> I hope I'm not being superfluous here, but don't forget to vacuum analyze 
> after creating those indexes. Eventually autovacuum will do that for you, but 
> you could otherwise be running tests to verify the impact of adding those 
> indexes before autovacuum gets around to it.

I'm becoming the king of manual VACUUM-ing the past few days ... no need to 
point that out ;)

> Finally, perhaps it's more efficient to weed out all unrelated drugs and 
> adverses before relating them to reports, something like:
> 
> SELECT r.id, r.age, r.gender, r.created, x.adverse, x.drug
> FROM reports r
> JOIN (
>   SELECT d.rid, a.adverse, d.drug
>   FROM report_drugs d
>   JOIN report_adverses a ON a.rid = d.rid 
>   WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific 
> back pain', 'back pain']) 
>   AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363])
> ) x ON x.rid = r.id
> ORDER BY r.created;

With the above indexes created, disk cache flushed, this query ran at the exact 
same speed, the plan:

 Sort  (cost=12108.14..12108.99 rows=337 width=41) (actual 
time=119.102..119.111 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Nested Loop  (cost=10940.25..12094.00 rows=337 width=41) (actual 
time=26.797..118.969 rows=448 loops=1)
 ->  Merge Join  (cost=10939.69..11414.84 rows=337 width=29) (actual 
time=25.899..33.203 rows=448 loops=1)
   Merge Cond: (d.rid = a.rid)
   ->  Sort  (cost=1417.78..1454.02 rows=14496 width=8) (actual 
time=4.319..4.981 rows=14074 loops=1)
 Sort Key: d.rid
 Sort Method: 

Re: [GENERAL] Slow join over three tables

2016-04-27 Thread Alban Hertroys

> On 27 Apr 2016, at 4:09, David Rowley  wrote:
> 
> On 27 April 2016 at 11:27, Tim van der Linden  wrote:

>> The query:
>> 
>> SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
>> FROM reports r
>> JOIN report_drugs d ON d.rid = r.id
>> JOIN report_adverses a ON a.rid = r.id
>> WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back 
>> pain', 'back pain'])
>> AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;
>> 

> Likely the only way to speed this up would be to create indexes;
> 
> create index on reports (id, age, gender, created);

Since you're not using age and gender in this (particular) query until the rows 
are combined into a result set already, it doesn't make a whole lot of sense to 
add them to the index. Moreover, since your'e ordering by created, I'd at least 
put that directly after id:

create index on reports (id, created);

> create index on report_adverses (rid, adverse);

In this case, you're using the values in adverse to filter relevant rid's for 
the FK join, so you might be better off with the inverse of above index:
create index on report_adverses (adverse, rid);
create index on report_drugs (drug, rid);

Do these tables have a primary key and can that be used here?

I hope I'm not being superfluous here, but don't forget to vacuum analyze after 
creating those indexes. Eventually autovacuum will do that for you, but you 
could otherwise be running tests to verify the impact of adding those indexes 
before autovacuum gets around to it.

Finally, perhaps it's more efficient to weed out all unrelated drugs and 
adverses before relating them to reports, something like:

SELECT r.id, r.age, r.gender, r.created, x.adverse, x.drug
FROM reports r
JOIN (
SELECT d.rid, a.adverse, d.drug
FROM report_drugs d
JOIN report_adverses a ON a.rid = d.rid 
WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific 
back pain', 'back pain']) 
AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363])
) x ON x.rid = r.id
ORDER BY r.created;

Looking at the cardinality of your tables that does seem a bit unlikely though. 
Still, worth a shot...

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Slow join over three tables

2016-04-27 Thread Tim van der Linden
On Wed, 27 Apr 2016 22:40:43 +1200
David Rowley  wrote:

Hi David

> > ...
> >  Planning time: 15.968 ms
> >  Execution time: 4313.755 ms
> >
> > Both the (rid, adverse) and the (id, age, gender, created) indexes are now 
> > used.
> >
> 
> Seems the (rid, adverse) is not being used. report_adverses_rid_idx
> is your (rid) index.

Ah ... good catch, I blame it on the late evening over here ;)

> > Yet ... this is "only" 1 second faster, still 4 seconds to complete this 
> > query (the join(s)). Is this truly the most that I could get out of it ... 
> > or could further tweaking to PostgreSQL's configuration be of any help here?
> 
> EXPLAIN ANALYZE also has quite a bit of timing overhead, so it might
> not be taking quite as long as you think.
> 
> How long does it take with EXPLAIN (ANALYZE, TIMING OFF) ... ?

Hmmm, this takes roughly the same amount of time: 

 Planning time: 39.889 ms
 Execution time: 4532.642 ms

> Or perhaps just run the query, as there's only 448 rows anyway.

Also, did not speed up much ... query time pulled from the log statement: 
4305.686 ms.

> -- 
>  David Rowley   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services

Cheers,
Tim


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Slow join over three tables

2016-04-27 Thread David Rowley
On 27 April 2016 at 22:29, Tim van der Linden  wrote:
>  Sort  (cost=372968.28..372969.07 rows=317 width=41) (actual 
> time=9308.174..9308.187 rows=448 loops=1)
>Sort Key: r.created
>  Sort  (cost=66065.73..66066.59 rows=344 width=41) (actual 
> time=4313.679..4313.708 rows=448 loops=1)
>Sort Key: r.created
>Sort Method: quicksort  Memory: 60kB
>->  Nested Loop  (cost=1.44..66051.23 rows=344 width=41) (actual 
> time=43.987..4313.435 rows=448 loops=1)
>  Join Filter: (d.rid = a.rid)
>  ->  Nested Loop  (cost=1.00..54700.19 rows=14319 width=28) (actual 
> time=0.772..1158.338 rows=14200 loops=1)
>->  Index Scan using report_drugs_drug_idx on report_drugs d  
> (cost=0.44..507.78 rows=14319 width=8) (actual time=0.579..4.327 rows=14200 
> loops=1)
>  Index Cond: (drug = ANY 
> ('{359,360,361,362,363}'::integer[]))
>->  Index Only Scan using reports_id_age_gender_created_idx on 
> reports r  (cost=0.56..3.77 rows=1 width=20) (actual time=0.081..0.081 rows=1 
> loops=14200)
>  Index Cond: (id = d.rid)
>  Heap Fetches: 0
>  ->  Index Scan using report_adverses_rid_idx on report_adverses a  
> (cost=0.44..0.78 rows=1 width=21) (actual time=0.222..0.222 rows=0 
> loops=14200)
>Index Cond: (rid = r.id)
>Filter: (adverse = ANY ('{"back pain - 
> nonspecific","nonspecific back pain","back pain"}'::text[]))
>Rows Removed by Filter: 5
>  Planning time: 15.968 ms
>  Execution time: 4313.755 ms
>
> Both the (rid, adverse) and the (id, age, gender, created) indexes are now 
> used.
>

Seems the (rid, adverse) is not being used. report_adverses_rid_idx
is your (rid) index.

> Yet ... this is "only" 1 second faster, still 4 seconds to complete this 
> query (the join(s)). Is this truly the most that I could get out of it ... or 
> could further tweaking to PostgreSQL's configuration be of any help here?

EXPLAIN ANALYZE also has quite a bit of timing overhead, so it might
not be taking quite as long as you think.

How long does it take with EXPLAIN (ANALYZE, TIMING OFF) ... ?

Or perhaps just run the query, as there's only 448 rows anyway.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Slow join over three tables

2016-04-27 Thread Tim van der Linden
On Wed, 27 Apr 2016 09:14:27 +0300
Victor Yegorov  wrote:

Hi Victor

> > ...
> 
> Can you post output of `EXPLAIN (analyze, buffers)`, please?
> It'd be good to check how many buffers are hit/read during Index Scans.

Happy to, here it is:

 Sort  (cost=107727.85..107728.71 rows=344 width=41) (actual 
time=5357.453..5357.482 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   Buffers: shared hit=75462 read=37183
   ->  Nested Loop  (cost=1.31..107713.36 rows=344 width=41) (actual 
time=53.866..5357.211 rows=448 loops=1)
 Join Filter: (d.rid = a.rid)
 Buffers: shared hit=75459 read=37183
 ->  Nested Loop  (cost=0.87..96362.32 rows=14319 width=28) (actual 
time=0.569..2214.900 rows=14200 loops=1)
   Buffers: shared hit=39949 read=17072
   ->  Index Scan using report_drugs_drug_idx on report_drugs d  
(cost=0.44..507.78 rows=14319 width=8) (actual time=0.567..4.406 rows=14200 
loops=1)
 Index Cond: (drug = ANY 
('{359,360,361,362,363}'::integer[]))
 Buffers: shared hit=14 read=131
   ->  Index Scan using reports_id_key on reports r  
(cost=0.43..6.68 rows=1 width=20) (actual time=0.155..0.155 rows=1 loops=14200)
 Index Cond: (id = d.rid)
 Buffers: shared hit=39935 read=16941
 ->  Index Scan using report_adverses_rid_idx on report_adverses a  
(cost=0.44..0.78 rows=1 width=21) (actual time=0.221..0.221 rows=0 loops=14200)
   Index Cond: (rid = r.id)
   Filter: (adverse = ANY ('{"back pain - nonspecific","nonspecific 
back pain","back pain"}'::text[]))
   Rows Removed by Filter: 5
   Buffers: shared hit=35510 read=20111
 Planning time: 13.962 ms
 Execution time: 5357.520 ms

> -- 
> Victor Y. Yegorov

Cheers,
Tim


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Slow join over three tables

2016-04-27 Thread Tim van der Linden
On Wed, 27 Apr 2016 14:09:04 +1200
David Rowley  wrote:

Hi David

Thanks for your time on this. I tried your proposals with the results below.

> > ...
> > Under 5 ms. The same goes for querying the "adverse" column in the 
> > "report_adverses" table: under 20 ms.
> 
> I'm not sure why you're comparing this to the join plan above. They're
> very different, I can only imagine that it' because you've not quite
> understood what the EXPLAIN output is saying.

Well, it is more due to me poorly wording the question I guess. I understand 
that both plans are totally different and both queries are fairly incomparable 
as two very different things are happening.

My point here was to show that the "drug" column seems to be indexed correctly 
and, even though it is a very large table, the query executes very fast (all 
these execution times are given after flushing the OS's disk cache).
 
> >  ->  Nested Loop  (cost=0.87..94657.59 rows=14005 width=72) (actual 
> > time=0.822..2038.952 rows=14199 loops=1)
> >  ->  Index Scan using report_drugs_drug_idx on report_drugs d  
> > (cost=0.44..500.28 rows=14005 width=31) (actual time=0.669..3.900 
> > rows=14199 loops=1)
> >Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
> >  ->  Index Scan using reports_id_key on reports r  (cost=0.43..6.71 
> > rows=1 width=41) (actual time=0.143..0.143 rows=1 loops=14199)
> >Index Cond: (id = d.rid)
> 
> This is a parameterised nested loop. This means that the inner side of
> the loop (reports), is parameterised by the outerside of the loop, you
> can see the d.rid in the Index Cond. This means that the
> reports_id_key index is looked up 14199 times. You can see from the
> plan that the nested loop here took 2038.952 milliseconds to complete,
> so about 0.144 ms per lookup, not too bad, right?

That is indeed not bad at all, thanks for pointing this out. This nested loop 
is thus responsible for little over 40% of the execution time.

> The alternative to a parameterised nested loop plan is that a Hash
> Join plan is generated. If you imagine what would happen here, likely
> the matching report_drugs records would be hashed, then the outer side
> of the join would then perform a SeqScan over the entire 6 million
> reports records... probably not better. You can test this yourself
> with; SET enable_nestloop = 0; you might need to SET enable_mergejoin
> = 0; too. I imagine it'll be slower.

Doing this with the current (unaltered) indexes almost doubles the time and it 
seems to behave exactly as you mentioned:

 Sort  (cost=372968.28..372969.07 rows=317 width=41) (actual 
time=9308.174..9308.187 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Hash Join  (cost=130941.07..372955.11 rows=317 width=41) (actual 
time=8131.509..9307.983 rows=448 loops=1)
 Hash Cond: (r.id = d.rid)
 ->  Seq Scan on reports r  (cost=0.00..105465.00 rows=5749300 
width=20) (actual time=0.237..849.564 rows=5749300 loops=1)
 ->  Hash  (cost=130937.10..130937.10 rows=317 width=29) (actual 
time=8130.175..8130.175 rows=448 loops=1)
   Buckets: 1024  Batches: 1  Memory Usage: 25kB
   ->  Hash Join  (cost=122150.13..130937.10 rows=317 width=29) 
(actual time=8118.130..8130.090 rows=448 loops=1)
 Hash Cond: (d.rid = a.rid)
 ->  Index Scan using report_drugs_drug_idx on report_drugs 
d  (cost=0.44..496.92 rows=13870 width=8) (actual time=0.830..3.503 rows=14200 
loops=1)
   Index Cond: (drug = ANY 
('{359,360,361,362,363}'::integer[]))
 ->  Hash  (cost=121170.69..121170.69 rows=78320 width=21) 
(actual time=8116.544..8116.544 rows=76972 loops=1)
   Buckets: 8192  Batches: 1  Memory Usage: 3609kB
   ->  Bitmap Heap Scan on report_adverses a  
(cost=2024.68..121170.69 rows=78320 width=21) (actual time=19.303..8093.257 
rows=76972 loops=1)
 Recheck Cond: (adverse = ANY ('{"back pain - 
nonspecific","nonspecific back pain","back pain"}'::text[]))
 Heap Blocks: exact=56959
 ->  Bitmap Index Scan on 
report_adverses_adverse_idx  (cost=0.00..2005.10 rows=78320 width=0) (actual 
time=11.756..11.756 rows=76972 loops=1)
   Index Cond: (adverse = ANY ('{"back pain 
- nonspecific","nonspecific back pain","back pain"}'::text[]))
 Planning time: 11.348 ms
 Execution time: 9308.694 ms

> Likely the only way to speed this up would be to create indexes;
> 
> create index on reports (id, age, gender, created);
> ...
> create index on report_adverses (rid, adverse);
> 
> this might speedup the 2nd join a bit.

Done. This does help a bit. The query runs about ~1 second faster. The plan:

 Sort  (cost=66065.73..66066.59 rows=344 width=41) (actual 

Re: [GENERAL] Slow join over three tables

2016-04-27 Thread Tim van der Linden
On Wed, 27 Apr 2016 01:45:55 +
Sameer Kumar  wrote:

Hi Sameer

Thanks for taking the time to look into this!

> > ... 
> Quite clearly the nested loop joins are the most costly operations here.

Indeed.

> > ... 
> I suppose. It might help if the filters are performed before the join. I am
> not an expert on optimizer but I guess it might help if you change the join
> order and add duplicate conditions for reports-
> 
> SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
> FROM report_drugs d
> JOIN report_adverses a ON a.rid = d.rid
> JOIN reports r ON d.rid = r.id
> WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back
> pain', 'back pain'])
> AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;

Looks like a nice approach, but it did no effect to the query time. The plan 
for this approach:

 Sort  (cost=104928.07..104928.86 rows=317 width=41) (actual 
time=5435.210..5435.236 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Nested Loop  (cost=1.31..104914.90 rows=317 width=41) (actual 
time=57.230..5434.930 rows=448 loops=1)
 Join Filter: (d.rid = a.rid)
 ->  Nested Loop  (cost=0.87..93919.79 rows=13870 width=28) (actual 
time=0.569..2240.955 rows=14200 loops=1)
   ->  Index Scan using report_drugs_drug_idx on report_drugs d  
(cost=0.44..496.92 rows=13870 width=8) (actual time=0.565..4.678 rows=14200 
loops=1)
 Index Cond: (drug = ANY 
('{359,360,361,362,363}'::integer[]))
   ->  Index Scan using reports_id_key on reports r  
(cost=0.43..6.73 rows=1 width=20) (actual time=0.157..0.157 rows=1 loops=14200)
 Index Cond: (id = d.rid)
 ->  Index Scan using report_adverses_rid_idx on report_adverses a  
(cost=0.44..0.78 rows=1 width=21) (actual time=0.224..0.225 rows=0 loops=14200)
   Index Cond: (rid = r.id)
   Filter: (adverse = ANY ('{"back pain - nonspecific","nonspecific 
back
 pain","back pain"}'::text[]))
   Rows Removed by Filter: 5
 Planning time: 18.512 ms
 Execution time: 5435.293 ms


> OR since you are using INNER JOIN, (As far as I understand the concept of
> joins) it won't hurt the result set if the where clause is pushed into the
> INNER JOIN criteria-

Correct. I have tried those as well, but the planner seems to take the exact 
same path and as a result the query time is unchanged.
 
> > ...
> 
> --
> Best Regards
> Sameer Kumar | DB Solution Architect

Cheers,
Tim


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Slow join over three tables

2016-04-27 Thread Victor Yegorov
2016-04-27 2:27 GMT+03:00 Tim van der Linden :

> The plan:
>
> Sort  (cost=105773.63..105774.46 rows=333 width=76) (actual
> time=5143.162..5143.185 rows=448 loops=1)
>Sort Key: r.created
>Sort Method: quicksort  Memory: 60kB
>->  Nested Loop  (cost=1.31..105759.68 rows=333 width=76) (actual
> time=54.784..5142.872 rows=448 loops=1)
>  Join Filter: (d.rid = a.rid)
>  ->  Nested Loop  (cost=0.87..94657.59 rows=14005 width=72) (actual
> time=0.822..2038.952 rows=14199 loops=1)
>  ->  Index Scan using report_drugs_drug_idx on report_drugs d
> (cost=0.44..500.28 rows=14005 width=31) (actual time=0.669..3.900
> rows=14199 loops=1)
>Index Cond: (drug = ANY
> ('{359,360,361,362,363}'::integer[]))
>  ->  Index Scan using reports_id_key on reports r
> (cost=0.43..6.71 rows=1 width=41) (actual time=0.143..0.143 rows=1
> loops=14199)
>Index Cond: (id = d.rid)
>  ->  Index Scan using report_adverses_rid_idx on report_adverses a
> (cost=0.44..0.78 rows=1 width=12) (actual time=0.218..0.218 rows=0
> loops=14199)
>Index Cond: (rid = r.id)
>Filter: (adverse = ANY ('{"back pain -
> nonspecific","nonspecific back pain","back pain"}'::text[]))
>Rows Removed by Filter: 5
> Planning time: 13.994 ms
> Execution time: 5143.235 ms
>

Can you post output of `EXPLAIN (analyze, buffers)`, please?
It'd be good to check how many buffers are hit/read during Index Scans.

I have had a somewhat similar case, when due to misconfigured autovacuum
(not aggressive enough for the increased volume)
I got bloat accumulated in the indexes. Perhaps, REINDEX can help here.


-- 
Victor Y. Yegorov


Re: [GENERAL] Slow join over three tables

2016-04-26 Thread David Rowley
On 27 April 2016 at 11:27, Tim van der Linden  wrote:
> Hi all
>
> I have asked this question in a somewhat different form on the DBA 
> Stackexchange site, but without much luck 
> (https://dba.stackexchange.com/questions/136423/postgresql-slow-join-on-three-large-tables).
>  So I apologize for possible double posting, but I hope this might get a 
> better response on the mailing list.
>
> I'm joining three fairly large tables together, and it is slow. The tables 
> are:
>
> - "reports": 6 million rows
> - "report_drugs": 20 million rows
> - "report_adverses": 20 million rows
>
> The table "reports" holds main report data and has a primary key column "id". 
> The other two tables have a foreign key to that table with "rid". It are 
> those columns that I use to join them together.
>
> All tables have indexes on the "id"/"rid" columns and on the "drug"/"adverse" 
> columns.
>
> The query:
>
> SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
> FROM reports r
> JOIN report_drugs d ON d.rid = r.id
> JOIN report_adverses a ON a.rid = r.id
> WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back 
> pain', 'back pain'])
> AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;
>
> The plan:
>
> Sort  (cost=105773.63..105774.46 rows=333 width=76) (actual 
> time=5143.162..5143.185 rows=448 loops=1)
>Sort Key: r.created
>Sort Method: quicksort  Memory: 60kB
>->  Nested Loop  (cost=1.31..105759.68 rows=333 width=76) (actual 
> time=54.784..5142.872 rows=448 loops=1)
>  Join Filter: (d.rid = a.rid)
>  ->  Nested Loop  (cost=0.87..94657.59 rows=14005 width=72) (actual 
> time=0.822..2038.952 rows=14199 loops=1)
>  ->  Index Scan using report_drugs_drug_idx on report_drugs d  
> (cost=0.44..500.28 rows=14005 width=31) (actual time=0.669..3.900 rows=14199 
> loops=1)
>Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
>  ->  Index Scan using reports_id_key on reports r  (cost=0.43..6.71 
> rows=1 width=41) (actual time=0.143..0.143 rows=1 loops=14199)
>Index Cond: (id = d.rid)
>  ->  Index Scan using report_adverses_rid_idx on report_adverses a  
> (cost=0.44..0.78 rows=1 width=12) (actual time=0.218..0.218 rows=0 
> loops=14199)
>Index Cond: (rid = r.id)
>Filter: (adverse = ANY ('{"back pain - nonspecific","nonspecific 
> back pain","back pain"}'::text[]))
>Rows Removed by Filter: 5
> Planning time: 13.994 ms
> Execution time: 5143.235 ms
>
> This takes well over 5 seconds, which to me, feels much too slow.
> If I query each table directly with the same conditions, thus:
>
> SELECT reason
> FROM report_drugs
> WHERE drug = ANY (ARRAY[359, 360, 361, 362, 363]);
>
> I get:
>
> Index Scan using report_drugs_drug_idx on report_drugs  (cost=0.44..500.28 
> rows=14005 width=27) (actual time=0.621..4.510 rows=14199 loops=1)
>   Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
> Planning time: 6.939 ms
> Execution time: 4.759 ms
>
> Under 5 ms. The same goes for querying the "adverse" column in the 
> "report_adverses" table: under 20 ms.

I'm not sure why you're comparing this to the join plan above. They're
very different, I can only imagine that it' because you've not quite
understood what the EXPLAIN output is saying.

>  ->  Nested Loop  (cost=0.87..94657.59 rows=14005 width=72) (actual 
> time=0.822..2038.952 rows=14199 loops=1)
>  ->  Index Scan using report_drugs_drug_idx on report_drugs d  
> (cost=0.44..500.28 rows=14005 width=31) (actual time=0.669..3.900 rows=14199 
> loops=1)
>Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
>  ->  Index Scan using reports_id_key on reports r  (cost=0.43..6.71 
> rows=1 width=41) (actual time=0.143..0.143 rows=1 loops=14199)
>Index Cond: (id = d.rid)

This is a parameterised nested loop. This means that the inner side of
the loop (reports), is parameterised by the outerside of the loop, you
can see the d.rid in the Index Cond. This means that the
reports_id_key index is looked up 14199 times. You can see from the
plan that the nested loop here took 2038.952 milliseconds to complete,
so about 0.144 ms per lookup, not too bad, right?

The alternative to a parameterised nested loop plan is that a Hash
Join plan is generated. If you imagine what would happen here, likely
the matching report_drugs records would be hashed, then the outer side
of the join would then perform a SeqScan over the entire 6 million
reports records... probably not better. You can test this yourself
with; SET enable_nestloop = 0; you might need to SET enable_mergejoin
= 0; too. I imagine it'll be slower.

Likely the only way to speed this up would be to create indexes;

create index on reports (id, age, gender, created);

the above might allow an index only scan, which should speed up the
nested loop a bit. This will only be useful if you're never going to

Re: [GENERAL] Slow join over three tables

2016-04-26 Thread Sameer Kumar
On Wed, 27 Apr 2016 07:28 Tim van der Linden,  wrote:

> Hi all
>
> I have asked this question in a somewhat different form on the DBA
> Stackexchange site, but without much luck (
> https://dba.stackexchange.com/questions/136423/postgresql-slow-join-on-three-large-tables).
> So I apologize for possible double posting, but I hope this might get a
> better response on the mailing list.
>
> I'm joining three fairly large tables together, and it is slow. The tables
> are:
>
> - "reports": 6 million rows
> - "report_drugs": 20 million rows
> - "report_adverses": 20 million rows
>
> The table "reports" holds main report data and has a primary key column
> "id". The other two tables have a foreign key to that table with "rid". It
> are those columns that I use to join them together.
>
> All tables have indexes on the "id"/"rid" columns and on the
> "drug"/"adverse" columns.
>
> The query:
>
> SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
> FROM reports r
> JOIN report_drugs d ON d.rid = r.id
> JOIN report_adverses a ON a.rid = r.id
> WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back
> pain', 'back pain'])
> AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;
>
> The plan:
>
> Sort  (cost=105773.63..105774.46 rows=333 width=76) (actual
> time=5143.162..5143.185 rows=448 loops=1)
>Sort Key: r.created
>Sort Method: quicksort  Memory: 60kB
>->  Nested Loop  (cost=1.31..105759.68 rows=333 width=76) (actual
> time=54.784..5142.872 rows=448 loops=1)
>  Join Filter: (d.rid = a.rid)
>  ->  Nested Loop  (cost=0.87..94657.59 rows=14005 width=72) (actual
> time=0.822..2038.952 rows=14199 loops=1)
>

Quite clearly the nested loop joins are the most costly operations here.


 ->  Index Scan using report_drugs_drug_idx on report_drugs d
> (cost=0.44..500.28 rows=14005 width=31) (actual time=0.669..3.900
> rows=14199 loops=1)
>Index Cond: (drug = ANY
> ('{359,360,361,362,363}'::integer[]))
>  ->  Index Scan using reports_id_key on reports r
> (cost=0.43..6.71 rows=1 width=41) (actual time=0.143..0.143 rows=1
> loops=14199)
>Index Cond: (id = d.rid)
>  ->  Index Scan using report_adverses_rid_idx on report_adverses a
> (cost=0.44..0.78 rows=1 width=12) (actual time=0.218..0.218 rows=0
> loops=14199)
>Index Cond: (rid = r.id)
>Filter: (adverse = ANY ('{"back pain -
> nonspecific","nonspecific back pain","back pain"}'::text[]))
>Rows Removed by Filter: 5
>

I suppose. It might help if the filters are performed before the join. I am
not an expert on optimizer but I guess it might help if you change the join
order and add duplicate conditions for reports-

SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
FROM report_drugs d
JOIN report_adverses a ON a.rid = d.rid
JOIN reports r ON d.rid = r.id
WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back
pain', 'back pain'])
AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;


OR since you are using INNER JOIN, (As far as I understand the concept of
joins) it won't hurt the result set if the where clause is pushed into the
INNER JOIN criteria-

SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
FROM report_drugs d
JOIN report_adverses a ON a.rid = d.rid AND
a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back pain',
'back pain'])
AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;
JOIN reports r ON d.rid = r.id;


Planning time: 13.994 ms
> Execution time: 5143.235 ms
>
> This takes well over 5 seconds, which to me, feels much too slow.
>

Well given your hardware (which seems to be pretty decent), I am not sure
if this is an unfair expectation.


> If I query each table directly with the same conditions, thus:
>
> SELECT reason
> FROM report_drugs
> WHERE drug = ANY (ARRAY[359, 360, 361, 362, 363]);
>
> I get:
>
> Index Scan using report_drugs_drug_idx on report_drugs  (cost=0.44..500.28
> rows=14005 width=27) (actual time=0.621..4.510 rows=14199 loops=1)
>   Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
> Planning time: 6.939 ms
> Execution time: 4.759 ms
>
> Under 5 ms. The same goes for querying the "adverse" column in the
> "report_adverses" table: under 20 ms.
>

> This indicates to me that indeed the join itself causes a major
> performance bottleneck.
>
> I'm running the cluster from an SSD drive, as a traditional HDD could not
> even manage the query in under 5 minutes. The system has a total memory of
> 24 GB, runs on Debian and uses an 4Ghz 8 core i7-4790 processor.
>
> Some important postgresql.conf readouts:
>
> - shared_buffers = 4GB
> - work_mem = 64MB
> - maintenance_work_mem = 1GB
> - checkpoint_segments = 50
> - checkpoint_completion_target = 0.9
> - autovacuum = on
>
> Is there something I am missing here? Any help on getting this join faster
> is much appreciated.
>
> Cheers,
> Tim
>
>
> --

[GENERAL] Slow join over three tables

2016-04-26 Thread Tim van der Linden
Hi all

I have asked this question in a somewhat different form on the DBA 
Stackexchange site, but without much luck 
(https://dba.stackexchange.com/questions/136423/postgresql-slow-join-on-three-large-tables).
 So I apologize for possible double posting, but I hope this might get a better 
response on the mailing list.

I'm joining three fairly large tables together, and it is slow. The tables are:

- "reports": 6 million rows
- "report_drugs": 20 million rows
- "report_adverses": 20 million rows

The table "reports" holds main report data and has a primary key column "id". 
The other two tables have a foreign key to that table with "rid". It are those 
columns that I use to join them together.

All tables have indexes on the "id"/"rid" columns and on the "drug"/"adverse" 
columns.

The query:

SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
FROM reports r
JOIN report_drugs d ON d.rid = r.id
JOIN report_adverses a ON a.rid = r.id 
WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back 
pain', 'back pain']) 
AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;

The plan:

Sort  (cost=105773.63..105774.46 rows=333 width=76) (actual 
time=5143.162..5143.185 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Nested Loop  (cost=1.31..105759.68 rows=333 width=76) (actual 
time=54.784..5142.872 rows=448 loops=1)
 Join Filter: (d.rid = a.rid)
 ->  Nested Loop  (cost=0.87..94657.59 rows=14005 width=72) (actual 
time=0.822..2038.952 rows=14199 loops=1)
 ->  Index Scan using report_drugs_drug_idx on report_drugs d  
(cost=0.44..500.28 rows=14005 width=31) (actual time=0.669..3.900 rows=14199 
loops=1)
   Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
 ->  Index Scan using reports_id_key on reports r  (cost=0.43..6.71 
rows=1 width=41) (actual time=0.143..0.143 rows=1 loops=14199)
   Index Cond: (id = d.rid)
 ->  Index Scan using report_adverses_rid_idx on report_adverses a  
(cost=0.44..0.78 rows=1 width=12) (actual time=0.218..0.218 rows=0 loops=14199)
   Index Cond: (rid = r.id)
   Filter: (adverse = ANY ('{"back pain - nonspecific","nonspecific 
back pain","back pain"}'::text[]))
   Rows Removed by Filter: 5
Planning time: 13.994 ms
Execution time: 5143.235 ms

This takes well over 5 seconds, which to me, feels much too slow.
If I query each table directly with the same conditions, thus:

SELECT reason
FROM report_drugs 
WHERE drug = ANY (ARRAY[359, 360, 361, 362, 363]);

I get:

Index Scan using report_drugs_drug_idx on report_drugs  (cost=0.44..500.28 
rows=14005 width=27) (actual time=0.621..4.510 rows=14199 loops=1)
  Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
Planning time: 6.939 ms
Execution time: 4.759 ms

Under 5 ms. The same goes for querying the "adverse" column in the 
"report_adverses" table: under 20 ms.

This indicates to me that indeed the join itself causes a major performance 
bottleneck.

I'm running the cluster from an SSD drive, as a traditional HDD could not even 
manage the query in under 5 minutes. The system has a total memory of 24 GB, 
runs on Debian and uses an 4Ghz 8 core i7-4790 processor.

Some important postgresql.conf readouts:

- shared_buffers = 4GB
- work_mem = 64MB
- maintenance_work_mem = 1GB
- checkpoint_segments = 50
- checkpoint_completion_target = 0.9
- autovacuum = on

Is there something I am missing here? Any help on getting this join faster is 
much appreciated.

Cheers,
Tim


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general