subject:"\[PERFORM\] query optimization"

Re: [PERFORM] query optimization

2012-05-03 Thread Richard Kojedzinszky

: 258.245 ms
(13 rows)

krichy= ANALYZE c;
ANALYZE
krichy= explain analyze select * from a left join (b inner join c on b.id 
= c.id) on a.b = b.id where a.id = 1;

QUERY PLAN
---
 Hash Right Join  (cost=2651.29..6419.30 rows=1 width=16) (actual 
time=83.295..255.653 rows=1 loops=1)

   Hash Cond: (b.id = a.b)
   -  Hash Join  (cost=2643.00..6036.00 rows=10 width=8) (actual 
time=83.275..222.373 rows=10 loops=1)

 Hash Cond: (b.id = c.id)
 -  Seq Scan on b  (cost=0.00..1393.00 rows=10 width=4) 
(actual time=0.010..35.726 rows=10 loops=1)
 -  Hash  (cost=1393.00..1393.00 rows=10 width=4) (actual 
time=83.238..83.238 rows=10 loops=1)

   Buckets: 16384  Batches: 1  Memory Usage: 2344kB
   -  Seq Scan on c  (cost=0.00..1393.00 rows=10 width=4) 
(actual time=0.009..36.243 rows=10 loops=1)
   -  Hash  (cost=8.28..8.28 rows=1 width=8) (actual time=0.011..0.011 
rows=1 loops=1)

 Buckets: 1024  Batches: 1  Memory Usage: 1kB
 -  Index Scan using a_idx1 on a  (cost=0.00..8.28 rows=1 
width=8) (actual time=0.008..0.009 rows=1 loops=1)

   Index Cond: (id = 1)
 Total runtime: 256.008 ms
(13 rows)

So after analyzing all the tables, the result changed much, psql uses 
other plans to do the query, and in my case it is effectively much slower.


My configuration file has:
work_mem=16MB

with this removed, the query goes fast again, but I dont know why the more 
memory makes psql chose a worse plan.


Thanks in advance,

Kojedzinszky Richard
Euronet Magyarorszag Informatikai Zrt.

On Thu, 26 Apr 2012, Tom Lane wrote:


Date: Thu, 26 Apr 2012 15:17:18 -0400
From: Tom Lane t...@sss.pgh.pa.us
To: Richard Kojedzinszky kri...@tvnetwork.hu
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] query optimization 


Richard Kojedzinszky kri...@tvnetwork.hu writes:

Dear list,
We have a database schema, which looks the same as the attached script.



When filling the tables with data, and skipping analyze on the table (so
pg_stats contains no records for table 'a'), the first select in the
script runs fast, but after an analyze the planner decides to sequence
scan tables b and c, thus making the query much slower. Can somebody help
me solving this issue, or tuning our installation to not to use sequence
scans in this case?


Um ... did you analyze all the tables, or just some of them?  I get
sub-millisecond runtimes if all four tables have been analyzed, but it
does seem to pick lousy plans if, say, only a and b have been analyzed.

What you really need for this query structure is the parameterized-path
work I've been doing for 9.2; but at least on the exact example given,
I'm not seeing that 9.1 is that much worse.

regards, tom lane



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] query optimization

2012-04-26 Thread Richard Kojedzinszky


Dear list,

We have a database schema, which looks the same as the attached script.

When filling the tables with data, and skipping analyze on the table (so 
pg_stats contains no records for table 'a'), the first select in the 
script runs fast, but after an analyze the planner decides to sequence 
scan tables b and c, thus making the query much slower. Can somebody help 
me solving this issue, or tuning our installation to not to use sequence 
scans in this case?


Thanks in advance,


Kojedzinszky Richard
Euronet Magyarorszag Informatikai Zrt.

test.sql
Description: application/sql

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] query optimization

2012-04-26 Thread Kevin Grittner

Richard Kojedzinszky kri...@tvnetwork.hu wrote:
 
 tuning our installation to not to use sequence scans in this case?
 
Make sure effective_cache_size is set to the sum of shared_buffers
and whatever your OS shows as usable for caching.  Try adjusting
cost factors: maybe random_page_cost between 1 and 2, and
cpu_tuple_cost between 0.03 and 0.05.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] query optimization

2012-04-26 Thread Tom Lane

Richard Kojedzinszky kri...@tvnetwork.hu writes:
 Dear list,
 We have a database schema, which looks the same as the attached script.

 When filling the tables with data, and skipping analyze on the table (so 
 pg_stats contains no records for table 'a'), the first select in the 
 script runs fast, but after an analyze the planner decides to sequence 
 scan tables b and c, thus making the query much slower. Can somebody help 
 me solving this issue, or tuning our installation to not to use sequence 
 scans in this case?

Um ... did you analyze all the tables, or just some of them?  I get
sub-millisecond runtimes if all four tables have been analyzed, but it
does seem to pick lousy plans if, say, only a and b have been analyzed.

What you really need for this query structure is the parameterized-path
work I've been doing for 9.2; but at least on the exact example given,
I'm not seeing that 9.1 is that much worse.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] query optimization

2012-04-26 Thread Thomas Kellerer


Tom Lane wrote on 26.04.2012 21:17:

Richard Kojedzinszkykri...@tvnetwork.hu  writes:

Dear list,
We have a database schema, which looks the same as the attached script.



When filling the tables with data, and skipping analyze on the table (so
pg_stats contains no records for table 'a'), the first select in the
script runs fast, but after an analyze the planner decides to sequence
scan tables b and c, thus making the query much slower. Can somebody help
me solving this issue, or tuning our installation to not to use sequence
scans in this case?


Um ... did you analyze all the tables, or just some of them?  I get
sub-millisecond runtimes if all four tables have been analyzed, but it
does seem to pick lousy plans if, say, only a and b have been analyzed.



Here it's similar to Richard's experience:

Before analyzing the four tables, the first statement yields this plan:

Merge Left Join  (cost=504.89..2634509.91 rows=12500 width=16) (actual 
time=0.103..0.108 rows=1 loops=1)
  Merge Cond: (a.b = b.id)
  -  Sort  (cost=504.89..506.14 rows=500 width=8) (actual time=0.043..0.043 
rows=1 loops=1)
Sort Key: a.b
Sort Method: quicksort  Memory: 17kB
-  Bitmap Heap Scan on a  (cost=12.14..482.47 rows=500 width=8) 
(actual time=0.028..0.029 rows=1 loops=1)
  Recheck Cond: (id = 4)
  -  Bitmap Index Scan on a_idx1  (cost=0.00..12.01 rows=500 
width=0) (actual time=0.021..0.021 rows=1 loops=1)
Index Cond: (id = 4)
  -  Materialize  (cost=0.00..884002.52 rows=5000 width=8) (actual 
time=0.041..0.057 rows=5 loops=1)
-  Merge Join  (cost=0.00..759002.52 rows=5000 width=8) (actual 
time=0.037..0.051 rows=5 loops=1)
  Merge Cond: (b.id = c.id)
  -  Index Scan using b_idx1 on b  (cost=0.00..4376.26 rows=10 
width=4) (actual time=0.016..0.018 rows=5 loops=1)
  -  Materialize  (cost=0.00..4626.26 rows=10 width=4) (actual 
time=0.017..0.022 rows=5 loops=1)
-  Index Scan using c_idx1 on c  (cost=0.00..4376.26 
rows=10 width=4) (actual time=0.014..0.017 rows=5 loops=1)
Total runtime: 0.209 ms

This continues to stay the plan for about 10-15 repetitions, then it turns to 
this plan

Hash Right Join  (cost=2701.29..6519.30 rows=1 width=16) (actual 
time=79.604..299.227 rows=1 loops=1)
  Hash Cond: (b.id = a.b)
  -  Hash Join  (cost=2693.00..6136.00 rows=10 width=8) (actual 
time=79.550..265.251 rows=10 loops=1)
Hash Cond: (b.id = c.id)
-  Seq Scan on b  (cost=0.00..1443.00 rows=10 width=4) (actual 
time=0.011..36.158 rows=10 loops=1)
-  Hash  (cost=1443.00..1443.00 rows=10 width=4) (actual 
time=79.461..79.461 rows=10 loops=1)
  Buckets: 16384  Batches: 1  Memory Usage: 2735kB
  -  Seq Scan on c  (cost=0.00..1443.00 rows=10 width=4) 
(actual time=0.010..32.930 rows=10 loops=1)
  -  Hash  (cost=8.28..8.28 rows=1 width=8) (actual time=0.015..0.015 rows=1 
loops=1)
Buckets: 1024  Batches: 1  Memory Usage: 1kB
-  Index Scan using a_idx1 on a  (cost=0.00..8.28 rows=1 width=8) 
(actual time=0.010..0.012 rows=1 loops=1)
  Index Cond: (id = 4)
Total runtime: 299.564 ms

(I guess autovacuum kicked in, because that the same plan I get when running 
analyze on all four tables right after populating them)

And the second one yields this one here (Regardless of analyze or not):

QUERY PLAN
Nested Loop Left Join  (cost=0.00..16.89 rows=1 width=16) (actual 
time=0.027..0.031 rows=1 loops=1)
  -  Nested Loop Left Join  (cost=0.00..16.57 rows=1 width=12) (actual 
time=0.020..0.022 rows=1 loops=1)
-  Index Scan using a_idx1 on a  (cost=0.00..8.28 rows=1 width=8) 
(actual time=0.011..0.012 rows=1 loops=1)
  Index Cond: (id = 4)
-  Index Scan using b_idx1 on b  (cost=0.00..8.28 rows=1 width=4) 
(actual time=0.004..0.005 rows=1 loops=1)
  Index Cond: (a.b = id)
  -  Index Scan using c_idx1 on c  (cost=0.00..0.31 rows=1 width=4) (actual 
time=0.004..0.005 rows=1 loops=1)
Index Cond: (b.id = id)
Total runtime: 0.104 ms


My version:
PostgreSQL 9.1.3, compiled by Visual C++ build 1500, 32-bit
Running on Windows XP SP3

shared_buffers = 768MB
work_mem = 24MB 
effective_cache_size = 1024MB

All other (relevant) settings are on defaults

Regards
Thomas





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] query optimization

2012-04-26 Thread Tom Lane

Thomas Kellerer spam_ea...@gmx.net writes:
 Tom Lane wrote on 26.04.2012 21:17:
 Um ... did you analyze all the tables, or just some of them?  I get
 sub-millisecond runtimes if all four tables have been analyzed, but it
 does seem to pick lousy plans if, say, only a and b have been analyzed.

 Here it's similar to Richard's experience:
 Before analyzing the four tables, the first statement yields this plan:
 [ merge joins ]
 This continues to stay the plan for about 10-15 repetitions, then it turns to 
 this plan
 [ hash joins ]

Hmm.  I see it liking the merge-join plan (with minor variations) with
or without analyze data, but if just some of the tables have been
analyzed, it goes for the hash plan which is a good deal slower.  The
cost estimates aren't that far apart though.  In any case, the only
reason the merge join is so fast is that the data is perfectly ordered
in each table; on a less contrived example, it could well be a lot
slower.

 And the second one yields this one here (Regardless of analyze or not):

Yeah, the trick there is that it's valid to re-order the joins, since
they're both left joins.

In git HEAD I get something like this:

regression=# explain analyze select * from a left join (b inner join c on b.id 
= c.id) on a.b = b.id where a.id = 4;
QUERY PLAN  
   
---
 Nested Loop Left Join  (cost=0.00..17.18 rows=1 width=16) (actual 
time=0.024..0.026 rows=1 loops=1)
   -  Index Scan using a_idx1 on a  (cost=0.00..8.38 rows=1 width=8) (actual 
time=0.010..0.010 rows=1 loops=1)
 Index Cond: (id = 4)
   -  Nested Loop  (cost=0.00..8.80 rows=1 width=8) (actual time=0.011..0.012 
rows=1 loops=1)
 -  Index Only Scan using b_idx1 on b  (cost=0.00..8.38 rows=1 
width=4) (actual time=0.006..0.006 rows=1 loops=1)
   Index Cond: (id = a.b)
   Heap Fetches: 1
 -  Index Only Scan using c_idx1 on c  (cost=0.00..0.41 rows=1 
width=4) (actual time=0.004..0.005 rows=1 loops=1)
   Index Cond: (id = b.id)
   Heap Fetches: 1
 Total runtime: 0.080 ms
(11 rows)

but 9.1 and older are not smart enough to do it like that when they
can't re-order the joins.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] query optimization

2012-04-26 Thread Andrew Dunstan




On 04/26/2012 04:08 PM, Tom Lane wrote:

Thomas Kellererspam_ea...@gmx.net  writes:

Tom Lane wrote on 26.04.2012 21:17:

Um ... did you analyze all the tables, or just some of them?  I get
sub-millisecond runtimes if all four tables have been analyzed, but it
does seem to pick lousy plans if, say, only a and b have been analyzed.

Here it's similar to Richard's experience:
Before analyzing the four tables, the first statement yields this plan:
[ merge joins ]
This continues to stay the plan for about 10-15 repetitions, then it turns to 
this plan
[ hash joins ]

Hmm.  I see it liking the merge-join plan (with minor variations) with
or without analyze data, but if just some of the tables have been
analyzed, it goes for the hash plan which is a good deal slower.  The
cost estimates aren't that far apart though.  In any case, the only
reason the merge join is so fast is that the data is perfectly ordered
in each table; on a less contrived example, it could well be a lot
slower.



It's not so terribly contrived, is it? It's common enough to have tables 
which are append-only and to join them by something that corresponds to 
the append order (serial field, timestamp etc.)


cheers

andrew

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query optimization using order by and limit

2011-09-25 Thread Tom Lane

Stephen Frost sfr...@snowman.net writes:
 What I think is happening here is that PG is pushing down that filter
 (not typically a bad thing..), but with that condition, it's going to
 scan the index until it finds a match for that filter before returning
 back up only to have that result cut out due to the limit.

Yeah, it's spending quite a lot of time finding the first matching row
in each child table.  I'm curious why that is though; are the child
tables not set up with nonoverlapping firstloadtime ranges?

 What would be great is if PG would realize that the CHECK constraints
 prevent earlier records from being in these earlier tables,

The explain shows that that isn't the case, because it *is* finding at
least one candidate row in each table.  It's just running quite far into
the firstloadtime sequence to do it.

If you're stuck with this table arrangement, one thing that would help
is a two-column index on (host_guid, firstloadtime) on each child table.
That would match the search condition exactly, and so reduce the cost
to find the first matching row to nearly nil.  Whether this query's
speed is important enough to justify maintaining such an index is a
question I can't answer for you.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query optimization using order by and limit

2011-09-25 Thread Stephen Frost

* Tom Lane (t...@sss.pgh.pa.us) wrote:
 Yeah, it's spending quite a lot of time finding the first matching row
 in each child table.  I'm curious why that is though; are the child
 tables not set up with nonoverlapping firstloadtime ranges?

They are set up w/ nonoverlapping firstloadtime ranges, using CHECK
constraints such as:

osmoduleloads_2011_09_14_firstloadtime_check CHECK (firstloadtime =
1296044640::bigint::numeric AND firstloadtime 
1296053280::bigint::numeric)

The issue here is that the query is saying Give me the first 150
records with this host_id in this week-long range.  PG happily
eliminates all the tables that are outside of the week-long range during
constraint exclusion.  After that, however, it hunts down the earliest
records (which matches 'host_id') from each child table.  Sure, from
each table there's a record in the week-long range with the host_id that
matches.  What PG doesn't realize is that it can stop after pulling the
150 records from the most recent table (and flipping the direction of
the query or the tables doesn't help- PG still pulls a record from each
table).

  What would be great is if PG would realize that the CHECK constraints
  prevent earlier records from being in these earlier tables,
 
 The explain shows that that isn't the case, because it *is* finding at
 least one candidate row in each table.  It's just running quite far into
 the firstloadtime sequence to do it.

My point above is that the CHECK constraints ensure an ordering which
could be leveraged to use the latest table first and then stop if enough
tuples are returned (or immediately go to the next table), without ever
considering the other tables.  I'm not looking for PG to eliminate those
other tables for consideration in all cases- if the limit is large
enough, it may get all the way down to them.  I'm pretty sure this isn't
something which PG does today and I don't expect teaching it to do this
to be trivial, but it certainly would be nice as this strikes me as a
very common use-case.

 If you're stuck with this table arrangement, one thing that would help
 is a two-column index on (host_guid, firstloadtime) on each child table.

Agreed, I mentioned this to the OP previously and it's on his list of
things to try.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [PERFORM] Query optimization using order by and limit

2011-09-25 Thread Tom Lane

Stephen Frost sfr...@snowman.net writes:
 * Tom Lane (t...@sss.pgh.pa.us) wrote:
 Yeah, it's spending quite a lot of time finding the first matching row
 in each child table.  I'm curious why that is though; are the child
 tables not set up with nonoverlapping firstloadtime ranges?

 The issue here is that the query is saying Give me the first 150
 records with this host_id in this week-long range.

Oh, I see.  So the query range overlaps multiple child tables, even
after constraint exclusion eliminates a lot of them.

 My point above is that the CHECK constraints ensure an ordering which
 could be leveraged to use the latest table first and then stop if enough
 tuples are returned (or immediately go to the next table), without ever
 considering the other tables.

Yeah.  My opinion is that trying to reverse-engineer that from the CHECK
constraints would cost a lot more than it's worth.  What we need, and
will hopefully have sooner or later, is an abstract concept of
partitioned table in which this kind of relationship is known a-priori
instead of having to be laboriously re-deduced every time we plan a
query.

 If you're stuck with this table arrangement, one thing that would help
 is a two-column index on (host_guid, firstloadtime) on each child table.

 Agreed, I mentioned this to the OP previously and it's on his list of
 things to try.

AFAICS the fact that this example would be fast if we were only paying
attention to the newest table is mere luck.  If it can take a long time
to find the first matching host_guid record in several of the child
tables, why might it not take just as long to find said record in the
other one?  I think you really need the two-column indexes, if keeping
this query's runtime to a minimum is critical.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query optimization using order by and limit

2011-09-22 Thread k...@rice.edu

On Wed, Sep 21, 2011 at 11:22:53PM -0400, Tom Lane wrote:
 Michael Viscuso michael.visc...@getcarbonblack.com writes:
  Greg/Tom, you are correct, these columns should be modified to whatever
  is easiest for Postgres to recognize 64-bit unsigned integers.  Would
  you still recommend bigint for unsigned integers?  I likely read the
  wrong documentation that suggested bigint for signed 64-bit integers and
  numeric(20) for unsigned 64-bit integers.
 
 Unsigned?  Oh, hm, that's a bit of a problem because we don't have any
 unsigned types.  If you really need to go to 2^64 and not 2^63 then
 you're stuck with numeric ... but that last bit is costing ya a lot.
 
   regards, tom lane
 

Hi Michael,

If you have access to the application, you can map the unsigned 64-bits
to the PostgreSQL signed 64-bit type with a simple subtraction. That will
allow you to drop all the numeric use. Also if the guid is a 64-bit
values stuffed into a numeric(20), you can do it there as well. I achieved
a hefty performance boost by making those application level changes in a
similar situation.

Regards,
Ken

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query optimization using order by and limit

2011-09-22 Thread Michael Viscuso

Thanks Ken,

I'm discussing with my coworker how to best make that change *as we
speak*.  Do you think this will also resolve the original issue I'm
seeing where the query doesn't limit out properly and spends time in
child tables that won't yield any results?  I was hoping that by using
the check constraints, I could query over a week or month's worth of
partitioned tables and the combination of order by and limit would
eliminate any time searching unnecessary tables but that doesn't appear
to be true. (I'm still very new to high-end Postgres performance so I
could be mistaken.)

Regardless, in the meantime, I'll switch those columns to bigint instead
of numeric and have an update as soon as possible.

Thanks for your help!

Mike

On 9/22/2011 9:41 AM, k...@rice.edu wrote:
 On Wed, Sep 21, 2011 at 11:22:53PM -0400, Tom Lane wrote:
 Michael Viscuso michael.visc...@getcarbonblack.com writes:
 Greg/Tom, you are correct, these columns should be modified to whatever
 is easiest for Postgres to recognize 64-bit unsigned integers.  Would
 you still recommend bigint for unsigned integers?  I likely read the
 wrong documentation that suggested bigint for signed 64-bit integers and
 numeric(20) for unsigned 64-bit integers.
 Unsigned?  Oh, hm, that's a bit of a problem because we don't have any
 unsigned types.  If you really need to go to 2^64 and not 2^63 then
 you're stuck with numeric ... but that last bit is costing ya a lot.

  regards, tom lane

 Hi Michael,

 If you have access to the application, you can map the unsigned 64-bits
 to the PostgreSQL signed 64-bit type with a simple subtraction. That will
 allow you to drop all the numeric use. Also if the guid is a 64-bit
 values stuffed into a numeric(20), you can do it there as well. I achieved
 a hefty performance boost by making those application level changes in a
 similar situation.

 Regards,
 Ken


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query optimization using order by and limit

2011-09-22 Thread Stephen Frost

* Michael Viscuso (michael.visc...@getcarbonblack.com) wrote:
 Adding the final condition hosts_guid = '2007075705813916178' is what
 ultimately kills it http://explain.depesz.com/s/8zy.  By adding the
 host_guid, it spends considerably more time in the older tables than
 without this condition and I'm not sure why. 

What I think is happening here is that PG is pushing down that filter
(not typically a bad thing..), but with that condition, it's going to
scan the index until it finds a match for that filter before returning
back up only to have that result cut out due to the limit.  Having it as
numerics isn't helping here, but the bigger issue is having to check all
those tuples for a match to the filter.

Mike, the filter has to be applied before the order by/limit, since
those clauses come after the filter has been applied (you wouldn't want
a 'where x = 2 limit 10' to return early just because it found 10
records where x didn't equal 2).

What would be great is if PG would realize that the CHECK constraints
prevent earlier records from being in these earlier tables, so it
shouldn't need to consider them at all once the records from the
'latest' table has been found and the limit reached (reverse all this
for an 'ascending' query, of course), which we can do when there's no
order by.  I don't believe we have that kind of logic or that
information available at this late stage- the CHECK constraints are used
to eliminate the impossible-to-match tables, but that's it.

One option, which isn't great of course, would be to implement your own
'nested loop' construct (something I typically despise..) in the
application which just walks backwards from the latest and pulls
whatever records it can from each day and then stops once it hits the
limit.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [PERFORM] Query optimization using order by and limit

2011-09-22 Thread Michael Viscuso


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
 
Stephen,
 
I spent the better part of the day implementing an application layer
nested loop and it seems to be working well.  Of course it's a little
slower than a Postgres only solution because it has to pass data back
and forth for each daily table query until it reaches the limit, but at
least I don't have runaway queries like I was seeing before.  That
should be a pretty good stopgap solution for the time being.
 
I was really hoping there was a Postgres exclusive answer though! :)  If
there are any other suggestions, it's a simple flag in my application to
query the other way again...
 
Thanks for all your help - and I'm still looking to change those
numerics to bigints, just haven't figured out the best way yet.
 
Mike
 
On 9/22/2011 10:53 AM, Stephen Frost wrote:
 * Michael Viscuso (michael.visc...@getcarbonblack.com) wrote:
 Adding the final condition hosts_guid = '2007075705813916178' is what
 ultimately kills it http://explain.depesz.com/s/8zy. By adding the
 host_guid, it spends considerably more time in the older tables than
 without this condition and I'm not sure why.

 What I think is happening here is that PG is pushing down that filter
 (not typically a bad thing..), but with that condition, it's going to
 scan the index until it finds a match for that filter before returning
 back up only to have that result cut out due to the limit. Having it as
 numerics isn't helping here, but the bigger issue is having to check all
 those tuples for a match to the filter.

 Mike, the filter has to be applied before the order by/limit, since
 those clauses come after the filter has been applied (you wouldn't want
 a 'where x = 2 limit 10' to return early just because it found 10
 records where x didn't equal 2).

 What would be great is if PG would realize that the CHECK constraints
 prevent earlier records from being in these earlier tables, so it
 shouldn't need to consider them at all once the records from the
 'latest' table has been found and the limit reached (reverse all this
 for an 'ascending' query, of course), which we can do when there's no
 order by. I don't believe we have that kind of logic or that
 information available at this late stage- the CHECK constraints are used
 to eliminate the impossible-to-match tables, but that's it.

 One option, which isn't great of course, would be to implement your own
 'nested loop' construct (something I typically despise..) in the
 application which just walks backwards from the latest and pulls
 whatever records it can from each day and then stops once it hits the
 limit.

 Thanks,

 Stephen
 
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
 
iQEcBAEBAgAGBQJOe7zzAAoJEBKjVK2HR1IXYwAIAKQBnFOtCNljL1Hs1ZQW3e+I
ele/kZCiHzgHLFpN7zawt1Y7qf+3ntd6u+mkatJsnqeC+HY1Qee4VTUqr+hIKhcc
VIGuuYkzuojs6/PgF6MAERHP24lRFdLCQtMgTY8RshYODvc07VpqkLq1cXhsNJZw
6pNBTEpEmA0MzMrmk3x6C8lFbyXZAYUxNLwG5SEWecV+lkOjnA70oKnSxG6EXRgk
fkj2l1ezVn23KoO8SSUp4xBFHHOY/PQP9JtV7b52Gm5PC7lOqFFrXFygNP0KkWho
TzyjoYKttShEjmTMXoLt181+NB4rQEas8USasemRA1pUkx2NrfvcK46gYucOAsg=
=8yQW
-END PGP SIGNATURE-


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query optimization using order by and limit

2011-09-22 Thread Stephen Frost

Mike,

* Michael Viscuso (michael.visc...@getcarbonblack.com) wrote:
 I spent the better part of the day implementing an application layer
 nested loop and it seems to be working well.  Of course it's a little
 slower than a Postgres only solution because it has to pass data back
 and forth for each daily table query until it reaches the limit, but at
 least I don't have runaway queries like I was seeing before.  That
 should be a pretty good stopgap solution for the time being.

Glad to hear that you were able to get something going which worked for
you.

 I was really hoping there was a Postgres exclusive answer though! :)  If
 there are any other suggestions, it's a simple flag in my application to
 query the other way again...

I continue to wonder if some combination of multi-column indexes might
have made the task of finding the 'lowest' record from each of the
tables fast enough that it wouldn't be an issue.

 Thanks for all your help - and I'm still looking to change those
 numerics to bigints, just haven't figured out the best way yet.

Our timestamps are also implemented using 64bit integers and would allow
you to use all the PG date/time functions and operators.  Just a
thought.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [PERFORM] Query optimization using order by and limit

2011-09-22 Thread Michael Viscuso

Stephen,

Yes, I couldn't agree more.  The next two things I will be looking at very
carefully are the timestamps and indexes.  I will reply to this post if
either dramatically helps.

Thanks again for all your help.  My eyes were starting to bleed from staring
at explain logs!

Mike

On Thu, Sep 22, 2011 at 7:14 PM, Stephen Frost sfr...@snowman.net wrote:

 Mike,

 * Michael Viscuso (michael.visc...@getcarbonblack.com) wrote:
  I spent the better part of the day implementing an application layer
  nested loop and it seems to be working well.  Of course it's a little
  slower than a Postgres only solution because it has to pass data back
  and forth for each daily table query until it reaches the limit, but at
  least I don't have runaway queries like I was seeing before.  That
  should be a pretty good stopgap solution for the time being.

 Glad to hear that you were able to get something going which worked for
 you.

  I was really hoping there was a Postgres exclusive answer though! :)  If
  there are any other suggestions, it's a simple flag in my application to
  query the other way again...

 I continue to wonder if some combination of multi-column indexes might
 have made the task of finding the 'lowest' record from each of the
 tables fast enough that it wouldn't be an issue.

  Thanks for all your help - and I'm still looking to change those
  numerics to bigints, just haven't figured out the best way yet.

 Our timestamps are also implemented using 64bit integers and would allow
 you to use all the PG date/time functions and operators.  Just a
 thought.

Thanks,

Stephen

 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.10 (GNU/Linux)

 iEYEARECAAYFAk57wXAACgkQrzgMPqB3kijaNwCfQ9cSdzzHyiPwa+BTzIihWR7T
 baoAoIbL8P3atU1cfbcCoFXFGbKE7fPt
 =ZRqu
 -END PGP SIGNATURE-

[PERFORM] Query optimization using order by and limit

2011-09-21 Thread Michael Viscuso

First of all, thank you for taking the time to review my question.  After
attending the PostgresOpen conference in Chicago last week, I've been
pouring over explain logs for hours on end and although my system is MUCH
better, I still can't resolve a few issues.  Luckily my data is pretty well
structured so solving one issue will likely solve many more so I'll start
with this one.

Version: PostgreSQL 9.1rc1, compiled by Visual C++ build 1500, 64-bit
OS: Windows 7 64-bit
ORM: SQLAlchemy
Postgres table structure: I have daily partitioned tables for each of 4
core tables (the tables with the majority of my application's data).  Each
daily table inherits from its parent.  I do not explicitly define a
REFERENCE between these tables because I cannot guarantee the order in which
the events are inserted into the database, but where there are references,
the referenced row should exist in the other's daily table.  The reason I
partitioned the data in this manner is to increase query speed and make it
easy to archive old data. (I'm new to high-end Postgres performance so
there's likely several fundamental flaws in my assumptions.  I won't turn
down any recommendation.)

An example of a daily partitioned table follows:

cb=# \d osmoduleloads_2011_09_14;
 Table public.osmoduleloads_2011_09_14
Column |Type |
  Modifiers
---+-+
 guid  | numeric(20,0)   | not null
 osprocess_guid| numeric(20,0)   | not null
 filepath_guid | numeric(20,0)   | not null
 firstloadtime | numeric(20,0)   | not null
 md5hash   | bytea   | not null
 host_guid | numeric(20,0)   | default NULL::numeric
 process_create_time   | numeric(20,0)   | default NULL::numeric
 process_filepath_guid | numeric(20,0)   | default NULL::numeric
 event_time| timestamp without time zone | default '2011-09-14
00:00:00'::timestamp without time zone
Indexes:
osmoduleloads_2011_09_14_pkey PRIMARY KEY, btree (guid)
idx_osmoduleloads_2011_09_14_filepath_guid btree (filepath_guid)
idx_osmoduleloads_2011_09_14_firstload_time btree (firstloadtime)
idx_osmoduleloads_2011_09_14_host_guid btree (host_guid)
idx_osmoduleloads_2011_09_14_md5hash btree (md5hash)
idx_osmoduleloads_2011_09_14_osprocess_guid btree (osprocess_guid)
Check constraints:
osmoduleloads_2011_09_14_event_time_check CHECK (event_time =
'2011-09-14 00:00:00'::timestamp without time zone)
osmoduleloads_2011_09_14_firstloadtime_check CHECK (firstloadtime =
1296044640::bigint::numeric AND firstloadtime 
1296053280::bigint::numeric)
Inherits: osmoduleloads

Objective:  The firstloadtime check constraint ensures that the record is
applicable to that daily table. (In case you were wondering, the large
numerics correspond to the Windows 100-nanosecond since the Epoch.) I'm
inserting millions of records into each daily table so query slowness is
quite easy to spot.  Given that there is so much data per daily table, I was
hoping to use the order by and limit clauses to stop out a query once it
sufficed the limit clause and not be forced to visit each daily table.
 However, I'm spending way too much time in the older tables than I'd like -
which leads me to believe that I;m doing something wrong.  For ease of
viewing, my explain analyze can be found at http://explain.depesz.com/s/tot

I'm still very new to this so I'm not sure if explain.depesz.com saves the
original query.  It wasn't readily apparent that it did so here is the
original query:

SELECT osm_1.*, storefiles_1.*, filepaths_1.*, filepaths_2.* FROM (SELECT *
FROM osmoduleloads JOIN hosts ON hosts.guid = osmoduleloads.host_guid WHERE
hosts.guid = '2007075705813916178' AND osmoduleloads.firstloadtime =
1296043200 AND osmoduleloads.firstloadtime  12961036799000 AND
hosts.enabled = true AND hosts.user_id = 111 ORDER BY
osmoduleloads.firstloadtime DESC LIMIT 251) AS osm_1 LEFT OUTER JOIN
storefiles AS storefiles_1 ON osm_1.md5hash = storefiles_1.md5hash LEFT
OUTER JOIN filepaths AS filepaths_1 ON osm_1.process_filepath_guid =
filepaths_1.guid AND osm_1.event_time = filepaths_1.event_time LEFT OUTER
JOIN filepaths AS filepaths_2 ON osm_1.filepath_guid = filepaths_2.guid AND
osm_1.event_time= filepaths_2.event_time ORDER BY osm_1.firstloadtime DESC;

Hopefully my assumptions about order by and limit are correct and this query
can be optimized.

Again, appreciate any help you can lend.  Thanks in advance.

Mike

Re: [PERFORM] Query optimization using order by and limit

2011-09-21 Thread Greg Smith


On 09/21/2011 07:14 PM, Michael Viscuso wrote:

Check constraints:
osmoduleloads_2011_09_14_event_time_check CHECK (event_time = 
'2011-09-14 00:00:00'::timestamp without time zone)
osmoduleloads_2011_09_14_firstloadtime_check CHECK 
(firstloadtime = 1296044640::bigint::numeric AND 
firstloadtime  1296053280::bigint::numeric)

Inherits: osmoduleloads


That weird casting can't be helping.  I'm not sure if it's your problem 
here, but the constraint exclusion code is pretty picky about matching 
the thing you're looking for against the CHECK constraint, and this is a 
messy one.  The bigint conversion in the middle there isn't doing 
anything useful for you anyway; you really should simplify this to just 
look like this:


firstloadtime = 1296044640::numeric

SELECT osm_1.*, storefiles_1.*, filepaths_1.*, filepaths_2.* FROM 
(SELECT * FROM osmoduleloads JOIN hosts ON hosts.guid = 
osmoduleloads.host_guid WHERE hosts.guid = '2007075705813916178' AND 
osmoduleloads.firstloadtime = 1296043200 AND 
osmoduleloads.firstloadtime  12961036799000 AND hosts.enabled = 
true AND hosts.user_id = 111 ORDER BY osmoduleloads.firstloadtime DESC 
LIMIT 251) AS osm_1 LEFT OUTER JOIN storefiles AS storefiles_1 ON 
osm_1.md5hash = storefiles_1.md5hash LEFT OUTER JOIN filepaths AS 
filepaths_1 ON osm_1.process_filepath_guid = filepaths_1.guid AND 
osm_1.event_time = filepaths_1.event_time LEFT OUTER JOIN filepaths AS 
filepaths_2 ON osm_1.filepath_guid = filepaths_2.guid AND 
osm_1.event_time= filepaths_2.event_time ORDER BY osm_1.firstloadtime 
DESC;




What you should start with here is confirming whether or not a simpler 
query touches all of the partitions or just the ones you expect it to.  
A simpler one like this:


SELECT * FROM osmoduleloads WHERE osmoduleloads.firstloadtime = 
1296043200 AND osmoduleloads.firstloadtime  12961036799000;


Would be the place to begin. Once you've got that working, then you can 
build up more pieces, and see if one of them results in the query not 
excluding partitions anymore or not.  I can't figure out if you're 
running into a basic error here, where constraint exclusion just isn't 
working at all, or if you are only having this problem because the query 
is too complicated.  Figuring that out will narrow the potential solutions.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: [PERFORM] Query optimization using order by and limit

2011-09-21 Thread Tom Lane

Greg Smith g...@2ndquadrant.com writes:
 That weird casting can't be helping.  I'm not sure if it's your problem 
 here, but the constraint exclusion code is pretty picky about matching 
 the thing you're looking for against the CHECK constraint, and this is a 
 messy one.  The bigint conversion in the middle there isn't doing 
 anything useful for you anyway; you really should simplify this to just 
 look like this:
 firstloadtime = 1296044640::numeric

I have a more aggressive suggestion: change all the numeric(20,0) fields
to bigint.  Unless the OP actually needs values wider than 64 bits,
the choice to use numeric is a significant performance penalty for
nothing.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query optimization using order by and limit

2011-09-21 Thread Michael Viscuso

Thanks guys,

First of all, I should have included my postgres.conf file with the
original submission.  Sorry about that.  It is now attached.

Based on a recommendation, I also should have shown the parent child
relationship between osmoduleloads and its daily partitioned tables.  to
reduce clutter, It is at the end of this message.

Taking this one step at a time and taking Greg's second suggestion
first, issuing

select * from osmoduleloads WHERE osmoduleloads.firstloadtime =
1296043200 AND osmoduleloads.firstloadtime  12961036799000;

appears to only query the appropriate daily tables (2011_09_13 through
2011_09_20 - http://explain.depesz.com/s/QCG).  So it appears that
constraint_exclusion is working properly.  Putting a limit on the query
like:

select * from osmoduleloads WHERE osmoduleloads.firstloadtime =
1296043200 AND osmoduleloads.firstloadtime  12961036799000
limit 251;

has the result that I'd expect to see http://explain.depesz.com/s/O7fZ.
Ordering by firstloadtime AND limiting like:

select * from osmoduleloads WHERE osmoduleloads.firstloadtime =
1296043200 AND osmoduleloads.firstloadtime  12961036799000
order by firstloadtime desc limit 251;

also has the result that I'd expect to see
http://explain.depesz.com/s/RDh.  

Adding the hosts join condition to the mix was still OK
http://explain.depesz.com/s/2Ns. 

Adding the hosts.enabled condition was still OK
http://explain.depesz.com/s/UYN. 

Adding the hosts.user_id = 111 started the descent but it appears to
still be obeying the proper contraint_exclusion that I'd expect, just
with a ton of rows returned from the most recent daily tables
http://explain.depesz.com/s/4WE. 

Adding the final condition hosts_guid = '2007075705813916178' is what
ultimately kills it http://explain.depesz.com/s/8zy.  By adding the
host_guid, it spends considerably more time in the older tables than
without this condition and I'm not sure why. 

Thanks Greg for the recommendation to step through it like that -
hopefully this helps get us closer to a resolution.

Greg/Tom, you are correct, these columns should be modified to whatever
is easiest for Postgres to recognize 64-bit unsigned integers.  Would
you still recommend bigint for unsigned integers?  I likely read the
wrong documentation that suggested bigint for signed 64-bit integers and
numeric(20) for unsigned 64-bit integers.

Thanks again for all your help!  Perhaps 15 hours of pouring over
explain logs will finally pan out!

Mike

cb=# \d+ osmoduleloads;
 Table public.osmoduleloads
Column |Type |  
Modifiers   | Storage  | Description
---+-+---+--+-
 guid  | numeric(20,0)   | not
null  | main |
 osprocess_guid| numeric(20,0)   | not
null  | main |
 filepath_guid | numeric(20,0)   | not
null  | main |
 firstloadtime | numeric(20,0)   | not
null  | main |
 md5hash   | bytea   | not
null  | extended |
 host_guid | numeric(20,0)   | default
NULL::numeric | main |
 process_create_time   | numeric(20,0)   | default
NULL::numeric | main |
 process_filepath_guid | numeric(20,0)   | default
NULL::numeric | main |
 event_time| timestamp without time zone
|   | plain|
Indexes:
osmoduleloads_pkey PRIMARY KEY, btree (guid)
Child tables: osmoduleloads_2001_12_31,
  osmoduleloads_2010_10_11,
  osmoduleloads_2010_10_12,
  osmoduleloads_2010_10_13,
  osmoduleloads_2011_07_27,
  osmoduleloads_2011_08_04,
  osmoduleloads_2011_08_05,
  osmoduleloads_2011_08_06,
  osmoduleloads_2011_08_07,
  osmoduleloads_2011_08_08,
  osmoduleloads_2011_08_09,
  osmoduleloads_2011_08_10,
  osmoduleloads_2011_08_11,
  osmoduleloads_2011_08_12,
  osmoduleloads_2011_08_13,
  osmoduleloads_2011_08_14,
  osmoduleloads_2011_08_15,
  osmoduleloads_2011_08_16,
  osmoduleloads_2011_08_17,
  osmoduleloads_2011_08_18,
  osmoduleloads_2011_08_19,
  osmoduleloads_2011_08_20,
  osmoduleloads_2011_08_21,
  osmoduleloads_2011_08_22,
  osmoduleloads_2011_08_23,
  osmoduleloads_2011_08_24,
  osmoduleloads_2011_08_25,
  osmoduleloads_2011_08_26,
  osmoduleloads_2011_08_27,
  osmoduleloads_2011_08_28,
  osmoduleloads_2011_08_29,
  osmoduleloads_2011_08_30,
  osmoduleloads_2011_08_31,

Re: [PERFORM] Query optimization using order by and limit

2011-09-21 Thread Tom Lane

Michael Viscuso michael.visc...@getcarbonblack.com writes:
 Greg/Tom, you are correct, these columns should be modified to whatever
 is easiest for Postgres to recognize 64-bit unsigned integers.  Would
 you still recommend bigint for unsigned integers?  I likely read the
 wrong documentation that suggested bigint for signed 64-bit integers and
 numeric(20) for unsigned 64-bit integers.

Unsigned?  Oh, hm, that's a bit of a problem because we don't have any
unsigned types.  If you really need to go to 2^64 and not 2^63 then
you're stuck with numeric ... but that last bit is costing ya a lot.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query optimization help

2011-08-30 Thread Ondrej Ivanič

Hi,

On 30 August 2011 15:36, Szymon Kosok szy...@mwg.pl wrote:
 Hello,

 I asked that question on StackOverflow, but didn't get any valuable
 response, so I'll ask it here. :)

 I have such query:

Could you please re-post your explain using this web site:
http://explain.depesz.com/ and post links to Stackoverflow question?
What is your Postgres version? Database settings?
I see huge discrepancy between predicted and actual row numbers (like
1264420 vs 485). I would try the following:

- check column statistics (pg_stasts) and focus on the following
columns: n_distinct, null_frac, most_common_vals. If they are way-off
from the actual values then you should tweak (auto)analyze process:
run manual/auto analyse more often (check pg_stat_user_tables),
increase default_statistics_target (per column or global)

- try to disable nested loop join (set enable_nestloop=off)

-- 
Ondrej Ivanic
(ondrej.iva...@gmail.com)

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query optimization help

2011-08-30 Thread Szymon Kosok

2011/8/30 Ondrej Ivanič ondrej.iva...@gmail.com:
 Could you please re-post your explain using this web site:
 http://explain.depesz.com/ and post links to Stackoverflow question?

Here it is: http://explain.depesz.com/s/Iaa

 - try to disable nested loop join (set enable_nestloop=off)

Even worse performance (http://explain.depesz.com/s/mMi).

My configuration:http://pastie.org/2453148 (copied and pasted only
uncommented important variables). It's decent hardware. i7, 16 GB of
RAM, 3x2 RAID 10 (7200rpm) for OS + data, RAID 1 (2 disks, 7200rpm)
for WAL, RAID controller with BBU and 512 MB memory cache (cache is
set to write only).

PS. Sorry Ondrej, accidentally I've sent reply to you, not to list.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query optimization help

2011-08-30 Thread Ondrej Ivanič

Hi,

2011/8/30 Szymon Kosok szy...@mwg.pl:
 2011/8/30 Ondrej Ivanič ondrej.iva...@gmail.com:
 Could you please re-post your explain using this web site:
 http://explain.depesz.com/ and post links to Stackoverflow question?

 Here it is: http://explain.depesz.com/s/Iaa

 - try to disable nested loop join (set enable_nestloop=off)

Thanks, I would try to materialise spoleczniak_tablica table. Your
query looks like this:
select ...
from spoleczniak_tablica
inner join ...
where ...
order by spoleczniak_tablica.id desc
limit 21

So I would rewrite your query like this:
select ...
from (
 select ...
 from spoleczniak_tablica
 where 
 order by spoleczniak_tablica.id desc
 limit 21
) as x
inner join ...


-- 
Ondrej Ivanic
(ondrej.iva...@gmail.com)

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Query optimization help

2011-08-29 Thread Szymon Kosok

Hello,

I asked that question on StackOverflow, but didn't get any valuable
response, so I'll ask it here. :)

I have such query:

SELECT spoleczniak_tablica.id, spoleczniak_tablica.postac_id,
spoleczniak_tablica.hash, spoleczniak_tablica.typ,
spoleczniak_tablica.ikona, spoleczniak_tablica.opis,
spoleczniak_tablica.cel, spoleczniak_tablica.data,
postac_postacie.id,
postac_postacie.user_id, postac_postacie.avatar,
postac_postacie.ikonka,
postac_postacie.imie, postac_postacie.nazwisko,
postac_postacie.pseudonim,
postac_postacie.plec, postac_postacie.wzrost, postac_postacie.waga,
postac_postacie.ur_tydz, postac_postacie.ur_rok,
postac_postacie.ur_miasto_id,
postac_postacie.akt_miasto_id, postac_postacie.kasa,
postac_postacie.punkty,
postac_postacie.zmeczenie, postac_postacie.zdrowie,
postac_postacie.kariera
FROM spoleczniak_tablica INNER JOIN postac_postacie ON
(spoleczniak_tablica.postac_id = postac_postacie.id) WHERE
spoleczniak_tablica.postac_id = 1 or spoleczniak_tablica.id in(select
wpis_id from
spoleczniak_oznaczone where etykieta_id in(select tag_id from
spoleczniak_subskrypcje where
postac_id = 1)) or (spoleczniak_tablica.postac_id in(select obserwowany_id from
spoleczniak_obserwatorium where obserwujacy_id = 1) and hash not
in('dyskusja', 'kochanie',
'szturniecie')) or (spoleczniak_tablica.cel = 1 and
spoleczniak_tablica.hash in('dyskusja',
'kochanie', 'obserwatorium', 'szturchniecie')) or spoleczniak_tablica.hash =
'administracja-info' or exists(select 1 from spoleczniak_komentarze
where kredka_id =
spoleczniak_tablica.id and postac_id = 1) ORDER BY
spoleczniak_tablica.id DESC LIMIT
21;

and it's real performance bottleneck for us. It's one of the most
often executed query on our site.

Here is EXPLAIN ANALYZE:

 Limit  (cost=52.69..185979.44 rows=21 width=283) (actual
time=5.981..149.110 rows=21 loops=1)
   -  Nested Loop  (cost=52.69..27867127142.57 rows=3147528
width=283) (actual time=5.981..149.103 rows=21 loops=1)
 -  Index Scan Backward using spoleczniak_tablica_pkey on
spoleczniak_tablica  (cost=52.69..27866103743.37 rows=3147528
width=194) (actual time=5.971..148.963 rows=21 loops=1)
   Filter: ((postac_id = 1) OR (SubPlan 1) OR ((hashed
SubPlan 2) AND ((hash)::text  ALL
('{dyskusja,kochanie,szturniecie}'::text[]))) OR ((cel = 1) AND
((hash)::text = ANY
('{dyskusja,kochanie,obserwatorium,szturchniecie}'::text[]))) OR
((hash)::text = 'administracja-info'::text) OR (alternatives: SubPlan
3 or hashed SubPlan 4))
   SubPlan 1
 -  Materialize  (cost=13.28..11947.85 rows=1264420
width=4) (actual time=0.000..0.024 rows=485 loops=2137)
   -  Nested Loop  (cost=13.28..685.75
rows=1264420 width=4) (actual time=0.119..0.664 rows=485 loops=1)
 -  HashAggregate  (cost=5.89..5.90
rows=1 width=4) (actual time=0.015..0.017 rows=7 loops=1)
   -  Index Scan using
spoleczniak_subskrypcje_postac_id on spoleczniak_subskrypcje
(cost=0.00..5.89 rows=2 width=4) (actual time=0.005..0.009 rows=7
loops=1)
 Index Cond: (postac_id = 1)
 -  Bitmap Heap Scan on
spoleczniak_oznaczone  (cost=7.38..674.96 rows=391 width=8) (actual
time=0.019..0.082 rows=69 loops=7)
   Recheck Cond: (etykieta_id =
spoleczniak_subskrypcje.tag_id)
   -  Bitmap Index Scan on
spoleczniak_oznaczone_etykieta_id  (cost=0.00..7.29 rows=391 width=0)
(actual time=0.013..0.013 rows=69 loops=7)
 Index Cond: (etykieta_id =
spoleczniak_subskrypcje.tag_id)
   SubPlan 2
 -  Index Scan using
spoleczniak_obserwatorium_obserwujacy_id on spoleczniak_obserwatorium
(cost=0.00..39.36 rows=21 width=4) (actual time=0.006..0.030 rows=26
loops=1)
   Index Cond: (obserwujacy_id = 1)
   SubPlan 3
 -  Bitmap Heap Scan on spoleczniak_komentarze
(cost=18.67..20.68 rows=1 width=0) (never executed)
   Recheck Cond: ((kredka_id =
spoleczniak_tablica.id) AND (postac_id = 1))
   -  BitmapAnd  (cost=18.67..18.67 rows=1
width=0) (never executed)
 -  Bitmap Index Scan on
spoleczniak_komentarze_kredka_id  (cost=0.00..2.98 rows=24 width=0)
(never executed)
   Index Cond: (kredka_id =
spoleczniak_tablica.id)
 -  Bitmap Index Scan on
spoleczniak_komentarze_postac_id  (cost=0.00..15.44 rows=890 width=0)
(never executed)
   Index Cond: (postac_id = 1)
   SubPlan 4
 -  Index Scan using spoleczniak_komentarze_postac_id
on spoleczniak_komentarze  (cost=0.00..1610.46 rows=890 width=4)
(actual time=0.013..2.983 rows=3605 loops=1)
   Index Cond: (postac_id = 1)
 -  Index Scan using postac_postacie_pkey on

[PERFORM] Query optimization help

2011-08-29 Thread Szymon Kosok

Hello,

I asked that question on StackOverflow, but didn't get any valuable
response, so I'll ask it here. :)

I have such query:

SELECT spoleczniak_tablica.id, spoleczniak_tablica.postac_id,
spoleczniak_tablica.hash, spoleczniak_tablica.typ,
spoleczniak_tablica.ikona, spoleczniak_tablica.opis,
spoleczniak_tablica.cel, spoleczniak_tablica.data,
postac_postacie.id,
postac_postacie.user_id, postac_postacie.avatar,
postac_postacie.ikonka,
postac_postacie.imie, postac_postacie.nazwisko,
postac_postacie.pseudonim,
postac_postacie.plec, postac_postacie.wzrost, postac_postacie.waga,
postac_postacie.ur_tydz, postac_postacie.ur_rok,
postac_postacie.ur_miasto_id,
postac_postacie.akt_miasto_id, postac_postacie.kasa,
postac_postacie.punkty,
postac_postacie.zmeczenie, postac_postacie.zdrowie,
postac_postacie.kariera
FROM spoleczniak_tablica INNER JOIN postac_postacie ON
(spoleczniak_tablica.postac_id = postac_postacie.id) WHERE
spoleczniak_tablica.postac_id = 1 or spoleczniak_tablica.id in(select
wpis_id from
spoleczniak_oznaczone where etykieta_id in(select tag_id from
spoleczniak_subskrypcje where
postac_id = 1)) or (spoleczniak_tablica.postac_id in(select obserwowany_id from
spoleczniak_obserwatorium where obserwujacy_id = 1) and hash not
in('dyskusja', 'kochanie',
'szturniecie')) or (spoleczniak_tablica.cel = 1 and
spoleczniak_tablica.hash in('dyskusja',
'kochanie', 'obserwatorium', 'szturchniecie')) or spoleczniak_tablica.hash =
'administracja-info' or exists(select 1 from spoleczniak_komentarze
where kredka_id =
spoleczniak_tablica.id and postac_id = 1) ORDER BY
spoleczniak_tablica.id DESC LIMIT
21;

and it's real performance bottleneck for us. It's one of the most
often executed query on our site.

Here is EXPLAIN ANALYZE:

 Limit  (cost=52.69..185979.44 rows=21 width=283) (actual
time=5.981..149.110 rows=21 loops=1)
   -  Nested Loop  (cost=52.69..27867127142.57 rows=3147528
width=283) (actual time=5.981..149.103 rows=21 loops=1)
 -  Index Scan Backward using spoleczniak_tablica_pkey on
spoleczniak_tablica  (cost=52.69..27866103743.37 rows=3147528
width=194) (actual time=5.971..148.963 rows=21 loops=1)
   Filter: ((postac_id = 1) OR (SubPlan 1) OR ((hashed
SubPlan 2) AND ((hash)::text  ALL
('{dyskusja,kochanie,szturniecie}'::text[]))) OR ((cel = 1) AND
((hash)::text = ANY
('{dyskusja,kochanie,obserwatorium,szturchniecie}'::text[]))) OR
((hash)::text = 'administracja-info'::text) OR (alternatives: SubPlan
3 or hashed SubPlan 4))
   SubPlan 1
 -  Materialize  (cost=13.28..11947.85 rows=1264420
width=4) (actual time=0.000..0.024 rows=485 loops=2137)
   -  Nested Loop  (cost=13.28..685.75
rows=1264420 width=4) (actual time=0.119..0.664 rows=485 loops=1)
 -  HashAggregate  (cost=5.89..5.90
rows=1 width=4) (actual time=0.015..0.017 rows=7 loops=1)
   -  Index Scan using
spoleczniak_subskrypcje_postac_id on spoleczniak_subskrypcje
(cost=0.00..5.89 rows=2 width=4) (actual time=0.005..0.009 rows=7
loops=1)
 Index Cond: (postac_id = 1)
 -  Bitmap Heap Scan on
spoleczniak_oznaczone  (cost=7.38..674.96 rows=391 width=8) (actual
time=0.019..0.082 rows=69 loops=7)
   Recheck Cond: (etykieta_id =
spoleczniak_subskrypcje.tag_id)
   -  Bitmap Index Scan on
spoleczniak_oznaczone_etykieta_id  (cost=0.00..7.29 rows=391 width=0)
(actual time=0.013..0.013 rows=69 loops=7)
 Index Cond: (etykieta_id =
spoleczniak_subskrypcje.tag_id)
   SubPlan 2
 -  Index Scan using
spoleczniak_obserwatorium_obserwujacy_id on spoleczniak_obserwatorium
(cost=0.00..39.36 rows=21 width=4) (actual time=0.006..0.030 rows=26
loops=1)
   Index Cond: (obserwujacy_id = 1)
   SubPlan 3
 -  Bitmap Heap Scan on spoleczniak_komentarze
(cost=18.67..20.68 rows=1 width=0) (never executed)
   Recheck Cond: ((kredka_id =
spoleczniak_tablica.id) AND (postac_id = 1))
   -  BitmapAnd  (cost=18.67..18.67 rows=1
width=0) (never executed)
 -  Bitmap Index Scan on
spoleczniak_komentarze_kredka_id  (cost=0.00..2.98 rows=24 width=0)
(never executed)
   Index Cond: (kredka_id =
spoleczniak_tablica.id)
 -  Bitmap Index Scan on
spoleczniak_komentarze_postac_id  (cost=0.00..15.44 rows=890 width=0)
(never executed)
   Index Cond: (postac_id = 1)
   SubPlan 4
 -  Index Scan using spoleczniak_komentarze_postac_id
on spoleczniak_komentarze  (cost=0.00..1610.46 rows=890 width=4)
(actual time=0.013..2.983 rows=3605 loops=1)
   Index Cond: (postac_id = 1)
 -  Index Scan using postac_postacie_pkey on

[PERFORM] Query optimization problem

2010-07-15 Thread Zotov


I have a query:

 SELECT d1.ID, d2.ID
 FROM DocPrimary d1
   JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
 WHERE (d1.ID=234409763) or (d2.ID=234409763)

i think what QO(Query Optimizer) can make it faster (now it seq scan and on
million records works 7 sec)

 SELECT d1.ID, d2.ID
 FROM DocPrimary d1
   JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
 WHERE (d2.BasedOn=234409763) or (d2.ID=234409763)


 --
 Slow Query
 --
 test=# EXPLAIN (ANALYZE on, VERBOSE on, COSTS on, BUFFERS off )SELECT d1.ID,
 d2.ID
 test-# FROM DocPrimary d1
 test-#   JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
 test-# WHERE (d1.ID=234409763) or (d2.ID=234409763);
 QUERY PLAN
 

   Hash Join  (cost=58.15..132.35 rows=2 width=8) (actual time=0.007..0.007
 rows=0 loops=1)
 Output: d1.id, d2.id
 Hash Cond: (d2.basedon = d1.id)
 Join Filter: ((d1.id = 234409763) OR (d2.id = 234409763))
 -   Seq Scan on public.docprimary d2  (cost=0.00..31.40 rows=2140
 width=8) (actual time=0.002..0.002 rows=0 loops=1)
   Output: d2.id, d2.basedon
 -   Hash  (cost=31.40..31.40 rows=2140 width=4) (never executed)
   Output: d1.id
   -   Seq Scan on public.docprimary d1  (cost=0.00..31.40 rows=2140
 width=4) (never executed)
 Output: d1.id

 --
 Fast Query
 --
 test=# EXPLAIN (ANALYZE on, VERBOSE on, COSTS on, BUFFERS off )SELECT d1.ID,
 d2.ID
 test-# FROM DocPrimary d1
 test-#   JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
 test-# WHERE (d2.BasedOn=234409763) or (d2.ID=234409763);
 QUERY PLAN
 
-
   Nested Loop  (cost=8.60..58.67 rows=12 width=8) (actual time=0.026..0.026
 rows=0 loops=1)
 Output: d1.id, d2.id
 -   Bitmap Heap Scan on public.docprimary d2  (cost=8.60..19.31 rows=12
 width=8) (actual time=0.023..0.023 rows=0 loops=1)
   Output: d2.id, d2.basedon
   Recheck Cond: ((d2.basedon = 234409763) OR (d2.id = 234409763))
   -   BitmapOr  (cost=8.60..8.60 rows=12 width=0) (actual
 time=0.018..0.018 rows=0 loops=1)
 -   Bitmap Index Scan on basedon_idx  (cost=0.00..4.33
 rows=11 width=0) (actual time=0.008..0.008 rows=0 loops=1)
   Index Cond: (d2.basedon = 234409763)
 -   Bitmap Index Scan on id_pk  (cost=0.00..4.26 rows=1
 width=0) (actual time=0.003..0.003 rows=0 loops=1)
   Index Cond: (d2.id = 234409763)
 -   Index Scan using id_pk on public.docprimary d1  (cost=0.00..3.27
 rows=1 width=4) (never executed)
   Output: d1.id, d1.basedon
   Index Cond: (d1.id = d2.basedon)



PGver: PostgreSQL 9.0b x86
OS: Win7 x64

-
Create table query:
-

CREATE TABLE docprimary
(
  id integer NOT NULL,
  basedon integer,
  CONSTRAINT id_pk PRIMARY KEY (id)
);
CREATE INDEX basedon_idx
  ON docprimary
  USING btree
  (basedon);


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query optimization problem

2010-07-15 Thread Yeb Havinga


Hello Zotov,

Somehow the equivalence d2.basedon=d1.id is not used in the slow query, 
probably because the equivalence constant value would be used inside a 
not-base expression (the OR). You can see that the equivalence values 
*are* used by changing the or to an and and compare both queries. The 
only thing you can do to guarantee the planner has all information to in 
cases like this it explicitly name the equivalence inside OR 
expressions, e.g.


SELECT d1.ID, d2.ID
FROM DocPrimary d1
  JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
WHERE (d1.ID=234409763 and d2.basedon=234409763) or (d2.ID=234409763) ;

regards,
Yeb Havinga

PS: the analyze time of the slow query showed 0.007ms?

Zotov wrote:

I have a query:

 SELECT d1.ID, d2.ID
 FROM DocPrimary d1
   JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
 WHERE (d1.ID=234409763) or (d2.ID=234409763)

i think what QO(Query Optimizer) can make it faster (now it seq scan 
and on

million records works 7 sec)

 SELECT d1.ID, d2.ID
 FROM DocPrimary d1
   JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
 WHERE (d2.BasedOn=234409763) or (d2.ID=234409763)


 --
 Slow Query
 --
 test=# EXPLAIN (ANALYZE on, VERBOSE on, COSTS on, BUFFERS off )SELECT 
d1.ID,

 d2.ID
 test-# FROM DocPrimary d1
 test-#   JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
 test-# WHERE (d1.ID=234409763) or (d2.ID=234409763);
 QUERY PLAN
  

   Hash Join  (cost=58.15..132.35 rows=2 width=8) (actual 
time=0.007..0.007

 rows=0 loops=1)
 Output: d1.id, d2.id
 Hash Cond: (d2.basedon = d1.id)
 Join Filter: ((d1.id = 234409763) OR (d2.id = 234409763))
 -   Seq Scan on public.docprimary d2  (cost=0.00..31.40 rows=2140
 width=8) (actual time=0.002..0.002 rows=0 loops=1)
   Output: d2.id, d2.basedon
 -   Hash  (cost=31.40..31.40 rows=2140 width=4) (never executed)
   Output: d1.id
   -   Seq Scan on public.docprimary d1  (cost=0.00..31.40 
rows=2140

 width=4) (never executed)
 Output: d1.id

 --
 Fast Query
 --
 test=# EXPLAIN (ANALYZE on, VERBOSE on, COSTS on, BUFFERS off )SELECT 
d1.ID,

 d2.ID
 test-# FROM DocPrimary d1
 test-#   JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
 test-# WHERE (d2.BasedOn=234409763) or (d2.ID=234409763);
 QUERY PLAN
 - 

   Nested Loop  (cost=8.60..58.67 rows=12 width=8) (actual 
time=0.026..0.026

 rows=0 loops=1)
 Output: d1.id, d2.id
 -   Bitmap Heap Scan on public.docprimary d2  (cost=8.60..19.31 
rows=12

 width=8) (actual time=0.023..0.023 rows=0 loops=1)
   Output: d2.id, d2.basedon
   Recheck Cond: ((d2.basedon = 234409763) OR (d2.id = 
234409763))

   -   BitmapOr  (cost=8.60..8.60 rows=12 width=0) (actual
 time=0.018..0.018 rows=0 loops=1)
 -   Bitmap Index Scan on basedon_idx  (cost=0.00..4.33
 rows=11 width=0) (actual time=0.008..0.008 rows=0 loops=1)
   Index Cond: (d2.basedon = 234409763)
 -   Bitmap Index Scan on id_pk  (cost=0.00..4.26 rows=1
 width=0) (actual time=0.003..0.003 rows=0 loops=1)
   Index Cond: (d2.id = 234409763)
 -   Index Scan using id_pk on public.docprimary d1  
(cost=0.00..3.27

 rows=1 width=4) (never executed)
   Output: d1.id, d1.basedon
   Index Cond: (d1.id = d2.basedon)



PGver: PostgreSQL 9.0b x86
OS: Win7 x64

-
Create table query:
-

CREATE TABLE docprimary
(
  id integer NOT NULL,
  basedon integer,
  CONSTRAINT id_pk PRIMARY KEY (id)
);
CREATE INDEX basedon_idx
  ON docprimary
  USING btree
  (basedon);





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Query Optimization

2010-04-08 Thread Kaloyan Iliev Iliev


Hi,
Can anyone suggest why this query so slow.

SELECT version();
   version
- 

PostgreSQL 8.4.2 on i386-portbld-freebsd7.2, compiled by GCC cc (GCC) 
4.2.1 20070719  [FreeBSD], 32-bit

(1 row)
explain analyze SELECT  DT.value,
   DT.meassure_date,
   DT.ms_status_id as status_id,
   S.descr_bg as status_bg,
   S.descr_en as status_en,
   VT.id as value_type_id,
   VT.descr_en as value_type_en,
   VT.descr_bg as value_type_bg,
   T.unit as value_type_unit,
   T.name as general_value_type,
   T.ms_db_type_id
   FROM
   ms_data AS DT,
   ms_statuses AS S,
   ms_value_types AS VT,
   ms_types AS T,
   ms_commands_history AS CH
  WHERE  DT.ms_value_type_id = 88  AND
 DT.meassure_date = '2010-04-01 1:00:00' AND
 DT.meassure_date = '2010-04-01 1:10:00' AND
 DT.ms_command_history_id = CH.id AND
 CH.ms_device_id = 7 AND
 DT.ms_value_type_id = VT.id AND
 VT.ms_type_id = T.id AND
 DT.ms_status_id = S.id
  GROUP BY value,
   meassure_date,
   status_id,
   status_bg,
   status_en,
   value_type_id,
   value_type_en,
   value_type_bg,
   value_type_unit,
   general_value_type,
   ms_db_type_id
  ORDER BY meassure_date DESC;
 QUERY PLAN
-- 

Group  (cost=23.93..23.96 rows=1 width=229) (actual 
time=63274.021..63274.021 rows=0 
loops=1)
-  Sort  (cost=23.93..23.94 rows=1 width=229) (actual 
time=63274.016..63274.016 rows=0 loops=1)
   Sort Key: dt.meassure_date, dt.value, dt.ms_status_id, s.descr_bg, 
s.descr_en, vt.id, vt.descr_en, vt.descr_bg, t.unit, t.name, 
t.ms_db_type_id
   Sort Method:  quicksort  Memory: 17kB
   -  Nested Loop  (cost=0.00..23.92 rows=1 width=229) (actual 
time=63273.982..63273.982 rows=0 
loops=1)   
   -  Nested Loop  (cost=0.00..19.64 rows=1 width=165) (actual 
time=63273.977..63273.977 rows=0 
loops=1)   
   -  Nested Loop  (cost=0.00..15.36 rows=1 width=101) 
(actual time=63273.974..63273.974 rows=0 
loops=1)   
   -  Nested Loop  (cost=0.00..11.08 rows=1 width=23) 
(actual time=63273.970..63273.970 rows=0 
loops=1)
   -  Index Scan using 
ms_commands_history_ms_device_id_idx on ms_commands_history ch  
(cost=0.00..4.33 rows=1 width=8) (actual time=0.163..25254.004 rows=9807 
loops=1)

 Index Cond: (ms_device_id = 7)
   -  Index Scan using 
ms_data_ms_command_history_id_idx on ms_data dt  (cost=0.00..6.74 rows=1 
width=31) (actual time=3.868..3.868 rows=0 loops=9807)
 Index Cond: 
(dt.ms_command_history_id = ch.id)
 Filter: ((dt.meassure_date = 
'2010-04-01 01:00:00'::timestamp without time zone) AND 
(dt.meassure_date = '2010-04-01 01:10:00'::timestamp without time zone) 
AND (dt.ms_value_type_id = 88))
 -  Index Scan using ms_value_types_pkey on 
ms_value_types vt  (cost=0.00..4.27 rows=1 width=82) (never 
executed) Index Cond: 
(vt.id = 88)
   -  Index Scan using ms_types_pkey on ms_types t  
(cost=0.00..4.27 rows=1 width=72) (never 
executed)  
Index Cond: (t.id = vt.ms_type_id)
 -  Index Scan using ms_statuses_pkey on ms_statuses s  
(cost=0.00..4.27 rows=1 width=68) (never 
executed)Index 
Cond: (s.id = dt.ms_status_id)
Total runtime: 63274.256 ms  


Thanks in advance.

Kaloyan Iliev

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:

Re: [PERFORM] Query Optimization

2010-04-08 Thread A. Kretschmer

In response to Kaloyan Iliev Iliev :
 Hi,
 Can anyone suggest why this query so slow.


-  Index Scan using 
 ms_commands_history_ms_device_id_idx on ms_commands_history ch  
 (cost=0.00..4.33 rows=1 width=8) (actual time=0.163..25254.004 rows=9807 
 loops=1)

Estimated rows: 1, actual rows: 9807, that's a BIG difference and,
maybe, your problem.

Btw.: your explain is hard to read (line-wrapping). It's better to
attach the explain as an own file...


Andreas
-- 
Andreas Kretschmer
Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: - Header)
GnuPG: 0x31720C99, 1006 CCB4 A326 1D42 6431  2EB0 389D 1DC2 3172 0C99

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] query optimization

2009-11-30 Thread Robert Haas

On Fri, Nov 27, 2009 at 4:47 PM, Faheem Mitha fah...@email.unc.edu wrote:
 If not, you might want to look at some way of pre-marking the
 non-duplicate rows so that you don't have to recompute that each time.

 What are the options re pre-marking?

 Well, what I usually do is - if I'm going to do the same
 distinct-ification frequently, I add an extra column (say, a boolean)
 and set it to true for all and only those rows which will pass the
 distinct-ification filter.  Then I can just say WHERE that column
 name.

 Yes, I see. The problem with is premarking is that the selection is somewhat
 dynamic, in the sense that this depends on the idlink table, which depends
 on patient data, which can change.

Yeah.  For things like this I find you have to think hard about how to
organize your schema so that you can optimize the queries you care
about.  There are no just do this and it works solutions to
performance problems of this type.  Still, many of them are solvable
by making the right decisions elsewhere.   Sometimes you can use
triggers to recompute your premarks when the data in the other table
changes.  Another strategy is to keep a cache of precomputed results
somewhere.  When the underlying data changes, you use triggers to
invalidate anything in the cache that might now be wrong, and set
things up so that it will be recomputed when next it is used.  But in
either case you have to figure out the right place to do the
computation so that it gains you more than it saves you, and adjusting
your schema is often necessary.

 Then you might be able to use the underlying table directly in the next
 CTE, which will usually permit better optimization, more use of indices,
 etc.  It seems pretty unfortunate that dedup_patient_anno joins against
 geno
 and then patient_geno does what appears to be the same join again. Is
 there
 some way to eliminate that?  If so it will probably help.

 You don't say whether you are looking at the PED or TPED query, so I'll
 assume PED. They are similar anyway.

 I see your point re the joins. You mean

 anno INNER JOIN geno

 followed by

 geno INNER JOIN dedup_patient_anno

 ? I think the point of the first join is to reduce the anno table based
 on
 information from the geno table. The result is basically a subset of the
 anno table with some potential duplication removed, which is then
 re-joined
 to the geno table. I agree this seems a bit suboptimal, and there might
 be a
 better way to do this.

 Yeah, I didn't think about it in detail, but it looks like it should
 be possible.  Eliminating joins can sometimes have *dramatic* effects
 on query performance, and it never hurts.

 Failing all else, couldn't I smoosh together the two queries and do a triple
 join? For reference, the two CTEs in question, from the PED query, are as
 follows.

    dedup_patient_anno AS
     ( SELECT *
     FROM
             (SELECT  *,
                      row_number() OVER(PARTITION BY anno.rsid ORDER BY
 anno.id)
             FROM     anno
                      INNER JOIN geno
                      ON       anno.id = geno.anno_id
             WHERE    idlink_id        =
                      (SELECT MIN(id)
                      FROM    idlink
                      )
             ) AS s
     WHERE   row_number = '1'
     ),
     patient_geno AS
     ( SELECT geno.idlink_id AS idlink_id,
       geno.anno_id AS anno_id,
       geno.snpval_id AS snpval_id,
       allelea_id, alleleb_id
       FROM    geno
             INNER JOIN dedup_patient_anno
             ON      geno.anno_id = dedup_patient_anno.id
     ),

If that will give the same results, which I'm not immediately certain
about, then I highly recommend it.  In general I would recommend only
using CTEs to express concepts that can't sensibly be expressed in
other ways, not to beautify your queries.  Keep in mind that joins can
be reordered and/or executed using different methods but most other
operations can't be, so trying to get your joins together in one place
is usually a good strategy, in my experience.  And of course if that
lets you reduce the total number of joins, that's even better.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] query optimization

2009-11-27 Thread Faheem Mitha




On Wed, 25 Nov 2009, Robert Haas wrote:


On Wed, Nov 25, 2009 at 5:54 PM, Faheem Mitha fah...@email.unc.edu wrote:



Well, any method of DISTINCT-ifying is likely to be somewhat slow, but
I've had good luck with SELECT DISTINCT ON (...) in the past, as
compared with other methods.  YMMV - the only way to find out is to
benchmark it.  I don't think it's non-deterministic if you order by
the DISTINCT-ON columns and enough extras to break any ties - you
should get the first one of each set.


Right, but adding enough extras to break ties is up to the user, and the 
language doesn't guarantee anything, so it feels more fragile.



If not, you might want to look at some way of pre-marking the
non-duplicate rows so that you don't have to recompute that each time.


What are the options re pre-marking?


Well, what I usually do is - if I'm going to do the same
distinct-ification frequently, I add an extra column (say, a boolean)
and set it to true for all and only those rows which will pass the
distinct-ification filter.  Then I can just say WHERE that column
name.


Yes, I see. The problem with is premarking is that the selection is 
somewhat dynamic, in the sense that this depends on the idlink table, 
which depends on patient data, which can change.



Then you might be able to use the underlying table directly in the next
CTE, which will usually permit better optimization, more use of indices,
etc.  It seems pretty unfortunate that dedup_patient_anno joins against geno
and then patient_geno does what appears to be the same join again. Is there
some way to eliminate that?  If so it will probably help.


You don't say whether you are looking at the PED or TPED query, so I'll
assume PED. They are similar anyway.

I see your point re the joins. You mean

anno INNER JOIN geno

followed by

geno INNER JOIN dedup_patient_anno

? I think the point of the first join is to reduce the anno table based on
information from the geno table. The result is basically a subset of the
anno table with some potential duplication removed, which is then re-joined
to the geno table. I agree this seems a bit suboptimal, and there might be a
better way to do this.


Yeah, I didn't think about it in detail, but it looks like it should
be possible.  Eliminating joins can sometimes have *dramatic* effects
on query performance, and it never hurts.


Failing all else, couldn't I smoosh together the two queries and do a 
triple join? For reference, the two CTEs in question, from the PED query, 
are as follows.


dedup_patient_anno AS
 ( SELECT *
 FROM
 (SELECT  *,
  row_number() OVER(PARTITION BY anno.rsid ORDER BY 
anno.id)

 FROM anno
  INNER JOIN geno
  ON   anno.id = geno.anno_id
 WHEREidlink_id=
  (SELECT MIN(id)
  FROMidlink
  )
 ) AS s
 WHERE   row_number = '1'
 ),
 patient_geno AS
 ( SELECT geno.idlink_id AS idlink_id,
   geno.anno_id AS anno_id,
   geno.snpval_id AS snpval_id,
   allelea_id, alleleb_id
   FROMgeno
 INNER JOIN dedup_patient_anno
 ON  geno.anno_id = dedup_patient_anno.id
 ),

  Regards, Faheem.
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] query optimization

2009-11-25 Thread Robert Haas

On Mon, Nov 23, 2009 at 5:47 PM, Faheem Mitha fah...@email.unc.edu wrote:

 Hi everybody,

 I've got two queries that needs optimizing. Actually, there are others, but
 these are pretty representative.

 You can see the queries and the corresponding plans at

 http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.pdf

 or

 http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.tex

 if you prefer text (latex file, effectively text in this case)

 The background to this is at
 http://bulldog.duhs.duke.edu/~faheem/snpdb/diag.pdf

 If more details are required, let me know and I can add them. I'd appreciate
 suggestions about how to make these queries go faster.

 Please CC this email address on any replies.

I've found that a good way to approach optimizing queries of this type
is to look at the EXPLAIN ANALYZE results and figure out which parts
of the query are slow.  Then simplify the rest of the query as much as
possible without eliminating the slowness.  Then try to figure out how
to optimize the simplified query: rewrite the logic, add indices,
change the schema, etc.  Lastly start adding the other bits back in.

It looks like the dedup_patient_anno CTE is part of your problem.  Try
pulling that piece out and optimizing it separately.  I wonder if that
could be rewritten to use SELECT DISTINCT ON (...) and whether that
would be any faster.  If not, you might want to look at some way of
pre-marking the non-duplicate rows so that you don't have to recompute
that each time.  Then you might be able to use the underlying table
directly in the next CTE, which will usually permit better
optimization, more use of indices, etc.  It seems pretty unfortunate
that dedup_patient_anno joins against geno and then patient_geno does
what appears to be the same join again.  Is there some way to
eliminate that?  If so it will probably help.

Once you've got those parts of the query as well-optimized as you can,
add the next pieces in and start hacking on those.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] query optimization

2009-11-25 Thread Faheem Mitha



Hi Robert,

Thanks very much for your suggestions.

On Wed, 25 Nov 2009, Robert Haas wrote:


On Mon, Nov 23, 2009 at 5:47 PM, Faheem Mitha fah...@email.unc.edu wrote:


Hi everybody,

I've got two queries that needs optimizing. Actually, there are others, 
but these are pretty representative.


You can see the queries and the corresponding plans at

http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.pdf

or

http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.tex

if you prefer text (latex file, effectively text in this case)

The background to this is at
http://bulldog.duhs.duke.edu/~faheem/snpdb/diag.pdf

If more details are required, let me know and I can add them. I'd appreciate
suggestions about how to make these queries go faster.

Please CC this email address on any replies.


I've found that a good way to approach optimizing queries of this type
is to look at the EXPLAIN ANALYZE results and figure out which parts
of the query are slow.  Then simplify the rest of the query as much as
possible without eliminating the slowness.  Then try to figure out how
to optimize the simplified query: rewrite the logic, add indices,
change the schema, etc.  Lastly start adding the other bits back in.


Good strategy. Now I just have to understand EXPLAIN ANALYZE well enough 
to figure out which bits are slow. :-)



It looks like the dedup_patient_anno CTE is part of your problem.  Try
pulling that piece out and optimizing it separately.  I wonder if that
could be rewritten to use SELECT DISTINCT ON (...) and whether that
would be any faster.


Isn't SELECT DISTINCT supposed to be evil, since in general the result is 
not deterministic? I think I had SELECT DISTINCT earlier, and removed it 
because of that, with the help of Andrew (RhodiumToad on #postgresql) I 
didn't compare the corresponding subqueries separately, so don't know what 
speed difference this made.


If not, you might want to look at some way of pre-marking the 
non-duplicate rows so that you don't have to recompute that each time.


What are the options re pre-marking?

Then you might be able to use the underlying table directly in the next 
CTE, which will usually permit better optimization, more use of indices, 
etc.  It seems pretty unfortunate that dedup_patient_anno joins against 
geno and then patient_geno does what appears to be the same join again. 
Is there some way to eliminate that?  If so it will probably help.


You don't say whether you are looking at the PED or TPED query, so I'll 
assume PED. They are similar anyway.


I see your point re the joins. You mean

anno INNER JOIN geno

followed by

geno INNER JOIN dedup_patient_anno

? I think the point of the first join is to reduce the anno table based on 
information from the geno table. The result is basically a subset of the 
anno table with some potential duplication removed, which is then 
re-joined to the geno table. I agree this seems a bit suboptimal, and 
there might be a better way to do this.



Once you've got those parts of the query as well-optimized as you can,
add the next pieces in and start hacking on those.


  Regards, Faheem.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] query optimization

2009-11-25 Thread Robert Haas

On Wed, Nov 25, 2009 at 5:54 PM, Faheem Mitha fah...@email.unc.edu wrote:

 Hi Robert,

 Thanks very much for your suggestions.

 Hi everybody,

 I've got two queries that needs optimizing. Actually, there are others,
 but these are pretty representative.

 You can see the queries and the corresponding plans at

 http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.pdf

 or

 http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.tex

 if you prefer text (latex file, effectively text in this case)

 The background to this is at
 http://bulldog.duhs.duke.edu/~faheem/snpdb/diag.pdf

 If more details are required, let me know and I can add them. I'd
 appreciate
 suggestions about how to make these queries go faster.

 Please CC this email address on any replies.

 I've found that a good way to approach optimizing queries of this type
 is to look at the EXPLAIN ANALYZE results and figure out which parts
 of the query are slow.  Then simplify the rest of the query as much as
 possible without eliminating the slowness.  Then try to figure out how
 to optimize the simplified query: rewrite the logic, add indices,
 change the schema, etc.  Lastly start adding the other bits back in.

 Good strategy. Now I just have to understand EXPLAIN ANALYZE well enough to
 figure out which bits are slow. :-)

Well, you basically just look for the big numbers.  The actual
numbers are in ms, and each node includes the times for the things
beneath it, so usually my approach is to just look at lower and lower
levels of the tree (i.e. the parts that are more indented) until I
find the lowest level that is slow.  Then I look at the query bits
presented there to figure out which piece of the SQL it corresponds
to.

Looking at the estimates (which are not in ms or any other particular
unit) can be helpful too, in that it can help you find places where
the planner thought it would be fast but it was actually slow.  To do
this, look at the top level of the query and get a sense of what the
ratio between estimated-cost-units and actual-ms is.  Then look for
big (order of magnitude) deviations from this throughout the plan.
Those are places where you want to either gather better statistics, or
rewrite the query so that it can make better use of statistics.  The
latter is more of an art than a science - I or someone else on this
list can help you with it if we find a specific case to look at.

 It looks like the dedup_patient_anno CTE is part of your problem.  Try
 pulling that piece out and optimizing it separately.  I wonder if that
 could be rewritten to use SELECT DISTINCT ON (...) and whether that
 would be any faster.

 Isn't SELECT DISTINCT supposed to be evil, since in general the result is
 not deterministic? I think I had SELECT DISTINCT earlier, and removed it
 because of that, with the help of Andrew (RhodiumToad on #postgresql) I
 didn't compare the corresponding subqueries separately, so don't know what
 speed difference this made.

Well, any method of DISTINCT-ifying is likely to be somewhat slow, but
I've had good luck with SELECT DISTINCT ON (...) in the past, as
compared with other methods.  YMMV - the only way to find out is to
benchmark it.  I don't think it's non-deterministic if you order by
the DISTINCT-ON columns and enough extras to break any ties - you
should get the first one of each set.

 If not, you might want to look at some way of pre-marking the
 non-duplicate rows so that you don't have to recompute that each time.

 What are the options re pre-marking?

Well, what I usually do is - if I'm going to do the same
distinct-ification frequently, I add an extra column (say, a boolean)
and set it to true for all and only those rows which will pass the
distinct-ification filter.  Then I can just say WHERE that column
name.

 Then you might be able to use the underlying table directly in the next
 CTE, which will usually permit better optimization, more use of indices,
 etc.  It seems pretty unfortunate that dedup_patient_anno joins against geno
 and then patient_geno does what appears to be the same join again. Is there
 some way to eliminate that?  If so it will probably help.

 You don't say whether you are looking at the PED or TPED query, so I'll
 assume PED. They are similar anyway.

 I see your point re the joins. You mean

 anno INNER JOIN geno

 followed by

 geno INNER JOIN dedup_patient_anno

 ? I think the point of the first join is to reduce the anno table based on
 information from the geno table. The result is basically a subset of the
 anno table with some potential duplication removed, which is then re-joined
 to the geno table. I agree this seems a bit suboptimal, and there might be a
 better way to do this.

Yeah, I didn't think about it in detail, but it looks like it should
be possible.  Eliminating joins can sometimes have *dramatic* effects
on query performance, and it never hurts.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your

[PERFORM] query optimization

2009-11-23 Thread Faheem Mitha



Hi everybody,

I've got two queries that needs optimizing. Actually, there are others, 
but these are pretty representative.


You can see the queries and the corresponding plans at

http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.pdf

or

http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.tex

if you prefer text (latex file, effectively text in this case)

The background to this is at 
http://bulldog.duhs.duke.edu/~faheem/snpdb/diag.pdf


If more details are required, let me know and I can add them. I'd 
appreciate suggestions about how to make these queries go faster.


Please CC this email address on any replies.

   Regards, Faheem.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] query optimization

2009-11-23 Thread Thom Brown

2009/11/23 Faheem Mitha fah...@email.unc.edu


 Hi everybody,

 I've got two queries that needs optimizing. Actually, there are others, but
 these are pretty representative.

 You can see the queries and the corresponding plans at

 http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.pdf

 or

 http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.tex

 if you prefer text (latex file, effectively text in this case)

 The background to this is at
 http://bulldog.duhs.duke.edu/~faheem/snpdb/diag.pdf

 If more details are required, let me know and I can add them. I'd
 appreciate suggestions about how to make these queries go faster.

 Please CC this email address on any replies.

   Regards, Faheem.



Hi Faheem,

There appears to be a discrepancy between the 2 PDFs you provided.  One says
you're using PostgreSQL 8.3, and the other shows you using common table
expressions, which are only available in 8.4+.

Thom

Re: [PERFORM] query optimization

2009-11-23 Thread Faheem Mitha




On Mon, 23 Nov 2009, Thom Brown wrote:


Hi Faheem,

There appears to be a discrepancy between the 2 PDFs you provided.  One 
says you're using PostgreSQL 8.3, and the other shows you using common 
table expressions, which are only available in 8.4+.


Yes, sorry. I'm using Postgresql 8.4. I guess I should go through diag.pdf 
and make sure all the information is current. Thanks for pointing out my 
error.


  Regards, Faheem.
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] query optimization

2009-11-23 Thread marcin mank

On Tue, Nov 24, 2009 at 12:49 AM, Faheem Mitha fah...@email.unc.edu wrote:

 Yes, sorry. I'm using Postgresql 8.4. I guess I should go through diag.pdf
 and make sure all the information is current. Thanks for pointing out my
 error.


excellent report!

about the copy problem: You seem to have created the primary key
before doing the copy (at least that`s what the dump before copy
says). This is bad. Create it after the copy.

Greetings
Marcin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] query optimization

2009-11-23 Thread Sebastian Jörgensen

How often are the tables you query from updated?

Rgds
Sebastian

On Tue, Nov 24, 2009 at 12:52 AM, marcin mank marcin.m...@gmail.com wrote:

 On Tue, Nov 24, 2009 at 12:49 AM, Faheem Mitha fah...@email.unc.edu
 wrote:
 
  Yes, sorry. I'm using Postgresql 8.4. I guess I should go through
 diag.pdf
  and make sure all the information is current. Thanks for pointing out my
  error.
 

 excellent report!

 about the copy problem: You seem to have created the primary key
 before doing the copy (at least that`s what the dump before copy
 says). This is bad. Create it after the copy.

 Greetings
 Marcin

 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] query optimization

2009-11-23 Thread Faheem Mitha




On Tue, 24 Nov 2009, Sebastian Jörgensen wrote:


How often are the tables you query from updated?


Quite rarely. Once in a while. The large tables, eg. geno, are basically 
static.


 Regards, Faheem.


Rgds
Sebastian

On Tue, Nov 24, 2009 at 12:52 AM, marcin mank marcin.m...@gmail.com wrote:
  On Tue, Nov 24, 2009 at 12:49 AM, Faheem Mitha fah...@email.unc.edu 
wrote:
  
   Yes, sorry. I'm using Postgresql 8.4. I guess I should go through 
diag.pdf
   and make sure all the information is current. Thanks for pointing out my
   error.
  

excellent report!

about the copy problem: You seem to have created the primary key
before doing the copy (at least that`s what the dump before copy
says). This is bad. Create it after the copy.

Greetings
Marcin

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query optimization

2008-12-01 Thread PFC



I am struggeling with the following query which fetches a random subset  
of 200 questions that matches certain tags within certain languages.  
However, the query takes forever to evaluate, even though I have a  
limit 200 appended. Any ideas on how to optimize it?


QUERY: 

SELECT distinct q.question_id
   FROM question_tags qt, questions q
   WHERE q.question_id = qt.question_id
 AND q.STATUS = 1
 AND not q.author_id = 105
 AND ((qt.language_id = 5 and qt.tag_id in  
(1101,917,734,550,367,183)) or (qt.language_id = 4 and qt.tag_id in  
(856,428)) or (qt.language_id =
  3 and qt.tag_id in (1156,1028,899,771,642,514,385,257,128)) or  
(qt.language_id = 2 and qt.tag_id in  
(1193,1101,1009,917,826,734,642,550,458,367,275,183,91)))

  and q.question_id not in (413)
   LIMIT 200

EXPLAIN ANALYZE: =

  Limit  (cost=1.50..1267.27 rows=200 width=4) (actual  
time=278.169..880.934 rows=200 loops=1)
-  Unique  (cost=1.50..317614.50 rows=50185 width=4) (actual  
time=278.165..880.843 rows=200 loops=1)
  -  Merge Join  (cost=1.50..317489.04 rows=50185 width=4)  
(actual time=278.162..880.579 rows=441 loops=1)

Merge Cond: (qt.question_id = q.question_id)
-  Index Scan using question_tags_question_id on  
question_tags qt  (cost=0.00..301256.96 rows=82051 width=4) (actual  
time=24.171..146.811 rows=6067 loops=1)
  Filter: (((language_id = 5) AND (tag_id = ANY  
('{1101,917,734,550,367,183}'::integer[]))) OR ((language_id = 4) AND  
(tag_id = ANY ('{856,428}'::integer[]))) OR ((language_id = 3) AND  
(tag_id = ANY ('{1156,1028,899,771,642,514,385,257,128}'::integer[])))  
OR ((language_id = 2) AND (tag_id = ANY  
('{1193,1101,1009,917,826,734,642,550,458,367,275,183,91}'::integer[]
-  Index Scan using questions_pkey on questions q  
(cost=0.00..15464.12 rows=83488 width=4) (actual time=222.956..731.737  
rows=1000 loops=1)
  Filter: ((q.author_id  105) AND (q.question_id  
 413) AND (q.status = 1))

  Total runtime: 881.152 ms
(9 rows)


An index on (language_id,tag_id) should be the first thing to try.
	Or perhaps even (status,language_id,tag_id) or (language_id, tad_id,  
status) (but that depends on the stats on status column).


	An index on author_id will probably not be useful for this particular  
query because your condition is author_id != constant.


Also CLUSTER question_tags on (language_id, tad_id).

	What is the database size versus RAM ? You must have a hell of a lot of  
questions to make this slow... (or bloat ?)




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Query optimization

2008-11-30 Thread tmp

I am struggeling with the following query which fetches a random subset 
of 200 questions that matches certain tags within certain languages. 
However, the query takes forever to evaluate, even though I have a 
limit 200 appended. Any ideas on how to optimize it?


QUERY: 

SELECT distinct q.question_id
  FROM question_tags qt, questions q
  WHERE q.question_id = qt.question_id
AND q.STATUS = 1
AND not q.author_id = 105
AND ((qt.language_id = 5 and qt.tag_id in 
(1101,917,734,550,367,183)) or (qt.language_id = 4 and qt.tag_id in 
(856,428)) or (qt.language_id =
 3 and qt.tag_id in (1156,1028,899,771,642,514,385,257,128)) or 
(qt.language_id = 2 and qt.tag_id in 
(1193,1101,1009,917,826,734,642,550,458,367,275,183,91)))

 and q.question_id not in (413)
  LIMIT 200

EXPLAIN ANALYZE: =

 Limit  (cost=1.50..1267.27 rows=200 width=4) (actual 
time=278.169..880.934 rows=200 loops=1)
   -  Unique  (cost=1.50..317614.50 rows=50185 width=4) (actual 
time=278.165..880.843 rows=200 loops=1)
 -  Merge Join  (cost=1.50..317489.04 rows=50185 width=4) 
(actual time=278.162..880.579 rows=441 loops=1)

   Merge Cond: (qt.question_id = q.question_id)
   -  Index Scan using question_tags_question_id on 
question_tags qt  (cost=0.00..301256.96 rows=82051 width=4) (actual 
time=24.171..146.811 rows=6067 loops=1)
 Filter: (((language_id = 5) AND (tag_id = ANY 
('{1101,917,734,550,367,183}'::integer[]))) OR ((language_id = 4) AND 
(tag_id = ANY ('{856,428}'::integer[]))) OR ((language_id = 3) AND 
(tag_id = ANY ('{1156,1028,899,771,642,514,385,257,128}'::integer[]))) 
OR ((language_id = 2) AND (tag_id = ANY 
('{1193,1101,1009,917,826,734,642,550,458,367,275,183,91}'::integer[]
   -  Index Scan using questions_pkey on questions q 
(cost=0.00..15464.12 rows=83488 width=4) (actual time=222.956..731.737 
rows=1000 loops=1)
 Filter: ((q.author_id  105) AND (q.question_id 
 413) AND (q.status = 1))

 Total runtime: 881.152 ms
(9 rows)


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query optimization

2008-11-30 Thread Marc Cousin

Le Sunday 30 November 2008 19:45:11 tmp, vous avez écrit :
 I am struggeling with the following query which fetches a random subset
 of 200 questions that matches certain tags within certain languages.
 However, the query takes forever to evaluate, even though I have a
 limit 200 appended. Any ideas on how to optimize it?

 QUERY: 

 SELECT distinct q.question_id
FROM question_tags qt, questions q
WHERE q.question_id = qt.question_id
  AND q.STATUS = 1
  AND not q.author_id = 105
  AND ((qt.language_id = 5 and qt.tag_id in
 (1101,917,734,550,367,183)) or (qt.language_id = 4 and qt.tag_id in
 (856,428)) or (qt.language_id =
   3 and qt.tag_id in (1156,1028,899,771,642,514,385,257,128)) or
 (qt.language_id = 2 and qt.tag_id in
 (1193,1101,1009,917,826,734,642,550,458,367,275,183,91)))
   and q.question_id not in (413)
LIMIT 200

 EXPLAIN ANALYZE: =

   Limit  (cost=1.50..1267.27 rows=200 width=4) (actual
 time=278.169..880.934 rows=200 loops=1)
 -  Unique  (cost=1.50..317614.50 rows=50185 width=4) (actual
 time=278.165..880.843 rows=200 loops=1)
   -  Merge Join  (cost=1.50..317489.04 rows=50185 width=4)
 (actual time=278.162..880.579 rows=441 loops=1)
 Merge Cond: (qt.question_id = q.question_id)
 -  Index Scan using question_tags_question_id on
 question_tags qt  (cost=0.00..301256.96 rows=82051 width=4) (actual
 time=24.171..146.811 rows=6067 loops=1)
   Filter: (((language_id = 5) AND (tag_id = ANY
 ('{1101,917,734,550,367,183}'::integer[]))) OR ((language_id = 4) AND
 (tag_id = ANY ('{856,428}'::integer[]))) OR ((language_id = 3) AND
 (tag_id = ANY ('{1156,1028,899,771,642,514,385,257,128}'::integer[])))
 OR ((language_id = 2) AND (tag_id = ANY
 ('{1193,1101,1009,917,826,734,642,550,458,367,275,183,91}'::integer[]
 -  Index Scan using questions_pkey on questions q
 (cost=0.00..15464.12 rows=83488 width=4) (actual time=222.956..731.737
 rows=1000 loops=1)
   Filter: ((q.author_id  105) AND (q.question_id
  413) AND (q.status = 1))
   Total runtime: 881.152 ms
 (9 rows)

First, because of the distinct, the limit 200 wont reduce the work to be done 
a lot : it will still have to sort for the unique. Only when everything is 
sorted will it take only the first 200 records. And anyhow it seems there are 
only 441 rows before doing the distinct, so, at least for this query, it 
won't change a lot the times.

Then it seems to me that you may try to create composed indexes, because there 
is a lot of filtering after the index scans (that is if you want the query to 
be faster). 
Maybe (q.author_id,q.status).

For qt you may try (qt.language_id,qt.tag_id)...

Hope it helps

Cheers

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Query Optimization with Kruskal’s Algorithm

2008-05-07 Thread Tarcizio Bini

Hello friends,

I'm working on optimizing queries using the Kruskal algorithm (
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4318118). I did several
tests in the database itself and saw interesting results.
I did 10 executions with each query using unchanged source of Postgres and
then adapted to the algorithm of Kruskal.
The query I used is composed of 12 tables and 11 joins.

Results Postgresql unchanged (ms): (\ timing)

170,690
168,214
182,832
166,172
174,466
167,143
167,287
172,891
170,452
165,665
average= 170,5812 ms


Results of Postgresql with the Kruskal algorithm (ms): (\ timing)

520,590
13,533
8,410
5,162
5,543
4,999
9,871
4,984
5,010
8,883
average= 58,6985 ms


As you can see the result, using the Kruskal algorithm, the first query
takes more time to return results. This does not occur when using the
original source of Postgres.
So how is the best method to conduct the tests? I take into consideration
the average of 10 executions or just the first one?
Do you think I must clean the cache after each query? (because the other (9)
executions may have information in memory).

regards, Tarcizio Bini.

Re: [PERFORM] Query Optimization with Kruskal’s Algorithm

2008-05-07 Thread Alexander Staubo

On 5/7/08, Tarcizio Bini [EMAIL PROTECTED] wrote:
 I'm working on optimizing queries using the Kruskal algorithm
 (http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4318118).

That paper looks very interesting. I would love to hear what the
PostgreSQL committers think of this algorithm.

Alexander.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Query Optimization

2008-03-26 Thread Gopinath Narasimhan

Dear Sirs,
   
  I am doing this project of optimizing pg-sql query engine with compressed 
annealing. I would like to know if any deficiency in existing GEQO. If there 
are any TODO items remaining in GEQO kindly brief about the same. Awaiting 
discussions on this.
   
  GN

   
-
Never miss a thing.   Make Yahoo your homepage.

[PERFORM] Query optimization....

2007-07-30 Thread Karl Denninger

In a followup to a question I put forward here on performance which I 
traced to the stats bug (and fixed it).  Now I'm trying to optimize 
that query and... I'm getting confused fast...


I have the following (fairly complex) statement which is run with some 
frequency:


select post.forum, post.subject, post.replied from post where toppost = 
1 and (replied  (select lastview from forumlog where login='someone' 
and forum=post.forum and number is null)) is not false AND (replied  
(select lastview from forumlog where login='someone' and 
forum=post.forum and number=post.number)) is not false order by pinned 
desc, replied desc;


This gives me exactly what I'm looking for BUT can be quite slow.

The forumlog table has one tuple for each post and user; it has the 
fields forum, number, login and lastview.  The post items have 
a forum, number and replied field (which is used to match the 
lastview one.) 

When you look at a post (which may have replies) the application 
updates your existing entry in that table if there is one, or INSERTs a 
new tuple if not.


Therefore, for each post you have viewed, there is a tuple in the 
forumlog table which represents the last time you looked at that item.


The problem is that for a person who has NOT visited a specific thread 
of discussion, there is no forumlog entry for that person and post in 
the table.  Thus, to get all posts which (1) you've not seen at all, or 
(2) you've seen but someone has added to since you saw them, the above 
complex query is what I've come up with; there may be a null table 
entry which a wildcard match if its present - if there is no match 
then the item also must treated as new.  The above statement works - but 
its slow.


The following query is VERY fast but only returns those in which there 
IS an entry in the table (e.g. you've visited the item at least once)


select post.forum, post.subject, post.replied from post, forumlog where 
post.number = forumlog.number and post.toppost = 1 and post.replied  
forumlog.lastview and forumlog.login='someone' order by pinned desc, 
replied desc;


What I haven't been able to figure out is how to structure a query that 
is both fast and will return the posts for which you DO NOT have a 
matching entry in the forumlog table for the specific post but DO 
either (1) match the null number entry (that is, they're posted later 
than that) OR (2) have no match at all.  (The first statement matches 
these other two cases)


Any ideas?  (Its ok if that query(s) are separate; in other words, its 
cool if I have to execute two or even three queries and get the results 
separately - in fact, that might be preferrable in some circumstances)


Ideas?

--
Karl Denninger ([EMAIL PROTECTED])
http://www.denninger.net




%SPAMBLOCK-SYS: Matched [EMAIL PROTECTED], message ok

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] Query Optimization

2007-02-20 Thread Jim C. Nasby

It's not necessarily the join order that's an issue; it could also be
due to the merge join that it does in the first case. I've also run into
situations where the cost estimate for a merge join is way off the mark.

Rather than forcing the join order, you might try setting
enable_mergejoin=false.

On Mon, Feb 19, 2007 at 06:03:22PM +0100, Reinhard Vicinus wrote:
 PostgreSQL version: 8.1.6
 OS: Debian etch
 
 The following query needs a lot of time because the query planner 
 reorders the joins:
 
 select m.message_idnr, v.headervalue, n.headername from dbmail_messages m
  join dbmail_headervalue v ON v.physmessage_id=m.physmessage_id
  join dbmail_headername n ON v.headername_id=n.id
  where m.mailbox_idnr = 1022 AND message_idnr BETWEEN 698928 AND 1496874
and lower(n.headername) IN 
 ('from','to','cc','subject','date','message-id',
'priority','x-priority','references','newsgroups','in-reply-to',
'content-type','x-spam-status','x-spam-flag');
 
 If I prevent the query planner from reordering the joins with 'set 
 join_collapse_limit=1;' then the same query is faster. At the end of the 
 Mail is the output of a explain analyze for both cases.
 
 The statistics of the database are updated each night. Is there an error 
 (in the statistical data) which is responsible for the reordering of the 
 joins? And if not are there other alternatives for preventing join 
 reordering?
 
 Thanks
 Reinhard
 
 
 
 Explain analyze with set join_collapse_limit=8:
 
 Merge Join  (cost=388657.62..391332.20 rows=821 width=127) (actual 
 time=82677.950..89103.192 rows=2699 loops=1)
   Merge Cond: (outer.physmessage_id = inner.physmessage_id)
   -  Sort  (cost=2901.03..2902.61 rows=632 width=16) (actual 
 time=247.238..247.578 rows=373 loops=1)
 Sort Key: m.physmessage_id
 -  Bitmap Heap Scan on dbmail_messages m  (cost=9.16..2871.63 
 rows=632 width=16) (actual time=38.072..246.509 rows=373 loops=1)
   Recheck Cond: (mailbox_idnr = 1022)
   Filter: ((message_idnr = 698928) AND (message_idnr = 
 1496874))
   -  Bitmap Index Scan on dbmail_messages_8  
 (cost=0.00..9.16 rows=902 width=0) (actual time=25.561..25.561 rows=615 
 loops=1)
 Index Cond: (mailbox_idnr = 1022)
   -  Sort  (cost=385756.58..387089.35 rows=533108 width=127) (actual 
 time=80156.731..85760.186 rows=3278076 loops=1)
 Sort Key: v.physmessage_id
 -  Hash Join  (cost=51.00..285787.17 rows=533108 width=127) 
 (actual time=34.519..28260.855 rows=3370242 loops=1)
   Hash Cond: (outer.headername_id = inner.id)
   -  Seq Scan on dbmail_headervalue v  
 (cost=0.00..241200.39 rows=7840939 width=115) (actual 
 time=0.006..16844.479 rows=7854485 loops=1)
   -  Hash  (cost=50.72..50.72 rows=113 width=28) (actual 
 time=34.493..34.493 rows=35 loops=1)
 -  Bitmap Heap Scan on dbmail_headername n  
 (cost=28.44..50.72 rows=113 width=28) (actual time=11.796..34.437 
 rows=35 loops=1)
   Recheck Cond: ((lower((headername)::text) = 
 'from'::text) OR (lower((headername)::text) = 'to'::text) OR 
 (lower((headername)::text) = 'cc'::text) OR (lower((headername)::text) = 
 'subject'::text) OR (lower((headername)::text) = 'date'::text) OR 
 (lower((headername)::text) = 'message-id'::text) OR 
 (lower((headername)::text) = 'priority'::text) OR 
 (lower((headername)::text) = 'x-priority'::text) OR 
 (lower((headername)::text) = 'references'::text) OR 
 (lower((headername)::text) = 'newsgroups'::text) OR 
 (lower((headername)::text) = 'in-reply-to'::text) OR 
 (lower((headername)::text) = 'content-type'::text) OR 
 (lower((headername)::text) = 'x-spam-status'::text) OR (lower((hea
 dername)::text) = 'x-spam-flag'::text))
   -  BitmapOr  (cost=28.44..28.44 rows=116 
 width=0) (actual time=11.786..11.786 rows=0 loops=1)
 -  Bitmap Index Scan on 
 dbmail_headername_lower_headername  (cost=0.00..2.03 rows=8 width=0) 
 (actual time=0.037..0.037 rows=3 loops=1)
   Index Cond: 
 (lower((headername)::text) = 'from'::text)
 -  Bitmap Index Scan on 
 dbmail_headername_lower_headername  (cost=0.00..2.03 rows=8 width=0) 
 (actual time=0.013..0.013 rows=3 loops=1)
   Index Cond: 
 (lower((headername)::text) = 'to'::text)
 -  Bitmap Index Scan on 
 dbmail_headername_lower_headername  (cost=0.00..2.03 rows=8 width=0) 
 (actual time=0.013..0.013 rows=3 loops=1)
   Index Cond: 
 (lower((headername)::text) = 'cc'::text)
 -  Bitmap Index Scan on 
 dbmail_headername_lower_headername  (cost=0.00..2.03 rows=8 width=0) 
 (actual time=0.014..0.014 rows=3 loops=1)
   Index Cond: 
 (lower((headername)::text) = 'subject'::text)

[PERFORM] Query Optimization

2007-02-19 Thread Reinhard Vicinus


PostgreSQL version: 8.1.6
OS: Debian etch

The following query needs a lot of time because the query planner 
reorders the joins:


select m.message_idnr, v.headervalue, n.headername from dbmail_messages m
 join dbmail_headervalue v ON v.physmessage_id=m.physmessage_id
 join dbmail_headername n ON v.headername_id=n.id
 where m.mailbox_idnr = 1022 AND message_idnr BETWEEN 698928 AND 1496874
   and lower(n.headername) IN 
('from','to','cc','subject','date','message-id',

   'priority','x-priority','references','newsgroups','in-reply-to',
   'content-type','x-spam-status','x-spam-flag');

If I prevent the query planner from reordering the joins with 'set 
join_collapse_limit=1;' then the same query is faster. At the end of the 
Mail is the output of a explain analyze for both cases.


The statistics of the database are updated each night. Is there an error 
(in the statistical data) which is responsible for the reordering of the 
joins? And if not are there other alternatives for preventing join 
reordering?


Thanks
Reinhard



Explain analyze with set join_collapse_limit=8:

Merge Join  (cost=388657.62..391332.20 rows=821 width=127) (actual 
time=82677.950..89103.192 rows=2699 loops=1)

  Merge Cond: (outer.physmessage_id = inner.physmessage_id)
  -  Sort  (cost=2901.03..2902.61 rows=632 width=16) (actual 
time=247.238..247.578 rows=373 loops=1)

Sort Key: m.physmessage_id
-  Bitmap Heap Scan on dbmail_messages m  (cost=9.16..2871.63 
rows=632 width=16) (actual time=38.072..246.509 rows=373 loops=1)

  Recheck Cond: (mailbox_idnr = 1022)
  Filter: ((message_idnr = 698928) AND (message_idnr = 
1496874))
  -  Bitmap Index Scan on dbmail_messages_8  
(cost=0.00..9.16 rows=902 width=0) (actual time=25.561..25.561 rows=615 
loops=1)

Index Cond: (mailbox_idnr = 1022)
  -  Sort  (cost=385756.58..387089.35 rows=533108 width=127) (actual 
time=80156.731..85760.186 rows=3278076 loops=1)

Sort Key: v.physmessage_id
-  Hash Join  (cost=51.00..285787.17 rows=533108 width=127) 
(actual time=34.519..28260.855 rows=3370242 loops=1)

  Hash Cond: (outer.headername_id = inner.id)
  -  Seq Scan on dbmail_headervalue v  
(cost=0.00..241200.39 rows=7840939 width=115) (actual 
time=0.006..16844.479 rows=7854485 loops=1)
  -  Hash  (cost=50.72..50.72 rows=113 width=28) (actual 
time=34.493..34.493 rows=35 loops=1)
-  Bitmap Heap Scan on dbmail_headername n  
(cost=28.44..50.72 rows=113 width=28) (actual time=11.796..34.437 
rows=35 loops=1)
  Recheck Cond: ((lower((headername)::text) = 
'from'::text) OR (lower((headername)::text) = 'to'::text) OR 
(lower((headername)::text) = 'cc'::text) OR (lower((headername)::text) = 
'subject'::text) OR (lower((headername)::text) = 'date'::text) OR 
(lower((headername)::text) = 'message-id'::text) OR 
(lower((headername)::text) = 'priority'::text) OR 
(lower((headername)::text) = 'x-priority'::text) OR 
(lower((headername)::text) = 'references'::text) OR 
(lower((headername)::text) = 'newsgroups'::text) OR 
(lower((headername)::text) = 'in-reply-to'::text) OR 
(lower((headername)::text) = 'content-type'::text) OR 
(lower((headername)::text) = 'x-spam-status'::text) OR (lower((hea

dername)::text) = 'x-spam-flag'::text))
  -  BitmapOr  (cost=28.44..28.44 rows=116 
width=0) (actual time=11.786..11.786 rows=0 loops=1)
-  Bitmap Index Scan on 
dbmail_headername_lower_headername  (cost=0.00..2.03 rows=8 width=0) 
(actual time=0.037..0.037 rows=3 loops=1)
  Index Cond: 
(lower((headername)::text) = 'from'::text)
-  Bitmap Index Scan on 
dbmail_headername_lower_headername  (cost=0.00..2.03 rows=8 width=0) 
(actual time=0.013..0.013 rows=3 loops=1)
  Index Cond: 
(lower((headername)::text) = 'to'::text)
-  Bitmap Index Scan on 
dbmail_headername_lower_headername  (cost=0.00..2.03 rows=8 width=0) 
(actual time=0.013..0.013 rows=3 loops=1)
  Index Cond: 
(lower((headername)::text) = 'cc'::text)
-  Bitmap Index Scan on 
dbmail_headername_lower_headername  (cost=0.00..2.03 rows=8 width=0) 
(actual time=0.014..0.014 rows=3 loops=1)
  Index Cond: 
(lower((headername)::text) = 'subject'::text)
-  Bitmap Index Scan on 
dbmail_headername_lower_headername  (cost=0.00..2.03 rows=8 width=0) 
(actual time=0.014..0.014 rows=3 loops=1)
  Index Cond: 
(lower((headername)::text) = 'date'::text)
-  Bitmap Index Scan on 
dbmail_headername_lower_headername  (cost=0.00..2.03 rows=8 width=0) 
(actual time=0.019..0.019 rows=4 loops=1)

Re: [PERFORM] Query Optimization

2007-02-19 Thread Tom Lane

Reinhard Vicinus [EMAIL PROTECTED] writes:
 PostgreSQL version: 8.1.6
 The following query needs a lot of time because the query planner 
 reorders the joins:

Try reducing random_page_cost, increasing effective_cache_size, and/or
updating to PG 8.2.  Any of these are likely to make it like the
nestloop plan better...

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

[PERFORM] Query optimization with X Y JOIN

2006-01-26 Thread J




Hey guys, how u been. This is quite a newbie 
question, but I need to ask it. I'm trying to wrap my mind around the syntax of 
join and why and when to use it. I understand the concept of making a query go 
faster by creating indexes, but it seems that when I want data from multiple 
tables that link together the query goes slow. The slow is typically due to 
expensive nested loops. The reason is, all my brain understands is:

select
 tablea.data
 tableb.data
 tablec.data
from
 tablea
 tableb
 tablec
where
 tablea.pri_key = 
tableb.foreign_key AND
 tableb.pri_key = 
tablec.foreign_key AND...

From what I read, it seems you can use inner/outer 
right/left join on (bla) but when I see syntax examples I see that sometimes 
tables are omitted from the 'from' section of the query and other times, no. 
Sometimes I see that the join commands are nested and others, no and sometimes I 
see joins syntax that only applies to one table. From what I understand join can 
be used to tell the database the fast way to murge table data together to get 
results by specifiying the table that has the primary keys and the table that 
has the foreign keys.

I've read all through the postgres docs on this 
command and I'm still left lost. Can someone please explain to me in simple 
language how to use these commands or provide me with a link. I need it to live 
right now. Thanx.

Re: [PERFORM] Query optimization with X Y JOIN

2006-01-26 Thread Craig A. James


First, this isn't really the right place to ask -- this forum is about 
performance, not SQL syntax.

Second, this isn't a question anyone can answer in a reasonable length of time. 
 What you're asking for usually is taught in a class on relational database 
theory, which is typically a semester or two in college.

If you really need a crash course, dig around on the web for terms like SQL 
Tutorial.

Good luck,
Craig


[EMAIL PROTECTED] wrote:
Hey guys, how u been. This is quite a newbie question, but I need to ask 
it. I'm trying to wrap my mind around the syntax of join and why and 
when to use it. I understand the concept of making a query go faster by 
creating indexes, but it seems that when I want data from multiple 
tables that link together the query goes slow. The slow is typically due 
to expensive nested loops. The reason is, all my brain understands is:
 
select

tablea.data
tableb.data
tablec.data
from
tablea
tableb
tablec
where
tablea.pri_key = tableb.foreign_key AND
tableb.pri_key = tablec.foreign_key AND...
 
 From what I read, it seems you can use inner/outer right/left join on 
(bla) but when I see syntax examples I see that sometimes tables are 
omitted from the 'from' section of the query and other times, no. 
Sometimes I see that the join commands are nested and others, no and 
sometimes I see joins syntax that only applies to one table. From what I 
understand join can be used to tell the database the fast way to murge 
table data together to get results by specifiying the table that has the 
primary keys and the table that has the foreign keys.
 
I've read all through the postgres docs on this command and I'm still 
left lost. Can someone please explain to me in simple language how to 
use these commands or provide me with a link. I need it to live right 
now. Thanx.
 
   


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Query optimization with X Y JOIN

2006-01-26 Thread J

If I want my database to go faster, due to X then I would think that the 
issue is about performance. I wasn't aware of a paticular constraint on X.


I have more that a rudementary understanding of what's going on here, I was 
just hoping that someone could shed some light on the basic principal of 
this JOIN command and its syntax. Most people I ask, don't give me straight 
answers and what I have already read on the web is not very helpful thus 
far.
- Original Message - 
From: Craig A. James [EMAIL PROTECTED]

To: pgsql-performance@postgresql.org
Sent: Thursday, January 26, 2006 11:12 AM
Subject: Re: [PERFORM] Query optimization with X Y JOIN


First, this isn't really the right place to ask -- this forum is about 
performance, not SQL syntax.


Second, this isn't a question anyone can answer in a reasonable length of 
time.  What you're asking for usually is taught in a class on relational 
database theory, which is typically a semester or two in college.


If you really need a crash course, dig around on the web for terms like 
SQL Tutorial.


Good luck,
Craig


[EMAIL PROTECTED] wrote:
Hey guys, how u been. This is quite a newbie question, but I need to ask 
it. I'm trying to wrap my mind around the syntax of join and why and when 
to use it. I understand the concept of making a query go faster by 
creating indexes, but it seems that when I want data from multiple tables 
that link together the query goes slow. The slow is typically due to 
expensive nested loops. The reason is, all my brain understands is:

 select
tablea.data
tableb.data
tablec.data
from
tablea
tableb
tablec
where
tablea.pri_key = tableb.foreign_key AND
tableb.pri_key = tablec.foreign_key AND...
 From what I read, it seems you can use inner/outer right/left join on 
(bla) but when I see syntax examples I see that sometimes tables are 
omitted from the 'from' section of the query and other times, no. 
Sometimes I see that the join commands are nested and others, no and 
sometimes I see joins syntax that only applies to one table. From what I 
understand join can be used to tell the database the fast way to murge 
table data together to get results by specifiying the table that has the 
primary keys and the table that has the foreign keys.
 I've read all through the postgres docs on this command and I'm still 
left lost. Can someone please explain to me in simple language how to use 
these commands or provide me with a link. I need it to live right now. 
Thanx.




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly




---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] Query optimization with X Y JOIN

2006-01-26 Thread Joshua D. Drake


[EMAIL PROTECTED] wrote:
If I want my database to go faster, due to X then I would think that 
the issue is about performance. I wasn't aware of a paticular 
constraint on X.


I have more that a rudementary understanding of what's going on here, 
I was just hoping that someone could shed some light on the basic 
principal of this JOIN command and its syntax. Most people I ask, 
don't give me straight answers and what I have already read on the web 
is not very helpful thus far.

What you are looking for is here:

http://sqlzoo.net/

It is an excellent website that discusses in depth but at a tutorial 
style level how and what SQL is and how to use it. Including JOINS.


FYI, a JOIN is basically a FROM with an integrated WHERE clause. That is 
a very simplified description and isn't 100% accurate
but it is close. I strongly suggest the website I mentioned above as it 
will resolve your question.


Joshua D. Drake

--
The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: PLphp, PLperl - http://www.commandprompt.com/


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] Query optimization with X Y JOIN

2006-01-26 Thread Richard Huxton


[EMAIL PROTECTED] wrote:
If I want my database to go faster, due to X then I would think that the 
issue is about performance. I wasn't aware of a paticular constraint on X.


You haven't asked a performance question yet though.

I have more that a rudementary understanding of what's going on here, I 
was just hoping that someone could shed some light on the basic 
principal of this JOIN command and its syntax. Most people I ask, don't 
give me straight answers and what I have already read on the web is not 
very helpful thus far.


OK - firstly it's not a JOIN command. It's a SELECT query that happens 
to join (in your example) three tables together. The syntax is specified 
in the SQL reference section of the manuals, and I don't think it's 
different from the standard SQL spec here.


A query that joins two or more tables (be they real base-tables, views 
or sub-query result-sets) produces the product of both. Normally you 
don't want this so you apply constraints to that join (table_a.col1 = 
table_b.col2).


In some cases you want all the rows from one side of a join, whether or 
not you get a match on the other side of the join. This is called an 
outer join and results in NULLs for all the columns on the outside of 
the join. A left-join returns all rows from the table on the left of the 
join, a right-join from the table on the right of it.


When planning a join, the planner will try to estimate how many matches 
it will see on each side, taking into account any extra constraints (you 
might want only some of the rows in table_a anyway). It then decides 
whether to use any indexes on the relevant column(s).


Now, if you think the planner is making a mistake we'll need to see the 
output of EXPLAIN ANALYSE for the query and will want to know that 
you've vacuumed and analysed the tables in question.


Does that help at all?
--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Query optimization with X Y JOIN

2006-01-26 Thread J

Yes, that helps a great deal. Thank you so much.

- Original Message - 
From: Richard Huxton dev@archonet.com

To: [EMAIL PROTECTED]
Cc: pgsql-performance@postgresql.org
Sent: Thursday, January 26, 2006 11:47 AM
Subject: Re: [PERFORM] Query optimization with X Y JOIN

[EMAIL PROTECTED] wrote:
If I want my database to go faster, due to X then I would think that the 
issue is about performance. I wasn't aware of a paticular constraint on 
X.

You haven't asked a performance question yet though.

I have more that a rudementary understanding of what's going on here, I 
was just hoping that someone could shed some light on the basic principal 
of this JOIN command and its syntax. Most people I ask, don't give me 
straight answers and what I have already read on the web is not very 
helpful thus far.

OK - firstly it's not a JOIN command. It's a SELECT query that happens to 
join (in your example) three tables together. The syntax is specified in 
the SQL reference section of the manuals, and I don't think it's different 
from the standard SQL spec here.

A query that joins two or more tables (be they real base-tables, views or 
sub-query result-sets) produces the product of both. Normally you don't 
want this so you apply constraints to that join (table_a.col1 = 
table_b.col2).

In some cases you want all the rows from one side of a join, whether or 
not you get a match on the other side of the join. This is called an outer 
join and results in NULLs for all the columns on the outside of the 
join. A left-join returns all rows from the table on the left of the join, 
a right-join from the table on the right of it.

When planning a join, the planner will try to estimate how many matches it 
will see on each side, taking into account any extra constraints (you 
might want only some of the rows in table_a anyway). It then decides 
whether to use any indexes on the relevant column(s).

Now, if you think the planner is making a mistake we'll need to see the 
output of EXPLAIN ANALYSE for the query and will want to know that you've 
vacuumed and analysed the tables in question.

Does that help at all?
--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

[PERFORM] Query Optimization - Hash Join estimate off?

2005-03-05 Thread mark . lubratt

Hello!

First off, I'm a real newbie at trying to read the output of explain
analyze.

I have several similar queries in my application that I've got
incorporated into views. When they run sub 300ms, the users don't
seem to mind. However, one of them (query is below along with some
relevant table information) is running about 800ms and my users are
starting to grumble.

I ran explain analyze on it (explain analyze results are
below). I noticed that the biggest chunk of time is being taken
by a Hash Join near the top of the output (I'm still not sure what the
indentation means and what the order means). If I look at the
estimate, it is comparable to several other hash join estimates in the
query; however, the actual cost in time is significantly higher than
those other hash joins. Is this significant?

I tried optimizing according to "SQL Tuning" by Tow, but this
actually seemed to slow things down. It also seemed that the
query optimizer in PostgreSQL reordered things on its own according to
its own plan anyway. Is this correct?

I'd appreciate any help I can get to try to get this query below
300ms.

Thanks!
Mark

The platform is a dual 2.2GHz Xeon 1.2GB RAM with mirrored drives
(raid 1) running Win2000 Pro. I run "vacuum analyze" every
night. The postgresql.conf is basically standard except that I've
opened it up to listen to the external network. Other changes:

max_connections = 100
shared_buffers = 1

query (the person_id = 1 in the where clause is changed on a case by
case basis - depending upon who's running the query):

explain analyzeSELECT DISTINCT c.job_id, g.person_id,
c.job_no, b.deadline, c.name, bid_date(c.job_id) AS bid_date, c.miscq,
c.city, c.st, j.name AS eng, c.s_team AS salesteam,

CASE
WHEN c.file_loc = 0 THEN 'No Bid'::character
varying
WHEN c.file_loc = -1 THEN 'Bid Board'::character
varying
WHEN c.file_loc = -2 THEN 'Lost Job'::character
varying
WHEN c.file_loc = -3 THEN 'See Job Notes'::character
varying
WHEN c.file_loc  -3 OR c.file_loc IS NULL THEN ''::character
varying
WHEN h.initials IS NOT NULL THEN
h.initials
ELSE 'Unknown person'::character
varying END AS file_loc,
COALESCE(c.city::text || COALESCE(', '::text || c.st::text, ''::text),
COALESCE(c.st, ''::character varying)::text) AS "location", c.file_loc
AS file_loc_id FROM status a LEFT JOIN
status_list b ON a.status_id = b.status_id AND b.active
LEFT JOIN job c ON c.job_id = b.job_id LEFT JOIN
builder_list d ON c.job_id = d.job_id AND (d.won_heat OR d.won_vent OR
d.won_tc OR c.heat AND d.bid_heat AND d.won_heat IS NULL OR c.vent AND
d.bid_vent AND d.won_vent IS NULL OR c.tc AND d.bid_tc AND d.won_tc IS
NULL) AND d.role = 'C'::bpchar LEFT JOIN company e ON
d.company_id = e.company_id LEFT JOIN call_list f ON
e.company_id = f.company_id LEFT JOIN person g ON
f.person_id = g.person_id OR "position"(c.s_team::text,
g.initials::text)  0 LEFT JOIN person h ON
c.file_loc = h.person_id LEFT JOIN builder_list i ON
c.job_id = i.job_id AND i.role = 'E'::bpchar LEFT JOIN
company j ON i.company_id = j.company_id WHERE a.name::text =
'Awaiting Award'::character varying::text and g.person_id = 1
ORDER BY c.job_id, g.person_id, c.job_no, b.deadline, c.name,
bid_date(c.job_id), c.miscq, c.city, COALESCE(c.city::text ||
COALESCE(', '::text || c.st::text, ''::text), COALESCE(c.st,
''::character varying)::text), c.st, CASE
WHEN c.file_loc = 0 THEN 'No Bid'::character
varying WHEN c.file_loc = -1 THEN 'Bid
Board'::character varying WHEN c.file_loc = -2
THEN 'Lost Job'::character varying WHEN
c.file_loc = -3 THEN 'See Job Notes'::character
varying WHEN c.file_loc  -3 OR c.file_loc IS
NULL THEN ''::character varying WHEN h.initials
IS NOT NULL THEN h.initials ELSE 'Unknown
person'::character varyingEND, j.name, c.s_team,
c.file_loc;
Tables:
status - 14 rows
status_list - 6566 rows
job - 2210 rows
builder_list- 9670 rows
company - 1249 rows
call_list - 4731 rows
person - 27 rows

Primary keys:
any field with a "_id" suffix is a primary key; and thus is
implicitly indexed.

Other indexes:
status_list(job_id) btree
status_list(status_id) btree
job(file_loc) btree
builder_list(company_id) btree
call_list(company_id) btree
call_list(person_id) btree
call_list(company_id) btree
person(company_id) btree

explain analyze:
Unique (cost=1798.47..1809.38 rows=291 width=114) (actual
time=766.000..781.000 rows=566 loops=1) -
Sort (cost=1798.47..1799.19 rows=291 width=114) (actual
time=766.000..766.000 rows=1473
loops=1) Sort Key:
c.job_id, g.person_id, c.job_no, b.deadline, c.name,
bid_date(c.job_id), c.miscq, c.city, COALESCE(((c.city)::text ||
COALESCE((', '::text || (c.st)::text), ''::text)), (COALESCE(c.st,
''::character varying))::text), c.st, CASE WHEN (c.fi
(..) - Hash
Left Join (cost=1750.81..1786.56 rows=291 width=114) (actual
time=453.000..750.000 rows=1473
loops=1)
Hash Cond: ("outer".company_id =
"inner".company_id)
- Merge Left Join (cost=1707.20..1722.53 rows=291
width=95) (actual time=437.000..484.000 rows=1473
loops=1)
Merge Cond:

[PERFORM] query optimization help

2005-01-14 Thread sarlav kumar

Hi All,

I have the following query to generate a report grouped by "states".

SELECT distinct upper(cd.state) as mystate, SUM(d.amount) as total_amount, SUM(COALESCE(d.fee,0) + COALESCE(mp.seller_fee, 0) + COALESCE(mp.buyer_fee,0)) as total_fee FROM data d left JOIN customerdata cd ON d.uid = cd.uid LEFT JOIN merchant_purchase mp ON d.id = mp.data_id WHERE d.what IN (26,0, 15) AND d.flags IN (1,9,10,12 ) AND lower(cd.country) = 'us' AND date_part('year',d.time)= 2004 GROUP BY mystate ORDER BY mystate;
mystate | total_amount | total_fee -+--+--- | 3695 | 0AR | 3000 | 0AZ | 1399 | 0CA | 113100 | 6242FL | 121191 | 9796GA | 34826876 | 47GEORGIA | 57990 |
 p;
 3500IEIE | 114000 | 4849MD | 2 | 1158MI | 906447 | 0NY | 8000 | 600PA | 6200 | 375SC | 25000 | 600TN | 1443681 | 1124
 | 13300 | 0(15 rows)
If you notice, my problem in this query is that the records for GA, GEORGIA appear separately. But what I want to do is to havethem combined to a single entry with their values summed up . Initially we had accepted both formats as input for the state field. Also, there are some invalid entries for the state field (like the "IEIE" and null values), which appear because theinput forstatewas not validated initially.These entries have to be eliminated from the report.This query did not take a long time to complete, but did not meet the needs for the report. 

So, the querywas rewritten to the following query which takes nearly 7-8 mins to complete on our test database:

SELECT (SELECT DISTINCT pc.state FROM postalcode pc WHERE UPPER(cd.state) IN (pc.state, pc.state_code)) as mystate, SUM(d.amount) as total_amount, SUM(COALESCE(d.fee,0) + COALESCE(mp.seller_fee, 0) + COALESCE(mp.buyer_fee,0)) as total_fee FROM data d JOIN customerdata cd ON d.uid = cd.uid LEFT JOIN merchant_purchase mp ON d.id = mp.data_id WHERE d.what IN (26,0, 15) AND d.flags IN (1,9,10,12 ) AND lower(cd.country) = 'us' AND date_part('year', d.time) = 2004 GROUP BY mystate ORDER BY mystate; mystate | total_amount | total_fee +--+---ARIZONA | 1399 | 0ARKANSAS | 3000
 | 0CALIFORNIA | 113100 | 6242FLORIDA | 121191 | 9796GEORGIA | 34884866 | 482388MARYLAND | 2 | 1158MICHIGAN | 906447 | 0NEW YORK | 8000 | 600PENNSYLVANIA | 6200
 | 375SOUTH CAROLINA | 25000 | 600TENNESSEE | 1443681 | 1124 | 130995 | 4849

Here is the explain analyze of this query:
 QUERY PLAN Aggregate (cost=1226.57..1226.58 rows=1 width=38) (actual time=362355.58..362372.09 rows=12 loops=1) - Group (cost=1226.57..1226.57 rows=1 width=38) (actual
 time=362355.54..362367.73 rows=2197 loops=1) - Sort (cost=1226.57..1226.57 rows=1 width=38) (actual time=362355.53..362356.96 rows=2197 loops=1) Sort Key: (subplan) - Nested Loop (cost=0.00..1226.56 rows=1 width=38) (actual time=166.11..362321.46 rows=2197 loops=1) - Nested Loop (cost=0.00..1220.53 rows=1 width=26) (actual time=1.68..361.32 rows=2115 loops=1) - Seq Scan on customerdata cd (cost=0.00..274.32 ro
 ws=31
 width=10) (actual time=0.04..29.87 rows=3303 loops=1) Filter: (lower((country)::text) = 'us'::text) - Index Scan using data_uid_idx on data d (cost=0.00..30.08 rows=1 width=16) (actual time=0.04..0.09 rows=1 loops=3303) Index Cond: (d.uid =
 "outer".uid) Filter: (((what = 26) OR (what = 0) OR (what = 15)) AND ((flags = 1) OR (flags = 9) OR (flags = 10) OR (flags = 12)) AND (date_part('year'::text, "time") = 2004::double precision)) - Index Scan using merchant_purchase_data_idx on merchant_purchase mp (cost=0.00..6.01 rows=1 width=12) (actual time=0.05..0.05 rows=1 loops=2115) Index Cond: ("outer".id =
 mp.data_id) SubPlan - Unique (cost=2237.12..2243.22 rows=122 width=13) (actual time=161.25..164.68 rows=1 loops=2197) - Sort (cost=2237.12..2240.17 rows=1220 width=13) (actual time=161.21..161.88 rows=1033 loops=2197) Sort Key:
 state - Seq Scan on postalcode pc (cost=0.00..2174.56 rows=1220 width=13) (actual time=35.79..148.33 rows=1033 loops=2197) Filter: ((upper(($0)::text) = (state)::text) OR (upper(($0)::text) = (state_code)::text))Total runtime: 362372.57 msec 

The postalcode table is used in the query to validate the states and to combine the entries like GA and GEORGIA.

\d postalcode Table "public.postalcode" Column | Type | Modifiers +---+id | integer | not null default
 nextval('public.postalcode_id_seq'::text)country | character(2) | state | character varying(30) | zipcode | character varying(20) | city | character varying(50) | city_alias | character varying(20) | state_code | character varying(2) | Indexes: postalcode_country_key unique btree (country, state_code, zipcode), postalcode_state_code_idx btree (state_code), postalcode_state_idx btree (state)

The postalcode table has 70328 rows! 

Can some one please help me

Re: [PERFORM] query optimization help

2005-01-14 Thread Merlin Moncure

Please post in plaintext, not html where possible.
Your group by clause was 'myst'...was that supposed to be mystate?

Her is something you might try...use the original query form and create a 
function which resolves the state code from the input data...you are already 
doing that with upper.

So,

create function get_state_code(text) returns char(2) as 
$$
select case when len($1) = 2 
then upper($1)
else lookup_state_code($1)
end;
$$
language sql stable;

lookup_state_code is a similar function which is boils down to a select from a 
lookup table.  Or, you could make a giant cast statement (when GEORGIA then GA, 
etc).  and now your function becomes IMMUTABLE and should execute very fast.  
Just make sure all the states are spelled correct in the original table via 
domain constraint.

Merlin



-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of sarlav kumar
Sent: Friday, January 14, 2005 9:40 AM
To: pgsqlnovice; pgsqlperform
Subject: [PERFORM] query optimization help

Hi All,
 
I have the following query to generate a report grouped by states.
 
SELECT distinct upper(cd.state) as mystate, SUM(d.amount) as total_amount, 
SUM(COALESCE(d.fee,0) + COALESCE(mp.seller_fee, 0) + COALESCE(mp.buyer_fee,0)) 
as total_fee FROM data d left JOIN customerdata cd ON d.uid = cd.uid LEFT JOIN 
merchant_purchase mp ON d.id = mp.data_id WHERE d.what IN (26,0, 15) AND 
d.flags IN (1,9,10,12 ) AND lower(cd.country) = 'us' AND 
date_part('year',d.time)= 2004 GROUP BY myst
ate ORDER BY mystate;


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] query optimization help

2005-01-14 Thread sarlav kumar

Hi,

Thanks for the help. I actually got around with it by doing the following.
I created a temporary table:

create table statesnew as select distinct state,state_code from postalcode where lower(country)='us';

And then changed the query to :

SELECT (SELECT sn.state FROM statesnew sn WHERE UPPER(cd.state) IN (sn.state, sn.state_code)) as mystate, SUM(d.amount) as total_amount, SUM(COALESCE(d.fee,0) + COALESCE(mp.seller_fee, 0) + COALESCE(mp.buyer_fee,0)) as total_fee FROM data d JOIN customerdata cd ON d.uid = cd.uid LEFT JOIN merchant_purchase mp ON d.id = mp.data_id WHERE d.what IN (26,0, 15) AND d.flags IN (1,9,10,12 ) AND lower(cd.country) = 'us' AND date_part('year', d.time) = 2004 GROUP BY mystate ORDER BY mystate;

This workedwell, as it reduced the number of entries it had to search from.

I am not sure how to use the function you have written. Can you give me pointers on that?

Thanks,
Saranya

Merlin Moncure [EMAIL PROTECTED] wrote:

Please post in plaintext, not html where possible.Your group by clause was 'myst'...was that supposed to be mystate?
Yes, It is mystate. It continues on the next line:)
Her is something you might try...use the original query form and create a function which resolves the state code from the input data...you are already doing that with upper.So,create function get_state_code(text) returns char(2) as $$select case when len($1) = 2 then upper($1)else lookup_state_code($1)end;$$language sql stable;lookup_state_code is a similar function which is boils down to a select from a lookup table. Or, you could make a giant cast statement (when GEORGIA then GA, etc). and now your function becomes IMMUTABLE and should execute very fast. Just make sure all the states are spelled correct in the original table via domain constraint.Merlin__Do You Yahoo!?Tired of spam?  Yahoo! Mail has the best spam protection around http://mail.yahoo.com

Re: [PERFORM] Query Optimization

2004-12-15 Thread Andrew Lazarus

sarlav kumar wrote:
Hi all,
 
Can someone please help me optimize this query? Is there a better way to 
write this query? I am generating a report of transactions ordered by 
time and with details of the sender and receiver etc.
 
SELECT distinct a.time::date ||'br'||substring(a.time::time::text,1,8) 
as Time,
CASE WHEN a.what = 0 THEN 'Money Transfer' WHEN a.what = 15 THEN 
'Purchase' WHEN a.what = 26 THEN 'Merchant Streamline' WHEN a.what = 13 
THEN 'Reversal' END  as Transaction_Type ,
c1.account_no as SenderAccount, c2.account_no as RecieverAccount,
b.country as SenderCountry, d.country as RecieverCountry,
b.firstname as SenderFirstName, b.lastname as SenderLastName,
d.firstname as ReceiverFirstName, d.lastname as ReceiverLastName,
a.status as status,
(select sum(td.amount * 0.01) from transaction_data td where td.data_id 
= a2.id and td.dir = 1 and td.uid = a.target_uid) as ReversedAmount,
(select sum(td.amount * 0.01) from transaction_data td where td.data_id 
= a2.id and td.dir = 0 and td.uid = a.uid ) as DepositedAmount, a.flags, 
(a.amount * 0.01) as Amount,
(a.fee * 0.01) as Fee
FROM data a, customerdata b, customerdata d, customer c1, customer c2 , 
participant p, data a2
WHERE p.id = a.partner_id AND (a.uid = c1.id) AND (a.target_uid = c2.id) 
and c1.id=b.uid and c2.id=d.uid
and a.confirmation is not null AND (a2.ref_id = a.id) and
((a2.what = 13) or (a2.what = 17) ) ORDER BY time desc ;
(query plan followed)
The expensive operation is the UNIQUE. Are you sure, in terms of 
business logic, that this is necessary? Is it actually possible to have 
duplicate transactions at the exact same time, and if so, would you 
really want to eliminate them?

As an aside, I prefer to have numeric constants like the 'what' field in 
a small lookup table of two columns (what_code, what_description); it's 
easier to extend and to document.
begin:vcard
fn:Andrew Lazarus
n:Lazarus;Andrew
org:Pillette Investment Management;Research and Development
adr;dom:;;3028 Fillmore;San Francisco;CA;94123
email;internet:[EMAIL PROTECTED]
title:Director
tel;work:800-366-0688
tel;fax:415-440-4093
url:http://www.pillette.com
version:2.1
end:vcard


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

[PERFORM] Query Optimization

2004-12-14 Thread sarlav kumar

Hi all, 

Can someone please help me optimize this query? Is there a better way to write this query? I am generating a report of transactions ordered by time and with details of the sender and receiver etc.

SELECT distinct a.time::date ||'br'||substring(a.time::time::text,1,8) as Time,CASE WHEN a.what = 0 THEN 'Money Transfer' WHEN a.what = 15 THEN 'Purchase' WHEN a.what = 26 THEN 'Merchant Streamline' WHEN a.what = 13 THEN 'Reversal' END as Transaction_Type ,c1.account_no as SenderAccount, c2.account_no as RecieverAccount, b.country as SenderCountry, d.country as RecieverCountry,b.firstname as SenderFirstName, b.lastname as SenderLastName, d.firstname as ReceiverFirstName, d.lastname as ReceiverLastName, a.status as status,(select sum(td.amount * 0.01) from transaction_data td where td.data_id = a2.id and td.dir = 1 and td.uid = a.target_uid) as ReversedAmount,(select sum(td.amount * 0.01) from transaction_data td where td.data_id = a2.id and td.dir = 0 and td.uid = a.uid ) as DepositedAmount, a.flags, (a.amount * 0.01) as Amount,(a.fee * 0.01) as Fee FROM data a, customerdata b, customerdata d, customer c1, customer c2
 ,
 participant p, data a2 WHERE p.id = a.partner_id AND (a.uid = c1.id) AND (a.target_uid = c2.id) and c1.id=b.uid and c2.id=d.uidand a.confirmation is not null AND (a2.ref_id = a.id) and ((a2.what = 13) or (a2.what = 17) ) ORDER BY time desc ;


QUERY PLAN-Unique (cost=2978.27..2981.54 rows=8 width=150) (actual time=502.29..506.75 rows=382 loops=1) - Sort (cost=2978.27..2978.46 rows=77 width=150) (actual time=502.29..502.61 rows=461 loops=1) Sort Key: a."time")::date)::text || 'br'::text) || "substring"(((a."time")::time without time zone)::text, 1, 8)), CASE WHEN (a.what = 0) THEN 'Money Transfer'::text WHEN (a.what = 15) THEN 'Purchase'::text WHEN (a.what = 26) THEN 'Merchant Streamline'::text WHEN (a.what = 13) THEN 'Reversal'::text ELSE NULL::text END, c1.account_no, c2.account_no, b.country, d.country, b.firstname, b.lastname, d.firstname, d.lastname, a.
 status,
 (subplan), (subplan), a.flags, ((a.amount)::numeric * 0.01), ((a.fee)::numeric * 0.01) - Hash Join (cost=2687.00..2975.86 rows=77 width=150) (actual time=423.91..493.48 rows=461 loops=1) Hash Cond: ("outer".partner_id = "inner".id) - Nested Loop (cost=2494.67..2781.99 rows=77 width=146) (actual time=413.19..441.61 rows=472 loops=1) - Merge Join (cost=2494.67..2526.04 rows=77 width=116) (actual time=413.09..429.86 rows=472 loops=1)&
 nbsp;
 Merge Cond: ("outer".id = "inner".ref_id) - Sort (cost=1443.39..1458.57 rows=6069 width=108) (actual time=370.14..377.72 rows=5604 loops=1) Sort Key: a.id - Hash Join (cost=203.50..1062.01 rows=6069 width=108) (actual time=20.35..335.44 rows=5604
 loops=1) Hash Cond: ("outer".uid = "inner".id) - Merge Join (cost=0.00..676.43 rows=6069 width=91) (actual time=0.42..255.33 rows=5611 loops=1) Merge Cond: ("outer".target_uid =
 "inner".uid) - Merge Join (cost=0.00..1224.05 rows=6069 width=61) (actual time=0.34..156.74 rows=5611 loops=1) Merge Cond: ("outer".target_uid =
 "inner".id) - Index Scan using data_target_uid on data a (cost=0.00..2263.05 rows=6069 width=44) (actual time=0.23..63.87 rows=5630 loops=1) Filter: (confirmation IS NOT
 NULL) - Index Scan using customer_pkey on customer c2 (cost=0.00..631.03 rows=6120 width=17) (actual time=0.05..50.97 rows=10862 loops=1) - Index Scan using customerdata_uid_idx on customerdata d (cost=0.00..312.36 rows=6085 width=30) (actual time=0.06..48.95 rows=10822
 loops=1) - Hash (cost=188.20..188.20 rows=6120 width=17) (actual time=19.81..19.81 rows=0 loops=1) - Seq Scan on customer c1 (cost=0.00..188.20 rows=6120 width=17) (actual time=0.03..12.30 rows=6157 loops=1) - Sort (cost=1051.28..1052.52 rows=497 width=8) (actual time=42.05..4
 2.51
 rows=542 loops=1) Sort Key: a2.ref_id - Seq Scan on data a2 (cost=0.00..1029.00 rows=497 width=8) (actual time=0.21..41.14 rows=545 loops=1) Filter: ((what = 13) OR (what = 17)) - Index Scan using customerdata_uid_i
 dx on
 customerdata b (cost=0.00..3.31 rows=1 width=30) (actual time=0.01..0.01 rows=1 loops=472) Index Cond: (b.uid = "outer".uid) - Hash (cost=192.26..192.26 rows=26 width=4) (actual time=10.50..10.50 rows=0 loops=1) - Seq Scan on participant p (cost=0.00..192.26 rows=26 width=4) (actual time=10.42..10.46 rows=26 loops=1) SubPlan - Aggregate (cost=6.08..6.08 rows=1 width=4)
  (actual
 time=0.03..0.03 rows=1 loops=461) - Index Scan using td_data_id_idx on transaction_data td (cost=0.00..6.08 rows=1 width=4) (actual time=0.02..0.02 rows=1 loops=461) Index Cond: (data_id = $0) Filter: ((dir = 1) AND (uid = $1)) - Aggregate (cost=6.08..6.08 rows=1 width=4) (actual time=0.02..0.02 rows=1
 loops=461) - Index Scan using td_data_id_idx on transaction_data td (cost=0.00..6.08 rows=1 width=4) (actual time=0.01..0.01 rows=1 loops=461) Index Cond: (data_id = $0) Filter: ((dir = 0) AND (uid =

Re: [PERFORM] query optimization question

2004-01-30 Thread Dennis Bjorklund

On Thu, 29 Jan 2004, Tom Lane wrote:

  jackdb-# GROUP BY memberid_ HAVING ( 
 
 Um, that's not what I had in mind at all.  Does GROUP BY actually do
 anything at all here?  (You didn't answer me as to whether memberid_
 is a unique identifier or not, but if it is, this GROUP BY is just an
 expensive no-op.)

From your comment I assume that there is no transformation in pg that 
detects that the group by columns are unique?

 this is all blue-sky speculation anyway.  What I'm actually more
 interested in is your statement that MSSQL can do the original query
 quickly.  I find that a bit hard to believe because I don't see any
 relevant optimization techniques.

Getting rid of the group by would not give that kind of speedup? Maybe
mssql manage to rewrite the query like that before executing.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] query optimization question

2004-01-30 Thread Dennis Bjorklund

On Thu, 29 Jan 2004, Jack Coates wrote:

  Probably better to repost it as a gzip'd attachment.  That should
 
 complete with a picture of the GUI version. 26k zipped, let's see if
 this makes it through.

Are you sure you attached it?

At least when it got here there was no attachment.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] query optimization question

2004-01-30 Thread Greg Stark


Tom Lane [EMAIL PROTECTED] writes:

 Jack Coates [EMAIL PROTECTED] writes:
  yup -- here it is. It will probably be a nasty mess after linewrap gets
  done with it,
 
 yup, sure is :-(  If I was familiar with the layout I could probably
 decipher where the line breaks are supposed to be, but right now I'm
 just confused.

I just replaced all newlines that are followed by lines starting in column 1
with spaces and got something reasonable:

SELECT DISTINCT members_.memberid_   FROM members_   WHERE ( members_.List_='list1'
AND members_.MemberType_='normal'AND members_.SubType_='mail' )   GROUP BY 
memberid_ HAVING (   ( select count(*) from lyrActiveRecips, outmail_where 
outmail11  1   0   NULLNULL1   NULL102274.5
NULLNULLNULL104.10356   NULLNULLSELECT  0   NULL
  |--Parallelism(Gather Streams)11  2   1   Parallelism Gather 
Streams  NULLNULL102274.50.0 0.22011127  23  104.10356 
  [members_].[MemberID_]  NULLPLAN_ROW-1  1.0
   |--Filter(WHERE:(If ([Expr1006] IS NULL) then 0 else [Expr1006]+If ([Expr1012] 
IS NULL) then 0 else [Expr1012]=3))   11  3   2   Filter  Filter  
WHERE:(If ([Expr1006] IS NULL) then 0 else [Expr1006]+If ([Expr1012] IS NULL) then 0 
else [Expr1012]=3) NULL102274.50.0 3.5393338   23  103.88345   
[members_].[MemberID_]  NULLPLAN_ROW-1  1.0
|--Hash Match(Right Outer Join, 
HASH:([lyrCompletedRecips].[MemberID])=([members_].[MemberID_]), 
RESIDUAL:([members_].[MemberID_]=[lyrCompletedRecips].[MemberID])) 11  4   3   
Hash Match  Right Outer Join
HASH:([lyrCompletedRecips].[MemberID])=([members_].[MemberID_]), 
RESIDUAL:([members_].[MemberID_]=[lyrCompletedRecips].[MemberID])  NULL
4782883.5   0.0 21.874712   23  100.34412   
[members_].[MemberID_], [Expr1006], [Expr1012]  NULLPLAN_ROW-1  1.0
 |--Compute Scalar(DEFINE:([Expr1012]=Convert([Expr1020]))) 11 
 5   4   Compute Scalar  Compute Scalar  
DEFINE:([Expr1012]=Convert([Expr1020])) [Expr1012]=Convert([Expr1020])  119575.35  
 0.0 1.3723248   15  4.3749919   [lyrCompletedRecips].[MemberID], 
[Expr1012] NULLPLAN_ROW-1  1.0
 ||--Hash Match(Aggregate, HASH:([lyrCompletedRecips].[MemberID]), 
RESIDUAL:([lyrCompletedRecips].[MemberID]=[lyrCompletedRecips].[MemberID]) 
DEFINE:([Expr1020]=COUNT(*))) 11  6   5   Hash Match  Aggregate   
HASH:([lyrCompletedRecips].[MemberID]), 
RESIDUAL:([lyrCompletedRecips].[MemberID]=[lyrCompletedRecips].[MemberID])  
[Expr1020]=COUNT(*) 119575.35   0.0 1.3723248   15  4.3749919  
 [lyrCompletedRecips].[MemberID], [Expr1020] NULLPLAN_ROW-1  1.0
 | |--Parallelism(Repartition Streams, PARTITION 
COLUMNS:([lyrCompletedRecips].[MemberID])) 11  7   6   Parallelism 
Repartition Streams PARTITION COLUMNS:([lyrCompletedRecips].[MemberID]) NULL   
 119640.60.0 0.32407209  173 3.002667
[lyrCompletedRecips].[MemberID] NULLPLAN_ROW-1  1.0
 |  |--Nested Loops(Inner Join, OUTER 
REFERENCES:([outmail_].[MessageID_])) 11  8   7   Nested LoopsInner 
Join  OUTER REFERENCES:([outmail_].[MessageID_])  NULL119640.60.0  
   0.75014657  173 2.6785948   [lyrCompletedRecips].[MemberID] NULL
PLAN_ROW-1  1.0
 |   |--Parallelism(Distribute Streams) 11  9  
 8   Parallelism Distribute Streams  NULLNULL1.0 0.0 
2.8501874E-2128 9.4664574E-2[outmail_].[MessageID_] NULLPLAN_ROW   
 -1  1.0
 |   ||--Clustered Index 
Scan(OBJECT:([lmdb].[dbo].[outmail_].[IX_outmail_list]), 
WHERE:([outmail_].[Type_]='list'))11  10  9   Clustered Index Scan 
   Clustered Index ScanOBJECT:([lmdb].[dbo].[outmail_].[IX_outmail_list]), 
WHERE:([outmail_].[Type_]='list')   [outmail_].[Type_], [outmail_].[MessageID_] 
1.0 0.01878925  3.981E-5128 3.7658099E-2[outmail_].[Type_], 
[outmail_].[MessageID_] NULLPLAN_ROW0   1.0
 |   |--Clustered Index 
Seek(OBJECT:([lmdb].[dbo].[lyrCompletedRecips].[IX_CompletedRecipsMailingID]), 
SEEK:([lyrCompletedRecips].[MailingID]=[outmail_].[MessageID_]),  
WHERE:([lyrCompletedRecips].[CompletionStatusID]=300 AN11  11  8   
Clustered Index SeekClustered Index Seek
OBJECT:([lmdb].[dbo].[lyrCompletedRecips].[IX_CompletedRecipsMailingID]), 
SEEK:([lyrCompletedRecips].[MailingID]=[outmail_].[MessageID_]),

Re: [PERFORM] query optimization question

2004-01-30 Thread Jack Coates

On Thu, 2004-01-29 at 23:23, Dennis Bjorklund wrote:
 On Thu, 29 Jan 2004, Jack Coates wrote:
 
   Probably better to repost it as a gzip'd attachment.  That should
  
  complete with a picture of the GUI version. 26k zipped, let's see if
  this makes it through.
 
 Are you sure you attached it?
 
 At least when it got here there was no attachment.

argh; attached the 40K version which was in color, removed it to make
the new one with greyscale and forgot to attach that. Here it is again:
-- 
Jack Coates, Lyris Technologies Applications Engineer
510-549-4350 x148, [EMAIL PROTECTED]
Interoperability is the keyword, uniformity is a dead end.
--Olivier Fourdan


pg-perf-sql-plan.tgz
Description: application/compressed-tar

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] query optimization question

2004-01-29 Thread Jack Coates

On Wed, 2004-01-28 at 18:04, Tom Lane wrote:
 Jack Coates [EMAIL PROTECTED] writes:
  I've got a query that needs some help, please. Is there a way to avoid
  all the looping? I've got freedom to work with the double-indented
  sections below ) AND (, but the initial select distinct wrapper is much
  more difficult to change. This is auto-generated code.
 
 Well, you're not going to get any serious improvement without a
 wholesale rewrite of the query --- I'd think that something driven by
 a GROUP BY memberid_ HAVING count(*) = whatever at the outer level would
 be a better way to approach it.  As you have it, the system has no
 choice but to fully evaluate two very expensive subselects, from scratch,
 for each outer row.
 

I hear you. There's definitely an understanding that this tool can
generate some gnarly queries, and we want to redesign in a way that will
allow some more intelligence to be applied to the problem. In the
meantime, I'll be happy if PG grinds at the same level as other
databases. MS-SQL completed that query in 25 minutes on a database with
31 times the data in it. Since I'm one of the bigger *nix fans around
here, that doesn't make me happy.

 However...
 
  ( select count(*) from lyrActiveRecips, members_ a, outmail_  
  where lyrActiveRecips.UserName = a.UserNameLC_  
  and lyrActiveRecips.Domain = a.Domain_  
  and a.MemberID_ = members_.MemberID_  
  and outmail_.MessageID_ = lyrActiveRecips.MailingID  
 
 Is memberid_ a unique identifier for members_, as one would think from
 the name?  If so, can't you drop the join of members_ a in this
 subselect, and just use the corresponding fields from the outer table?
 
  ( select count(*) from lyrCompletedRecips, members_ a, outmail_
  where a.MemberID_ = lyrCompletedRecips.MemberID  
  and a.UserNameLC_ = members_.UserNameLC_  
  and a.Domain_ = members_.Domain_  
  and outmail_.MessageID_ = lyrCompletedRecips.MailingID  
 
 Why are the join conditions different here from the other subselect?
 Can't you rephrase them the same as above, and then again remove the
 inner appearance of members_ ?
 
   regards, tom lane

unfortunately, the column names are different between lyrcompletedrecips
and lyractiverecips. However, one thing we were able to do is to reduce
the number of queries by not trying to match across multiple lists.

SELECT DISTINCT members_.emailaddr_, members_.memberid_ FROM members_ 
WHERE ( members_.List_='list1' 
AND members_.MemberType_='normal' 
AND members_.SubType_='mail' 
AND members_.emailaddr_ IS NOT NULL ) 
AND ( 
( select count(*) from lyrActiveRecips, outmail_ 
where outmail_.MessageID_ = lyrActiveRecips.MailingID 
and outmail_.Type_ = 'list' 
and members_.MemberID_ = lyrActiveRecips.MemberID 
and lyrActiveRecips.NextAttempt  '2004-01-20 00:00:00' )
 + 
( select count(*) from lyrCompletedRecips, outmail_ 
where members_.MemberID_ = lyrCompletedRecips.MemberID 
and outmail_.MessageID_ = lyrCompletedRecips.MailingID 
and outmail_.Type_ = 'list' 
and lyrCompletedRecips.FinalAttempt  '2004-01-20 00:00:00' 
and lyrCompletedRecips.CompletionStatusID = 300 )
 = 3
);

That completed in 3.5 minutes on MS-SQL. I killed the query this morning
after 15 hours on PostgreSQL 7.4. I tried a GROUP BY memberid_ HAVING
variation, which completed in 59 seconds on MS-SQL. I killed it after 35
minutes on PostgreSQL.

On a more positive note, if you remember the benchmarking I was doing
last month, PostgreSQL got some pretty good relative numbers. It
requires a lot of hand-holding and tuning relative to MS-SQL, but it
certainly beat the pants off of Oracle 8 and 9 for speed and ease of
management. Oracle 8 was in fact unable to complete the uglier stress
tests. I'll be working on a tuning recommendations white paper today.

thanks for all the help,
-- 
Jack Coates, Lyris Technologies Applications Engineer
510-549-4350 x148, [EMAIL PROTECTED]
Interoperability is the keyword, uniformity is a dead end.
--Olivier Fourdan



---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] query optimization question

2004-01-29 Thread Tom Lane

Jack Coates [EMAIL PROTECTED] writes:
 That completed in 3.5 minutes on MS-SQL. I killed the query this morning
 after 15 hours on PostgreSQL 7.4. I tried a GROUP BY memberid_ HAVING
 variation, which completed in 59 seconds on MS-SQL. I killed it after 35
 minutes on PostgreSQL.

Hm.  I'd like to think that 7.4 would be competitive on grouping
queries.  What sort of plan did you get from it?

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] query optimization question

2004-01-29 Thread Tom Lane

Jack Coates [EMAIL PROTECTED] writes:
 jackdb=# explain SELECT DISTINCT members_.memberid_ 
 jackdb-# FROM members_ 
 jackdb-# WHERE ( members_.List_='list1' 
 jackdb(# AND members_.MemberType_='normal' 
 jackdb(# AND members_.SubType_='mail' 
 jackdb(# AND members_.emailaddr_ IS NOT NULL ) 
 jackdb-# GROUP BY memberid_ HAVING ( 

Um, that's not what I had in mind at all.  Does GROUP BY actually do
anything at all here?  (You didn't answer me as to whether memberid_
is a unique identifier or not, but if it is, this GROUP BY is just an
expensive no-op.)

What I was envisioning was pulling the sub-selects up to the top level
and using grouping to calculate the count(*) values for all memberids
in parallel.  Roughly speaking it would look like (again assuming
memberid_ is unique)

SELECT memberid_ FROM
(
  SELECT memberid_ FROM lyrActiveRecips, members_, outmail
  WHERE (all the conditions for this case)
  UNION ALL
  SELECT memberid_ FROM lyrCompletedRecips, members_, outmail
  WHERE (all the conditions for this case)
)
GROUP BY memberid_ HAVING count(*) = 3;

However, if you can't change the boilerplate part of your query then
this is all blue-sky speculation anyway.  What I'm actually more
interested in is your statement that MSSQL can do the original query
quickly.  I find that a bit hard to believe because I don't see any
relevant optimization techniques.  Do they have any equivalent to
EXPLAIN that would give some hint how they're doing it?

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] query optimization question

2004-01-29 Thread Jack Coates

On Thu, 2004-01-29 at 11:31, Tom Lane wrote:
 Jack Coates [EMAIL PROTECTED] writes:
  jackdb=# explain SELECT DISTINCT members_.memberid_ 
  jackdb-# FROM members_ 
  jackdb-# WHERE ( members_.List_='list1' 
  jackdb(# AND members_.MemberType_='normal' 
  jackdb(# AND members_.SubType_='mail' 
  jackdb(# AND members_.emailaddr_ IS NOT NULL ) 
  jackdb-# GROUP BY memberid_ HAVING ( 
 
 Um, that's not what I had in mind at all.  Does GROUP BY actually do
 anything at all here?  (You didn't answer me as to whether memberid_
 is a unique identifier or not, but if it is, this GROUP BY is just an
 expensive no-op.)
 

Sorry for the misunderstanding. It should be unique, yes.

 What I was envisioning was pulling the sub-selects up to the top level
 and using grouping to calculate the count(*) values for all memberids
 in parallel.  Roughly speaking it would look like (again assuming
 memberid_ is unique)
 
 SELECT memberid_ FROM
 (
   SELECT memberid_ FROM lyrActiveRecips, members_, outmail
   WHERE (all the conditions for this case)
   UNION ALL
   SELECT memberid_ FROM lyrCompletedRecips, members_, outmail
   WHERE (all the conditions for this case)
 )
 GROUP BY memberid_ HAVING count(*) = 3;
 
 However, if you can't change the boilerplate part of your query then
 this is all blue-sky speculation anyway.  

Got it now -- I'm running into some subquery errors trying to implement
this, anyway.

 What I'm actually more
 interested in is your statement that MSSQL can do the original query
 quickly.  I find that a bit hard to believe because I don't see any
 relevant optimization techniques.  Do they have any equivalent to
 EXPLAIN that would give some hint how they're doing it?

yup -- here it is. It will probably be a nasty mess after linewrap gets
done with it, so let me know if you'd like me to post a copy on ftp.

SELECT DISTINCT members_.memberid_   FROM members_   WHERE (
members_.List_='list1'AND members_.MemberType_='normal'AND
members_.SubType_='mail' )   GROUP BY memberid_ HAVING (   ( select
count(*) from lyrActiveRecips, outmail_where
outmail 11  1   0   NULLNULL1   NULL102274.5NULL   
 NULLNULL104.10356   NULLNULLSELECT  0   NULL
  |--Parallelism(Gather Streams)11  2   1   Parallelism Gather
Streams NULLNULL102274.50.0 0.22011127  23  104.10356  
 [members_].[MemberID_]  NULLPLAN_ROW-1  1.0
   |--Filter(WHERE:(If ([Expr1006] IS NULL) then 0 else
[Expr1006]+If ([Expr1012] IS NULL) then 0 else
[Expr1012]=3))  11  3   2   Filter  Filter  WHERE:(If ([Expr1006] IS NULL) 
then
0 else [Expr1006]+If ([Expr1012] IS NULL) then 0 else
[Expr1012]=3)   NULL102274.50.0 3.5393338   23  103.88345  
 [members_].[MemberID_]  NULLPLAN_ROW-1  1.0
|--Hash Match(Right Outer Join,
HASH:([lyrCompletedRecips].[MemberID])=([members_].[MemberID_]),
RESIDUAL:([members_].[MemberID_]=[lyrCompletedRecips].[MemberID]))  11  4  
 3   Hash Match  Right Outer Join
HASH:([lyrCompletedRecips].[MemberID])=([members_].[MemberID_]), 
RESIDUAL:([members_].[MemberID_]=[lyrCompletedRecips].[MemberID])  NULL
4782883.5   0.0 21.874712   23  100.34412   
[members_].[MemberID_], [Expr1006], [Expr1012]  NULLPLAN_ROW-1  1.0
 |--Compute
Scalar(DEFINE:([Expr1012]=Convert([Expr1020]))) 11  5   4   Compute
Scalar  Compute
Scalar  DEFINE:([Expr1012]=Convert([Expr1020])) [Expr1012]=Convert([Expr1020])  
119575.35   0.0 1.3723248   15  4.3749919   
[lyrCompletedRecips].[MemberID], [Expr1012] NULLPLAN_ROW-1  1.0
 ||--Hash Match(Aggregate,
HASH:([lyrCompletedRecips].[MemberID]),
RESIDUAL:([lyrCompletedRecips].[MemberID]=[lyrCompletedRecips].[MemberID]) 
DEFINE:([Expr1020]=COUNT(*)))11  6   5   Hash Match  Aggregate 
  HASH:([lyrCompletedRecips].[MemberID]), 
RESIDUAL:([lyrCompletedRecips].[MemberID]=[lyrCompletedRecips].[MemberID])  
[Expr1020]=COUNT(*) 119575.35   0.0 1.3723248   15  4.3749919  
 [lyrCompletedRecips].[MemberID], [Expr1020] NULLPLAN_ROW-1  1.0
 | |--Parallelism(Repartition Streams, PARTITION
COLUMNS:([lyrCompletedRecips].[MemberID]))  11  7   6   Parallelism
 Repartition Streams PARTITION COLUMNS:([lyrCompletedRecips].[MemberID]) NULL  
  119640.60.0 0.32407209  173 3.002667
[lyrCompletedRecips].[MemberID] NULLPLAN_ROW-1  1.0
 |  |--Nested Loops(Inner Join, OUTER
REFERENCES:([outmail_].[MessageID_]))   11  8   7   Nested LoopsInner
JoinOUTER
REFERENCES:([outmail_].[MessageID_])NULL119640.60.0 0.75014657 
 173 2.6785948

Re: [PERFORM] query optimization question

2004-01-29 Thread Jack Coates

On Thu, 2004-01-29 at 14:01, Tom Lane wrote:

 Probably better to repost it as a gzip'd attachment.  That should
 protect the formatting and get it into the list archives.
 
   regards, tom lane

complete with a picture of the GUI version. 26k zipped, let's see if
this makes it through.
-- 
Jack Coates, Lyris Technologies Applications Engineer
510-549-4350 x148, [EMAIL PROTECTED]
Interoperability is the keyword, uniformity is a dead end.
--Olivier Fourdan



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

[PERFORM] query optimization differs between view and explicit query

2004-01-29 Thread Reece Hart





I have a large query which I would like to place in a view. The explicit query is sufficiently fast, but the same query as a view is much slower and uses a different plan. I would appreciate an explanation of why this is, and, more importantly whether/how I might coax the view to use a different plan.


The original query:
[EMAIL PROTECTED] select distinct on (AH.p2gblataln_id) AH.p2gblataln_id,H.pseq_id,min(H.pstart) as pstart,
max(H.pstop) as pstop,A.ident,(A.ident/Q.len::float*100)::int as pct_ident,
sum(H.pstop-H.pstart+1) as aln_length,H.genasm_id,H.chr,H.plus_strand,min(H.gstart) as gstart,
max(H.gstop) as gstop
from p2gblathsp H
join p2gblatalnhsp AH on H.p2gblathsp_id=AH.p2gblathsp_id
join p2gblataln A on AH.p2gblataln_id=A.p2gblataln_id
join pseq Q on H.pseq_id=Q.pseq_id
where H.pseq_id=76
group by AH.p2gblataln_id,H.pseq_id,H.genasm_id,H.chr,H.plus_strand,A.ident,Q.len
\g /dev/null
Time: 277.804 ms

Now as a view:
[EMAIL PROTECTED] create view v1 as
select distinct on (AH.p2gblataln_id) AH.p2gblataln_id,H.pseq_id,min(H.pstart) as pstart,
max(H.pstop) as pstop,A.ident,(A.ident/Q.len::float*100)::int as pct_ident,
sum(H.pstop-H.pstart+1) as aln_length,H.genasm_id,H.chr,H.plus_strand,min(H.gstart) as gstart,
max(H.gstop) as gstop
from p2gblathsp H
join p2gblatalnhsp AH on H.p2gblathsp_id=AH.p2gblathsp_id
join p2gblataln A on AH.p2gblataln_id=A.p2gblataln_id
join pseq Q on H.pseq_id=Q.pseq_id
group by AH.p2gblataln_id,H.pseq_id,H.genasm_id,H.chr,H.plus_strand,A.ident,Q.len;
CREATE VIEW
Time: 103.041 ms

[EMAIL PROTECTED] select * from v1 where pseq_id=76 \g /dev/null
Time: 31973.979 ms

Okay, that's ~100x slower. The plans:

[EMAIL PROTECTED] explain select distinct on snip... same as the first query above
 QUERY PLAN
-- Unique (cost=11157.75..11187.26 rows=454 width=40)
 - GroupAggregate (cost=11157.75..11186.13 rows=454 width=40)
 - Sort (cost=11157.75..11158.89 rows=454 width=40)
 Sort Key: ah.p2gblataln_id, h.pseq_id, h.genasm_id, h.chr, h.plus_strand, a.ident, q.len
 - Nested Loop (cost=11125.62..11137.71 rows=454 width=40)
 - Index Scan using pseq_pkey on pseq q (cost=0.00..3.01 rows=2 width=6)
 Index Cond: (76 = pseq_id)
 - Materialize (cost=11125.62..11127.89 rows=227 width=38)
 - Nested Loop (cost=546.15..11125.62 rows=227 width=38)
 - Hash Join (cost=546.15..10438.72 rows=227 width=34)
 Hash Cond: (outer.p2gblathsp_id = inner.p2gblathsp_id)
 - Seq Scan on p2gblatalnhsp ah (cost=0.00..6504.03 rows=451503 width=8)
 - Hash (cost=545.58..545.58 rows=227 width=34)
 - Index Scan using p2gblathsp_p_lookup on p2gblathsp h (cost=0.00..545.58 rows=227 wid Index Cond: (pseq_id = 76)
 - Index Scan using p2gblataln_pkey on p2gblataln a (cost=0.00..3.01 rows=1 width=8)
 Index Cond: (outer.p2gblataln_id = a.p2gblataln_id)
(17 rows)
 


[EMAIL PROTECTED] explain select * from v1 where pseq_id=76;
 QUERY PLAN
-- Subquery Scan v1 (cost=246907.54..281897.70 rows=2258 width=77)
 Filter: (pseq_id = 76)
 - Unique (cost=246907.54..276254.13 rows=451486 width=40)
 - GroupAggregate (cost=246907.54..275125.41 rows=451486 width=40)
 - Sort (cost=246907.54..248036.25 rows=451486 width=40)
 Sort Key: ah.p2gblataln_id, h.pseq_id, h.genasm_id, h.chr, h.plus_strand, a.ident, q.len
 - Hash Join (cost=14019.29..204503.24 rows=451486 width=40)
 Hash Cond: (outer.p2gblataln_id = inner.p2gblataln_id)
 - Hash Join (cost=7632.79..191344.45 rows=451486 width=36)
 Hash Cond: (outer.p2gblathsp_id = inner.p2gblathsp_id)
 - Merge Join (cost=0.00..176939.38 rows=451486 width=36)
 Merge Cond: (outer.pseq_id = inner.pseq_id)
 - Index Scan using p2gblathsp_p_lookup on p2gblathsp h (cost=0.00..16102.40 rows=451485 widt

Re: [PERFORM] query optimization differs between view and explicit

2004-01-29 Thread Tom Lane

Stephan Szabo [EMAIL PROTECTED] writes:
 On Thu, 29 Jan 2004, Reece Hart wrote:
 I have a large query which I would like to place in a view. The explicit
 query is sufficiently fast, but the same query as a view is much slower
 and uses a different plan. I would appreciate an explanation of why this
 is, and, more importantly whether/how I might coax the view to use a
 different plan.

 Well, in general [ they're not the same query ]

Right.  The reason the performance is so much worse is that the
restriction pseq_id=76 cannot be pushed down into the view subquery;
we have to form the entire logical output of the view and then filter
on pseq_id=76.  In your inline query you have done the pushing down
anyway and so the restriction is applied much lower in the plan,
resulting in lots less work.  But the results might be different.

The point that Stephan makes is explicitly understood by the planner as
of PG 7.4:

 * 3. If the subquery uses DISTINCT ON, we must not push down any quals that
 * refer to non-DISTINCT output columns, because that could change the set
 * of rows returned.

It's hard to give any advice on how to make a faster view without more
context.  What's the actual intention in all this?  What's the semantics
of pseq_id --- is it unique?  It might be you could fix the problem by
adding pseq_id to the DISTINCT ON list, but we don't have enough info
to understand whether that would break the desired behavior.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

[PERFORM] query optimization question

2004-01-28 Thread Jack Coates

Hi all,

I've got a query that needs some help, please. Is there a way to avoid
all the looping? I've got freedom to work with the double-indented
sections below ) AND (, but the initial select distinct wrapper is much
more difficult to change. This is auto-generated code.

explain analyze SELECT DISTINCT members_.emailaddr_, members_.memberid_ 
FROM members_  WHERE ( 
members_.List_='list1'  
AND members_.MemberType_='normal'  
AND members_.SubType_='mail'  
AND members_.emailaddr_ IS NOT NULL 
) AND (
( select count(*) from lyrActiveRecips, members_ a, outmail_  
where lyrActiveRecips.UserName = a.UserNameLC_  
and lyrActiveRecips.Domain = a.Domain_  
and a.MemberID_ = members_.MemberID_  
and outmail_.MessageID_ = lyrActiveRecips.MailingID  
and outmail_.Type_ = 'list'  
and lyrActiveRecips.NextAttempt  '2004-01-20 00:00:00' 
)
  +  
( select count(*) from lyrCompletedRecips, members_ a, outmail_  
where a.MemberID_ = lyrCompletedRecips.MemberID  
and a.UserNameLC_ = members_.UserNameLC_  
and a.Domain_ = members_.Domain_  
and outmail_.MessageID_ = lyrCompletedRecips.MailingID  
and outmail_.Type_ = 'list'  
and lyrCompletedRecips.FinalAttempt  '2004-01-20 00:00:00'  
and lyrCompletedRecips.CompletionStatusID = 300 )  
  = 3 
) 
;
 QUERY 
PLAN  
-
 Unique  (cost=537.06..537.07 rows=1 width=72) (actual
time=114460.908..114460.908 rows=0 loops=1)
   -  Sort  (cost=537.06..537.06 rows=1 width=72) (actual
time=114460.905..114460.905 rows=0 loops=1)
 Sort Key: emailaddr_, memberid_
 -  Index Scan using ix_members_list_notifyerr on members_ 
(cost=0.00..537.05 rows=1 width=72) (actual time=114460.893..114460.893
rows=0 loops=1)
   Index Cond: ((list_)::text = 'list1'::text)
   Filter: (((membertype_)::text = 'normal'::text) AND
((subtype_)::text = 'mail'::text) AND (emailaddr_ IS NOT NULL) AND
(((subplan) + (subplan)) = 3))
   SubPlan
 -  Aggregate  (cost=52.39..52.39 rows=1 width=0)
(actual time=0.089..0.090 rows=1 loops=818122)
   -  Hash Join  (cost=47.55..52.39 rows=1 width=0)
(actual time=0.086..0.086 rows=0 loops=818122)
 Hash Cond: (outer.memberid_ =
inner.memberid)
 -  Index Scan using ix_members_emaillc on
members_ a  (cost=0.00..4.83 rows=1 width=4) (actual time=0.077..0.081
rows=1 loops=818122)
   Index Cond: (((domain_)::text =
($2)::text) AND ((usernamelc_)::text = ($1)::text))
 -  Hash  (cost=47.55..47.55 rows=1
width=4) (actual time=0.025..0.025 rows=0 loops=1)
   -  Hash Join  (cost=25.00..47.55
rows=1 width=4) (actual time=0.023..0.023 rows=0 loops=1)
 Hash Cond: (outer.messageid_
= inner.mailingid)
 -  Seq Scan on outmail_ 
(cost=0.00..22.50 rows=6 width=4) (actual time=0.001..0.001 rows=0
loops=1)
   Filter: ((type_)::text =
'list'::text)
 -  Hash  (cost=25.00..25.00
rows=2 width=8) (actual time=0.003..0.003 rows=0 loops=1)
   -  Seq Scan on
lyrcompletedrecips  (cost=0.00..25.00 rows=2 width=8) (actual
time=0.001..0.001 rows=0 loops=1)
 Filter:
((finalattempt  '2004-01-20 00:00:00'::timestamp without time zone) AND
(completionstatusid = 300))
 -  Aggregate  (cost=51.59..51.59 rows=1 width=0)
(actual time=0.033..0.034 rows=1 loops=818122)
   -  Hash Join  (cost=27.35..51.59 rows=1 width=0)
(actual time=0.028..0.028 rows=0 loops=818122)
 Hash Cond: (((outer.username)::text =
(inner.usernamelc_)::text) AND ((outer.domain)::text =
(inner.domain_)::text))
 -  Hash Join  (cost=22.52..46.72 rows=3
width=211) (actual time=0.003..0.003 rows=0 loops=818122)
   Hash Cond: (outer.mailingid =
inner.messageid_)
   -  Seq Scan on lyractiverecips 
(cost=0.00..22.50 rows=334 width=215) (actual time=0.001..0.001 rows=0
loops=818122)
 Filter: (nextattempt 
'2004-01-20 00:00:00'::timestamp

75 matches

Mail list logo