Re: [PERFORM] bad estimates

2003-08-30 Thread Bruno Wolff III
I haven't come up with any great ideas for this one. It might be interesting
to compare the explain analyze output from the distinct on query with
and without seqscans enabled.

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] bad estimates

2003-08-30 Thread Ken Geis
Bruno Wolff III wrote:
I haven't come up with any great ideas for this one. It might be interesting
to compare the explain analyze output from the distinct on query with
and without seqscans enabled.
Can't do that comparison.  Remember, with seqscan it fails.  (Oh, and 
that nested loops solution I thought was fast actually took 31 minutes 
versus 29 for index scan in 7.4b2.)

I ran another query across the same data:

select price_date, count(*) from day_ends group by price_date;

It used a table scan and hashed aggregates, and it ran in 5.5 minutes. 
Considering that, pgsql should be able to do the query that I had been 
running in a little more time than that.  So...

From what I've learned, we want to convince the optimizer to use a 
table scan; that's a good thing.  I want it to use hashed aggregates, 
but I can't convince it to (unless maybe I removed all of the 
statistics.)  To use group aggregates, it first sorts the results of the 
table scan (all 17 million rows!)  There ought to be some way to tell 
pgsql not to do sorts above a certain size.  In this case, if I set 
enable_sort=false, it goes back to the index scan.  If I then set 
enable_indexscan=false, it goes back to sorting.

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] bad estimates

2003-08-30 Thread Ken Geis
Bruno Wolff III wrote:
I haven't come up with any great ideas for this one. It might be interesting
to compare the explain analyze output from the distinct on query with
and without seqscans enabled.
After digging through planner code, I found that bumping up the sort_mem 
will make the planner prefer a full table scan and hashed aggregation. 
The sort memory is where the hash table is stored.  In the end, the 
query runs in 4.5 minutes, which is reasonable.

I had planned to try Manfred's index correlation patch to see if it 
would give better estimates for an index scan.  The index scan method 
took maybe 6.5x as long, but the estimate was that it would take 1400x 
as long.  I think instead of trying out his patch I might actually work 
on my application!

Ken

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] bad estimates

2003-08-30 Thread Tom Lane
Ken Geis [EMAIL PROTECTED] writes:
  From what I've learned, we want to convince the optimizer to use a 
 table scan; that's a good thing.  I want it to use hashed aggregates, 
 but I can't convince it to (unless maybe I removed all of the 
 statistics.)

You probably just need to increase sort_mem.  Multiple aggregates take
more RAM to process in a hashtable style ...

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] bad estimates / non-scanning aggregates

2003-08-29 Thread Bruno Wolff III
On Thu, Aug 28, 2003 at 17:10:31 -0700,
  Ken Geis [EMAIL PROTECTED] wrote:
 The query I want to run is
 
 select stock_id, min(price_date) from day_ends group by stock_id;

The fast way to do this is:

select distinct on (stock_id) stock_id, price_date
  order by stock_id, price_date;

 Also, to get the MIN for a given group, not all values of the index need 
 to be seen.  Must pgsql do a full scan because it treats all aggregates 
 in the same way?  Are MIN and MAX used often enough to justify special 
 treatment, and could that be cleanly implemented?  Perhaps the aggregate 
 function can request the data in a certain order, be told that it is 
 being passed data in a certain order, and return before seeing the 
 entire set of data.

Yes, max and min are not treated special so they don't benefit from
indexes. This has been discussed repeatedly in the archives.

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] bad estimates / non-scanning aggregates

2003-08-29 Thread Ken Geis
Bruno Wolff III wrote:
On Thu, Aug 28, 2003 at 17:10:31 -0700,
  Ken Geis [EMAIL PROTECTED] wrote:
The query I want to run is

select stock_id, min(price_date) from day_ends group by stock_id;
The fast way to do this is:

select distinct on (stock_id) stock_id, price_date
  order by stock_id, price_date;
Not according to the optimizer!  Plus, this is not guaranteed to return 
the correct results.

 Unique  (cost=3711244.30..3795942.42 rows=6366 width=8)
   -  Sort  (cost=3711244.30..3753593.36 rows=16939624 width=8)
 Sort Key: stock_id, price_date
 -  Seq Scan on day_ends  (cost=0.00..361892.24 rows=16939624 
width=8)



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] bad estimates / non-scanning aggregates

2003-08-29 Thread Ken Geis
Bruno Wolff III wrote:
Not according to the optimizer!  Plus, this is not guaranteed to return 
the correct results.
For it to be fast you need an index on (stock_id, price_date) so that
you can use an index scan.
I already said that such an index existed.  In fact, it is the primary 
key of the table.  And yes, I *am* analyzed!

The answers are guarenteed to be correct. See:
http://developer.postgresql.org/docs/postgres/sql-select.html#SQL-DISTINCT
That's good to know.  Thanks!

Ken



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] bad estimates

2003-08-29 Thread Ken Geis
Bruno Wolff III wrote:
On Thu, Aug 28, 2003 at 20:00:32 -0700,
  Ken Geis [EMAIL PROTECTED] wrote:
Bruno Wolff III wrote:

Not according to the optimizer!  Plus, this is not guaranteed to return 
the correct results.
For it to be fast you need an index on (stock_id, price_date) so that
you can use an index scan.
I already said that such an index existed.  In fact, it is the primary 
key of the table.  And yes, I *am* analyzed!


Your original example didn't actually match that of the table you are showing
examples from. In that example the second half of the primary key was the
date not the end of the day price. If this is the case for the real table,
then that is the reason the distinct on doesn't help.
I had obfuscated the table in the example and forgot to do the same with 
the query.  Serves me right for thinking I care about that.

A big problem is that the values I am working with are *only* the 
primary key and the optimizer is choosing a table scan over an index 
scan.  That is why I titled the email bad estimates.  The table has 
(stock_id, price_date) as the primary key, and a bunch of other columns. 
 What I *really* want to do efficiently is

select stock_id, min(price_date), max(price_date)
  from day_ends
 group by stock_id;
It is not the table or the query that is wrong.  It is either the db 
parameters or the optimizer itself.

Ken

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] bad estimates

2003-08-29 Thread Bruno Wolff III
On Thu, Aug 28, 2003 at 20:46:00 -0700,
  Ken Geis [EMAIL PROTECTED] wrote:
 
 A big problem is that the values I am working with are *only* the 
 primary key and the optimizer is choosing a table scan over an index 
 scan.  That is why I titled the email bad estimates.  The table has 
 (stock_id, price_date) as the primary key, and a bunch of other columns. 
  What I *really* want to do efficiently is
 
 select stock_id, min(price_date), max(price_date)
   from day_ends
  group by stock_id;
 
 It is not the table or the query that is wrong.  It is either the db 
 parameters or the optimizer itself.

If you want both the max and the min, then things are going to be a bit
more work. You are either going to want to do two separate selects
or join two selects or use subselects. If there aren't enough prices
per stock, the sequential scan might be fastest since you only need to
go through the table once and don't have to hit the index blocks.

It is still odd that you didn't get a big speed up for just the min though.
You example did have the stock id and the date as the primary key which
would make sense since the stock id and stock price on a day wouldn't
be guarenteed to be unique. Are you absolutely sure you have a combined
key on the stock id and the stock price?

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] bad estimates

2003-08-29 Thread Ken Geis
Bruno Wolff III wrote:
Can you do a \d on the real table or is that too sensitive?
It was silly of me to think of this as particularly sensitive.

stocks= \d day_ends
Table public.day_ends
   Column   | Type | Modifiers
+--+---
 stock_id   | integer  | not null
 price_date | date | not null
 open   | numeric(9,4) |
 high   | numeric(9,4) |
 low| numeric(9,4) |
 close  | numeric(9,4) |
 volume | integer  |
Indexes: day_ends_pkey primary key btree (stock_id, price_date)
Triggers: RI_ConstraintTrigger_16558399
It still doesn't make sense that you have a primary key that
is a stock and its price. What happens when the stock has the
same price on two different dates? And I doubt that you are looking
for the minimum and maximum dates for which you have price data.
So it is hard to believe that the index for your primary key is the
one you need for your query.
I can see the naming being confusing.  I used price_date because, of 
course, date is not a legal name.  day_ends is a horrible name for 
the table; daily_bars would probably be better.  I *am* looking for 
the mininum and maximum dates for which I have price data.  I'm running 
this query to build a chart so I can see visually where the majority of 
my data begins to use as the start of a window for analysis.

When run on 7.3.3, forcing an index scan by setting 
enable_seqscan=false, the query took 55 minutes to run.  The index is 
about 660M in size, and the table is 1G.  As I mentioned before, with 
table scans enabled, it bombs, running out of temporary space.

Hey Bruno, thanks for your attention here.  I'm not a newbie, but I've 
never really had performance issues with pgsql before.  And I've been 
running this database for a couple of years now, but I haven't run these 
queries against it.

Ken



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] bad estimates

2003-08-29 Thread Ken Geis
Ken Geis wrote:
I went through the code (7.4 beta2) that estimates the cost of an index 
scan path.  What I need to be sure of is that when running a query in 
pgsql that uses only the columns that are in an index, the underlying 
table need not be accessed.  I know that Oracle does this.
Thinking about it some more, it's obvious to me that a pgsql index scan 
must be accessing the underlying table even though all of the 
information needed is in the index itself.  A linear scan of a 660M file 
should not take 55 minutes.  I could confirm this with stats, but 
someone out there probably already knows the answer here.

Ken



---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [PERFORM] bad estimates

2003-08-29 Thread Christopher Kings-Lynne
  I went through the code (7.4 beta2) that estimates the cost of an index
  scan path.  What I need to be sure of is that when running a query in
  pgsql that uses only the columns that are in an index, the underlying
  table need not be accessed.  I know that Oracle does this.

PostgreSQL absolutely does not do this.  It is also not possible to do this
due to MVCC.

Chris


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] bad estimates

2003-08-29 Thread Ken Geis
Ken Geis wrote:
When run on 7.3.3, forcing an index scan by setting 
enable_seqscan=false, the query took 55 minutes to run.  The index is 
about 660M in size, and the table is 1G.  As I mentioned before, with 
table scans enabled, it bombs, running out of temporary space.
Man, I should wait a while before I send mails, because I keep having 
more to say!

Some good news here.  Doing the same as above on 7.4beta2 took 29 
minutes.  Now, the 7.3.3 was on reiser and 7.4 on ext2, so take that as 
you will.  7.4's index selectivity estimate seems much better; 7.3.3's 
anticipated rows was ten times the actual; 7.4's is one half of the actual.

Ken



---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] bad estimates

2003-08-29 Thread Jeff
On Fri, 29 Aug 2003, Ken Geis wrote:

 Some good news here.  Doing the same as above on 7.4beta2 took 29
 minutes.  Now, the 7.3.3 was on reiser and 7.4 on ext2, so take that as
 you will.  7.4's index selectivity estimate seems much better; 7.3.3's
 anticipated rows was ten times the actual; 7.4's is one half of the actual.

Min()  Max() unfortunatly suck on PG. It will be that way for a while
perhaps at some point someone will make a special case and convince
-HACKERS it is a Good Thing(tm) (Like select count(*) from table being
'cached' - a lot of people probably get bad first impressions because of
that)

Would it be possible ot rewrite your queries replacing min/max with a
select stock_id from bigtable where blah = blorch order by stock_id
(desc|asc) limit 1? because that would enable PG to use an index and
magically go fast. You may need a subselect..


--
Jeff Trout [EMAIL PROTECTED]
http://www.jefftrout.com/
http://www.stuarthamm.net/



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] bad estimates

2003-08-29 Thread Stephan Szabo
On Fri, 29 Aug 2003, Ken Geis wrote:

 Ken Geis wrote:
  I went through the code (7.4 beta2) that estimates the cost of an index
  scan path.  What I need to be sure of is that when running a query in
  pgsql that uses only the columns that are in an index, the underlying
  table need not be accessed.  I know that Oracle does this.

 Thinking about it some more, it's obvious to me that a pgsql index scan
 must be accessing the underlying table even though all of the
 information needed is in the index itself.  A linear scan of a 660M file
 should not take 55 minutes.  I could confirm this with stats, but
 someone out there probably already knows the answer here.

Unfortunately not all the information needed is in the index.  You can't
tell from the index alone currently whether or not the row is visible to
you.  Adding said information would be possible but there are downsides to
that as well (there are some past discussions on the topic, but I'm too
lazy to look them up to give a link, check the archives ;) ).


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] bad estimates

2003-08-29 Thread Ken Geis
Bruno Wolff III wrote:
If you want both the max and the min, then things are going to be a bit
more work. You are either going to want to do two separate selects
or join two selects or use subselects. If there aren't enough prices
per stock, the sequential scan might be fastest since you only need to
go through the table once and don't have to hit the index blocks.
It is still odd that you didn't get a big speed up for just the min though.
I found I'm suffering from an effect detailed in a previous thread titled

	Does correlation mislead the optimizer on large tables?

Ken



---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] bad estimates

2003-08-29 Thread Sean Chittenden
 If you want both the max and the min, then things are going to be a
 bit more work. You are either going to want to do two separate
 selects or join two selects or use subselects. If there aren't
 enough prices per stock, the sequential scan might be fastest since
 you only need to go through the table once and don't have to hit
 the index blocks.
 
 It is still odd that you didn't get a big speed up for just the min though.
 
 I found I'm suffering from an effect detailed in a previous thread titled
 
   Does correlation mislead the optimizer on large tables?

I don't know about large tables, but this is a big problem and
something I'm going to spend some time validating later today.  I
think Manfred's patch is pretty good and certainly better than where
we are but I haven't used it yet to see if it's the magic ticket for
many of these index problems.

-sc

-- 
Sean Chittenden

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings