[PERFORM] Analyzer is clueless

2004-11-17 Thread David Brown
I'm doing some performance profiling with a simple two-table query:

SELECT L.ProductID, sum(L.Amount)
FROM drinv H
JOIN drinvln L ON L.OrderNo = H.OrderNo
WHERE
(OrderDate between '2003-01-01' AND '2003-04-30')
GROUP BY L.ProductID

drinv and drinvln have about 100,000 and 3,500,000 rows respectively. Actual 
data size in the large table is 500-600MB. OrderNo is indexed in both tables, 
as is OrderDate.

The environment is PGSQL 8 on Win2k with 512MB RAM (results are similar to 7.3 
from Mammoth). I've tried tweaking various conf parameters, but apart from 
using up memory, nothing seems to have had a tangible effect - the Analyzer 
doesn't seem to take resources into account like some of the doco suggests.

The date selection represents about 5% of the range. Here's the plan summaries:

Three months (2003-01-01 to 2003-03-30) = 1 second

HashAggregate  (cost=119365.53..119368.74 rows=642 width=26)
  -  Nested Loop  (cost=0.00..118791.66 rows=114774 width=26)
-  Index Scan using drinv_OrderDate on drinv h  (cost=0.00..200.27 
rows=3142 width=8)
  Index Cond: ((OrderDate = '2003-01-01'::date) AND (OrderDate 
= '2003-03-30'::date))
-  Index Scan using drinvln_OrderNo on drinvln l  (cost=0.00..28.73 
rows=721 width=34)
  Index Cond: (l.OrderNo = outer.OrderNo)


Four months (2003-01-01 to 2003-04-30) = 60 seconds

HashAggregate  (cost=126110.53..126113.74 rows=642 width=26)
  -  Hash Join  (cost=277.55..125344.88 rows=153130 width=26)
Hash Cond: (outer.OrderNo = inner.OrderNo)
-  Seq Scan on drinvln l  (cost=0.00..106671.35 rows=3372935 width=34)
-  Hash  (cost=267.07..267.07 rows=4192 width=8)
  -  Index Scan using drinv_OrderDate on drinv h  
(cost=0.00..267.07 rows=4192 width=8)
Index Cond: ((OrderDate = '2003-01-01'::date) AND 
(OrderDate = '2003-04-30'::date))


Four months (2003-01-01 to 2003-04-30) with Seq_scan disabled = 75 seconds


HashAggregate  (cost=130565.83..130569.04 rows=642 width=26)
  -  Merge Join  (cost=519.29..129800.18 rows=153130 width=26)
Merge Cond: (outer.OrderNo = inner.OrderNo)
-  Sort  (cost=519.29..529.77 rows=4192 width=8)
  Sort Key: h.OrderNo
  -  Index Scan using drinv_OrderDate on drinv h  
(cost=0.00..267.07 rows=4192 width=8)
Index Cond: ((OrderDate = '2003-01-01'::date) AND 
(OrderDate = '2003-04-30'::date))
-  Index Scan using drinvln_OrderNo on drinvln l  
(cost=0.00..119296.29 rows=3372935 width=34)

Statistics were run on each table before query execution. The random page cost 
was lowered to 2, but as you can see, the estimated costs are wild anyway.

As a comparison, MS SQL Server took less than 15 seconds, or 4 times faster.

MySQL (InnoDB) took 2 seconds, which is 30 times faster.

The query looks straightforward to me (it might be clearer with a subselect), 
so what on earth is wrong?

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Analyzer is clueless

2004-11-17 Thread Josh Berkus
Dan,

 I'm doing some performance profiling with a simple two-table query:

Please send EXPLAIN ANALYZE for each query, and not just EXPLAIN.   Thanks!

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Analyzer is clueless

2004-11-17 Thread Joshua D. Drake
Hello,
Have you tried increasing the statistics target for orderdate and 
rerunning analyze?

Sincerely,
Joshua D. Drake
David Brown wrote:
I'm doing some performance profiling with a simple two-table query:
SELECT L.ProductID, sum(L.Amount)
FROM drinv H
JOIN drinvln L ON L.OrderNo = H.OrderNo
WHERE
(OrderDate between '2003-01-01' AND '2003-04-30')
GROUP BY L.ProductID
drinv and drinvln have about 100,000 and 3,500,000 rows respectively. Actual 
data size in the large table is 500-600MB. OrderNo is indexed in both tables, 
as is OrderDate.
The environment is PGSQL 8 on Win2k with 512MB RAM (results are similar to 7.3 
from Mammoth). I've tried tweaking various conf parameters, but apart from 
using up memory, nothing seems to have had a tangible effect - the Analyzer 
doesn't seem to take resources into account like some of the doco suggests.
The date selection represents about 5% of the range. Here's the plan summaries:
Three months (2003-01-01 to 2003-03-30) = 1 second
HashAggregate  (cost=119365.53..119368.74 rows=642 width=26)
  -  Nested Loop  (cost=0.00..118791.66 rows=114774 width=26)
-  Index Scan using drinv_OrderDate on drinv h  (cost=0.00..200.27 
rows=3142 width=8)
  Index Cond: ((OrderDate = '2003-01-01'::date) AND (OrderDate 
= '2003-03-30'::date))
-  Index Scan using drinvln_OrderNo on drinvln l  (cost=0.00..28.73 
rows=721 width=34)
  Index Cond: (l.OrderNo = outer.OrderNo)
Four months (2003-01-01 to 2003-04-30) = 60 seconds
HashAggregate  (cost=126110.53..126113.74 rows=642 width=26)
  -  Hash Join  (cost=277.55..125344.88 rows=153130 width=26)
Hash Cond: (outer.OrderNo = inner.OrderNo)
-  Seq Scan on drinvln l  (cost=0.00..106671.35 rows=3372935 width=34)
-  Hash  (cost=267.07..267.07 rows=4192 width=8)
  -  Index Scan using drinv_OrderDate on drinv h  
(cost=0.00..267.07 rows=4192 width=8)
Index Cond: ((OrderDate = '2003-01-01'::date) AND 
(OrderDate = '2003-04-30'::date))
Four months (2003-01-01 to 2003-04-30) with Seq_scan disabled = 75 seconds
HashAggregate  (cost=130565.83..130569.04 rows=642 width=26)
  -  Merge Join  (cost=519.29..129800.18 rows=153130 width=26)
Merge Cond: (outer.OrderNo = inner.OrderNo)
-  Sort  (cost=519.29..529.77 rows=4192 width=8)
  Sort Key: h.OrderNo
  -  Index Scan using drinv_OrderDate on drinv h  
(cost=0.00..267.07 rows=4192 width=8)
Index Cond: ((OrderDate = '2003-01-01'::date) AND 
(OrderDate = '2003-04-30'::date))
-  Index Scan using drinvln_OrderNo on drinvln l  
(cost=0.00..119296.29 rows=3372935 width=34)
Statistics were run on each table before query execution. The random page cost 
was lowered to 2, but as you can see, the estimated costs are wild anyway.
As a comparison, MS SQL Server took less than 15 seconds, or 4 times faster.
MySQL (InnoDB) took 2 seconds, which is 30 times faster.
The query looks straightforward to me (it might be clearer with a subselect), 
so what on earth is wrong?
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

--
Command Prompt, Inc., home of PostgreSQL Replication, and plPHP.
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - [EMAIL PROTECTED] - http://www.commandprompt.com
Mammoth PostgreSQL Replicator. Integrated Replication for PostgreSQL
begin:vcard
fn:Joshua D. Drake
n:Drake;Joshua D.
org:Command Prompt, Inc.
adr:;;PO Box 215;Cascade Locks;Oregon;97014;USA
email;internet:[EMAIL PROTECTED]
title:Consultant
tel;work:503-667-4564
tel;fax:503-210-0334
note:Command Prompt, Inc. is the largest and oldest US based commercial PostgreSQL support provider. We  provide the only commercially viable integrated PostgreSQL replication solution, but also custom programming, and support. We authored  the book Practical PostgreSQL, the procedural language plPHP, and adding trigger capability to plPerl.
x-mozilla-html:FALSE
url:http://www.commandprompt.com/
version:2.1
end:vcard


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Analyzer is clueless

2004-11-17 Thread Simon Riggs
On Thu, 2004-11-18 at 02:08, David Brown wrote:
 Statistics were run on each table before query execution. The random page 
 cost was lowered to 2, but as you can see, the estimated costs are wild 
 anyway.
 
 As a comparison, MS SQL Server took less than 15 seconds, or 4 times faster.
 
 MySQL (InnoDB) took 2 seconds, which is 30 times faster.
 
 The query looks straightforward to me (it might be clearer with a subselect), 
 so what on earth is wrong?

The query is, as you say, straightforward.

You are clearly working with a query that is on the very edge of the
decision between using an index or not. 

The main issue is that PostgreSQL's default histogram statistics setting
is lower than other RDBMS. This means that it is less able to
discriminate between cases such as yours that are close to the edge.
This is a trade-off between run-time of the ANALYZE command and the
benefit it produces. As Joshua suggests, increasing the statistics
target for this table will likely allow the optimizer to correctly
determine the selectivity of the index and take the right path.

If this is a general RDBMS comparison, you may wish to extend the
system's default_statistics_target = 80 or at least  10.

To improve this query, you may wish to extend the table's statistics
target using:

ALTER TABLE drinv
ALTER COLUMN OrderDate SET STATISTICS 100;

which should allow the planner to more accurately estimate statistics
and thereby select an index, if appropriate.

The doco has recently been changed with regard to effective_cache_size;
you don't mention what beta release level you're using. That is the only
planner parameter that takes cache size into account, so any other
changes would certainly have zero effect on this *plan* though might
still benefit execution time.

Please post EXPLAIN ANALYZE output for any further questions.

-- 
Best Regards, Simon Riggs


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Analyzer is clueless

2004-11-17 Thread Jim C. Nasby
On Wed, Nov 17, 2004 at 10:32:48PM +, Simon Riggs wrote:
 The main issue is that PostgreSQL's default histogram statistics setting
 is lower than other RDBMS. This means that it is less able to
 discriminate between cases such as yours that are close to the edge.
 This is a trade-off between run-time of the ANALYZE command and the
 benefit it produces. As Joshua suggests, increasing the statistics
 target for this table will likely allow the optimizer to correctly
 determine the selectivity of the index and take the right path.

Is there still a good reason to have the histogram stats so low? Should
the default be changed to more like 100 at this point?

Also, how extensively does the planner use n_distinct, null_frac,
reltuples and the histogram to see what the odds are of finding a unique
value or a low number of values? I've seen cases where it seems the
planer doesn't think it'll be getting a unique value or a small set of
values even though stats indicates that it should be.

One final question... would there be interest in a process that would
dynamically update the histogram settings for tables based on how
distinct/unique each field was?
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Analyzer is clueless

2004-11-17 Thread Josh Berkus
Jim,

 Is there still a good reason to have the histogram stats so low? Should
 the default be changed to more like 100 at this point?

Low overhead.  This is actually a TODO for me for 8.1.   I need to find some 
test cases to set a differential level of histogram access for indexed 
fields, so like 10 for most fields but 100/150/200 for indexed fields.

However, I got stalled on finding test cases and then ran out of time.

 Also, how extensively does the planner use n_distinct, null_frac,
 reltuples and the histogram to see what the odds are of finding a unique
 value or a low number of values? I've seen cases where it seems the
 planer doesn't think it'll be getting a unique value or a small set of
 values even though stats indicates that it should be.

 One final question... would there be interest in a process that would
 dynamically update the histogram settings for tables based on how
 distinct/unique each field was?

Well, the process by which the analyzer decides that a field is unique could 
probably use some troubleshooting.   And we always, always could use 
suggestions/tests/help with the query planner.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Analyzer is clueless

2004-11-17 Thread Tom Lane
 I've seen cases where it seems the
 planer doesn't think it'll be getting a unique value or a small set of
 values even though stats indicates that it should be.

A test case exhibiting the problem would be helpful.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org