[PERFORM] Trouble managing planner for timestamptz columns

2006-03-10 Thread Marc Morin
 
We have large tables that hold statistics based on time. They are of the
form.

CREATE TABLE stats (
id serial primary key,
logtime timestamptz,
d1 int,
s1 bigint
);

CREATE INDEX idx on stats(logtime);

Some of these tables have new data inserted at a rate of 500,000+ rows /
hour.  The entire table will grow to being 10's to 100's of millions of
rows in size.  (Yes, we are also paritioning these, it's the size of an
individual partition that we're talking about).

We tend to analyze these tables every day or so and this doesn't always
prove to be sufficient

Our application is a reporting application and the end users typically
like to query the newest data the most. As such,  the queries of the
form...


select 
 *
from stats
inner join dimension_d1 using (d1)
where logtime between X and Y and d1.something = value; 

This usually results in a hash join (good thing) where the dimension
table is loaded into the hash table and it index scans stats using idx
index.

The trouble starts when both X and Y are times after the last analyze.
This restriction clause is outside the range of values in the historgram
created by the last analyze.  Postgres's estimate on the number of rows
returned here is usually very low and incorrect, as you'd expect...  

Trouble can occur when the planner will flip its decision and decide
to hash join by loading the results of the index scan on idx into the
hash table instead of the dimension table  

Since the table is so large and the system is busy (disk not idle at
all), doing an analyze on this table in the production system can take
1/2 hour!  (statistics collector set to 100).  We can't afford to
analyze more often...

It certainly would be nice if postgres could understand somehow that
some columns are dynamic and that it's histogram could be stretched to
the maximal values or some other technique for estimating rows to the
right of the range of values in the histogram...

Or have some concept of error bars on it's planner decisions

Suggestions? Comments?


Marc

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Trouble managing planner for timestamptz columns

2006-03-10 Thread Tom Lane
Marc Morin [EMAIL PROTECTED] writes:
 We tend to analyze these tables every day or so and this doesn't always
 prove to be sufficient

Seems to me you just stated your problem.  Instead of having the planner
make wild extrapolations, why not set up a cron job to analyze these
tables more often?  Or use autovacuum which will do it for you.

 Since the table is so large and the system is busy (disk not idle at
 all), doing an analyze on this table in the production system can take
 1/2 hour!  (statistics collector set to 100).

I'd believe that for vacuum analyze, but analyze alone should be cheap.
Have you perhaps got some weird datatypes in the table?  Maybe you
should back off the stats target a bit?

We do support analyzing selected columns, so you might try something
like a cron job analyzing only the timestamp column, with a suitably low
stats target for that column.  This would yield numbers far more
reliable than any extrapolation the planner could do.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Trouble managing planner for timestamptz columns

2006-03-10 Thread Marc Morin
Well this analyze just took 12 minutes...  Stats target of 100.

# time psql xxx xxx -c analyze elem_trafficstats_1
ANALYZE

real12m1.070s
user0m0.001s
sys 0m0.015s 


A large table, but by far, not the largest...  Have about 1 dozen or so
tables like this, so analyzing them will take 3-4 hours of time...  No
weird datatypes, just bigints for facts, timestamptz and ints for
dimensions.

My problem is not the analyze itself, it's the fact that our db is
really busy doing stuff Analyze I/O is competing... I am random I/O
bound like crazy.

If I set the stats target to 10, I get

# time psql  xxx -c set session default_statistics_target to
10;analyze elem_trafficstats_1
ANALYZE

real2m15.733s
user0m0.009s
sys 0m2.255s 

Better, but not sure what side affect this would have.

 -Original Message-
 From: Tom Lane [mailto:[EMAIL PROTECTED] 
 Sent: Friday, March 10, 2006 1:31 PM
 To: Marc Morin
 Cc: pgsql-performance@postgresql.org
 Subject: Re: [PERFORM] Trouble managing planner for 
 timestamptz columns 
 
 Marc Morin [EMAIL PROTECTED] writes:
  We tend to analyze these tables every day or so and this doesn't 
  always prove to be sufficient
 
 Seems to me you just stated your problem.  Instead of having 
 the planner make wild extrapolations, why not set up a cron 
 job to analyze these tables more often?  Or use autovacuum 
 which will do it for you.
 
  Since the table is so large and the system is busy (disk 
 not idle at 
  all), doing an analyze on this table in the production 
 system can take
  1/2 hour!  (statistics collector set to 100).
 
 I'd believe that for vacuum analyze, but analyze alone should 
 be cheap.
 Have you perhaps got some weird datatypes in the table?  
 Maybe you should back off the stats target a bit?
 
 We do support analyzing selected columns, so you might try 
 something like a cron job analyzing only the timestamp 
 column, with a suitably low stats target for that column.  
 This would yield numbers far more reliable than any 
 extrapolation the planner could do.
 
   regards, tom lane
 

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster