Re: [PERFORM] Odd number of rows expected

2005-01-23 Thread Jim C. Nasby
On Sat, Jan 22, 2005 at 10:18:00PM -0500, Tom Lane wrote:
 Jim C. Nasby [EMAIL PROTECTED] writes:
  (SELECT b.bucket_id AS rrs_bucket_id, s.*
  FROM rrs.bucket b
  JOIN page_log.log s
  ON (
  b.prev_end_time   log_time
  AND b.end_time = log_time )
  WHERE b.rrs_id = '1'
  AND b.end_time = '2005-01-21 20:23:00+00'
  AND b.end_time  '1970-01-01 00:00:00+00'
  ) a
 
  Basically, it seems that it doesn't understand that each row in log will
  match up with at most one row in bucket. There is a unique index on
  bucket(rrs_id, end_time), so it should be able to tell this.
 
 Why should it be able to tell that?

Indexes:
rrs_bucket__rrs_id__end_time unique, btree (rrs_id, end_time)

Err, crap, I guess that wouldn't work, because of prev_end_time not
being in there...

In english, each bucket defines a specific time period, and no two
buckets can over-lap (though there's no constraints defined to actually
prevent that). So reality is that each row in page_log.log will in fact
only match one row in bucket (at least for each value of rrs_id).

Given that, would the optimizer make a better choice if it knew that
(since it means a much smaller result set). Is there any way to tell the
optimizer this is the case?

Maybe what I ultimately need is a timestamp with interval datatype, that
specifies an interval that's fixed in time.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Odd number of rows expected

2005-01-23 Thread Tom Lane
Jim C. Nasby [EMAIL PROTECTED] writes:
 In english, each bucket defines a specific time period, and no two
 buckets can over-lap (though there's no constraints defined to actually
 prevent that). So reality is that each row in page_log.log will in fact
 only match one row in bucket (at least for each value of rrs_id).

 Given that, would the optimizer make a better choice if it knew that
 (since it means a much smaller result set).

Given that the join condition is not an equality, there's no hope of
using hash or merge join; so the join itself is about as good as you're
gonna get.  With a more accurate rows estimate for the join result, it
might have decided to use HashAggregate instead of Sort/GroupAggregate,
but AFAICS that would not have made a huge difference ... at best maybe
25% of the total query time.

 Is there any way to tell the
 optimizer this is the case?

Nope.  This gets back to the old problem of not having any cross-column
(cross-table in this case) statistics.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Odd number of rows expected

2005-01-22 Thread Tom Lane
Jim C. Nasby [EMAIL PROTECTED] writes:
 (SELECT b.bucket_id AS rrs_bucket_id, s.*
 FROM rrs.bucket b
 JOIN page_log.log s
 ON (
 b.prev_end_time   log_time
 AND b.end_time = log_time )
 WHERE b.rrs_id = '1'
 AND b.end_time = '2005-01-21 20:23:00+00'
 AND b.end_time  '1970-01-01 00:00:00+00'
 ) a

 Basically, it seems that it doesn't understand that each row in log will
 match up with at most one row in bucket. There is a unique index on
 bucket(rrs_id, end_time), so it should be able to tell this.

Why should it be able to tell that?

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


[PERFORM] Odd number of rows expected

2005-01-21 Thread Jim C. Nasby
I have a query that thinks it's going to generate a huge number of rows,
when in fact it won't:

INSERT INTO page_log.rrs
( bucket_id, page_id,project_id,other, 
hits,min_hits,max_hits,total_duration,min_duration,max_duration )
SELECT a.rrs_bucket_id, page_id,project_id,other
, 
count(*),count(*),count(*),sum(duration),min(duration),max(duration)
FROM
(SELECT b.bucket_id AS rrs_bucket_id, s.*
FROM rrs.bucket b
JOIN page_log.log s
ON (
b.prev_end_time   log_time
AND b.end_time = log_time )
WHERE b.rrs_id = '1'
AND b.end_time = '2005-01-21 20:23:00+00'
AND b.end_time  '1970-01-01 00:00:00+00'
) a
GROUP BY rrs_bucket_id, page_id,project_id,other;


QUERY PLAN  
   
---
 Subquery Scan *SELECT*  (cost=170461360504.98..183419912556.69 rows=91175544 
width=77)
   -  GroupAggregate  (cost=170461360504.98..183418316984.67 rows=91175544 
width=29)
 -  Sort  (cost=170461360504.98..171639141309.21 rows=471112321692 
width=29)
   Sort Key: b.bucket_id, s.page_id, s.project_id, s.other
   -  Nested Loop  (cost=0.00..17287707964.10 rows=471112321692 
width=29)
 -  Seq Scan on bucket b  (cost=0.00..9275.84 rows=281406 
width=20)
   Filter: ((rrs_id = 1) AND (end_time = '2005-01-21 
20:23:00+00'::timestamp with time zone) AND (end_time  '1970-01-01 
00:00:00+00'::timestamp with time zone))
 -  Index Scan using log__log_time on log s  
(cost=0.00..36321.24 rows=1674137 width=33)
   Index Cond: ((outer.prev_end_time  s.log_time) 
AND (outer.end_time = s.log_time))

The final rowcount after the aggregate will actually be about 1.9M
rows:

QUERY PLAN  
   
---
 Subquery Scan *SELECT*  (cost=170461360504.98..183419912556.69 rows=91175544 
width=77) (actual time=156777.374..234613.843 rows=1945123 loops=1)
   -  GroupAggregate  (cost=170461360504.98..183418316984.67 rows=91175544 
width=29) (actual time=156777.345..214246.751 rows=1945123 loops=1)
 -  Sort  (cost=170461360504.98..171639141309.21 rows=471112321692 
width=29) (actual time=156777.296..177517.663 rows=4915567 loops=1)
   Sort Key: b.bucket_id, s.page_id, s.project_id, s.other
   -  Nested Loop  (cost=0.00..17287707964.10 rows=471112321692 
width=29) (actual time=0.662..90702.755 rows=4915567 loops=1)
 -  Seq Scan on bucket b  (cost=0.00..9275.84 rows=281406 
width=20) (actual time=0.063..1591.591 rows=265122 loops=1)
   Filter: ((rrs_id = 1) AND (end_time = '2005-01-21 
20:23:00+00'::timestamp with time zone) AND (end_time  '1970-01-01 
00:00:00+00'::timestamp with time zone))
 -  Index Scan using log__log_time on log s  
(cost=0.00..36321.24 rows=1674137 width=33) (actual time=0.014..0.174 rows=19 
loops=265122)
   Index Cond: ((outer.prev_end_time  s.log_time) 
AND (outer.end_time = s.log_time))
 Total runtime: 299623.954 ms

Everything is analyzed, and the statistics target is set to 1000.
Basically, it seems that it doesn't understand that each row in log will
match up with at most one row in bucket. There is a unique index on
bucket(rrs_id, end_time), so it should be able to tell this.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]