[PERFORM] IN() Optimization issue in 8.0rc5

2005-01-15 Thread Josh Berkus
Tom,

Hmmm ... I'm seeing an issue with IN() optimization -- or rather the lack of 
it -- in 8.0rc5.It seems to me that this worked better in 7.4, although 
I've not been able to load this particular database and test

dm=# explain
dm-# SELECT personid FROM mr.person_attributes_old
dm-#WHERE personid NOT IN (SELECT 
personid FROM mr.person_attributes);
QUERY PLAN
---
 Seq Scan on person_attributes_old  (cost=0.00..3226144059.85 rows=235732 
width=4)
   Filter: (NOT (subplan))
   SubPlan
 -  Seq Scan on person_attributes  (cost=0.00..12671.07 rows=405807 
width=4)
(4 rows)

dm=# explain select pao.personid from mr.person_attributes_old pao
dm-# left outer join mr.person_attributes p on pao.personid = p.personid
dm-# where p.personid is null;
  QUERY PLAN
---
 Merge Left Join  (cost=0.00..34281.83 rows=471464 width=4)
   Merge Cond: (outer.personid = inner.personid)
   Filter: (inner.personid IS NULL)
   -  Index Scan using idx_opa_person on person_attributes_old pao  
(cost=0.00..13789.29 rows=471464 width=4)
   -  Index Scan using idx_pa_person on person_attributes p  
(cost=0.00..14968.25 rows=405807 width=4)
(5 rows)

It seems like the planner ought to recognize that the first form of the query 
is optimizable into the 2nd form, and that I've seen it do so in 7.4.   
However, *no* amount of manipulation of query parameters I did on the 1st 
form of the query were successful in getting the planner to recognize that it 
could use indexes for the IN() form of the query.

Thoughts?

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] IN() Optimization issue in 8.0rc5

2005-01-15 Thread Tom Lane
Josh Berkus josh@agliodbs.com writes:
 dm=# explain
 dm-# SELECT personid FROM mr.person_attributes_old
 dm-#WHERE personid NOT IN (SELECT 
 personid FROM mr.person_attributes);
 QUERY PLAN
 ---
  Seq Scan on person_attributes_old  (cost=0.00..3226144059.85 rows=235732 
 width=4)
Filter: (NOT (subplan))
SubPlan
  -  Seq Scan on person_attributes  (cost=0.00..12671.07 rows=405807 
 width=4)
 (4 rows)

Hmm.  What you want for a NOT IN is for it to say
   Filter: (NOT (hashed subplan))
which you are not getting.  What's the datatypes of the two personid
columns?  Is the 400k-row estimate for person_attributes reasonable?
Maybe you need to increase work_mem (nee sort_mem) to allow a
400k-row hash table?

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] IN() Optimization issue in 8.0rc5

2005-01-15 Thread Josh Berkus
Tom,

 Hmm.  What you want for a NOT IN is for it to say
Filter: (NOT (hashed subplan))
 which you are not getting.  What's the datatypes of the two personid
 columns? 

INT

 Is the 400k-row estimate for person_attributes reasonable? 

Yes, the estimates are completely accurate.

 Maybe you need to increase work_mem (nee sort_mem) to allow a
 400k-row hash table?

Aha, that's it.  I thought I'd already set that, but apparently it was a 
different session.Fixed.  Thanks!

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings