Re: [HACKERS] SQL 'in' vs join.

2000-12-10 Thread Bruce Momjian

 Don Baccus [EMAIL PROTECTED] writes:
  The optimizer should do a better job on your first query, sure, but why
  don't you like writing joins?
 
 The join wouldn't give quite the same answers.  If there are multiple
 rows in table2 matching a particular table1 row, then a join would give
 multiple copies of the table1 row, whereas the WHERE foo IN (sub-select)
 way would give only one copy.  SELECT DISTINCT can't be used to fix
 this, because that would eliminate legitimate duplicates from identical
 table1 rows.
 
 Now that the executor understands about multiple join rules (for
 OUTER JOIN support), I've been thinking about inventing a new join rule
 that says "at most one output row per left-hand row" --- this'd be sort
 of the opposite of the LEFT OUTER JOIN rule, "at least one output row
 per left-hand row" --- and then transforming IN (sub-select) clauses 
 that appear at the top level of WHERE into this kind of join.  Won't
 happen for 7.1, though.

Of course, we will have the query tree redesign for 7.2, right, make
that unnecessary.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026



RE: [HACKERS] SQL 'in' vs join.

2000-12-04 Thread Andrew Snow


 Now, given the two components, each with very low costs, it chooses to
 do a sequential scan on the table. I don't get it. 


Read the FAQ?

http://www.postgresql.org/docs/faq-english.html#4.23
"4.23) Why are my subqueries using IN so slow?")


- Andrew





Re: [HACKERS] SQL 'in' vs join.

2000-11-30 Thread mlw

Hannu Krosing wrote:
 
 mlw wrote:
 
  Why is a "select * from table1 where field in (select field from table2
  where condition )"
 
  is so dramatically bad compared to:
 
  "select * from table1, table2 where table1.field = table2.field and
  condition"
 
  I can't understand why the first query isn't optimized better than the
  second one. The 'in' query forces a full table scan (it shouldn't) and
  the second one uses the indexes. Does anyone know why?
 
 Its not done yet, and probably hsomewhat difficult to do in a general
 fashion
 
  I know I am no SQL guru, but my gut tells me that the 'in' operator
  should be far more efficient than a join.
 
  Here are the actual queries:
 
  cdinfo=# explain select trackid from zsong where muzenbr in (select
  muzenbr from ztitles where title = 'Mulan') ;
 
 try
 
 explain
  select trackid
from zsong
   where muzenbr in (
  select muzenbr
from ztitles
   where title = 'Mulan'
 and ztitles.muzenbr=zsong.muzenbr
   );
 
 this should hint the current optimizer to do the right thing;
 
 -
 Hannu

Nope:

cdinfo=# explain
cdinfo-#  select trackid
cdinfo-#from zsong
cdinfo-#   where muzenbr in (
cdinfo(#  select muzenbr
cdinfo(#from ztitles
cdinfo(#   where title = 'Mulan'
cdinfo(# and ztitles.muzenbr=zsong.muzenbr
cdinfo(#   );
NOTICE:  QUERY PLAN:
 
Seq Scan on zsong  (cost=1.00..104474515.18 rows=2193213
width=4)
  SubPlan
-  Index Scan using ztitles_pkey on ztitles  (cost=0.00..4.05
rows=1 width=4)  


But what I also find odd is, look at the components:

cdinfo=# explain select muzenbr from ztitles where title = 'Mulan' ;
NOTICE:  QUERY PLAN:
 
Index Scan using ztitles_title_ndx on ztitles  (cost=0.00..7.08 rows=1
width=4)  

cdinfo=# explain select trackid from zsong where muzenbr in ( 1,2,3,4,5)
;
NOTICE:  QUERY PLAN:
 
Index Scan using zsong_muzenbr_ndx, zsong_muzenbr_ndx,
zsong_muzenbr_ndx, zsong_muzenbr_ndx, zsong_muzenbr_ndx on zsong 
(cost=0.00..392.66 rows=102 width=4) 


Now, given the two components, each with very low costs, it chooses to
do a sequential scan on the table. I don't get it. I have have been
having no end of problems with Postgres' optimizer. It just seems to be
brain dead at times. It is a huge point of frustration to me. I am tied
to postgres in my current project, and I fear that I will not be able to
implement certain features because of this sort of behavior.


-- 
http://www.mohawksoft.com



Re: [HACKERS] SQL 'in' vs join.

2000-11-30 Thread Hannu Krosing

mlw wrote:
 
 Hannu Krosing wrote:
 
  mlw wrote:
  
   Why is a "select * from table1 where field in (select field from table2
   where condition )"
  
   is so dramatically bad compared to:
  
   "select * from table1, table2 where table1.field = table2.field and
   condition"
  
   I can't understand why the first query isn't optimized better than the
   second one. The 'in' query forces a full table scan (it shouldn't) and
   the second one uses the indexes. Does anyone know why?
 
  Its not done yet, and probably hsomewhat difficult to do in a general
  fashion
 
   I know I am no SQL guru, but my gut tells me that the 'in' operator
   should be far more efficient than a join.
  
   Here are the actual queries:
  
   cdinfo=# explain select trackid from zsong where muzenbr in (select
   muzenbr from ztitles where title = 'Mulan') ;
 
  try
 
  explain
   select trackid
 from zsong
where muzenbr in (
   select muzenbr
 from ztitles
where title = 'Mulan'
  and ztitles.muzenbr=zsong.muzenbr
);
 
  this should hint the current optimizer to do the right thing;
 
  -
  Hannu

did you have indexes on both ztitles.muzenbr and zsong.muzenbr ?

--
Hannu



Re: [HACKERS] SQL 'in' vs join.

2000-11-30 Thread Don Baccus

At 08:37 AM 11/30/00 -0500, mlw wrote:
 mlw wrote:
 
  Why is a "select * from table1 where field in (select field from table2
  where condition )"
 
  is so dramatically bad compared to:
 
  "select * from table1, table2 where table1.field = table2.field and
  condition"

Now, given the two components, each with very low costs, it chooses to
do a sequential scan on the table. I don't get it. I have have been
having no end of problems with Postgres' optimizer. It just seems to be
brain dead at times. It is a huge point of frustration to me. I am tied
to postgres in my current project, and I fear that I will not be able to
implement certain features because of this sort of behavior.

But but but ...

Not only is the join faster, but it is more readable and cleaner SQL as
well.  I would never write the query in its first form.  I'd change the
second one slightly to "select table1.* from ...", though, since those
are apparently the only fields you want.

The optimizer should do a better job on your first query, sure, but why
don't you like writing joins?




- Don Baccus, Portland OR [EMAIL PROTECTED]
  Nature photos, on-line guides, Pacific Northwest
  Rare Bird Alert Service and other goodies at
  http://donb.photo.net.



Re: [HACKERS] SQL 'in' vs join.

2000-11-30 Thread Tom Lane

Don Baccus [EMAIL PROTECTED] writes:
 The optimizer should do a better job on your first query, sure, but why
 don't you like writing joins?

The join wouldn't give quite the same answers.  If there are multiple
rows in table2 matching a particular table1 row, then a join would give
multiple copies of the table1 row, whereas the WHERE foo IN (sub-select)
way would give only one copy.  SELECT DISTINCT can't be used to fix
this, because that would eliminate legitimate duplicates from identical
table1 rows.

Now that the executor understands about multiple join rules (for
OUTER JOIN support), I've been thinking about inventing a new join rule
that says "at most one output row per left-hand row" --- this'd be sort
of the opposite of the LEFT OUTER JOIN rule, "at least one output row
per left-hand row" --- and then transforming IN (sub-select) clauses 
that appear at the top level of WHERE into this kind of join.  Won't
happen for 7.1, though.

regards, tom lane



Re: [HACKERS] SQL 'in' vs join.

2000-11-30 Thread Don Baccus

At 10:52 AM 11/30/00 -0500, Tom Lane wrote:
Don Baccus [EMAIL PROTECTED] writes:
 The optimizer should do a better job on your first query, sure, but why
 don't you like writing joins?

The join wouldn't give quite the same answers.  If there are multiple
rows in table2 matching a particular table1 row, then a join would give
multiple copies of the table1 row, whereas the WHERE foo IN (sub-select)
way would give only one copy.  SELECT DISTINCT can't be used to fix
this, because that would eliminate legitimate duplicates from identical
table1 rows.

Hmmm...I was presuming that "field" was a primary key of table1, so
such duplicates wouldn't exist (and SELECT DISTINCT would weed out
duplicates from table2 if "field" isn't a primary key of table2, i.e.
if table2 has a many-to-one relationship to table1).  For many-to-many
relationships yes, you're right, the "in" version returns a different
result.

Now that the executor understands about multiple join rules (for
OUTER JOIN support), I've been thinking about inventing a new join rule
that says "at most one output row per left-hand row" --- this'd be sort
of the opposite of the LEFT OUTER JOIN rule, "at least one output row
per left-hand row" --- and then transforming IN (sub-select) clauses 
that appear at the top level of WHERE into this kind of join.  Won't
happen for 7.1, though.

Same trick could be used for some classes of queries which do a SELECT DISTINCT
on the results of a join, too ...



- Don Baccus, Portland OR [EMAIL PROTECTED]
  Nature photos, on-line guides, Pacific Northwest
  Rare Bird Alert Service and other goodies at
  http://donb.photo.net.



[HACKERS] SQL 'in' vs join.

2000-11-29 Thread mlw

Why is a "select * from table1 where field in (select field from table2
where condition )"

is so dramatically bad compared to:

"select * from table1, table2 where table1.field = table2.field and
condition"

I can't understand why the first query isn't optimized better than the
second one. The 'in' query forces a full table scan (it shouldn't) and
the second one uses the indexes. Does anyone know why?

I know I am no SQL guru, but my gut tells me that the 'in' operator
should be far more efficient than a join. 

Here are the actual queries:

cdinfo=# explain select trackid from zsong where muzenbr in (select
muzenbr from ztitles where title = 'Mulan') ;
NOTICE:  QUERY PLAN:
 
Seq Scan on zsong  (cost=1.00..219321449380756.66 rows=2193213
width=4)
  SubPlan
-  Materialize  (cost=10022.50..10022.50 rows=10 width=4)
  -  Seq Scan on ztitles  (cost=1.00..10022.50
rows=10 width=4) 

cdinfo=# explain select trackid from zsong, ztitles where
ztitles.muzenbr = zsong.muzenbr and title = 'Mulan' ;
NOTICE:  QUERY PLAN:
 
Merge Join  (cost=0.00..183664.10 rows=219321 width=12)
  -  Index Scan using zsong_muznbr on zsong  (cost=0.00..156187.31
rows=2193213 width=8)
  -  Index Scan using ztitles_pkey on ztitles  (cost=0.00..61.50
rows=10 width=4)  

cdinfo=# \d zsong
   Table "zsong"
 Attribute |   Type| Modifier
---+---+---
 muzenbr   | integer   |
 disc  | integer   |
 trk   | integer   |
 song  | character varying |
 trackid   | integer   | not null default
nextval('trackid'::text)
 artistid  | integer   |
 acd   | character varying |
Indices: zsong_muznbr,
 zsong_pkey

cdinfo=# \d ztitles
  Table "ztitles"
 Attribute  |   Type| Modifier
+---+--
 muzenbr| integer   | not null
 artistid   | integer   |
 cat2   | character varying |
 cat3   | character varying |
 cat4   | character varying |
 performer  | character varying |
 performer2 | character varying |
 title  | character varying |
 artist1| character varying |
 engineer   | character varying |
 producer   | character varying |
 labelname  | character varying |
 catalog| character varying |
 distribut  | character varying |
 released   | character varying |
 origrel| character varying |
 nbrdiscs   | character varying |
 spar   | character varying |
 minutes| character varying |
 seconds| character varying |
 monostereo | character varying |
 studiolive | character varying |
 available  | character(1)  |
 previews   | character varying |
 pnotes | character varying |
 acd| character varying |
Index: ztitles_pkey 

-- 
http://www.mohawksoft.com