On Mon, Feb 20, 2012 at 05:01:17PM +0000, Simon Slavin scratched on the wall:
> 
> On 20 Feb 2012, at 4:11pm, Puneet Kishor <punk.k...@gmail.com> wrote:
> 
> > is there are way to force the query to return only the "first" match
> > so the resulting query has only as many rows as the left table?
> 
> You have two problems:
> 
> A) You are JOINing two columns which contain some null values.  null is
> interpreted as 'I don't know', so the joins tend to match many
> different rows.

  No, that's not how it works.  If you have a join condition like
  (A.col == B.col), and either A.col or B.col (or both) is NULL, then
  the result of the comparison itself is NULL ("Are these values
  equal?"  "I don't know.").  JOIN operation will only match rows where
  the join condition is TRUE.  NULL is not "good enough" to match rows.

  Think about NATURAL JOINs where the two columns are reduced to only
  one column.  This reduction can only take place if the values are
  exactly the same.

> B) You are using a LEFT JOIN which is explicitly defined to include
> many matches, not just one match.  Instead you might try NATURAL JOIN
> or CROSS JOIN to see what difference they make.

  That's not the issue.  Or rather, it is, but that isn't going to show
  us anything we don't already know.  NATURAL JOIN should return the
  exact same results as example #5: 329743  (assuming "col" is the only
  column name shared between tables), except it will merge the two
  "col" columns into a single column.

  A CROSS JOIN has no join condition and is going to return
  2,066,471,848 rows.  That tells us nothing.

> And you have a meta-problem too: the dataset you're experimenting with
> is so big you can't see why it's doing what it's doing.

  Databases are designed to work with sets.  If you want to debug the
  issue, you need to think in sets as well.  Use the database to answer
  your questions, rather than just scanning the results and doing it in
  your head.

  The statement...

    SELECT COUNT(*) as ct, col FROM a GROUP BY 2 HAVING ct > 1

  ...will tell you which rows have duplicate columns.  You can run this
  on either table.

  If you want to have unique values in the JOIN, the best approach is
  to put a unique index across each "col" column, so that the columns
  have unique values.

  Failing that, replace the table reference in example #6 with
  something like "(SELECT * FROM a GROUP BY col) as a".  That's going
  to give you random rows (there is no concept of a "first" row), but
  it will get you the row count you're looking for.

  You might need to do it with both tables, depending on which table(s)
  has the duplicate "col" values.


   -j

-- 
Jay A. Kreibich < J A Y  @  K R E I B I.C H >

"Intelligence is like underwear: it is important that you have it,
 but showing it to the wrong people has the tendency to make them
 feel uncomfortable." -- Angela Johnson
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to