On Feb 20, 2012, at 12:46 PM, Jay A. Kreibich wrote:

> On Mon, Feb 20, 2012 at 05:01:17PM +0000, Simon Slavin scratched on the wall:
>> 
>> On 20 Feb 2012, at 4:11pm, Puneet Kishor <punk.k...@gmail.com> wrote:
>> 
>>> is there are way to force the query to return only the "first" match
>>> so the resulting query has only as many rows as the left table?
>> 
>> You have two problems:
>> 
>> A) You are JOINing two columns which contain some null values.  null is
>> interpreted as 'I don't know', so the joins tend to match many
>> different rows.
> 
>  No, that's not how it works.  If you have a join condition like
>  (A.col == B.col), and either A.col or B.col (or both) is NULL, then
>  the result of the comparison itself is NULL ("Are these values
>  equal?"  "I don't know.").  JOIN operation will only match rows where
>  the join condition is TRUE.  NULL is not "good enough" to match rows.
> 
>  Think about NATURAL JOINs where the two columns are reduced to only
>  one column.  This reduction can only take place if the values are
>  exactly the same.
> 
>> B) You are using a LEFT JOIN which is explicitly defined to include
>> many matches, not just one match.  Instead you might try NATURAL JOIN
>> or CROSS JOIN to see what difference they make.
> 
>  That's not the issue.  Or rather, it is, but that isn't going to show
>  us anything we don't already know.  NATURAL JOIN should return the
>  exact same results as example #5: 329743  (assuming "col" is the only
>  column name shared between tables), except it will merge the two
>  "col" columns into a single column.
> 
>  A CROSS JOIN has no join condition and is going to return
>  2,066,471,848 rows.  That tells us nothing.
> 
>> And you have a meta-problem too: the dataset you're experimenting with
>> is so big you can't see why it's doing what it's doing.
> 
>  Databases are designed to work with sets.  If you want to debug the
>  issue, you need to think in sets as well.  Use the database to answer
>  your questions, rather than just scanning the results and doing it in
>  your head.
> 
>  The statement...
> 
>    SELECT COUNT(*) as ct, col FROM a GROUP BY 2 HAVING ct > 1
> 
>  ...will tell you which rows have duplicate columns.  You can run this
>  on either table.
> 
>  If you want to have unique values in the JOIN, the best approach is
>  to put a unique index across each "col" column, so that the columns
>  have unique values.
> 
>  Failing that, replace the table reference in example #6 with
>  something like "(SELECT * FROM a GROUP BY col) as a".  That's going
>  to give you random rows (there is no concept of a "first" row), but
>  it will get you the row count you're looking for.


Thanks Jay... brilliant explanation. I ended up finding the dupes in table "b" 
and re-encoding the column so it had only unique values. You are right, "first" 
doesn't mean much in this case... I was referring to "the first matching row, 
and to hell with the subsequent matches." Fortunately, the problem was not too 
laborious to correct manually.


> 
>  You might need to do it with both tables, depending on which table(s)
>  has the duplicate "col" values.

Yup, fortunately I know that; it was table "b". Table "a" has geographic 
features, and neither do I want to "create new" nor "lose existing" features. 
Table "b" has attributes to categorize features in table "a". Cleaning up table 
"b" cleared the problem.



--
Puneet Kishor

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to