On Feb 20, 2012, at 12:46 PM, Jay A. Kreibich wrote: > On Mon, Feb 20, 2012 at 05:01:17PM +0000, Simon Slavin scratched on the wall: >> >> On 20 Feb 2012, at 4:11pm, Puneet Kishor <punk.k...@gmail.com> wrote: >> >>> is there are way to force the query to return only the "first" match >>> so the resulting query has only as many rows as the left table? >> >> You have two problems: >> >> A) You are JOINing two columns which contain some null values. null is >> interpreted as 'I don't know', so the joins tend to match many >> different rows. > > No, that's not how it works. If you have a join condition like > (A.col == B.col), and either A.col or B.col (or both) is NULL, then > the result of the comparison itself is NULL ("Are these values > equal?" "I don't know."). JOIN operation will only match rows where > the join condition is TRUE. NULL is not "good enough" to match rows. > > Think about NATURAL JOINs where the two columns are reduced to only > one column. This reduction can only take place if the values are > exactly the same. > >> B) You are using a LEFT JOIN which is explicitly defined to include >> many matches, not just one match. Instead you might try NATURAL JOIN >> or CROSS JOIN to see what difference they make. > > That's not the issue. Or rather, it is, but that isn't going to show > us anything we don't already know. NATURAL JOIN should return the > exact same results as example #5: 329743 (assuming "col" is the only > column name shared between tables), except it will merge the two > "col" columns into a single column. > > A CROSS JOIN has no join condition and is going to return > 2,066,471,848 rows. That tells us nothing. > >> And you have a meta-problem too: the dataset you're experimenting with >> is so big you can't see why it's doing what it's doing. > > Databases are designed to work with sets. If you want to debug the > issue, you need to think in sets as well. Use the database to answer > your questions, rather than just scanning the results and doing it in > your head. > > The statement... > > SELECT COUNT(*) as ct, col FROM a GROUP BY 2 HAVING ct > 1 > > ...will tell you which rows have duplicate columns. You can run this > on either table. > > If you want to have unique values in the JOIN, the best approach is > to put a unique index across each "col" column, so that the columns > have unique values. > > Failing that, replace the table reference in example #6 with > something like "(SELECT * FROM a GROUP BY col) as a". That's going > to give you random rows (there is no concept of a "first" row), but > it will get you the row count you're looking for.
Thanks Jay... brilliant explanation. I ended up finding the dupes in table "b" and re-encoding the column so it had only unique values. You are right, "first" doesn't mean much in this case... I was referring to "the first matching row, and to hell with the subsequent matches." Fortunately, the problem was not too laborious to correct manually. > > You might need to do it with both tables, depending on which table(s) > has the duplicate "col" values. Yup, fortunately I know that; it was table "b". Table "a" has geographic features, and neither do I want to "create new" nor "lose existing" features. Table "b" has attributes to categorize features in table "a". Cleaning up table "b" cleared the problem. -- Puneet Kishor _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users