Re: [sqlite] "is not null" and index
OK...here's a new thumb...data is now randomly distributed at 50%. I'm running on Windows XP with version 3.7.5 And the between logic works about 36% faster. Is your "rule of thumb" based on any benchmark or just a guess? #include #include int main() { int i; char sql[4096]; printf("CREATE TABLE x ('col1','col2','col3');\n"); printf("BEGIN;"); for(i=0;i<100;++i) { double r=rand()/(double)RAND_MAX; if (r < .5) { sprintf(sql,"INSERT INTO x values(null,'col2_%d','col3_%d');",i,i,i); } else { sprintf(sql,"INSERT INTO x values('col_%d','col2_%d','col3_%d');",i,i,i); } printf("%s\n",sql); } printf("COMMIT;"); printf("CREATE INDEX col1index on x('col1');\n"); } D:\SQLite\index1>sqlite3 test.db SQLite version 3.7.5 Enter ".help" for instructions Enter SQL statements terminated with a ";" sqlite> .read data.sql sqlite> .timer on sqlite> select count(*) from x where col1 between (select min(col1) from x) and (select max(col1) from x); 500080 CPU Time: user 0.218750 sys 0.046875 sqlite> select count(*) from x where col1 is not null; 500080 CPU Time: user 0.343750 sys 0.062500 sqlite> select count(*) from x where col1 is null; 499920 CPU Time: user 0.109375 sys 0.015625 Michael D. Black Senior Scientist Advanced Analytics Directorate Advanced GEOINT Solutions Operating Unit Northrop Grumman Information Systems From: sqlite-users-boun...@sqlite.org [sqlite-users-boun...@sqlite.org] on behalf of Igor Tandetnik [itandet...@mvps.org] Sent: Monday, October 17, 2011 10:06 AM To: sqlite-users@sqlite.org Subject: EXT :Re: [sqlite] "is not null" and index On 10/17/2011 9:30 AM, Black, Michael (IS) wrote: > According to this benchmark the break-even point is at 40% nulls. I asssume > you have a different test? I did mention "rule of thumb". Specific cases may vary. I must admit I'm too lazy to build tests for someone else's problem. The fact that all NULL values are clustered together in rows with sequential rowids might have skewed the results in your test. Better locality of reference, fewer pages to read from disk, improved cache utilization. -- Igor Tandetnik ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] "is not null" and index
On 10/17/2011 9:30 AM, Black, Michael (IS) wrote: According to this benchmark the break-even point is at 40% nulls. I asssume you have a different test? I did mention "rule of thumb". Specific cases may vary. I must admit I'm too lazy to build tests for someone else's problem. The fact that all NULL values are clustered together in rows with sequential rowids might have skewed the results in your test. Better locality of reference, fewer pages to read from disk, improved cache utilization. -- Igor Tandetnik ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] "is not null" and index
According to this benchmark the break-even point is at 40% nulls. I asssume you have a different test? #include int main() { int i; char sql[4096]; printf("CREATE TABLE x ('col1','col2','col3');\n"); printf("BEGIN;"); for(i=0;i<100;++i) { if (i < 10) { sprintf(sql,"INSERT INTO x values(null,'col2_%d','col3_%d');",i,i,i); } else { sprintf(sql,"INSERT INTO x values('col_%d','col2_%d','col3_%d');",i,i,i); } printf("%s\n",sql); } printf("COMMIT;"); printf("CREATE INDEX col1index on x('col1');\n"); } sqlite> select count(*) from x where col1 between (select min(col1) from x) and (select max(col1) from x); 90 CPU Time: user 0.453125 sys 0.015625 sqlite> select count(*) from x where col1 is not null; 90 CPU Time: user 0.281250 sys 0.125000 sqlite> update x set col1=null where rowid < 20; CPU Time: user 1.906250 sys 0.234375 sqlite> select count(*) from x where col1 between (select min(col1) from x) and (select max(col1) from x); 81 CPU Time: user 0.390625 sys 0.031250 sqlite> update x set col1=null where rowid >= 20 and rowid < 30; CPU Time: user 0.953125 sys 0.125000 sqlite> select count(*) from x where col1 between (select min(col1) from x) and (select max(col1) from x); 71 CPU Time: user 0.359375 sys 0.015625 sqlite> select count(*) from x where col1 is not null; 71 CPU Time: user 0.296875 sys 0.078125 sqlite> update x set col1=null where rowid >= 30 and rowid < 40; CPU Time: user 0.890625 sys 0.156250 sqlite> select count(*) from x where col1 is not null; 61 CPU Time: user 0.281250 sys 0.109375 sqlite> select count(*) from x where col1 between (select min(col1) from x) and (select max(col1) from x); 61 CPU Time: user 0.281250 sys 0.031250 sqlite> update x set col1=null where rowid >= 40 and rowid < 50; CPU Time: user 0.921875 sys 0.171875 sqlite> select count(*) from x where col1 between (select min(col1) from x) and (select max(col1) from x); 51 CPU Time: user 0.25 sys 0.015625 sqlite> select count(*) from x where col1 is not null; 51 CPU Time: user 0.312500 sys 0.062500 Michael D. Black Senior Scientist Advanced Analytics Directorate Advanced GEOINT Solutions Operating Unit Northrop Grumman Information Systems From: sqlite-users-boun...@sqlite.org [sqlite-users-boun...@sqlite.org] on behalf of Igor Tandetnik [itandet...@mvps.org] Sent: Monday, October 17, 2011 8:14 AM To: sqlite-users@sqlite.org Subject: EXT :Re: [sqlite] "is not null" and index Jean-Christophe Deschamps <j...@antichoc.net> wrote: >>> How can indexes be used with "not null" queries? >> >> They cannot. > > If one sees NOT NULL as the complement of NULL, i.e. values in the > range {min_value, max_value} (min and max depending on the column > expected content and type), then couldn't an index help? > select * from table where col between min_value and max_value If you have an extrinsic knowledge of the domain of the column values, then yes, you can reformulate the query as where col1 between minValue and maxValue I don't think SQLite would be able to rewrite the query this way automatically. > Of course if there are only few NULLs then a table scan will probably > be faster. In fact, unless some 90% of the rows contain NULLs, a table scan will probably be faster. Using an index trades O(N) performance for O(M log N), where M is the number of rows actually satisfying the condition. This is clearly an improvement when M is much smaller than N, and a pessimization when M is close to N. A rule of thumb is that the break-even point is somewhere around M = 0.1*N And if 90% of the rows do contain NULL in this column, I'd consider splitting the data into two tables - one with three columns (containing all non-NULL rows from the original table) and the other with two columns (containing the remaining rows). -- Igor Tandetnik ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] "is not null" and index
On Mon, Oct 17, 2011 at 9:14 AM, Igor Tandetnikwrote: > Jean-Christophe Deschamps wrote: > >>> How can indexes be used with "not null" queries? > >> > >> They cannot. > > > > If one sees NOT NULL as the complement of NULL, i.e. values in the > > range {min_value, max_value} (min and max depending on the column > > expected content and type), then couldn't an index help? > > select * from table where col between min_value and max_value > > If you have an extrinsic knowledge of the domain of the column values, then > yes, you can reformulate the query as > > where col1 between minValue and maxValue > > I don't think SQLite would be able to rewrite the query this way > automatically. > It will if you use the latest SQLite from trunk, compiled using -DSQLITE_ENABLE_STAT3 and if you run ANALYZE and if enough entries in the table are NULL to justify using an index to find the ones that are NOT NULL. As you point out below, "enough" usually means about 90%. > > > Of course if there are only few NULLs then a table scan will probably > > be faster. > > In fact, unless some 90% of the rows contain NULLs, a table scan will > probably be faster. Using an index trades O(N) performance for O(M log N), > where M is the number of rows actually satisfying the condition. This is > clearly an improvement when M is much smaller than N, and a pessimization > when M is close to N. A rule of thumb is that the break-even point is > somewhere around M = 0.1*N > > And if 90% of the rows do contain NULL in this column, I'd consider > splitting the data into two tables - one with three columns (containing all > non-NULL rows from the original table) and the other with two columns > (containing the remaining rows). > -- > Igor Tandetnik > > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > -- D. Richard Hipp d...@sqlite.org ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] "is not null" and index
Jean-Christophe Deschampswrote: >>> How can indexes be used with "not null" queries? >> >> They cannot. > > If one sees NOT NULL as the complement of NULL, i.e. values in the > range {min_value, max_value} (min and max depending on the column > expected content and type), then couldn't an index help? > select * from table where col between min_value and max_value If you have an extrinsic knowledge of the domain of the column values, then yes, you can reformulate the query as where col1 between minValue and maxValue I don't think SQLite would be able to rewrite the query this way automatically. > Of course if there are only few NULLs then a table scan will probably > be faster. In fact, unless some 90% of the rows contain NULLs, a table scan will probably be faster. Using an index trades O(N) performance for O(M log N), where M is the number of rows actually satisfying the condition. This is clearly an improvement when M is much smaller than N, and a pessimization when M is close to N. A rule of thumb is that the break-even point is somewhere around M = 0.1*N And if 90% of the rows do contain NULL in this column, I'd consider splitting the data into two tables - one with three columns (containing all non-NULL rows from the original table) and the other with two columns (containing the remaining rows). -- Igor Tandetnik ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] "is not null" and index
Yoav Apterwrote: > I have the following table: > > CREATE TABLE x ('col1', 'col2', 'col3') > Create col1index on x ('col1') > > When I run this query: "select * from x where col1 is null" I see the index on x is used. > When I run this query: "select * from x where col1 is NOT null" I see the index on x is not used. Why is that suprising? Imagine you are given a book with an index at the end, and are asked to enumerate all pages where a particular term does *not* appear. Would an index be helpful in this task? > How can indexes be used with "not null" queries? They cannot. If one sees NOT NULL as the complement of NULL, i.e. values in the range {min_value, max_value} (min and max depending on the column expected content and type), then couldn't an index help? select * from table where col between min_value and max_value Of course if there are only few NULLs then a table scan will probably be faster. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] "is not null" and index
Black, Michael (IS)wrote: > What does distinct do? > sqlite> explain query plan select distinct (col1) from x where col1 is not > null; > sele order from deta > - > 0 0 0 SCAN TABLE x USING COVERING INDEX col1index > (~50 rows) > OK...we're still using an index here... In a sense, yes - but note that you have SCAN TABLE USING INDEX; compare and contrast with SEARCH TABLE USING INDEX. DISTINCT effectively implies ORDER BY. The way DISTINCT is implemeted, SQLite enumerates the rows in order (and the index is helpful here) and discards any where the value is the same as in the previous row. You've just replaced a full table scan with a full index scan (and changed the meaning of the query in the process). > so using this subselect we do this: > sqlite> explain query plan select * from x where col1 in (select distinct > (col1) from x where col1 is not null); > sele order from deta > - > 0 0 0 SEARCH TABLE x USING INDEX col1index (col1=?) > (~250 rows) > 0 0 0 EXECUTE LIST SUBQUERY 1 > 1 0 0 SCAN TABLE x USING COVERING INDEX col1index > (~50 rows) So now you are scanning the table, and also searching on top of that. How is this an improvement? > Performance probably depends on how many "not null" things there are...if not > many of them this may not be any faster. In fact, this will always be strictly slower than a straightforward table scan. > A count() could be a lot faster though I'd think. How so? I'm not even sure how you would use count() here, let alone use it in a way that leads to performance gains. -- Igor Tandetnik ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] "is not null" and index
Does this make sense to try? First, duplicate the lack of index sqlite> explain query plan select * from x where col1 is null; sele order from deta - 0 0 0 SEARCH TABLE x USING INDEX col1index (col1=?) (~10 rows) sqlite> explain query plan select * from x where col1 is not null; sele order from deta - 0 0 0 SCAN TABLE x (~50 rows) What does distinct do? sqlite> explain query plan select distinct (col1) from x where col1 is not null; sele order from deta - 0 0 0 SCAN TABLE x USING COVERING INDEX col1index (~50 rows) OK...we're still using an index here...so using this subselect we do this: sqlite> explain query plan select * from x where col1 in (select distinct (col1) from x where col1 is not null); sele order from deta - 0 0 0 SEARCH TABLE x USING INDEX col1index (col1=?) (~250 rows) 0 0 0 EXECUTE LIST SUBQUERY 1 1 0 0 SCAN TABLE x USING COVERING INDEX col1index (~50 rows) Performance probably depends on how many "not null" things there are...if not many of them this may not be any faster. A count() could be a lot faster though I'd think. Michael D. Black Senior Scientist Advanced Analytics Directorate Advanced GEOINT Solutions Operating Unit Northrop Grumman Information Systems From: sqlite-users-boun...@sqlite.org [sqlite-users-boun...@sqlite.org] on behalf of Igor Tandetnik [itandet...@mvps.org] Sent: Monday, October 17, 2011 6:56 AM To: sqlite-users@sqlite.org Subject: EXT :Re: [sqlite] "is not null" and index Yoav Apter <yo...@checkpoint.com> wrote: > I have the following table: > > CREATE TABLE x ('col1', 'col2', 'col3') > Create col1index on x ('col1') > > When I run this query: "select * from x where col1 is null" I see the index > on x is used. > When I run this query: "select * from x where col1 is NOT null" I see the > index on x is not used. Why is that suprising? Imagine you are given a book with an index at the end, and are asked to enumerate all pages where a particular term does *not* appear. Would an index be helpful in this task? > How can indexes be used with "not null" queries? They cannot. -- Igor Tandetnik ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] "is not null" and index
Yoav Apterwrote: > I have the following table: > > CREATE TABLE x ('col1', 'col2', 'col3') > Create col1index on x ('col1') > > When I run this query: "select * from x where col1 is null" I see the index > on x is used. > When I run this query: "select * from x where col1 is NOT null" I see the > index on x is not used. Why is that suprising? Imagine you are given a book with an index at the end, and are asked to enumerate all pages where a particular term does *not* appear. Would an index be helpful in this task? > How can indexes be used with "not null" queries? They cannot. -- Igor Tandetnik ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] "is not null" and index
Hi I have the following table: CREATE TABLE x ('col1', 'col2', 'col3') Create col1index on x ('col1') When I run this query: "select * from x where col1 is null" I see the index on x is used. When I run this query: "select * from x where col1 is NOT null" I see the index on x is not used. How can indexes be used with "not null" queries? Thanks ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users