Re: [sqlite] "is not null" and index

2011-10-17 Thread Black, Michael (IS)
OK...here's a new thumb...data is now randomly distributed at 50%.   I'm 
running on Windows XP with version 3.7.5
And the between logic works about 36% faster.  Is your "rule of thumb" based on 
any benchmark or just a guess?

#include 
#include 
int main()
{
  int i;
  char sql[4096];
  printf("CREATE TABLE x ('col1','col2','col3');\n");
  printf("BEGIN;");
  for(i=0;i<100;++i) {
double r=rand()/(double)RAND_MAX;
if (r < .5) {
  sprintf(sql,"INSERT INTO x values(null,'col2_%d','col3_%d');",i,i,i);
}
else {
  sprintf(sql,"INSERT INTO x values('col_%d','col2_%d','col3_%d');",i,i,i);
}
printf("%s\n",sql);
  }
  printf("COMMIT;");
  printf("CREATE INDEX col1index on x('col1');\n");
}

D:\SQLite\index1>sqlite3 test.db
SQLite version 3.7.5
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .read data.sql
sqlite> .timer on
sqlite> select count(*) from x where col1 between (select min(col1) from x) and 
(select max(col1) from x);
500080
CPU Time: user 0.218750 sys 0.046875
sqlite> select count(*) from x where col1 is not null;
500080
CPU Time: user 0.343750 sys 0.062500
sqlite> select count(*) from x where col1 is null;
499920
CPU Time: user 0.109375 sys 0.015625

Michael D. Black
Senior Scientist
Advanced Analytics Directorate
Advanced GEOINT Solutions Operating Unit
Northrop Grumman Information Systems



From: sqlite-users-boun...@sqlite.org [sqlite-users-boun...@sqlite.org] on 
behalf of Igor Tandetnik [itandet...@mvps.org]
Sent: Monday, October 17, 2011 10:06 AM
To: sqlite-users@sqlite.org
Subject: EXT :Re: [sqlite] "is not null" and index


On 10/17/2011 9:30 AM, Black, Michael (IS) wrote:
> According to this benchmark the break-even point is at 40% nulls.  I asssume 
> you have a different test?

I did mention "rule of thumb". Specific cases may vary. I must admit I'm
too lazy to build tests for someone else's problem.

The fact that all NULL values are clustered together in rows with
sequential rowids might have skewed the results in your test. Better
locality of reference, fewer pages to read from disk, improved cache
utilization.
--
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] "is not null" and index

2011-10-17 Thread Igor Tandetnik

On 10/17/2011 9:30 AM, Black, Michael (IS) wrote:

According to this benchmark the break-even point is at 40% nulls.  I asssume 
you have a different test?


I did mention "rule of thumb". Specific cases may vary. I must admit I'm 
too lazy to build tests for someone else's problem.


The fact that all NULL values are clustered together in rows with 
sequential rowids might have skewed the results in your test. Better 
locality of reference, fewer pages to read from disk, improved cache 
utilization.

--
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] "is not null" and index

2011-10-17 Thread Black, Michael (IS)
According to this benchmark the break-even point is at 40% nulls.  I asssume 
you have a different test?

#include 

int main()
{
  int i;
  char sql[4096];
  printf("CREATE TABLE x ('col1','col2','col3');\n");
  printf("BEGIN;");
  for(i=0;i<100;++i) {
if (i < 10) {
  sprintf(sql,"INSERT INTO x values(null,'col2_%d','col3_%d');",i,i,i);
}
else {
  sprintf(sql,"INSERT INTO x values('col_%d','col2_%d','col3_%d');",i,i,i);
}
printf("%s\n",sql);
  }
  printf("COMMIT;");
  printf("CREATE INDEX col1index on x('col1');\n");
}



sqlite> select count(*) from x where col1 between (select min(col1) from x) and 
(select max(col1) from x);
90
CPU Time: user 0.453125 sys 0.015625
sqlite> select count(*) from x where col1 is not null;
90
CPU Time: user 0.281250 sys 0.125000
sqlite> update x set col1=null where rowid < 20;
CPU Time: user 1.906250 sys 0.234375
sqlite> select count(*) from x where col1 between (select min(col1) from x) and 
(select max(col1) from x);
81
CPU Time: user 0.390625 sys 0.031250
sqlite> update x set col1=null where rowid >= 20 and rowid < 30;
CPU Time: user 0.953125 sys 0.125000
sqlite> select count(*) from x where col1 between (select min(col1) from x) and 
(select max(col1) from x);
71
CPU Time: user 0.359375 sys 0.015625
sqlite> select count(*) from x where col1 is not null;
71
CPU Time: user 0.296875 sys 0.078125
sqlite> update x set col1=null where rowid >= 30 and rowid < 40;
CPU Time: user 0.890625 sys 0.156250
sqlite> select count(*) from x where col1 is not null;
61
CPU Time: user 0.281250 sys 0.109375
sqlite> select count(*) from x where col1 between (select min(col1) from x) and 
(select max(col1) from x);
61
CPU Time: user 0.281250 sys 0.031250
sqlite> update x set col1=null where rowid >= 40 and rowid < 50;
CPU Time: user 0.921875 sys 0.171875
sqlite> select count(*) from x where col1 between (select min(col1) from x) and 
(select max(col1) from x);
51
CPU Time: user 0.25 sys 0.015625
sqlite> select count(*) from x where col1 is not null;
51
CPU Time: user 0.312500 sys 0.062500



Michael D. Black

Senior Scientist

Advanced Analytics Directorate

Advanced GEOINT Solutions Operating Unit

Northrop Grumman Information Systems


From: sqlite-users-boun...@sqlite.org [sqlite-users-boun...@sqlite.org] on 
behalf of Igor Tandetnik [itandet...@mvps.org]
Sent: Monday, October 17, 2011 8:14 AM
To: sqlite-users@sqlite.org
Subject: EXT :Re: [sqlite] "is not null" and index

Jean-Christophe Deschamps <j...@antichoc.net> wrote:
>>> How can indexes be used with "not null" queries?
>>
>> They cannot.
>
> If one sees NOT NULL as the complement of NULL, i.e. values in the
> range {min_value, max_value} (min and max depending on the column
> expected content and type), then couldn't an index help?
> select * from table where col between min_value and max_value

If you have an extrinsic knowledge of the domain of the column values, then 
yes, you can reformulate the query as

where col1 between minValue and maxValue

I don't think SQLite would be able to rewrite the query this way automatically.

> Of course if there are only few NULLs then a table scan will probably
> be faster.

In fact, unless some 90% of the rows contain NULLs, a table scan will probably 
be faster. Using an index trades O(N) performance for O(M log N), where M is 
the number of rows actually satisfying the condition. This is clearly an 
improvement when M is much smaller than N, and a pessimization when M is close 
to N. A rule of thumb is that the break-even point is somewhere around M = 0.1*N

And if 90% of the rows do contain NULL in this column, I'd consider splitting 
the data into two tables - one with three columns (containing all non-NULL rows 
from the original table) and the other with two columns (containing the 
remaining rows).
--
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] "is not null" and index

2011-10-17 Thread Richard Hipp
On Mon, Oct 17, 2011 at 9:14 AM, Igor Tandetnik  wrote:

> Jean-Christophe Deschamps  wrote:
> >>> How can indexes be used with "not null" queries?
> >>
> >> They cannot.
> >
> > If one sees NOT NULL as the complement of NULL, i.e. values in the
> > range {min_value, max_value} (min and max depending on the column
> > expected content and type), then couldn't an index help?
> > select * from table where col between min_value and max_value
>
> If you have an extrinsic knowledge of the domain of the column values, then
> yes, you can reformulate the query as
>
> where col1 between minValue and maxValue
>
> I don't think SQLite would be able to rewrite the query this way
> automatically.
>

It will if you use the latest SQLite from trunk, compiled using
-DSQLITE_ENABLE_STAT3 and if you run ANALYZE and if enough entries in the
table are NULL to justify using an index to find the ones that are NOT
NULL.  As you point out below, "enough" usually means about 90%.


>
> > Of course if there are only few NULLs then a table scan will probably
> > be faster.
>
> In fact, unless some 90% of the rows contain NULLs, a table scan will
> probably be faster. Using an index trades O(N) performance for O(M log N),
> where M is the number of rows actually satisfying the condition. This is
> clearly an improvement when M is much smaller than N, and a pessimization
> when M is close to N. A rule of thumb is that the break-even point is
> somewhere around M = 0.1*N
>
> And if 90% of the rows do contain NULL in this column, I'd consider
> splitting the data into two tables - one with three columns (containing all
> non-NULL rows from the original table) and the other with two columns
> (containing the remaining rows).
> --
> Igor Tandetnik
>
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>



-- 
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] "is not null" and index

2011-10-17 Thread Igor Tandetnik
Jean-Christophe Deschamps  wrote:
>>> How can indexes be used with "not null" queries?
>> 
>> They cannot.
> 
> If one sees NOT NULL as the complement of NULL, i.e. values in the
> range {min_value, max_value} (min and max depending on the column
> expected content and type), then couldn't an index help?
> select * from table where col between min_value and max_value

If you have an extrinsic knowledge of the domain of the column values, then 
yes, you can reformulate the query as

where col1 between minValue and maxValue

I don't think SQLite would be able to rewrite the query this way automatically.

> Of course if there are only few NULLs then a table scan will probably
> be faster.

In fact, unless some 90% of the rows contain NULLs, a table scan will probably 
be faster. Using an index trades O(N) performance for O(M log N), where M is 
the number of rows actually satisfying the condition. This is clearly an 
improvement when M is much smaller than N, and a pessimization when M is close 
to N. A rule of thumb is that the break-even point is somewhere around M = 0.1*N

And if 90% of the rows do contain NULL in this column, I'd consider splitting 
the data into two tables - one with three columns (containing all non-NULL rows 
from the original table) and the other with two columns (containing the 
remaining rows).
-- 
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] "is not null" and index

2011-10-17 Thread Jean-Christophe Deschamps



Yoav Apter  wrote:
> I have the following table:
>
> CREATE TABLE x ('col1', 'col2', 'col3')
> Create col1index on x ('col1')
>
> When I run this query: "select * from x where col1 is null" I see 
the index on x is used.
> When I run this query: "select * from x where col1 is NOT null" I 
see the index on x is not used.


Why is that suprising? Imagine you are given a book with an index at 
the end, and are asked to enumerate all pages where a particular term 
does *not* appear. Would an index be helpful in this task?


> How can indexes be used with "not null" queries?

They cannot.


If one sees NOT NULL as the complement of NULL, i.e. values in the 
range {min_value, max_value} (min and max depending on the column 
expected content and type), then couldn't an index help?

select * from table where col between min_value and max_value

Of course if there are only few NULLs then a table scan will probably 
be faster.



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] "is not null" and index

2011-10-17 Thread Igor Tandetnik
Black, Michael (IS)  wrote:
> What does distinct do?
> sqlite> explain query plan select distinct (col1) from x where col1 is not 
> null;
> sele  order  from  deta
>   -    
> 0 0  0 SCAN TABLE x USING COVERING INDEX col1index 
> (~50 rows)
> OK...we're still using an index here...

In a sense, yes - but note that you have SCAN TABLE USING INDEX; compare and 
contrast with SEARCH TABLE USING INDEX.

DISTINCT effectively implies ORDER BY. The way DISTINCT is implemeted, SQLite 
enumerates the rows in order (and the index is helpful here) and discards any 
where the value is the same as in the previous row. You've just replaced a full 
table scan with a full index scan (and changed the meaning of the query in the 
process).

> so using this subselect we do this:
> sqlite> explain query plan select * from x where col1 in (select distinct 
> (col1) from x where col1 is not null);
> sele  order  from  deta
>   -    
> 0 0  0 SEARCH TABLE x USING INDEX col1index (col1=?) 
> (~250 rows)
> 0 0  0 EXECUTE LIST SUBQUERY 1
> 1 0  0 SCAN TABLE x USING COVERING INDEX col1index 
> (~50 rows)

So now you are scanning the table, and also searching on top of that. How is 
this an improvement?

> Performance probably depends on how many "not null" things there are...if not 
> many of them this may not be any faster.

In fact, this will always be strictly slower than a straightforward table scan.

> A count() could be a lot faster though I'd think.

How so? I'm not even sure how you would use count() here, let alone use it in a 
way that leads to performance gains.
-- 
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] "is not null" and index

2011-10-17 Thread Black, Michael (IS)
Does this make sense to try?

First, duplicate the lack of index
sqlite> explain query plan select * from x where col1 is null;
sele  order  from  deta
  -    
0 0  0 SEARCH TABLE x USING INDEX col1index (col1=?) (~10 
rows)
sqlite> explain query plan select * from x where col1 is not null;
sele  order  from  deta
  -    
0 0  0 SCAN TABLE x (~50 rows)

What does distinct do?
sqlite> explain query plan select distinct (col1) from x where col1 is not null;
sele  order  from  deta
  -    
0 0  0 SCAN TABLE x USING COVERING INDEX col1index (~50 
rows)
OK...we're still using an index here...so using this subselect we do this:
sqlite> explain query plan select * from x where col1 in (select distinct 
(col1) from x where col1 is not null);
sele  order  from  deta
  -    
0 0  0 SEARCH TABLE x USING INDEX col1index (col1=?) (~250 
rows)
0 0  0 EXECUTE LIST SUBQUERY 1
1 0  0 SCAN TABLE x USING COVERING INDEX col1index (~50 
rows)

Performance probably depends on how many "not null" things there are...if not 
many of them this may not be any faster.
A count() could be a lot faster though I'd think.

Michael D. Black
Senior Scientist
Advanced Analytics Directorate
Advanced GEOINT Solutions Operating Unit
Northrop Grumman Information Systems



From: sqlite-users-boun...@sqlite.org [sqlite-users-boun...@sqlite.org] on 
behalf of Igor Tandetnik [itandet...@mvps.org]
Sent: Monday, October 17, 2011 6:56 AM
To: sqlite-users@sqlite.org
Subject: EXT :Re: [sqlite] "is not null" and index


Yoav Apter <yo...@checkpoint.com> wrote:
> I have the following table:
>
> CREATE TABLE x ('col1', 'col2', 'col3')
> Create col1index on x ('col1')
>
> When I run this query: "select * from x where col1 is null" I see the index 
> on x is used.
> When I run this query: "select * from x where col1 is NOT null" I see the 
> index on x is not used.

Why is that suprising? Imagine you are given a book with an index at the end, 
and are asked to enumerate all pages where a particular term does *not* appear. 
Would an index be helpful in this task?

> How can indexes be used with "not null" queries?

They cannot.
--
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] "is not null" and index

2011-10-17 Thread Igor Tandetnik
Yoav Apter  wrote:
> I have the following table:
> 
> CREATE TABLE x ('col1', 'col2', 'col3')
> Create col1index on x ('col1')
> 
> When I run this query: "select * from x where col1 is null" I see the index 
> on x is used.
> When I run this query: "select * from x where col1 is NOT null" I see the 
> index on x is not used.

Why is that suprising? Imagine you are given a book with an index at the end, 
and are asked to enumerate all pages where a particular term does *not* appear. 
Would an index be helpful in this task?

> How can indexes be used with "not null" queries?

They cannot.
-- 
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] "is not null" and index

2011-10-17 Thread Yoav Apter
Hi

I have the following table:

CREATE TABLE x ('col1', 'col2', 'col3')
Create col1index on x ('col1')

When I run this query: "select * from x where col1 is null" I see the index on 
x is used.
When I run this query: "select * from x where col1 is NOT null" I see the index 
on x is not used.

How can indexes be used with "not null" queries?


Thanks


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users