[sqlite] Speed of count distinct rows if unique index on all fields

2015-10-02 Thread R.Smith


On 2015-10-02 07:28 PM, Bart Smissaert wrote:
>> if any single column in those two rows is NULL.
> OK, I got it and that will make it even less likely that this situation
> will occur.
> I think we got this exhausted now.

Sure - but don't be dismayed though, every opportunity for optimization 
should be investigated, that's how progress is made. This one turned out 
after investigation to not be suitable, but the next one might. Keep at it.



[sqlite] Speed of count distinct rows if unique index on all fields

2015-10-02 Thread R.Smith


On 2015-10-02 05:41 PM, Bart Smissaert wrote:
>> you're just throwing random terms around and hoping something sticks.
> Not sure where you got that idea from, but let me explain better:
>
> Say we have a table:
>
> CREATE TABLE TABLE1
> ([FIELD1] INTEGER,
> [FIELD2] TEXT,
> [FIELD3] TEXT,
> [FIELD4] REAL)
>
> and we have a unique index on all 4 fields:
>
> CREATE UNIQUE INDEX IDX1
> ON TABLE1
> (FIELD1, FIELD2, FIELD3, FIELD4)
>
> Now I want to count all the unique rows of this table (without any limiting
> where or join or whatever).
>
> Unless I am mistaken here this is done with a SQL like this:
>
> SELECT COUNT(*) AS UNIQUE_ROWS
> FROM (SELECT DISTINCT * FROM TABLE1)
>
> But if we take advantage of the fact that this table has a unique index on
> all the fields of the table
> we can simply do this SQL:
>
> SELECT COUNT(*) FROM TABLE1
>
> It will have the same result and that is a lot faster.
> Hope this makes it clear.

This explanation is much more clear - thanks.  I think the initial 
source of confusion is the original explanation being hard to follow, 
but now that it is clear and with SQL, we can see exactly what you mean.

In the case where where the table has rows that are not unique, there is 
no gain.
In the case where we could use the optimization, all the following has 
to be true:

A - The Table must NOT have a Primary Key, but
B - The table must have a complete covering Index and
C - The covering Index must be UNIQUE and
D - All fields must be NOT NULL
E - The Query must be a COUNT() aggregate query
F - There must NOT be any JOINs
G - There must NOT be a WHERE clause.
H - There must be a correlated sub-query which accesses the table,
I - The sub-query MUST force a DISTINCT Query output constraint, and
J - The constraint MUST be on the * wildcard identifier (or 
alternatively, the complete field list).

The only way this optimization can ever be accommodated is if ALL of the 
above are true, and the odds of having all those things true are even 
less than Lotto-winning odds.

One might look at the sub-query alone and say "What if the sub query (or 
any query) that Uses 'DISTINCT *' in a table that has a Unique covering 
index without a where clause or joins can be optimized" - which it 
probably can, but the odds of this are only very slightly less low than 
the first case.

Add to that the code that needs to be added to check for all those 
things - it would probably make the query planner slower by a (very) 
insignificant amount, but more still than the odds of running into such 
a query+schema combination.

That said, if you do have this situation sometimes and it does slow down 
things for you - Perhaps a design change in the schemata would help?



[sqlite] Speed of count distinct rows if unique index on all fields

2015-10-02 Thread Bart Smissaert
> Keep at it.

Will do.

RBS

On Fri, Oct 2, 2015 at 6:46 PM, R.Smith  wrote:

>
>
> On 2015-10-02 07:28 PM, Bart Smissaert wrote:
>
>> if any single column in those two rows is NULL.
>>>
>> OK, I got it and that will make it even less likely that this situation
>> will occur.
>> I think we got this exhausted now.
>>
>
> Sure - but don't be dismayed though, every opportunity for optimization
> should be investigated, that's how progress is made. This one turned out
> after investigation to not be suitable, but the next one might. Keep at it.
>
>
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>


[sqlite] Speed of count distinct rows if unique index on all fields

2015-10-02 Thread Keith Medcalf

No, whether the column contains a null is irrelevant.  It is whether the column 
CAN contain a null.

> -Original Message-
> From: sqlite-users-bounces at mailinglists.sqlite.org [mailto:sqlite-users-
> bounces at mailinglists.sqlite.org] On Behalf Of Bart Smissaert
> Sent: Friday, 2 October, 2015 11:15
> To: General Discussion of SQLite Database
> Subject: Re: [sqlite] Speed of count distinct rows if unique index on all
> fields
> 
> OK, thanks.
> Do you mean that is only valid if there are no rows where all columns are
> NULL?
> In any case, I can see that this optimization (in SQLite) is just not
> worth
> the trouble.
> 
> RBS
> 
> 
> On Fri, Oct 2, 2015 at 6:02 PM, Richard Hipp  wrote:
> 
> > On 10/2/15, Bart Smissaert  wrote:
> > >
> > > Unless I am mistaken here this is done with a SQL like this:
> > >
> > > SELECT COUNT(*) AS UNIQUE_ROWS
> > > FROM (SELECT DISTINCT * FROM TABLE1)
> > >
> > > But if we take advantage of the fact that this table has a unique
> index
> > on
> > > all the fields of the table
> > > we can simply do this SQL:
> > >
> > > SELECT COUNT(*) FROM TABLE1
> > >
> >
> > No.  That is only valid if all columns are individually NOT NULL.
> >
> > --
> > D. Richard Hipp
> > drh at sqlite.org
> > ___
> > sqlite-users mailing list
> > sqlite-users at mailinglists.sqlite.org
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users





[sqlite] Speed of count distinct rows if unique index on all fields

2015-10-02 Thread Keith Medcalf

You are making an error.

sqlite> create table x(a, b, c, d);
sqlite> create unique index y on x(a,b,c,d);
sqlite> insert into x values(1,2,3,null);
sqlite> insert into x values(1,2,3,null);
sqlite> insert into x values(1,2,3,null);
sqlite> select * from x;
1|2|3|
1|2|3|
1|2|3|
sqlite> select distinct * from x;
1|2|3|
sqlite>

As you can see, the number of "DISTINCT" rows does not equal the number of 
"unique" rows in the index.

Using DISTINCT requires visitating of all rows and counting the DISTINCT 
entries, whereas merely counting rows does not require visiting all the rows.

In other words, where columns are nullable DISTINCT(*) is not the same as 
UNIQUE(*).

> -Original Message-
> From: sqlite-users-bounces at mailinglists.sqlite.org [mailto:sqlite-users-
> bounces at mailinglists.sqlite.org] On Behalf Of Bart Smissaert
> Sent: Friday, 2 October, 2015 09:42
> To: General Discussion of SQLite Database
> Subject: Re: [sqlite] Speed of count distinct rows if unique index on all
> fields
> 
> > you're just throwing random terms around and hoping something sticks.
> 
> Not sure where you got that idea from, but let me explain better:
> 
> Say we have a table:
> 
> CREATE TABLE TABLE1
> ([FIELD1] INTEGER,
> [FIELD2] TEXT,
> [FIELD3] TEXT,
> [FIELD4] REAL)
> 
> and we have a unique index on all 4 fields:
> 
> CREATE UNIQUE INDEX IDX1
> ON TABLE1
> (FIELD1, FIELD2, FIELD3, FIELD4)
> 
> Now I want to count all the unique rows of this table (without any
> limiting
> where or join or whatever).
> 
> Unless I am mistaken here this is done with a SQL like this:
> 
> SELECT COUNT(*) AS UNIQUE_ROWS
> FROM (SELECT DISTINCT * FROM TABLE1)
> 
> But if we take advantage of the fact that this table has a unique index on
> all the fields of the table
> we can simply do this SQL:
> 
> SELECT COUNT(*) FROM TABLE1
> 
> It will have the same result and that is a lot faster.
> Hope this makes it clear.
> 
> 
> RBS
> 
> 
> 
> 
> On Fri, Oct 2, 2015 at 4:01 PM, Scott Hess  wrote:
> 
> > Why does any of that matter?  SELECT COUNT(*) FROM table; already knows
> all
> > of that information.
> >
> > If you have a question about why one query is faster/slower than another
> > query given one schema versus another schema, then post representative
> > schema and queries.  Right now you're just throwing random terms around
> and
> > hoping something sticks.  Earlier you said that you weren't talking
> about
> > DISTINCT as applied to result sets, but just now you're re-introducing
> > COUNT DISTINCT for some reason.  SQL code will make your question
> concrete.
> >
> > -scott
> >
> >
> > On Fri, Oct 2, 2015 at 7:54 AM, Bart Smissaert
> 
> > wrote:
> >
> > > It is faster because if it knows there is no where or join or whatever
> > row
> > > limiting condition and it also knows there is
> > > a unique index on all fields it can simply do select count(rowid) from
> > > table1 and not do any count distinct.
> > >
> > > RBS
> > >
> > >
> > > On Fri, Oct 2, 2015 at 3:51 PM, Scott Hess  wrote:
> > >
> > > > On Fri, Oct 2, 2015 at 7:43 AM, Bart Smissaert <
> > bart.smissaert at gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > > The Uniqueness of the output depends on which fields are
> included,
> > > > JOINs,
> > > > > UNIONs, etc. etc.
> > > > >
> > > > > I am not talking about that situation. I am only referring to a
> > > situation
> > > > > where you want to count all
> > > > > rows in a table. I know it will be uncommon to have an index on
> all
> > > > fields
> > > > > and this is not really a practical
> > > > > question. I suppose as it so uncommon it is not worth it to put
> this
> > > > > optimization in.
> > > >
> > > >
> > > > Is your case of having the unique index on all fields faster than
> > having
> > > a
> > > > unique index on a single field?
> > > >
> > > > Maybe you should include an example of your schema.  I can't think
> of
> > how
> > > > scanning an index on all fields could be smaller than the underlying
> > > table,
> > > > so it's unclear how that could be faster.  But a unique index on a
> > subset
> > > > of the data could be faster simply from being smaller.
> > > >
> > > > Also, 10x faster makes me wonder about whether you're accounting for
> > > > caching effects.  A blunt way to test for that is to run your
> queries a
> > > > couple times.  If the first time is slow and the second and later
> times
> > > are
> > > > much faster, then it's likely the cache is causing the speedup.
> > > >
> > > > -scott
> > > > ___
> > > > sqlite-users mailing list
> > > > sqlite-users at mailinglists.sqlite.org
> > > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > > >
> > > ___
> > > sqlite-users mailing list
> > > sqlite-users at mailinglists.sqlite.org
> > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > >
> > ___
> > 

[sqlite] Speed of count distinct rows if unique index on all fields

2015-10-02 Thread Bart Smissaert
> if any single column in those two rows is NULL.

OK, I got it and that will make it even less likely that this situation
will occur.
I think we got this exhausted now.

RBS


On Fri, Oct 2, 2015 at 6:24 PM, Richard Hipp  wrote:

> On 10/2/15, Bart Smissaert  wrote:
> > Do you mean that is only valid if there are no rows where all columns are
> > NULL?
>
> No, I mean that two rows can be identical (not distinct) event if
> there is a unique index on all columns, if any single column in those
> two rows is NULL.  Example:
>
> SQLite version 3.8.11 2015-07-15 23:15:59
> Enter ".help" for usage hints.
> Connected to a transient in-memory database.
> Use ".open FILENAME" to reopen on a persistent database.
> sqlite> create table t1(a,b,c);
> sqlite> create unique index t1u on t1(a,b,c);
> sqlite> insert into t1(a,b,c) values(1,null,3),(1,null,3),(1,null,3);
> sqlite>
>
> --
> D. Richard Hipp
> drh at sqlite.org
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>


[sqlite] Speed of count distinct rows if unique index on all fields

2015-10-02 Thread Bart Smissaert
Yes, I can see that there is no point in SQLite having this optimization.
Of course I can do this optimization myself (just run select count(*) from
table, instead of using distinct) if I know all conditions are met. This
wasn't anything practical I needed, just a fleeting thought.

RBS


On Fri, Oct 2, 2015 at 6:14 PM, R.Smith  wrote:

>
>
> On 2015-10-02 05:41 PM, Bart Smissaert wrote:
>
>> you're just throwing random terms around and hoping something sticks.
>>>
>> Not sure where you got that idea from, but let me explain better:
>>
>> Say we have a table:
>>
>> CREATE TABLE TABLE1
>> ([FIELD1] INTEGER,
>> [FIELD2] TEXT,
>> [FIELD3] TEXT,
>> [FIELD4] REAL)
>>
>> and we have a unique index on all 4 fields:
>>
>> CREATE UNIQUE INDEX IDX1
>> ON TABLE1
>> (FIELD1, FIELD2, FIELD3, FIELD4)
>>
>> Now I want to count all the unique rows of this table (without any
>> limiting
>> where or join or whatever).
>>
>> Unless I am mistaken here this is done with a SQL like this:
>>
>> SELECT COUNT(*) AS UNIQUE_ROWS
>> FROM (SELECT DISTINCT * FROM TABLE1)
>>
>> But if we take advantage of the fact that this table has a unique index on
>> all the fields of the table
>> we can simply do this SQL:
>>
>> SELECT COUNT(*) FROM TABLE1
>>
>> It will have the same result and that is a lot faster.
>> Hope this makes it clear.
>>
>
> This explanation is much more clear - thanks.  I think the initial source
> of confusion is the original explanation being hard to follow, but now that
> it is clear and with SQL, we can see exactly what you mean.
>
> In the case where where the table has rows that are not unique, there is
> no gain.
> In the case where we could use the optimization, all the following has to
> be true:
>
> A - The Table must NOT have a Primary Key, but
> B - The table must have a complete covering Index and
> C - The covering Index must be UNIQUE and
> D - All fields must be NOT NULL
> E - The Query must be a COUNT() aggregate query
> F - There must NOT be any JOINs
> G - There must NOT be a WHERE clause.
> H - There must be a correlated sub-query which accesses the table,
> I - The sub-query MUST force a DISTINCT Query output constraint, and
> J - The constraint MUST be on the * wildcard identifier (or alternatively,
> the complete field list).
>
> The only way this optimization can ever be accommodated is if ALL of the
> above are true, and the odds of having all those things true are even less
> than Lotto-winning odds.
>
> One might look at the sub-query alone and say "What if the sub query (or
> any query) that Uses 'DISTINCT *' in a table that has a Unique covering
> index without a where clause or joins can be optimized" - which it probably
> can, but the odds of this are only very slightly less low than the first
> case.
>
> Add to that the code that needs to be added to check for all those things
> - it would probably make the query planner slower by a (very) insignificant
> amount, but more still than the odds of running into such a query+schema
> combination.
>
> That said, if you do have this situation sometimes and it does slow down
> things for you - Perhaps a design change in the schemata would help?
>
>
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>


[sqlite] Speed of count distinct rows if unique index on all fields

2015-10-02 Thread Bart Smissaert
OK, thanks.
Do you mean that is only valid if there are no rows where all columns are
NULL?
In any case, I can see that this optimization (in SQLite) is just not worth
the trouble.

RBS


On Fri, Oct 2, 2015 at 6:02 PM, Richard Hipp  wrote:

> On 10/2/15, Bart Smissaert  wrote:
> >
> > Unless I am mistaken here this is done with a SQL like this:
> >
> > SELECT COUNT(*) AS UNIQUE_ROWS
> > FROM (SELECT DISTINCT * FROM TABLE1)
> >
> > But if we take advantage of the fact that this table has a unique index
> on
> > all the fields of the table
> > we can simply do this SQL:
> >
> > SELECT COUNT(*) FROM TABLE1
> >
>
> No.  That is only valid if all columns are individually NOT NULL.
>
> --
> D. Richard Hipp
> drh at sqlite.org
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>


[sqlite] Speed of count distinct rows if unique index on all fields

2015-10-02 Thread Simon Slavin

On 2 Oct 2015, at 4:41pm, Bart Smissaert  wrote:

> Say we have a table:
> 
> CREATE TABLE TABLE1
> ([FIELD1] INTEGER,
> [FIELD2] TEXT,
> [FIELD3] TEXT,
> [FIELD4] REAL)
> 
> and we have a unique index on all 4 fields:
> 
> CREATE UNIQUE INDEX IDX1
> ON TABLE1
> (FIELD1, FIELD2, FIELD3, FIELD4)

This would be rather unusual.  It implies that the combination of all the 
fields in the table has to be unique, but that it's okay to have two rows with 
the same values for FIELD1, FIELD2 and FIELD3.

As you (I think it was you, may have been someone else) mentioned upthread, the 
extreme rarity of this means that it's not worth checking for it in every use 
of COUNT(*).

The only common case I can think of where there's a UNIQUE requirement for all 
the columns in a table is where all the table is really just a primary key in 
itself. And in that situation the software would just scan the primary key 
index anyway, which would probably be just a rowid, and that is already the 
optimal case.

Simon.


[sqlite] Speed of count distinct rows if unique index on all fields

2015-10-02 Thread Bart Smissaert
> you're just throwing random terms around and hoping something sticks.

Not sure where you got that idea from, but let me explain better:

Say we have a table:

CREATE TABLE TABLE1
([FIELD1] INTEGER,
[FIELD2] TEXT,
[FIELD3] TEXT,
[FIELD4] REAL)

and we have a unique index on all 4 fields:

CREATE UNIQUE INDEX IDX1
ON TABLE1
(FIELD1, FIELD2, FIELD3, FIELD4)

Now I want to count all the unique rows of this table (without any limiting
where or join or whatever).

Unless I am mistaken here this is done with a SQL like this:

SELECT COUNT(*) AS UNIQUE_ROWS
FROM (SELECT DISTINCT * FROM TABLE1)

But if we take advantage of the fact that this table has a unique index on
all the fields of the table
we can simply do this SQL:

SELECT COUNT(*) FROM TABLE1

It will have the same result and that is a lot faster.
Hope this makes it clear.


RBS




On Fri, Oct 2, 2015 at 4:01 PM, Scott Hess  wrote:

> Why does any of that matter?  SELECT COUNT(*) FROM table; already knows all
> of that information.
>
> If you have a question about why one query is faster/slower than another
> query given one schema versus another schema, then post representative
> schema and queries.  Right now you're just throwing random terms around and
> hoping something sticks.  Earlier you said that you weren't talking about
> DISTINCT as applied to result sets, but just now you're re-introducing
> COUNT DISTINCT for some reason.  SQL code will make your question concrete.
>
> -scott
>
>
> On Fri, Oct 2, 2015 at 7:54 AM, Bart Smissaert 
> wrote:
>
> > It is faster because if it knows there is no where or join or whatever
> row
> > limiting condition and it also knows there is
> > a unique index on all fields it can simply do select count(rowid) from
> > table1 and not do any count distinct.
> >
> > RBS
> >
> >
> > On Fri, Oct 2, 2015 at 3:51 PM, Scott Hess  wrote:
> >
> > > On Fri, Oct 2, 2015 at 7:43 AM, Bart Smissaert <
> bart.smissaert at gmail.com
> > >
> > > wrote:
> > >
> > > > > The Uniqueness of the output depends on which fields are included,
> > > JOINs,
> > > > UNIONs, etc. etc.
> > > >
> > > > I am not talking about that situation. I am only referring to a
> > situation
> > > > where you want to count all
> > > > rows in a table. I know it will be uncommon to have an index on all
> > > fields
> > > > and this is not really a practical
> > > > question. I suppose as it so uncommon it is not worth it to put this
> > > > optimization in.
> > >
> > >
> > > Is your case of having the unique index on all fields faster than
> having
> > a
> > > unique index on a single field?
> > >
> > > Maybe you should include an example of your schema.  I can't think of
> how
> > > scanning an index on all fields could be smaller than the underlying
> > table,
> > > so it's unclear how that could be faster.  But a unique index on a
> subset
> > > of the data could be faster simply from being smaller.
> > >
> > > Also, 10x faster makes me wonder about whether you're accounting for
> > > caching effects.  A blunt way to test for that is to run your queries a
> > > couple times.  If the first time is slow and the second and later times
> > are
> > > much faster, then it's likely the cache is causing the speedup.
> > >
> > > -scott
> > > ___
> > > sqlite-users mailing list
> > > sqlite-users at mailinglists.sqlite.org
> > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > >
> > ___
> > sqlite-users mailing list
> > sqlite-users at mailinglists.sqlite.org
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>


[sqlite] Speed of count distinct rows if unique index on all fields

2015-10-02 Thread R.Smith


On 2015-10-02 10:05 AM, Bart Smissaert wrote:
> Noticed that if I have table with a unique index on all fields, counting
> all rows is still a lot faster
> (about a factor 10 on my particular test table) than counting distinct rows.
> Could maybe an optimization be added to SQLite to speed this up, taking
> advantage of the fact that there is a unique index on all fields?
> I am running SQLite 3.8.10.

I'm sure it can and I'm not answering (since the devs may do whatever 
they decide), but I would like to mention that a Unique index on all 
fields is:
A - More uncommon than Lotto-winners.
B - In no way related to a DISTINCT specifier.

The DISTINCT specifier checks that the results (output list) from a 
Query is Unique - There is no way to determine that from the Input 
table's uniqueness. In fact ALL tables with a Primary Key is by 
definition row-unique, but it has no bearing on the output of a query 
from it. The Uniqueness of the output depends on which fields are 
included, JOINs, UNIONs, etc. etc.

In other words: Considering the above points, I believe this is an 
optimization that would serve so small a possible set of use-cases as to 
not deserve the few bytes needed to implement it.



[sqlite] Speed of count distinct rows if unique index on all fields

2015-10-02 Thread Bart Smissaert
It is faster because if it knows there is no where or join or whatever row
limiting condition and it also knows there is
a unique index on all fields it can simply do select count(rowid) from
table1 and not do any count distinct.

RBS


On Fri, Oct 2, 2015 at 3:51 PM, Scott Hess  wrote:

> On Fri, Oct 2, 2015 at 7:43 AM, Bart Smissaert 
> wrote:
>
> > > The Uniqueness of the output depends on which fields are included,
> JOINs,
> > UNIONs, etc. etc.
> >
> > I am not talking about that situation. I am only referring to a situation
> > where you want to count all
> > rows in a table. I know it will be uncommon to have an index on all
> fields
> > and this is not really a practical
> > question. I suppose as it so uncommon it is not worth it to put this
> > optimization in.
>
>
> Is your case of having the unique index on all fields faster than having a
> unique index on a single field?
>
> Maybe you should include an example of your schema.  I can't think of how
> scanning an index on all fields could be smaller than the underlying table,
> so it's unclear how that could be faster.  But a unique index on a subset
> of the data could be faster simply from being smaller.
>
> Also, 10x faster makes me wonder about whether you're accounting for
> caching effects.  A blunt way to test for that is to run your queries a
> couple times.  If the first time is slow and the second and later times are
> much faster, then it's likely the cache is causing the speedup.
>
> -scott
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>


[sqlite] Speed of count distinct rows if unique index on all fields

2015-10-02 Thread Bart Smissaert
> The Uniqueness of the output depends on which fields are included, JOINs,
UNIONs, etc. etc.

I am not talking about that situation. I am only referring to a situation
where you want to count all
rows in a table. I know it will be uncommon to have an index on all fields
and this is not really a practical
question. I suppose as it so uncommon it is not worth it to put this
optimization in.

RBS

On Fri, Oct 2, 2015 at 3:02 PM, R.Smith  wrote:

>
>
> On 2015-10-02 10:05 AM, Bart Smissaert wrote:
>
>> Noticed that if I have table with a unique index on all fields, counting
>> all rows is still a lot faster
>> (about a factor 10 on my particular test table) than counting distinct
>> rows.
>> Could maybe an optimization be added to SQLite to speed this up, taking
>> advantage of the fact that there is a unique index on all fields?
>> I am running SQLite 3.8.10.
>>
>
> I'm sure it can and I'm not answering (since the devs may do whatever they
> decide), but I would like to mention that a Unique index on all fields is:
> A - More uncommon than Lotto-winners.
> B - In no way related to a DISTINCT specifier.
>
> The DISTINCT specifier checks that the results (output list) from a Query
> is Unique - There is no way to determine that from the Input table's
> uniqueness. In fact ALL tables with a Primary Key is by definition
> row-unique, but it has no bearing on the output of a query from it. The
> Uniqueness of the output depends on which fields are included, JOINs,
> UNIONs, etc. etc.
>
> In other words: Considering the above points, I believe this is an
> optimization that would serve so small a possible set of use-cases as to
> not deserve the few bytes needed to implement it.
>
>
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>


[sqlite] Speed of count distinct rows if unique index on all fields

2015-10-02 Thread Paul Sanderson
Counting all rows vs counting distinct rows is a very different task.
In simple terms

For all rows the process is
read every leaf page in the index
query the cell count field (number of records)
add cell count to the to the total count
repeat for the remaining leaf pages


For distinct records
read every leaf page in the index
read each cell on page and identify whether it has been seen before
add count of distinct cells to total count
repeat for the remaining leaf pages

a bit over simplified as there may be unique records that are on different pages

The thrust of this is that to count distinct records page and every
record in the tree must be read and examined, to count all records
just every page header in the tree needs to be read.

Paul
www.sandersonforensics.com
skype: r3scue193
twitter: @sandersonforens
Tel +44 (0)1326 572786
http://sandersonforensics.com/forum/content.php?195-SQLite-Forensic-Toolkit
-Forensic Toolkit for SQLite
email from a work address for a fully functional demo licence


On 2 October 2015 at 14:15, Simon Slavin  wrote:
>
> On 2 Oct 2015, at 9:05am, Bart Smissaert  wrote:
>
>> Noticed that if I have table with a unique index on all fields, counting
>> all rows is still a lot faster
>> (about a factor 10 on my particular test table) than counting distinct rows.
>> Could maybe an optimization be added to SQLite to speed this up, taking
>> advantage of the fact that there is a unique index on all fields?
>
> The fact that your index is so wide is actually slowing it down.  Because it 
> means that the index takes up more space on disk.  The fastest way to count 
> all the rows in a table would be to have an index on just a single numeric 
> field.  This means that reading the whole index in would involve reading the 
> least number of pages.
>
> Simon.
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Speed of count distinct rows if unique index on all fields

2015-10-02 Thread Simon Slavin

On 2 Oct 2015, at 9:05am, Bart Smissaert  wrote:

> Noticed that if I have table with a unique index on all fields, counting
> all rows is still a lot faster
> (about a factor 10 on my particular test table) than counting distinct rows.
> Could maybe an optimization be added to SQLite to speed this up, taking
> advantage of the fact that there is a unique index on all fields?

The fact that your index is so wide is actually slowing it down.  Because it 
means that the index takes up more space on disk.  The fastest way to count all 
the rows in a table would be to have an index on just a single numeric field.  
This means that reading the whole index in would involve reading the least 
number of pages.

Simon.


[sqlite] Speed of count distinct rows if unique index on all fields

2015-10-02 Thread Richard Hipp
On 10/2/15, Bart Smissaert  wrote:
> Do you mean that is only valid if there are no rows where all columns are
> NULL?

No, I mean that two rows can be identical (not distinct) event if
there is a unique index on all columns, if any single column in those
two rows is NULL.  Example:

SQLite version 3.8.11 2015-07-15 23:15:59
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> create table t1(a,b,c);
sqlite> create unique index t1u on t1(a,b,c);
sqlite> insert into t1(a,b,c) values(1,null,3),(1,null,3),(1,null,3);
sqlite>

-- 
D. Richard Hipp
drh at sqlite.org


[sqlite] Speed of count distinct rows if unique index on all fields

2015-10-02 Thread Richard Hipp
On 10/2/15, Bart Smissaert  wrote:
>
> Unless I am mistaken here this is done with a SQL like this:
>
> SELECT COUNT(*) AS UNIQUE_ROWS
> FROM (SELECT DISTINCT * FROM TABLE1)
>
> But if we take advantage of the fact that this table has a unique index on
> all the fields of the table
> we can simply do this SQL:
>
> SELECT COUNT(*) FROM TABLE1
>

No.  That is only valid if all columns are individually NOT NULL.

-- 
D. Richard Hipp
drh at sqlite.org


[sqlite] Speed of count distinct rows if unique index on all fields

2015-10-02 Thread Scott Robison
On Fri, Oct 2, 2015 at 9:41 AM, Bart Smissaert 
wrote:

> > you're just throwing random terms around and hoping something sticks.
>
> Not sure where you got that idea from, but let me explain better:
>
> Say we have a table:
>
> CREATE TABLE TABLE1
> ([FIELD1] INTEGER,
> [FIELD2] TEXT,
> [FIELD3] TEXT,
> [FIELD4] REAL)
>
> and we have a unique index on all 4 fields:
>
> CREATE UNIQUE INDEX IDX1
> ON TABLE1
> (FIELD1, FIELD2, FIELD3, FIELD4)
>
> Now I want to count all the unique rows of this table (without any limiting
> where or join or whatever).
>
> Unless I am mistaken here this is done with a SQL like this:
>
> SELECT COUNT(*) AS UNIQUE_ROWS
> FROM (SELECT DISTINCT * FROM TABLE1)
>
> But if we take advantage of the fact that this table has a unique index on
> all the fields of the table
> we can simply do this SQL:
>
> SELECT COUNT(*) FROM TABLE1
>
> It will have the same result and that is a lot faster.
> Hope this makes it clear.
>

I just ran a quick test to see how this might work. Given your schema above:

sqlite> create table table1(field1 integer, field2 text, field3 text,
field4 real);
sqlite> create unique index idx1 on table1(field1, field2, field3, field4);

I populated the table with this query:

sqlite> with recursive src(f1, f2, f3, f4) as (select 0, null, null, null
union all select f1+1, hex(randomblob(16)), hex(randomblob(16)), f1/47.0
from src limit 501) insert into table1 select * from src where f1 > 0;

So I have 5 million guaranteed unique rows, and an index to insure that is
adhered to. After warming the cache, I ran the following queries with
.timer on:

sqlite> select count(*) from (select distinct * from table1);
500
Run Time: real 3.682 user 2.886019 sys 0.795605
sqlite> select count(*) from (select distinct * from table1);
500
Run Time: real 3.699 user 2.714417 sys 0.951606
sqlite> select count(*) from (select distinct * from table1);
500
Run Time: real 3.684 user 2.886019 sys 0.811205

sqlite> select count(*) from table1 where field1 > 0;
500
Run Time: real 1.344 user 0.436803 sys 0.904806
sqlite> select count(*) from table1 where field1 > 0;
500
Run Time: real 1.363 user 0.546003 sys 0.811205
sqlite> select count(*) from table1 where field1 > 0;
500
Run Time: real 1.357 user 0.592804 sys 0.764405

sqlite> select count(*) from table1;
500
Run Time: real 1.062 user 0.234002 sys 0.811205
sqlite> select count(*) from table1;
500
Run Time: real 1.061 user 0.312002 sys 0.748805
sqlite> select count(*) from table1;
500
Run Time: real 1.061 user 0.234002 sys 0.811205

Note the first query (using your distinct example, which I understand was
just for illustrative purposes) takes about 3.6 seconds to execute.

The second query, which I included just to show the effects of running
count with a condition (in this case including every row) had to do a lot
less work (taking advantage of the fact that all the rows were already
unique) and took about 1.3 seconds.

Finally, the third query just uses the optimized count without where
clause, and it only takes about 1 second.

If I'm misunderstanding something, please let me know. Given that row
counts aren't stored, reading the database or the index should read about
the same amount of data for this type of configuration.

-- 
Scott Robison


[sqlite] Speed of count distinct rows if unique index on all fields

2015-10-02 Thread Scott Robison
On Fri, Oct 2, 2015 at 9:41 AM, Bart Smissaert 
wrote:

> > you're just throwing random terms around and hoping something sticks.
>
> Not sure where you got that idea from, but let me explain better:
>
> Say we have a table:
>
> CREATE TABLE TABLE1
> ([FIELD1] INTEGER,
> [FIELD2] TEXT,
> [FIELD3] TEXT,
> [FIELD4] REAL)
>
> and we have a unique index on all 4 fields:
>
> CREATE UNIQUE INDEX IDX1
> ON TABLE1
> (FIELD1, FIELD2, FIELD3, FIELD4)
>
> Now I want to count all the unique rows of this table (without any limiting
> where or join or whatever).
>
> Unless I am mistaken here this is done with a SQL like this:
>
> SELECT COUNT(*) AS UNIQUE_ROWS
> FROM (SELECT DISTINCT * FROM TABLE1)
>
> But if we take advantage of the fact that this table has a unique index on
> all the fields of the table
> we can simply do this SQL:
>
> SELECT COUNT(*) FROM TABLE1
>
> It will have the same result and that is a lot faster.
> Hope this makes it clear.
>

I think you're already going to be getting the "fastest" way with your
structure.

You have a row wide unique index that ensures two duplicate rows can't be
inserted into the table.

Each row of that index is going to be essentially (if not exactly) the same
size as each row of the table, since they include all the same columns.
There is no real speed difference between reading each row of the index vs
each row of the table.

Additionally, SQLite already has an optimization in place for the query
"SELECT COUNT(*) FROM sometable". It still has to walk the entire B-tree,
but it only has to count how many rows are in each page, it doesn't have to
actually decode each row to check conditions.

Now, I am not an expert on SQLite, but based on my understanding of things,
SQLite is going as about fast as it can (without changes to the file
format), and the relative size of the table vs the index means there isn't
really any size difference to exploit.

-- 
Scott Robison


[sqlite] Speed of count distinct rows if unique index on all fields

2015-10-02 Thread Scott Hess
On Fri, Oct 2, 2015 at 8:41 AM, Bart Smissaert 
wrote:

> > you're just throwing random terms around and hoping something sticks.
>
> Not sure where you got that idea from, but let me explain better:
>

AFAICT this is the first posting where you said "I want to count all the
unique rows of this table".  Previously, you said that you wanted to count
all the rows of the table (without the qualifier "unique") and you also
said this was faster if you had a unique index.  That didn't make sense,
because the only case where an index should make that faster is if the
index narrows the table rather than containing the table.

I agree, if this is what you want:


> SELECT COUNT(*) AS UNIQUE_ROWS
> FROM (SELECT DISTINCT * FROM TABLE1)
>

Then it could in some cases be reduced to just counting the rows in the
table:

SELECT COUNT(*) FROM TABLE1
>

and that would be faster.  But the latter statement would not itself become
faster with a schema change.

This is why I suggested posting schema and query examples.  Then we're
talking about the same thing.  This could have been a two- or three-post
thread.

-scott


[sqlite] Speed of count distinct rows if unique index on all fields

2015-10-02 Thread Bart Smissaert
Noticed that if I have table with a unique index on all fields, counting
all rows is still a lot faster
(about a factor 10 on my particular test table) than counting distinct rows.
Could maybe an optimization be added to SQLite to speed this up, taking
advantage of the fact that there is a unique index on all fields?
I am running SQLite 3.8.10.

RBS


[sqlite] Speed of count distinct rows if unique index on all fields

2015-10-02 Thread Scott Hess
Why does any of that matter?  SELECT COUNT(*) FROM table; already knows all
of that information.

If you have a question about why one query is faster/slower than another
query given one schema versus another schema, then post representative
schema and queries.  Right now you're just throwing random terms around and
hoping something sticks.  Earlier you said that you weren't talking about
DISTINCT as applied to result sets, but just now you're re-introducing
COUNT DISTINCT for some reason.  SQL code will make your question concrete.

-scott


On Fri, Oct 2, 2015 at 7:54 AM, Bart Smissaert 
wrote:

> It is faster because if it knows there is no where or join or whatever row
> limiting condition and it also knows there is
> a unique index on all fields it can simply do select count(rowid) from
> table1 and not do any count distinct.
>
> RBS
>
>
> On Fri, Oct 2, 2015 at 3:51 PM, Scott Hess  wrote:
>
> > On Fri, Oct 2, 2015 at 7:43 AM, Bart Smissaert  >
> > wrote:
> >
> > > > The Uniqueness of the output depends on which fields are included,
> > JOINs,
> > > UNIONs, etc. etc.
> > >
> > > I am not talking about that situation. I am only referring to a
> situation
> > > where you want to count all
> > > rows in a table. I know it will be uncommon to have an index on all
> > fields
> > > and this is not really a practical
> > > question. I suppose as it so uncommon it is not worth it to put this
> > > optimization in.
> >
> >
> > Is your case of having the unique index on all fields faster than having
> a
> > unique index on a single field?
> >
> > Maybe you should include an example of your schema.  I can't think of how
> > scanning an index on all fields could be smaller than the underlying
> table,
> > so it's unclear how that could be faster.  But a unique index on a subset
> > of the data could be faster simply from being smaller.
> >
> > Also, 10x faster makes me wonder about whether you're accounting for
> > caching effects.  A blunt way to test for that is to run your queries a
> > couple times.  If the first time is slow and the second and later times
> are
> > much faster, then it's likely the cache is causing the speedup.
> >
> > -scott
> > ___
> > sqlite-users mailing list
> > sqlite-users at mailinglists.sqlite.org
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>


[sqlite] Speed of count distinct rows if unique index on all fields

2015-10-02 Thread Scott Hess
On Fri, Oct 2, 2015 at 7:43 AM, Bart Smissaert 
wrote:

> > The Uniqueness of the output depends on which fields are included, JOINs,
> UNIONs, etc. etc.
>
> I am not talking about that situation. I am only referring to a situation
> where you want to count all
> rows in a table. I know it will be uncommon to have an index on all fields
> and this is not really a practical
> question. I suppose as it so uncommon it is not worth it to put this
> optimization in.


Is your case of having the unique index on all fields faster than having a
unique index on a single field?

Maybe you should include an example of your schema.  I can't think of how
scanning an index on all fields could be smaller than the underlying table,
so it's unclear how that could be faster.  But a unique index on a subset
of the data could be faster simply from being smaller.

Also, 10x faster makes me wonder about whether you're accounting for
caching effects.  A blunt way to test for that is to run your queries a
couple times.  If the first time is slow and the second and later times are
much faster, then it's likely the cache is causing the speedup.

-scott