[sqlite] Speed of count distinct rows if unique index on all fields

Keith Medcalf Fri, 02 Oct 2015 18:34:53 -0600

You are making an error.

sqlite> create table x(a, b, c, d);
sqlite> create unique index y on x(a,b,c,d);
sqlite> insert into x values(1,2,3,null);
sqlite> insert into x values(1,2,3,null);
sqlite> insert into x values(1,2,3,null);
sqlite> select * from x;
1|2|3|
1|2|3|
1|2|3|
sqlite> select distinct * from x;
1|2|3|
sqlite>


As you can see, the number of "DISTINCT" rows does not equal the number of 
"unique" rows in the index.

Using DISTINCT requires visitating of all rows and counting the DISTINCT 
entries, whereas merely counting rows does not require visiting all the rows.

In other words, where columns are nullable DISTINCT(*) is not the same as 
UNIQUE(*).

> -----Original Message-----
> From: sqlite-users-bounces at mailinglists.sqlite.org [mailto:sqlite-users-
> bounces at mailinglists.sqlite.org] On Behalf Of Bart Smissaert
> Sent: Friday, 2 October, 2015 09:42
> To: General Discussion of SQLite Database
> Subject: Re: [sqlite] Speed of count distinct rows if unique index on all
> fields
> 
> > you're just throwing random terms around and hoping something sticks.
> 
> Not sure where you got that idea from, but let me explain better:
> 
> Say we have a table:
> 
> CREATE TABLE TABLE1
> ([FIELD1] INTEGER,
> [FIELD2] TEXT,
> [FIELD3] TEXT,
> [FIELD4] REAL)
> 
> and we have a unique index on all 4 fields:
> 
> CREATE UNIQUE INDEX IDX1
> ON TABLE1
> (FIELD1, FIELD2, FIELD3, FIELD4)
> 
> Now I want to count all the unique rows of this table (without any
> limiting
> where or join or whatever).
> 
> Unless I am mistaken here this is done with a SQL like this:
> 
> SELECT COUNT(*) AS UNIQUE_ROWS
> FROM (SELECT DISTINCT * FROM TABLE1)
> 
> But if we take advantage of the fact that this table has a unique index on
> all the fields of the table
> we can simply do this SQL:
> 
> SELECT COUNT(*) FROM TABLE1
> 
> It will have the same result and that is a lot faster.
> Hope this makes it clear.
> 
> 
> RBS
> 
> 
> 
> 
> On Fri, Oct 2, 2015 at 4:01 PM, Scott Hess <shess at google.com> wrote:
> 
> > Why does any of that matter?  SELECT COUNT(*) FROM table; already knows
> all
> > of that information.
> >
> > If you have a question about why one query is faster/slower than another
> > query given one schema versus another schema, then post representative
> > schema and queries.  Right now you're just throwing random terms around
> and
> > hoping something sticks.  Earlier you said that you weren't talking
> about
> > DISTINCT as applied to result sets, but just now you're re-introducing
> > COUNT DISTINCT for some reason.  SQL code will make your question
> concrete.
> >
> > -scott
> >
> >
> > On Fri, Oct 2, 2015 at 7:54 AM, Bart Smissaert
> <bart.smissaert at gmail.com>
> > wrote:
> >
> > > It is faster because if it knows there is no where or join or whatever
> > row
> > > limiting condition and it also knows there is
> > > a unique index on all fields it can simply do select count(rowid) from
> > > table1 and not do any count distinct.
> > >
> > > RBS
> > >
> > >
> > > On Fri, Oct 2, 2015 at 3:51 PM, Scott Hess <shess at google.com> wrote:
> > >
> > > > On Fri, Oct 2, 2015 at 7:43 AM, Bart Smissaert <
> > bart.smissaert at gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > > The Uniqueness of the output depends on which fields are
> included,
> > > > JOINs,
> > > > > UNIONs, etc. etc.
> > > > >
> > > > > I am not talking about that situation. I am only referring to a
> > > situation
> > > > > where you want to count all
> > > > > rows in a table. I know it will be uncommon to have an index on
> all
> > > > fields
> > > > > and this is not really a practical
> > > > > question. I suppose as it so uncommon it is not worth it to put
> this
> > > > > optimization in.
> > > >
> > > >
> > > > Is your case of having the unique index on all fields faster than
> > having
> > > a
> > > > unique index on a single field?
> > > >
> > > > Maybe you should include an example of your schema.  I can't think
> of
> > how
> > > > scanning an index on all fields could be smaller than the underlying
> > > table,
> > > > so it's unclear how that could be faster.  But a unique index on a
> > subset
> > > > of the data could be faster simply from being smaller.
> > > >
> > > > Also, 10x faster makes me wonder about whether you're accounting for
> > > > caching effects.  A blunt way to test for that is to run your
> queries a
> > > > couple times.  If the first time is slow and the second and later
> times
> > > are
> > > > much faster, then it's likely the cache is causing the speedup.
> > > >
> > > > -scott
> > > > _______________________________________________
> > > > sqlite-users mailing list
> > > > sqlite-users at mailinglists.sqlite.org
> > > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > > >
> > > _______________________________________________
> > > sqlite-users mailing list
> > > sqlite-users at mailinglists.sqlite.org
> > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > >
> > _______________________________________________
> > sqlite-users mailing list
> > sqlite-users at mailinglists.sqlite.org
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >
> _______________________________________________
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] Speed of count distinct rows if unique index on all fields

Reply via email to