[sqlite] Query times vary between 0.2 s and 30 s for very

2015-03-27 Thread Simon Slavin

On 26 Mar 2015, at 11:52pm, James K. Lowden  wrote:

> I guess I was unclear.  I was asking about SQLite's implementation,
> specifically whether an IN clause is always sorted using an ephemeral
> table, or if that complexity is subject to some minimum threshhold.  

The answer to this is undocumented.  Therefore even if I could answer for the 
current version of SQLite, a future update may change the answer.

Simon.


[sqlite] Query times vary between 0.2 s and 30 s for very

2015-03-26 Thread James K. Lowden
On Wed, 25 Mar 2015 10:02:24 +
Simon Slavin  wrote:

> 
> On 25 Mar 2015, at 3:28am, James K. Lowden 
> wrote:
> 
> > Is there some lower bound on either the size of the IN list or the
> > number of rows in the table being queried?
> 
> There's nothing in the language to stop you from executing "... WHERE
> c IN (12) ..." on a zero-row table.  It won't be efficient, but it
> will give the correct result.

I guess I was unclear.  I was asking about SQLite's implementation,
specifically whether an IN clause is always sorted using an ephemeral
table, or if that complexity is subject to some minimum threshhold.  

--jkl


[sqlite] Query times vary between 0.2 s and 30 s for very

2015-03-25 Thread Simon Slavin

On 25 Mar 2015, at 3:28am, James K. Lowden  wrote:

> Is there some lower bound on either the size of the IN list or the
> number of rows in the table being queried?

There's nothing in the language to stop you from executing "... WHERE c IN (12) 
..." on a zero-row table.  It won't be efficient, but it will give the correct 
result.

Simon.


[sqlite] Query times vary between 0.2 s and 30 s for very

2015-03-25 Thread James K. Lowden
On Mon, 23 Mar 2015 06:43:39 +
Hick Gunter  wrote:

> SQLite creates an ephemeral table for the IN list,giving O(log n)
> performance for lookups.

Thank you for pointing that out, Hick.  Good to know!  

Is there some lower bound on either the size of the IN list or the
number of rows in the table being queried?  I ask because a linear scan
of ~10 elements in a IN list won't benefit much if at all from a binary
search.  Similarly, if the table had only, say, 6 rows and the IN list
was 100 long, we could imagine it would be faster to scan the table for
each list element -- O(nm), worst case -- than to sort the IN list first
and scan the table, O(m n log(n)).  

--jkl

> >-Urspr?ngliche Nachricht-
> >Von: James K. Lowden [mailto:jklowden at schemamania.org]
> >Gesendet: Samstag, 21. M?rz 2015 20:43
> >An: sqlite-users at mailinglists.sqlite.org
> >Betreff: Re: [sqlite] Query times vary between 0.2 s and 30 s for
> >very
> >
> >On Sat, 21 Mar 2015 19:01:16 +0100
> >"Mario M. Westphal"  wrote:
> >
> >> For now I have increased the threshold for IN clauses (instead of
> >> TEMP tables) and use WHERE IN (SELECT ? from TEMP) instead of a
> >> JOIN.
> >
> >...
> >There is also the question of linear vs binary searches.  When you
> >supply a list of constants to IN, most if not all DBMSs search the
> >list sequentially.  When IN (or EXISTS) is supplied from an indexed
> >column, the search is often binary.  For a small number of elements,
> >there's no >distinction.  For 1000 elements, it's 2 orders of
> >magnitude: 1000 hops versus 10.
> >
> >--jkl
> 
> 
> ___
>  Gunter Hick
> Software Engineer
> Scientific Games International GmbH
> FN 157284 a, HG Wien
> Klitschgasse 2-4, A-1130 Vienna, Austria
> Tel: +43 1 80100 0
> E-Mail: hick at scigames.at
> 
> This communication (including any attachments) is intended for the
> use of the intended recipient(s) only and may contain information
> that is confidential, privileged or legally protected. Any
> unauthorized use or dissemination of this communication is strictly
> prohibited. If you have received this communication in error, please
> immediately notify the sender by return e-mail message and delete all
> copies of the original communication. Thank you for your cooperation.
> 
> 
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Query times vary between 0.2 s and 30 s for very

2015-03-24 Thread Mario M. Westphal
The temporary table is creates as



CREATE TEMPORARY TABLE _tempNNN (oid INTEGER PRIMARY KEY)



So the optimizer must know that the values are unique.







[sqlite] Query times vary between 0.2 s and 30 s for very

2015-03-23 Thread Hick Gunter
SQLite creates an ephemeral table for the IN list,giving O(log n) performance 
for lookups.

>-Urspr?ngliche Nachricht-
>Von: James K. Lowden [mailto:jklowden at schemamania.org]
>Gesendet: Samstag, 21. M?rz 2015 20:43
>An: sqlite-users at mailinglists.sqlite.org
>Betreff: Re: [sqlite] Query times vary between 0.2 s and 30 s for very
>
>On Sat, 21 Mar 2015 19:01:16 +0100
>"Mario M. Westphal"  wrote:
>
>> For now I have increased the threshold for IN clauses (instead of TEMP
>> tables) and use WHERE IN (SELECT ? from TEMP) instead of a JOIN.
>
>...
>There is also the question of linear vs binary searches.  When you supply a 
>list of constants to IN, most if not all DBMSs search the list sequentially.  
>When IN (or EXISTS) is supplied from an indexed column, the search is often 
>binary.  For a small number of elements, there's no >distinction.  For 1000 
>elements, it's 2 orders of magnitude: 1000 hops versus 10.
>
>--jkl


___
 Gunter Hick
Software Engineer
Scientific Games International GmbH
FN 157284 a, HG Wien
Klitschgasse 2-4, A-1130 Vienna, Austria
Tel: +43 1 80100 0
E-Mail: hick at scigames.at

This communication (including any attachments) is intended for the use of the 
intended recipient(s) only and may contain information that is confidential, 
privileged or legally protected. Any unauthorized use or dissemination of this 
communication is strictly prohibited. If you have received this communication 
in error, please immediately notify the sender by return e-mail message and 
delete all copies of the original communication. Thank you for your cooperation.




[sqlite] Query times vary between 0.2 s and 30 s for very

2015-03-22 Thread James K. Lowden
On Sat, 21 Mar 2015 14:36:47 -0600
Scott Robison  wrote:

> On Mar 21, 2015 1:43 PM, "James K. Lowden" 
> wrote:
> >
> > The query optimizer has to be sophisticated enough to recognize
> > those conditions, which is unlikely in the case of a temporary
> > table.
> 
> Are temporary tables really that different? Other than being dropped
> automatically at the end of a session, I was under the impression
> they were just another table. 

They're different to the extent ANALYZE matters.  I'm assuming ANALYZE
wouldn't be run between the time the temporary table is created and
when it's used.  

--jkl


[sqlite] Query times vary between 0.2 s and 30 s for very

2015-03-22 Thread Scott Robison
On Sun, Mar 22, 2015 at 2:47 PM, James K. Lowden 
wrote:

> On Sat, 21 Mar 2015 14:36:47 -0600
> Scott Robison  wrote:
>
> > >
> > > The query optimizer has to be sophisticated enough to recognize
> > > those conditions, which is unlikely in the case of a temporary
> > > table.
> >
> > Are temporary tables really that different? Other than being dropped
> > automatically at the end of a session, I was under the impression
> > they were just another table.
>
> They're different to the extent ANALYZE matters.  I'm assuming ANALYZE
> wouldn't be run between the time the temporary table is created and
> when it's used.
>

Good point. Thanks.

-- 
Scott Robison


[sqlite] Query times vary between 0.2 s and 30 s for very

2015-03-21 Thread Mario M. Westphal
I don?t think I can always run an analyze on the TEMP table for each query. May 
ruin performance worse than trying my luck with the temp table.



I think this boils down why a JOIN with a 500 row table (one column) is so much 
(several hundred times) slower than using an IN clause with 500 values. 

I have not changed the code or query for a long time, so my assumption is that 
a change in one of the recent SQLite updates caused the changed behavior of the 
optimizer/query engine. I recall a similar problem maybe a year ago. I think 
only one of the SQLite developers may shed some light on this.



For now I have increased the threshold for IN clauses (instead of TEMP tables) 
and use WHERE IN (SELECT ? from TEMP) instead of a JOIN.







[sqlite] Query times vary between 0.2 s and 30 s for very

2015-03-21 Thread Keith Medcalf
>I don?t think I can always run an analyze on the TEMP table for each
>query. May ruin performance worse than trying my luck with the temp
>table.

>I think this boils down why a JOIN with a 500 row table (one column) is
>so much (several hundred times) slower than using an IN clause with 500
>values.

>I have not changed the code or query for a long time, so my assumption is
>that a change in one of the recent SQLite updates caused the changed
>behavior of the optimizer/query engine. I recall a similar problem maybe
>a year ago. I think only one of the SQLite developers may shed some light
>on this.

>For now I have increased the threshold for IN clauses (instead of TEMP
>tables) and use WHERE IN (SELECT ? from TEMP) instead of a JOIN.

Of course, when you use IN (...) the optimizer KNOWS that the RHS is unique -- 
whether you are using IN (...list of values...) or IN (SELECT x FROM y).

However, if you are using a temp table and using a JOIN, the optimizer only 
knows that the RHS join predicate is unique if you tell it that is the case.  
So for example, if you use:

create table temp_table (x integer);
insert into temp_table values (1),(2),(3) ...

then this query:

select ... from b join temp_table on b.col = temp_table.x;

is significantly different from:

select ... from b where col in (select x from temp_table);

but the following are equivalent to the later, and to using the list of values 
directly in the IN clause:

create table temp_table (x integer not null primary key);
insert into temp_table values (1),(2),(3) ...

select ... from b join temp_table where b.col = temp_table.x;

which is equivalent to:

select ... from b where exists (select 1 from temp_table where x = b.col);

and

select ... from b where col in (select x from temp_table);

and

select ... from b where col in temp_table;

I don't recall ever seeing your definition of your temp_table ...

---
Theory is when you know everything but nothing works.  Practice is when 
everything works but no one knows why.  Sometimes theory and practice are 
combined:  nothing works and no one knows why.







[sqlite] Query times vary between 0.2 s and 30 s for very

2015-03-21 Thread Scott Robison
On Mar 21, 2015 1:43 PM, "James K. Lowden"  wrote:
>
> The query optimizer has to be sophisticated enough to recognize
> those conditions, which is unlikely in the case of a temporary table.

Are temporary tables really that different? Other than being dropped
automatically at the end of a session, I was under the impression they were
just another table. Proper indexing would be key to performance of course,
as you indicated.

Also, is there a limit on the size of in? By which I mean, there is a limit
on the size of the query text, but if you use "colx in (select coly from
taby)" the table size should not have an arbitrary limit. As long as it is
indexed properly, it should be performant.

SDR


[sqlite] Query times vary between 0.2 s and 30 s for very

2015-03-21 Thread James K. Lowden
On Sat, 21 Mar 2015 19:01:16 +0100
"Mario M. Westphal"  wrote:

> For now I have increased the threshold for IN clauses (instead of
> TEMP tables) and use WHERE IN (SELECT ? from TEMP) instead of a JOIN.

Because the two are conceptually different, it's not surprising they
run differently.  IN is an existence test: does the value appear in the
set?  JOIN is a product: produce every matching combination.  

For some queries, they reduce to the same thing.  For example, 

FROM A JOIN B ON A.b = B.b

is the same as IN if B.b 

1.  is unique, and 
2.  no other columns from B are used.  

The query optimizer has to be sophisticated enough to recognize
those conditions, which is unlikely in the case of a temporary table.  

There is also the question of linear vs binary searches.  When you
supply a list of constants to IN, most if not all DBMSs search the list
sequentially.  When IN (or EXISTS) is supplied from an indexed column,
the search is often binary.  For a small number of elements, there's no
distinction.  For 1000 elements, it's 2 orders of magnitude: 1000 hops
versus 10.  

--jkl


[sqlite] Query times vary between 0.2 s and 30 s for very

2015-03-21 Thread Scott Robison
On Mar 21, 2015 1:43 PM, "James K. Lowden"  wrote:
>
> The query optimizer has to be sophisticated enough to recognize
> those conditions, which is unlikely in the case of a temporary table.

Are temporary tables really that different? Other than being dropped
automatically at the end of a session, I was under the impression they were
just another table. Proper indexing would be key to performance of course,
as you indicated.

Also, is there a limit on the size of in? By which I mean, there is a limit
on the size of the query text, but if you use "colx in (select coly from
taby)" the table size should not have an arbitrary limit. As long as it is
indexed properly, it should be performant.

SDR


[sqlite] Query times vary between 0.2 s and 30 s for very similar queries - how to trick the optimizer?

2015-03-21 Thread Eduardo Morras
On Wed, 18 Mar 2015 14:43:26 +0100
"Mario M. Westphal"  wrote:

> I?m using 3.8.8.1 on Windows via the ?C? interface.
> 
> I work with SQLite for several years with good success. And I know
> that no optimizer ever will be perfect. But I?ve just stumbled upon a
> case where a very similar query requires between 0.2 seconds and a
> whopping 30 seconds.
> 
> I?ve simplified things as much as possible and included the create
> instructions and queries below.

<...snip...>

> The question is: When JOINing large tables with a temporary table,
> how to ensure that the optimizer can work optimal? Running ANALYZE
> with a temporary table probably does not work, and ANALYZE takes
> about 1 minute on this database so this is not feasible for each
> query.
> 
> I'm glad to have found an apparently working solution (IN instead of
> JOIN) but I wonder if this could be somehow automated by the
> optimizer? Or maybe this is a worst-case for the optimizer?
> 

You can run ANALYZE on any table, try:

ANALYZE temp_table;
SELECT 

You can also test-stress IN and know where is the limit. I 
think/suppouse/suspect that in this case there is no winning for using a 
temporal table intstead an IN, it should be faster when the temp table has more 
columns used in the where clause or additional join restriction.

---   ---
Eduardo Morras 


[sqlite] Query times vary between 0.2 s and 30 s for very similar queries - how to trick the optimizer?

2015-03-18 Thread Mario M. Westphal
I?m using 3.8.8.1 on Windows via the ?C? interface.

I work with SQLite for several years with good success. And I know that no 
optimizer ever will be perfect.
But I?ve just stumbled upon a case where a very similar query requires between 
0.2 seconds and a whopping 30 seconds.

I?ve simplified things as much as possible and included the create instructions 
and queries below.

1.  My database has a table "art_data" which holds information about an 
article. A second table "attr" holds all available attributes.
art_data may contain any number of attribues per article, typically between 50 
and 200.
The art_data table in the production database has about 22 million rows, the 
attr table 20,000.

art_data
oid INTEGER PRIMARY KEY,
art_oid INTEGER,
attr_oid INTEGER,
tdata TEXT

the attr_oid refers to the table which defines the available attributes and 
tdata is the value for that attribute. The attr table is defined as:

attr
oid INTEGER PRIMARY KEY
class INTEGER
tag TEXT

For the indices created, please see below.

2.  My application needs to select all the data for a specific article and 
specific attributes. The number of attributes to select is usually between 1 
and 30 so I use an IN clause and provide the attribute ids directly in the 
SELECT query. My application has these 20,000 ids cached and always available.
If the number of articles is < 500, my application supplies a list of articles 
for the d.oid IN (...) as well.

SELECT d.oid, d.tdata FROM art_data d
WHERE d.oid IN (1,890,222,...)  
AND d.attr_oid IN (2188,2191,2251,2272,...) 
ORDER BY d.oid ASC, d.attr_oid ASC, d.rowid ASC

This query takes between 0.2 and 0.5 seconds, even if 500 article numbers are 
in the first WHERE d.oid IN clause!

If more than 500 articles are needed, I estimated that this would probably 
break the IN (is there a limit for IN?) and thus my application puts the 
article into a temporary table and JOINs with this table:

SELECT d.oid, d.attr_oid, d.tdata FROM art_data d 
INNER JOIN _temp _t ON d.oid = _t.oid  
AND d.attr_oid IN (2188,2191,2251,2272,...)
ORDER BY d.oid ASC, d.attr_oid ASC, d.rowid ASC

This query takes between 17 and 30 seconds (!) even if the temporary table only 
has 501 article numbers (one more than the threshold for the IN clause).

The only difference between 0.2 and 17 seconds is replacing an IN clause with 
500 numbers with a JOIN with a temporary table containing 501 numbers.

While playing with that, I used a SQLite GUI tool and created the temporary 
table _temp (oid INTEGER PRIMARY KEY) as a regular table and filled it with 500 
article numbers. This also resulted in the 17 to 30s query times.

For a test, I ran ANALYZE and the query time dropped down to 0.5 seconds. 
AMAZING.
Apparently the query analyzer now had the info about the (no longer) temporary 
table and was able to use it efficiently.

My SOLUTION for now was to change the query with the temporary table to

SELECT d.oid, d.attr_oid, d.tdata FROM art_data d 
WHERE d.attr_oid IN (2188,2191,2251,2272,...) 
AND d.oid IN (SELECT oid FROM _temp)

Instead of a JOIN for the temporary table I use an IN clause with a SELECT. 
This brought the query time down to 0.5 seconds as well. May also be the 
optimizer.

The question is: When JOINing large tables with a temporary table, how to 
ensure that the optimizer can work optimal? Running ANALYZE with a temporary 
table probably does not work, and ANALYZE takes about 1 minute on this database 
so this is not feasible for each query.

I'm glad to have found an apparently working solution (IN instead of JOIN) but 
I wonder if this could be somehow automated by the optimizer? Or maybe this is 
a worst-case for the optimizer?



If you want to try this out yourself, here is the complete CREATE schema and 
queries:

-- BEGIN -
DROP TABLE IF EXISTS art_data; 
DROP TABLE IF EXISTS attr;

CREATE TABLE art_data (oid INTEGER, attr_oid INTEGER, tdata TEXT, FOREIGN 
KEY(attr_oid) REFERENCES attr(oid) ON DELETE CASCADE);

CREATE INDEX idx_art_data_oid ON art_data(oid);
CREATE INDEX idx_art_data_oid_tag_oid ON art_data(oid,attr_oid);
CREATE INDEX idx_art_data_attr_oid ON art_data(attr_oid);

CREATE TABLE attr (oid INTEGER PRIMARY KEY, class INTEGER, tag TEXT);
CREATE INDEX idx_attr_tag ON attr(tag);

DROP TABLE IF EXISTS _temp;
CREATE TABLE _temp (OID INTEGER PRIMARY KEY);
--insert into _temp select ...

-- Fast: 0.2 seconds
-- explain query plan
SELECT d.oid, d.tdata FROM art_data d
-- Only for specific articles 
WHERE d.oid IN