[sqlite] Query times vary between 0.2 s and 30 s for very
On 26 Mar 2015, at 11:52pm, James K. Lowden wrote: > I guess I was unclear. I was asking about SQLite's implementation, > specifically whether an IN clause is always sorted using an ephemeral > table, or if that complexity is subject to some minimum threshhold. The answer to this is undocumented. Therefore even if I could answer for the current version of SQLite, a future update may change the answer. Simon.
[sqlite] Query times vary between 0.2 s and 30 s for very
On Wed, 25 Mar 2015 10:02:24 + Simon Slavin wrote: > > On 25 Mar 2015, at 3:28am, James K. Lowden > wrote: > > > Is there some lower bound on either the size of the IN list or the > > number of rows in the table being queried? > > There's nothing in the language to stop you from executing "... WHERE > c IN (12) ..." on a zero-row table. It won't be efficient, but it > will give the correct result. I guess I was unclear. I was asking about SQLite's implementation, specifically whether an IN clause is always sorted using an ephemeral table, or if that complexity is subject to some minimum threshhold. --jkl
[sqlite] Query times vary between 0.2 s and 30 s for very
On 25 Mar 2015, at 3:28am, James K. Lowden wrote: > Is there some lower bound on either the size of the IN list or the > number of rows in the table being queried? There's nothing in the language to stop you from executing "... WHERE c IN (12) ..." on a zero-row table. It won't be efficient, but it will give the correct result. Simon.
[sqlite] Query times vary between 0.2 s and 30 s for very
On Mon, 23 Mar 2015 06:43:39 + Hick Gunter wrote: > SQLite creates an ephemeral table for the IN list,giving O(log n) > performance for lookups. Thank you for pointing that out, Hick. Good to know! Is there some lower bound on either the size of the IN list or the number of rows in the table being queried? I ask because a linear scan of ~10 elements in a IN list won't benefit much if at all from a binary search. Similarly, if the table had only, say, 6 rows and the IN list was 100 long, we could imagine it would be faster to scan the table for each list element -- O(nm), worst case -- than to sort the IN list first and scan the table, O(m n log(n)). --jkl > >-Urspr?ngliche Nachricht- > >Von: James K. Lowden [mailto:jklowden at schemamania.org] > >Gesendet: Samstag, 21. M?rz 2015 20:43 > >An: sqlite-users at mailinglists.sqlite.org > >Betreff: Re: [sqlite] Query times vary between 0.2 s and 30 s for > >very > > > >On Sat, 21 Mar 2015 19:01:16 +0100 > >"Mario M. Westphal" wrote: > > > >> For now I have increased the threshold for IN clauses (instead of > >> TEMP tables) and use WHERE IN (SELECT ? from TEMP) instead of a > >> JOIN. > > > >... > >There is also the question of linear vs binary searches. When you > >supply a list of constants to IN, most if not all DBMSs search the > >list sequentially. When IN (or EXISTS) is supplied from an indexed > >column, the search is often binary. For a small number of elements, > >there's no >distinction. For 1000 elements, it's 2 orders of > >magnitude: 1000 hops versus 10. > > > >--jkl > > > ___ > Gunter Hick > Software Engineer > Scientific Games International GmbH > FN 157284 a, HG Wien > Klitschgasse 2-4, A-1130 Vienna, Austria > Tel: +43 1 80100 0 > E-Mail: hick at scigames.at > > This communication (including any attachments) is intended for the > use of the intended recipient(s) only and may contain information > that is confidential, privileged or legally protected. Any > unauthorized use or dissemination of this communication is strictly > prohibited. If you have received this communication in error, please > immediately notify the sender by return e-mail message and delete all > copies of the original communication. Thank you for your cooperation. > > > ___ > sqlite-users mailing list > sqlite-users at mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] Query times vary between 0.2 s and 30 s for very
The temporary table is creates as CREATE TEMPORARY TABLE _tempNNN (oid INTEGER PRIMARY KEY) So the optimizer must know that the values are unique.
[sqlite] Query times vary between 0.2 s and 30 s for very
SQLite creates an ephemeral table for the IN list,giving O(log n) performance for lookups. >-Urspr?ngliche Nachricht- >Von: James K. Lowden [mailto:jklowden at schemamania.org] >Gesendet: Samstag, 21. M?rz 2015 20:43 >An: sqlite-users at mailinglists.sqlite.org >Betreff: Re: [sqlite] Query times vary between 0.2 s and 30 s for very > >On Sat, 21 Mar 2015 19:01:16 +0100 >"Mario M. Westphal" wrote: > >> For now I have increased the threshold for IN clauses (instead of TEMP >> tables) and use WHERE IN (SELECT ? from TEMP) instead of a JOIN. > >... >There is also the question of linear vs binary searches. When you supply a >list of constants to IN, most if not all DBMSs search the list sequentially. >When IN (or EXISTS) is supplied from an indexed column, the search is often >binary. For a small number of elements, there's no >distinction. For 1000 >elements, it's 2 orders of magnitude: 1000 hops versus 10. > >--jkl ___ Gunter Hick Software Engineer Scientific Games International GmbH FN 157284 a, HG Wien Klitschgasse 2-4, A-1130 Vienna, Austria Tel: +43 1 80100 0 E-Mail: hick at scigames.at This communication (including any attachments) is intended for the use of the intended recipient(s) only and may contain information that is confidential, privileged or legally protected. Any unauthorized use or dissemination of this communication is strictly prohibited. If you have received this communication in error, please immediately notify the sender by return e-mail message and delete all copies of the original communication. Thank you for your cooperation.
[sqlite] Query times vary between 0.2 s and 30 s for very
On Sat, 21 Mar 2015 14:36:47 -0600 Scott Robison wrote: > On Mar 21, 2015 1:43 PM, "James K. Lowden" > wrote: > > > > The query optimizer has to be sophisticated enough to recognize > > those conditions, which is unlikely in the case of a temporary > > table. > > Are temporary tables really that different? Other than being dropped > automatically at the end of a session, I was under the impression > they were just another table. They're different to the extent ANALYZE matters. I'm assuming ANALYZE wouldn't be run between the time the temporary table is created and when it's used. --jkl
[sqlite] Query times vary between 0.2 s and 30 s for very
On Sun, Mar 22, 2015 at 2:47 PM, James K. Lowden wrote: > On Sat, 21 Mar 2015 14:36:47 -0600 > Scott Robison wrote: > > > > > > > The query optimizer has to be sophisticated enough to recognize > > > those conditions, which is unlikely in the case of a temporary > > > table. > > > > Are temporary tables really that different? Other than being dropped > > automatically at the end of a session, I was under the impression > > they were just another table. > > They're different to the extent ANALYZE matters. I'm assuming ANALYZE > wouldn't be run between the time the temporary table is created and > when it's used. > Good point. Thanks. -- Scott Robison
[sqlite] Query times vary between 0.2 s and 30 s for very
I don?t think I can always run an analyze on the TEMP table for each query. May ruin performance worse than trying my luck with the temp table. I think this boils down why a JOIN with a 500 row table (one column) is so much (several hundred times) slower than using an IN clause with 500 values. I have not changed the code or query for a long time, so my assumption is that a change in one of the recent SQLite updates caused the changed behavior of the optimizer/query engine. I recall a similar problem maybe a year ago. I think only one of the SQLite developers may shed some light on this. For now I have increased the threshold for IN clauses (instead of TEMP tables) and use WHERE IN (SELECT ? from TEMP) instead of a JOIN.
[sqlite] Query times vary between 0.2 s and 30 s for very
>I don?t think I can always run an analyze on the TEMP table for each >query. May ruin performance worse than trying my luck with the temp >table. >I think this boils down why a JOIN with a 500 row table (one column) is >so much (several hundred times) slower than using an IN clause with 500 >values. >I have not changed the code or query for a long time, so my assumption is >that a change in one of the recent SQLite updates caused the changed >behavior of the optimizer/query engine. I recall a similar problem maybe >a year ago. I think only one of the SQLite developers may shed some light >on this. >For now I have increased the threshold for IN clauses (instead of TEMP >tables) and use WHERE IN (SELECT ? from TEMP) instead of a JOIN. Of course, when you use IN (...) the optimizer KNOWS that the RHS is unique -- whether you are using IN (...list of values...) or IN (SELECT x FROM y). However, if you are using a temp table and using a JOIN, the optimizer only knows that the RHS join predicate is unique if you tell it that is the case. So for example, if you use: create table temp_table (x integer); insert into temp_table values (1),(2),(3) ... then this query: select ... from b join temp_table on b.col = temp_table.x; is significantly different from: select ... from b where col in (select x from temp_table); but the following are equivalent to the later, and to using the list of values directly in the IN clause: create table temp_table (x integer not null primary key); insert into temp_table values (1),(2),(3) ... select ... from b join temp_table where b.col = temp_table.x; which is equivalent to: select ... from b where exists (select 1 from temp_table where x = b.col); and select ... from b where col in (select x from temp_table); and select ... from b where col in temp_table; I don't recall ever seeing your definition of your temp_table ... --- Theory is when you know everything but nothing works. Practice is when everything works but no one knows why. Sometimes theory and practice are combined: nothing works and no one knows why.
[sqlite] Query times vary between 0.2 s and 30 s for very
On Mar 21, 2015 1:43 PM, "James K. Lowden" wrote: > > The query optimizer has to be sophisticated enough to recognize > those conditions, which is unlikely in the case of a temporary table. Are temporary tables really that different? Other than being dropped automatically at the end of a session, I was under the impression they were just another table. Proper indexing would be key to performance of course, as you indicated. Also, is there a limit on the size of in? By which I mean, there is a limit on the size of the query text, but if you use "colx in (select coly from taby)" the table size should not have an arbitrary limit. As long as it is indexed properly, it should be performant. SDR
[sqlite] Query times vary between 0.2 s and 30 s for very
On Sat, 21 Mar 2015 19:01:16 +0100 "Mario M. Westphal" wrote: > For now I have increased the threshold for IN clauses (instead of > TEMP tables) and use WHERE IN (SELECT ? from TEMP) instead of a JOIN. Because the two are conceptually different, it's not surprising they run differently. IN is an existence test: does the value appear in the set? JOIN is a product: produce every matching combination. For some queries, they reduce to the same thing. For example, FROM A JOIN B ON A.b = B.b is the same as IN if B.b 1. is unique, and 2. no other columns from B are used. The query optimizer has to be sophisticated enough to recognize those conditions, which is unlikely in the case of a temporary table. There is also the question of linear vs binary searches. When you supply a list of constants to IN, most if not all DBMSs search the list sequentially. When IN (or EXISTS) is supplied from an indexed column, the search is often binary. For a small number of elements, there's no distinction. For 1000 elements, it's 2 orders of magnitude: 1000 hops versus 10. --jkl
[sqlite] Query times vary between 0.2 s and 30 s for very
On Mar 21, 2015 1:43 PM, "James K. Lowden" wrote: > > The query optimizer has to be sophisticated enough to recognize > those conditions, which is unlikely in the case of a temporary table. Are temporary tables really that different? Other than being dropped automatically at the end of a session, I was under the impression they were just another table. Proper indexing would be key to performance of course, as you indicated. Also, is there a limit on the size of in? By which I mean, there is a limit on the size of the query text, but if you use "colx in (select coly from taby)" the table size should not have an arbitrary limit. As long as it is indexed properly, it should be performant. SDR
[sqlite] Query times vary between 0.2 s and 30 s for very similar queries - how to trick the optimizer?
On Wed, 18 Mar 2015 14:43:26 +0100 "Mario M. Westphal" wrote: > I?m using 3.8.8.1 on Windows via the ?C? interface. > > I work with SQLite for several years with good success. And I know > that no optimizer ever will be perfect. But I?ve just stumbled upon a > case where a very similar query requires between 0.2 seconds and a > whopping 30 seconds. > > I?ve simplified things as much as possible and included the create > instructions and queries below. <...snip...> > The question is: When JOINing large tables with a temporary table, > how to ensure that the optimizer can work optimal? Running ANALYZE > with a temporary table probably does not work, and ANALYZE takes > about 1 minute on this database so this is not feasible for each > query. > > I'm glad to have found an apparently working solution (IN instead of > JOIN) but I wonder if this could be somehow automated by the > optimizer? Or maybe this is a worst-case for the optimizer? > You can run ANALYZE on any table, try: ANALYZE temp_table; SELECT You can also test-stress IN and know where is the limit. I think/suppouse/suspect that in this case there is no winning for using a temporal table intstead an IN, it should be faster when the temp table has more columns used in the where clause or additional join restriction. --- --- Eduardo Morras
[sqlite] Query times vary between 0.2 s and 30 s for very similar queries - how to trick the optimizer?
I?m using 3.8.8.1 on Windows via the ?C? interface. I work with SQLite for several years with good success. And I know that no optimizer ever will be perfect. But I?ve just stumbled upon a case where a very similar query requires between 0.2 seconds and a whopping 30 seconds. I?ve simplified things as much as possible and included the create instructions and queries below. 1. My database has a table "art_data" which holds information about an article. A second table "attr" holds all available attributes. art_data may contain any number of attribues per article, typically between 50 and 200. The art_data table in the production database has about 22 million rows, the attr table 20,000. art_data oid INTEGER PRIMARY KEY, art_oid INTEGER, attr_oid INTEGER, tdata TEXT the attr_oid refers to the table which defines the available attributes and tdata is the value for that attribute. The attr table is defined as: attr oid INTEGER PRIMARY KEY class INTEGER tag TEXT For the indices created, please see below. 2. My application needs to select all the data for a specific article and specific attributes. The number of attributes to select is usually between 1 and 30 so I use an IN clause and provide the attribute ids directly in the SELECT query. My application has these 20,000 ids cached and always available. If the number of articles is < 500, my application supplies a list of articles for the d.oid IN (...) as well. SELECT d.oid, d.tdata FROM art_data d WHERE d.oid IN (1,890,222,...) AND d.attr_oid IN (2188,2191,2251,2272,...) ORDER BY d.oid ASC, d.attr_oid ASC, d.rowid ASC This query takes between 0.2 and 0.5 seconds, even if 500 article numbers are in the first WHERE d.oid IN clause! If more than 500 articles are needed, I estimated that this would probably break the IN (is there a limit for IN?) and thus my application puts the article into a temporary table and JOINs with this table: SELECT d.oid, d.attr_oid, d.tdata FROM art_data d INNER JOIN _temp _t ON d.oid = _t.oid AND d.attr_oid IN (2188,2191,2251,2272,...) ORDER BY d.oid ASC, d.attr_oid ASC, d.rowid ASC This query takes between 17 and 30 seconds (!) even if the temporary table only has 501 article numbers (one more than the threshold for the IN clause). The only difference between 0.2 and 17 seconds is replacing an IN clause with 500 numbers with a JOIN with a temporary table containing 501 numbers. While playing with that, I used a SQLite GUI tool and created the temporary table _temp (oid INTEGER PRIMARY KEY) as a regular table and filled it with 500 article numbers. This also resulted in the 17 to 30s query times. For a test, I ran ANALYZE and the query time dropped down to 0.5 seconds. AMAZING. Apparently the query analyzer now had the info about the (no longer) temporary table and was able to use it efficiently. My SOLUTION for now was to change the query with the temporary table to SELECT d.oid, d.attr_oid, d.tdata FROM art_data d WHERE d.attr_oid IN (2188,2191,2251,2272,...) AND d.oid IN (SELECT oid FROM _temp) Instead of a JOIN for the temporary table I use an IN clause with a SELECT. This brought the query time down to 0.5 seconds as well. May also be the optimizer. The question is: When JOINing large tables with a temporary table, how to ensure that the optimizer can work optimal? Running ANALYZE with a temporary table probably does not work, and ANALYZE takes about 1 minute on this database so this is not feasible for each query. I'm glad to have found an apparently working solution (IN instead of JOIN) but I wonder if this could be somehow automated by the optimizer? Or maybe this is a worst-case for the optimizer? If you want to try this out yourself, here is the complete CREATE schema and queries: -- BEGIN - DROP TABLE IF EXISTS art_data; DROP TABLE IF EXISTS attr; CREATE TABLE art_data (oid INTEGER, attr_oid INTEGER, tdata TEXT, FOREIGN KEY(attr_oid) REFERENCES attr(oid) ON DELETE CASCADE); CREATE INDEX idx_art_data_oid ON art_data(oid); CREATE INDEX idx_art_data_oid_tag_oid ON art_data(oid,attr_oid); CREATE INDEX idx_art_data_attr_oid ON art_data(attr_oid); CREATE TABLE attr (oid INTEGER PRIMARY KEY, class INTEGER, tag TEXT); CREATE INDEX idx_attr_tag ON attr(tag); DROP TABLE IF EXISTS _temp; CREATE TABLE _temp (OID INTEGER PRIMARY KEY); --insert into _temp select ... -- Fast: 0.2 seconds -- explain query plan SELECT d.oid, d.tdata FROM art_data d -- Only for specific articles WHERE d.oid IN