Thomas,

> > with sum(n1) added query runs twice as slow. as i was told its because
> > sqlite has to fetch data row. fine, but why its soo slow?! and it
> 
>    Because for each row it has to compute the aggregate key, find the
> aggregator for that key and increment the sum for that aggregate key.
> That's a lot more work than just selecting the row and doing nothing
> with it.

well, it could be true, but not in the queries i have posted. i "group
by" column "a" and there is an index on column "a", so sqlite does not
have to do anything to compute key. it does not even have to back to
the disk and fetch the value of column "a" because its right in the
index. the only work sqlite has to do is to go back to the disk and
fetch the value of column "n1" and sum it.

> > so, just with plain ascii file i get four times the speed i get with
> > sqlite. note that my c program will scale linearly with the size of
> > dataset (just like i see with sqlite).
> 
>    With anything related to computers, there are always tradeoffs - most
> commonly power for complexity, and flexibility for speed.  Your C
> program *should* be faster than anything SQLite can do - it's simpler
> and more specific to the problem you're trying to solve.  On the flip
> side, it'll never do anything other than what it already does - it can
> never be used to solve any other problem.

what you say is correct, but four (!) times performance increase?
please, you have got to admit that something is not right here.

> > - what am i doing wrong here?
> 
>    Your expectations are too high, for starters.  For reasons I will
> never understand, people expect a relational database to be a silver
> bullet that answers all questions instantaneously.  RDBMSs are useful
> because they're flexible, but they're also not 100% optimal because
> they're flexible.  That's the price you pay for being able to answer
> questions with SQL instead of C.

why? i'm using relational database to do exactly what it should do
fast. that is i want select the range of rows from a much bigger set
and do something with it. i'm not asking it to do complex computations
or anything else.

>    The other problem you may be running into is lack of a clear problem
> definition.  So far I've seen mention of a half dozen things which you
> think should perform better, but you haven't dug into any of them to
> find out *why* they're slow.  An earlier poster helped identify disk I/O
> as a bottleneck.  If from there you can prove that that's a bottleneck
> because SQLite is performing too much disk I/O, then you have cause to
> claim that SQLite is slow.  Otherwise, all you can do is blame your
> disks.  Until you get a clear picture of what you want, what you need
> and what's actually happening, however, you'll never be able to tell the
> difference.

i think, i know what is going on here. the problem is that every time
i do an indexed scan sqlite has to

1) fetch index pages

2) fetch data pages that match "where" condition

because both index and data are in the same file sqlite has to perform
insane amount of seek() calls and this is clearly demonstrated in my
previous post.

even if i remote the sum() but just add extra column that is not in
the index, like so

SELECT count(*) FROM (SELECT a,n1 FROM data WHERE a <= 18234721 GROUP BY a);

> time sqlite3 db < test.sql
300000
1024
1417
13.14u 1.06s 0:14.40 98.6%

it still takes 14 whooping seconds to "group by"  300,000+ records
down to 1417 records and that is just unacceptable (imo) for
relational database.

now if you only request column that is  in the index

SELECT count(*) FROM (SELECT a FROM data WHERE a <= 18234721 GROUP BY a);

> time sqlite3 db < test.sql
300000
1024
1417
6.84u 0.24s 0:07.18 98.6%

it only takes 7+ seconds. 

from what i can see sqlite took extra 7 seconds to fetch the data of
the disk. that is it had to seek back to data page, fetch it, process
it, seek back to next index page etc.

> > - any suggestions on how to speed things up with sqlite?
> 
>    First, you must understand what SQLite is doing.  Have you tried
> EXPLAINing the query that you're unhappy with to see what SQLite is
> going with it "under the hood"?

yes, i tried all that. it is not that easy to read sqlite explain
output (imo) but i convince myself that it uses all the proper indexes
when i make the query.

>    Also, an earlier poster mentioned increasing your block size.  How
> high did you raise it?  I've seen significant performance increases with
> block sizes of 8k and 16k; just how much difference it makes seems to
> depend on the system in question, however.  On some systems I've tested,
> 8k block sizes are faster than 16k block sizes; my guess is that all
> this comes down to the block size used by the file system and/or OS (and
> the size of the data itself can factor in there as well), though I
> haven't yet dug deep enough to be sure.  From my experience though, in
> most cases, there are certainly gains to be had by using larger block
> sizes.

by default you can not raise page_size above SQLITE_MAX_PAGE_SIZE
(which is set to 8192). i have changed this to 65536 and tried all
block sized up to 32k. 64k does not work for whatever reason. when i
say PRAGMA page_size=65536 it sets page_size to 0. none of this made
any difference. it does not matter what pags_size i use or what
cache_size is set to the results are always the same.

> > - is sqlite optimized to retrieve one row from the table?
> 
>    If there is an index on the column or columns referenced in your
> WHERE clause, yes.  Moreso if the column or columns referenced contain
> unique values.  Note however that this has nothing to do with SQLite -
> all relational databases (all databases with B-tree indexes, actually,
> not just RDBMSs) work this way.

i think sqlite can only do good job when resulting set is very small.
ideally one row. in this case it does not have to go back to disk to
fetch the columns that are not in index. it is that simple (imo). the
more columns (that are not in the index) you request the more time it
takes.

here, i just created index on "a" and "n1" columns and now

SELECT count(*) FROM (SELECT a,n1 FROM data WHERE a <= 18234721 GROUP BY bin);

> time sqlite3 db < test.sql
300000
1024
1417
6.82u 0.42s 0:07.49 96.6

here is 7+ seconds. however if you ask for "a" and "n2" columns

SELECT count(*) FROM (SELECT a,n2 FROM data WHERE a <= 18234721 GROUP BY bin);

> time sqlite3 db < test.sql
300000
1024
1417
12.48u 0.84s 0:13.38 99.5%

right back to 13+ seconds. 

> > i'm starting to think that having index and data in the same file is
> > not such a great idea when you want to perform range requests.
> 
>    Having the data and the index in the same file is irrelevant.  It is
> entirely possible that SQLite might be using an index to satisfy your
> query, and entirely possible that it would be more optimal to execute
> the query without using the index, but the fact that the index and data
> are in the same file has nothing to do with it.  Whether to use the
> index or not is an age-old problem that other RDBMSs solve using
> cost-based optimizers; I'm not sure if there's a solution for that in
> SQLite, as I believe query plan execution is rule-based (I'm not 100%
> sure about that though, someone please correct me if I'm wrong).  If
> that's the case, then you'll need to try to find another way to write
> your query such that SQLite executes it more optimally.

i disagree, see examples above.

>    If I remember correctly, the old trick to make sure you didn't use an
> index was to change your WHERE clause slightly to something like:
> 
>    SELECT COUNT(*) FROM (SELECT a FROM data WHERE (a + 0) <= 18234721
> GROUP BY a);
> 
>    This should force a table scan rather than an index scan.  Try that
> and see how your queries perform; if things go faster, that means the
> index is hurting you, and you need to either drop it or rewrite your
> queries to avoid using it.

again i tried all that. the problem is that it takes too long to fetch
the columns that are not in the index. i did try to create a separate
database with much smaller subset of data and doing full scan on it.
i'm basically getting the same 6,7 seconds. however in order to create
that small subset you have to do sa 14+ seconds query :(

thanks,
max

Reply via email to