RE: [sqlite] sqlite performance problem

2005-04-13 Thread Thomas Briggs
 

> well, it could be true, but not in the queries i have posted. i "group
> by" column "a" and there is an index on column "a", so sqlite does not
> have to do anything to compute key. it does not even have to back to

   Do not confuse the index key with the aggregator key.  The two may be
the same in your case, but in general they won't be, so I wouldn't
expect SQLite to reuse the index key as the aggregator key even if it
can.

> the disk and fetch the value of column "a" because its right in the
> index. the only work sqlite has to do is to go back to the disk and
> fetch the value of column "n1" and sum it.

   You're underestimating how much work has to be done to just "fetch
the value of column n1".  The entire row containing n1 has to be read,
then the value of column n1 extracted from it.  That means disk seeks,
disk reads, moving around within the data row, etc.

> what you say is correct, but four (!) times performance increase?
> please, you have got to admit that something is not right here.

   It may well be possible to make your query run faster.  The biggest
problem here is simply that you're expecting a magically
high-performance solution without understanding what needs to be done in
order to satisfy your request.

> why? i'm using relational database to do exactly what it should do
> fast. that is i want select the range of rows from a much bigger set

   Do not confuse what you want it to do fast, or even what you think it
should do fast, with what it actually does or how fast it does it.

> and do something with it. i'm not asking it to do complex computations
> or anything else.

   You don't think you're doing anything complex, but you really are,
you just don't understand that you are.  Locating a variable-sized row
in an unpredictable location and performing arbitrary operations on some
sub-section of that row is a heck of a lot more complicated than
scanning through a flat file and computing hash keys.

> i think, i know what is going on here. the problem is that every time
> i do an indexed scan sqlite has to
> 
> 1) fetch index pages
> 
> 2) fetch data pages that match "where" condition
> 
> because both index and data are in the same file sqlite has to perform
> insane amount of seek() calls and this is clearly demonstrated in my
> previous post.

   Once again, the fact that the data and the index are in the same file
is irrelevant.  Even if the data and the index were in separate files,
the disk heads would be doing a lot of random I/O to move from the
location of one row to the next given that the table rows are not
accessed sequentially when using an index.

> it still takes 14 whooping seconds to "group by"  300,000+ records
> down to 1417 records and that is just unacceptable (imo) for
> relational database.

   Why does the fact that SQLite is a relational database mean that
response times should be measured in seconds?  Have you tried this with
any other relational database, as another poster suggested?  You may
well get better performance out of something like MySQL or Oracle, but
it'll come at a price - much more memory usage and an SMP system that
allows you to effectively multithread.  I can guarantee you that any
other relational database, if configured to use the same amount of
memory as SQLite (a few MBs) and the same number of threads (1) that
they would all perform markedly *worse* than SQLite.

> from what i can see sqlite took extra 7 seconds to fetch the data of
> the disk. that is it had to seek back to data page, fetch it, process
> it, seek back to next index page etc.

   That's correct.  That's the price you pay for using an index to look
up the data.

> yes, i tried all that. it is not that easy to read sqlite explain
> output (imo) but i convince myself that it uses all the proper indexes
> when i make the query.

   Just because you think it's using the correct index doesn't mean that
that's the optimal access path.  Assuming that it is, however... Given
that you do seem to understand the overhead incurred by using an index
to look up data that is located elsewhere, what's the issue?  It seems
pretty clear that SQLite isn't doing anything wrong.  It isn't doing it
as optimally as your C program, but we've already covered the reasons
for that, so what's left to discuss?

> i think sqlite can only do good job when resulting set is very small.
> ideally one row. in this case it does not have to go back to disk to
> fetch the columns that are not in index. it is that simple (imo). the
> more columns (that are not in the index) you request the more time it
> takes.

   Why would you expect the processing of a large number of rows to take
the same amount of time as processing a single row?  Once again, your
expectations are unreasonable.  It's going to take more time to process
a large number of rows simply because there's more data to be managed.
   The number of columns requested isn't much of a factor either.
Requesting 10 columns that are not in 

Re: [sqlite] sqlite performance problem

2005-04-13 Thread D. Richard Hipp
On Fri, 2005-04-08 at 11:53 -0700, Maksim Yevmenkin wrote:
> CREATE INDEX data_by_a ON data (a);
> 
> > time sqlite3 db 'select n2 from data where a <= 18234721' > /dev/null
> 25.95u 0.71s 0:27.02 98.6%
> 

If you make the index look like this:

  CREATE INDEX data_by_a ON data(a, n2);

Then SQLite only has to look at the index to get the information
it needs.  It never has to consult the original table.  This will
make the query about twice as fast or maybe even a little faster.
-- 
D. Richard Hipp <[EMAIL PROTECTED]>



Re: [sqlite] sqlite performance problem

2005-04-13 Thread Christian Smith
On Tue, 12 Apr 2005, Maksim Yevmenkin wrote:

>Dear SQLite users,
>
>consider this
>
> [snip]
>
>it only took 4+ seconds to read, parse, perform hash table lookup and
>sum the data. note that for unique 1417 keys it had to do hash lookup
>and hash insert.
>
>so, just with plain ascii file i get four times the speed i get with
>sqlite. note that my c program will scale linearly with the size of
>dataset (just like i see with sqlite).
>
>so,
>
>- what am i doing wrong here?


Not managing your expectations. Try the test with another SQL database,
and see what sort of speed you get.


>
>- any suggestions on how to speed things up with sqlite?


Buy a faster machine. No, I'm not being facetious.


>
>- is sqlite optimized to retrieve one row from the table?


When fetching that one row using an index, yes, that's what indexes are
for. But when an index scan touches a significant proportion of the
corresponding table, then it becomes less than optimal, due the reasons
given previously in the thread:
- Cache thrashing
- Index + Table access
- CPU overhead
- Non-linear file access

>
>i'm starting to think that having index and data in the same file is
>not such a great idea when you want to perform range requests.


Unless the index and data are on different disks, you'll get no benefit by
splitting them from the same file.


>
>thanks,
>max


Christian


>
>On Apr 12, 2005 11:33 AM, Maksim Yevmenkin <[EMAIL PROTECTED]> wrote:
>> Gé,
>>
>> thanks for the suggestion. unfortunately it did not make any
>> difference :( below is the results. as you can see it takes 7+ seconds
>> to "group by"  333,392 records and i'm grouping by column on which i
>> have index. again, i'm not a database guy, but i think that is slow.
>> perhaps someone can comment if that is the best sqlite can do?
>>
>>
>> sqlite> .schema data
>> CREATE TABLE data(
>>   a INTEGER,
>>   b INTEGER,
>>   c CHAR,
>>   d INTEGER,
>>   e INTEGER,
>>   n1 FLOAT,
>>   n2 FLOAT
>> );
>> CREATE INDEX data_by_a on data (a);
>>
>>
>> sqlite> select count(*) from data;
>> 92
>>
>> test-1.sql
>> 
>> PRAGMA cache_size = 30;
>> PRAGMA cache_size;
>> PRAGMA page_size;
>> SELECT count(*) FROM (SELECT a FROM data WHERE a <= 18234721 GROUP BY a);
>>
>> test-2.sql
>> 
>> PRAGMA cache_size;
>> PRAGMA page_size;
>> SELECT count(*) FROM (SELECT a FROM data WHERE a <= 18234721 GROUP BY a);
>>
>> > time sqlite3 db < test-1.sql
>> 30
>> 1024
>> 1417
>> 6.89u 0.33s 0:07.55 95.6%
>>
>> > time sqlite3 db < test-2.sql
>> 2000
>> 1024
>> 1417
>> 6.91u 0.19s 0:07.39 96.0%
>>
>> > time sqlite3 db2048 < test-1.sql
>> 30
>> 2048
>> 1417
>> 6.80u 0.08s 0:07.32 93.9%
>>
>> > time sqlite3 db2048 < test-2.sql
>> 2000
>> 2048
>> 1417
>> 6.77u 0.12s 0:07.10 97.0%
>>
>> > time sqlite3 db4096 < test-1.sql
>> 30
>> 4096
>> 1417
>> 6.80u 0.15s 0:07.21 96.3%
>>
>> > time sqlite3 db4096 < test-2.sql
>> 2000
>> 4096
>> 1417
>> 6.79u 0.15s 0:07.15 97.0%
>>
>> > time sqlite3 db8192 < test-1.sql
>> 30
>> 8192
>> 1417
>> 6.70u 0.11s 0:07.01 97.1%
>>
>> > time sqlite3 db8192 < test-2.sql
>> 2000
>> 8192
>> 1417
>> 6.73u 0.09s 0:07.01 97.2%
>>
>> thanks,
>> max
>>
>>
>> On Apr 12, 2005 7:10 AM, Gé Weijers <[EMAIL PROTECTED]> wrote:
>> > Maksim,
>> >
>> > Some things you could try:
>> >
>> > 1) increase cache memory
>> >
>> > You may be causing a lot of cache misses if the size of the query result
>> > is very large compared to the size of the cache. Index-based searches
>> > can cause multiple reloads of the same page because of a lack of
>> > locality in the cache. An index-less search will just load each page once.
>> >
>> > as an experiment, try 'PRAGMA cache_size = 30', before you run the
>> > query. 1 GB of ram should be able to support 300MB of cache.
>> >
>> > 2) use 8192-byte pages
>> >
>> > Larger pages seem to improve performance quite a bit, in my experience.
>> >
>> > Do 'PRAGMA page_size = 8192' before you create the database.
>> >
>> > Doing both may cause excessive memory use (20 * 8K = ...). I've
>> > never tried that.
>> >
>> >
>> > Gé
>> >
>> >
>> > Maksim Yevmenkin wrote:
>> >
>> > >Robert,
>> > >
>> > >
>> > >
>> > >>[snip]
>> > >>
>> > >>
>> > >>
>> > >>>i said i print these rows to /dev/null too in my perl code. plus the
>> > >>>perl code does some other things such as joining these rows with other
>> > >>>hashes and summing the numbers.
>> > >>>
>> > >>>
>> > >>That's fine.  I was merely trying to account for the 50% speed difference
>> > >>between the two differing column tests, which has been accomplished.
>> > >>
>> > >>
>> > >>
>> > As for the temp table ... I haven't tried this, but isn't
>> > 
>> > 
>> > >>>"temp" a reserved
>> > >>>
>> > >>>
>> > word in SQLite?  More importantly, you should be doing this
>> > 
>> > 
>> > >>>statement inside
>> > >>>
>> > >>>yes, it is. i really want to create 'temporary table' in memory. i was
>> > >>>really hoping it would speed things up.
>> > >>>
>> > >>>

Re: [sqlite] sqlite performance problem

2005-04-12 Thread Gé Weijers
Maksim Yevmenkin wrote:

>
>>>so, just with plain ascii file i get four times the speed i get with
>>>sqlite. note that my c program will scale linearly with the size of
>>>dataset (just like i see with sqlite).
>>>  
>>>
>>   With anything related to computers, there are always tradeoffs - most
>>commonly power for complexity, and flexibility for speed.  Your C
>>program *should* be faster than anything SQLite can do - it's simpler
>>and more specific to the problem you're trying to solve.  On the flip
>>side, it'll never do anything other than what it already does - it can
>>never be used to solve any other problem.
>>
>>
>
>what you say is correct, but four (!) times performance increase?
>please, you have got to admit that something is not right here.
>
>  
>
I have to agree with Thomas, your expectations are too high. If I'd be
using a relational database and I could get within a factor of 4 of what
I can concoct in C I'd declare victory. Relational databases are often
far from speedy, even on simple queries. You pay for:

* variable record formats
* integrity checks
* duplicate storage of keys in the BTree
* duplicate storage of keys in multiple tables
* the ACID property, even if you're not using it in your samples
* the ability to perform queries in a flexible way
* .

If your database is simple you may be better off performance wise by
rolling your own solution, or using another database. MySQL is pretty
fast if you run it using ISAM tables, but you pay with data corruption
if the DB or system crashes.

If your queries generally produce a sizeable percentage of the records
stored you might as well do a sequential scan over a file, if written
with care, performance will be completely I/O bound. Use the 'mmap'
system call or equivalent to map the DB into memory, and you can read
your DB using pointer arithmetic, and use 'memmove' for updates.

Gé



Re: [sqlite] sqlite performance problem

2005-04-12 Thread Maksim Yevmenkin
John.

> >i think, i know what is going on here. the problem is that every time
> >i do an indexed scan sqlite has to
> >
> >1) fetch index pages
> >
> >2) fetch data pages that match "where" condition
> >
> >because both index and data are in the same file sqlite has to perform
> >insane amount of seek() calls and this is clearly demonstrated in my
> >previous post.
> >
> >even if i remote the sum() but just add extra column that is not in
> >the index, like so
> >
> >SELECT count(*) FROM (SELECT a,n1 FROM data WHERE a <= 18234721 GROUP BY a);
> >
> >
> >
> >>time sqlite3 db < test.sql
> >>
> >>
> >30
> >1024
> >1417
> >13.14u 1.06s 0:14.40 98.6%
> >
> >it still takes 14 whooping seconds to "group by"  300,000+ records
> >down to 1417 records and that is just unacceptable (imo) for
> >relational database.
> >
> >
> If seeking was the problem, you'd see more system time. The i/o time is
> at most 1.06 seconds. The file is probably
> being cached in the os, and therefore, the seeks just become memory
> operations.

i agree that most of the data are cached by filesystem. it would be
even slower it sqlite had to go and fetch all the data from the disk
plates. i'm not sure if system time is counted if the process was put
to sleep due to, say, disk i/o.

> You might also get more help if you presented your requests in a milder
> tone. You might even get responses
> from the people who wrote sqlite, and therefore have knowledge of its
> internals.

 i did not intend to offended anyone, if i did i apologize. 

> but one thing to try:
> 
> CREATE TABLE data(
>   a INTEGER,
>   b INTEGER,
>   c CHAR,
>   d INTEGER,
>   e INTEGER,
>   n1 FLOAT,
>   n2 FLOAT
> );
> CREATE INDEX data_by_a on data (a, b, c, d, e, n1, n2);
> 
> This is a big waste of space, and will slow down all your updates and
> inserts, but since you
> seem to be mostly worried about the speed of selects, those shouldn't
> make a difference to you.

that will not work for me either. i have to fit 60 million rows table
in under 15 gig.

> You may already know this, but when you create an index with multiple
> keys, it can use a subset
> of those keys to look up rows. The limitation is that any key that
> appears in the index before a column
> you would like to limit must also be limited, e.g.,
> 
> select
> *
> from
> a
> where
> a = 12 and b = 16 and c = 16
> 
> will use the index, but
> 
> select
> *
> from
> a
> where
> b = 22 and c = 23
> 
> will not use the index.

yes, i know this. i do not see how this is applicable in my tests.
 
> You claim that the reason sqlite is slower than you expect is because
> sqlite is using the index to
> look up the data. If so, this should fix the problem, by making the
> index have all the data right
> there.

that is not what i said. the index is properly gives me the much (30
times) smaller data set. however, the column i have index on is only
one of the columns i want to fetch. the other columns are still on
disk. in my case it takes longer to get the other columns i want.
again, i'd rather not place index on all the columns because i might
not meet size requirements.

> I'm not so sure that this will resolve your problems, but it may take
> you closer to a solution. Another
> thing to try is to try the same experiment with postgresql or mysql, and
> see if they have the same sort
> of performance you're expecting to see. If not, you may want to consider

i will try another database. its just a already spend too much time
trying to make sqlite work :( i still hope that i'm doing something
wrong here. i just dont know what it is :)

> that Thomas was right, and
> a general solution may not be able to perform the calculations with the
> same speed that you are hoping
> for.

but i'm not doing anything out of the ordinary here. select rows, sum
some columns and do join on a couple of other tables. why would sqlite
not work for me?

thanks,
max


Re: [sqlite] sqlite performance problem

2005-04-12 Thread Maksim Yevmenkin
> > SELECT count(*) FROM (SELECT a,n1 FROM data WHERE a <= 18234721 GROUP BY a);
> >
> > > time sqlite3 db < test.sql
> > 30
> > 1024
> > 1417
> > 13.14u 1.06s 0:14.40 98.6%
> 
> Have you tried doing the query like this:
> 
>   SELECT count(*) FROM (SELECT a,nl FROM data WHERE a-18234721<=0 GROUP
> BY a);

i just tried. i think this effectively disables the index on column
"a" and on 9+ million records data set it took

> time sqlite3 db  < test.sql
30
1024
1417
61.16u 40.69s 2:05.01 81.4%

thanks,
max


Re: [sqlite] sqlite performance problem

2005-04-12 Thread John LeSueur
Maksim Yevmenkin wrote:
i think, i know what is going on here. the problem is that every time
i do an indexed scan sqlite has to
1) fetch index pages
2) fetch data pages that match "where" condition
because both index and data are in the same file sqlite has to perform
insane amount of seek() calls and this is clearly demonstrated in my
previous post.
even if i remote the sum() but just add extra column that is not in
the index, like so
SELECT count(*) FROM (SELECT a,n1 FROM data WHERE a <= 18234721 GROUP BY a);
 

time sqlite3 db < test.sql
   

30
1024
1417
13.14u 1.06s 0:14.40 98.6%
it still takes 14 whooping seconds to "group by"  300,000+ records
down to 1417 records and that is just unacceptable (imo) for
relational database.
 

If seeking was the problem, you'd see more system time. The i/o time is 
at most 1.06 seconds. The file is probably
being cached in the os, and therefore, the seeks just become memory 
operations.

You might also get more help if you presented your requests in a milder 
tone. You might even get responses
from the people who wrote sqlite, and therefore have knowledge of its 
internals.

but one thing to try:
CREATE TABLE data(
 a INTEGER,
 b INTEGER,
 c CHAR,
 d INTEGER,
 e INTEGER,
 n1 FLOAT,
 n2 FLOAT
);
CREATE INDEX data_by_a on data (a, b, c, d, e, n1, n2);
This is a big waste of space, and will slow down all your updates and 
inserts, but since you
seem to be mostly worried about the speed of selects, those shouldn't 
make a difference to you.

You may already know this, but when you create an index with multiple 
keys, it can use a subset
of those keys to look up rows. The limitation is that any key that 
appears in the index before a column
you would like to limit must also be limited, e.g.,

select 
   * 
from 
   a 
where 
   a = 12 and b = 16 and c = 16 

will use the index, but
select
   *
from 
   a
where 
   b = 22 and c = 23

will not use the index.
You claim that the reason sqlite is slower than you expect is because 
sqlite is using the index to
look up the data. If so, this should fix the problem, by making the 
index have all the data right
there.

I'm not so sure that this will resolve your problems, but it may take 
you closer to a solution. Another
thing to try is to try the same experiment with postgresql or mysql, and 
see if they have the same sort
of performance you're expecting to see. If not, you may want to consider 
that Thomas was right, and
a general solution may not be able to perform the calculations with the 
same speed that you are hoping
for.

John LeSueur


Re: [sqlite] sqlite performance problem

2005-04-12 Thread D. Richard Hipp
On Tue, 2005-04-12 at 16:17 -0700, Maksim Yevmenkin wrote:

> SELECT count(*) FROM (SELECT a,n1 FROM data WHERE a <= 18234721 GROUP BY a);
> 
> > time sqlite3 db < test.sql
> 30
> 1024
> 1417
> 13.14u 1.06s 0:14.40 98.6%
> 

Have you tried doing the query like this:

  SELECT count(*) FROM (SELECT a,nl FROM data WHERE a-18234721<=0 GROUP
BY a);


-- 
D. Richard Hipp <[EMAIL PROTECTED]>



Re: [sqlite] sqlite performance problem

2005-04-12 Thread Maksim Yevmenkin
Thomas,

> > with sum(n1) added query runs twice as slow. as i was told its because
> > sqlite has to fetch data row. fine, but why its soo slow?! and it
> 
>Because for each row it has to compute the aggregate key, find the
> aggregator for that key and increment the sum for that aggregate key.
> That's a lot more work than just selecting the row and doing nothing
> with it.

well, it could be true, but not in the queries i have posted. i "group
by" column "a" and there is an index on column "a", so sqlite does not
have to do anything to compute key. it does not even have to back to
the disk and fetch the value of column "a" because its right in the
index. the only work sqlite has to do is to go back to the disk and
fetch the value of column "n1" and sum it.

> > so, just with plain ascii file i get four times the speed i get with
> > sqlite. note that my c program will scale linearly with the size of
> > dataset (just like i see with sqlite).
> 
>With anything related to computers, there are always tradeoffs - most
> commonly power for complexity, and flexibility for speed.  Your C
> program *should* be faster than anything SQLite can do - it's simpler
> and more specific to the problem you're trying to solve.  On the flip
> side, it'll never do anything other than what it already does - it can
> never be used to solve any other problem.

what you say is correct, but four (!) times performance increase?
please, you have got to admit that something is not right here.

> > - what am i doing wrong here?
> 
>Your expectations are too high, for starters.  For reasons I will
> never understand, people expect a relational database to be a silver
> bullet that answers all questions instantaneously.  RDBMSs are useful
> because they're flexible, but they're also not 100% optimal because
> they're flexible.  That's the price you pay for being able to answer
> questions with SQL instead of C.

why? i'm using relational database to do exactly what it should do
fast. that is i want select the range of rows from a much bigger set
and do something with it. i'm not asking it to do complex computations
or anything else.

>The other problem you may be running into is lack of a clear problem
> definition.  So far I've seen mention of a half dozen things which you
> think should perform better, but you haven't dug into any of them to
> find out *why* they're slow.  An earlier poster helped identify disk I/O
> as a bottleneck.  If from there you can prove that that's a bottleneck
> because SQLite is performing too much disk I/O, then you have cause to
> claim that SQLite is slow.  Otherwise, all you can do is blame your
> disks.  Until you get a clear picture of what you want, what you need
> and what's actually happening, however, you'll never be able to tell the
> difference.

i think, i know what is going on here. the problem is that every time
i do an indexed scan sqlite has to

1) fetch index pages

2) fetch data pages that match "where" condition

because both index and data are in the same file sqlite has to perform
insane amount of seek() calls and this is clearly demonstrated in my
previous post.

even if i remote the sum() but just add extra column that is not in
the index, like so

SELECT count(*) FROM (SELECT a,n1 FROM data WHERE a <= 18234721 GROUP BY a);

> time sqlite3 db < test.sql
30
1024
1417
13.14u 1.06s 0:14.40 98.6%

it still takes 14 whooping seconds to "group by"  300,000+ records
down to 1417 records and that is just unacceptable (imo) for
relational database.

now if you only request column that is  in the index

SELECT count(*) FROM (SELECT a FROM data WHERE a <= 18234721 GROUP BY a);

> time sqlite3 db < test.sql
30
1024
1417
6.84u 0.24s 0:07.18 98.6%

it only takes 7+ seconds. 

from what i can see sqlite took extra 7 seconds to fetch the data of
the disk. that is it had to seek back to data page, fetch it, process
it, seek back to next index page etc.

> > - any suggestions on how to speed things up with sqlite?
> 
>First, you must understand what SQLite is doing.  Have you tried
> EXPLAINing the query that you're unhappy with to see what SQLite is
> going with it "under the hood"?

yes, i tried all that. it is not that easy to read sqlite explain
output (imo) but i convince myself that it uses all the proper indexes
when i make the query.

>Also, an earlier poster mentioned increasing your block size.  How
> high did you raise it?  I've seen significant performance increases with
> block sizes of 8k and 16k; just how much difference it makes seems to
> depend on the system in question, however.  On some systems I've tested,
> 8k block sizes are faster than 16k block sizes; my guess is that all
> this comes down to the block size used by the file system and/or OS (and
> the size of the data itself can factor in there as well), though I
> haven't yet dug deep enough to be sure.  From my experience though, in
> most cases, there are certainly gains to be had by 

RE: [sqlite] sqlite performance problem

2005-04-12 Thread Thomas Briggs
 
> with sum(n1) added query runs twice as slow. as i was told its because
> sqlite has to fetch data row. fine, but why its soo slow?! and it

   Because for each row it has to compute the aggregate key, find the
aggregator for that key and increment the sum for that aggregate key.
That's a lot more work than just selecting the row and doing nothing
with it.

> so, just with plain ascii file i get four times the speed i get with
> sqlite. note that my c program will scale linearly with the size of
> dataset (just like i see with sqlite).

   With anything related to computers, there are always tradeoffs - most
commonly power for complexity, and flexibility for speed.  Your C
program *should* be faster than anything SQLite can do - it's simpler
and more specific to the problem you're trying to solve.  On the flip
side, it'll never do anything other than what it already does - it can
never be used to solve any other problem.

> - what am i doing wrong here?

   Your expectations are too high, for starters.  For reasons I will
never understand, people expect a relational database to be a silver
bullet that answers all questions instantaneously.  RDBMSs are useful
because they're flexible, but they're also not 100% optimal because
they're flexible.  That's the price you pay for being able to answer
questions with SQL instead of C.
   The other problem you may be running into is lack of a clear problem
definition.  So far I've seen mention of a half dozen things which you
think should perform better, but you haven't dug into any of them to
find out *why* they're slow.  An earlier poster helped identify disk I/O
as a bottleneck.  If from there you can prove that that's a bottleneck
because SQLite is performing too much disk I/O, then you have cause to
claim that SQLite is slow.  Otherwise, all you can do is blame your
disks.  Until you get a clear picture of what you want, what you need
and what's actually happening, however, you'll never be able to tell the
difference.

> - any suggestions on how to speed things up with sqlite?

   First, you must understand what SQLite is doing.  Have you tried
EXPLAINing the query that you're unhappy with to see what SQLite is
going with it "under the hood"?

   Also, an earlier poster mentioned increasing your block size.  How
high did you raise it?  I've seen significant performance increases with
block sizes of 8k and 16k; just how much difference it makes seems to
depend on the system in question, however.  On some systems I've tested,
8k block sizes are faster than 16k block sizes; my guess is that all
this comes down to the block size used by the file system and/or OS (and
the size of the data itself can factor in there as well), though I
haven't yet dug deep enough to be sure.  From my experience though, in
most cases, there are certainly gains to be had by using larger block
sizes.

> - is sqlite optimized to retrieve one row from the table?

   If there is an index on the column or columns referenced in your
WHERE clause, yes.  Moreso if the column or columns referenced contain
unique values.  Note however that this has nothing to do with SQLite -
all relational databases (all databases with B-tree indexes, actually,
not just RDBMSs) work this way.

> i'm starting to think that having index and data in the same file is
> not such a great idea when you want to perform range requests.

   Having the data and the index in the same file is irrelevant.  It is
entirely possible that SQLite might be using an index to satisfy your
query, and entirely possible that it would be more optimal to execute
the query without using the index, but the fact that the index and data
are in the same file has nothing to do with it.  Whether to use the
index or not is an age-old problem that other RDBMSs solve using
cost-based optimizers; I'm not sure if there's a solution for that in
SQLite, as I believe query plan execution is rule-based (I'm not 100%
sure about that though, someone please correct me if I'm wrong).  If
that's the case, then you'll need to try to find another way to write
your query such that SQLite executes it more optimally.

   If I remember correctly, the old trick to make sure you didn't use an
index was to change your WHERE clause slightly to something like:

   SELECT COUNT(*) FROM (SELECT a FROM data WHERE (a + 0) <= 18234721
GROUP BY a);

   This should force a table scan rather than an index scan.  Try that
and see how your queries perform; if things go faster, that means the
index is hurting you, and you need to either drop it or rewrite your
queries to avoid using it.

   -Tom


Re: [sqlite] sqlite performance problem

2005-04-12 Thread Maksim Yevmenkin
Gé,

thanks for the suggestion. unfortunately it did not make any
difference :( below is the results. as you can see it takes 7+ seconds
to "group by"  333,392 records and i'm grouping by column on which i
have index. again, i'm not a database guy, but i think that is slow.
perhaps someone can comment if that is the best sqlite can do?


sqlite> .schema data
CREATE TABLE data(
  a INTEGER,
  b INTEGER,
  c CHAR,
  d INTEGER,
  e INTEGER,
  n1 FLOAT,
  n2 FLOAT
);
CREATE INDEX data_by_a on data (a);



sqlite> select count(*) from data;
92



test-1.sql

PRAGMA cache_size = 30;
PRAGMA cache_size;
PRAGMA page_size;
SELECT count(*) FROM (SELECT a FROM data WHERE a <= 18234721 GROUP BY a);



test-2.sql

PRAGMA cache_size;
PRAGMA page_size;
SELECT count(*) FROM (SELECT a FROM data WHERE a <= 18234721 GROUP BY a);



> time sqlite3 db < test-1.sql
30
1024
1417
6.89u 0.33s 0:07.55 95.6%

> time sqlite3 db < test-2.sql
2000
1024
1417
6.91u 0.19s 0:07.39 96.0%



> time sqlite3 db2048 < test-1.sql
30
2048
1417
6.80u 0.08s 0:07.32 93.9%

> time sqlite3 db2048 < test-2.sql
2000
2048
1417
6.77u 0.12s 0:07.10 97.0%



> time sqlite3 db4096 < test-1.sql
30
4096
1417
6.80u 0.15s 0:07.21 96.3%

> time sqlite3 db4096 < test-2.sql
2000
4096
1417
6.79u 0.15s 0:07.15 97.0%



> time sqlite3 db8192 < test-1.sql
30
8192
1417
6.70u 0.11s 0:07.01 97.1%

> time sqlite3 db8192 < test-2.sql
2000
8192
1417
6.73u 0.09s 0:07.01 97.2%

thanks,
max


On Apr 12, 2005 7:10 AM, Gé Weijers <[EMAIL PROTECTED]> wrote:
> Maksim,
> 
> Some things you could try:
> 
> 1) increase cache memory
> 
> You may be causing a lot of cache misses if the size of the query result
> is very large compared to the size of the cache. Index-based searches
> can cause multiple reloads of the same page because of a lack of
> locality in the cache. An index-less search will just load each page once.
> 
> as an experiment, try 'PRAGMA cache_size = 30', before you run the
> query. 1 GB of ram should be able to support 300MB of cache.
> 
> 2) use 8192-byte pages
> 
> Larger pages seem to improve performance quite a bit, in my experience.
> 
> Do 'PRAGMA page_size = 8192' before you create the database.
> 
> Doing both may cause excessive memory use (20 * 8K = ...). I've
> never tried that.
> 
> 
> Gé
> 
> 
> Maksim Yevmenkin wrote:
> 
> >Robert,
> >
> >
> >
> >>[snip]
> >>
> >>
> >>
> >>>i said i print these rows to /dev/null too in my perl code. plus the
> >>>perl code does some other things such as joining these rows with other
> >>>hashes and summing the numbers.
> >>>
> >>>
> >>That's fine.  I was merely trying to account for the 50% speed difference
> >>between the two differing column tests, which has been accomplished.
> >>
> >>
> >>
> As for the temp table ... I haven't tried this, but isn't
> 
> 
> >>>"temp" a reserved
> >>>
> >>>
> word in SQLite?  More importantly, you should be doing this
> 
> 
> >>>statement inside
> >>>
> >>>yes, it is. i really want to create 'temporary table' in memory. i was
> >>>really hoping it would speed things up.
> >>>
> >>>
> >>I misread the statement, so ignore me on that part.  However, 339,000 rows
> >>into a temporary in-memory table ... I tried some experiments locally here
> >>and none of them took more than 2 seconds to execute.  Are you sure you're
> >>not using up all available memory, which is causing the system to hit the
> >>swapfile?  What does this same query look like when you drop the "temp" from
> >>the query?
> >>
> >>
> >
> >the system has 1G of ram. i was "monitoring" sqlite3 memory usage with
> >'top'. the SIZE and RES did not exceed 30M. so i do not think the
> >memory is the issue here.
> >
> >
> >
> >>time sqlite3 db 'create table foo as select * from data where a <= 
> >>18234721' > /dev/null
> >>
> >>
> >22.06u 1.39s 0:27.75 84.5%
> >
> >so pretty much the same time without 'temp'.
> >
> >i'm starting to suspect disk. here is what i did. i created a separate
> >database with only one table. this table contains subset of 92
> >rows from original data table. it also has the same index on "a"
> >column, i.e. i did
> >
> >
> >
> >>sqlite3 db1
> >>
> >>
> >sqlite> attach db as s;
> >sqlite> create table data as select * from s.data where a <= 18234721;
> >sqlite> create index data_by_a on data (a);
> >
> >full scan
> >
> >
> >
> >>time sqlite3 db1 'select n1 from data' > /dev/null
> >>
> >>
> >17.19u 0.55s 0:19.06 93.0%
> >
> >"bad" index scan, because it is guaranteed then the table only has
> >keys that match "where"
> >
> >
> >
> >>time sqlite3 db1 'select n1 from data where a <= 18234721' > /dev/null
> >>
> >>
> >25.73u 0.59s 0:28.37 92.7%
> >
> >+10 seconds! is this the overhead of "indexed" scan? is this what it
> >really takes to seek back and forth between index and data? what am i
> >missing here?
> >
> >thanks,
> >max
> >
> >
> 
>


Re: [sqlite] sqlite performance problem

2005-04-12 Thread Gé Weijers
Maksim,

Some things you could try:

1) increase cache memory

You may be causing a lot of cache misses if the size of the query result
is very large compared to the size of the cache. Index-based searches
can cause multiple reloads of the same page because of a lack of
locality in the cache. An index-less search will just load each page once.

as an experiment, try 'PRAGMA cache_size = 30', before you run the
query. 1 GB of ram should be able to support 300MB of cache.

2) use 8192-byte pages

Larger pages seem to improve performance quite a bit, in my experience.

Do 'PRAGMA page_size = 8192' before you create the database.

Doing both may cause excessive memory use (20 * 8K = ...). I've
never tried that.


Gé



Maksim Yevmenkin wrote:

>Robert,
>
>  
>
>>[snip]
>>
>>
>>
>>>i said i print these rows to /dev/null too in my perl code. plus the
>>>perl code does some other things such as joining these rows with other
>>>hashes and summing the numbers.
>>>  
>>>
>>That's fine.  I was merely trying to account for the 50% speed difference
>>between the two differing column tests, which has been accomplished.
>>
>>
>>
As for the temp table ... I haven't tried this, but isn't


>>>"temp" a reserved
>>>  
>>>
word in SQLite?  More importantly, you should be doing this


>>>statement inside
>>>
>>>yes, it is. i really want to create 'temporary table' in memory. i was
>>>really hoping it would speed things up.
>>>  
>>>
>>I misread the statement, so ignore me on that part.  However, 339,000 rows
>>into a temporary in-memory table ... I tried some experiments locally here
>>and none of them took more than 2 seconds to execute.  Are you sure you're
>>not using up all available memory, which is causing the system to hit the
>>swapfile?  What does this same query look like when you drop the "temp" from
>>the query?
>>
>>
>
>the system has 1G of ram. i was "monitoring" sqlite3 memory usage with
>'top'. the SIZE and RES did not exceed 30M. so i do not think the
>memory is the issue here.
>
>  
>
>>time sqlite3 db 'create table foo as select * from data where a <= 18234721' 
>>> /dev/null
>>
>>
>22.06u 1.39s 0:27.75 84.5%
>
>so pretty much the same time without 'temp'.
>
>i'm starting to suspect disk. here is what i did. i created a separate
>database with only one table. this table contains subset of 92
>rows from original data table. it also has the same index on "a"
>column, i.e. i did
>
>  
>
>>sqlite3 db1
>>
>>
>sqlite> attach db as s;
>sqlite> create table data as select * from s.data where a <= 18234721;
>sqlite> create index data_by_a on data (a);
>
>full scan
>
>  
>
>>time sqlite3 db1 'select n1 from data' > /dev/null
>>
>>
>17.19u 0.55s 0:19.06 93.0%
>
>"bad" index scan, because it is guaranteed then the table only has
>keys that match "where"
>
>  
>
>>time sqlite3 db1 'select n1 from data where a <= 18234721' > /dev/null
>>
>>
>25.73u 0.59s 0:28.37 92.7%
>
>+10 seconds! is this the overhead of "indexed" scan? is this what it
>really takes to seek back and forth between index and data? what am i
>missing here?
>
>thanks,
>max
>  
>



Re: [sqlite] sqlite performance problem

2005-04-11 Thread Maksim Yevmenkin
Robert,

> [snip]
> 
> > i said i print these rows to /dev/null too in my perl code. plus the
> > perl code does some other things such as joining these rows with other
> > hashes and summing the numbers.
> 
> That's fine.  I was merely trying to account for the 50% speed difference
> between the two differing column tests, which has been accomplished.
> 
> > > As for the temp table ... I haven't tried this, but isn't
> > "temp" a reserved
> > > word in SQLite?  More importantly, you should be doing this
> > statement inside
> >
> > yes, it is. i really want to create 'temporary table' in memory. i was
> > really hoping it would speed things up.
> 
> I misread the statement, so ignore me on that part.  However, 339,000 rows
> into a temporary in-memory table ... I tried some experiments locally here
> and none of them took more than 2 seconds to execute.  Are you sure you're
> not using up all available memory, which is causing the system to hit the
> swapfile?  What does this same query look like when you drop the "temp" from
> the query?

the system has 1G of ram. i was "monitoring" sqlite3 memory usage with
'top'. the SIZE and RES did not exceed 30M. so i do not think the
memory is the issue here.

> time sqlite3 db 'create table foo as select * from data where a <= 18234721' 
> > /dev/null
22.06u 1.39s 0:27.75 84.5%

so pretty much the same time without 'temp'.

i'm starting to suspect disk. here is what i did. i created a separate
database with only one table. this table contains subset of 92
rows from original data table. it also has the same index on "a"
column, i.e. i did

> sqlite3 db1
sqlite> attach db as s;
sqlite> create table data as select * from s.data where a <= 18234721;
sqlite> create index data_by_a on data (a);

full scan

> time sqlite3 db1 'select n1 from data' > /dev/null
17.19u 0.55s 0:19.06 93.0%

"bad" index scan, because it is guaranteed then the table only has
keys that match "where"

> time sqlite3 db1 'select n1 from data where a <= 18234721' > /dev/null
25.73u 0.59s 0:28.37 92.7%

+10 seconds! is this the overhead of "indexed" scan? is this what it
really takes to seek back and forth between index and data? what am i
missing here?

thanks,
max


Re: [sqlite] sqlite performance problem

2005-04-11 Thread Maksim Yevmenkin
Robert,

> > time sqlite3 db 'select n1 from data where a <= 18234721' > /dev/null
> > 26.15u 0.59s 0:27.00 99.0%
> >
> > time sqlite3 db 'select n1 from data where a <= 18234721' > /dev/null
> > 26.04u 0.61s 0:26.91 99.0%
> >
> > time sqlite3 db 'select e from data where a <= 18234721' > /dev/null
> > 12.22u 0.41s 0:12.67 99.6%
> >
> > time sqlite3 db 'select e from data where a <= 18234721' > /dev/null
> > 12.01u 0.61s 0:12.64 99.8%
> >
> > so just by selecting two different rows (both next to each other). I
> > get another 50% time difference? sure the column types
> > are different,
> > and i can even understand that FLOAT might be 8 bytes and
> > INTEGER 4
> > bytes, but 50% time difference? it just cant be that
> > linear can it?
> 
> By doing these same two queries using select count(), you've proven my
> original theory that the time difference is due to the volume of bytes being
> transmitted to dev/null and NOT because of the actual data types of the
> columns or ANY OTHER FACTOR.  If you really want, change your pipe output to

ok, i agree, redirecting output to /dev/null might have impact on times. 

> two file names, I'm 100% sure you'll find that the 'select n1 ...' query
> results in an output file significantly larger than the 'select e ...'
> output file.  This is where the differing performance comes from -- the time
> it is taking to parse and print your output.

here you are wrong actually. 

> time sqlite3 db 'select e from data where a <= 18234721' > x
12.01u 0.64s 0:12.80 98.8%

> time sqlite3 db 'select n1 from data where a <= 18234721' > y
26.06u 0.62s 0:26.86 99.3%

the size of x is 1070681 (output of column e) and the size of y is
1004219 (output of column n1), so the file sizes are about the same. x
is 66462 bytes more.  it is probably possible that printf() is 2 times
slower on float's.

> The select count(*) does indeed use the index and hence the significant
> performance difference.

fine

> > fine, if i ask sqlite just to count the rows it wins hands-down, but i
> > really want these rows. even more i 'd like to then "natural join"
> > these rows with a couple of other tables to really do what the perl
> > code currently does.
> >
> > but, it takes 22 seconds to just to create a temp table with the
> > required dataset
> >
> > > time sqlite3 db 'create temp table foo as select * from
> > data where a <= 18234721' > /dev/null
> > 21.93u 0.89s 0:22.95 99.4%
> >
> > and i do not understand what i'm doing wrong here :(
> 
> Again, the only reason I suggested using count() in your timing test was to
> ensure that the command-line sqlite3 program's output was consistent for
> both tests and to eliminate dev/null printf's from factoring into the total
> time.  In your application, you'll call select * (or whatever) without the
> count to retrieve the rows -- but since you're not printf'ing them and
> instead are doing your own thing with them, you will indeed see close to
> identical times in your selects just like you did in the count() test.

i said i print these rows to /dev/null too in my perl code. plus the
perl code does some other things such as joining these rows with other
hashes and summing the numbers.

> As for the temp table ... I haven't tried this, but isn't "temp" a reserved
> word in SQLite?  More importantly, you should be doing this statement inside

yes, it is. i really want to create 'temporary table' in memory. i was
really hoping it would speed things up.

> a transaction.  Transactions are critically important in SQLite.  I would go
> so far as to say NEVER EVER perform ANY bulk write in SQLite outside a
> transaction!  The performance difference is beyond phenomenal.

nope.

> time sqlite3 db 'begin; create temp table foo as select * from data where a 
> <= 18234721; commit' > /dev/null
21.90u 0.77s 0:22.87 99.1%

still 22 seconds to just create a table with 300,000+ records, and
that is, unfortunately, too slow :(

doing strace/truss on sqlite shows that it performs huge amount of seek's. 

so the original questions stay:

- what am i doing wrong here?

- is sqlite going to be not as fast on a fairly large index'ed table
because it has to seek back and forth between index and data?

thanks,
max


RE: [sqlite] sqlite performance problem

2005-04-11 Thread Robert Simpson
Let's recap ...

> time sqlite3 db 'select n1 from data where a <= 18234721' > /dev/null
> 26.15u 0.59s 0:27.00 99.0%
>
> time sqlite3 db 'select n1 from data where a <= 18234721' > /dev/null
> 26.04u 0.61s 0:26.91 99.0%
>
> time sqlite3 db 'select e from data where a <= 18234721' > /dev/null
> 12.22u 0.41s 0:12.67 99.6%
>
> time sqlite3 db 'select e from data where a <= 18234721' > /dev/null
> 12.01u 0.61s 0:12.64 99.8%
>
> so just by selecting two different rows (both next to each other). I
> get another 50% time difference? sure the column types 
> are different,
> and i can even understand that FLOAT might be 8 bytes and 
> INTEGER 4
> bytes, but 50% time difference? it just cant be that 
> linear can it?

By doing these same two queries using select count(), you've proven my
original theory that the time difference is due to the volume of bytes being
transmitted to dev/null and NOT because of the actual data types of the
columns or ANY OTHER FACTOR.  If you really want, change your pipe output to
two file names, I'm 100% sure you'll find that the 'select n1 ...' query
results in an output file significantly larger than the 'select e ...'
output file.  This is where the differing performance comes from -- the time
it is taking to parse and print your output.

The select count(*) does indeed use the index and hence the significant
performance difference.  

> fine, if i ask sqlite just to count the rows it wins hands-down, but i
> really want these rows. even more i 'd like to then "natural join"
> these rows with a couple of other tables to really do what the perl
> code currently does.
> 
> but, it takes 22 seconds to just to create a temp table with the
> required dataset
> 
> > time sqlite3 db 'create temp table foo as select * from 
> data where a <= 18234721' > /dev/null
> 21.93u 0.89s 0:22.95 99.4%
> 
> and i do not understand what i'm doing wrong here :(

Again, the only reason I suggested using count() in your timing test was to
ensure that the command-line sqlite3 program's output was consistent for
both tests and to eliminate dev/null printf's from factoring into the total
time.  In your application, you'll call select * (or whatever) without the
count to retrieve the rows -- but since you're not printf'ing them and
instead are doing your own thing with them, you will indeed see close to
identical times in your selects just like you did in the count() test.

As for the temp table ... I haven't tried this, but isn't "temp" a reserved
word in SQLite?  More importantly, you should be doing this statement inside
a transaction.  Transactions are critically important in SQLite.  I would go
so far as to say NEVER EVER perform ANY bulk write in SQLite outside a
transaction!  The performance difference is beyond phenomenal.

Robert




Re: [sqlite] sqlite performance problem

2005-04-11 Thread Maksim Yevmenkin
Robert,

> > i guess, i can believe this. however its pretty disappointing to get
> > 50% improvement on 30 times less dataset :(
> >
> > but how do you explain this?
> >
> > sqlite> .schema data
> > CREATE TABLE data
> > (
> >a INTEGER,
> >b INTEGER,
> >c CHAR,
> >d INTEGER,
> >e INTEGER,
> >n1 FLOAT,
> >n2 FLOAT
> > );
> > CREATE INDEX data_by_a ON data (a);
> >
> > > time sqlite3 db 'select n1 from data where a <= 18234721' >
> > /dev/null
> > 26.15u 0.59s 0:27.00 99.0%
> > > time sqlite3 db 'select n1 from data where a <= 18234721' >
> > /dev/null
> > 26.04u 0.61s 0:26.91 99.0%
> >
> > and
> >
> > > time sqlite3 db 'select e from data where a <= 18234721' > /dev/null
> > 12.22u 0.41s 0:12.67 99.6%
> > > time sqlite3 db 'select e from data where a <= 18234721' > /dev/null
> > 12.01u 0.61s 0:12.64 99.8%
> >
> > so just by selecting two different rows (both next to each other). i
> > get another 50% time difference? sure the column types are different,
> > and i can even understand that FLOAT might be 8 bytes and INTEGER 4
> > bytes, but 50% time difference? it just cant be that linear can it?
> >
> > do you think it is possible to get better results by issuing multiple
> > queries each of which will return even less dataset? and, no, i'm not
> > a database guy.
> 
> The most glaring fault in your time tests that I see is that you're running
> the command-line sqlite3, which is printf'ing all 339,000+ rows to dev/null.
> It doesn't show on your screen, but the work is still being done regardless.

well, i do print these rows to /dev/null in perl test too.

> If you really want a fair speed test, change those statements to:
> 
> select count(n1) from data where a <= 18234721

> time sqlite3 db 'select count(n1) from data where a <= 18234721' > /dev/null
7.79u 0.70s 0:08.50 99.8%

> And
> 
> select count(e) from data where a <= 18234721

> time sqlite3 db 'select count(e) from data where a <= 18234721' > /dev/null
7.90u 0.42s 0:08.31 100.1%

> Or even
> 
> select count(*) from data where a <= 18234721

> time sqlite3 db 'select count(*) from data where a <= 18234721' > /dev/null
1.35u 0.16s 0:01.47 102.7%

8 times faster then count(n1) or count(e)? i'm confused. i guess it
just used "a" field (on which it had index?)

> THEN tell us what the difference in performance is ...

fine, if i ask sqlite just to count the rows it wins hands-down, but i
really want these rows. even more i 'd like to then "natural join"
these rows with a couple of other tables to really do what the perl
code currently does.

but, it takes 22 seconds to just to create a temp table with the
required dataset

> time sqlite3 db 'create temp table foo as select * from data where a <= 
> 18234721' > /dev/null
21.93u 0.89s 0:22.95 99.4%

and i do not understand what i'm doing wrong here :(

thanks,
max


RE: [sqlite] sqlite performance problem

2005-04-11 Thread Robert Simpson
> -Original Message-
> From: Maksim Yevmenkin [mailto:[EMAIL PROTECTED] 
> Sent: Monday, April 11, 2005 9:59 AM
> To: Christian Smith
> Cc: sqlite-users@sqlite.org
> Subject: Re: [sqlite] sqlite performance problem
> 
> i guess, i can believe this. however its pretty disappointing to get
> 50% improvement on 30 times less dataset :(
> 
> but how do you explain this?
> 
> sqlite> .schema data
> CREATE TABLE data
> (
>a INTEGER,
>b INTEGER,
>c CHAR,
>d INTEGER,
>e INTEGER,
>n1 FLOAT,
>n2 FLOAT
> );
> CREATE INDEX data_by_a ON data (a);
> 
> > time sqlite3 db 'select n1 from data where a <= 18234721' > 
> /dev/null
> 26.15u 0.59s 0:27.00 99.0%
> > time sqlite3 db 'select n1 from data where a <= 18234721' > 
> /dev/null
> 26.04u 0.61s 0:26.91 99.0%
> 
> and
> 
> > time sqlite3 db 'select e from data where a <= 18234721' > /dev/null
> 12.22u 0.41s 0:12.67 99.6%
> > time sqlite3 db 'select e from data where a <= 18234721' > /dev/null
> 12.01u 0.61s 0:12.64 99.8%
> 
> so just by selecting two different rows (both next to each other). i
> get another 50% time difference? sure the column types are different,
> and i can even understand that FLOAT might be 8 bytes and INTEGER 4
> bytes, but 50% time difference? it just cant be that linear can it?
> 
> do you think it is possible to get better results by issuing multiple
> queries each of which will return even less dataset? and, no, i'm not
> a database guy.

The most glaring fault in your time tests that I see is that you're running
the command-line sqlite3, which is printf'ing all 339,000+ rows to dev/null.
It doesn't show on your screen, but the work is still being done regardless.
If you really want a fair speed test, change those statements to:

select count(n1) from data where a <= 18234721

And 

select count(e) from data where a <= 18234721

Or even

select count(*) from data where a <= 18234721

THEN tell us what the difference in performance is ...

Robert




Re: [sqlite] sqlite performance problem

2005-04-11 Thread Maksim Yevmenkin
Christian,

thanks for the reply.

> >i'm having strange performance problem with sqlite-3.2.0. consider the
> >following table
> >
> > [snip]
> >
> >now the problem:
> >
> >1) if i do a select with an idex it takes 27 sec. to get 92 rows
> >
> >> time sqlite3 db 'select n2 from data where a <= 18234721' > /dev/null
> >25.95u 0.71s 0:27.02 98.6%
> >
> >> time sqlite3 db 'select n2 from data where a <= 18234721' > /dev/null
> >26.02u 0.66s 0:27.53 96.9%
> >
> >2) if i do a select with sequential lookup it takes 1min to get 9818210 rows
> >
> >> time sqlite3 db 'select n2 from data where a+0 <= 18234721' > /dev/null
> >49.54u 14.65s 1:04.98 98.7%
> >
> >> time sqlite3 db 'select n2 from data where a+0 <= 18234721' > /dev/null
> >49.80u 14.42s 1:05.03 98.7%
> >
> >- so how come it take only twice as much time to get 30 times more rows?
> 
> When doing an index scan, you will be accessing the index as well as the
> table pages. For a sequential scan, only the table pages are touched,
> reducing thrashing of the cache.

i guess, i can believe this. however its pretty disappointing to get
50% improvement on 30 times less dataset :(

but how do you explain this?

sqlite> .schema data
CREATE TABLE data
(
   a INTEGER,
   b INTEGER,
   c CHAR,
   d INTEGER,
   e INTEGER,
   n1 FLOAT,
   n2 FLOAT
);
CREATE INDEX data_by_a ON data (a);

> time sqlite3 db 'select n1 from data where a <= 18234721' > /dev/null
26.15u 0.59s 0:27.00 99.0%
> time sqlite3 db 'select n1 from data where a <= 18234721' > /dev/null
26.04u 0.61s 0:26.91 99.0%

and

> time sqlite3 db 'select e from data where a <= 18234721' > /dev/null
12.22u 0.41s 0:12.67 99.6%
> time sqlite3 db 'select e from data where a <= 18234721' > /dev/null
12.01u 0.61s 0:12.64 99.8%

so just by selecting two different rows (both next to each other). i
get another 50% time difference? sure the column types are different,
and i can even understand that FLOAT might be 8 bytes and INTEGER 4
bytes, but 50% time difference? it just cant be that linear can it?

do you think it is possible to get better results by issuing multiple
queries each of which will return even less dataset? and, no, i'm not
a database guy.

> Also, if the data is not in index order in the table, and/or dispersed
> across the database file, you may have to visit each page more than once
> when traversing in index order. In the full table scan, you'll read the
> rows in table order, hence only touching each page once before moving on
> to later pages, thus reducing cache thrashing even more.

all rows were inserted into the table in order that matches the index.
i can not say if the records on disk will have the same order.

> >- and why is it taking 27 seconds to get 92 rows anyway?
> 
> You think 12347 rows/s is bad?

because i have the same data set in perl multilevel hash stored on
disk (storable format), and it takes about 1 minute to perform the
same query on the same hardware. this time includes reading the data
from the disk, traversing every single key on each hash level, etc. i
was hoping that sqlite  would do something like in just a few (< 10)
seconds.

> >- is there any way to create an index in separate file?
> 
> No, SQLite is a single file embedded database. Keeps administration
> almost non-existent (by design.)

- so, if i have a large table ( > 9 million records) with an index
then sqlite would have constantly seek back and forth between index
and data (within the same file)  probably wasting lots of disk
bandwidth?

- if above is correct than can i force sqlite to get all (or as much
as possible) of index into the memory?

> >the hardware is sun netra t1 running solaris 5.7. the db file size is
> >about 800 mbytes.
> >
> >just for the record i'd like to have at least 15 times more records in
> >the 'data' table.
> 
> If you can match SQLite for simple read-only throughput with another
> database, you'll be doing well.

i do not know about other database, but compared to plain perl its not
that impressive. i must be doing something wrong here.

thanks,
max