RE: [sqlite] sqlite performance problem
> well, it could be true, but not in the queries i have posted. i "group > by" column "a" and there is an index on column "a", so sqlite does not > have to do anything to compute key. it does not even have to back to Do not confuse the index key with the aggregator key. The two may be the same in your case, but in general they won't be, so I wouldn't expect SQLite to reuse the index key as the aggregator key even if it can. > the disk and fetch the value of column "a" because its right in the > index. the only work sqlite has to do is to go back to the disk and > fetch the value of column "n1" and sum it. You're underestimating how much work has to be done to just "fetch the value of column n1". The entire row containing n1 has to be read, then the value of column n1 extracted from it. That means disk seeks, disk reads, moving around within the data row, etc. > what you say is correct, but four (!) times performance increase? > please, you have got to admit that something is not right here. It may well be possible to make your query run faster. The biggest problem here is simply that you're expecting a magically high-performance solution without understanding what needs to be done in order to satisfy your request. > why? i'm using relational database to do exactly what it should do > fast. that is i want select the range of rows from a much bigger set Do not confuse what you want it to do fast, or even what you think it should do fast, with what it actually does or how fast it does it. > and do something with it. i'm not asking it to do complex computations > or anything else. You don't think you're doing anything complex, but you really are, you just don't understand that you are. Locating a variable-sized row in an unpredictable location and performing arbitrary operations on some sub-section of that row is a heck of a lot more complicated than scanning through a flat file and computing hash keys. > i think, i know what is going on here. the problem is that every time > i do an indexed scan sqlite has to > > 1) fetch index pages > > 2) fetch data pages that match "where" condition > > because both index and data are in the same file sqlite has to perform > insane amount of seek() calls and this is clearly demonstrated in my > previous post. Once again, the fact that the data and the index are in the same file is irrelevant. Even if the data and the index were in separate files, the disk heads would be doing a lot of random I/O to move from the location of one row to the next given that the table rows are not accessed sequentially when using an index. > it still takes 14 whooping seconds to "group by" 300,000+ records > down to 1417 records and that is just unacceptable (imo) for > relational database. Why does the fact that SQLite is a relational database mean that response times should be measured in seconds? Have you tried this with any other relational database, as another poster suggested? You may well get better performance out of something like MySQL or Oracle, but it'll come at a price - much more memory usage and an SMP system that allows you to effectively multithread. I can guarantee you that any other relational database, if configured to use the same amount of memory as SQLite (a few MBs) and the same number of threads (1) that they would all perform markedly *worse* than SQLite. > from what i can see sqlite took extra 7 seconds to fetch the data of > the disk. that is it had to seek back to data page, fetch it, process > it, seek back to next index page etc. That's correct. That's the price you pay for using an index to look up the data. > yes, i tried all that. it is not that easy to read sqlite explain > output (imo) but i convince myself that it uses all the proper indexes > when i make the query. Just because you think it's using the correct index doesn't mean that that's the optimal access path. Assuming that it is, however... Given that you do seem to understand the overhead incurred by using an index to look up data that is located elsewhere, what's the issue? It seems pretty clear that SQLite isn't doing anything wrong. It isn't doing it as optimally as your C program, but we've already covered the reasons for that, so what's left to discuss? > i think sqlite can only do good job when resulting set is very small. > ideally one row. in this case it does not have to go back to disk to > fetch the columns that are not in index. it is that simple (imo). the > more columns (that are not in the index) you request the more time it > takes. Why would you expect the processing of a large number of rows to take the same amount of time as processing a single row? Once again, your expectations are unreasonable. It's going to take more time to process a large number of rows simply because there's more data to be managed. The number of columns requested isn't much of a factor either. Requesting 10 columns that are not in
Re: [sqlite] sqlite performance problem
On Fri, 2005-04-08 at 11:53 -0700, Maksim Yevmenkin wrote: > CREATE INDEX data_by_a ON data (a); > > > time sqlite3 db 'select n2 from data where a <= 18234721' > /dev/null > 25.95u 0.71s 0:27.02 98.6% > If you make the index look like this: CREATE INDEX data_by_a ON data(a, n2); Then SQLite only has to look at the index to get the information it needs. It never has to consult the original table. This will make the query about twice as fast or maybe even a little faster. -- D. Richard Hipp <[EMAIL PROTECTED]>
Re: [sqlite] sqlite performance problem
On Tue, 12 Apr 2005, Maksim Yevmenkin wrote: >Dear SQLite users, > >consider this > > [snip] > >it only took 4+ seconds to read, parse, perform hash table lookup and >sum the data. note that for unique 1417 keys it had to do hash lookup >and hash insert. > >so, just with plain ascii file i get four times the speed i get with >sqlite. note that my c program will scale linearly with the size of >dataset (just like i see with sqlite). > >so, > >- what am i doing wrong here? Not managing your expectations. Try the test with another SQL database, and see what sort of speed you get. > >- any suggestions on how to speed things up with sqlite? Buy a faster machine. No, I'm not being facetious. > >- is sqlite optimized to retrieve one row from the table? When fetching that one row using an index, yes, that's what indexes are for. But when an index scan touches a significant proportion of the corresponding table, then it becomes less than optimal, due the reasons given previously in the thread: - Cache thrashing - Index + Table access - CPU overhead - Non-linear file access > >i'm starting to think that having index and data in the same file is >not such a great idea when you want to perform range requests. Unless the index and data are on different disks, you'll get no benefit by splitting them from the same file. > >thanks, >max Christian > >On Apr 12, 2005 11:33 AM, Maksim Yevmenkin <[EMAIL PROTECTED]> wrote: >> Gé, >> >> thanks for the suggestion. unfortunately it did not make any >> difference :( below is the results. as you can see it takes 7+ seconds >> to "group by" 333,392 records and i'm grouping by column on which i >> have index. again, i'm not a database guy, but i think that is slow. >> perhaps someone can comment if that is the best sqlite can do? >> >> >> sqlite> .schema data >> CREATE TABLE data( >> a INTEGER, >> b INTEGER, >> c CHAR, >> d INTEGER, >> e INTEGER, >> n1 FLOAT, >> n2 FLOAT >> ); >> CREATE INDEX data_by_a on data (a); >> >> >> sqlite> select count(*) from data; >> 92 >> >> test-1.sql >> >> PRAGMA cache_size = 30; >> PRAGMA cache_size; >> PRAGMA page_size; >> SELECT count(*) FROM (SELECT a FROM data WHERE a <= 18234721 GROUP BY a); >> >> test-2.sql >> >> PRAGMA cache_size; >> PRAGMA page_size; >> SELECT count(*) FROM (SELECT a FROM data WHERE a <= 18234721 GROUP BY a); >> >> > time sqlite3 db < test-1.sql >> 30 >> 1024 >> 1417 >> 6.89u 0.33s 0:07.55 95.6% >> >> > time sqlite3 db < test-2.sql >> 2000 >> 1024 >> 1417 >> 6.91u 0.19s 0:07.39 96.0% >> >> > time sqlite3 db2048 < test-1.sql >> 30 >> 2048 >> 1417 >> 6.80u 0.08s 0:07.32 93.9% >> >> > time sqlite3 db2048 < test-2.sql >> 2000 >> 2048 >> 1417 >> 6.77u 0.12s 0:07.10 97.0% >> >> > time sqlite3 db4096 < test-1.sql >> 30 >> 4096 >> 1417 >> 6.80u 0.15s 0:07.21 96.3% >> >> > time sqlite3 db4096 < test-2.sql >> 2000 >> 4096 >> 1417 >> 6.79u 0.15s 0:07.15 97.0% >> >> > time sqlite3 db8192 < test-1.sql >> 30 >> 8192 >> 1417 >> 6.70u 0.11s 0:07.01 97.1% >> >> > time sqlite3 db8192 < test-2.sql >> 2000 >> 8192 >> 1417 >> 6.73u 0.09s 0:07.01 97.2% >> >> thanks, >> max >> >> >> On Apr 12, 2005 7:10 AM, Gé Weijers <[EMAIL PROTECTED]> wrote: >> > Maksim, >> > >> > Some things you could try: >> > >> > 1) increase cache memory >> > >> > You may be causing a lot of cache misses if the size of the query result >> > is very large compared to the size of the cache. Index-based searches >> > can cause multiple reloads of the same page because of a lack of >> > locality in the cache. An index-less search will just load each page once. >> > >> > as an experiment, try 'PRAGMA cache_size = 30', before you run the >> > query. 1 GB of ram should be able to support 300MB of cache. >> > >> > 2) use 8192-byte pages >> > >> > Larger pages seem to improve performance quite a bit, in my experience. >> > >> > Do 'PRAGMA page_size = 8192' before you create the database. >> > >> > Doing both may cause excessive memory use (20 * 8K = ...). I've >> > never tried that. >> > >> > >> > Gé >> > >> > >> > Maksim Yevmenkin wrote: >> > >> > >Robert, >> > > >> > > >> > > >> > >>[snip] >> > >> >> > >> >> > >> >> > >>>i said i print these rows to /dev/null too in my perl code. plus the >> > >>>perl code does some other things such as joining these rows with other >> > >>>hashes and summing the numbers. >> > >>> >> > >>> >> > >>That's fine. I was merely trying to account for the 50% speed difference >> > >>between the two differing column tests, which has been accomplished. >> > >> >> > >> >> > >> >> > As for the temp table ... I haven't tried this, but isn't >> > >> > >> > >>>"temp" a reserved >> > >>> >> > >>> >> > word in SQLite? More importantly, you should be doing this >> > >> > >> > >>>statement inside >> > >>> >> > >>>yes, it is. i really want to create 'temporary table' in memory. i was >> > >>>really hoping it would speed things up. >> > >>> >> > >>>
Re: [sqlite] sqlite performance problem
Maksim Yevmenkin wrote: > >>>so, just with plain ascii file i get four times the speed i get with >>>sqlite. note that my c program will scale linearly with the size of >>>dataset (just like i see with sqlite). >>> >>> >> With anything related to computers, there are always tradeoffs - most >>commonly power for complexity, and flexibility for speed. Your C >>program *should* be faster than anything SQLite can do - it's simpler >>and more specific to the problem you're trying to solve. On the flip >>side, it'll never do anything other than what it already does - it can >>never be used to solve any other problem. >> >> > >what you say is correct, but four (!) times performance increase? >please, you have got to admit that something is not right here. > > > I have to agree with Thomas, your expectations are too high. If I'd be using a relational database and I could get within a factor of 4 of what I can concoct in C I'd declare victory. Relational databases are often far from speedy, even on simple queries. You pay for: * variable record formats * integrity checks * duplicate storage of keys in the BTree * duplicate storage of keys in multiple tables * the ACID property, even if you're not using it in your samples * the ability to perform queries in a flexible way * . If your database is simple you may be better off performance wise by rolling your own solution, or using another database. MySQL is pretty fast if you run it using ISAM tables, but you pay with data corruption if the DB or system crashes. If your queries generally produce a sizeable percentage of the records stored you might as well do a sequential scan over a file, if written with care, performance will be completely I/O bound. Use the 'mmap' system call or equivalent to map the DB into memory, and you can read your DB using pointer arithmetic, and use 'memmove' for updates. Gé
Re: [sqlite] sqlite performance problem
John. > >i think, i know what is going on here. the problem is that every time > >i do an indexed scan sqlite has to > > > >1) fetch index pages > > > >2) fetch data pages that match "where" condition > > > >because both index and data are in the same file sqlite has to perform > >insane amount of seek() calls and this is clearly demonstrated in my > >previous post. > > > >even if i remote the sum() but just add extra column that is not in > >the index, like so > > > >SELECT count(*) FROM (SELECT a,n1 FROM data WHERE a <= 18234721 GROUP BY a); > > > > > > > >>time sqlite3 db < test.sql > >> > >> > >30 > >1024 > >1417 > >13.14u 1.06s 0:14.40 98.6% > > > >it still takes 14 whooping seconds to "group by" 300,000+ records > >down to 1417 records and that is just unacceptable (imo) for > >relational database. > > > > > If seeking was the problem, you'd see more system time. The i/o time is > at most 1.06 seconds. The file is probably > being cached in the os, and therefore, the seeks just become memory > operations. i agree that most of the data are cached by filesystem. it would be even slower it sqlite had to go and fetch all the data from the disk plates. i'm not sure if system time is counted if the process was put to sleep due to, say, disk i/o. > You might also get more help if you presented your requests in a milder > tone. You might even get responses > from the people who wrote sqlite, and therefore have knowledge of its > internals. i did not intend to offended anyone, if i did i apologize. > but one thing to try: > > CREATE TABLE data( > a INTEGER, > b INTEGER, > c CHAR, > d INTEGER, > e INTEGER, > n1 FLOAT, > n2 FLOAT > ); > CREATE INDEX data_by_a on data (a, b, c, d, e, n1, n2); > > This is a big waste of space, and will slow down all your updates and > inserts, but since you > seem to be mostly worried about the speed of selects, those shouldn't > make a difference to you. that will not work for me either. i have to fit 60 million rows table in under 15 gig. > You may already know this, but when you create an index with multiple > keys, it can use a subset > of those keys to look up rows. The limitation is that any key that > appears in the index before a column > you would like to limit must also be limited, e.g., > > select > * > from > a > where > a = 12 and b = 16 and c = 16 > > will use the index, but > > select > * > from > a > where > b = 22 and c = 23 > > will not use the index. yes, i know this. i do not see how this is applicable in my tests. > You claim that the reason sqlite is slower than you expect is because > sqlite is using the index to > look up the data. If so, this should fix the problem, by making the > index have all the data right > there. that is not what i said. the index is properly gives me the much (30 times) smaller data set. however, the column i have index on is only one of the columns i want to fetch. the other columns are still on disk. in my case it takes longer to get the other columns i want. again, i'd rather not place index on all the columns because i might not meet size requirements. > I'm not so sure that this will resolve your problems, but it may take > you closer to a solution. Another > thing to try is to try the same experiment with postgresql or mysql, and > see if they have the same sort > of performance you're expecting to see. If not, you may want to consider i will try another database. its just a already spend too much time trying to make sqlite work :( i still hope that i'm doing something wrong here. i just dont know what it is :) > that Thomas was right, and > a general solution may not be able to perform the calculations with the > same speed that you are hoping > for. but i'm not doing anything out of the ordinary here. select rows, sum some columns and do join on a couple of other tables. why would sqlite not work for me? thanks, max
Re: [sqlite] sqlite performance problem
> > SELECT count(*) FROM (SELECT a,n1 FROM data WHERE a <= 18234721 GROUP BY a); > > > > > time sqlite3 db < test.sql > > 30 > > 1024 > > 1417 > > 13.14u 1.06s 0:14.40 98.6% > > Have you tried doing the query like this: > > SELECT count(*) FROM (SELECT a,nl FROM data WHERE a-18234721<=0 GROUP > BY a); i just tried. i think this effectively disables the index on column "a" and on 9+ million records data set it took > time sqlite3 db < test.sql 30 1024 1417 61.16u 40.69s 2:05.01 81.4% thanks, max
Re: [sqlite] sqlite performance problem
Maksim Yevmenkin wrote: i think, i know what is going on here. the problem is that every time i do an indexed scan sqlite has to 1) fetch index pages 2) fetch data pages that match "where" condition because both index and data are in the same file sqlite has to perform insane amount of seek() calls and this is clearly demonstrated in my previous post. even if i remote the sum() but just add extra column that is not in the index, like so SELECT count(*) FROM (SELECT a,n1 FROM data WHERE a <= 18234721 GROUP BY a); time sqlite3 db < test.sql 30 1024 1417 13.14u 1.06s 0:14.40 98.6% it still takes 14 whooping seconds to "group by" 300,000+ records down to 1417 records and that is just unacceptable (imo) for relational database. If seeking was the problem, you'd see more system time. The i/o time is at most 1.06 seconds. The file is probably being cached in the os, and therefore, the seeks just become memory operations. You might also get more help if you presented your requests in a milder tone. You might even get responses from the people who wrote sqlite, and therefore have knowledge of its internals. but one thing to try: CREATE TABLE data( a INTEGER, b INTEGER, c CHAR, d INTEGER, e INTEGER, n1 FLOAT, n2 FLOAT ); CREATE INDEX data_by_a on data (a, b, c, d, e, n1, n2); This is a big waste of space, and will slow down all your updates and inserts, but since you seem to be mostly worried about the speed of selects, those shouldn't make a difference to you. You may already know this, but when you create an index with multiple keys, it can use a subset of those keys to look up rows. The limitation is that any key that appears in the index before a column you would like to limit must also be limited, e.g., select * from a where a = 12 and b = 16 and c = 16 will use the index, but select * from a where b = 22 and c = 23 will not use the index. You claim that the reason sqlite is slower than you expect is because sqlite is using the index to look up the data. If so, this should fix the problem, by making the index have all the data right there. I'm not so sure that this will resolve your problems, but it may take you closer to a solution. Another thing to try is to try the same experiment with postgresql or mysql, and see if they have the same sort of performance you're expecting to see. If not, you may want to consider that Thomas was right, and a general solution may not be able to perform the calculations with the same speed that you are hoping for. John LeSueur
Re: [sqlite] sqlite performance problem
On Tue, 2005-04-12 at 16:17 -0700, Maksim Yevmenkin wrote: > SELECT count(*) FROM (SELECT a,n1 FROM data WHERE a <= 18234721 GROUP BY a); > > > time sqlite3 db < test.sql > 30 > 1024 > 1417 > 13.14u 1.06s 0:14.40 98.6% > Have you tried doing the query like this: SELECT count(*) FROM (SELECT a,nl FROM data WHERE a-18234721<=0 GROUP BY a); -- D. Richard Hipp <[EMAIL PROTECTED]>
Re: [sqlite] sqlite performance problem
Thomas, > > with sum(n1) added query runs twice as slow. as i was told its because > > sqlite has to fetch data row. fine, but why its soo slow?! and it > >Because for each row it has to compute the aggregate key, find the > aggregator for that key and increment the sum for that aggregate key. > That's a lot more work than just selecting the row and doing nothing > with it. well, it could be true, but not in the queries i have posted. i "group by" column "a" and there is an index on column "a", so sqlite does not have to do anything to compute key. it does not even have to back to the disk and fetch the value of column "a" because its right in the index. the only work sqlite has to do is to go back to the disk and fetch the value of column "n1" and sum it. > > so, just with plain ascii file i get four times the speed i get with > > sqlite. note that my c program will scale linearly with the size of > > dataset (just like i see with sqlite). > >With anything related to computers, there are always tradeoffs - most > commonly power for complexity, and flexibility for speed. Your C > program *should* be faster than anything SQLite can do - it's simpler > and more specific to the problem you're trying to solve. On the flip > side, it'll never do anything other than what it already does - it can > never be used to solve any other problem. what you say is correct, but four (!) times performance increase? please, you have got to admit that something is not right here. > > - what am i doing wrong here? > >Your expectations are too high, for starters. For reasons I will > never understand, people expect a relational database to be a silver > bullet that answers all questions instantaneously. RDBMSs are useful > because they're flexible, but they're also not 100% optimal because > they're flexible. That's the price you pay for being able to answer > questions with SQL instead of C. why? i'm using relational database to do exactly what it should do fast. that is i want select the range of rows from a much bigger set and do something with it. i'm not asking it to do complex computations or anything else. >The other problem you may be running into is lack of a clear problem > definition. So far I've seen mention of a half dozen things which you > think should perform better, but you haven't dug into any of them to > find out *why* they're slow. An earlier poster helped identify disk I/O > as a bottleneck. If from there you can prove that that's a bottleneck > because SQLite is performing too much disk I/O, then you have cause to > claim that SQLite is slow. Otherwise, all you can do is blame your > disks. Until you get a clear picture of what you want, what you need > and what's actually happening, however, you'll never be able to tell the > difference. i think, i know what is going on here. the problem is that every time i do an indexed scan sqlite has to 1) fetch index pages 2) fetch data pages that match "where" condition because both index and data are in the same file sqlite has to perform insane amount of seek() calls and this is clearly demonstrated in my previous post. even if i remote the sum() but just add extra column that is not in the index, like so SELECT count(*) FROM (SELECT a,n1 FROM data WHERE a <= 18234721 GROUP BY a); > time sqlite3 db < test.sql 30 1024 1417 13.14u 1.06s 0:14.40 98.6% it still takes 14 whooping seconds to "group by" 300,000+ records down to 1417 records and that is just unacceptable (imo) for relational database. now if you only request column that is in the index SELECT count(*) FROM (SELECT a FROM data WHERE a <= 18234721 GROUP BY a); > time sqlite3 db < test.sql 30 1024 1417 6.84u 0.24s 0:07.18 98.6% it only takes 7+ seconds. from what i can see sqlite took extra 7 seconds to fetch the data of the disk. that is it had to seek back to data page, fetch it, process it, seek back to next index page etc. > > - any suggestions on how to speed things up with sqlite? > >First, you must understand what SQLite is doing. Have you tried > EXPLAINing the query that you're unhappy with to see what SQLite is > going with it "under the hood"? yes, i tried all that. it is not that easy to read sqlite explain output (imo) but i convince myself that it uses all the proper indexes when i make the query. >Also, an earlier poster mentioned increasing your block size. How > high did you raise it? I've seen significant performance increases with > block sizes of 8k and 16k; just how much difference it makes seems to > depend on the system in question, however. On some systems I've tested, > 8k block sizes are faster than 16k block sizes; my guess is that all > this comes down to the block size used by the file system and/or OS (and > the size of the data itself can factor in there as well), though I > haven't yet dug deep enough to be sure. From my experience though, in > most cases, there are certainly gains to be had by
RE: [sqlite] sqlite performance problem
> with sum(n1) added query runs twice as slow. as i was told its because > sqlite has to fetch data row. fine, but why its soo slow?! and it Because for each row it has to compute the aggregate key, find the aggregator for that key and increment the sum for that aggregate key. That's a lot more work than just selecting the row and doing nothing with it. > so, just with plain ascii file i get four times the speed i get with > sqlite. note that my c program will scale linearly with the size of > dataset (just like i see with sqlite). With anything related to computers, there are always tradeoffs - most commonly power for complexity, and flexibility for speed. Your C program *should* be faster than anything SQLite can do - it's simpler and more specific to the problem you're trying to solve. On the flip side, it'll never do anything other than what it already does - it can never be used to solve any other problem. > - what am i doing wrong here? Your expectations are too high, for starters. For reasons I will never understand, people expect a relational database to be a silver bullet that answers all questions instantaneously. RDBMSs are useful because they're flexible, but they're also not 100% optimal because they're flexible. That's the price you pay for being able to answer questions with SQL instead of C. The other problem you may be running into is lack of a clear problem definition. So far I've seen mention of a half dozen things which you think should perform better, but you haven't dug into any of them to find out *why* they're slow. An earlier poster helped identify disk I/O as a bottleneck. If from there you can prove that that's a bottleneck because SQLite is performing too much disk I/O, then you have cause to claim that SQLite is slow. Otherwise, all you can do is blame your disks. Until you get a clear picture of what you want, what you need and what's actually happening, however, you'll never be able to tell the difference. > - any suggestions on how to speed things up with sqlite? First, you must understand what SQLite is doing. Have you tried EXPLAINing the query that you're unhappy with to see what SQLite is going with it "under the hood"? Also, an earlier poster mentioned increasing your block size. How high did you raise it? I've seen significant performance increases with block sizes of 8k and 16k; just how much difference it makes seems to depend on the system in question, however. On some systems I've tested, 8k block sizes are faster than 16k block sizes; my guess is that all this comes down to the block size used by the file system and/or OS (and the size of the data itself can factor in there as well), though I haven't yet dug deep enough to be sure. From my experience though, in most cases, there are certainly gains to be had by using larger block sizes. > - is sqlite optimized to retrieve one row from the table? If there is an index on the column or columns referenced in your WHERE clause, yes. Moreso if the column or columns referenced contain unique values. Note however that this has nothing to do with SQLite - all relational databases (all databases with B-tree indexes, actually, not just RDBMSs) work this way. > i'm starting to think that having index and data in the same file is > not such a great idea when you want to perform range requests. Having the data and the index in the same file is irrelevant. It is entirely possible that SQLite might be using an index to satisfy your query, and entirely possible that it would be more optimal to execute the query without using the index, but the fact that the index and data are in the same file has nothing to do with it. Whether to use the index or not is an age-old problem that other RDBMSs solve using cost-based optimizers; I'm not sure if there's a solution for that in SQLite, as I believe query plan execution is rule-based (I'm not 100% sure about that though, someone please correct me if I'm wrong). If that's the case, then you'll need to try to find another way to write your query such that SQLite executes it more optimally. If I remember correctly, the old trick to make sure you didn't use an index was to change your WHERE clause slightly to something like: SELECT COUNT(*) FROM (SELECT a FROM data WHERE (a + 0) <= 18234721 GROUP BY a); This should force a table scan rather than an index scan. Try that and see how your queries perform; if things go faster, that means the index is hurting you, and you need to either drop it or rewrite your queries to avoid using it. -Tom
Re: [sqlite] sqlite performance problem
Gé, thanks for the suggestion. unfortunately it did not make any difference :( below is the results. as you can see it takes 7+ seconds to "group by" 333,392 records and i'm grouping by column on which i have index. again, i'm not a database guy, but i think that is slow. perhaps someone can comment if that is the best sqlite can do? sqlite> .schema data CREATE TABLE data( a INTEGER, b INTEGER, c CHAR, d INTEGER, e INTEGER, n1 FLOAT, n2 FLOAT ); CREATE INDEX data_by_a on data (a); sqlite> select count(*) from data; 92 test-1.sql PRAGMA cache_size = 30; PRAGMA cache_size; PRAGMA page_size; SELECT count(*) FROM (SELECT a FROM data WHERE a <= 18234721 GROUP BY a); test-2.sql PRAGMA cache_size; PRAGMA page_size; SELECT count(*) FROM (SELECT a FROM data WHERE a <= 18234721 GROUP BY a); > time sqlite3 db < test-1.sql 30 1024 1417 6.89u 0.33s 0:07.55 95.6% > time sqlite3 db < test-2.sql 2000 1024 1417 6.91u 0.19s 0:07.39 96.0% > time sqlite3 db2048 < test-1.sql 30 2048 1417 6.80u 0.08s 0:07.32 93.9% > time sqlite3 db2048 < test-2.sql 2000 2048 1417 6.77u 0.12s 0:07.10 97.0% > time sqlite3 db4096 < test-1.sql 30 4096 1417 6.80u 0.15s 0:07.21 96.3% > time sqlite3 db4096 < test-2.sql 2000 4096 1417 6.79u 0.15s 0:07.15 97.0% > time sqlite3 db8192 < test-1.sql 30 8192 1417 6.70u 0.11s 0:07.01 97.1% > time sqlite3 db8192 < test-2.sql 2000 8192 1417 6.73u 0.09s 0:07.01 97.2% thanks, max On Apr 12, 2005 7:10 AM, Gé Weijers <[EMAIL PROTECTED]> wrote: > Maksim, > > Some things you could try: > > 1) increase cache memory > > You may be causing a lot of cache misses if the size of the query result > is very large compared to the size of the cache. Index-based searches > can cause multiple reloads of the same page because of a lack of > locality in the cache. An index-less search will just load each page once. > > as an experiment, try 'PRAGMA cache_size = 30', before you run the > query. 1 GB of ram should be able to support 300MB of cache. > > 2) use 8192-byte pages > > Larger pages seem to improve performance quite a bit, in my experience. > > Do 'PRAGMA page_size = 8192' before you create the database. > > Doing both may cause excessive memory use (20 * 8K = ...). I've > never tried that. > > > Gé > > > Maksim Yevmenkin wrote: > > >Robert, > > > > > > > >>[snip] > >> > >> > >> > >>>i said i print these rows to /dev/null too in my perl code. plus the > >>>perl code does some other things such as joining these rows with other > >>>hashes and summing the numbers. > >>> > >>> > >>That's fine. I was merely trying to account for the 50% speed difference > >>between the two differing column tests, which has been accomplished. > >> > >> > >> > As for the temp table ... I haven't tried this, but isn't > > > >>>"temp" a reserved > >>> > >>> > word in SQLite? More importantly, you should be doing this > > > >>>statement inside > >>> > >>>yes, it is. i really want to create 'temporary table' in memory. i was > >>>really hoping it would speed things up. > >>> > >>> > >>I misread the statement, so ignore me on that part. However, 339,000 rows > >>into a temporary in-memory table ... I tried some experiments locally here > >>and none of them took more than 2 seconds to execute. Are you sure you're > >>not using up all available memory, which is causing the system to hit the > >>swapfile? What does this same query look like when you drop the "temp" from > >>the query? > >> > >> > > > >the system has 1G of ram. i was "monitoring" sqlite3 memory usage with > >'top'. the SIZE and RES did not exceed 30M. so i do not think the > >memory is the issue here. > > > > > > > >>time sqlite3 db 'create table foo as select * from data where a <= > >>18234721' > /dev/null > >> > >> > >22.06u 1.39s 0:27.75 84.5% > > > >so pretty much the same time without 'temp'. > > > >i'm starting to suspect disk. here is what i did. i created a separate > >database with only one table. this table contains subset of 92 > >rows from original data table. it also has the same index on "a" > >column, i.e. i did > > > > > > > >>sqlite3 db1 > >> > >> > >sqlite> attach db as s; > >sqlite> create table data as select * from s.data where a <= 18234721; > >sqlite> create index data_by_a on data (a); > > > >full scan > > > > > > > >>time sqlite3 db1 'select n1 from data' > /dev/null > >> > >> > >17.19u 0.55s 0:19.06 93.0% > > > >"bad" index scan, because it is guaranteed then the table only has > >keys that match "where" > > > > > > > >>time sqlite3 db1 'select n1 from data where a <= 18234721' > /dev/null > >> > >> > >25.73u 0.59s 0:28.37 92.7% > > > >+10 seconds! is this the overhead of "indexed" scan? is this what it > >really takes to seek back and forth between index and data? what am i > >missing here? > > > >thanks, > >max > > > > > >
Re: [sqlite] sqlite performance problem
Maksim, Some things you could try: 1) increase cache memory You may be causing a lot of cache misses if the size of the query result is very large compared to the size of the cache. Index-based searches can cause multiple reloads of the same page because of a lack of locality in the cache. An index-less search will just load each page once. as an experiment, try 'PRAGMA cache_size = 30', before you run the query. 1 GB of ram should be able to support 300MB of cache. 2) use 8192-byte pages Larger pages seem to improve performance quite a bit, in my experience. Do 'PRAGMA page_size = 8192' before you create the database. Doing both may cause excessive memory use (20 * 8K = ...). I've never tried that. Gé Maksim Yevmenkin wrote: >Robert, > > > >>[snip] >> >> >> >>>i said i print these rows to /dev/null too in my perl code. plus the >>>perl code does some other things such as joining these rows with other >>>hashes and summing the numbers. >>> >>> >>That's fine. I was merely trying to account for the 50% speed difference >>between the two differing column tests, which has been accomplished. >> >> >> As for the temp table ... I haven't tried this, but isn't >>>"temp" a reserved >>> >>> word in SQLite? More importantly, you should be doing this >>>statement inside >>> >>>yes, it is. i really want to create 'temporary table' in memory. i was >>>really hoping it would speed things up. >>> >>> >>I misread the statement, so ignore me on that part. However, 339,000 rows >>into a temporary in-memory table ... I tried some experiments locally here >>and none of them took more than 2 seconds to execute. Are you sure you're >>not using up all available memory, which is causing the system to hit the >>swapfile? What does this same query look like when you drop the "temp" from >>the query? >> >> > >the system has 1G of ram. i was "monitoring" sqlite3 memory usage with >'top'. the SIZE and RES did not exceed 30M. so i do not think the >memory is the issue here. > > > >>time sqlite3 db 'create table foo as select * from data where a <= 18234721' >>> /dev/null >> >> >22.06u 1.39s 0:27.75 84.5% > >so pretty much the same time without 'temp'. > >i'm starting to suspect disk. here is what i did. i created a separate >database with only one table. this table contains subset of 92 >rows from original data table. it also has the same index on "a" >column, i.e. i did > > > >>sqlite3 db1 >> >> >sqlite> attach db as s; >sqlite> create table data as select * from s.data where a <= 18234721; >sqlite> create index data_by_a on data (a); > >full scan > > > >>time sqlite3 db1 'select n1 from data' > /dev/null >> >> >17.19u 0.55s 0:19.06 93.0% > >"bad" index scan, because it is guaranteed then the table only has >keys that match "where" > > > >>time sqlite3 db1 'select n1 from data where a <= 18234721' > /dev/null >> >> >25.73u 0.59s 0:28.37 92.7% > >+10 seconds! is this the overhead of "indexed" scan? is this what it >really takes to seek back and forth between index and data? what am i >missing here? > >thanks, >max > >
Re: [sqlite] sqlite performance problem
Robert, > [snip] > > > i said i print these rows to /dev/null too in my perl code. plus the > > perl code does some other things such as joining these rows with other > > hashes and summing the numbers. > > That's fine. I was merely trying to account for the 50% speed difference > between the two differing column tests, which has been accomplished. > > > > As for the temp table ... I haven't tried this, but isn't > > "temp" a reserved > > > word in SQLite? More importantly, you should be doing this > > statement inside > > > > yes, it is. i really want to create 'temporary table' in memory. i was > > really hoping it would speed things up. > > I misread the statement, so ignore me on that part. However, 339,000 rows > into a temporary in-memory table ... I tried some experiments locally here > and none of them took more than 2 seconds to execute. Are you sure you're > not using up all available memory, which is causing the system to hit the > swapfile? What does this same query look like when you drop the "temp" from > the query? the system has 1G of ram. i was "monitoring" sqlite3 memory usage with 'top'. the SIZE and RES did not exceed 30M. so i do not think the memory is the issue here. > time sqlite3 db 'create table foo as select * from data where a <= 18234721' > > /dev/null 22.06u 1.39s 0:27.75 84.5% so pretty much the same time without 'temp'. i'm starting to suspect disk. here is what i did. i created a separate database with only one table. this table contains subset of 92 rows from original data table. it also has the same index on "a" column, i.e. i did > sqlite3 db1 sqlite> attach db as s; sqlite> create table data as select * from s.data where a <= 18234721; sqlite> create index data_by_a on data (a); full scan > time sqlite3 db1 'select n1 from data' > /dev/null 17.19u 0.55s 0:19.06 93.0% "bad" index scan, because it is guaranteed then the table only has keys that match "where" > time sqlite3 db1 'select n1 from data where a <= 18234721' > /dev/null 25.73u 0.59s 0:28.37 92.7% +10 seconds! is this the overhead of "indexed" scan? is this what it really takes to seek back and forth between index and data? what am i missing here? thanks, max
Re: [sqlite] sqlite performance problem
Robert, > > time sqlite3 db 'select n1 from data where a <= 18234721' > /dev/null > > 26.15u 0.59s 0:27.00 99.0% > > > > time sqlite3 db 'select n1 from data where a <= 18234721' > /dev/null > > 26.04u 0.61s 0:26.91 99.0% > > > > time sqlite3 db 'select e from data where a <= 18234721' > /dev/null > > 12.22u 0.41s 0:12.67 99.6% > > > > time sqlite3 db 'select e from data where a <= 18234721' > /dev/null > > 12.01u 0.61s 0:12.64 99.8% > > > > so just by selecting two different rows (both next to each other). I > > get another 50% time difference? sure the column types > > are different, > > and i can even understand that FLOAT might be 8 bytes and > > INTEGER 4 > > bytes, but 50% time difference? it just cant be that > > linear can it? > > By doing these same two queries using select count(), you've proven my > original theory that the time difference is due to the volume of bytes being > transmitted to dev/null and NOT because of the actual data types of the > columns or ANY OTHER FACTOR. If you really want, change your pipe output to ok, i agree, redirecting output to /dev/null might have impact on times. > two file names, I'm 100% sure you'll find that the 'select n1 ...' query > results in an output file significantly larger than the 'select e ...' > output file. This is where the differing performance comes from -- the time > it is taking to parse and print your output. here you are wrong actually. > time sqlite3 db 'select e from data where a <= 18234721' > x 12.01u 0.64s 0:12.80 98.8% > time sqlite3 db 'select n1 from data where a <= 18234721' > y 26.06u 0.62s 0:26.86 99.3% the size of x is 1070681 (output of column e) and the size of y is 1004219 (output of column n1), so the file sizes are about the same. x is 66462 bytes more. it is probably possible that printf() is 2 times slower on float's. > The select count(*) does indeed use the index and hence the significant > performance difference. fine > > fine, if i ask sqlite just to count the rows it wins hands-down, but i > > really want these rows. even more i 'd like to then "natural join" > > these rows with a couple of other tables to really do what the perl > > code currently does. > > > > but, it takes 22 seconds to just to create a temp table with the > > required dataset > > > > > time sqlite3 db 'create temp table foo as select * from > > data where a <= 18234721' > /dev/null > > 21.93u 0.89s 0:22.95 99.4% > > > > and i do not understand what i'm doing wrong here :( > > Again, the only reason I suggested using count() in your timing test was to > ensure that the command-line sqlite3 program's output was consistent for > both tests and to eliminate dev/null printf's from factoring into the total > time. In your application, you'll call select * (or whatever) without the > count to retrieve the rows -- but since you're not printf'ing them and > instead are doing your own thing with them, you will indeed see close to > identical times in your selects just like you did in the count() test. i said i print these rows to /dev/null too in my perl code. plus the perl code does some other things such as joining these rows with other hashes and summing the numbers. > As for the temp table ... I haven't tried this, but isn't "temp" a reserved > word in SQLite? More importantly, you should be doing this statement inside yes, it is. i really want to create 'temporary table' in memory. i was really hoping it would speed things up. > a transaction. Transactions are critically important in SQLite. I would go > so far as to say NEVER EVER perform ANY bulk write in SQLite outside a > transaction! The performance difference is beyond phenomenal. nope. > time sqlite3 db 'begin; create temp table foo as select * from data where a > <= 18234721; commit' > /dev/null 21.90u 0.77s 0:22.87 99.1% still 22 seconds to just create a table with 300,000+ records, and that is, unfortunately, too slow :( doing strace/truss on sqlite shows that it performs huge amount of seek's. so the original questions stay: - what am i doing wrong here? - is sqlite going to be not as fast on a fairly large index'ed table because it has to seek back and forth between index and data? thanks, max
RE: [sqlite] sqlite performance problem
Let's recap ... > time sqlite3 db 'select n1 from data where a <= 18234721' > /dev/null > 26.15u 0.59s 0:27.00 99.0% > > time sqlite3 db 'select n1 from data where a <= 18234721' > /dev/null > 26.04u 0.61s 0:26.91 99.0% > > time sqlite3 db 'select e from data where a <= 18234721' > /dev/null > 12.22u 0.41s 0:12.67 99.6% > > time sqlite3 db 'select e from data where a <= 18234721' > /dev/null > 12.01u 0.61s 0:12.64 99.8% > > so just by selecting two different rows (both next to each other). I > get another 50% time difference? sure the column types > are different, > and i can even understand that FLOAT might be 8 bytes and > INTEGER 4 > bytes, but 50% time difference? it just cant be that > linear can it? By doing these same two queries using select count(), you've proven my original theory that the time difference is due to the volume of bytes being transmitted to dev/null and NOT because of the actual data types of the columns or ANY OTHER FACTOR. If you really want, change your pipe output to two file names, I'm 100% sure you'll find that the 'select n1 ...' query results in an output file significantly larger than the 'select e ...' output file. This is where the differing performance comes from -- the time it is taking to parse and print your output. The select count(*) does indeed use the index and hence the significant performance difference. > fine, if i ask sqlite just to count the rows it wins hands-down, but i > really want these rows. even more i 'd like to then "natural join" > these rows with a couple of other tables to really do what the perl > code currently does. > > but, it takes 22 seconds to just to create a temp table with the > required dataset > > > time sqlite3 db 'create temp table foo as select * from > data where a <= 18234721' > /dev/null > 21.93u 0.89s 0:22.95 99.4% > > and i do not understand what i'm doing wrong here :( Again, the only reason I suggested using count() in your timing test was to ensure that the command-line sqlite3 program's output was consistent for both tests and to eliminate dev/null printf's from factoring into the total time. In your application, you'll call select * (or whatever) without the count to retrieve the rows -- but since you're not printf'ing them and instead are doing your own thing with them, you will indeed see close to identical times in your selects just like you did in the count() test. As for the temp table ... I haven't tried this, but isn't "temp" a reserved word in SQLite? More importantly, you should be doing this statement inside a transaction. Transactions are critically important in SQLite. I would go so far as to say NEVER EVER perform ANY bulk write in SQLite outside a transaction! The performance difference is beyond phenomenal. Robert
Re: [sqlite] sqlite performance problem
Robert, > > i guess, i can believe this. however its pretty disappointing to get > > 50% improvement on 30 times less dataset :( > > > > but how do you explain this? > > > > sqlite> .schema data > > CREATE TABLE data > > ( > >a INTEGER, > >b INTEGER, > >c CHAR, > >d INTEGER, > >e INTEGER, > >n1 FLOAT, > >n2 FLOAT > > ); > > CREATE INDEX data_by_a ON data (a); > > > > > time sqlite3 db 'select n1 from data where a <= 18234721' > > > /dev/null > > 26.15u 0.59s 0:27.00 99.0% > > > time sqlite3 db 'select n1 from data where a <= 18234721' > > > /dev/null > > 26.04u 0.61s 0:26.91 99.0% > > > > and > > > > > time sqlite3 db 'select e from data where a <= 18234721' > /dev/null > > 12.22u 0.41s 0:12.67 99.6% > > > time sqlite3 db 'select e from data where a <= 18234721' > /dev/null > > 12.01u 0.61s 0:12.64 99.8% > > > > so just by selecting two different rows (both next to each other). i > > get another 50% time difference? sure the column types are different, > > and i can even understand that FLOAT might be 8 bytes and INTEGER 4 > > bytes, but 50% time difference? it just cant be that linear can it? > > > > do you think it is possible to get better results by issuing multiple > > queries each of which will return even less dataset? and, no, i'm not > > a database guy. > > The most glaring fault in your time tests that I see is that you're running > the command-line sqlite3, which is printf'ing all 339,000+ rows to dev/null. > It doesn't show on your screen, but the work is still being done regardless. well, i do print these rows to /dev/null in perl test too. > If you really want a fair speed test, change those statements to: > > select count(n1) from data where a <= 18234721 > time sqlite3 db 'select count(n1) from data where a <= 18234721' > /dev/null 7.79u 0.70s 0:08.50 99.8% > And > > select count(e) from data where a <= 18234721 > time sqlite3 db 'select count(e) from data where a <= 18234721' > /dev/null 7.90u 0.42s 0:08.31 100.1% > Or even > > select count(*) from data where a <= 18234721 > time sqlite3 db 'select count(*) from data where a <= 18234721' > /dev/null 1.35u 0.16s 0:01.47 102.7% 8 times faster then count(n1) or count(e)? i'm confused. i guess it just used "a" field (on which it had index?) > THEN tell us what the difference in performance is ... fine, if i ask sqlite just to count the rows it wins hands-down, but i really want these rows. even more i 'd like to then "natural join" these rows with a couple of other tables to really do what the perl code currently does. but, it takes 22 seconds to just to create a temp table with the required dataset > time sqlite3 db 'create temp table foo as select * from data where a <= > 18234721' > /dev/null 21.93u 0.89s 0:22.95 99.4% and i do not understand what i'm doing wrong here :( thanks, max
RE: [sqlite] sqlite performance problem
> -Original Message- > From: Maksim Yevmenkin [mailto:[EMAIL PROTECTED] > Sent: Monday, April 11, 2005 9:59 AM > To: Christian Smith > Cc: sqlite-users@sqlite.org > Subject: Re: [sqlite] sqlite performance problem > > i guess, i can believe this. however its pretty disappointing to get > 50% improvement on 30 times less dataset :( > > but how do you explain this? > > sqlite> .schema data > CREATE TABLE data > ( >a INTEGER, >b INTEGER, >c CHAR, >d INTEGER, >e INTEGER, >n1 FLOAT, >n2 FLOAT > ); > CREATE INDEX data_by_a ON data (a); > > > time sqlite3 db 'select n1 from data where a <= 18234721' > > /dev/null > 26.15u 0.59s 0:27.00 99.0% > > time sqlite3 db 'select n1 from data where a <= 18234721' > > /dev/null > 26.04u 0.61s 0:26.91 99.0% > > and > > > time sqlite3 db 'select e from data where a <= 18234721' > /dev/null > 12.22u 0.41s 0:12.67 99.6% > > time sqlite3 db 'select e from data where a <= 18234721' > /dev/null > 12.01u 0.61s 0:12.64 99.8% > > so just by selecting two different rows (both next to each other). i > get another 50% time difference? sure the column types are different, > and i can even understand that FLOAT might be 8 bytes and INTEGER 4 > bytes, but 50% time difference? it just cant be that linear can it? > > do you think it is possible to get better results by issuing multiple > queries each of which will return even less dataset? and, no, i'm not > a database guy. The most glaring fault in your time tests that I see is that you're running the command-line sqlite3, which is printf'ing all 339,000+ rows to dev/null. It doesn't show on your screen, but the work is still being done regardless. If you really want a fair speed test, change those statements to: select count(n1) from data where a <= 18234721 And select count(e) from data where a <= 18234721 Or even select count(*) from data where a <= 18234721 THEN tell us what the difference in performance is ... Robert
Re: [sqlite] sqlite performance problem
Christian, thanks for the reply. > >i'm having strange performance problem with sqlite-3.2.0. consider the > >following table > > > > [snip] > > > >now the problem: > > > >1) if i do a select with an idex it takes 27 sec. to get 92 rows > > > >> time sqlite3 db 'select n2 from data where a <= 18234721' > /dev/null > >25.95u 0.71s 0:27.02 98.6% > > > >> time sqlite3 db 'select n2 from data where a <= 18234721' > /dev/null > >26.02u 0.66s 0:27.53 96.9% > > > >2) if i do a select with sequential lookup it takes 1min to get 9818210 rows > > > >> time sqlite3 db 'select n2 from data where a+0 <= 18234721' > /dev/null > >49.54u 14.65s 1:04.98 98.7% > > > >> time sqlite3 db 'select n2 from data where a+0 <= 18234721' > /dev/null > >49.80u 14.42s 1:05.03 98.7% > > > >- so how come it take only twice as much time to get 30 times more rows? > > When doing an index scan, you will be accessing the index as well as the > table pages. For a sequential scan, only the table pages are touched, > reducing thrashing of the cache. i guess, i can believe this. however its pretty disappointing to get 50% improvement on 30 times less dataset :( but how do you explain this? sqlite> .schema data CREATE TABLE data ( a INTEGER, b INTEGER, c CHAR, d INTEGER, e INTEGER, n1 FLOAT, n2 FLOAT ); CREATE INDEX data_by_a ON data (a); > time sqlite3 db 'select n1 from data where a <= 18234721' > /dev/null 26.15u 0.59s 0:27.00 99.0% > time sqlite3 db 'select n1 from data where a <= 18234721' > /dev/null 26.04u 0.61s 0:26.91 99.0% and > time sqlite3 db 'select e from data where a <= 18234721' > /dev/null 12.22u 0.41s 0:12.67 99.6% > time sqlite3 db 'select e from data where a <= 18234721' > /dev/null 12.01u 0.61s 0:12.64 99.8% so just by selecting two different rows (both next to each other). i get another 50% time difference? sure the column types are different, and i can even understand that FLOAT might be 8 bytes and INTEGER 4 bytes, but 50% time difference? it just cant be that linear can it? do you think it is possible to get better results by issuing multiple queries each of which will return even less dataset? and, no, i'm not a database guy. > Also, if the data is not in index order in the table, and/or dispersed > across the database file, you may have to visit each page more than once > when traversing in index order. In the full table scan, you'll read the > rows in table order, hence only touching each page once before moving on > to later pages, thus reducing cache thrashing even more. all rows were inserted into the table in order that matches the index. i can not say if the records on disk will have the same order. > >- and why is it taking 27 seconds to get 92 rows anyway? > > You think 12347 rows/s is bad? because i have the same data set in perl multilevel hash stored on disk (storable format), and it takes about 1 minute to perform the same query on the same hardware. this time includes reading the data from the disk, traversing every single key on each hash level, etc. i was hoping that sqlite would do something like in just a few (< 10) seconds. > >- is there any way to create an index in separate file? > > No, SQLite is a single file embedded database. Keeps administration > almost non-existent (by design.) - so, if i have a large table ( > 9 million records) with an index then sqlite would have constantly seek back and forth between index and data (within the same file) probably wasting lots of disk bandwidth? - if above is correct than can i force sqlite to get all (or as much as possible) of index into the memory? > >the hardware is sun netra t1 running solaris 5.7. the db file size is > >about 800 mbytes. > > > >just for the record i'd like to have at least 15 times more records in > >the 'data' table. > > If you can match SQLite for simple read-only throughput with another > database, you'll be doing well. i do not know about other database, but compared to plain perl its not that impressive. i must be doing something wrong here. thanks, max