Thanks everyone for your answers, I made some changes to the database according to the information you gave me. It improved the performance of the query by about 20% (the request now takes 4 seconds instead of 5).
Here are some more information, regarding all the suggestions I received: - The version of SQLite I used is the one provided by Debian (current stable: wheezy). - I need the database to be opened in Read Only mode, so I did not use the WAL mode. - All the existing indexes cannot be removed because they are used by other queries. - I however removed unused indexes for the following tests results Note that my benchmarks are run in batch, with sqlite3 as with mysql. I stop and start the mysql daemon to avoid most caching (I hope). # For mysql, I use: /etc/init.d/mysql stop; /etc/init.d/mysql start; \ time echo 'SELECT SUM(nbBytesDecompOut+nbBytesCompIn) as vol, portLan \ as item FROM flows WHERE timestamp>=1383770600 AND \ timestamp<=1384770600 AND protocol IN (17, 6) GROUP BY portLan \ ORDER BY vol DESC LIMIT 6;' | mysql testperf 2783898050 33722 1374153827 33736 886842830 39155 655809252 51800 363040479 53153 358988337 59757 real 0m1.067s user 0m0.000s sys 0m0.000s # For sqlite, I use: time echo 'SELECT SUM(nbBytesDecompOut+nbBytesCompIn) as vol, portLan as item FROM flows WHERE timestamp>=1383770600 AND timestamp<=1384770600 AND protocol IN (17, 6) GROUP BY portLan ORDER BY vol DESC LIMIT 6;' | sqlite3 /var/db/udcast/flow_stats_OA_1M.db 2783898050|33722 1374153827|33736 886842830|39155 655809252|51800 363040479|53153 358988337|59757 real 0m4.405s user 0m1.812s sys 0m2.580s Here is the time spent in the query according to the number of lines matching the where clause (ANALYZE has been run before): PERIOD (s) MIN TS MAX TS LINES TIME ------------------------------------------------------- 3600 1384767000 1384770600 35113 0:00.06 ------------------------------------------------------- 7200 1384763400 1384770600 67611 0:00.11 ------------------------------------------------------- 21600 1384749000 1384770600 154592 0:00.69 ------------------------------------------------------- 43200 1384727400 1384770600 270728 0:01.18 ------------------------------------------------------- 86400 1384684200 1384770600 501871 0:02.20 ------------------------------------------------------- all 1383770600 1384770600 1000000 0:04.44 The 20% improvement is nice, but mysql (even without caching) is still far ahead for the moment (4 times faster). Other ideas are welcome. Thanks again! 2015-04-01 12:52 GMT+02:00 GB <gbinfo at web.de>: > In case of SELECTing "all available" I recommend invoking a different > statement without the timestamp-part instead of providing some min and max > values for timestamp. This avoids tricking the query planner into some > wrong decisions (if you have an index with protocol as the first column). > > And how about WAL mode? If concurrency is of any concern for you, this > definitely is something worth a try. > > -- GB > > > Jeff Roux schrieb am 31.03.2015 um 12:48: > >> Thanks everyone for the answers. >> >> I won't be able to make some tests today, I will come back to you soon >> with >> additional information. Just to say that, in the worst case, the WHERE >> clause selects the entire data, i.e 1000000 entries. The user can select a >> time range of 1 hour, 3 hours, 1 day, and ? all available ?. Note: before >> being added in the database, the time stamps are aggregated on a 180 >> second >> period and a lot of rows has the same time stamp (~ one row per TCP >> session). >> >> All the columns are defined as INTEGER. There are 41 columns in total in >> the flow table. If I remember well, there is no primary key defined for >> this table. >> >> 2015-03-31 8:32 GMT+02:00 GB <gbinfo at web.de>: >> >> From what I see, I assume that timestamp gives the highest selectivity. >>> Taking into account that protocol is SELECTed for and portLan is GROUPed >>> BY, I'd try an index (timestamp, protocol, portLan) (not sure if portLan >>> helps here, but it's worth a try, I think). Don't forget to ANALYZE, of >>> course. Are your colums of INTEGER affinity? If the are of TEXT, they >>> will >>> store anything as TEXT. May make a difference in both space consumption >>> and >>> speed. Is your SQLite lib built with SQLITE_ENABLE_STAT4 enabled? If not, >>> give it a try. It sometimes makes a big difference. >>> >>> Is it possible that data collection and retrieval happen at the same >>> time? >>> If so, try running the database in WAL mode, it should help with >>> concurrency issues. >>> >>> -- GB >>> >>> >>> Jeff Roux schrieb am 30.03.2015 um 11:46: >>> >>> Hi everyone, >>>> >>>> I have a daemon that collects information and stores it in a SQLite >>>> database. The table has 1 million rows. >>>> >>>> This daemon is running on a HP server with 12 cores, 32 GB of RAM, >>>> and a SSD drive. I have performance issues with some requests. For >>>> instance, the following request takes more than 5 seconds to >>>> accomplish with SQlite3 (in that particular case, the WHERE clause >>>> selects all the data in the database, i.e. 1000000 rows): >>>> >>>> SELECT SUM(nbBytesDecompOut + nbBytesCompIn) as vol, portLan as item >>>> FROM flows >>>> WHERE timestamp>=1383770600 AND timestamp<=1384770600 AND protocol IN >>>> (17, 6) >>>> GROUP BY portLan ORDER BY vol DESC LIMIT 6; >>>> >>>> I have done some tests with or without "INDEXED BY" clauses and got >>>> nearly the same results. >>>> >>>> I compared the performance with a mySQL and the same request takes >>>> less than 1 second to accomplish. >>>> >>>> Could you give me some directions to optimize this kind of request >>>> with SQlite3 when there is a big amount of data in the table ? I >>>> need to increase 3 times the number of rows and the performance will >>>> become unacceptable for my application. >>>> >>>> Thanks in advance. >>>> _______________________________________________ >>>> sqlite-users mailing list >>>> sqlite-users at mailinglists.sqlite.org >>>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users >>>> >>>> >>>> _______________________________________________ >>> sqlite-users mailing list >>> sqlite-users at mailinglists.sqlite.org >>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users >>> >>> _______________________________________________ >> sqlite-users mailing list >> sqlite-users at mailinglists.sqlite.org >> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users >> > > _______________________________________________ > sqlite-users mailing list > sqlite-users at mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users >