Thanks everyone for your answers,

I made some changes to the database according to the information you
gave me. It improved the performance of the query by about 20% (the
request now takes 4 seconds instead of 5).

Here are some more information, regarding all the suggestions I
received:

  - The version of SQLite I used is the one provided by Debian (current
    stable: wheezy).

  - I need the database to be opened in Read Only mode, so I did not
    use the WAL mode.

  - All the existing indexes cannot be removed because they are used
    by other queries.

  - I however removed unused indexes for the following tests results

Note that my benchmarks are run in batch, with sqlite3 as with
mysql. I stop and start the mysql daemon to avoid most caching (I
hope).

# For mysql, I use:
/etc/init.d/mysql stop; /etc/init.d/mysql start; \
time echo 'SELECT SUM(nbBytesDecompOut+nbBytesCompIn) as vol, portLan \
     as item FROM flows WHERE timestamp>=1383770600 AND \
     timestamp<=1384770600 AND protocol IN (17, 6) GROUP BY portLan \
     ORDER BY vol DESC LIMIT 6;' |
     mysql testperf

2783898050      33722
1374153827      33736
886842830       39155
655809252       51800
363040479       53153
358988337       59757

real    0m1.067s
user    0m0.000s
sys     0m0.000s


# For sqlite, I use:
time echo 'SELECT SUM(nbBytesDecompOut+nbBytesCompIn) as vol, portLan
     as item FROM flows WHERE timestamp>=1383770600 AND
     timestamp<=1384770600 AND protocol IN (17, 6) GROUP BY portLan
     ORDER BY vol DESC LIMIT 6;' |
     sqlite3 /var/db/udcast/flow_stats_OA_1M.db

2783898050|33722
1374153827|33736
886842830|39155
655809252|51800
363040479|53153
358988337|59757

real    0m4.405s
user    0m1.812s
sys     0m2.580s

Here is the time spent in the query according to the number of lines
matching the where clause (ANALYZE has been run before):

PERIOD (s)  MIN TS       MAX TS       LINES    TIME
-------------------------------------------------------
3600        1384767000   1384770600   35113    0:00.06
-------------------------------------------------------
7200        1384763400   1384770600   67611    0:00.11
-------------------------------------------------------
21600       1384749000   1384770600   154592   0:00.69
-------------------------------------------------------
43200       1384727400   1384770600   270728   0:01.18
-------------------------------------------------------
86400       1384684200   1384770600   501871   0:02.20
-------------------------------------------------------
all         1383770600   1384770600   1000000  0:04.44


The 20% improvement is nice, but mysql (even without caching) is still
far ahead for the moment (4 times faster). Other ideas are welcome.

Thanks again!


2015-04-01 12:52 GMT+02:00 GB <gbinfo at web.de>:

> In case of SELECTing "all available" I recommend invoking a different
> statement without the timestamp-part instead of providing some min and max
> values for timestamp. This avoids tricking the query planner into some
> wrong decisions (if you have an index with protocol as the first column).
>
> And how about WAL mode? If concurrency is of any concern for you, this
> definitely is something worth a try.
>
> -- GB
>
>
> Jeff Roux schrieb am 31.03.2015 um 12:48:
>
>> Thanks everyone for the answers.
>>
>> I won't be able to make some tests today, I will come back to you soon
>> with
>> additional information. Just to say that, in the worst case, the WHERE
>> clause selects the entire data, i.e 1000000 entries. The user can select a
>> time range of 1 hour, 3 hours, 1 day, and ? all available ?. Note: before
>> being added in the database, the time stamps are aggregated on a 180
>> second
>> period and a lot of rows has the same time stamp (~ one row per TCP
>> session).
>>
>> All the columns are defined as INTEGER. There are 41 columns in total in
>> the flow table. If I remember well, there is no primary key defined for
>> this table.
>>
>> 2015-03-31 8:32 GMT+02:00 GB <gbinfo at web.de>:
>>
>>   From what I see, I assume that timestamp gives the highest selectivity.
>>> Taking into account that protocol is SELECTed for and portLan is GROUPed
>>> BY, I'd try an index (timestamp, protocol, portLan) (not sure if portLan
>>> helps here, but it's worth a try, I think). Don't forget to ANALYZE, of
>>> course. Are your colums of INTEGER affinity? If the are of TEXT, they
>>> will
>>> store anything as TEXT. May make a difference in both space consumption
>>> and
>>> speed. Is your SQLite lib built with SQLITE_ENABLE_STAT4 enabled? If not,
>>> give it a try. It sometimes makes a big difference.
>>>
>>> Is it possible that data collection and retrieval happen at the same
>>> time?
>>> If so, try running the database in WAL mode, it should help with
>>> concurrency issues.
>>>
>>> -- GB
>>>
>>>
>>> Jeff Roux schrieb am 30.03.2015 um 11:46:
>>>
>>>  Hi everyone,
>>>>
>>>> I have a daemon that collects information and stores it in a SQLite
>>>> database. The table has 1 million rows.
>>>>
>>>> This daemon is running on a HP server with 12 cores, 32 GB of RAM,
>>>> and a SSD drive. I have performance issues with some requests. For
>>>> instance, the following request takes more than 5 seconds to
>>>> accomplish with SQlite3 (in that particular case, the WHERE clause
>>>> selects all the data in the database, i.e. 1000000 rows):
>>>>
>>>> SELECT SUM(nbBytesDecompOut + nbBytesCompIn) as vol, portLan as item
>>>> FROM  flows
>>>> WHERE timestamp>=1383770600 AND timestamp<=1384770600 AND protocol IN
>>>> (17, 6)
>>>> GROUP BY portLan ORDER BY vol DESC LIMIT 6;
>>>>
>>>> I have done some tests with or without "INDEXED BY" clauses and got
>>>> nearly the same results.
>>>>
>>>> I compared the performance with a mySQL and the same request takes
>>>> less than 1 second to accomplish.
>>>>
>>>> Could you give me some directions to optimize this kind of request
>>>> with SQlite3 when there is a big amount of data in the table ? I
>>>> need to increase 3 times the number of rows and the performance will
>>>> become unacceptable for my application.
>>>>
>>>> Thanks in advance.
>>>> _______________________________________________
>>>> sqlite-users mailing list
>>>> sqlite-users at mailinglists.sqlite.org
>>>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>>>
>>>>
>>>>  _______________________________________________
>>> sqlite-users mailing list
>>> sqlite-users at mailinglists.sqlite.org
>>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>>
>>>  _______________________________________________
>> sqlite-users mailing list
>> sqlite-users at mailinglists.sqlite.org
>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>
>
> _______________________________________________
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>

Reply via email to