Re: [sqlite] improving query performance

Dennis Cote Thu, 30 Mar 2006 08:41:17 -0800

Christian Smith wrote:

On Wed, 29 Mar 2006, Andy Spencer wrote:

I have a sqlite database with about 3 GB of data, most of which is stored
in a data table with about 75 million records, having three columns
(EntryId INTEGER, PropertyId INTEGER, Value NUMERIC) and
PRIMARY KEY(EntryId, PropertyId).

This table is not indexed, to allow faster updates.



It is indexed. The primary key clause creates an implied index on
(EntryId,PropertyId).

The problem is that it takes over an hour to access all Values, for a
specified PropertyId, when the value is obtained for each EntryId
separately (using "SELECT Value FROM Data WHERE PropertyId=? AND
EntryId=?", bound to the specified PropertyId and EntryId) and
the EntryId values for successive database queries are in essentially
random order (taken from an external list of entries that has been
sorted by property values).

This same query (getting the property value for each EntryId,
separately) only takes about 7 minutes when the EntryId values for
successive database queries are in the same ascending order as
the data orginally inserted into the table.



Yes. You're accessing the database in about as inefficient way as is
possible with your data, resulting in much thrashing of caches. Under
UNIX, if you're thrashing the OS cache, you can monitor this using vmstat.

I assume that this has to do with better pager caching of successive
records in the database, whereas random access may re-read the same
page multiple times (due to the limited cache).



If you're not thrashing the OS cache (do you have lots of RAM?) try
increasing the size of your SQLite cache. Use:

PRAGMA cache_size=20000;

This will make your cache 10x bigger, and may increase hit rate.

My question is whether it should be faster to

A) create an index for the table before the query,
 query the value (for the specified PropertyId) for each EntryId
 (in essentially random order, from external list of entries),
 and delete the index after the queries (for each EntryId) are done



Won't help. You already have an index from the primary key.

or

B) issue a single "SELECT EntryId, Value FROM Data WHERE PropertyId=?" query
 (bound to the specified PropertyId) and step through the results,
 using something like a hash table lookup to map the EntryId values
 (returned from the query) back to an index into the external list of
 entries.



This may help, as you'll not be using the primary key index, and thus the
index pages will not be competing with the table pages for memory.

The values extracted from the database are to be copied into an entry
property data structure, having the same order as the external list of
entries.


If you must group the values by PropertyId rather than EntryId, then
insert them into the database in that order. Is that possible?

That, or increase the amount of RAM you have.

Christian

Andy,

There is a change that you can make that might greatly speed things upif you are thrashing the cache. Add the Value field to your primary key.


create table Data (
   EntryId INTEGER,
   PropertyId INTEGER,
   Value NUMERIC,
   PRIMARY KEY(EntryId, PropertyId, Value)
);

This works because the primary key create an index (as Christian hasalready pointed out),and SQLite has an optimization that causes it toreturn data directly from the columns in an index if it can. Since theValue data is now available in the index, it won't need to read in ablock from the main table to return a result for your query.

Since you are currently accessing the index and the data table randomly,(i.e. by randomly ordered EntryIds) you are probably doing 2 disk reads(one for the index lookup and one for the value lookup) for each resultrow. With this change to the primary key, SQLite will not need to accessthe data table at all (since all the data is now duplicated in theindex) you will only do one disk read per result row. This reduction indisk reading will also help to reduce the cache thrashing.

Adding the value to the index will only cause increase of about 16% inthe size of your database since you are already duplicating the othertwo fields in your primary key index.

Another approach is to remove your primary key. If you don't need it toenforce uniqueness constraints on your data then you could eliminate theprimary key, and change the EntryId column into an integer primary keycolumn. This primary key is not stored as a separate index table, it isstored in the key fields of the Btree used to hold the data table.


create table Data (
   EntryId INTEGER PRIMARY KEY,
   PropertyId INTEGER,
   Value NUMERIC
);

Now you truly won't have any indexes, and your inserts and updates willrun as quickly as possible. Also when you search for an EntryId andPropertyId pair, SQLite will use the index on EntryId to locate thecorrect section in the data table quickly, and then it will scan throughthe rows with that EntryId sequentially looking for a matchingPropertyId. You didn't say how many properties each entry has, but forreasonable values this may be faster than fully indexed lookup becauseit eliminates half of the disk reads and should reduce the cachethrashing. This will also reduce the size of your database file to abouthalf of its current size by eliminating the index and storing theEntryId in the rowid of the Data table.

Your B idea should work as well if you create an index on PropertyId.Then SQLite can go directly to the correct section of the table usingthe index, and scan the index sequentially to locate the data records.You will have to reorder the returned EntryId and Value pairs to matchyour required EntryId order.

This can be combined with my first suggestion to add an index on allthree columns; PropertyId, EntryId, Value. Now SQLite can return theresult directly from a sequential scan of a small section of the indexand won't need to do any data table lookups. Again this will lead to alarger database file, and you will still need to reorder the results,but it should be about as fast as possible.


HTH
Dennis Cote

Re: [sqlite] improving query performance

Reply via email to