Re: [sqlite] Improving query performance

2009-04-02 Thread John Elrick
D. Richard Hipp wrote:
> On Apr 1, 2009, at 2:00 PM, John Elrick wrote:
>   
>> explain query plan
>> select DISTINCT  RESPONSES.RESPONSE_OID
>> from DATA_ELEMENTS, RESPONSES, SEQUENCE_ELEMENTS
>> where
>> SEQUENCE_ELEMENTS.SEQUENCE_ELEMENT_NAME = :sequence_element_name and
>> DATA_ELEMENTS.DATA_ELEMENT_NAME = :data_element_name and
>> RESPONSES.instance_parent = SEQUENCE_ELEMENTS.SEQUENCE_ELEMENT_OID and
>> RESPONSES.definition_parent = DATA_ELEMENTS.DATA_ELEMENT_OID
>>
>> orderfromdetail
>> 00TABLE DATA_ELEMENTS WITH INDEX data_element_name_idx
>> 12TABLE SEQUENCE_ELEMENTS WITH INDEX sequence_element_name_idx
>> 21TABLE RESPONSES
>> 
>
>
> The index is not being used on the RESPONSES table because your WHERE  
> clause constraint is comparing a TEXT column (instance_parent) against  
> an INTEGER column (sequence_element_oid).  The rules of SQLite are  
> that this requires a NUMERIC comparison, but the index is constructed  
> using a TEXT collation and so the index cannot be used.
>
> Various workarounds:
>
> (1) redefine RESPONSES.INSTANCE_PARENT to be type INTEGER.  (Do the  
> same with RESPONSES.definition_parent).
>
> (2) Add a "+" sign in front of sequence_element_oid in the where clause:
>
>   ... instance_parent = +sequence_element_oid...
>
> This will force the RHS of the expression to be an expression rather  
> than a column name.  That will force the use of TEXT collating for the  
> comparison, and thus make the index usable.
>
> (3) Case the integer on the RHS to text:
>
>  ... instance_parent = CASE(seqence_element_oid AS varchar) ...
>
> Seems like (1) is probably the right fix, but any of these three will  
> work.

Good heavens Richard!  I didn't notice.  There is no reason they can't 
be integers and I honestly thought they were.  Thanks very much.


John
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Improving query performance

2009-04-01 Thread D. Richard Hipp

On Apr 1, 2009, at 2:00 PM, John Elrick wrote:
>
> explain query plan
> select DISTINCT  RESPONSES.RESPONSE_OID
> from DATA_ELEMENTS, RESPONSES, SEQUENCE_ELEMENTS
> where
> SEQUENCE_ELEMENTS.SEQUENCE_ELEMENT_NAME = :sequence_element_name and
> DATA_ELEMENTS.DATA_ELEMENT_NAME = :data_element_name and
> RESPONSES.instance_parent = SEQUENCE_ELEMENTS.SEQUENCE_ELEMENT_OID and
> RESPONSES.definition_parent = DATA_ELEMENTS.DATA_ELEMENT_OID
>
> orderfromdetail
> 00TABLE DATA_ELEMENTS WITH INDEX data_element_name_idx
> 12TABLE SEQUENCE_ELEMENTS WITH INDEX sequence_element_name_idx
> 21TABLE RESPONSES


The index is not being used on the RESPONSES table because your WHERE  
clause constraint is comparing a TEXT column (instance_parent) against  
an INTEGER column (sequence_element_oid).  The rules of SQLite are  
that this requires a NUMERIC comparison, but the index is constructed  
using a TEXT collation and so the index cannot be used.

Various workarounds:

(1) redefine RESPONSES.INSTANCE_PARENT to be type INTEGER.  (Do the  
same with RESPONSES.definition_parent).

(2) Add a "+" sign in front of sequence_element_oid in the where clause:

  ... instance_parent = +sequence_element_oid...

This will force the RHS of the expression to be an expression rather  
than a column name.  That will force the use of TEXT collating for the  
comparison, and thus make the index usable.

(3) Case the integer on the RHS to text:

 ... instance_parent = CASE(seqence_element_oid AS varchar) ...

Seems like (1) is probably the right fix, but any of these three will  
work.

D. Richard Hipp
d...@hwaci.com



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] improving query performance

2006-03-31 Thread Dennis Cote
On 3/30/06, Dennis Cote <[EMAIL PROTECTED]> wrote:
>
>
> Another approach is to remove your primary key. If you don't need it to
> enforce uniqueness constraints on your data then you could eliminate the
> primary key, and change the EntryId column into an integer primary key
> column. This primary key is not stored as a separate index table, it is
> stored in the key fields of the Btree used to hold the data table.
>
> create table Data (
> EntryId INTEGER PRIMARY KEY,
> PropertyId INTEGER,
> Value NUMERIC
> );
>
> Now you truly won't have any indexes, and your inserts and updates will
> run as quickly as possible. Also when you search for an EntryId and
> PropertyId pair, SQLite will use the index on EntryId to locate the
> correct section in the data table quickly, and then it will scan through
> the rows with that EntryId sequentially looking for a matching
> PropertyId. You didn't say how many properties each entry has, but for
> reasonable values this may be faster than fully indexed lookup because
> it eliminates half of the disk reads and should reduce the cache
> thrashing. This will also reduce the size of your database file to about
> half of its current size by eliminating the index and storing the
> EntryId in the rowid of the Data table.


Soory about the self reply, but I got to thinking about my suggestion later
and realized that this part was complete crap! Integer primary key fields
must be unique, so this could only store one property for each entry. Please
ignore this blather.

Dennis Cote


Re: [sqlite] improving query performance

2006-03-30 Thread Jay Sprenkle
On 3/30/06, Andy Spencer <[EMAIL PROTECTED]> wrote:
>
> Yes. I tried adding an index to the table, after the data had been
> imported and prior to fetching the entry property values, and the
> construction of the index took longer than it had taken previously to
> fetch all of the property values.

The only way to significantly speed up extracting the data is to add the
right indexes to the table. You pay a small penalty for each insert or
update in exchange
for speed of retrieval. You pay the price in small increments instead
of in one big
lump. Users usually won't notice a few milliseconds lost here and there, but the
hour when they're waiting for an answer they do notice ;)


Re: [sqlite] improving query performance

2006-03-30 Thread Andy Spencer
On Thu, 30 Mar 2006, Christian Smith wrote:

> On Wed, 29 Mar 2006, Andy Spencer wrote:
> 
> >I have a sqlite database with about 3 GB of data, most of which is stored
> >in a data table with about 75 million records, having three columns
> >(EntryId INTEGER, PropertyId INTEGER, Value NUMERIC) and
> >PRIMARY KEY(EntryId, PropertyId).
> >
> >This table is not indexed, to allow faster updates.
> 
> 
> It is indexed. The primary key clause creates an implied index on
> (EntryId,PropertyId).
> 

You are right. I verified this with:

SQL statement: PRAGMA index_list(Data)
(seq) = (0)
(name) = (sqlite_autoindex_Data_1)
(unique) = (1)

SQL statement: PRAGMA index_info(sqlite_autoindex_Data_1)
(seqno) = (0)
(cid) = (0)
(name) = (EntryId)

(seqno) = (1)
(cid) = (1)
(name) = (PropertyId)


> 
> >
> >The problem is that it takes over an hour to access all Values, for a
> >specified PropertyId, when the value is obtained for each EntryId
> >separately (using "SELECT Value FROM Data WHERE PropertyId=? AND
> >EntryId=?", bound to the specified PropertyId and EntryId) and
> >the EntryId values for successive database queries are in essentially
> >random order (taken from an external list of entries that has been
> >sorted by property values).
> >
> >This same query (getting the property value for each EntryId,
> >separately) only takes about 7 minutes when the EntryId values for
> >successive database queries are in the same ascending order as
> >the data orginally inserted into the table.
> 
> 
> Yes. You're accessing the database in about as inefficient way as is
> possible with your data, resulting in much thrashing of caches. Under
> UNIX, if you're thrashing the OS cache, you can monitor this using vmstat.
> 
> 
> >
> >I assume that this has to do with better pager caching of successive
> >records in the database, whereas random access may re-read the same
> >page multiple times (due to the limited cache).
> 
> 
> If you're not thrashing the OS cache (do you have lots of RAM?) try
> increasing the size of your SQLite cache. Use:
> 
> PRAGMA cache_size=2;
> 
> This will make your cache 10x bigger, and may increase hit rate.
> 

SQL statement: PRAGMA page_size
(page_size) = (1024)

SQL statement: PRAGMA default_cache_size
(cache_size) = (2000)

SQL statement: PRAGMA cache_size
(cache_size) = (2000)

So, our cached data is currently 2000 pages * 1024 bytes/page = 2 MB.

I think you are right, that it makes sense to increase the number of
cached pages, rather than the page size.

> 
> >
> >My question is whether it should be faster to
> >
> >A) create an index for the table before the query,
> >   query the value (for the specified PropertyId) for each EntryId
> >   (in essentially random order, from external list of entries),
> >   and delete the index after the queries (for each EntryId) are done
> 
> 
> Won't help. You already have an index from the primary key.
> 

Yes. I tried adding an index to the table, after the data had been
imported and prior to fetching the entry property values, and the
construction of the index took longer than it had taken previously to
fetch all of the property values.

> 
> >
> >or
> >
> >B) issue a single "SELECT EntryId, Value FROM Data WHERE PropertyId=?" query
> >   (bound to the specified PropertyId) and step through the results,
> >   using something like a hash table lookup to map the EntryId values
> >   (returned from the query) back to an index into the external list of
> >   entries.
> 
> 
> This may help, as you'll not be using the primary key index, and thus the
> index pages will not be competing with the table pages for memory.
> 
> 
> >
> >The values extracted from the database are to be copied into an entry
> >property data structure, having the same order as the external list of
> >entries.
> >
> 
> If you must group the values by PropertyId rather than EntryId, then
> insert them into the database in that order. Is that possible?
> 

During import, the data arrives entry by entry, with each entry having
multiple property values (for possibly same or different properties
as previous entries).

It is not feasible to group the incoming data by PropertyId, and new values
may be added (or replace existing values) in the table at any time.

> That, or increase the amount of RAM you have.
> 
> Christian

Increasing RAM is probably not an option for me or for many of the
customers who will be using the software.

Thank you for your suggestions.

--
Andy Spencer/ Schrodinger, Inc./ [EMAIL PROTECTED] 




Re: [sqlite] improving query performance

2006-03-30 Thread Dennis Cote

Christian Smith wrote:


On Wed, 29 Mar 2006, Andy Spencer wrote:



I have a sqlite database with about 3 GB of data, most of which is stored
in a data table with about 75 million records, having three columns
(EntryId INTEGER, PropertyId INTEGER, Value NUMERIC) and
PRIMARY KEY(EntryId, PropertyId).

This table is not indexed, to allow faster updates.




It is indexed. The primary key clause creates an implied index on
(EntryId,PropertyId).




The problem is that it takes over an hour to access all Values, for a
specified PropertyId, when the value is obtained for each EntryId
separately (using "SELECT Value FROM Data WHERE PropertyId=? AND
EntryId=?", bound to the specified PropertyId and EntryId) and
the EntryId values for successive database queries are in essentially
random order (taken from an external list of entries that has been
sorted by property values).

This same query (getting the property value for each EntryId,
separately) only takes about 7 minutes when the EntryId values for
successive database queries are in the same ascending order as
the data orginally inserted into the table.




Yes. You're accessing the database in about as inefficient way as is
possible with your data, resulting in much thrashing of caches. Under
UNIX, if you're thrashing the OS cache, you can monitor this using vmstat.




I assume that this has to do with better pager caching of successive
records in the database, whereas random access may re-read the same
page multiple times (due to the limited cache).




If you're not thrashing the OS cache (do you have lots of RAM?) try
increasing the size of your SQLite cache. Use:

PRAGMA cache_size=2;

This will make your cache 10x bigger, and may increase hit rate.




My question is whether it should be faster to

A) create an index for the table before the query,
 query the value (for the specified PropertyId) for each EntryId
 (in essentially random order, from external list of entries),
 and delete the index after the queries (for each EntryId) are done




Won't help. You already have an index from the primary key.




or

B) issue a single "SELECT EntryId, Value FROM Data WHERE PropertyId=?" query
 (bound to the specified PropertyId) and step through the results,
 using something like a hash table lookup to map the EntryId values
 (returned from the query) back to an index into the external list of
 entries.




This may help, as you'll not be using the primary key index, and thus the
index pages will not be competing with the table pages for memory.




The values extracted from the database are to be copied into an entry
property data structure, having the same order as the external list of
entries.




If you must group the values by PropertyId rather than EntryId, then
insert them into the database in that order. Is that possible?

That, or increase the amount of RAM you have.

Christian




Andy,

There is a change that you can make that might greatly speed things up 
if you are thrashing the cache. Add the Value field to your primary key.


create table Data (
   EntryId INTEGER,
   PropertyId INTEGER,
   Value NUMERIC,
   PRIMARY KEY(EntryId, PropertyId, Value)
);

This works because the primary key create an index (as Christian has 
already pointed out),and SQLite has an optimization that causes it to 
return data directly from the columns in an index if it can. Since the 
Value data is now available in the index, it won't need to read in a 
block from the main table to return a result for your query.


Since you are currently accessing the index and the data table randomly, 
(i.e. by randomly ordered EntryIds) you are probably doing 2 disk reads 
(one for the index lookup and one for the value lookup) for each result 
row. With this change to the primary key, SQLite will not need to access 
the data table at all (since all the data is now duplicated in the 
index) you will only do one disk read per result row. This reduction in 
disk reading will also help to reduce the cache thrashing.


Adding the value to the index will only cause increase of about 16% in 
the size of your database since you are already duplicating the other 
two fields in your primary key index.


Another approach is to remove your primary key. If you don't need it to 
enforce uniqueness constraints on your data then you could eliminate the 
primary key, and change the EntryId column into an integer primary key 
column. This primary key is not stored as a separate index table, it is 
stored in the key fields of the Btree used to hold the data table.


create table Data (
   EntryId INTEGER PRIMARY KEY,
   PropertyId INTEGER,
   Value NUMERIC
);

Now you truly won't have any indexes, and your inserts and updates will 
run as quickly as possible. Also when you search for an EntryId and 
PropertyId pair, SQLite will use the index on EntryId to locate the 
correct section in the data table quickly, and then it will scan through 
the rows with that EntryId sequentially looking for a 

RE: [sqlite] improving query performance

2006-03-30 Thread Dan Petitt
I had a similar problem, importing a lot of data into a database (import
very infrequently but read a lot) and then accessing it. With about 6million
rows it was taking 12 hours to get halfway through importing so I gave up.

These are the things that massively helped me:
* Increased default_page_cache to 6 (people have a lot of RAM now on
non-embedded systems so why not use it)
* Increased page_size to 32768 (maximum amount)
* Set synchronous = OFF (big difference, data integrity was not important to
us)
* Set temp_store = MEMORY (not sure of the usefulness of this, but used it
anyway)
* Set auto_vacuum = 0 (not deleting anything so probably not useful, set it
anyway)
* Using prepared statements cut import down by more than 50%

I also found if I was inserting into a multi-indexed table with "on conflict
ignore" it was actually a lot quicker to search for the record and only
insert if it didn't exist, than to rely on insert/ignore failure.

Hope this helps.




-Original Message-
From: Christian Smith [mailto:[EMAIL PROTECTED] 
Sent: 30 March 2006 15:54
To: sqlite-users@sqlite.org
Cc: Subhash Mangipudi; Herc Silverstein
Subject: Re: [sqlite] improving query performance

On Wed, 29 Mar 2006, Andy Spencer wrote:

>I have a sqlite database with about 3 GB of data, most of which is stored
>in a data table with about 75 million records, having three columns
>(EntryId INTEGER, PropertyId INTEGER, Value NUMERIC) and
>PRIMARY KEY(EntryId, PropertyId).
>
>This table is not indexed, to allow faster updates.


It is indexed. The primary key clause creates an implied index on
(EntryId,PropertyId).


>
>The problem is that it takes over an hour to access all Values, for a
>specified PropertyId, when the value is obtained for each EntryId
>separately (using "SELECT Value FROM Data WHERE PropertyId=? AND
>EntryId=?", bound to the specified PropertyId and EntryId) and
>the EntryId values for successive database queries are in essentially
>random order (taken from an external list of entries that has been
>sorted by property values).
>
>This same query (getting the property value for each EntryId,
>separately) only takes about 7 minutes when the EntryId values for
>successive database queries are in the same ascending order as
>the data orginally inserted into the table.


Yes. You're accessing the database in about as inefficient way as is
possible with your data, resulting in much thrashing of caches. Under
UNIX, if you're thrashing the OS cache, you can monitor this using vmstat.


>
>I assume that this has to do with better pager caching of successive
>records in the database, whereas random access may re-read the same
>page multiple times (due to the limited cache).


If you're not thrashing the OS cache (do you have lots of RAM?) try
increasing the size of your SQLite cache. Use:

PRAGMA cache_size=2;

This will make your cache 10x bigger, and may increase hit rate.


>
>My question is whether it should be faster to
>
>A) create an index for the table before the query,
>   query the value (for the specified PropertyId) for each EntryId
>   (in essentially random order, from external list of entries),
>   and delete the index after the queries (for each EntryId) are done


Won't help. You already have an index from the primary key.


>
>or
>
>B) issue a single "SELECT EntryId, Value FROM Data WHERE PropertyId=?"
query
>   (bound to the specified PropertyId) and step through the results,
>   using something like a hash table lookup to map the EntryId values
>   (returned from the query) back to an index into the external list of
>   entries.


This may help, as you'll not be using the primary key index, and thus the
index pages will not be competing with the table pages for memory.


>
>The values extracted from the database are to be copied into an entry
>property data structure, having the same order as the external list of
>entries.
>

If you must group the values by PropertyId rather than EntryId, then
insert them into the database in that order. Is that possible?

That, or increase the amount of RAM you have.

Christian


-- 
/"\
\ /ASCII RIBBON CAMPAIGN - AGAINST HTML MAIL
 X   - AGAINST MS ATTACHMENTS
/ \




Re: [sqlite] improving query performance

2006-03-30 Thread Christian Smith
On Wed, 29 Mar 2006, Andy Spencer wrote:

>I have a sqlite database with about 3 GB of data, most of which is stored
>in a data table with about 75 million records, having three columns
>(EntryId INTEGER, PropertyId INTEGER, Value NUMERIC) and
>PRIMARY KEY(EntryId, PropertyId).
>
>This table is not indexed, to allow faster updates.


It is indexed. The primary key clause creates an implied index on
(EntryId,PropertyId).


>
>The problem is that it takes over an hour to access all Values, for a
>specified PropertyId, when the value is obtained for each EntryId
>separately (using "SELECT Value FROM Data WHERE PropertyId=? AND
>EntryId=?", bound to the specified PropertyId and EntryId) and
>the EntryId values for successive database queries are in essentially
>random order (taken from an external list of entries that has been
>sorted by property values).
>
>This same query (getting the property value for each EntryId,
>separately) only takes about 7 minutes when the EntryId values for
>successive database queries are in the same ascending order as
>the data orginally inserted into the table.


Yes. You're accessing the database in about as inefficient way as is
possible with your data, resulting in much thrashing of caches. Under
UNIX, if you're thrashing the OS cache, you can monitor this using vmstat.


>
>I assume that this has to do with better pager caching of successive
>records in the database, whereas random access may re-read the same
>page multiple times (due to the limited cache).


If you're not thrashing the OS cache (do you have lots of RAM?) try
increasing the size of your SQLite cache. Use:

PRAGMA cache_size=2;

This will make your cache 10x bigger, and may increase hit rate.


>
>My question is whether it should be faster to
>
>A) create an index for the table before the query,
>   query the value (for the specified PropertyId) for each EntryId
>   (in essentially random order, from external list of entries),
>   and delete the index after the queries (for each EntryId) are done


Won't help. You already have an index from the primary key.


>
>or
>
>B) issue a single "SELECT EntryId, Value FROM Data WHERE PropertyId=?" query
>   (bound to the specified PropertyId) and step through the results,
>   using something like a hash table lookup to map the EntryId values
>   (returned from the query) back to an index into the external list of
>   entries.


This may help, as you'll not be using the primary key index, and thus the
index pages will not be competing with the table pages for memory.


>
>The values extracted from the database are to be copied into an entry
>property data structure, having the same order as the external list of
>entries.
>

If you must group the values by PropertyId rather than EntryId, then
insert them into the database in that order. Is that possible?

That, or increase the amount of RAM you have.

Christian


-- 
/"\
\ /ASCII RIBBON CAMPAIGN - AGAINST HTML MAIL
 X   - AGAINST MS ATTACHMENTS
/ \


Re: [sqlite] improving query performance

2006-03-30 Thread Jay Sprenkle
On 3/29/06, Andy Spencer <[EMAIL PROTECTED]> wrote:
> I have a sqlite database with about 3 GB of data, most of which is stored
> in a data table with about 75 million records, having three columns
> (EntryId INTEGER, PropertyId INTEGER, Value NUMERIC) and
> PRIMARY KEY(EntryId, PropertyId).
>
> This table is not indexed, to allow faster updates.

Why not index the table? The overhead for updates is fairly small.


>
> The problem is that it takes over an hour to access all Values, for a
> specified PropertyId, when the value is obtained for each EntryId
> separately (using "SELECT Value FROM Data WHERE PropertyId=? AND
> EntryId=?", bound to the specified PropertyId and EntryId) and
> the EntryId values for successive database queries are in essentially
> random order (taken from an external list of entries that has been
> sorted by property values).
>
> This same query (getting the property value for each EntryId,
> separately) only takes about 7 minutes when the EntryId values for
> successive database queries are in the same ascending order as
> the data orginally inserted into the table.
>
> I assume that this has to do with better pager caching of successive
> records in the database, whereas random access may re-read the same
> page multiple times (due to the limited cache).
>
> My question is whether it should be faster to
>
> A) create an index for the table before the query,
>query the value (for the specified PropertyId) for each EntryId
>(in essentially random order, from external list of entries),
>and delete the index after the queries (for each EntryId) are done
>
> or
>
> B) issue a single "SELECT EntryId, Value FROM Data WHERE PropertyId=?" query
>(bound to the specified PropertyId) and step through the results,
>using something like a hash table lookup to map the EntryId values
>(returned from the query) back to an index into the external list of
>entries.
>
> The values extracted from the database are to be copied into an entry
> property data structure, having the same order as the external list of
> entries.

If raw blazing data collection speed is required try a data warehouse.
Collect your data to flat files (this would be even faster than using sqlite
without indexes). Then import that data to an database without indexes.
Then create the index (this is faster than a mass insert into an indexed table).
Then do all your reporting from the database.



---
SqliteImporter: Command line fixed and delimited text import.
http://www.reddawn.net/~jsprenkl/Sqlite