Re: [sqlite] trying to optimize left outer join for large data set, (multiple indices needed?)

2009-06-09 Thread Simon Slavin

On 10 Jun 2009, at 5:15am, Elizabeth Purdom wrote:

> This seems to
> also imply to me that if I have inequalities in my query on c and d,
> that a index (a,b,c,d) would be *used* but only for the parts dealing
> with a,b, and c and that d would be manually scanned over. So if I do
> EXPLAIN QUERY, it will say the index is being used, but won't tell  
> me if
> it is being used for a,b,c, *and* d or just a,b, and c. This is what I
> make of this, but I'm not sure if this is correct.

You've got the right idea, but even just using that index for a and b  
will much reduce the amount of records that have to be iterated down  
to something a computer can do in a split second.  And by including c  
and d in the index you ensure that the library doesn't have to fetch  
those values from the record data: it can see all the values it needs  
in the index it's already looking at.  This can save the SELECT  
command quite a lot of time and effort fetching the record data as it  
does the 'JOIN'.

> So I don't what do do: can I can keep my same query, make separate
> indices (chr,strand,start) and (chr,strand,stop) and have them both be
> used, or do I need to rewrite my query into two parts so that the two
> indexes can be used and if so how?

Can I urge you first to try your query with the obvious index and find  
out whether it's too slow ?  The sort of stuff we're discussing here  
is useful in a highly technical context, but you almost never have to  
worry about that level of detail in real life.  You're doing the same  
thing most programmers do to start off with: trying to wring every  
possible millisecond out of each query whether it's worth it or not.   
If you've tried the simple and easy job of having SQLite handle the  
job, and it's not fast enough, then's the time to consider the options  
you listed.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] trying to optimize left outer join for large data set, (multiple indices needed?)

2009-06-09 Thread Elizabeth Purdom
Hi,
I appreciate the tips about timing the indexing and the order of the 
variables. However, I am particularly trying to address the following 
documentation on the SQLite page entitled 'The SQLite Query Optimizer 
Overview' (http://www.sqlite.org/optoverview.html).

There they say:

>  If an index is created using a statement like this:
> 
>   CREATE INDEX idx_ex1 ON ex1(a,b,c,d,e,...,y,z);
> 
> Then the index might be used if the initial columns of the index (columns a, 
> b, and so forth) appear in WHERE clause terms. All index columns must be used 
> with the = or IN operators except for the right-most column which can use 
> inequalities. For the right-most column of an index that is used, there can 
> be up to two inequalities that must sandwich the allowed values of the column 
> between two extremes.
 >
>  It is not necessary for every column of an index to appear in a WHERE clause 
> term in order for that index to be used. But there can not be gaps in the 
> columns of the index that are used. Thus for the example index above, if 
> there is no WHERE clause term that constraints column c, then terms that 
> constraint columns a and b can be used with the index but not terms that 
> constraint columns d through z. Similarly, no index column will be used (for 
> indexing purposes) that is to the right of a column that is constrained only 
> by inequalities. For the index above and WHERE clause like this:
> 
>   ... WHERE a=5 AND b IN (1,2,3) AND c>12 AND d='hello'
> 
> Only columns a, b, and c of the index would be usable. The d column would not 
> be usable because it occurs to the right of c and c is constrained only by 
> inequalities.

I have inequalities on both 'start' and 'stop', which would seem to 
imply 1) chr and strand must come first and 2) start and stop in the 
same index won't help. My understanding is that the 'ON ...' phrase is 
equivalent to the WHERE phrase, so this would be relevant. This seems to 
also imply to me that if I have inequalities in my query on c and d, 
that a index (a,b,c,d) would be *used* but only for the parts dealing 
with a,b, and c and that d would be manually scanned over. So if I do 
EXPLAIN QUERY, it will say the index is being used, but won't tell me if 
it is being used for a,b,c, *and* d or just a,b, and c. This is what I 
make of this, but I'm not sure if this is correct.

So I don't what do do: can I can keep my same query, make separate 
indices (chr,strand,start) and (chr,strand,stop) and have them both be 
used, or do I need to rewrite my query into two parts so that the two 
indexes can be used and if so how? I'm thinking the later, because of 
tips I read from elsewhere that seem to say only ONE index can be used 
per WHERE clause (and had suggestions about to break up a WHERE clause 
into two and then intersect, as I mentioned on the original post, but 
again I don't know how this works for the OUTER JOIN).

Thanks,
Elizabeth Purdom


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] trying to optimize left outer join for large data set (multiple indices needed?)

2009-06-09 Thread Ibrahim A
Hi, 

at the first sight i'd suggest that you reorder the rows of your index :

it is most likely that chr and strand will have many equal values in your 
example - especially chr. When chr is the first field of your Index than the 
path to find first differences in the btree to find the matching intervall will 
be longer. 

i'd try an index with reordered columns which wouldn't affect your application 
to much :

you can experiment for the right combination. 

i'd try this combinations for your index and measure the differences :

a) idx(start, stop, strand, chr)
b) idx(start, strand, chr, stop)
c) idx(stop, start, strand, chr)
d) idx(stop, strand, chr, start)
e) idx(strand, chr, start, stop)
f) idx(strand, chr, stop, start)

i think that all this indexes especially b and d will make your query faster.



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] trying to optimize left outer join for large data set (multiple indices needed?)

2009-06-09 Thread Simon Slavin

On 9 Jun 2009, at 7:12am, Elizabeth Purdom wrote:

> SELECT regions.id, TOTAL(reads.data) FROM  regions LEFT OUTER JOIN  
> reads ON
> ( regions.chr = reads.chr AND regions.strand=reads.strand
> AND regions.start<=reads.start AND regions.stop>=reads.stop
> )
> GROUP BY regions.id ORDER BY regions.id
> ;
>
>  [snip]
>
> In reading about optimizing with indexing, then it seems that I would
> want an index of the relevant columns in 'reads' (chr, strand, start,
> stop); however, since I have inequality matches for both start and  
> stop,
> then a joint index like this will not help
> (http://www.sqlite.org/optoverview.html). I can make two indexes (chr,
> strand, start) and (chr,strand,stop), but I don't know how to get an
> outer join to use them both.


Although you can force SQLite to use a specific index when doing some  
things, it's not really your job. SQLite has strategies that allow it  
to look at your SELECT and pick the indices that will let it do the  
job fastest.  It's SQLite's job to figure out how to do this given  
whatever indices you provided, or none at all.

Write some code to execute a SELECT like the above and time how long  
it takes to do the SELECT and iterate through the result.  Then  
experiment with a few different indices.  You could start with two  
separate indices on reads.chr and reads.strand, but you might then try  
one index which combines them both, or even

CREATE INDEX fred ON TABLE reads (chr, strand, start)

Try a few combinations and see which one leads to fastest retrieval.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] trying to optimize left outer join for large data set (multiple indices needed?)

2009-06-09 Thread Elizabeth Purdom
Hello,

I'm basically a newbie, but have been plunged into not just having 
correct code but having to optimize a particular query. The following 
query works but is very slow on my database with millions of entries in 
one of the tables.

SELECT regions.id, TOTAL(reads.data) FROM  regions LEFT OUTER JOIN reads ON
( regions.chr = reads.chr AND regions.strand=reads.strand
 AND regions.start<=reads.start AND regions.stop>=reads.stop
)
GROUP BY regions.id ORDER BY regions.id
;

[I am writing with simplified names, so if there are any typos, its 
probably that, because the code does run]

'regions' is a table of somewhere between 20K to 300K rows and 'reads' 
is the larger one with millions of rows. I'm using sqlite 3.3.3. I 
currently have a single index of (chr,strand,start,stop), though I don't 
think this is helping (see below)

I would be glad as to any advice as to the most efficient way to do this 
(and any idea of whether it will really speed anything up, or if I'm 
just stuck with the speed I see).

Just to say what I've been thinking, though I don't know if its useful:

In reading about optimizing with indexing, then it seems that I would 
want an index of the relevant columns in 'reads' (chr, strand, start, 
stop); however, since I have inequality matches for both start and stop, 
then a joint index like this will not help 
(http://www.sqlite.org/optoverview.html). I can make two indexes (chr, 
strand, start) and (chr,strand,stop), but I don't know how to get an 
outer join to use them both.

There was then an example about breaking up the where clause into two parts
1=SELECT rowid from tablename WHERE (regions.chr = reads.chr AND 
regions.strand=reads.strand
 AND regions.start<=reads.start)
2= SELECT rowid from tablename WHERE (regions.chr = reads.chr AND 
regions.strand=reads.strand
 AND regions.stop>=reads.stop)
and then intersecting the row ids, but I can't figure out to extend this 
to outer join (I'm not clear what are the correct replacement of rowid). 
I think I could do
SELECT * from regions LEFT OUTER JOIN reads ON [1st]
INTERSECT
SELECT * from regions LEFT OUTER JOIN reads ON [2nd]
but then that seems unlikely to be efficient, since you have to 
intersect all of the column information. And I also don't know how to do 
the syntax for summary that should surround this table.

Thanks for any assistance,
Elizabeth Purdom



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users