Re: [sqlite] trying to optimize left outer join for large data set, (multiple indices needed?)
On 10 Jun 2009, at 5:15am, Elizabeth Purdom wrote: > This seems to > also imply to me that if I have inequalities in my query on c and d, > that a index (a,b,c,d) would be *used* but only for the parts dealing > with a,b, and c and that d would be manually scanned over. So if I do > EXPLAIN QUERY, it will say the index is being used, but won't tell > me if > it is being used for a,b,c, *and* d or just a,b, and c. This is what I > make of this, but I'm not sure if this is correct. You've got the right idea, but even just using that index for a and b will much reduce the amount of records that have to be iterated down to something a computer can do in a split second. And by including c and d in the index you ensure that the library doesn't have to fetch those values from the record data: it can see all the values it needs in the index it's already looking at. This can save the SELECT command quite a lot of time and effort fetching the record data as it does the 'JOIN'. > So I don't what do do: can I can keep my same query, make separate > indices (chr,strand,start) and (chr,strand,stop) and have them both be > used, or do I need to rewrite my query into two parts so that the two > indexes can be used and if so how? Can I urge you first to try your query with the obvious index and find out whether it's too slow ? The sort of stuff we're discussing here is useful in a highly technical context, but you almost never have to worry about that level of detail in real life. You're doing the same thing most programmers do to start off with: trying to wring every possible millisecond out of each query whether it's worth it or not. If you've tried the simple and easy job of having SQLite handle the job, and it's not fast enough, then's the time to consider the options you listed. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] trying to optimize left outer join for large data set, (multiple indices needed?)
Hi, I appreciate the tips about timing the indexing and the order of the variables. However, I am particularly trying to address the following documentation on the SQLite page entitled 'The SQLite Query Optimizer Overview' (http://www.sqlite.org/optoverview.html). There they say: > If an index is created using a statement like this: > > CREATE INDEX idx_ex1 ON ex1(a,b,c,d,e,...,y,z); > > Then the index might be used if the initial columns of the index (columns a, > b, and so forth) appear in WHERE clause terms. All index columns must be used > with the = or IN operators except for the right-most column which can use > inequalities. For the right-most column of an index that is used, there can > be up to two inequalities that must sandwich the allowed values of the column > between two extremes. > > It is not necessary for every column of an index to appear in a WHERE clause > term in order for that index to be used. But there can not be gaps in the > columns of the index that are used. Thus for the example index above, if > there is no WHERE clause term that constraints column c, then terms that > constraint columns a and b can be used with the index but not terms that > constraint columns d through z. Similarly, no index column will be used (for > indexing purposes) that is to the right of a column that is constrained only > by inequalities. For the index above and WHERE clause like this: > > ... WHERE a=5 AND b IN (1,2,3) AND c>12 AND d='hello' > > Only columns a, b, and c of the index would be usable. The d column would not > be usable because it occurs to the right of c and c is constrained only by > inequalities. I have inequalities on both 'start' and 'stop', which would seem to imply 1) chr and strand must come first and 2) start and stop in the same index won't help. My understanding is that the 'ON ...' phrase is equivalent to the WHERE phrase, so this would be relevant. This seems to also imply to me that if I have inequalities in my query on c and d, that a index (a,b,c,d) would be *used* but only for the parts dealing with a,b, and c and that d would be manually scanned over. So if I do EXPLAIN QUERY, it will say the index is being used, but won't tell me if it is being used for a,b,c, *and* d or just a,b, and c. This is what I make of this, but I'm not sure if this is correct. So I don't what do do: can I can keep my same query, make separate indices (chr,strand,start) and (chr,strand,stop) and have them both be used, or do I need to rewrite my query into two parts so that the two indexes can be used and if so how? I'm thinking the later, because of tips I read from elsewhere that seem to say only ONE index can be used per WHERE clause (and had suggestions about to break up a WHERE clause into two and then intersect, as I mentioned on the original post, but again I don't know how this works for the OUTER JOIN). Thanks, Elizabeth Purdom ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] trying to optimize left outer join for large data set (multiple indices needed?)
Hi, at the first sight i'd suggest that you reorder the rows of your index : it is most likely that chr and strand will have many equal values in your example - especially chr. When chr is the first field of your Index than the path to find first differences in the btree to find the matching intervall will be longer. i'd try an index with reordered columns which wouldn't affect your application to much : you can experiment for the right combination. i'd try this combinations for your index and measure the differences : a) idx(start, stop, strand, chr) b) idx(start, strand, chr, stop) c) idx(stop, start, strand, chr) d) idx(stop, strand, chr, start) e) idx(strand, chr, start, stop) f) idx(strand, chr, stop, start) i think that all this indexes especially b and d will make your query faster. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] trying to optimize left outer join for large data set (multiple indices needed?)
On 9 Jun 2009, at 7:12am, Elizabeth Purdom wrote: > SELECT regions.id, TOTAL(reads.data) FROM regions LEFT OUTER JOIN > reads ON > ( regions.chr = reads.chr AND regions.strand=reads.strand > AND regions.start<=reads.start AND regions.stop>=reads.stop > ) > GROUP BY regions.id ORDER BY regions.id > ; > > [snip] > > In reading about optimizing with indexing, then it seems that I would > want an index of the relevant columns in 'reads' (chr, strand, start, > stop); however, since I have inequality matches for both start and > stop, > then a joint index like this will not help > (http://www.sqlite.org/optoverview.html). I can make two indexes (chr, > strand, start) and (chr,strand,stop), but I don't know how to get an > outer join to use them both. Although you can force SQLite to use a specific index when doing some things, it's not really your job. SQLite has strategies that allow it to look at your SELECT and pick the indices that will let it do the job fastest. It's SQLite's job to figure out how to do this given whatever indices you provided, or none at all. Write some code to execute a SELECT like the above and time how long it takes to do the SELECT and iterate through the result. Then experiment with a few different indices. You could start with two separate indices on reads.chr and reads.strand, but you might then try one index which combines them both, or even CREATE INDEX fred ON TABLE reads (chr, strand, start) Try a few combinations and see which one leads to fastest retrieval. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] trying to optimize left outer join for large data set (multiple indices needed?)
Hello, I'm basically a newbie, but have been plunged into not just having correct code but having to optimize a particular query. The following query works but is very slow on my database with millions of entries in one of the tables. SELECT regions.id, TOTAL(reads.data) FROM regions LEFT OUTER JOIN reads ON ( regions.chr = reads.chr AND regions.strand=reads.strand AND regions.start<=reads.start AND regions.stop>=reads.stop ) GROUP BY regions.id ORDER BY regions.id ; [I am writing with simplified names, so if there are any typos, its probably that, because the code does run] 'regions' is a table of somewhere between 20K to 300K rows and 'reads' is the larger one with millions of rows. I'm using sqlite 3.3.3. I currently have a single index of (chr,strand,start,stop), though I don't think this is helping (see below) I would be glad as to any advice as to the most efficient way to do this (and any idea of whether it will really speed anything up, or if I'm just stuck with the speed I see). Just to say what I've been thinking, though I don't know if its useful: In reading about optimizing with indexing, then it seems that I would want an index of the relevant columns in 'reads' (chr, strand, start, stop); however, since I have inequality matches for both start and stop, then a joint index like this will not help (http://www.sqlite.org/optoverview.html). I can make two indexes (chr, strand, start) and (chr,strand,stop), but I don't know how to get an outer join to use them both. There was then an example about breaking up the where clause into two parts 1=SELECT rowid from tablename WHERE (regions.chr = reads.chr AND regions.strand=reads.strand AND regions.start<=reads.start) 2= SELECT rowid from tablename WHERE (regions.chr = reads.chr AND regions.strand=reads.strand AND regions.stop>=reads.stop) and then intersecting the row ids, but I can't figure out to extend this to outer join (I'm not clear what are the correct replacement of rowid). I think I could do SELECT * from regions LEFT OUTER JOIN reads ON [1st] INTERSECT SELECT * from regions LEFT OUTER JOIN reads ON [2nd] but then that seems unlikely to be efficient, since you have to intersect all of the column information. And I also don't know how to do the syntax for summary that should surround this table. Thanks for any assistance, Elizabeth Purdom ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users