On Sat, Aug 21, 2010 at 4:25 PM, ve3meo <holden_fam...@sympatico.ca> wrote: > "Simon Slavin" ... >> >>> Here's how I have interpreted Simon's suggested chunky query: >>> >>> select * from A as A1, A as A2 where A1.name=A2.name and >>> A1.position != A2.position and >>> A2.chunk between A1.chunk - 1 and A1.chunk + 1 and >>> A2.position between A1.position - 10 and A1.position + 10 ; >> >> That's exactly what I meant. It allows you to reject everything not in >> those three chunks immediately, before even working out what 'position - >> 10' and 'position + 10' are. But looking again at your formulation with >> 'between' in ... >> >>> select * from A as A1, A as A2 where A1.name=A2.name and >>> A1.position != A2.position and >>> A2.position between A1.position - 10 and A1.position + 10 ; >> >> I think that that might be just as fast, and not require you to work out >> the chunks, or take up all the space the chunk indexes will take up, or >> the time it will take to generate them. So we're back to the same old >> song: Try the simpler solution. If it's fast enough, use it. Only if it >> isn't fast enough is it worth optimizing. >> >> By the way, you might find that swapping the last two lines makes it >> faster: >> >> >>> select * from A as A1, A as A2 where A1.name=A2.name and >>> A2.position between A1.position - 10 and A1.position + 10 and >>> A1.position != A2.position ; >> >> But you might not: SQLite's optimizer may already have spotted it. > > I did some tests on a 10,000 row table similar to Peng's example with name a > random alpha and random position 0-99, distance 10: each query returned > 333,450 results. > > ~62s: Peng's original query with no index, either order of constraints > ~4s: Peng's original query with idxNamePos, either order > ~2.3s: Jim's 'between' query with idxNamePos, either order > ~2.7s: Simon's 'chunky' addition to Jim's with idxNameChunkPos, any order > > So: > 1. Simpler is best, in this case. The 'chunking' option paid a ~16% penalty > in speed and added considerable complexity and storage requirement. > 2. The order of the where constraints did not seem to matter - I guess the > optimiser sorts it out or maybe the random nature of the data offers no > advantage to any particular order.
I randomly generate an array of random names and positions (using R). Essentially it generate 26 names (a-z). For each name, it generates n uniq positions in the range from 1 to n*40. The trial run at the end shows that the run time complexity of the sql query is O(n^2). If I do not use sqlite3, rather, use 'sort' from coreutils and perl, the complexity will be O(log(n)). Because, I could first sort the names and positions (O(log(n)), then I use perl to go through the sorted names and positions to figure out which pairs of names and positions have nearby neighbors (O(n)). In this sense, sqlite3 will never beat the solution using sort and perl at least for this particular problem. I'm wondering if my understand is correct. Or if there is a away to achieve O(log(n)) run time complexity. $ cat main.R options(echo=F) L=n*40 f=do.call( rbind , lapply( letters[1:26] , function(x) { data.frame(chrom=x, start=sample(L, n)) } ) ) write.table(f,'/dev/stdout', quote=F, sep=',', row.names=F, col.names=F) $ cat main.sql create table A (name text, position integer); create index A_index on A (name, position); .separator ',' .import /dev/stdin A select count(*) from A; .timer on select count(*) from( select name1 as name, position1 as position from( select A1.name as name1, A1.position as position1, A2.name as name2, A2.position as position2 from A as A1, A as A2 where A1.name=A2.name and A1.position != A2.position and A1.position between A2.position - 36 and A2.position + 36 ) group by name,position ) ; $ cat main.sh #!/usr/bin/env bash for n in 100 1000 10000 100000; do echo '$n'=$n rm -rf main.db3 Rscript -e "n=$n; source('main.R')" | tail -n +2 | sqlite3 main.db3 '.read main.sql' done $ ./main.sh $n=100 2155 CPU Time: user 0.120000 sys 0.000000 $n=1000 21812 CPU Time: user 10.250000 sys 0.020000 $n=10000 217798 CPU Time: user 1048.660000 sys 0.240000 $n=100000 2600000 -- Regards, Peng _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users