On Sat, Aug 21, 2010 at 4:25 PM, ve3meo <holden_fam...@sympatico.ca> wrote:
> "Simon Slavin" ...
>>
>>> Here's how I have interpreted Simon's suggested chunky query:
>>>
>>> select * from A as A1, A as A2 where A1.name=A2.name and
>>> A1.position != A2.position and
>>> A2.chunk between A1.chunk - 1 and A1.chunk + 1 and
>>> A2.position between A1.position - 10 and A1.position + 10 ;
>>
>> That's exactly what I meant.  It allows you to reject everything not in
>> those three chunks immediately, before even working out what 'position -
>> 10' and 'position + 10' are.  But looking again at your formulation with
>> 'between' in ...
>>
>>> select * from A as A1, A as A2 where A1.name=A2.name and
>>> A1.position != A2.position and
>>> A2.position between A1.position - 10 and A1.position + 10 ;
>>
>> I think that that might be just as fast, and not require you to work out
>> the chunks, or take up all the space the chunk indexes will take up, or
>> the time it will take to generate them.  So we're back to the same old
>> song: Try the simpler solution.  If it's fast enough, use it.  Only if it
>> isn't fast enough is it worth optimizing.
>>
>> By the way, you might find that swapping the last two lines makes it
>> faster:
>>
>>
>>> select * from A as A1, A as A2 where A1.name=A2.name and
>>> A2.position between A1.position - 10 and A1.position + 10 and
>>> A1.position != A2.position ;
>>
>> But you might not: SQLite's optimizer may already have spotted it.
>
> I did some tests on a 10,000 row table similar to Peng's example with name a
> random alpha and random position 0-99, distance 10: each query returned
> 333,450 results.
>
> ~62s: Peng's original query with no index, either order of constraints
> ~4s: Peng's original query with idxNamePos, either order
> ~2.3s: Jim's 'between' query with idxNamePos, either order
> ~2.7s: Simon's 'chunky' addition to Jim's with idxNameChunkPos, any order
>
> So:
> 1. Simpler is best, in this case. The 'chunking' option paid a ~16% penalty
> in speed and added considerable complexity and storage requirement.
> 2. The order of the where constraints did not seem to matter - I guess the
> optimiser sorts it out or maybe the random nature of the data offers no
> advantage to any particular order.

I randomly generate an array of random names and positions (using R).
Essentially it generate 26 names (a-z). For each name, it generates n
uniq positions in the range from 1 to n*40. The trial run at the end
shows that the run time complexity of the sql query is O(n^2).

If I do not use sqlite3, rather, use 'sort' from coreutils and perl,
the complexity will be O(log(n)). Because, I could first sort the
names and positions (O(log(n)), then I use perl to go through the
sorted names and positions to figure out which pairs of names and
positions have nearby neighbors (O(n)).

In this sense, sqlite3 will never beat the solution using sort and
perl at least for this particular problem. I'm wondering if my
understand is correct. Or if there is a away to achieve O(log(n)) run
time complexity.

$ cat main.R
options(echo=F)
L=n*40
f=do.call(
    rbind
    , lapply(
        letters[1:26]
        , function(x) {
          data.frame(chrom=x, start=sample(L, n))
        }
        )
    )

write.table(f,'/dev/stdout', quote=F, sep=',', row.names=F, col.names=F)


$ cat main.sql
create table A (name text, position integer);
create index A_index on A (name, position);

.separator ','
.import /dev/stdin A

select count(*) from A;
.timer on
select count(*) from(
  select name1 as name, position1 as position from(
    select A1.name as name1,
    A1.position as position1,
    A2.name as name2,
    A2.position as position2
    from A as A1, A as A2
    where A1.name=A2.name and A1.position != A2.position and
A1.position between A2.position - 36 and A2.position + 36
  )
  group by name,position
)
;


$ cat main.sh
#!/usr/bin/env bash

for n in 100 1000 10000 100000;
do
  echo '$n'=$n
  rm -rf main.db3
  Rscript -e "n=$n; source('main.R')" | tail -n +2 | sqlite3 main.db3
'.read main.sql'
done


$ ./main.sh
$n=100
2155
CPU Time: user 0.120000 sys 0.000000
$n=1000
21812
CPU Time: user 10.250000 sys 0.020000
$n=10000
217798
CPU Time: user 1048.660000 sys 0.240000
$n=100000
2600000



-- 
Regards,
Peng
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to