Right, Jean-Christophe, about "moving too much data around". You'd want each
of the inner selects to produce a vector of integer ids. "select *" is
clearly not going to be very efficient.  200K rows is a very small table for
this sort of test, in my view.  Were the faster intereections any faster
than:

select * from T
where lo ... {low condition}
and hi ......{ hi condition}
and name = 'x'

where only one index is chosen?  The answer would depend on the cardinality.
  If [name] is highly differentiated then name='x' would return very few
rows and the query is fast using standard approach.  But if [name] were,
say, FirstName, and there were hundreds of thousands of records containing
"Joe" in the OP's table, then the name index won't help us very much.
 That's why I am asking about Minimal Perfect Hash and intersection of
vectors.

The Minimal-Perfect-Hash-INTERSECTION-OF-VECTORS approach might benefit
queries against tables having several million rows. What I'm wondering (and
lack the C skills to find out for myself) is whether SQLite's underlying
algorithms for INTERSECT could be optimized with a minimal perfect hash
approach. The algorithms would decide which vector of ids is the better
candidate to be used for the MPH and which is the better candidate to be
iterated, and then iterate over the one vector of ids, testing for the
existence of each id in the MPH using the optimized Exists() function
supplied by the mph library for the particular type of mph being used.

The question, in scenarios where the vectors contain many items, is whether
the overhead of creating the MPH and testing for existence is significantly
less than the overhead of doing whatever INTERSECT is doing now when it
intersects vectors of ids.  You have a ready-made acronym to advertise the
speed if it turns out to be faster: MPH.  ;-)

Regards
Tim Romano
Swarthmore PA

On Wed, May 12, 2010 at 8:52 PM, Jean-Christophe Deschamps 
<j...@q-e-d.org>wrote:

>
> >
> >I would first create an INTEGER primary key and then place an index on
> >name,
> >another on i_from, and another on i_to, and then see if the approach below
> >has any benefit.
> >
> >When I tried this with a geo-queryit was actually slower than the standard
> >select, and I'm curious if that's always going to be the case. It will
> >come
> >down to how efficient the INTERSECT of the vectors of integers is. Each
> >vector will have been the result of an index-scan.  If INTERSECT were
> >optimized (perhaps with a minimal perfect hash function
> >http://cmph.sourceforge.net/index.html) this approach might be useful.
>
>
> All three following queries use only simple indexes (PK, name, lo, hi).
>
> Query#1:
> select * from tst where lo < 345678
>     intersect
> select * from tst where hi > 123456
>     intersect
> select * from tst where name = 'aaelj';
>
> Query#2
> select * from tst
>     join (
>         select rowid from tst where lo < 345678
>             intersect
>         select rowid from tst where hi > 123456
>     ) as interval
>     on tst.rowid = interval.rowid and name = 'aaelj';
>
> Query#3
> select * from tst
>     join (
>         select rowid from tst where lo < 345678
>             intersect
>         select rowid from tst where hi > 123456
>             intersect
>         select rowid from tst where name = 'aaelj'
>     ) as interval
>     on tst.rowid = interval.rowid;
>
> On a 200K-row test table with random data, queries #2 and #3 were
> essentially identical while #1 was twice slower (moving too much data
> around, uselessly).
>
>
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to