You also need to make sure the "no hit" does not degenerate into a table scan.  
RTree works well for this but is overall significantly slower than not using 
RTree since the purpose of RTree is to find the "small number of candidate 
records" that could possibly satisfy the query out of a haystack of records 
(that is, find the candidate needles in the haystack, so that you only need to 
closely examine that small number of candidates to find the needle rather than 
test the whole haystack).  

However, if you know that there can only be one possible record which can 
satisfy the query (ie, there is only one possible needle in the haystack, and 
only one possible candidate, and you can find this candidate directly for 
testing), then the overhead of using RTree where it is not needed exceeds the 
benefits of using it.

I see that the performance of the RTree is significantly slower than the 
equivalent "direct" method.  Am I doing something wrong here or is that 
overhead simply because of the data structures that the RTRee implementation 
must maintain (which are not required in this case).

Without RTree:
>py -3 test.py
Created 100000 random ranges in 00:00:00.681118 Creation Rate = 146817 
Ranges/Second
Looked up 1102019 random range values in 00:00:04.598245 Lookup Rate = 239660 
Values/Second
Failure Rate = 257270 Values/Second
Success Rate = 228828 Values/Second

With RTree:
>py -3 test.py --rtree
Created 100000 random ranges in 00:00:02.139742 Creation Rate = 46734 
Ranges/Second
Looked up 1100681 random range values in 00:00:13.662556 Lookup Rate = 80561 
Values/Second
Failure Rate = 119874 Values/Second
Success Rate = 65627 Values/Second

And that came from the following test program (in python) where the only 
difference is the SQL statements being used.  Because the ranges are random and 
the lookups are random, the timings given are subject to differences on every 
run, however, the averaged rates are relatively stable given a large number of 
random ranges and query values.

--- test.py ---
from __future__ import absolute_import, division, print_function, 
unicode_literals

import datetime
import random
import sys
import time

import sqlite3

# Convert a value in seconds to HMS format
HMS = lambda t: (datetime.datetime.min + 
datetime.timedelta(seconds=t)).time().isoformat()

# Create constants for the SQL statements we will use

if '--rtree' in sys.argv:
    create_sql = 'create virtual table ranges using rtree(id, start, stop, 
+value);'
    query_sql = 'select value from ranges where ? between start and stop;'
else:
    create_sql = 'create table ranges (start integer primary key, stop integer 
not null, value integer not null);'
    query_sql = 'select value from (select stop, value from ranges where start 
<= ?1 order by start desc limit 1) where ?1 <= stop;'

insert_sql = 'insert into ranges (start, stop, value) values (?, ?, ?);'


# Open our database and do not use automagical transactions
db = sqlite3.connect(':memory:', isolation_level=None)

# Create our table
db.execute(create_sql)

# Create our random range data
recs = 100000
start = 0
st = time.time()
for cnt in range(recs):
    start += random.randint(1, 10)
    stop = start + random.randint(1, 10)
    value = int((start + stop) / 2)
    db.execute(insert_sql, (start, stop, value))
    start = stop
stop = stop + random.randint(1, 10)
et = time.time() - st
print('Created', recs, 'random ranges in', HMS(et), 'Creation Rate =', int(recs 
/ et), 'Ranges/Second')

db.execute('analyze;')

# Generate a bunch of random values and perform the range query
eta = 0.0
ets = 0.0
etf = 0.0
fcnt = 0
scnt = 0
tcnt = 0
for i in range(stop):
    x = random.randint(0, stop)
    lst = time.time()
    row = db.execute(query_sql, (x, )).fetchone()
    let = time.time() - lst
    if row:
        value = row[0]
        ets += let
        scnt += 1
    else:
        value = None
        etf += let
        fcnt += 1
    eta += let
    tcnt += 1
print('Looked up', stop, 'random range values in', HMS(eta), 'Lookup Rate =', 
int(tcnt / eta), 'Values/Second')
print('Failure Rate =', int(fcnt / etf), 'Values/Second')
print('Success Rate =', int(scnt / ets), 'Values/Second')

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.


>-----Original Message-----
>From: sqlite-users [mailto:sqlite-users-
>boun...@mailinglists.sqlite.org] On Behalf Of E.Pasma
>Sent: Friday, 26 October, 2018 16:28
>To: SQLite mailing list
>Subject: Re: [sqlite] Optmize queries on ranges
>
>About the rtree extension, which was the first idea.
>
>The extension appears available without any special installation
>option. This is easier than what is mentioned in
>https://sqlite.org/rtree.html <https://sqlite.org/rtree.html> chapter
>2: "Compiling The R*Tree Module".
>This chapter may as well be left out?
>
>With test data where the ranges are mostly non-overlapping, the query
>now runs faster than without rtree. Even though both run within a
>millisecond rtree is ten times faster.
>With order by and limit the timing remains superior. But this relies
>on strictly non-overlapping ranges.
>Below my test script
>
>
>/* query 1: using rtree built-in extension */
>;
>create virtual table ranges using rtree(id, minX, maxX, +value);
>with r as (select 0 as r union all select r+1 from r where r<1000000)
>insert into ranges (minX, maxX, value)
>select r*10+1,r*10+10,r*10+5 from r
>;
>select value from ranges where 123456 between minx and maxx
>;
>123455
>Run Time: real 0.000 user 0.000135 sys 0.000018
>
>/* query 2: using index on minx+maxx */
>drop table ranges
>;
>create table ranges (minx int, maxx int, value int)
>;
>with r as (select 0 as r union all select r+1 from r where r<1000000)
>insert into ranges (minX, maxX, value)
>select r*10+1,r*10+10,r*10+5 from r
>;
>create unique index ranges_minx_maxx on ranges(minx,maxx)
>;
>select value from ranges where 123456 between minx and maxx
>;
>123455
>Run Time: real 0.002 user 0.001415 sys 0.000016
>
>/* query 3: same, assuming non-overlapping ranges */
>select value from ranges where 123456 between minx and maxx
>order by minx desc limit 1
>;
>123455
>Run Time: real 0.000 user 0.000057 sys 0.000000
>
>
>
>_______________________________________________
>sqlite-users mailing list
>sqlite-users@mailinglists.sqlite.org
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to