Re: [sqlite] Fastest way to SELECT on a set of keys?

Keith Medcalf Tue, 17 Sep 2019 14:38:23 -0700

>I edited this second script to use plain apsw.
>In the vi editor:
>%s/db.execute/db.cursor().execute/
>%s/executescript/execute/
>%s/db.commit()/db.cursor().execute("COMMIT")/
>/Method 9
>.,$d
>
>(the carray tests are left out)
>My test output for 1000 keys is:
>$ python3 keith2b.py 1000
>Creating db and sample keys: 1000000 rows; 1000 keys
>Method 1: Individual Row          00:00:00.003748
>Method 2: Individual Row (Sorted) 00:00:00.003545
>Method 3: Rows with ExecMany      00:00:00.003300
>Method 3: Rows with ExecMany Sort 00:00:00.003088
>Method 4: Using IN temp           00:00:00.003838
>Method 5: Using IN temp (sorted)  00:00:00.003850
>Method 6: Using IN temp no rowid  00:00:00.003941
>Method 7: Using IN (dynamic)      00:00:00.003276
>Method 8: Using IN (sorted)       00:00:00.003223
>
>This is much different as in your output.
>- the test is two times faster here (on a moderate system)
>- there is no substantial difference any longer between individual tests
>(excluded carray)
>
>Any idea?
>
>Thanks for the inviting tests.


I can actually.  

My apsw.Connection class is actually a python class that inherits from the real 
apsw.Connection class so that I can add a bunch of extra's to it.  For example, 
the apsw.Connection.execute method(s) are not direct delegates to 
apsw.Connection.Cursor().execute method -- it also does some scanning of the 
bind parameters so that it can handle datetime objects conversion to text.  
Similarly I have an active exec tracer and row tracer so that output tuples are 
converted to objects and datetime data can be converted to real datetime 
objects.

For the most part these overheads are static per execute call and per row 
returned so mostly cancel themselves out (but make everything somewhat slower). 
 It is also why the execution of many select's degrades so quickly -- the setup 
time is per select -- and the executemany degrades slower (it is only calling 
back on the exe tracer for each statement rather than doing the whole overhead 
for each statement).  The row tracer overhead is constant since the actual 
number of rows returned is constant on each set.

Some other discrepancies are apparently native to the differences between 
running on Linux vs Windows 10 -- Windows 10 seems to be somewhat less 
deterministic in its scheduling and even CPython itself seems to be somewhat 
different between.

So, I modified my delegate slightly so that if the exec/row tracer is unhooked 
then the bind parameters are not scanned either (but there is still a bit of 
overhead per statement executed due to the delegation).  Here are the results:

Full exec and row hook processing with bind parameter scanning:

1000000 rows; 1000 keys; 0.100000%
Method 1: Individual Row          00:00:00.018191   54971 rps external order
Method 2: Individual Row (Sorted) 00:00:00.018101   55244 rps external order 
(sorted)
Method 3: Rows ExecuteMany        00:00:00.016854   59332 rps external order
Method 3: Rows ExecuteMany Sorted 00:00:00.016159   61885 rps external order 
(sorted)
Method 4: Using IN temp           00:00:00.007869  127088 rps order by id
Method 5: Using IN temp (sorted)  00:00:00.008125  123083 rps order by id
Method 6: Using IN keyset         00:00:00.009833  101697 rps order by id
Method 7: Using IN keyset sorted  00:00:00.008166  122461 rps order by id
Method 8: Using IN (dynamic)      00:00:00.004820  207474 rps order by id
Method 9: Using IN (sorted)       00:00:00.005196  192452 rps order by id
Method A: Using IN CArray         00:00:00.005440  183815 rps order by id
Method B: Using IN CArray sorted  00:00:00.005891  169741 rps order by id

No row or exec tracers, bind parameter scanning bypassed:

1000000 rows; 1000 keys; 0.100000%
Method 1: Individual Row          00:00:00.003435  291089 rps external order
Method 2: Individual Row (Sorted) 00:00:00.003366  297047 rps external order 
(sorted)
Method 3: Rows ExecuteMany        00:00:00.002942  339950 rps external order
Method 3: Rows ExecuteMany Sorted 00:00:00.002892  345807 rps external order 
(sorted)
Method 4: Using IN temp           00:00:00.003435  291129 rps order by id
Method 5: Using IN temp (sorted)  00:00:00.003419  292449 rps order by id
Method 6: Using IN keyset         00:00:00.003649  274083 rps order by id
Method 7: Using IN keyset sorted  00:00:00.003626  275814 rps order by id
Method 8: Using IN (dynamic)      00:00:00.002526  395950 rps order by id
Method 9: Using IN (sorted)       00:00:00.002706  369574 rps order by id
Method A: Using IN CArray         00:00:00.002902  344557 rps order by id
Method B: Using IN CArray sorted  00:00:00.002656  376508 rps order by id

No row or exec tracers, calling .cursor() methods directly (so minimize 
delegation processing):

1000000 rows; 1000 keys; 0.100000%
Method 1: Individual Row          00:00:00.003267  306108 rps external order
Method 2: Individual Row (Sorted) 00:00:00.003083  324310 rps external order 
(sorted)
Method 3: Rows ExecuteMany        00:00:00.002665  375262 rps external order
Method 3: Rows ExecuteMany Sorted 00:00:00.002772  360738 rps external order 
(sorted)
Method 4: Using IN temp           00:00:00.003426  291858 rps order by id
Method 5: Using IN temp (sorted)  00:00:00.003491  286457 rps order by id
Method 6: Using IN keyset         00:00:00.003555  281326 rps order by id
Method 7: Using IN keyset sorted  00:00:00.003415  292857 rps order by id
Method 8: Using IN (dynamic)      00:00:00.002914  343204 rps order by id
Method 9: Using IN (sorted)       00:00:00.002677  373557 rps order by id
Method A: Using IN CArray         00:00:00.002718  367856 rps order by id
Method B: Using IN CArray sorted  00:00:00.002936  340612 rps order by id

Using --sqlite3 (does not do executemany for select statements):

1000000 rows; 1000 keys; 0.100000%
Method 1: Individual Row          00:00:00.004060  246303 rps external order
Method 2: Individual Row (Sorted) 00:00:00.003985  250945 rps external order 
(sorted)
Method 3: Rows ExecuteMany
Method 3: Rows ExecuteMany Sorted
Method 4: Using IN temp           00:00:00.003985  250930 rps order by id
Method 5: Using IN temp (sorted)  00:00:00.003968  252031 rps order by id
Method 6: Using IN keyset         00:00:00.004013  249201 rps order by id
Method 7: Using IN keyset sorted  00:00:00.004247  235463 rps order by id
Method 8: Using IN (dynamic)      00:00:00.002576  388217 rps order by id
Method 9: Using IN (sorted)       00:00:00.002456  407174 rps order by id

Note that I added rps (rows per second) which is merely int(rows / et) and 
added "order by id" to each query that returns multiple rows so that they will 
all get the same results in the same order for each test (except the first 4 
which will of course retrieve the rows in the order asked for since you get one 
row per select).  So yes, I would conclude that they are all about the same 
speed so it is really a question of which is the least work to implement, 
though overall I would probably favour the dynamic in clause format.

Note that the parameter is now the exponent of the set, so 1 is 10 records, 2 
is 100 records, ... 6 is 1 million records ... and the test runs for each power 
of ten up to that specified ...

#! python3

import datetime
import random
import sys
import time

exponent = int(sys.argv[1])

elapsed = lambda et: datetime.datetime.utcfromtimestamp(et).time()
tuplize = lambda x: (x,)
carrray = None

if '--sqlite3' in sys.argv:
    import sqlite3
    db = sqlite3.connect(':memory:', isolation_level=None)
else:
    import apsw
    if '--unhook' in sys.argv and hasattr(apsw, 'rowunhook'):
        apsw.rowunhook()
    db = apsw.Connection(':memory:')
    if hasattr(apsw, 'carray'):
        carray = apsw.carray

datasize = 10 ** exponent

print('Creating db and sample keys:', end=' ', flush=True)
db.executescript('''
create table x
(
    id      integer primary key,
    data    blob
);
insert into x
with a(x) as (
        select 1
     union all
        select x + 1
          from a
         where x < %d
             )
select x, randomblob(30)
  from a;
analyze;
''' % (datasize,))
print(db.execute('select count(*) from x').fetchone()[0], 'rows')

masterkeys = [i for i in range(datasize)]
random.shuffle(masterkeys)

for rowexp in range(exponent + 1):
    rows = 10 ** rowexp

    rowset = masterkeys[:rows]

    print()
    print(datasize, 'rows;', rows, 'keys;', '%f%%' % (rows*100/datasize,))
    # -------------------------------------------------------------------------
    try:
        print('Method 1: Individual Row         ', end=' ', flush=True)
        st = time.time()
        db.executescript('BEGIN')
        for key in rowset:
            row = db.execute('select * from x where id=?', (key,)).fetchone()
        db.commit()
        et = time.time() - st
        print(elapsed(et), '%7d' % (rows / et,), 'rps external order')
    except:
        print()
    # -------------------------------------------------------------------------
    try:
        print('Method 2: Individual Row (Sorted)', end=' ', flush=True)
        st = time.time()
        db.executescript('BEGIN')
        for key in sorted(rowset):
            row = db.execute('select * from x where id=?', (key,)).fetchone()
        db.commit()
        et = time.time() - st
        print(elapsed(et), '%7d' % (rows / et,), 'rps external order (sorted)')
    except:
        print()
    # -------------------------------------------------------------------------
    try:
        print('Method 3: Rows ExecuteMany       ', end=' ', flush=True)
        st = time.time()
        db.executescript('BEGIN')
        for row in db.executemany('select * from x where id=?', 
list(map(tuplize, rowset))):
            pass
        db.commit()
        et = time.time() - st
        print(elapsed(et), '%7d' % (rows / et,), 'rps external order')
    except:
        print()
    # -------------------------------------------------------------------------
    try:
        print('Method 3: Rows ExecuteMany Sorted', end=' ', flush=True)
        st = time.time()
        db.executescript('BEGIN')
        for row in db.executemany('select * from x where id=?',
                                  list(map(tuplize, sorted(rowset)))):
            pass
        db.commit()
        et = time.time() - st
        print(elapsed(et), '%7d' % (rows / et,), 'rps external order (sorted)')
    except:
        print()
    # -------------------------------------------------------------------------
    try:
        print('Method 4: Using IN temp          ', end=' ', flush=True)
        st = time.time()
        db.executescript('BEGIN')
        db.executescript('create temporary table keys (key)')
        db.executemany('insert into keys values (?)', list(map(tuplize, 
rowset)))
        for row in db.execute('select * from x where id in temp.keys order by 
id'):
            pass
        db.executescript('drop table temp.keys')
        db.commit()
        et = time.time() - st
        print(elapsed(et), '%7d' % (rows / et,), 'rps order by id')
    except:
        print()
    # -------------------------------------------------------------------------
    try:
        print('Method 5: Using IN temp (sorted) ', end=' ', flush=True)
        st = time.time()
        db.executescript('BEGIN')
        db.executescript('create temporary table keys (key)')
        db.executemany('insert into keys values (?)', list(map(tuplize, 
sorted(rowset))))
        for row in db.execute('select * from x where id in temp.keys order by 
id'):
            pass
        db.executescript('drop table temp.keys')
        db.commit()
        et = time.time() - st
        print(elapsed(et), '%7d' % (rows / et,), 'rps order by id')
    except:
        print()
    # -------------------------------------------------------------------------
    try:
        print('Method 6: Using IN keyset        ', end=' ', flush=True)
        st = time.time()
        db.executescript('BEGIN')
        db.executescript('create temporary table keys (key primary key) without 
rowid')
        db.executemany('insert or ignore into keys values (?)', 
list(map(tuplize, rowset)))
        for row in db.execute('select * from x where id in temp.keys order by 
id'):
            pass
        db.executescript('drop table temp.keys')
        db.commit()
        et = time.time() - st
        print(elapsed(et), '%7d' % (rows / et,), 'rps order by id')
    except:
        print()
    # -------------------------------------------------------------------------
    try:
        print('Method 7: Using IN keyset sorted ', end=' ', flush=True)
        st = time.time()
        db.executescript('BEGIN')
        db.executescript('create temporary table keys (key primary key) without 
rowid')
        db.executemany('insert or ignore into keys values (?)', 
list(map(tuplize, rowset)))
        for row in db.execute('select * from x where id in temp.keys order by 
id'):
            pass
        db.executescript('drop table temp.keys')
        db.commit()
        et = time.time() - st
        print(elapsed(et), '%7d' % (rows / et,), 'rps order by id')
    except:
        print()
    # -------------------------------------------------------------------------
    try:
        print('Method 8: Using IN (dynamic)     ', end=' ', flush=True)
        st = time.time()
        for row in db.execute('select * from x where id in (' + 
','.join(map(str, rowset)) + ') order by id'):
            pass
        et = time.time() - st
        print(elapsed(et), '%7d' % (rows / et,), 'rps order by id')
    except:
        print()
    # -------------------------------------------------------------------------
    try:
        print('Method 9: Using IN (sorted)      ', end=' ', flush=True)
        st = time.time()
        for row in db.execute('select * from x where id in (' + 
','.join(map(str, sorted(rowset))) + ') order by id'):
            pass
        et = time.time() - st
        print(elapsed(et), '%7d' % (rows / et,), 'rps order by id')
    except:
        print()

    if carray:
        # 
-------------------------------------------------------------------------
        try:
            print('Method A: Using IN CArray        ', end=' ', flush=True)
            st = time.time()
            for row in db.execute('select * from x where id in 
carray(:l_address, :l_length, :l_type) order by id', carray('l', rowset)):
                pass
            et = time.time() - st
            print(elapsed(et), '%7d' % (rows / et,), 'rps order by id')
        except:
            print()
        # 
-------------------------------------------------------------------------
        try:
            print('Method B: Using IN CArray sorted ', end=' ', flush=True)
            st = time.time()
            for row in db.execute('select * from x where id in 
carray(:l_address, :l_length, :l_type) order by id', carray('l', 
sorted(rowset))):
                pass
            et = time.time() - st
            print(elapsed(et), '%7d' % (rows / et,), 'rps order by id')
        except:
            print()
        # 
-------------------------------------------------------------------------

-- 
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.



_______________________________________________
sqlite-users mailing list
[email protected]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Fastest way to SELECT on a set of keys?

Reply via email to