Re: [sqlite] Virtual table API performance

Elefterios Stamatogiannakis Sun, 02 Mar 2014 05:22:20 -0800

In our performance tests we try to work with data and queries that arerepresentative of what we would find in a typical DB.

This means a lot of "small" values (ints, floats, small strings), and5-20 columns.

Our main test case is TPCH, a standard DB benchmark. The "lineitem"table of TPCH contains 16 columns, which for 10M rows would require 160MxColumn callbacks, to pass it through the virtual table API. Thesecallbacks are very expensive, especially when at the other end sits a VM(CPython or PyPy) handling them.

For PyPy in particular, which is able to produce JIT compiled andoptimized UDF code, adapted on the "shape" (type and value distribution)of the data flows as they pass through it, every time it faces thevirtual table API it slows down to a crawl, becoming more than 2x slowerthan interpreted Python. This happens because it cannot see the data/row"shape" from the many small and unrelated between each other, singlevalue based, xColumn callbacks.

Changing the subject, i've seen some requests in previous emails frompeople asking for windowing functions to be added to SQLite. I want topropose an alternative that we have been using for years, and is a lotmore generic than adding specific functions for very "narrow" use casesin SQLite.

We have added the "EXPAND" VT function in madIS, which "emulates" nestedtables in SQLite, enabling to have row and aggregate functions thatreturn (in a streaming fashion) multiple values on multiple columns. The"EXPAND" function, takes as input a table containing as values (in ourcase Python) generators, and then it calls the generators "expanding"the input table to its final form. "EXPAND" is automatically insertedwherever is required, so it isn't visible. An example follows:


> select strsplit('one and other');
one|and|other     <-- 3 columns

or

> select strsplitV('one and other');
one
and            <-- 3 individual rows
other

So by adding a single VT function and some syntactic sugar (autoinserting EXPAND VT), we were able to have functionality that is notcase specific, allowing us to run all kinds of analytics inside SQLite.

The performance of above functionality is already very good. But itcould be a lot better with a more efficient VT API.


Regards,

estama

On 2/3/2014 9:15 πμ, Max Vlasov wrote:

Hi,
thanks for explaining your syntax in another post. Now about virtual
tables if you don't mind.

On Fri, Feb 28, 2014 at 8:24 PM, Eleytherios Stamatogiannakis
<est...@gmail.com> wrote:


If we load into SQLite, ....

create table newtable as select * from READCOMPRESSEDFILE('ctable.rc');

it takes: 55 sec


If we create an external program ....

it takes: 19 sec (~3x faster than using the virtual table API)


Looking at your numbers, as a user (and fan :) of virtual tables I
decided to do some tests.

I have a virtual table "all values", it was designed for enumeration
of all tables values to the one single virtual table, so finally it is
a long list of

   TableName, TableRowId, FieldName, Value

so you get the idea. As an example of what it may do, you may open
places.sqlite of mozilla browser and do

   Select * from AllValues where Value Like "%sqlite.org%"

and see actual results even not knowing how they planned their schema.

Internally this virtual table simply uses general selects for all
other tables met in sqlite_master. This is a good (but probably not
the best) test for measuring virtual tables performance, because

   SELECT * FROM AllValues

is equivalent to reading all conventional tables of this database.
Besides
- the tool I use has a tweaker implemented with VFS that allows
measuring speed and other characteristics of the query performed while
the query is in effect.
- I have an option that forces resetting windows cache for the
database file when it is reopened. So with it we exclude the windows
cache from consideration so pure I/O reading is used. Btw, when you do
your comparison, it's very important to reset system cache before
every measurement that involves I/O.


So I took a comparatively large (500 Mb) database consisting of
several small and one big table (Posts) and compared two queries.

(Query1)

   Select sum(length(Body) + length(Title)) from Posts

This ones effectively reads the table data and uses
- length() to force sqlite reading texts that don't fit into single db page
- sum() to exclude accumulating results on my side from comparison, so
we have a single row, single column result from the work completely
done by sqlite.

(Query2)

   Select Sum(Length(Value)) from AllValues

This one performs basically the same but using sqlite virtual tables
api. It also touches other tables, but since they're small, we can
forget about this.

Query1 (General):
   Read: 540MB,
   Time: 24.2 sec,
   CPU Time: 6 Sec (25%)
   Speed: 22.31 MB/Sec

Query2 (Virtual):
   Read: 540MB,
   Time: 27.3 Sec,
   CPU Time: 13 sec (51%)
   Speed: 20 MB/Sec

In my particular test the noticeable difference is at the part of the
CPU spent more with the virtual table. I assume this can be related to
my own implementation of this virtual table since I should retrieve,
store values temporary somewhere and talk to sqlite. But this also may
shed light on your performance drop. If your virtual implementation
spend much time processing a value, you may finally get a big drop.

You may tell that this test is not fair because it does not involve
creating a table from the values of a virtual table. Unfortunately I
can't create good enough test comparing Posts and AllValues table as
sources, because the destination geometry of the tables are different
( Posts have more columns, less rows, AllValue less columns, more
rows). The closest approximation was possible when I created an
intermediate physical table containing the results from AllValues and
compared table creation from this table and from virtual table. The
virtual one took longer, but the values - 56 seconds vs 43 second not
different enough to conclude something.

I'm not sure my tests defend sqlite virtual tables sufficiently, but
currently I don't have evidence of significant inefficiency either.

Max


_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Virtual table API performance

Reply via email to