In our performance tests we try to work with data and queries that are
representative of what we would find in a typical DB.
This means a lot of "small" values (ints, floats, small strings), and
5-20 columns.
Our main test case is TPCH, a standard DB benchmark. The "lineitem"
table of TPCH contains 16 columns, which for 10M rows would require 160M
xColumn callbacks, to pass it through the virtual table API. These
callbacks are very expensive, especially when at the other end sits a VM
(CPython or PyPy) handling them.
For PyPy in particular, which is able to produce JIT compiled and
optimized UDF code, adapted on the "shape" (type and value distribution)
of the data flows as they pass through it, every time it faces the
virtual table API it slows down to a crawl, becoming more than 2x slower
than interpreted Python. This happens because it cannot see the data/row
"shape" from the many small and unrelated between each other, single
value based, xColumn callbacks.
Changing the subject, i've seen some requests in previous emails from
people asking for windowing functions to be added to SQLite. I want to
propose an alternative that we have been using for years, and is a lot
more generic than adding specific functions for very "narrow" use cases
in SQLite.
We have added the "EXPAND" VT function in madIS, which "emulates" nested
tables in SQLite, enabling to have row and aggregate functions that
return (in a streaming fashion) multiple values on multiple columns. The
"EXPAND" function, takes as input a table containing as values (in our
case Python) generators, and then it calls the generators "expanding"
the input table to its final form. "EXPAND" is automatically inserted
wherever is required, so it isn't visible. An example follows:
> select strsplit('one and other');
one|and|other <-- 3 columns
or
> select strsplitV('one and other');
one
and <-- 3 individual rows
other
So by adding a single VT function and some syntactic sugar (auto
inserting EXPAND VT), we were able to have functionality that is not
case specific, allowing us to run all kinds of analytics inside SQLite.
The performance of above functionality is already very good. But it
could be a lot better with a more efficient VT API.
Regards,
estama
On 2/3/2014 9:15 πμ, Max Vlasov wrote:
Hi,
thanks for explaining your syntax in another post. Now about virtual
tables if you don't mind.
On Fri, Feb 28, 2014 at 8:24 PM, Eleytherios Stamatogiannakis
<est...@gmail.com> wrote:
If we load into SQLite, ....
create table newtable as select * from READCOMPRESSEDFILE('ctable.rc');
it takes: 55 sec
If we create an external program ....
it takes: 19 sec (~3x faster than using the virtual table API)
Looking at your numbers, as a user (and fan :) of virtual tables I
decided to do some tests.
I have a virtual table "all values", it was designed for enumeration
of all tables values to the one single virtual table, so finally it is
a long list of
TableName, TableRowId, FieldName, Value
so you get the idea. As an example of what it may do, you may open
places.sqlite of mozilla browser and do
Select * from AllValues where Value Like "%sqlite.org%"
and see actual results even not knowing how they planned their schema.
Internally this virtual table simply uses general selects for all
other tables met in sqlite_master. This is a good (but probably not
the best) test for measuring virtual tables performance, because
SELECT * FROM AllValues
is equivalent to reading all conventional tables of this database.
Besides
- the tool I use has a tweaker implemented with VFS that allows
measuring speed and other characteristics of the query performed while
the query is in effect.
- I have an option that forces resetting windows cache for the
database file when it is reopened. So with it we exclude the windows
cache from consideration so pure I/O reading is used. Btw, when you do
your comparison, it's very important to reset system cache before
every measurement that involves I/O.
So I took a comparatively large (500 Mb) database consisting of
several small and one big table (Posts) and compared two queries.
(Query1)
Select sum(length(Body) + length(Title)) from Posts
This ones effectively reads the table data and uses
- length() to force sqlite reading texts that don't fit into single db page
- sum() to exclude accumulating results on my side from comparison, so
we have a single row, single column result from the work completely
done by sqlite.
(Query2)
Select Sum(Length(Value)) from AllValues
This one performs basically the same but using sqlite virtual tables
api. It also touches other tables, but since they're small, we can
forget about this.
Query1 (General):
Read: 540MB,
Time: 24.2 sec,
CPU Time: 6 Sec (25%)
Speed: 22.31 MB/Sec
Query2 (Virtual):
Read: 540MB,
Time: 27.3 Sec,
CPU Time: 13 sec (51%)
Speed: 20 MB/Sec
In my particular test the noticeable difference is at the part of the
CPU spent more with the virtual table. I assume this can be related to
my own implementation of this virtual table since I should retrieve,
store values temporary somewhere and talk to sqlite. But this also may
shed light on your performance drop. If your virtual implementation
spend much time processing a value, you may finally get a big drop.
You may tell that this test is not fair because it does not involve
creating a table from the values of a virtual table. Unfortunately I
can't create good enough test comparing Posts and AllValues table as
sources, because the destination geometry of the tables are different
( Posts have more columns, less rows, AllValue less columns, more
rows). The closest approximation was possible when I created an
intermediate physical table containing the results from AllValues and
compared table creation from this table and from virtual table. The
virtual one took longer, but the values - 56 seconds vs 43 second not
different enough to conclude something.
I'm not sure my tests defend sqlite virtual tables sufficiently, but
currently I don't have evidence of significant inefficiency either.
Max
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users