Re: [sqlite] Virtual table API performance

Elefterios Stamatogiannakis Sun, 02 Mar 2014 11:39:06 -0800

We have both input and output virtual tables that avoid hitting the harddisk and are also able to compress the incoming and outgoing data.

We have a virtual table that takes as input a query and sends the datato a port on another machine. This virtual table is called "OUTPUT". Andanother virtual table that takes as input data from another port andforwards it into SQLite. Lets call it "INPUT". A query that uses thesetwo virtual tables would look like this in madIS:


OUTPUT ip:192.168.0.1 port:8080 select * from INPUT('port:8081');

We actually use queries like above (actually we don't do it directly toports but to buffered named pipes that are then forwarded via netcat) torun distributed queries on clusters, connecting all the localSQLite/madIS instances on the different machines together.

The main point that i want to make with above explanation is that wedon't view SQLite only as a traditional database. We also view it as adata stream processing machine, that doesn't have the requirement forthe data to be stored on a hard disk.

Under this view, the efficiency of the virtual table api is veryimportant. Above query only uses 2 VTs in it, but we have other queriesthat use a lot more VTs than that.


estama


On 2/3/2014 9:34 μμ, Max Vlasov wrote:

On Sun, Mar 2, 2014 at 5:21 PM, Elefterios Stamatogiannakis
<est...@gmail.com> wrote:


Our main test case is TPCH, a standard DB benchmark. The "lineitem" table of
TPCH contains 16 columns, which for 10M rows would require 160M xColumn
callbacks, to pass it through the virtual table API. These callbacks are
very expensive, especially when at the other end sits a VM (CPython or PyPy)
handling them.


Ok, not stating that the performance improvment is impossible, I will
explain why I'm a little sceptical about it.

For every bulk insert we have a theoretical maxiumum we'd all glad to
see sqlite would perform with - the speed of simple file copying.
Sqlite can't be faster than that, but to be on par is a good goal.
This is not possible when an insert means also modification of other
parts of the file, for example when there's an index involved. But
let's forget about it. Finally when new data is added, sqlite should
write a number of database pages, the cost of this part is absolutely
in the hands of the media (driver) and OS (driver).  But for every
database page write there's also price to pay in CPU units, for many
actions sqlite should do before actual value is translated from what
the developer provided to what actually appears on disk.

The illustration of the CPU price is the following example
  CREATE TABLE t(Value)

on my ssd drive mulitply inserts (thousands)
   insert into t (Value) values ('123456689....  // this string
contains many symbols, for example 1024)
performed with the speed
   30 MB/Sec

but the query
   insert into t (Value) values (100000)  // this is a small integer value
only
   3 Mb/Sec

Both shows almost full cpu load. Why such difference? Because with
latter query the system can do more than 30 MB of writes in 1 second,
but it should wait for sqlite spending 10 seconds in preparations.
The former is better because CPU cost of passing a large text value to
sqlite is comparatively low comparing to the  time spent in I/O in
writing this on disk.

So CPU price to pay isn't avoidable and notice that in example this is
not virtual table API, this is bind API. I suppose that the price we
pay for CPU spent in virtual table API is on par with an average price
payed in sqlite as a whole. This means that if I transfom the avove
queries into inserts from virtual tables, the final speed difference
will be similar. And this also means that for your comparision tests
(when you get x3 difference), the CPU price sqlite pays inside bind
api and in its code wrapping xColumn call is probably similar. The
rest is the share your code pays.

Well, I know that there are differences in CPU architectures and
probably there are platform where compiled code for bind api and
virtual tables api behaves a little differently making the costs more
diffrent. But imagine that hard task of fine tuning and refactoring
just to get a noticeable difference for a particular platform.


Max
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Virtual table API performance

Reply via email to