We have gotten around this problem by defining "virtual" fields that contain a reference to the "current record" and the entrypoint of a wrapper around the xColumn function. That way only two fields get passed upwards through the virtual table stack and the top level virtual table's xColumn implementation calls straight through to the bottom layer's wrapper.
It does take some care to avoid sorting in between the layers and re-preparation of statements on schema changes. -----Ursprüngliche Nachricht----- Von: Elefterios Stamatogiannakis [mailto:est...@gmail.com] Gesendet: Sonntag, 02. März 2014 20:39 An: sqlite-users@sqlite.org Betreff: Re: [sqlite] Virtual table API performance We have both input and output virtual tables that avoid hitting the hard disk and are also able to compress the incoming and outgoing data. We have a virtual table that takes as input a query and sends the data to a port on another machine. This virtual table is called "OUTPUT". And another virtual table that takes as input data from another port and forwards it into SQLite. Lets call it "INPUT". A query that uses these two virtual tables would look like this in madIS: OUTPUT ip:192.168.0.1 port:8080 select * from INPUT('port:8081'); We actually use queries like above (actually we don't do it directly to ports but to buffered named pipes that are then forwarded via netcat) to run distributed queries on clusters, connecting all the local SQLite/madIS instances on the different machines together. The main point that i want to make with above explanation is that we don't view SQLite only as a traditional database. We also view it as a data stream processing machine, that doesn't have the requirement for the data to be stored on a hard disk. Under this view, the efficiency of the virtual table api is very important. Above query only uses 2 VTs in it, but we have other queries that use a lot more VTs than that. estama On 2/3/2014 9:34 ìì, Max Vlasov wrote: > On Sun, Mar 2, 2014 at 5:21 PM, Elefterios Stamatogiannakis > <est...@gmail.com> wrote: >> >> Our main test case is TPCH, a standard DB benchmark. The "lineitem" >> table of TPCH contains 16 columns, which for 10M rows would require >> 160M xColumn callbacks, to pass it through the virtual table API. >> These callbacks are very expensive, especially when at the other end >> sits a VM (CPython or PyPy) handling them. >> > > Ok, not stating that the performance improvment is impossible, I will > explain why I'm a little sceptical about it. > > For every bulk insert we have a theoretical maxiumum we'd all glad to > see sqlite would perform with - the speed of simple file copying. > Sqlite can't be faster than that, but to be on par is a good goal. > This is not possible when an insert means also modification of other > parts of the file, for example when there's an index involved. But > let's forget about it. Finally when new data is added, sqlite should > write a number of database pages, the cost of this part is absolutely > in the hands of the media (driver) and OS (driver). But for every > database page write there's also price to pay in CPU units, for many > actions sqlite should do before actual value is translated from what > the developer provided to what actually appears on disk. > > The illustration of the CPU price is the following example > CREATE TABLE t(Value) > > on my ssd drive mulitply inserts (thousands) > insert into t (Value) values ('123456689.... // this string > contains many symbols, for example 1024) performed with the speed > 30 MB/Sec > > but the query > insert into t (Value) values (100000) // this is a small integer > value only > 3 Mb/Sec > > Both shows almost full cpu load. Why such difference? Because with > latter query the system can do more than 30 MB of writes in 1 second, > but it should wait for sqlite spending 10 seconds in preparations. > The former is better because CPU cost of passing a large text value to > sqlite is comparatively low comparing to the time spent in I/O in > writing this on disk. > > So CPU price to pay isn't avoidable and notice that in example this is > not virtual table API, this is bind API. I suppose that the price we > pay for CPU spent in virtual table API is on par with an average price > payed in sqlite as a whole. This means that if I transfom the avove > queries into inserts from virtual tables, the final speed difference > will be similar. And this also means that for your comparision tests > (when you get x3 difference), the CPU price sqlite pays inside bind > api and in its code wrapping xColumn call is probably similar. The > rest is the share your code pays. > > Well, I know that there are differences in CPU architectures and > probably there are platform where compiled code for bind api and > virtual tables api behaves a little differently making the costs more > diffrent. But imagine that hard task of fine tuning and refactoring > just to get a noticeable difference for a particular platform. > > > Max > _______________________________________________ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users ----------------------------------------------------------------------- Gunter Hick Software Engineer Scientific Games International GmbH Klitschgasse 2 – 4, A - 1130 Vienna, Austria FN 157284 a, HG Wien Tel: +43 1 80100 0 E-Mail: h...@scigames.at This e-mail is confidential and may well also be legally privileged. If you have received it in error, you are on notice as to its status and accordingly please notify us immediately by reply e-mail and then delete this message from your system. Please do not copy it or use it for any purposes, or disclose its contents to any person as to do so could be a breach of confidence. Thank you for your cooperation. _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users