[sqlite] Ordered, union all of ordered views

2014-08-01 Thread Eleytherios Stamatogiannakis

Hello,

We have two virtual tables that are ordered, and we do a "union all" of 
them asking from SQLite to keep the result ordered.


So the query looks like this:

select * from (select x from ordered_VT1 UNION ALL select x from 
ordered_VT2) order by x;


Both of ordered_VT1,  ordered_VT2 report back (via BestIndex) that their 
results are ordered on x. Note that when SQLite negotiates with the 
ordered_VT1,2 (via BestIndex), it doesn't even ask them if an ordering 
on x already exists.


Right now SQLite does a full scan of ordered_VT1, and then ordered_VT2 
before starting to produce results.


Shouldn't it do a merge union all of the two? Is there some way to help 
SQLite's planner to "see" that such a possibility exists?


Kind regards,

l.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Merge join in SQLite

2014-06-30 Thread Eleytherios Stamatogiannakis
Thank you for the answer as well as the merge-join complexity remark. 
What about the union-all with ordered input and an order-by on the whole 
query? Does SQLite use a "merge" algorithm for that case?


Kind regards,

l.

On 26/06/14 21:45, Richard Hipp wrote:




On Thu, Jun 26, 2014 at 11:41 AM, Eleytherios Stamatogiannakis
<est...@gmail.com <mailto:est...@gmail.com>> wrote:

Hi,

I've seen that SQLite supports the "SorterOpen" Op, to do merge
sorting. So based on that, does SQLite support merge joins when both
inputs of the join are sorted?


No, not at this time.  Note that a merge-join is more complicated than
it appears at first glance for the common case where the join key is not
unique in one or the other of the two tables being joined.


--
D. Richard Hipp
d...@sqlite.org <mailto:d...@sqlite.org>


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Merge join in SQLite

2014-06-26 Thread Eleytherios Stamatogiannakis

Hi,

I've seen that SQLite supports the "SorterOpen" Op, to do merge sorting. 
So based on that, does SQLite support merge joins when both inputs of 
the join are sorted?


Kind regards,

l.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] RPAD/LPAD

2014-03-07 Thread Eleytherios Stamatogiannakis
IMHO, SQLite is targeted towards being a relational "core" with very 
wide extensibility. Adding specific scripting engines to it is would be 
detrimental to its main purpose (being a very good relational "core").


In our group, we use SQLite + UDFs written in Python. Creating new 
functions is very simple (most of the functions are a couple Python 
lines). The API of SQLite permits very advanced functionality to be tied 
to it (infinite streaming virtual tables, indexes written in Python, 
etc). And (in our benchmarks) the speed is better than Postgres and a 
LOT better than MySQL.


Finally, the simplicity and extensibility of SQLite has permitted us to 
create hundreds of functions in a very short time [*].


l.

[*] http://doc.madis.googlecode.com/hg/index.html

On 07/03/14 16:59, Dominique Devienne wrote:

On Fri, Mar 7, 2014 at 3:39 PM, Clemens Ladisch  wrote:

Max Vlasov wrote:

On Fri, Mar 7, 2014 at 12:49 PM, Dominique Devienne  wrote:

I think what SQLite lacks is a syntax to define custom function like
it does for virtual tables. Something like:

create function rpad(x, y) using scripty_module as "return PRINTF('%-*s',y,x)";


Nice suggestion. This probably falls into case when a small new part
needed on sqlite side


Actually, no change to SQLite itself would be needed.  It's possible
to create an extension that provides a function that allows to register
another function that executes a custom SQL expression:

  SELECT register_simple_function('rpad', 2, 'SELECT printf(''%-*s'', ?, ?)');


I've already mentioned that Clemens. But also pointed about that then
you don't have a standard way to register those dynamically generated
functions. FWIW. --DD
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Virtual table API performance

2014-03-04 Thread Eleytherios Stamatogiannakis

On 04/03/14 20:11, Alek Paunov wrote:

On 04.03.2014 15:05, Eleytherios Stamatogiannakis wrote:

On 03/03/14 03:01, Alek Paunov wrote:

It seems that the easiest optimization for this (very often) VT use case
(bulk streaming) is SQLite add-on in _C_ to be written, implementing
vtable interface specialization containing xNextPage "buffering" let's
say 4K rows or even better 16KB data (in addition to your initial
proposal of xNextRow).

The technical question is: how the rows to be encoded? You said
initially that you use some compressed format. But for such extension,
to gain more traction in the future, it would be better probably a more
standard format to be chosen.

a) Rows represented in native SQLite3 format [3]
b) ... native SQLite4 format
c) Some wide used encoding near to SQLite types [4]
d) ...

[3] http://www.sqlite.org/fileformat.html#record_format
[4] https://github.com/msgpack/msgpack/blob/master/spec.md#formats

IMHO, It would be very nice if some common conventions for binary record
streaming could be discussed and adopted across the SQLite binding and
add-on developers. The possible applications are not limited only to
vtables ;-).


SQLite doesn't need any special format for the records to be passed over
to it. It already has the "bind" API which would be very suitable for
the xNextRow function too.



It seems that I do not know the sqlite3_ API very well.

http://www.sqlite.org/c3ref/funclist.html

Would you like to point me to the "bind" API page?


In the link that you posted above, look for all the sqlite3_bind_x 
functions. In SQLite the bind API is used to pass parameters to prepared 
statements. The way the the bind API works is that you have a statement 
parameter "row", and you fill it by saying:


Set column 1 of statement parameter "row" to an int with value 10
Set column 2 of statement parameter "row" to an float with value 3.5
...

So instead of SQLite calling back for each column, in the bind API 
"way", the program calls SQLite to fill a row's values.


Regards.

l.




For a paging API (which IMHO is too complex ATM), the bind API could be
extended with a row number parameter.




___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Virtual table API performance

2014-03-04 Thread Eleytherios Stamatogiannakis

Thank you for the explanation.

Answers inline.

On 04/03/14 16:16, Hick Gunter wrote:

Your VT1 table already has an xColumn implementation, possibly doing something 
like

 switch( p_column )
 {
 case 1: sqlite3_result_xxx( p_ctx, v_rec->f1, ...); break;
 ...
 case n: sqlite3_result_xxx( p_ctx, v_rec->fn, ...); break;
 }

This needs to have two cases added:

 case n+1: sqlite3_result_int64( p_ctx, (uintptr_t)v_rec );
 case n+2: sqlite3_result_int64( p_ctx, (uintptr_t)func );

where

 static int func( p_rec, p_ctx, p_column );

calls

 xColumn( v_cursor, p_ctx, p_column );

with a dummy cursor structure as defined for your table.

The VT2 table can then prepare "select __rec,__func from VT1", and in its 
xColumn implementation it calls

   v_rec  = (void *)sqlite3_column_int64( v_stmt, 0 ); // this can be 
stored and cleared in the xNext function
 v_func = (func *)sqlite3_column_int64( v_stmt, 1 ); // this can be 
stored
 v_func( v_rec, p_ctx, p_column );


I see, so you do a similar trick as what we do with passing Python's 
generators as values in SQLite.




As for your second example, as written it does not suffer from the effect 
because you are already selecting c1, c2 and c3 at the bottom level.

Rewritten as

Select processrow(c1,c2,c3) from VT2(select * from VT1);


Without knowing what VT2 will do, I don't think that this rewritting can 
happen. For example, "processrow" might return generators (nested 
tables), that get expanded by VT2. If you moved it outside VT2, then the 
generators would not be expanded.


Regards,

l.



results in the VT1 xColumn function getting called (via the VT2 xColumn 
function) just 3 times per row.

Additionally, you may like to "select __func from VT1 limit 1" and store that in your 
xFilter implementation; and then "select __rec from VT1" in your xNext implementation to 
have sqlite3_result_int64() called half as often.

HTH

-Ursprüngliche Nachricht-
Von: Eleytherios Stamatogiannakis [mailto:est...@gmail.com]
Gesendet: Dienstag, 04. März 2014 14:15
An: sqlite-users@sqlite.org
Betreff: Re: [sqlite] Virtual table API performance

Could you explain some more your solution?

Does it work in this following case:

select * from VT2(select * from VT1);

by directly passing the rows from VT1 to VT2 (short-circuiting SQLite)?

What would happen in the following case?:

select * from VT2(select processrow(c1,c2,c3) from VT1);

Regards,

l.

On 03/03/14 14:17, Hick Gunter wrote:

We have gotten around this problem by defining "virtual" fields that contain a reference 
to the "current record" and the entrypoint of a wrapper around the xColumn function. That 
way only two fields get passed upwards through the virtual table stack and the top level virtual 
table's xColumn implementation calls straight through to the bottom layer's wrapper.

It does take some care to avoid sorting in between the layers and 
re-preparation of statements on schema changes.

-Ursprüngliche Nachricht-
Von: Elefterios Stamatogiannakis [mailto:est...@gmail.com]
Gesendet: Sonntag, 02. März 2014 20:39
An: sqlite-users@sqlite.org
Betreff: Re: [sqlite] Virtual table API performance

We have both input and output virtual tables that avoid hitting the hard disk 
and are also able to compress the incoming and outgoing data.

We have a virtual table that takes as input a query and sends the data to a port on another 
machine. This virtual table is called "OUTPUT". And another virtual table that takes as 
input data from another port and forwards it into SQLite. Lets call it "INPUT". A query 
that uses these two virtual tables would look like this in madIS:

OUTPUT ip:192.168.0.1 port:8080 select * from INPUT('port:8081');

We actually use queries like above (actually we don't do it directly to ports 
but to buffered named pipes that are then forwarded via netcat) to run 
distributed queries on clusters, connecting all the local SQLite/madIS 
instances on the different machines together.

The main point that i want to make with above explanation is that we don't view 
SQLite only as a traditional database. We also view it as a data stream 
processing machine, that doesn't have the requirement for the data to be stored 
on a hard disk.

Under this view, the efficiency of the virtual table api is very important. 
Above query only uses 2 VTs in it, but we have other queries that use a lot 
more VTs than that.

estama


On 2/3/2014 9:34 ìì, Max Vlasov wrote:

On Sun, Mar 2, 2014 at 5:21 PM, Elefterios Stamatogiannakis
<est...@gmail.com> wrote:


Our main test case is TPCH, a standard DB benchmark. The "lineitem"
table of TPCH contains 16 columns, which for 10M rows would require
160M xColumn callbacks, to pass it through the virtual table API.
These ca

Re: [sqlite] Virtual table API performance

2014-03-04 Thread Eleytherios Stamatogiannakis
In our tests the bind API can acquire from the Python side more than 20 
values in a single call, at the same time that xColumn acquires 2 values.


Most of the cost is in the callback and not in submitting a row's values 
through bind's API .


So with the exception of queries that need only 1 column, IMHO 
everything else should go through the xNextRow API.


To keep the complexity to the lowest minimum, my proposal is to use 
xNextRow API only for queries that only "scan" over a virtual table (no 
filtering).


l.

On 04/03/14 18:23, Hick Gunter wrote:

My guess: Yes.

It would require implementing an new opcode, either only for virtual tables or 
also for native tables too, that accepts a list of field numbers (currently 
there are only 5 parameters possible for an opcode and some of them have fixed 
meanings).

And the logic to generate theses opcodes based on the capabilities of the loaded table 
module combined with the requirements of the subject query (fields required for JOIN are 
fetched separately from those required for the result set) and the result of the 
xBestIndex calls (where it is possible to set the "omit" flag to suppress 
generation of a comparison). This also adds to the complexity of register allocation.

Take for example a join that needs 3 fields for the comparison, 2 of which are 
also required for the result set of 7 fields total. Do you request all 10 
fields at once (wastes up to 9 fields worth of effort if the JOIN is not met)? 
Or the 3 fields first and the 5 others only if the join matches (must allocate 
consecutive registers to build a result set)? Or 3 first and then 7 (which 
approximates the current behavior, as the 2 common fields are fetched twice on 
a match)?

And a set of new sqlite3_result routines that specify which of the various 
requested fields' value is being set.

-Ursprüngliche Nachricht-
Von: J. Merrill [mailto:j.merr...@enlyton.com]
Gesendet: Dienstag, 04. März 2014 16:23
An: sqlite-users@sqlite.org
Betreff: Re: [sqlite] Virtual table API performance

Eleytherios Stamatogiannakis wrote

Our main test case is TPCH, a standard DB benchmark. The "lineitem"
table of TPCH contains 16 columns, which for 10M rows would require
160M xColumn callbacks, to pass it through the virtual table API.
These callbacks are very expensive, especially when at the other end
sits a VM (CPython or PyPy) handling them.


Would it be very difficult to arrange for an option that would request that 
SQLite issue a single more-complex xMultiColumns (a sample name) callback 
request, with a way for multiple results to be returned, rather than many 
xColumn callbacks? This would reduce the number of calls across the VM boundary.

Applications that don't implement xMultiColumns (and request its use) would see 
no change; those that do would get the performance boost.

J. Merrill



--
View this message in context: 
http://sqlite.1065341.n5.nabble.com/Why-does-type-affinity-declared-on-a-foreign-key-column-affect-join-speed-tp74183p74295.html
Sent from the SQLite mailing list archive at Nabble.com.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


---
Gunter Hick
Software Engineer

Scientific Games International GmbH
Klitschgasse 2 – 4, A - 1130 Vienna,
Austria
FN 157284 a, HG Wien
Tel: +43 1 80100 0
E-Mail: h...@scigames.at

This e-mail is confidential and may well also be legally privileged. If you 
have received it in error, you are on notice as to its status and accordingly 
please notify us immediately by reply e-mail and then
delete this message from your system. Please do not copy it or use it for any 
purposes, or disclose its contents to any person as to do so could be a breach 
of confidence. Thank you for your cooperation.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Virtual table API performance

2014-03-04 Thread Eleytherios Stamatogiannakis

Could you explain some more your solution?

Does it work in this following case:

select * from VT2(select * from VT1);

by directly passing the rows from VT1 to VT2 (short-circuiting SQLite)?

What would happen in the following case?:

select * from VT2(select processrow(c1,c2,c3) from VT1);

Regards,

l.

On 03/03/14 14:17, Hick Gunter wrote:

We have gotten around this problem by defining "virtual" fields that contain a reference 
to the "current record" and the entrypoint of a wrapper around the xColumn function. That 
way only two fields get passed upwards through the virtual table stack and the top level virtual 
table's xColumn implementation calls straight through to the bottom layer's wrapper.

It does take some care to avoid sorting in between the layers and 
re-preparation of statements on schema changes.

-Ursprüngliche Nachricht-
Von: Elefterios Stamatogiannakis [mailto:est...@gmail.com]
Gesendet: Sonntag, 02. März 2014 20:39
An: sqlite-users@sqlite.org
Betreff: Re: [sqlite] Virtual table API performance

We have both input and output virtual tables that avoid hitting the hard disk 
and are also able to compress the incoming and outgoing data.

We have a virtual table that takes as input a query and sends the data to a port on another 
machine. This virtual table is called "OUTPUT". And another virtual table that takes as 
input data from another port and forwards it into SQLite. Lets call it "INPUT". A query 
that uses these two virtual tables would look like this in madIS:

OUTPUT ip:192.168.0.1 port:8080 select * from INPUT('port:8081');

We actually use queries like above (actually we don't do it directly to ports 
but to buffered named pipes that are then forwarded via netcat) to run 
distributed queries on clusters, connecting all the local SQLite/madIS 
instances on the different machines together.

The main point that i want to make with above explanation is that we don't view 
SQLite only as a traditional database. We also view it as a data stream 
processing machine, that doesn't have the requirement for the data to be stored 
on a hard disk.

Under this view, the efficiency of the virtual table api is very important. 
Above query only uses 2 VTs in it, but we have other queries that use a lot 
more VTs than that.

estama


On 2/3/2014 9:34 ìì, Max Vlasov wrote:

On Sun, Mar 2, 2014 at 5:21 PM, Elefterios Stamatogiannakis
 wrote:


Our main test case is TPCH, a standard DB benchmark. The "lineitem"
table of TPCH contains 16 columns, which for 10M rows would require
160M xColumn callbacks, to pass it through the virtual table API.
These callbacks are very expensive, especially when at the other end
sits a VM (CPython or PyPy) handling them.



Ok, not stating that the performance improvment is impossible, I will
explain why I'm a little sceptical about it.

For every bulk insert we have a theoretical maxiumum we'd all glad to
see sqlite would perform with - the speed of simple file copying.
Sqlite can't be faster than that, but to be on par is a good goal.
This is not possible when an insert means also modification of other
parts of the file, for example when there's an index involved. But
let's forget about it. Finally when new data is added, sqlite should
write a number of database pages, the cost of this part is absolutely
in the hands of the media (driver) and OS (driver).  But for every
database page write there's also price to pay in CPU units, for many
actions sqlite should do before actual value is translated from what
the developer provided to what actually appears on disk.

The illustration of the CPU price is the following example
   CREATE TABLE t(Value)

on my ssd drive mulitply inserts (thousands)
insert into t (Value) values ('123456689  // this string
contains many symbols, for example 1024) performed with the speed
30 MB/Sec

but the query
insert into t (Value) values (10)  // this is a small integer
value only
3 Mb/Sec

Both shows almost full cpu load. Why such difference? Because with
latter query the system can do more than 30 MB of writes in 1 second,
but it should wait for sqlite spending 10 seconds in preparations.
The former is better because CPU cost of passing a large text value to
sqlite is comparatively low comparing to the  time spent in I/O in
writing this on disk.

So CPU price to pay isn't avoidable and notice that in example this is
not virtual table API, this is bind API. I suppose that the price we
pay for CPU spent in virtual table API is on par with an average price
payed in sqlite as a whole. This means that if I transfom the avove
queries into inserts from virtual tables, the final speed difference
will be similar. And this also means that for your comparision tests
(when you get x3 difference), the CPU price sqlite pays inside bind
api and in its code wrapping xColumn call is probably similar. The
rest is the share your code pays.

Well, I know that there are differences in CPU architectures 

Re: [sqlite] Virtual table API performance

2014-03-04 Thread Eleytherios Stamatogiannakis

On 03/03/14 03:01, Alek Paunov wrote:

On 02.03.2014 21:38, Elefterios Stamatogiannakis wrote:

Under this view, the efficiency of the virtual table api is very
important. Above query only uses 2 VTs in it, but we have other queries
that use a lot more VTs than that.


Max tests in C shows 2x CPU work, but he explains that the test is not
very sound, so let's say somewhere between 1x-2x. Your tests - 3x time.

As you have already identified, the real reason probably is the million
scale callback quantity across the VM barrier - I do not follow PyPy,
but see these notes [1] by Mike Pall - the LuaJIT author (LuaJIT is the
leading project in the trace compilers filed):

[1] http://luajit.org/ext_ffi_semantics.html#callback_performance

Also from one of the dozens of threads touching the subject:

[2] http://www.freelists.org/post/luajit/Yielding-across-C-boundaries,3

```
Entering the VM needs a lot of state setup and leaving it isn't
free either. Constantly entering and leaving the VM via a callback
from C *to* Lua has a high overhead. For short callbacks, the
switching overhead between C and Lua may completely dominate the
total CPU time.

Calling an iterator written in C via the FFI *from* a Lua program
is much cheaper -- this compiles down to a simple call instruction.
```


I remember that i had seen above quote from Mike Pall, but i couldn't 
find it.


Thank you for unearthing it.


Unfortunately, for your "insert into t select * from vt" case an the
callback/iterator transformation is not possible (we do not have
repetitive _step call to invert the control somehow). What to do?

It seems that the easiest optimization for this (very often) VT use case
(bulk streaming) is SQLite add-on in _C_ to be written, implementing
vtable interface specialization containing xNextPage "buffering" let's
say 4K rows or even better 16KB data (in addition to your initial
proposal of xNextRow).

The technical question is: how the rows to be encoded? You said
initially that you use some compressed format. But for such extension,
to gain more traction in the future, it would be better probably a more
standard format to be chosen.

a) Rows represented in native SQLite3 format [3]
b) ... native SQLite4 format
c) Some wide used encoding near to SQLite types [4]
d) ...

[3] http://www.sqlite.org/fileformat.html#record_format
[4] https://github.com/msgpack/msgpack/blob/master/spec.md#formats

IMHO, It would be very nice if some common conventions for binary record
streaming could be discussed and adopted across the SQLite binding and
add-on developers. The possible applications are not limited only to
vtables ;-).


SQLite doesn't need any special format for the records to be passed over 
to it. It already has the "bind" API which would be very suitable for 
the xNextRow function too.


For a paging API (which IMHO is too complex ATM), the bind API could be 
extended with a row number parameter.


Regards,

estama



Kind regards,
Alek



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Virtual table API performance

2014-02-28 Thread Eleytherios Stamatogiannakis

Hello,

First of all, i would like to say that, IMHO, SQLite's overall design 
and code quality is top notch. Having said that, SQLite's virtual table 
API, while beautifully designed, it is not very efficient.


We have been hitting our heads for the last few years, on the virtual 
table API, to make it work efficiently. During that time, we were 
thinking that the problem was on our side, in the code that feeds the 
virtual table API. It turns out that the virtual table API isn't very 
efficient.


The main cause of inefficiency is that it is extremely "chatty". For an 
external stream that contains many columns, "xColumn" can be called-back 
hundreds of millions of times for the stream to be consumed by SQLite. 
These callbacks have a very big cost. Let me describe a test that we did.


For our work, we use compressed streams that are being fed in SQLite 
through the virtual table API.


If we load into SQLite, the external compressed stream (containing 3M 
rows) through the virtual table API:


create table newtable as select * from READCOMPRESSEDFILE('ctable.rc');

it takes: 55 sec


If we create an external program that inserts into SQLite, the rows in 
the compressed stream one by one, using "insert into newtable values 
" and the SQLite bind API:


it takes: 19 sec (~3x faster than using the virtual table API)


Another problem with the virtual table API, is that it wrecks havok with 
VM JIT engines. Especially for tracing JIT engines, the many "small" per 
column callbacks do not permit them to specialize at all, compared to 
the specialization that a tracing JIT could achieve with one "big" 
get_an_entire_row callback.


A suggestion for improving the efficiency of the virtual table API 
naturally arises when we look at all the virtual table functions that we 
have already created. We have ~15 VT functions dealing with importing 
"all" from external sources (files, http streams, xml, clipboard, etc), 
and only one filtering VT function (a multidimensional index) "picking" 
columns to return.


So most of our queries that use VTs look like this:

create table cliptab as select * from clipboard();

, these queries most of the time select all columns from an external stream.

Based on above, an addition that improves the efficiency of the VT API 
would be an "xRow" function that the SQLite could call to get an entire 
row back (like the bind API).


Even better, and to reduce even more the callback count, would be a 
"xNextRow" function that returns the contents of the next row or EOF.


Regards,

estama.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Why must WITHOUT ROWID tables have PRIMARY KEYs?

2014-01-23 Thread Eleytherios Stamatogiannakis
First of all please excuse me for bringing forth again a conversation 
that is months old. I've just discovered the "without rowid" ability of 
SQLite and i have to say many many thanks for it.


Let me describe a use case where a not unique key and without rowid are 
most welcome. We have a distributed big data system here which uses 
SQLite for the partitions. To be able to efficiently execute join 
queries on splited partitions, we need to assemble the partitions of one 
side of the query to create an index on them.


So in the query:

select y,z from A, B where A.x = B.x;

We assemble the pieces of B using this query:

create table B as select * from ALL("b1.db", "b2.db", "b3.db");

and then we create an index on them:

create index idx_b_x on B(x,y);

Above essentially creates 2 copies of table B's data (because all our 
indexes are always covering indexes).


So to avoid these 2 copies, what we would like to have are indexes 
*without backing tables*. These are essentially tables "without rowid" 
having plain keys (as opposed to primary keys).


We could work around the primary key limitation of "without rowid" 
tables by creating a dummy rowid column and putting it at the right of 
the table's key that we care about:


CREATE TABLE B(
x, y, rowid, PRIMARY KEY(x, rowid)
) WITHOUT ROWID;

and then insert into it the data:

insert into B select * from ALLWITHDUMMYROWID("b1.db", "b2.db", "b3.db");

but having this capability inside SQLITE would be a lot better.

Nevertheless above comments, thank you again for "without rowid". 
They'll be very useful for our use case.


Regards,

l.

On 19/11/13 21:53, Richard Hipp wrote:

On Mon, Nov 18, 2013 at 5:45 PM, Nico Williams wrote:


Obviously a B-Tree-based table will need *some* key, but it won't need
that key to be UNIQUE.



Yeah it does.  Every storage system has a unique key, be it an address in
memory or a filename and offset on disk.  There must be some way to
distinguish between two record.  Two records with the same key are
indistinguishable from one another and are hence the same record.

The question is whether or not the unique identifier is exposed to the
application.

SQLite allows the unique identifier to be a PRIMARY KEY ... WITHOUT ROWID.
Or it allows it to be the rowid.  You get to choose.  But every table has
to have one or the other.

You can argue that it is theoretically possible to create a table where the
key is something other than PRIMARY KEY or rowid (perhaps it is the
filename+offset suggested above) that is simply not exposed to the
application layer.  Yes, that is theoretically possible.  But SQLite does
not do that and I do not see any reason to add the capability.



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Limit of attached databases

2013-06-04 Thread Eleytherios Stamatogiannakis

On 04/06/13 17:37, Simon Slavin wrote:


On 4 Jun 2013, at 3:09pm, Eleytherios Stamatogiannakis <est...@gmail.com> wrote:


Is there any way to go beyond the SQLITE_MAX_ATTACHED limit for *read only* 
attached DBs?


See section 11 of

<http://www.sqlite.org/limits.html>

It's a 64-bit value, and two bits are already taken up.


Yes i have seen it in SQLite's code. I considered changing it to a 
bitfield, but the problem is that this long int is used in various other 
places in SQLite's internals for transaction' metadata.




You can attach databases, copy data from them to the main database, then detach 
those and attach some others.  Or you can create a hierarchy of shards (each of 
62 shards can point to up to 62 others).  Or you can rewrite your code so it 
never uses more than 62 shards no matter how many nodes are available.


Attaching and detaching is only useful when materializing the shards 
into a single table:


create table T
attach T1
insert into T select * from T1.T
detach T1
attach T2
insert into T select * from T2.T
...

How would a hierarchy of shards work? You cannot attach a DB onto 
another attached DB for the hierarchy idea to work.


Also, only using 62 shards at a time can be very constrained/slow in our 
use case (see below).



None of them good solutions, I'm afraid.


Yes :-(.


Also is there anyway for SQLite to create an automatic index on a view (or 
Virtual Table), without having to first materialize the view (or VT)?


I believe that SQLite needs the data to be in one place (i.e. at least a 
virtual table) for the indexing routine to work.


We have tried with both views and VTs but SQLite does not create 
automatic indexes on them at all. So right now, to be able to have 
automatic indexes from SQLite's side we materialize all Virtual Tables 
into plain tables:


create temp table T as select * from UnionAllVT1;
create temp table G as select * from UnionAllVT2;
...

which doubles our I/O to process a single sharded table.

 - 1 full read + 1 full write of all data to materialize the UnionAllVT 
into a plain table.
 - 1 full read + 1 full write of the data in the materialized table to 
create the automatic index.


It would be very nice if the automatic index could be created directly 
from the UnionAllVT, but we haven't found a way to do it.



If you're willing to put a bit of SQLite-only effort in, you could implement 
your own virtual table implementation that consulted data on each of your 
nodes.  This would be quite highly customised for your own application's 
requirements but it would mean you didn't have to do any attaching or detaching 
at all.  Your SQLite API calls could address your data as if it was all in one 
database file but SQLite would understand how data is partitioned between nodes 
and automatically gather it from all the necessary nodes.


We already have done this (creating VTs is very easy in madIS [*]). We 
have the UnionALL virtual table that scans over all the DB shards.


Above UnionALL VT only supports scans, and it is only used to 
materialize the shards into a regular table. It would be very costly 
having per shard indexes, because each Filter on the UnionALL VT would 
need to be passed to all of the shards. A single automatic index, works 
best.



Another way to do it would be to implement your own VFS which would distribute 
over the nodes not at the row level but as if they were all one huge storage 
medium (i.e. like a RAID).


Each shard that we use is already a self contained SQLite DB. We would 
need to change our whole approach to convert to a disk page based 
sharding approach.


Thanks for your ideas.

l.

[*] 
https://code.google.com/p/madis/source/browse/src/functions/vtable/unionalldb.py


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Limit of attached databases

2013-06-04 Thread Eleytherios Stamatogiannakis

Hi,

During our work on a distributed processing system (which uses SQLite 
shards), we have hit the SQLITE_MAX_ATTACHED limit of attached DBs.


The way we use SQLite for distributed processing [*], is the following:
 - Each table is sharded into multiple SQLite DBs on different nodes of 
the cluster.
 - To process a query, we run on each shard a query which produces 
multiple sharded SQLite result DBs.
 - We redistribute in the cluster the result DBs, and the next set of 
cluster nodes, attaches all the input shard SQLite DBs, and it creates a 
temp view that unions all the input DB shards into a single view.

 - It then executes a query on the views that produces new result DB shards
 - and so on

We recently got access to a cluster of 64 nodes and it is very easy now 
to hit the SQLITE_MAX_ATTACHED limit (1 DB shard gets produced per node).


So the question that i have is:

Is there any way to go beyond the SQLITE_MAX_ATTACHED limit for *read 
only* attached DBs?


Also is there anyway for SQLite to create an automatic index on a view 
(or Virtual Table), without having to first materialize the view (or VT)?


Thanks in advance.

Lefteris Stamatogiannakis.

[*] The same processing ideas are used in hadapt:

http://hadapt.com/

which uses Postgres for the DB shards.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Query optimizer suboptimal planning with virtual tables

2013-03-14 Thread Eleytherios Stamatogiannakis

On 14/03/13 17:05, Clemens Ladisch wrote:

Eleytherios Stamatogiannakis wrote:

explain query plan select * from t1, t2 where t1.c1=t2.c1;

0 |0 |1 | SCAN TABLE t2 (~100 rows)
0 |1 |0 | SCAN TABLE t1 VIRTUAL TABLE INDEX 0: (~0 rows)

Even thought, i have put VT t1 first in the join list, SQLite
will do a nested loop join (putting it on the right).


How have you defined index 0 of your virtual table?


The "INDEX 0:" is always there (even when no index is defined in the VT 
function). I don't know what it means.


l.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Query optimizer suboptimal planning with virtual tables

2013-03-14 Thread Eleytherios Stamatogiannakis

Hi,

I have came across a glitch with how SQLite's query optimizer plans 
virtual tables. Example follows:


I have a virtual table function named "range" that produces all numbers 
from 1 to range's arg. This virtual table does not have any index 
functionality.


With this i'll create the virtual table instance "t1":

> create virtual table t1 using range('100');
> select * from t1;
1
2
...
99
100
--Column names--
C1

Let's create a real table now:

> create table t2 as select * from t1;

The plan that the optimizer will produce when i join these two tables is 
this:


> explain query plan select * from t1, t2 where t1.c1=t2.c1;
0 |0 |1 | SCAN TABLE t2 (~100 rows)
0 |1 |0 | SCAN TABLE t1 VIRTUAL TABLE INDEX 0: (~0 rows)

Even thought, i have put VT t1 first in the join list, SQLite will do a 
nested loop join (putting it on the right).


Wouldn't it had made more sense for SQLite to create an automatic index 
on the real table t2 and do the join as such?


0 |0 |0 | SCAN TABLE t1 VIRTUAL TABLE INDEX 0: (~0 rows)
0 |1 |1 | SEARCH TABLE t2 USING AUTOMATIC COVERING INDEX idx (C1=?) (~10 
rows)


Putting the VT on the right by default, doesn't make much sense to me, 
since it cannot create an automatic index on it.


In general it seems to me to be a better default to always have the non 
automatic indexable SQLite entities (views, virtual tables) on the left 
of the joins and what can be automatically indexed on the right of the 
joins.


Also, i think, that it would be even better if SQLite had the ability to 
scan the virtual table and build a temporary automatic covering index on 
it to do the join (why isn't this case allowed?).


Thank you,

lefteris.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Full covering index without table

2013-03-05 Thread Eleytherios Stamatogiannakis
I'm not so sure how fast SQLite4 will be with respect to full scans. 
IMHO row stores have an advantage when scanning over multiple columns.


Concerning dropping the index and recreating it on the other side. Its 
doable but very bad from a performance point of view.


If you know how the other side will access the data, then the best 
option is to build in parallel the indexes over all the data chunks (on 
every cluster node), and then send the indexed chunks on the other side. 
Having doubled data (table+full covering index) in these chunks, halves 
the I/O bandwidth of the whole cluster.


Also i should point that our main use case is OLAP processing and not OLTP.

lefteris.

On 05/03/13 10:51, Nico Williams wrote:

SQLite4 gets this right...  Of course, it's not been released.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Full covering index without table

2013-03-04 Thread Eleytherios Stamatogiannakis

On 04/03/13 18:44, Simon Slavin wrote:


On 4 Mar 2013, at 4:13pm, Eleytherios Stamatogiannakis <est...@gmail.com> wrote:


Is there a way in SQLite to have a full covering index on a table without also 
storing the duplicate table?


Can we ask why you care about this ?  Do you have a huge table which is taking a huge amount of 
space, and you're trying to fit it on a Flash Drive ?  Can you distinguish between "I think it 
could be smaller." and "It's just a little too big and that means I can't use SQLite for 
this." ?



We are creating a distributed processing system in the spirit of Hadapt 
[1], but instead of using PostgreSQL we are using SQLite.


For the intermediate result tables (each one inside an SQLite DB) that 
we know how they will be accessed (and so we prepare their indexes), it 
is very wasteful to have to transfer twice the data (index + full table).


This kind of systems live and die by their I/O.


The most compact way of carrying SQLite databases around is to use the shell 
tool to dump the database to a SQL text file, then use a compression utility 
(e.g. ZIP) to compress that text file.  But without knowing your situation I 
can't tell if that would help you.


For streaming processing we have our own serialization format that is 
compressed on the fly with LZ4. These streams are opened on the other 
side as SQLite Virtual Tables. For store and forward type of processing, 
we use SQLite DBs also compressed on the fly with LZ4. On the other side 
we simply "attach" these DBs.



A first shot toward a partial solution would be to declare all the columns on 
the table as primary keys:

create table t(a,b,c, primary key(a,b,c));


Sorry, but it doesn't help.  Even fields in the primary key are stored twice.


I'm saddened to hear that. I thought that at least we had a partial 
solution with declaring all rows as a primary key...


Thank you for answering.

l.

[1] http://hadapt.com
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Full covering index without table

2013-03-04 Thread Eleytherios Stamatogiannakis

Hi,

I have a question:

Is there a way in SQLite to have a full covering index on a table 
without also storing the duplicate table?


So if a have a table:

create table t(a,b,c);

and an index covering all its columns:

create index idx_t on t(a,b,c);

SQLite will store everything twice. First on the table and then on the 
index. The problem is that if all access on the table happens through 
the covering index, then the information on the table is redundant.


A first shot toward a partial solution would be to declare all the 
columns on the table as primary keys:


create table t(a,b,c, primary key(a,b,c));

Above assumes that due to the nature of how SQLite always stores tables 
in a B-tree, this would in essence be like an index. But this wouldn't 
work if columns a,b,c are not unique.


Is there any way to declare a primary key without the uniqueness constraint?

Kind regards,

lefteris.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unql

2012-12-11 Thread Eleytherios Stamatogiannakis

Warning: Only my opinion too :-) .

In madIS, we ended up faking some of the features that document based 
databases have. Without these features, it would be extremely hard to 
build queries that describe advanced processing workflows.


The main features that we ended up faking are:

- JSON "packs" (or jpacks in madIS), which are in essence JSON lists 
that contain many values. The main difference between madIS's jpacks and 
regular JSON lists is that a jpack(base_type) = base_type . Examples:


 select jpack(1) as C;
 C

 1

 select jpack(1,2) as C;
 C

 [1,2]

- Nested tables, without which SQL is severely deficient in describing 
any advanced processing. Example:


 select jsplit("[1,2]");
 C1 | C2
---
 1   |  2

- Virtual tables having  queries as parameters.

select * from (XMLPARSE select "row1val1row1val1b");
{"a/b":"row1val1"}
{"a/b":"row1val1b"}

From my standpoint, UnQL is a very nice first try towards solving above 
DB deficiencies. It would be a shame if this effort is abandoned.


Concerning the other query languages. Cassandra's SQL inspired queries 
are very limiting. There is a company that merged Cassandra with 
Hadoop/Hive, providing Cassandra with more advanced querying abilities, 
but the last time a looked at it, it wasn't working so well.


Riak and a number of other key-value stores that do not have rich query 
interfaces are useful only for OLTP workloads and do not cover OLAP 
workloads at all.


 I don't like MongoDB's approach very much. It seems to me that 
extending/expanding SQL to cover MongoDB's use cases would be a better 
solution than starting from JSON and working towards SQL. While i like 
JSON very much, whenever i needed to do a MongoDB query i had to 
translate from SQL to MongoDB through the use of a translation table. 
For better or worse, SQL is the de facto lingua franca of query languages.


IMHO, UnQL has a lot of promise in expanding SQL into domains that right 
now are served poorly from existing querying solutions.


l.

On 11/12/12 05:28, Roger Binns wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 10/12/12 19:01, dcharno wrote:

What ever happened to Unql and is there any chance it will be revived?
It seemed like it would have been incredible useful.


Warning: My opinion only

Unql seemed to be especially pointless.  Being somewhat compatible with
SQL syntax made no sense - SQL syntax and all the tools using it assume
the relational model and simple table like data layouts.  NoSQL databases
vary, but generally they are non-relational and schemaless, and JSON
interoperable ones have a structure completely different than SQL tables
(eg item nesting).

Every NoSQL database already has its own query mechanism so Unql would
always be a second class citizen if any of them ever adopted Unql.

NoSQL databases also tend to have looser consistency, have multiple
servers available and various other semantics very different than SQL
servers.  The native NoSQL interfaces expose and work with those semantics.

Cassandra did actually end up with something SQL inspired:

   http://cassandra.apache.org/doc/cql/CQL.html

Riak uses something like Google advanced searches (field:value):

   http://docs.basho.com/riak/latest/cookbooks/Riak-Search---Querying/

MongoDB uses JSON with extra magic operators:

   http://docs.mongodb.org/manual/applications/read/

Personally I like MongoDB approach where queries strongly resemble the
underlying stored data which means very little translation between the two.

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)

iEYEARECAAYFAlDGqEAACgkQmOOfHg372QQ8OACgxbOBcp1F5rADh9Uw5+0efsEe
5RQAn3sim96zcz6x2lKMXF+B7Sp20P1A
=OHFc
-END PGP SIGNATURE-
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Prefix joins

2012-11-28 Thread Eleytherios Stamatogiannakis

Hello,

Is there a version of "LIKE", in SQLite, which makes it clear that we 
only care to have a pattern on the postfix of a column?


An example follows:

select * from a,b where POSTFIX_LIKE(a.c1, b.c1 ,'%');

In above example, POSTFIX_LIKE works in the same way as if we had written:

select * from a,b where a.c1 LIKE b.c1||'%';

but with the additional guarantee for the optimizer that all pattern 
matching will happen on the postfix and not on the prefix, so the 
optimizer will be able to use an index to do the join.


Thanks in advance,

Lefteris.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Union all patch, preliminary tests

2012-10-30 Thread Eleytherios Stamatogiannakis
In my preliminary tests, the patch has worked wonderfully. In these 
first tests i have only tested plain SQLite functionality (without 
virtual table or additional functions) on a 1.5GB DB. Some remarks:


- It is lightning fast. For queries that fit to the free space (some of 
them could not finish before due to space constraints) and previously 
took minutes to finish, with the patch they are instantaneous and they 
do not touch the drive. The results are the same.


- I was expecting that only simple scans over unioned queries would be 
affected by the patch. When i tried to do a filter on a unioned table 
composition, and the query was again instantaneous, i was flabbergasted. 
The patch really pushes index accesses downwards into the unioned 
tables. I didn't expect the patch to go that far. I'm amazed,


- Group by on unioned all tables was also instantaneous without filling 
the hard disk.


- Oddly, order by works the same both with the patch and without the 
patch (SQLite 3.7.14.1). I assume that order by had a similar 
optimization in previous versions of SQLite too?


- On all of the above tests, the free space on the hard disk wasn't 
affected by the running query.


More tests will be done in the context of madIS, but they will take more 
time. I want to test plain SQLite first, because this is what most 
people will use in practise.


Best regards,

lefteris.


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Union all writting on /var/tmp

2012-10-30 Thread Eleytherios Stamatogiannakis

Hello,

Mr. Hipp please excuse my attitude on my first email (the one your 
replied to). It came after 3 days of intense pressure to find out what 
the problem of machines coming to a crawl whenever a particular query 
with sufficiently large union-ed all tables was run.


Due to the quality of SQLite it literally was the last thing that i 
looked at. I even had prepared emails to other people asking them to 
look at their software, and to not to buffer on the hard disk so much.


Concerning your effort to fix it. I also suspected about the "path of 
least resistance", considering "union all". This is why in my next 
emails i mainly asked for a switch to change the "/var/tmp" path. The 
other thing that i've asked for, was for the documentation to have a 
warning about union all's behaviour. I believed that warning could save 
a lot of time and effort for other people that tripped on the same thing 
as me.


Thank you very very much for your fix. I'm glad that you put the 
considerable effort to it. My Phd was in databases, so i can understand 
how much effort this fix required. I have already downloaded the patch 
and i'll test it asap.


Due to me also working/having worked on Open Source software (madIS, 
Rigs of Rods), i realize your second point in my skin. Nevertheless 
sometimes, i also slide to this kind of behaviour. I'll try to be on 
guard against it in the future.


Best regards,

lefteris.

On 30/10/12 03:08, Richard Hipp wrote:



On Fri, Oct 26, 2012 at 2:16 PM, Eleytherios Stamatogiannakis
<est...@gmail.com <mailto:est...@gmail.com>> wrote:

I have been observing the following freaky behaviour of SQLite. When
i run:

select count(*) from (select * from huge_table union all select *
from huge_table);

Sqlite starts writting in /var/tmp/ a file like:

  /var/tmp/etilqs___gblRd6vUPcx91Hl, the root partition of fills up
and an error is raised.

Why does SQLite 3.7.14.1 need to write at all when doing union all?
It seems to me that there is no reason for doing so.


The reason for using a temp table for UNION ALL in a subquery is because
that is the path of least resistance.  The same identical code can
handle UNION ALL, UNION, INTERSECT, EXCEPT and various other cases.
Some queries (such as your UNION ALL) can in fact do without the temp
table.  But those are special cases that have to coded separately.
Adding, testing, and maintaining that extra code involves a lot of
work.  And the extra code risks introducing bugs that might appear even
for people who are not doing a UNION ALL in a subquery.  And in over 12
years of use, in over a million different applications, with over 2
billion deployments, nobody has ever before requested this optimization.

At http://www.sqlite.org/src/info/7af3acbbd4 there is a patch that adds
the optimization to avoid using a temp table for your UNION ALL
queries.  This one small patch represents about 12 hours of intense
work, so far.  Much more work will be required to get the patch
performing to our release standards.  All of this effort on your behalf
you are receiving for free.  In return, we ask two things:

(1) Please download and test the patch and report any problems,
including performance problems.

(2) Please learn to be less grumpy, demanding, and condescending when
requesting help with software towards which you have contributed
nothing.  You have received this latest patch, and indeed all of SQLite,
by grace.  Therefore, please extend the same grace toward others.


Best regards,

lefteris.
_
sqlite-users mailing list
sqlite-users@sqlite.org <mailto:sqlite-users@sqlite.org>
http://sqlite.org:8080/cgi-__bin/mailman/listinfo/sqlite-__users
<http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users>




--
D. Richard Hipp
d...@sqlite.org <mailto:d...@sqlite.org>


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Union all writting on /var/tmp documentation warning

2012-10-29 Thread Eleytherios Stamatogiannakis
My understanding (and what my experiments have shown) is that in both 
cases "COMPOUND SUBQUERIES 1 AND 2" will write a temp file to /var/tmp.


IMHO, the documentation should warn about this writing behaviour, 
because for the second case (union all) it isn't expected/predictable 
because fully buffering is not needed.


lefteris.

On 29/10/12 20:41, Clemens Ladisch wrote:

Eleytherios Stamatogiannakis wrote:

"union all" works exactly like plain "union". It always materializes its input.


sqlite> explain query plan select 1 union select 2;
sele  order  from  deta
  -    
0 0  0 COMPOUND SUBQUERIES 1 AND 2 USING TEMP B-TREE (UNION)
sqlite> explain query plan select 1 union all select 2;
sele  order  from  deta
  -    
0 0  0 COMPOUND SUBQUERIES 1 AND 2 (UNION ALL)

It does not.  (In the full "explain" output, "OpenEphemeral" is missing.)
Neither with real tables.

What particular query behaves unexpectedly for you?


Regards,
Clemens
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Union all writting on /var/tmp documentation warning

2012-10-29 Thread Eleytherios Stamatogiannakis

Look at what that page says about "union all" (emphasis on *all*)

--SNIP--
"Note that the UNION ALL operator for compound queries does not use 
transient indices by itself (though of course the right and left 
subqueries of the UNION ALL might use transient indices depending on how 
they are composed.)"

--/SNIP--

At least to my eyes, above says what i was expecting before realizing 
what actually happens, that "union all" tries to not materialize its 
results when possible.


What the truth is, concerning materialization, is that in SQLite "union 
all" works exactly like plain "union". It always materializes its input.


lefteris.

On 29/10/12 16:37, Clemens Ladisch wrote:

Eleytherios Stamatogiannakis wrote:

Can a warning about "union all"'s behaviour of buffering everything in /var/tmp 
be added in SQLite's documentation?


Like this?  http://www.sqlite.org/tempfiles.html


Regards,
Clemens
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Union all writting on /var/tmp documentation warning

2012-10-29 Thread Eleytherios Stamatogiannakis
Can a warning about "union all"'s behaviour of  buffering everything in 
/var/tmp be added in SQLite's documentation?


I think that such a warning could save a lot of time for other SQLite 
users that trip over the same thing as i did.


Thank you,

lefteris.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Union all writting on /var/tmp

2012-10-26 Thread Eleytherios Stamatogiannakis

I have been observing the following freaky behaviour of SQLite. When i run:

select count(*) from (select * from huge_table union all select * from 
huge_table);


Sqlite starts writting in /var/tmp/ a file like:

 /var/tmp/etilqs_gblRd6vUPcx91Hl, the root partition of fills up and an 
error is raised.


Why does SQLite 3.7.14.1 need to write at all when doing union all? It 
seems to me that there is no reason for doing so.


Best regards,

lefteris.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Please test the latest SQLite enhancements

2012-10-11 Thread Eleytherios Stamatogiannakis
Is the covering index scan optimization enabled in the mentioned 
amalgamation?


For large datasets it can speed up the queries a lot.

l.

On 09/10/12 20:19, Richard Hipp wrote:

SUMMARY:

If you have a complex application that uses SQLite, please test your
application with the SQLite amalgamation found at
http://www.sqlite.org/sqlite3-20121009.zip and report to me (via private
email) whether or not you encounter any problems.

DETAILS:

We have recently made a number of significant enhancements to the query
optimizer in SQLite.  In particular, the latest code does a much better job
of recognizing when ORDER BY clauses can be satisfied by careful selection
of indices and scan orders and without having to do any sorting.  This
optimization can result in significant performance improves for queries
with large result sets.

The current implementation has already been well tested:

(1) All legacy tests pass
(2) 100% branch test coverage
(3) The Fossil server that hosts the SQLite source code is using the latest
SQLite
(4) This email is being composed on an instance of Firefox Nightly that is
running the latest SQLite code

However, with such extensive changes to the query optimizer there is a
heightened risk of missing obscure corner cases in which SQLite omits
sorting for an ORDER BY clause when the sort is actually needed, resulting
in output appearing in the wrong order.  You can help reduce this risk, and
help us to ensure that the forthcoming SQLite version 3.7.15 is
trouble-free, by compiling the SQLite amalgamation snapshot found at
http://www.sqlite.org/sqlite3-20121009.zip into your application and then
testing your application to verify that it is still working correctly.  If
you find any problems, please let me know.  If your application continues
to work normally (though perhaps a little faster) I would appreciate
hearing from you then too.

I are especially interested in hearing from people whose applications
contain large schemas, descending indices, and complex joins containing
ORDER BY clauses with multiple terms and a mixture of ASC and DESC.

Thank you for your help.



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unofficial poll

2012-09-24 Thread Eleytherios Stamatogiannakis
Sqlite's dynamic typing made it a natural fit for using it with Python 
UDFs in madIS:


https://code.google.com/p/madis/

Absence of the feature would have complicated the whole "functional 
relational" [*] workflow that madIS uses a *lot*.


l.

[*] Instead of Python functions calling SQL, have SQL call Python functions.

On 23/09/12 13:37, Baruch Burstein wrote:

I am curious about the usefulness of sqlite's "unique" type handling, and
so would like to know if anyone has ever actually found any practical use
for it/used it in some project? I am referring to the typeless handling,
e.g. storing strings in integer columns etc., not to the non-truncating
system e.g. storing any size number or any length string (which is
obviously very useful in many cases).
Has anyone ever actually taken advantage of this feature? In what case?



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] SQL:2003 -- Window Functions

2012-09-20 Thread Eleytherios Stamatogiannakis
While i haven't needed this functionality, i have been thinking about 
how it could be solved in a performant way.


IMHO, the solution would be to create a virtual table that takes, for 
example, as input a query that produces the following rows:


C1 | C2

a1 | b1
a2 | b2
a3 | b3

and outputs the following:

C1 | C2 | C1prev | C2prev
--
a1 | b1 | null | null
a2 | b2 | a1   | b1
a3 | b3 | a2   | b2

So in essence the window gets transformed to a single row. In a similar 
way, bigger windows could also be transformed.


l.

On 20/09/12 14:46, Gabor Grothendieck wrote:

On Wed, Sep 19, 2012 at 12:51 PM, joe.fis...@tanguaylab.com
 wrote:

Too bad SQLite doesn't yet support SQL Window Functions.

Are there any SQLite Extension Libraries that support "SQL:2003 type Window
Functions"?
I specifically need LEAD and LAG to calculate an event integer timestamp
delta between consecutive rows.
I've played with some self-join code but that's proving to be complicated.



SQL Window Functions is the number one feature that I could use as
well.  In R, sqlite can be used for manipulating R data frames via the
sqldf package and this sort of functionality would be very useful.
(sqldf also handles PostgreSQL which does have windowing functions but
PostgreSQL requires more set up than sqlite so its not as accessible
to users.)
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Javascript API for SQLite?

2012-09-14 Thread Eleytherios Stamatogiannakis

There is a version of SQLite compiled in Javascript:

http://www.syntensity.com/static/sqlite_wip.html

But as others said, it is not possible to do block access on files from 
inside a Web browser's contained Javascript VM.


Nevertheless, theoretically you could load the full DB file in memory 
[*], do your operations on it in memory and then write it as a whole 
back to the disk, all on the client side.


lefteris.

[*] https://github.com/eligrey/FileSaver.js

On 14/09/12 16:05, Simon Slavin wrote:


On 13 Sep 2012, at 3:13pm, Jean-Denis Muys  wrote:


I work on an application that uses an SQLite database as it's binary document 
format (with a custom extension). Now we want to develop a Web App variation of 
that application, that should be able to read and write those 
documents/databases. Since the Web App is written in Javascript, I am now 
looking for a Javascript implementation of the SQLite library.


This can't be done entirely in JavaScript, since JavaScript running in a web 
browser has no way of getting at files on your hard disk.  This is to prevent 
the programmers of a web site spying on your computer's files.


I have used the C SQLite library in the past, so I know about using SQLite from 
C. However, I am just starting with Javascript and Web Apps and I am quite a 
newbie on that side (Other people on the team are experienced, but I have been 
asked to work on the SQLite integration).


There are ways you can allow JavaScript to access data inside a file on a web 
server.  The standard way is to write a shim in PHP or some similar language.  
The PHP code runs on the web server and uses PHP's SQLite3 library to access 
databases.  You ask it to execute a SQLite command, and it returns the results 
in JSON (or some other) format.  So for instance

https://myserver.com/databases/doSQLCommand.php?file=money.sqlite=SELECT
 * FROM transactions WHERE id=123

might return a page of application/json type containing

{id: 123, trandate: "20030205", amount: 274.53}

Of course in real life you're more likely to pass the parameters using POST 
than GET.

Your JavaScript code asks the shim for the data using an XMLHttpRequest and 
uses JSON.parse() to turn the results into an array or an object.

There are, of course, many security concerns with such a setup, so most shim 
programs check to see that they're being called only from their own server, by 
a program they recognise, running on a computer they recognise.  I sometimes 
use a setup like this, though my shim returns the requested results as only a 
small part of the stuff it returns, the rest being things like error messages 
and last-insert-id and stuff like that.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Covering index scan optimization

2012-09-12 Thread Eleytherios Stamatogiannakis

Hello,

I've just wanted to ask about using covering indexes for scans. A very 
rudimentary test:


create table t (c1,c2, c3, c4);
create index idxtc1 on t(c1);

The simple "select" scans the full table:

  explain query plan select c1 from t;
SCAN TABLE t (~100 rows)

A select with a dummy "order by" uses the covering index:

  explain query plan select c1 from t order by c1;
SCAN TABLE t USING COVERING INDEX idxtc1 (~100 rows)

It seems to me that using a covering index scan would always be faster 
in both cases (fewer disk page reads). Am i wrong? Is there a reason for 
SQLite to not use a covering index for scans?


Thank you in advance,

lefteris
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] madIS v1.4 release

2011-12-13 Thread Eleytherios Stamatogiannakis

madIS is an extensible relational database system built upon the SQLite
database and with extensions written in Python (via APSW SQLite
wrapper). Its is developed at:

http://madis.googlecode.com

Due to madIS’ SQLite core, the database format of madIS is exactly the
same as SQLite’s one. This means that all SQLite database files are 
directly usable with madIS.


In usage, madIS, feels like a lightweight personal Hive+Hadoop 
programming environment, without the distributed processing capabilities 
of Hadoop. Nevertheless due to its low overhead while running on a 
single computer (compared to Hadoop), madIS can easily handle tens of 
millions of rows on a single desktop/laptop computer.


In version 1.4 of madIS:

 - XMLPARSE can now work without any provided prototype, producing JSON 
dicts which contain all parsed XML data. XMLPARSE accepts now Jdicts and 
Jlists in addition to XML snippets as prototypes.
 - FILE works with gzip compressed files, HTTP and FTP streams directly 
in a streaming way.
 - New functions which work with Jdicts were added (jdictkeys, 
jdictvals, jdictsplit, jdictgroupunion).

 - APACHELOGSPLIT parses and splits Apache log lines.
 - Optimizations in Virtual Tables (up to 3 times faster). XMLPARSE is 
up to 2x faster (using fast:1 switch).


About XMLPARSE:

XMLPARSE can now work without any prototype. In this mode it produces 
JSON dict entries containing all the paths:data below the provided root tag:


-- Example
mterm> select * from (XMLPARSE root:entry FILE 
'http://code.google.com/feeds/p/madis/hgchanges/basic') limit 1;


{"entry/updated":"2011-12-09T17:18:43Z","entry/id":"http://code.google.com/feeds/p/madis/hgchanges/basic/92bc61a496b3a34c21c5aed6d9c6cde5ac63121e","entry/link/@/href":"http://code.google.com/p/madis/source/detail?r=92bc61a496b3a34c21c5aed6d9c6cde5ac63121e","entry/link/@/type":"text/html","entry/link/@/rel":"alternate","entry/title":"Revision 
92bc61a496: Fixed help 
formatting","entry/author/name":"est...@servum","entry/content/@/type":"html","entry/content":"Changed 
Paths:\n Modify/src/functions/vtable/xmlparse.py\n \n 
\n Fixed help formatting"}

--

If one wishes to find the all the paths that appear in above feed he 
could use the jgroupunion aggregate function, producing the union of all 
XML paths:


-- Example
mterm> select JGROUPUNION(c1) from (XMLPARSE root:entry FILE 
'http://code.google.com/feeds/p/madis/hgchanges/basic');


["entry/updated","entry/id","entry/link/@/href","entry/link/@/type","entry/link/@/rel","entry/title","entry/author/name","entry/content/@/type","entry/content"]
--

If one wishes to find the common set of XML paths that appear in all of 
ATOM feed's entries then he could do:


--Example
mterm> select JGROUPINTERSECTION(c1) from (XMLPARSE root:entry FILE 
'http://code.google.com/feeds/p/madis/hgchanges/basic');


["entry/updated","entry/id","entry/link/@/href","entry/link/@/type","entry/link/@/rel","entry/title","entry/author/name","entry/content/@/type","entry/content"]
-- Note: Intersection in this example is the same as union --

Finally, to output in a tabular form the contents of the ATOM feed, one 
simply has to provide the list of paths as a parameter to XMLPARSE:


--Example
mterm> select * from (
XMLPARSE root:entry
'["entry/updated","entry/id","entry/link/@/href","entry/link/@/type","entry/link/@/rel","entry/title","entry/author/name","entry/content/@/type","entry/content"]' 


  FILE 'http://code.google.com/feeds/p/madis/hgchanges/basic')
  limit 1;

[1|2011-12-09T17:18:43Z
[2|http://code.google.com/feeds/p/madis/hgchanges/basic/92bc61a496b3a34c21c5aed6d9c6cde5ac63121e
[3|http://code.google.com/p/madis/source/detail?r=92bc61a496b3a34c21c5aed6d9c6cde5ac63121e
[4|text/html
[5|alternate
[6|Revision 92bc61a496: Fixed help formatting
[7|est...@servum
[8|html
[9|Changed Paths:
 Modify/src/functions/vtable/xmlparse.py

 
 Fixed help formatting
--- [0|Column names ---
[1|updated [2|id [3|link_href [4|link_type [5|link_rel [6|title 
[7|author_name [8|content_type [9|content



-- Lefteris
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] New madIS v1.3 release

2011-07-26 Thread Eleytherios Stamatogiannakis
I've mainly used JSON because it is a well defined and widely used 
standard. JSON also contains associative arrays (which currently are not 
used in madIS).

 From what little i've read about Tcl lists, i believe that JSON lists 
are better for the eye. Compare this:

["this is the first", "second", "and third sentence"]

to this:

"this is the first" second "and third sentence"

In the top example the commas help the eye to distinguish between the 
values. Nevertheless you could devise an alternative example with a lot 
of commas inside the strings, which would make JSON lists more difficult 
to read.

So in the end i think it is a matter of taste which of the two is more 
preferable, and the kind of data that one has to deal with.

l.

On 26/07/11 10:08, Alexey Pechnikov wrote:
> Why JPack? May be Tcl lists will be more useful? The tcl dictionary
(also known as associative array) can be stored as list too.


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] New madIS v1.3 release

2011-07-25 Thread Eleytherios Stamatogiannakis
madIS is an extensible relational database system built upon the SQLite
database and with extensions written in Python (via APSW SQLite
wrapper). Its is developed at:

http://madis.googlecode.com

Due to madIS’ SQLite core, the database format of madIS is exactly the
same as SQLite’s one. This means that all SQLite database files are 
directly usable with madIS.

In usage, madIS, feels like a lightweight personal Hive+Hadoop 
programming environment, without the distributed processing capabilities 
of Hadoop. Nevertheless due to its low overhead while running on a 
single computer (compared to Hadoop), madIS can easily handle tens of 
millions of rows on a single desktop/laptop computer.

In version 1.3 of madIS:

  - A great deal of testing has been done on Linux, Windows and Mac OSX.
  - madIS now only depends on APSW on all these systems, so it is easier 
to be installed
  - XMLPARSE was added. XMLPARSE processes its input in a streaming 
fashion, and has been tested with very large (~20+ GB) XML source files 
without problems.
  - JPACK functions were added. Jpacks are now the preferable way to 
store a set of values into a single tuple. For easy viewing and 
exporting of the jpacks, their format was based on the JSON format.
  - Heavy testing under Windows and Mac OSX. CLIPBOARD and CLIPOUT 
virtual tables work under all OSes.
  - CLIPOUT and CLIPBOARD, have been tested with Excel, Libre/Open 
Office Calc, and iWork Numbers.
  - Functions that return tables, can easily be coded now, by using 
Python's generators (yield)
  - A lot of completions (via TAB) have been added to mterm. Mterm's 
completion engine can automatically complete, tables, column names, 
table index names and database filenames in attach database.

In detail:

MTERM changes:

While using madIS's terminal (mterm), mterm completes (via TAB) column 
names, tables names, etc. of the opened and attached databases.

Also by default mterm colours column separators and if more than 4 
columns are returned, mterm "tags" the columns with numbers:

mterm> select * from deficit;
[1|1 | People's Republic of China [3|272.500 |2010
[1|2 | Japan [3|166.500 |2010
[1|3 | Germany [3|162.300 |2010
--- Column names ---
[1|Rank [2|Country [3|CAB [4|Year

  Note: In mterm the column number tags are coloured red in above example


XMLPARSE:

If i wished to retrieve the date and author of madIS project's Source 
Changes ATOM feed:

mterm> select * from
(XMLPARSE 
'tt'
select * from file('http://code.google.com/feeds/p/madis/hgchanges/basic')
) limit 3;
2011-07-25T14:07:07Z|est...@servum
2011-07-25T14:04:09Z|est...@servum
2011-07-22T14:08:11Z|est...@servum
--- Column names ---
[1|updated [2|author_name
Query executed in 0 min. 0 sec 543 msec
mterm>

In above query, XMLPARSE is used in an "inverted form" which is easier 
to write, when chaining virtual tables.


JPACKS:

Frequently, the need to store multiple values into a tuple arises while 
processing data. Previously in madIS a lot of formats were used to store 
all these multiple values (space separated, comma separated, tab 
separated). Now JPACKs are the recommended way to store multiple values.

JPACKs are based on the JSON format, with the exception that a JPACK of 
a single value is itself. Some examples are presented below:

mterm> select jpack('a');
a

mterm> select jpack('a', 'b');
["a","b"]

mterm> select jsplit(jpack('a','b','c'));
a|b|c

mterm> select jsplitv(jpack('a','b','c'));
a
b
c

mterm> select jgroup(c1) from (select 'a' as c1 union select 'b');
["a","b"]


CLIPBOARD and CLIPOUT:

If a selection of data has been made in the web browser or a spreadsheet 
program then by executing the following in mterm we could access the 
clipboard data as a table:

mterm> select * from CLIPBOARD();
Rank↓ |Country↓ |CAB (billion US dollars)↓ |Year↓
1 | People's Republic of China |272.500 |2010
2 | Japan |166.500 |2010
3 | Germany |162.300 |2010
--- Column names ---
[1|C1 [2|C2 [3|C3 [4|C4
Query executed in 0 min. 0 sec 204 msec

* For above data i selected the top 3 rows while browsing 
http://en.wikipedia.org/wiki/List_of_sovereign_states_by_current_account_balance
 
.

Similarly if i wished to process and send the data to a spreadsheet 
program then i could execute the following in mterm:

mterm> CLIPOUT select * from deficit;
1
--- Column names ---
[1|return_value
Query executed in 0 min. 0 sec 111 msec

and paste in a spreadsheet program.

Both CLIPBOARD and CLIPOUT are virtual tables. CLIPOUT is used in an 
"inverted form" which is easier to write, when chaining queries.

-- Lefteris
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users