Re: [sqlite] Virtual table API performance

2014-03-04 Thread Elefterios Stamatogiannakis
Also, i want to note the pretty clever "hack" from Hick Gunter that 
works by passing a "proxy" for the row (instead of the whole row's 
values) to SQLite. This proxy gets expanded using the xColumn API by a 
virtual table outside the whole query.


In this way a query using multiple VTs, only needs to incur the xColumn 
performance penalty only once.


Nevertheless, having people do crazy/clever hacks like that to avoid the 
inefficiencies of an API, gives a hint that something might be problematic.


In a previous email of mine, i had said half seriously/half joking about 
the cost in Watts of SQLite's worldwide usage. For something that is so 
widely used, even some % of efficiency improvement really makes a 
difference. It is not an "SQLite destroys/saves civilization" kind of 
difference, but IMHO it would be measurable in G/M Watts.


l.

On 4/3/2014 11:33 μμ, Alek Paunov wrote:

On 04.03.2014 20:25, Eleytherios Stamatogiannakis wrote:

On 04/03/14 20:11, Alek Paunov wrote:

On 04.03.2014 15:05, Eleytherios Stamatogiannakis wrote:

On 03/03/14 03:01, Alek Paunov wrote:

It seems that the easiest optimization for this (very often) VT use
case
(bulk streaming) is SQLite add-on in _C_ to be written, implementing
vtable interface specialization containing xNextPage "buffering" let's
say 4K rows or even better 16KB data (in addition to your initial
proposal of xNextRow).

The technical question is: how the rows to be encoded? You said
initially that you use some compressed format. But for such extension,
to gain more traction in the future, it would be better probably a
more
standard format to be chosen.

a) Rows represented in native SQLite3 format [3]
b) ... native SQLite4 format
c) Some wide used encoding near to SQLite types [4]
d) ...

[3] http://www.sqlite.org/fileformat.html#record_format
[4] https://github.com/msgpack/msgpack/blob/master/spec.md#formats

IMHO, It would be very nice if some common conventions for binary
record
streaming could be discussed and adopted across the SQLite binding and
add-on developers. The possible applications are not limited only to
vtables ;-).


SQLite doesn't need any special format for the records to be passed
over
to it. It already has the "bind" API which would be very suitable for
the xNextRow function too.



It seems that I do not know the sqlite3_ API very well.

http://www.sqlite.org/c3ref/funclist.html

Would you like to point me to the "bind" API page?


In the link that you posted above, look for all the sqlite3_bind_x
functions. In SQLite the bind API is used to pass parameters to prepared
statements. The way the the bind API works is that you have a statement
parameter "row", and you fill it by saying:

Set column 1 of statement parameter "row" to an int with value 10
Set column 2 of statement parameter "row" to an float with value 3.5
...

So instead of SQLite calling back for each column, in the bind API
"way", the program calls SQLite to fill a row's values.



Ah, OK. Let see if I finally understood your idea:

- exec "insert into t select * from vt"
- VDBE calls xNextRow
- in xNextRow implementation, the Python code calls something like
bind_xxx for each column with the scalar addresses (allocated by you)
- xNextRow returns, VDBE inserts the row, you clean on next step

Questions:

What stops you to make this wrapper right now (e.g. as apsw patch or
standalone sqlite add-on loaded by PyPy FFI)?

How you expect this model (managing one per cell count of scalar
allocations during the query) to perform in comparison with passing
encoded row pages (memory chinks) between sqlite and the script engine
especially when it is not PyPy or LuaJIT?

Regards,
Alek



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Virtual table API performance

2014-03-04 Thread Elefterios Stamatogiannakis

On 4/3/2014 11:33 μμ, Alek Paunov wrote:

On 04.03.2014 20:25, Eleytherios Stamatogiannakis wrote:

On 04/03/14 20:11, Alek Paunov wrote:

On 04.03.2014 15:05, Eleytherios Stamatogiannakis wrote:

On 03/03/14 03:01, Alek Paunov wrote:

It seems that the easiest optimization for this (very often) VT use
case
(bulk streaming) is SQLite add-on in _C_ to be written, implementing
vtable interface specialization containing xNextPage "buffering" let's
say 4K rows or even better 16KB data (in addition to your initial
proposal of xNextRow).

The technical question is: how the rows to be encoded? You said
initially that you use some compressed format. But for such extension,
to gain more traction in the future, it would be better probably a
more
standard format to be chosen.

a) Rows represented in native SQLite3 format [3]
b) ... native SQLite4 format
c) Some wide used encoding near to SQLite types [4]
d) ...

[3] http://www.sqlite.org/fileformat.html#record_format
[4] https://github.com/msgpack/msgpack/blob/master/spec.md#formats

IMHO, It would be very nice if some common conventions for binary
record
streaming could be discussed and adopted across the SQLite binding and
add-on developers. The possible applications are not limited only to
vtables ;-).


SQLite doesn't need any special format for the records to be passed
over
to it. It already has the "bind" API which would be very suitable for
the xNextRow function too.



It seems that I do not know the sqlite3_ API very well.

http://www.sqlite.org/c3ref/funclist.html

Would you like to point me to the "bind" API page?


In the link that you posted above, look for all the sqlite3_bind_x
functions. In SQLite the bind API is used to pass parameters to prepared
statements. The way the the bind API works is that you have a statement
parameter "row", and you fill it by saying:

Set column 1 of statement parameter "row" to an int with value 10
Set column 2 of statement parameter "row" to an float with value 3.5
...

So instead of SQLite calling back for each column, in the bind API
"way", the program calls SQLite to fill a row's values.



Ah, OK. Let see if I finally understood your idea:

- exec "insert into t select * from vt"
- VDBE calls xNextRow
- in xNextRow implementation, the Python code calls something like
bind_xxx for each column with the scalar addresses (allocated by you)
- xNextRow returns, VDBE inserts the row, you clean on next step

Questions:

What stops you to make this wrapper right now (e.g. as apsw patch or
standalone sqlite add-on loaded by PyPy FFI)?


Yes these things can be done. We have written our own APSW compatible 
FFI for PyPy, in which we have tried techniques like the one you 
describe. The end result wasn't that much faster. Because no matter how 
many wrappers there are above VT's API, the efficiency characteristics 
of it are still there.


How you expect this model (managing one per cell count of scalar
allocations during the query) to perform in comparison with passing
encoded row pages (memory chinks) between sqlite and the script engine
especially when it is not PyPy or LuaJIT?



Essentially in my first post about VT's API efficiency the comparison 
was between the bind API (similar to your "encoded pages" idea but for a 
single row), and the current xColumn based VT API.


In our tests with Python, the difference between the two APIs was 3x. 
I've seen another post here by somebody else who was using a native 
compiled VT implementation saying that the difference was 2x. Both of 
these observations say that there is some inefficiency in the current 
situation with the VT API.


If this API is also used within SQLite's VM, maybe adding an xNextRow 
opcode, would speedup SQLite's VM too?


Best,

l.


Regards,
Alek



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Virtual table API performance

2014-03-02 Thread Elefterios Stamatogiannakis
We have both input and output virtual tables that avoid hitting the hard 
disk and are also able to compress the incoming and outgoing data.


We have a virtual table that takes as input a query and sends the data 
to a port on another machine. This virtual table is called "OUTPUT". And 
another virtual table that takes as input data from another port and 
forwards it into SQLite. Lets call it "INPUT". A query that uses these 
two virtual tables would look like this in madIS:


OUTPUT ip:192.168.0.1 port:8080 select * from INPUT('port:8081');

We actually use queries like above (actually we don't do it directly to 
ports but to buffered named pipes that are then forwarded via netcat) to 
run distributed queries on clusters, connecting all the local 
SQLite/madIS instances on the different machines together.


The main point that i want to make with above explanation is that we 
don't view SQLite only as a traditional database. We also view it as a 
data stream processing machine, that doesn't have the requirement for 
the data to be stored on a hard disk.


Under this view, the efficiency of the virtual table api is very 
important. Above query only uses 2 VTs in it, but we have other queries 
that use a lot more VTs than that.


estama


On 2/3/2014 9:34 μμ, Max Vlasov wrote:

On Sun, Mar 2, 2014 at 5:21 PM, Elefterios Stamatogiannakis
<est...@gmail.com> wrote:


Our main test case is TPCH, a standard DB benchmark. The "lineitem" table of
TPCH contains 16 columns, which for 10M rows would require 160M xColumn
callbacks, to pass it through the virtual table API. These callbacks are
very expensive, especially when at the other end sits a VM (CPython or PyPy)
handling them.



Ok, not stating that the performance improvment is impossible, I will
explain why I'm a little sceptical about it.

For every bulk insert we have a theoretical maxiumum we'd all glad to
see sqlite would perform with - the speed of simple file copying.
Sqlite can't be faster than that, but to be on par is a good goal.
This is not possible when an insert means also modification of other
parts of the file, for example when there's an index involved. But
let's forget about it. Finally when new data is added, sqlite should
write a number of database pages, the cost of this part is absolutely
in the hands of the media (driver) and OS (driver).  But for every
database page write there's also price to pay in CPU units, for many
actions sqlite should do before actual value is translated from what
the developer provided to what actually appears on disk.

The illustration of the CPU price is the following example
  CREATE TABLE t(Value)

on my ssd drive mulitply inserts (thousands)
   insert into t (Value) values ('123456689  // this string
contains many symbols, for example 1024)
performed with the speed
   30 MB/Sec

but the query
   insert into t (Value) values (10)  // this is a small integer value
only
   3 Mb/Sec

Both shows almost full cpu load. Why such difference? Because with
latter query the system can do more than 30 MB of writes in 1 second,
but it should wait for sqlite spending 10 seconds in preparations.
The former is better because CPU cost of passing a large text value to
sqlite is comparatively low comparing to the  time spent in I/O in
writing this on disk.

So CPU price to pay isn't avoidable and notice that in example this is
not virtual table API, this is bind API. I suppose that the price we
pay for CPU spent in virtual table API is on par with an average price
payed in sqlite as a whole. This means that if I transfom the avove
queries into inserts from virtual tables, the final speed difference
will be similar. And this also means that for your comparision tests
(when you get x3 difference), the CPU price sqlite pays inside bind
api and in its code wrapping xColumn call is probably similar. The
rest is the share your code pays.

Well, I know that there are differences in CPU architectures and
probably there are platform where compiled code for bind api and
virtual tables api behaves a little differently making the costs more
diffrent. But imagine that hard task of fine tuning and refactoring
just to get a noticeable difference for a particular platform.


Max
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] About "speed"

2014-03-02 Thread Elefterios Stamatogiannakis
IMHO, a benchmark like this is useless without any more information. 
Some questions that i would like to see answered:


 - Which SQLite and Postgres versions were used?
 - Are the SQLite indexes, covering ones?
 - Have any performance pragmas being used?

Also interval joins ("between") are hard for SQLite's default indexes, 
but converting them to use a multidimensional index (R-Trees) speeds 
them up to similar speeds as Postgres.


estama

On 2/3/2014 3:02 μμ, big stone wrote:

Hello,

This morning I saw  Pandas/Wes McKinney communicating figures :
  - starting at 37'37" of http://vimeo.com/79562736,
  - leaking a slide where SQLite "is" 15 times slower than Postgresql.

==> the dataset is public :
http://www.fec.gov/disclosurep/PDownload.do?candId=P0001=2012=All%20Candidates=pNational
==> the sql are basic.

Wouldn't it be nice to update the "speed.html" page to have an objective
vision ?

Rationals :
- better show progress (it's hidden in
http://www.sqlite.org/checklists/3080300/index),
- better show non-time metrics : memory, electricity ,i/o...
- better show options effect : ":memory:" , "compile -o2", ...
- better show SQLite position in the SQL landscape.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Virtual table API performance

2014-03-02 Thread Elefterios Stamatogiannakis
In our performance tests we try to work with data and queries that are 
representative of what we would find in a typical DB.


This means a lot of "small" values (ints, floats, small strings), and 
5-20 columns.


Our main test case is TPCH, a standard DB benchmark. The "lineitem" 
table of TPCH contains 16 columns, which for 10M rows would require 160M 
xColumn callbacks, to pass it through the virtual table API. These 
callbacks are very expensive, especially when at the other end sits a VM 
(CPython or PyPy) handling them.


For PyPy in particular, which is able to produce JIT compiled and 
optimized UDF code, adapted on the "shape" (type and value distribution) 
of the data flows as they pass through it, every time it faces the 
virtual table API it slows down to a crawl, becoming more than 2x slower 
than interpreted Python. This happens because it cannot see the data/row 
"shape" from the many small and unrelated between each other, single 
value based, xColumn callbacks.


Changing the subject, i've seen some requests in previous emails from 
people asking for windowing functions to be added to SQLite. I want to 
propose an alternative that we have been using for years, and is a lot 
more generic than adding specific functions for very "narrow" use cases 
in SQLite.


We have added the "EXPAND" VT function in madIS, which "emulates" nested 
tables in SQLite, enabling to have row and aggregate functions that 
return (in a streaming fashion) multiple values on multiple columns. The 
"EXPAND" function, takes as input a table containing as values (in our 
case Python) generators, and then it calls the generators "expanding" 
the input table to its final form. "EXPAND" is automatically inserted 
wherever is required, so it isn't visible. An example follows:


> select strsplit('one and other');
one|and|other <-- 3 columns

or

> select strsplitV('one and other');
one
and<-- 3 individual rows
other

So by adding a single VT function and some syntactic sugar (auto 
inserting EXPAND VT), we were able to have functionality that is not 
case specific, allowing us to run all kinds of analytics inside SQLite.


The performance of above functionality is already very good. But it 
could be a lot better with a more efficient VT API.


Regards,

estama

On 2/3/2014 9:15 πμ, Max Vlasov wrote:

Hi,
thanks for explaining your syntax in another post. Now about virtual
tables if you don't mind.

On Fri, Feb 28, 2014 at 8:24 PM, Eleytherios Stamatogiannakis
 wrote:


If we load into SQLite, 

create table newtable as select * from READCOMPRESSEDFILE('ctable.rc');

it takes: 55 sec


If we create an external program 

it takes: 19 sec (~3x faster than using the virtual table API)




Looking at your numbers, as a user (and fan :) of virtual tables I
decided to do some tests.

I have a virtual table "all values", it was designed for enumeration
of all tables values to the one single virtual table, so finally it is
a long list of

   TableName, TableRowId, FieldName, Value

so you get the idea. As an example of what it may do, you may open
places.sqlite of mozilla browser and do

   Select * from AllValues where Value Like "%sqlite.org%"

and see actual results even not knowing how they planned their schema.

Internally this virtual table simply uses general selects for all
other tables met in sqlite_master. This is a good (but probably not
the best) test for measuring virtual tables performance, because

   SELECT * FROM AllValues

is equivalent to reading all conventional tables of this database.
Besides
- the tool I use has a tweaker implemented with VFS that allows
measuring speed and other characteristics of the query performed while
the query is in effect.
- I have an option that forces resetting windows cache for the
database file when it is reopened. So with it we exclude the windows
cache from consideration so pure I/O reading is used. Btw, when you do
your comparison, it's very important to reset system cache before
every measurement that involves I/O.


So I took a comparatively large (500 Mb) database consisting of
several small and one big table (Posts) and compared two queries.

(Query1)

   Select sum(length(Body) + length(Title)) from Posts

This ones effectively reads the table data and uses
- length() to force sqlite reading texts that don't fit into single db page
- sum() to exclude accumulating results on my side from comparison, so
we have a single row, single column result from the work completely
done by sqlite.

(Query2)

   Select Sum(Length(Value)) from AllValues

This one performs basically the same but using sqlite virtual tables
api. It also touches other tables, but since they're small, we can
forget about this.

Query1 (General):
   Read: 540MB,
   Time: 24.2 sec,
   CPU Time: 6 Sec (25%)
   Speed: 22.31 MB/Sec

Query2 (Virtual):
   Read: 540MB,
   Time: 27.3 Sec,
   CPU Time: 13 sec (51%)
   Speed: 20 MB/Sec

In my particular test the noticeable 

Re: [sqlite] Virtual Table "Functions"

2014-03-01 Thread Elefterios Stamatogiannakis

Please excuse me for not explaining.

The syntax that you puzzle about is supported by madIS [*] which 
translates it into SQLite. Having being using madIS for many years (we 
created it in 2008) it comes natural to me, forgetting that SQLite 
doesn't support it.


What essentially madIS does is to create and destroy the virtual tables 
in an madIS/SQLite query. It also supports an "inverted" syntax, that 
permits virtual table pipelines (like pipes in unix):


select * from (XMLPARSE select * from FILE("data.xml"));

above query is the same as writting:

select * from XMLPARSE(' select * from FILE("data.xml") ')

but without the thorny quote escaping problems.

The "select * from " part is optional (it is autocompleted) so above 
query could be rewritten to:


XMLPARSE FILE "data.xml";

Both XMLPARSE and FILE are regular SQLite virtual table functions coded 
in Python (the whole madIS is in Python). Also, due to SQLite's virtual 
table API design quality, both functions are also streaming (XML parsing 
works on continuous multi-GB long streams of XML input, coming from the 
internal query that uses FILE).


You may think that it is crazy to do this pipelining through a 
relational engine (SQLite), but the whole thing is very fast and very 
versatille. We have been processing hundrends of GB of data (in various 
forms) using such "enhanced" SQLite queries for many years.


Having said all of the above, i hope that you can see why we care so 
much about SQLite's virtual table API efficiency.


estama.

[*] https://code.google.com/p/madis/

On 1/3/2014 10:35 πμ, Max Vlasov wrote:

On Fri, Feb 28, 2014 at 10:14 PM, Dominique Devienne
 wrote:

Can someone tell me how the statement below works?


Thanks for any help on this. This is really puzzling to me. --DD



Very puzzling for me too
For any statement like this

   select * from blablabla(123)

sqlite (3.8.3.1) primarily reports syntax a error...

near "(": syntax error

... before complaining about anything else.

I'd also be glad to see a shortcut when a virtual table created just
for a select statement and dies automatically, but I suspect it is not
supported and it was their own modified version of sqlite. Hope OP
reads this post also and will have an answer for us. Probably this
will help providing following information regarding his issues.

Max
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] "Common Table Expression"

2014-01-25 Thread Elefterios Stamatogiannakis

On 25/01/14 18:41, James K. Lowden wrote:

On Sat, 25 Jan 2014 11:56:32 +0200
Elefterios Stamatogiannakis <est...@gmail.com> wrote:


Wouldn't it be better instead of creating a new concept "row
constructor", to use the existing row constructors, also known as
virtual tables?


Perhaps.  I didn't make up the term; it exists in various other DBMSs,
sometimes called table value constructor.  I just wanted to make clear
that it's not new, and opens the potential for (I would say) better
queries, because they can be expressed in terms of tuples instead of
just scalars.



Yes i see what you mean and i agree. Furthermore, and to carry on a 
little more that train of thought, i hope that tuples will become 
possible as return values from SQLite's user defined functions.


If ever that becomes the case, then i suggest to go for "tuple 
generators" instead of materialized tuples as the return primitive. 
Materialized tuples as a primitive, have a lot of nasty side effects in 
a DB engine.


l.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] "Common Table Expression"

2014-01-25 Thread Elefterios Stamatogiannakis

On 25/01/14 03:37, James K. Lowden wrote:

On Fri, 24 Jan 2014 23:51:11 +0100
Petite Abeille  wrote:


It's exactly the same as "SELECT ?", but a little bit easier to
write. (It behaves like with INSERT, but is now available in every
place where a SELECT would be allowed.)


Hmmm? seems rather pointless to me.

select 1 as value /* from thin air */ union all
select 2 as value /* from thin air */
? etc ?

Seems to be good enough. No point in hijacking a totally unrelated
construct. I would drop such complication if I had a say. There is
already a perfectly fine construct to conjure constants out of thin
air: select.


VALUES is a row constructor.  You should be able to do

VALUES ( ( 'a', 'b', 'c' ) , ('d', 'e', 'f') )
as ( A, B, C ) as T



Wouldn't it be better instead of creating a new concept "row 
constructor", to use the existing row constructors, also known as 
virtual tables?


If we had the option of using virtual tables without first creating 
them, and we also were able to have them at the front of the query 
(automatically prepending a select * from ...) then VALUES above, could 
be a virtual table [*].


So think of a virtual table named VALUES that gets as parameters the 
values that you want it to emit. Then you can select from it, insert 
from it and so on.


l.

[*] In addition if we permitted queries as parameters in virtual tables, 
it would also enable virtual table composition.


An example of how this is done in madIS is below (XMLPARSE and FILE are 
virtual tables):


XMLPARSE FILE 'xmldata.xml.gz';

or expanded:

select * from (XMLPARSE select * from FILE('xmldata.xml'));



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Why must WITHOUT ROWID tables have PRIMARY KEYs?

2014-01-23 Thread Elefterios Stamatogiannakis

On 23/1/2014 7:12 μμ, Drake Wilson wrote:

Quoth Eleytherios Stamatogiannakis , on 2014-01-23 14:37:23 
+0200:

Let me describe a use case where a not unique key and without rowid
are most welcome. We have a distributed big data system here which
uses SQLite for the partitions. To be able to efficiently execute
join queries on splited partitions, we need to assemble the
partitions of one side of the query to create an index on them.


Do you really need bag rather than set semantics?  That is, can there
be a case where rows that are identical in _all_ columns need to be
treated as separate and (e.g.) have both copies show up in queries?



As we need to emulate (non table backed) indexes, yes. In an index you 
can have the same key with multiple "covering values" accompanying it. 
Consider the case where you want a covering index that "covers" the 
whole table. And you know that you'll only ever hit the index (e.g. for 
joins), and not the table that backs the index. In that case, the only 
way to store the data only once is using something like what i've 
described in my previous email.


Also using the whole row as a primary key isn't a viable solution. There 
are many kinds of data that may have duplicate rows in the index. Like 
pre-graph data (co-occurency lists), on which, for example, you want to 
calculate the frequency of the links before you group by them.


Our data is mainly scientific data (from digital libraries). In which we 
do article text mining (finding citations, funders, classification, 
protein references, ...). We also deal with graph data (graph 
isomorphisms, graph mining...).


All of the above processes are done using madIS [*], which is 
essentially SQLite + extensions (multivalued row and aggregate 
functions, virtual table composition, ...).


l.

[*] http://madis.googlecode.com


Most of the time, the way data is represented in relational databases,
this winds up requiring an arbitrary identity key anyway to be
practical (so one can manipulate a specific instance of an otherwise
identical row), or else it's equivalent to adding a count column to
turn {(x, y, z), (x, y, z)} into {(x, y, z, 2)}, though the latter has
a similar slight complexity hitch in the merge case to what you were
doing.

If you do require the above, I'm curious what data is being handled
here, since it's a rare case (but I understand if you don't wish to
say).  If not, then you may actually have a primary key of the whole
row, in which case I'm not sure why inventing a rowid is needed.

---> Drake Wilson
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Please test the latest SQLite snapshot

2012-11-10 Thread Elefterios Stamatogiannakis

Hello,

We've been using SQLite 3.15-with-union-all-optimizations that you 
linked to previously, to run some of the workflows that we have here 
without problems.


The workflows use madIS, and they involve around 30 OLAP queries using 
row/aggregate/virtual table functions on ~5 GB of data.


No problems were found during the runs. The speeds were always the same 
or faster than before, and in some cases by changing some of the queries 
to fully use covering indexes they became many times faster (from ~1hour 
to 6-10 mins) [*].


We'll continue our testing with more data and we'll report back.

lefteris.

[*] Covering index optimizations are really amazing.

On 10/11/2012 2:50 πμ, Richard Hipp wrote:

Please test the latest snapshot of SQLite in your applications and report
any performance regressions to this mailing list, or directly to me at
d...@sqlite.org.   You can get a tarball or ZIP archive of the latest raw
sources from Fossil at http://www.sqlite.org/src/info/trunk or you can pull
down a snapshot of a recent amalgamation from:

 http://www.sqlite.org/download.html
 http://www.sqlite.org/sqlite-amalgamation-201211092140.zip

We have made many improvements to the SQLite query planner/optimizer since
the last release.  These enhancements should help to make most applications
run faster.  But coding a query planner is more art than science, and so
there is the possibility that some of the recent enhancements have made
some corner-case queries run slower.  Unfortunately, the only way for us to
identify those cases is to test SQLite in as many applications as possible.

So, please do test the latest SQLite amalgamation in our applications, and
please let us know if you encounter any performance regressions.  We cannot
fix performance problems if we do not know about them.  Please do NOT wait
until an official release to run this test - it will be too late then.

FWIW, this email is being typed into Gmail running on a Firefox Nightly
that was recompiled with the SQLite amalgamation snapshot shown above.  And
the Fossil binary that you see at http://www.sqlite.org/src/timeline is
also running the latest SQLite.  (See
http://en.wikipedia.org/wiki/Eating_your_own_dog_food)  But we still need
your help to test the latest SQLite enhancements in as many applications as
possible.

Thanks



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Union all writting on /var/tmp documentation warning

2012-10-29 Thread Elefterios Stamatogiannakis

Sorry i didn't realize before that you had:

select * from (... union all ...)

Try with a count(*) as such:

select count(*) from (... union all ...)

And you'll see that both union and "union all" will create a temp file. 
Union needs the temp file to remove the duplicates. Union All doesn't 
need it at all.


l.

On 29/10/2012 10:37 μμ, Clemens Ladisch wrote:

Eleytherios Stamatogiannakis wrote:

My understanding (and what my experiments have shown) is that in both
cases "COMPOUND SUBQUERIES 1 AND 2" will write a temp file to /var/tmp.


Okay, let's create a test database ...

   $ strace -e trace=open sqlite3 test.db
   [...]
   sqlite> create table t(x);
   open("/tmp/test.db-journal", O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 4
   open("/dev/urandom", O_RDONLY|O_CLOEXEC) = 5
   open("/tmp", O_RDONLY|O_CLOEXEC)= 5
   sqlite> begin;
   sqlite> insert into t values('long string');
   open("/tmp/test.db-journal", O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 4
   sqlite> insert into t select * from t;
   sqlite> insert into t select * from t;
   sqlite> insert into t select * from t;

... that becomes so big that the subquery overflows the cache:

   [...]
   sqlite> insert into t select * from t;
   open("/var/tmp/etilqs_Oekg82a6826YGdz", 
O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW|O_CLOEXEC, 0600) = 5
   sqlite> commit;

Only the UNION query claims to use temporary storage for the subquery:
(These two queries are constructed so that they have no result records.)

   sqlite> explain query plan select * from (select rowid, x from t union all 
select rowid, x from t) where x = '';
   1|0|0|SCAN TABLE t (~10 rows)
   2|0|0|SCAN TABLE t (~10 rows)
   0|0|0|COMPOUND SUBQUERIES 1 AND 2 (UNION ALL)
   sqlite> explain query plan select * from (select rowid, x from t union 
select rowid, x from t) where x = '';
   2|0|0|SCAN TABLE t (~100 rows)
   3|0|0|SCAN TABLE t (~100 rows)
   1|0|0|COMPOUND SUBQUERIES 2 AND 3 USING TEMP B-TREE (UNION)
   0|0|0|SCAN SUBQUERY 1 (~20 rows)

When executing them, only UNION uses temporary files:

   sqlite> select * from (select rowid, x from t union all select rowid, x from 
t) where x = '';
   sqlite> select * from (select rowid, x from t union select rowid, x from t) 
where x = '';
   open("/var/tmp/etilqs_QNvTpzSHSedfFFM", 
O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW|O_CLOEXEC, 0600) = 4
   open("/var/tmp/etilqs_RiTrAL6vrIxpnOu", 
O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW|O_CLOEXEC, 0600) = 5


Now, what UNION ALL query wants to use a temporary table?
I'd guess that SQLite needs to save the result for some other reasons.


Regards,
Clemens
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Union all writting on /var/tmp

2012-10-27 Thread Elefterios Stamatogiannakis
Thank you Simon. Your solution would work for the example i gave. 
Nevertheless my problem is still more complex because i also use SQLite 
as a generic streaming engine (yes i know, SQLite wasn't designed for 
doing things like that).


Appart from input VTs (FILE) we also have output VTs like so [*]:

OUTPUT '10.0.0.1:8080' select upper(c1), c2, lower(c3) from
(
select * from file('http://www.foo.com/list1.tsv.gz')
union all
select * from file('http://www.foo.com/list2.tsv.gz')
)
where c2!=c4;

Internally this is implemented as a VT (OUTPUT) that takes as input a 
query and sends its results to another SQLite. The beauty of the concept 
is that a streaming query flow can go from machine to machine and still 
the transactional properties hold. If anything happens anywhere in the 
query's path, all of the machines will rollback due to the transactional 
properties of SQLite.


In addition to above we are developing a parallel processing engine 
(like Hadoop) where we use SQLite "chunks" for processing and data 
storage. So tables are split (or collected) into multiple SQLite DBs and 
transferred to multiple machines, where the queries will be executed in 
parallel [**].


For above we heavilly use UNION ALLs over the attached DBs to scan over 
the many chunks of a table.


A UNION ALL that unnecessarily buffers to the disk its inputs (for a 
single scan over them), is very nasty, performance wise, to all of the 
above machinations.


I can guess that changing UNION ALL processing for single scans, isn't 
so easy, nevertheless it would be nice if the /var/tmp buffering 
directory could be at least changed.


lefteris.

[*] In practise we pipe to named pipes that point to other machines.

[**] Other projects based on the same idea are:

http://hadoopdb.sourceforge.net/

and the company that sprang from above:

http://hadapt.com/


On 27/10/2012 3:05 πμ, Simon Slavin wrote:


Does this work instead ?

CREATE TABLE t ([[whatever columns you want]]);
INSERT INTO t SELECT upper(c1),c2,lower(c3) FROM 
file('http://www.foo.com/list1.tsv.gz') WHERE c2!=c4;
INSERT INTO t SELECT upper(c1),c2,lower(c3) FROM 
file('http://www.foo.com/list2.tsv.gz') WHERE c2!=c4;

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Union all writting on /var/tmp

2012-10-26 Thread Elefterios Stamatogiannakis
Thank you for answering Mr. Hipp. The general problem doesn't have to do 
with counting the rows (it was there as a placeholder). I just want to 
merge 2 (and more) table/streams.


The real query in madIS looks like the following:

create table t as
select upper(c1), c2, lower(c3) from
(
select * from file('http://www.foo.com/list1.tsv.gz')
union all
select * from file('http://www.foo.com/list2.tsv.gz')
)
where c2!=c4;

"list1.tsv.gz" and "list2.tsv.gz" are two enormous streams which i would 
like to process and put into table "t".


I have gone to great lengths to make the virtual table "file" being 
fully streamed. So both of the network files (list1, list2) arrive from 
the network packet by packet, get decompressed without touching the disk 
and then are broken into multiple columns ('tsv' is assumed to mean tab 
separated).


I admire SQLite very much for its predictability. So for the above query 
i would expect from it to scan first over the first file (list1) and 
then over the next (list2), and row by row put them in table "t".


This assumption was so strong that i've been searching all over the 
"file" VT code for a *very* long time to find out the bug in it that 
caused it to grid to a halt my system whenever i executed above query ( 
/ partition was filled ).


I have a request. If "union all" cannot be changed to not write on the 
hard disk when scanning just once over tables/streams. Could the 
/var/tmp position that it writes to, be changed to another directory? I 
prefer to keep a small root (/) partition and right now i'm unable to do 
any "union all" on anything that is bigger than the free space on it.


Thank you again,

lefteris.

On 26/10/2012 9:23 μμ, Richard Hipp wrote:



On Fri, Oct 26, 2012 at 2:16 PM, Eleytherios Stamatogiannakis
> wrote:

I have been observing the following freaky behaviour of SQLite. When
i run:

select count(*) from (select * from huge_table union all select *
from huge_table);

Sqlite starts writting in /var/tmp/ a file like:

  /var/tmp/etilqs___gblRd6vUPcx91Hl, the root partition of fills up
and an error is raised.

Why does SQLite 3.7.14.1 need to write at all when doing union all?
It seems to me that there is no reason for doing so.


The only way SQLite knows to evaluate the query is to (1) compute the
UNION ALL into a temporary table then (2) scan the temporary table to
count the rows.  /var/tmp space is used to hold the temporary table.

Try instead:

SELECT (select count(*) from huge_table)+(select count(*) from huge_table);




Best regards,

lefteris.
_
sqlite-users mailing list
sqlite-users@sqlite.org 
http://sqlite.org:8080/cgi-__bin/mailman/listinfo/sqlite-__users





--
D. Richard Hipp
d...@sqlite.org 


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Sqlite, Is it possible to calculate the length of the longest increasing subsequence using an UDF?

2012-10-13 Thread Elefterios Stamatogiannakis

Sorry for hijacking the thread, but i have an answer for that.

IMHO having the computation application and the data management on 
different domains incurs a very high data transfer cost.


The traditional thinking of shipping the data from the DB to somewhere 
else (application code) to do the computation and then shipping the 
result back to the DB is putting an enormous overhead on data processing 
tasks that need to do this back and forth shipping many times. Even when 
using SQLite that lives on the same process as the computation 
application, this overhead is *very* considerable.


By moving the computation closer/inside the DB engine (with UDFs), this 
overhead vanishes. I have already seen many cases of people implementing 
workflows the traditional way (DB server, computation on the app. 
server) on high powered machines, being several times slower than me 
doing the same processing using madIS [*] on my low powered desktop.


All of the above apply for OLAP processing. For OLTP, the traditional 
approach is good enough.


lefteris.

[*] http://code.google.com/p/madis/

On 12/10/12 21:41, Igor Tandetnik wrote:

... In light of this, I don't quite see what you expect to gain from turning it
into a UDF, as opposed to simply loading the sequence from the database
into memory with a vanilla SELECT statement, and working on it in your
application code.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Covering index scan optimization

2012-09-15 Thread Elefterios Stamatogiannakis

On 15/09/12 21:57, Richard Hipp wrote:

So my thinking now is that this optimization should not be merged to trunk
unless it is first disabled by default and only enabled by a compile-time
or start-time option.



IMHO, a pragma that enables/disables it persistently on a DB would be ideal.

Many many thanks for making this a reality, the performance increase 
will be very profound.


lefteris.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Covering index scan optimization

2012-09-15 Thread Elefterios Stamatogiannakis

I have a question.

Will the covering index scan optimization also cover the automatically 
created indexes?


So lets say that for a certain part of a query, the optimizer decides to 
create an automatic index. Will the same index be chosen for doing a 
scan (instead of a full table scan) for another part of the query where 
the automatic index may also be useful?


Thanks,

lefteris.

On 15/09/12 18:29, Richard Hipp wrote:

On Fri, Sep 14, 2012 at 5:24 PM, Roger Binns  wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Is there a chance that the change will go into SQLite mainline?


Not without a copyright release.


And it may require more especially if you are an employee.  See the bottom
section of http://www.sqlite.org/copyright.html

And of course it is more than the few lines of changes - all the test
suites have to be updated to ensure no breakage and 100% coverage.



Furthermore, the change is incomplete.  With the patch shown, EXPLAIN QUERY
PLAN gives an incorrect explanation of what is happening.  And
sqlite3_stmt_status(pStmt,
SQLITE_STMTSTATUS_FULLSCAN_STEP, ...) does not count the full-scan steps
that occur as part of a covering-index scan.  There may be other issues;
these are just the ones I have found so far.



Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)

iEYEARECAAYFAlBToHQACgkQmOOfHg372QR1XgCfZvoXr2uKaVFFDo46sEQZiML6
X3UAn0qWun5ldvSJdGj4SEN/n7dVBd7V
=GQzU
-END PGP SIGNATURE-
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users







___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Covering index scan optimization

2012-09-15 Thread Elefterios Stamatogiannakis

On 15/09/12 17:03, Simon Slavin wrote:


On 15 Sep 2012, at 12:08pm, Elefterios Stamatogiannakis <est...@gmail.com> 
wrote:


What i would really like to have in SQLite concerning OLAP, would be bigger 
pages,


You can set pagesize for a new database using a PRAGMA:

<http://www.sqlite.org/pragma.html#pragma_page_size>

The maximum allowed pagesize is 65536 bytes.  Create a new database file, then 
issue the PRAGMA before any CREATE commands.  If you have an existing database 
and want to change it you have to export the data, make a new database and 
import the data again, but this can all be done in two commands to the shell 
tool.



Yes i know about the page size pragma and i actively use it. What i 
meant to say was that i would like to have even bigger pages than what 
is currently possible.



and internal page compression in a similar manner that column stores do [^]. 
This would greatly alleviate the storage pain of using denormalized DBs which 
is a must for OLAP.


This feature would indeed be suitable for a server-client database engine 
designed to run on multipurpose computers.  But SQLite is designed more in 
embedded machines in a single-processor environment.  For example, my TV 
recorder uses it to list the TV channels and meta-data about its recordings, 
and I have an extremely low-power GPS device which uses it for Positions of 
Interest.  The fact that SQLite turns out to be so useful as an embedded DBMS 
inside, for example, a web browser is just a bonus.  As the documentation says, 
if you need network-savvy client-server stuff, look elsewhere.



Could you explain why you think that compressed pages are suitable for a 
client-server database engine and not for an embedded one ? Because i 
don't see why.


From my point of view, being able to have even smaller databases than 
now, without any major overhead would be most useful for embedded 
purposes where disk and memory (pages could be kept compressed in the 
cache) space are at a premium.


Also considering that the compression i proposed doesn't have anything 
at all to do with using a block compressor on a page, but it concerns 
the way that the rows are stored in a page, there won't be any 
noticeable CPU usage increase.


l.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Covering index scan optimization

2012-09-15 Thread Elefterios Stamatogiannakis

On 14/09/12 22:56, Clemens Ladisch wrote:

 But do you have any numbers (which would help deciding whether to accept this 
patch)?



I've run two queries on two different DBs:

- DB1: Size 414M, Count of recs 2999671, the table has 17 Cols
- DB2: Size 1.4G, Count of recs 1975986, the table has 28 Cols

The tables had one of the columns indexed (Column Cidx)

Queries run were:

Tablescan (Ts) query:
  select sum(Cidx) from t;
SCAN TABLE t (~100 rows)

CoveringIndexscan (CIs) query:
  select sum(Cidx) from (select Cidx from t order by Cidx);
SCAN TABLE t USING COVERING INDEX idx_Cidx (~100 rows)
SCAN SUBQUERY 1 (~100 rows)

For each DB, i've run 3 times each query (below i present an average), 
clearing the OS's caches every time.


DB1:
  Ts: ~ 4 sec 100 msec.
  CIs: ~ 2 sec 90 msec

DB2:
  Ts: ~ 4 sec 550 msec.
  CIs: ~ 1 sec 350 msec

Apart from the covering index scans being faster, another interesting 
fact that i've found is that when both table's and index's data are in 
memory (cached), the CIs query is consistently two times slower than Ts.


A rough guess is that SQLite's query execution VM has to do more work 
due to not being able to optimize:


SCAN TABLE t USING COVERING INDEX idx_Cidx (~100 rows)
SCAN SUBQUERY 1 (~100 rows)

to just:

SCAN TABLE t USING COVERING INDEX idx_Cidx (~100 rows)

Regards,

lefteris.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Covering index scan optimization

2012-09-15 Thread Elefterios Stamatogiannakis

On 14/09/12 22:56, Clemens Ladisch wrote:

Elefterios Stamatogiannakis wrote:

On 13/09/12 23:02, Clemens Ladisch wrote:
For my main workload (OLAP) this can make an enormous difference!


OLAP isn't quite the typical SQLite use case.  But do you have any
numbers (which would help deciding whether to accept this patch)?



Concerning the "numbers", i'll run some timings.

Concerning OLAP being the typical SQLite use case. I haven't chosen 
SQLite for OLAP lightly. When the choice was made (some years ago) we 
had benchmarked other DBs too (MySQL, Postgres).


Things may have changed from when we did the benchmarks, but i suspect 
that if we ran the same tests again the final choice would remain the 
same. Some more details concerning the things that we had found follows.


MySQL's query execution engine was very fast for simple selects, counts 
etc. When the queries became more complex, MySQL tended to "explode" due 
to materializing nested queries to the disk or other query optimizer 
stupidities.


Postgres, which i hold dear in my heart, was a lot closer to SQLite's 
behaviour (it also runs single threaded for single OLAP queries). 
Nevertheless it was slower than SQLite for queries involving UDFs and on 
some complex queries it could take many times more to finish a query 
than SQLite.


In the end we didn't care that much about raw performance, but about UDF 
flexibility/speed, and to have a reliable and predictable query 
planner/optimizer.


Concerning the query optimizer, in my opinion and by watching other more 
experienced people than me writing queries. What counts more is not the 
cleverness of the query optimizer but being able to predict what a 
reliable optimizer will do.


There are a number of things that i miss from Postgres e.g. nested 
tables and the new JSON storage in 9.2. But using SQLite's UDF API (via 
exceptional APSW [*] ) i have created workarounds for them in madIS [*]. 
So i don't care that much.


What i would really like to have in SQLite concerning OLAP, would be 
bigger pages, and internal page compression in a similar manner that 
column stores do [^]. This would greatly alleviate the storage pain of 
using denormalized DBs which is a must for OLAP.


In addition, this feature would bring SQLite's engine very close to also 
being able to function as a document store while at the same time 
speeding up its relational processing (due to having smaller DBs on the 
disk).


Kind regards,

lefteris

[*] APSW: http://code.google.com/p/apsw/
madIS: http://code.google.com/p/madis/

[^] Like http://en.wikipedia.org/wiki/RCFile . Another idea would be for 
each column to have a dictionary at the start and then the column's data 
would reference the column's dictionary.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Covering index scan optimization

2012-09-13 Thread Elefterios Stamatogiannakis

On 13/09/12 23:02, Clemens Ladisch wrote:

Eleytherios Stamatogiannakis wrote:

It seems to me that using a covering index scan would always be faster
in both cases (fewer disk page reads).


Yes, if the index has fewer columns than the table.



In my experience, the most frequent case is for an index to have less 
columns than the table it indexes.



Is there a reason for SQLite to not use a covering index for scans?


The query optimizer does not allow indexes that are not needed for some
DISTINCT, WHERE, or ORDER BY clause:


Do you know if there is a reason for this?



   select c1 from t indexed by idxtc1;
   Error: cannot use index: idxtc1

However, it doesn't appear to be too difficult to allow this case:

--- src/where.c
+++ src/where.c
@@ -3037,6 +3037,7 @@
  int bSort = !!pOrderBy;   /* True if external sort required */
  int bDist = !!pDistinct;  /* True if index cannot help with DISTINCT 
*/
  int bLookup = 0;  /* True if not a covering index */
+int bFullCovIdxScan = 0;  /* True if full covering index scan */
  WhereTerm *pTerm; /* A single term of the WHERE clause */
  #ifdef SQLITE_ENABLE_STAT3
  WhereTerm *pFirstTerm = 0;/* First term matching the index */
@@ -3133,7 +3134,7 @@
  ** using the main table (i.e. if the index is a covering
  ** index for this query). If it is, set the WHERE_IDX_ONLY flag in
  ** wsFlags. Otherwise, set the bLookup variable to true.  */
-if( pIdx && wsFlags ){
+if( pIdx ){
Bitmask m = pSrc->colUsed;
int j;
for(j=0; jnColumn; j++){
@@ -3143,9 +3144,16 @@
  }
}
if( m==0 ){
-wsFlags |= WHERE_IDX_ONLY;
+if( wsFlags ){
+  wsFlags |= WHERE_IDX_ONLY;
+   }else{
+  wsFlags = WHERE_COLUMN_RANGE|WHERE_IDX_ONLY;
+  bFullCovIdxScan = 1;
+   }
}else{
-bLookup = 1;
+if( wsFlags ){
+  bLookup = 1;
+   }
}
  }

@@ -3209,6 +3217,8 @@
** it seems to be working well enough at the moment.
*/
cost = aiRowEst[0]*4;
+}else if(bFullCovIdxScan){
+  cost = aiRowEst[0]*2;
  }else{
log10N = estLog(aiRowEst[0]);
cost = nRow;




Thank you for the patch!! With a three line change you replicated the 
new index-only scan feature of PostgreSQL 9.2!


Is there a chance that the change will go into SQLite mainline? For my 
main workload (OLAP) this can make an enormous difference!


Thanks again.

lefteris.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Announcing the Madis project

2010-02-23 Thread Elefterios Stamatogiannakis
Madis is a extensible relational database system built upon the SQLite 
database and with extensions written in Python (via APSW SQLite 
wrapper). Its is developed at:

http://madis.googlecode.com

Due to Madis’ SQLite core, the database format of Madis is exactly the 
same as SQLite’s one. This means that all SQLite databases are directly 
usable with Madis.

Madis' enables to quickly develop (in Python) and test new relational 
functions and virtual tables. SQL syntax extensions were also created to 
simplify queries which use these functions.

Madis' main goal is to promote the handling of data related tasks within 
an extended relational model. In doing so, it promotes the database, 
from a support role (storing and retrieving data), to being a full data 
processing system on its own. Madis already includes functions for file 
import/export, keyword analysis, data mining tasks, fast indexing, 
pivoting, statistics and workflow execution.

Some of the functionality of Madis is:

- Programmable (in Python) row functions (via APSW):

mterm> select detectlang('Il en est des livres comme du feu de nos foyers');
french

- Programmable (in Python) aggregate functions (via APSW):

mterm> select concatterms(a)
from (select "term1+term2" as a UNION select "term2 term3" as a);
term1+term2 term2 term3

- Programmable (in Python) virtual tables (via APSW):

mterm> select * from file('./demo/continents.tsv') limit 2;
Asia|AF
Europe|AL

- Multisets (functions that return multiple rows/columns):

mterm> select * from table1;
a |b
---
'car wood bike'| 'first group'
'car car wood'  |'first group'
'car wood'|'first group'
'car wood ice'  |'first group'
'ice'  |'second group'
'car ice'|'second group'
'car cream'  |'second group'
'icecream ice car'  |'second group'

mterm> select b, freqitemsets(a, 'threshold:2', 'noautothres:1', 
'maxlen:2') from table1 group by b
b| itemset_id | itemset_length | itemset_frequency | item
-
first group  | 1  | 1  | 4 | wood
first group  | 2  | 1  | 4 | car
first group  | 3  | 2  | 4 | car
first group  | 3  | 2  | 4 | wood
second group | 1  | 1  | 3 | ice
second group | 2  | 1  | 3 | car
second group | 3  | 2  | 2 | car
second group | 3  | 2  | 2 | ice

- On the fly multidimensional indexing (the cache virtual table):

The index is based on kdtrees and is extremely fast with queries 
involving multiple constraints.

mterm> select country.c2, continent.c1
   from file('countries.tsv') as country,
file('continents.tsv') as continent
   where country.c1=continent.c2;
Aruba|Americas
Antigua and Barbuda|Americas
United Arab Emirates|Asia
Afghanistan|Asia
. . . . . . . . .
Query executed in 0 min 2 sec 40 msec

mterm> select country.c2, continent.c1
   from file('countries.tsv') as country,
(CACHE file 'continents.tsv') as continent
   where country.c1=continent.c2;
Aruba|Americas
Antigua and Barbuda|Americas
United Arab Emirates|Asia
Afghanistan|Asia
. . . . . . . . .
Query executed in 0 min 0 sec 71 msec

- Workflows:

mterm> exec flow file 'workflow.sql';

Above query uses Madis' SQL inverted syntax.

- Pivoting:

http://madis.googlecode.com/svn/publish/row.html#pivoting

--

All above functionality has been created via row/aggregate/vtable Python 
extensions (APSW offers these in very nice API), and the aforementioned 
SQL syntax extensions.

In practise Madis has been proven to be very fast in data analysis tasks 
and in the development of data processing workflows.

A little note:

The high quality of APSW's and SQLite's code, has helped immensely in 
developing Madis. We have strained both of these projects as much as we 
could, and they coped beautifully. We literally had queries that spanned 
multiple pages, which executed in seconds.


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Counter function

2008-12-03 Thread Elefterios Stamatogiannakis
Is there any important reason for counter(X) function to not be included 
in main sqlite?

There is already an implementation of counter function in 
src\test_func.c and given the usefulness of counter function in 
analytics, it is a petty to have to write obnoxious queries to 
workaround the lack of a function like that.

Is it possible for a counter function that resets on group boundaries to 
be implemented in sqlite? It also would help immensely with analytics.

lefteris
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Window functions

2008-11-27 Thread Elefterios Stamatogiannakis
I'm using sqlite's count function to emulate OLAP functionality. 
Unfortunately count function is not exposed in sqlite, nevertheless it 
eases the pain of not having analytics functions in sqlite (lead, lag, 
median etc).

lefteris


Alexey Pechnikov wrote:
> Hello!
> 
> В сообщении от Monday 24 November 2008 19:16:46 Constantine Vassil написал(а):
>> OLAP functionality includes the concept of a sliding *window* that moves
>> down
>> through the input rows as they are processed. Additional calculations can
>> be
>>
>> performed on the data in the window as it moves, allowing further analysis
>> in a
>> manner that is more efficient than using semantically equivalent self-join
>> queries, or correlated subqueries.
>>
>> I am thinking this functionality is possible to implement working directly
>> with
>> the B-Tree.
>>
>> Does anyone has better ideas?
> 
> I think application-level solutions (R language for example) is better. I'm 
> using safe tcl 
> interpreter with custom functions and do send input data row by row to it.
> 
> Best regards, Alexey.
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Counter function

2008-11-19 Thread Elefterios Stamatogiannakis
Is there a reason for a very useful (for analytics) function like 
"counter(X)" to live in test_func.c ?

I stumbled on it while trying to write a counter function of my own. 
Please if possible make it visible. It could ease the pain of not having 
analytics functions in sqlite (lead, lag, median etc).

Is there a way to create a "group_counter" that restarts when in group 
boundaries?

lefteris
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Terrible performance for one of our tables

2008-11-19 Thread Elefterios Stamatogiannakis
What page size do you use in your database? Maybe you should increase it?

Assuming a table fully fragmented with a page size of 16KB then the I/O 
rate should be (if my calculations are correct) in the 2MB/s range 
(assuming a hard disk having 50MB/s bandwidth, 7msec seek time).

For 32KB page size you should get 4MB/s. Because the I/O rate is heavily 
affected by the seeks (the rate doubles for 32KB because the hard disk 
does roughly half the seeks), if SQLite does additional seeks (like 
trying to lock something), then everything halves so try to avoid any 
locking if you can.

lefteris

Jens Miltner wrote:
> Hi,
> 
> we're seeing terrible performance problems when fetching data from one  
> of our tables:
> The table contains roughly 1.2 Million rows and a plain "SELECT  
> COUNT(*) FROM t" query takes 8 minutes to finish.
> The table contains 10 fields, but the records average to about 100  
> Bytes of data total. The largest records are ~1k of data, but there  
> are only  a few hundred records with more than 256 Bytes of data.
> 
> Another table in the same database contains over 5 Million records  
> (and probably an average of a few hundred Bytes per record). Counting  
> rows in this table only takes about 40 seconds.
> 
> (All times taken from first run and they're reproducible on several  
> machines).
> 
> 
> When looking at a sample taken while sqlite is executing the COUNT(*)  
> query, almost all the time is spent inside read() calls.
> The average I/O rate is about 500kB - 800kB / seconds read  
> performance. When reading from other databases / other tables, I can  
> easily get 20-30 MB/s overall read performance.
> So, I can only conclude that for some reason, there seems to be some  
> terrible I/O behavior for the data in this specific table.
> 
> Running fs_usage shows that there are almost no consecutive pages  
> being read (i.e. the file offsets are not consecutive), so I suspect  
> that for some reason, this table is fragmented all across the 3 GB  
> database file, which may explain the slow overall read performance.
> 
> Now, there are two things that happened to this table, which didn't  
> happen to the other big table:
> 
> (1) we've updated the schema by appending a couple of columns
> (2) when data is updated, it's updated by first deleting a bunch of  
> rows and then re-inserting the new data. This happens once or twice a  
> day for almost all entries, thus the
> 
> Would either (1) or (2) lead to heavy fragmentation of this kind?
> 
> Vacuuming the database does resolve the performance issue, but  
> vacuuming that specific database took about 75 minutes (MacPro @  
> 3GHz), so it's not something we can do frequently (especially, since  
> the database is used in a server product that has 24/7 uptime) :(
> 
> Is there anything we could do to avoid the table fragmentation in the  
> first place (write queries in a different way, avoid deletions/re- 
> insertions, etc.)?
> (Note that I'm not talking about reclaiming free space here - that's  
> not really an issue for us. We need to avoid the performance hit we've  
> seen which IMHO is caused by the pages for a single table being  
> scattered all over the database file, thus requiring the paging code  
> to jump a lot in the file).
> 
> 
> Thanks,
> -jens
> 
> 
> BTW: this is SQLite 3.6.3, but at least running the COUNT queries is  
> not any faster with 3.5.x. Tests run on Mac OS X.
> 
> 
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] bug? like-search with german umlaut is case-sensitive, should not be

2008-11-14 Thread Elefterios Stamatogiannakis
Has anybody successfully compiled sqlite with icu for win32?

I haven't managed to find an libicu for mingw. Any tips welcome.

lefteris

D. Richard Hipp wrote:
> On Nov 14, 2008, at 8:08 AM, Martin Engelschalk wrote:
> 
>> Hi all,
>>
>> the ICU project is a very powerful tool to handle codepages, and also
>> supports regular expressions (using a class named "RegexMatcher", see
>> http://icu-project.org/apiref/icu4c/classRegexMatcher.html).
>> So, it should be relatively easy to replace the like() - function in
>> sqlite (see http://www.sqlite.org/lang_corefunc.html#like and
>> http://www.sqlite.org/c3ref/create_function.html)
>>
> 
> http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/icu/README.txt=1.2
> 
> D. Richard Hipp
> [EMAIL PROTECTED]
> 
> 
> 
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users