[sqlite] IS a SQLite db of small size as good as reliable cache?

2016-01-31 Thread Simon Slavin

> On 31 Jan 2016, at 11:24pm, James K. Lowden  
> wrote:
> 
> Howard Chu  wrote:
> 
>> Note that the cache eviction runs quite frequently - once every 5
>> seconds or so, and evicts pages regardless of whether there's any
>> memory pressure in the system. It's quite possibly the stupidest
>> cache manager ever written.
> 
> Any insight into what they were thinking?

Only a cynical one.  It's really good for demo sessions.

When you're giving a demo you log in, immediately start an App, do a few 
things, spend a few seconds talking, start a second App, talk about that, start 
another App, etc..  Having memory cleared while you're taking means you have 
plenty free to start another App.

Simon


[sqlite] Customizing the location of the .sqlite_history

2016-01-31 Thread Jes Slow
Hi - this is kind of a trivial request, but .sqlite_history is one of the
relatively few files that I can't move into ~/.config folder (or wherever
it fits).

We know that globals can become messy as you scale, and up until the XDG
standard applications treated the home folder as their global place to drop
configuration files. When you have dozens of programs, that starts to get
messy.

Many applications do this by allowing the user to set an environment
variable to customize the location, altho personally I would prefer another
way since environment variables are also global. Hope you consider it.
Thanks


[sqlite] Bug: Successfully committed transaction rolled back after power failure

2016-01-31 Thread Scott Robison
On Sun, Jan 31, 2016 at 7:35 PM, Rowan Worth  wrote:

> On 31 January 2016 at 03:56, James K. Lowden 
> wrote:
>
> > Surely SQLite does both -- fsync on file and directory -- as part of a
> > commit. That's not in doubt, is it?
> >
>
> No, SQLite does not. On COMMIT it fsyncs the database file and unlinks the
> journal[1], but does not fsync the directory. This is demonstrated by the
> logs in Stefan's initial post, and indeed the purpose of his patch was to
> introduce an fsync on the directory after unlinking the journal.
>

Perhaps there could be a "paranoid" journal mode, where it first zeros out
the header ala persist, then truncates the file, then deletes the file.

-- 
Scott Robison


[sqlite] Is the first column of a composite primary key, special?

2016-01-31 Thread Yannick Duchêne
On Sun, 31 Jan 2016 17:22:37 +
Simon Slavin  wrote:

> 
> Ignore all the above.  There are rare situations where they're useful but the 
> situation you're in is helped far more by using the phonebook analogy earlier 
> posters used than by trying to use the above.
> 
> Think about pure SQL, and about making one ideal index for each SELECT 
> command, and you'll get very good results.  Work out what you want the SELECT 
> (or UPDATE, etc.) to do.  Then work out the command.  Then work out the index 
> which would be ideal to help the command do its thing.  That's the best way 
> to get fast SQL.
> 
> Optimizing for SQLite peculiarities (some of which no longer apply because 
> inner workings of SQLite have changed since the article was written) is 
> useful only very rarely.
> 

I agree. I'm, drifting to much far from one of my concern to stick to just SQL 
(and standard SQL).

In the meantime, I was looking at what SQLite do with the queries, to see if 
it's intuitive, if it matches enough what one would devise without a DB engine. 
I mean I sometime think about SQL as a data structure modelling language, and I 
may try to re?implement in a classic procedural language, for an experiment. 
I'm also aware this point of view (SQL as a data structure modelling language) 
is less meaningful with complex queries, as with these, a DB engine adds values 
outweighing the hopes one may get from a procedural implementation, ? the same 
if the set of queries is not fixed.

That said, thanks for the recall.

-- 
Yannick Duch?ne


[sqlite] Feature request: ANALYZE REMOVE

2016-01-31 Thread Simon Slavin
I would like a version of the ANALYZE command which drops all the tables that 
ANALYZE creates, then updates the query planner so that it knows no stats are 
available.  This command is intended to help with

A) automated testing and time-trials
B) cleanup for transporting datasets from one setup to another
C) cleanup for when changing compilation options

It should delete not just the tables that ANALYZE in its current mode would 
create, but also those which might have been created by other compilation 
options such as SQLITE_ENABLE_STAT3.  If no schema or table name is supplied, 
dropping all tables with names starting with 'sqlite_stat' is fine.

The current syntax for ANALYZE is

ANALYZE [schema-name.table-or-index-name]

I see no need to implement ANALYZE REMOVE for individual tables or indices, so 
I propose that this new command be

ANALYZE [schema-name] REMOVE

This is my best idea but it does present ambiguity if you have a schema called 
REMOVE, and I'm happy to accept alternative ideas.  Or you may prefer 'DROP' to 
'REMOVE' or have some completely different idea about syntax.

PS: For those interested I currently use the following sequence:

DROP TABLE IF EXISTS sqlite_stat1;
DROP TABLE IF EXISTS sqlite_stat2;
DROP TABLE IF EXISTS sqlite_stat3;
DROP TABLE IF EXISTS sqlite_stat4;
ANALYZE sqlite_master;

Simon.


[sqlite] Wish List for 2016: High Level API for Object Oriented Interactive Languages

2016-01-31 Thread Scott Robison
On Sun, Jan 31, 2016 at 9:42 AM, Keith Medcalf  wrote:

>
> And I thought the "Object Oriented" jihad blew up when it was discovered
> to be counter to productivity and performance in the 1990's and that it did
> not provide a single one of the "advantages" claimed by its mujahedeen
> warriors.
>

Any time one starts using religious terminology to describe a technology,
whether pro or con, it seems suspect to me. There are many successful
projects that utilize object oriented techniques to some extent. SQLite
itself included.

It is possible to write crappy code in any language, and it is possible to
write elegant / efficient code in most languages. Often the choice is not
the language but rather the algorithms implemented.

I would never suggest that everyone must use OO, but to suggest it is
worthless or never works seems suspect to me.

-- 
Scott Robison


[sqlite] Wish List for 2016: High Level API for Object Oriented Interactive Languages

2016-01-31 Thread Gabor Grothendieck
On Sun, Jan 31, 2016 at 6:25 PM, James K. Lowden
 wrote:
> On Sat, 30 Jan 2016 20:50:17 -0500
> Jim Callahan  wrote:
> But you already have pandas.read_sql_query.  While that function
> isn't really what I'd call simple, the complexity afaict -- dates,
> floats, and chunks -- can be laid at Python's feet.

I use R rather than python but the problem of dates is significant and
I assume the same problem as in python.  Consider this query:

 select myDate + 1 from myTable

Assume that myTable comes from an R data frame with a myDate column
that has R class of "Date".  Internally R stores the myDate column as
days since 1970-01-01 and since SQLite has no Date class it sends that
number to SQLite.  The select statement then adds 1 giving a number
representing the next day but when one gets it back to R how can R
know that that number is intended to represent a Date and so convert
it to R's Date class?  In fact it can't.   With databases that have a
Date type this is not a problem but it is a significant problem with
SQLite.

There are some hackish workarounds.  For example, consider this
self-contained reproducible R code:

library(sqldf)  # load package and dependencies
myTable <- data.frame(myDate = as.Date("2000-01-31"))  # define myTable
sqldf("select myDate + 1 myDate from myTable")

sqldf will use the heuristic of assuming that the myDate in the output
has the same class as the myDate in the input (because they have the
same name) and so will convert the output myDate column to the R
"Date" class but the ability of such a heuristic to work is fairly
limited.

It is also possible to specify to sqldf the class of each output
column but this is tedious and puts SQLite at a disadvantage since it
is unnecessary if used with a different backend database that is able
to return a Date class column.

With numeric and character columns there is not much problem but as
soon as one gets to date and date/time columns then this impedence
mismatch appears and is one fo the main reasons an R user might decide
to use a different backend.


[sqlite] Is the first column of a composite primary key, special?

2016-01-31 Thread Yannick Duchêne
On Sun, 31 Jan 2016 10:45:59 -0700
"Keith Medcalf"  wrote:

> create table PhoneDirectory
> (
>  surname_id integer not null references Surnames,
>  given_id integer not null references GivenNames,
>  address_id integer not null references Addresses,
>  PhoneNumber text collate nocase primary key
> );
> create index PD_Surnameid on PhoneDirectory (surname_id);
> create index PD_Givenid on PhoneDirectory (given_id);
> create index PD_addressid on PhoneDirectory (address_id);
> 
> create view v_PhoneDirectory
> as
>   select Surname, Givenname, Street_No, Suffix, Streetname, PhoneNumber
> from PhoneDirectory
> natural join Addresses
> natural join Streets
> natural join GivenNames
> natural join Surnames;

Now I understand why `NATURAL JOIN` is qualified so, and why it has an implicit 
`USING` clause. That's to help with this.

> This database would be in BCNF normal form.  (Although the streetno and 
> suffix ought to be moved out to a separate table(s) if you need 4th or 5th 
> normal).  The model gets very much more complicated if you also have to 
> handle municipal names, city names, etc.

The normalized forms highest of level, are to be weighted, anyway.


-- 
Yannick Duch?ne


[sqlite] Wish List for 2016: High Level API for Object Oriented Interactive Languages

2016-01-31 Thread Yannick Duchêne
On Sun, 31 Jan 2016 09:42:28 -0700
"Keith Medcalf"  wrote:

> 
> And I thought the "Object Oriented" jihad blew up when it was discovered to 
> be counter to productivity and performance in the 1990's and that it did not 
> provide a single one of the "advantages" claimed by its mujahedeen warriors.
> 
> Of course, it could be that there is not really very much of anything at all 
> (if there is anything at all) that implements an "object oriented" 
> architecture.  It is mostly just glossing appearances and wrappers around 
> inherently non-object oriented things.  But then again, that just goes to 
> show that OO is inherently flawed.
> 
> OO is a dead horse.  It is about time it was finally shot in the head and put 
> out of its misery.

I'm off?topic, but this is going a bit too far in the opposite direction.

I won't be so much affirmative about OO being entirely negative; there are 
cases where it matches well. These cases (as I experienced them) are when there 
are 1) not that much objects (I mean 100, as an example, not 10 000 or 1000 
000) 2) there each have a strong personality, that is, there differs enough 
from each other instance, while have enough in commons (both). This is 
typically the case with user interfaces (although state?machine are nice too in 
this area), as the human being is a complex thing :-D with many peculiarities, 
which are well matched by OO with it's deep inheritance and attributes 
everywhere (prototype based OO is nice too for this purpose).

A fact is that a user interface as in this example, works at a tempo which is 
the same as that of a human: quarter of second or something around; it normally 
consumes not that much resources (objects are not responsible for heavy 
graphics). It does not do computation, it behaves. That's the word I believe: 
OO is good at expressing peculiar behaviours, more than at expressing 
computation (if one does not confuse records and OO's objects).

The overall criticism I would still have against OO, is that in its typical 
implementations, it confuses interface inheritance and implementation 
inheritance, which in my opinion, should be separate (some OO languages have 
private inheritance at least, ? unfortunately, most famous languages don't). An 
other issue (however cleaner solved by Eiffel, unlike others did), is name 
conflicts in inheritance. The implementations and the model, are still two 
different things.


-- 
Yannick Duch?ne


[sqlite] Is the first column of a composite primary key, special?

2016-01-31 Thread Simon Slavin

On 31 Jan 2016, at 6:51pm, Yannick Duch?ne  wrote:
> 
> In the meantime, I was looking at what SQLite do with the queries, to see if 
> it's intuitive, if it matches enough what one would devise without a DB 
> engine. I mean I sometime think about SQL as a data structure modelling 
> language, and I may try to re?implement in a classic procedural language, for 
> an experiment. I'm also aware this point of view (SQL as a data structure 
> modelling language) is less meaningful with complex queries, as with these, a 
> DB engine adds values outweighing the hopes one may get from a procedural 
> implementation, ? the same if the set of queries is not fixed.

It's all good.  SQLite provides two wonderful tools to help you figure out what 
choices it's making and why: EXPLAIN and EXPLAIN QUERY PLAN.  You can also 
experiment with ANALYZE which does a job which many people have wasted hours 
trying to do.

Simon.


[sqlite] SQLite crashing

2016-01-31 Thread Olivier Mascia
> Le 31 janv. 2016 ? 18:06, Igor Korot  a ?crit :
> 
> Ok, I did try to recompile everything with /MDd.
> The crash again happens only on the main frame destruction.
> Everything worked after the pointer had been assigned to the main frame 
> member.
> I don't understand this at all.
> 
> Thank you.

Looking at the stack trace you posted, it is not SQLite code which crash, but 
some C++ code (which happen to be inside a dll named 'sqlite.dll') probably 
wrapping SQLite3 code itself.  It tries to delete some data, and some integrity 
checks reveal the pointer is invalid, by that time at least.

> Here is the stack trace of the crash:
> ntdll.dll!77b973a6()
> [Frames below may be incorrect and/or missing, no symbols loaded
> for ntdll.dll]
> ntdll.dll!77b5164f()
> ntdll.dll!77b20f01()
> KernelBase.dll!762b2844()
>>   sqlite.dll!_CrtIsValidHeapPointer(const void * pUserData)  Line 2036C++
> sqlite.dll!_free_dbg_nolock(void * pUserData, int nBlockUse)
> Line 1322 + 0x9 bytesC++
> sqlite.dll!_free_dbg(void * pUserData, int nBlockUse)  Line 1265
> + 0xd bytesC++
> sqlite.dll!operator delete(void * pUserData)  Line 54 + 0x10 bytesC++
> sqlite.dll!SQLiteDatabase::`vector deleting destructor'()  + 0x65
> bytesC++

-- 
Meilleures salutations, Met vriendelijke groeten, Best Regards,
Olivier Mascia, integral.be/om





[sqlite] Wish List for 2016: High Level API for Object Oriented Interactive Languages

2016-01-31 Thread James K. Lowden
On Sat, 30 Jan 2016 23:03:29 +
Simon Slavin  wrote:

> On 30 Jan 2016, at 8:13pm, Yannick Duch?ne 
> wrote:
> 
> > In my opinion (which some others share), OO is a bag of
> > miscellaneous things which are better tools and better understood
> > when accosted individually. Just trying to define what OO is, shows
> > it: is this about late binding? (if it is, then there sub?program
> > references, first?class functions, or even static polymorphism and
> > signature overloading) About encapsulation? (if it is, then there
> > is already modularity and scopes) About grouping logically related
> > entities? (if it is, there is already modularity, and sometime
> > physically grouping is a bad physical design).
> 
> There are a number of problems in using a relational database for
> object-oriented purposes.  One is that to provide access to stored
> objects you need to access the database in very inefficient ways
> which are slow and are not helped by caching.  You can read about
> some of the problems here:
> 
> 

To the extent "impedance mismatch" is real, it's a matter of looking
through the wrong end of the telescope.  

Programming languages have almost universally ignored relations, logic,
and constraints, leaving programmers with primitives, pointers, and
loops.   Which is cause and which effect?  Do programmers ignorant of
set theory demand primitive languages?  Or do primitive languages beget
ignorant programmers?  I don't know.  What I do know is that a
programming language with built-in support for relational concepts
remains to be invented.  

--jkl


[sqlite] Wish List for 2016: High Level API for Object Oriented Interactive Languages

2016-01-31 Thread James K. Lowden
On Sat, 30 Jan 2016 20:50:17 -0500
Jim Callahan  wrote:

> I am not interested in a complete ORM; what I am interested is when
> the object-oriented language supports a SQL-R-like object. In R, the
> object is called a data.frame and the package "Pandas" supplies a
> similar data frame object to Python.
> 
> R as I have mentioned has fantastic interfaces to SQL databases that
> allow one to pass a query and have the result populate a data frame.
> The data frame in R or Python can be fed to a machine learning
> algorithm (scikit learn on Python) or to a plotting package such as
> ggplot or bokeh.

OK, good.  It sounds lke what you're really interested in is the
simplest function possible to move data from SQLite into Pandas.  (I'll
stick with Python, since I know it better than R.)  

But you already have pandas.read_sql_query.  While that function
isn't really what I'd call simple, the complexity afaict -- dates,
floats, and chunks -- can be laid at Python's feet.  At a minimum, you
have to specify the SQL and the column names you want to use in
Pandas.  

> SQLAlchemy package ... did not understand SQLite VIEWS and one had to
> write an explicit loop to build the data frame

You don't need to use SQLAlchemy with SQLite.  And you don't need to
write loops to move query results into a Pandas DataFrame.  

So, is the problem solved, or am I missing something?  

--jkl



[sqlite] Bug: Successfully committed transaction rolled back after power failure

2016-01-31 Thread James K. Lowden
On Sat, 30 Jan 2016 20:00:19 +
Simon Slavin  wrote:

> On 30 Jan 2016, at 7:56pm, James K. Lowden 
> wrote:
> 
> > Given that the fsync has returned successfully, I don't know of any
> > hardware that then will take 1000 ms to complete the write.  That's
> > the basis for my "subsecond interval" assumption. 
> 
> Writing to a RAID which has other write commands queued.

As Keith says, you get what you pay for.  I would say that hardware
that behaves as you describe is broken.  Hardware must expose some
kind of "yes, defintely writen" notification interface.  If it does,
that's what fsync abstracts.  If it doesn't, it's broken, because
"write" only means, "whatever, dude".  

A 7200 RPM drive is capable of ~75 I/O operation/second.  To reach my
1-second threshold, the RAID controller would have to have that many
queued operations.  Not only that: it would have to have the chutzpah
to claim to have written the data, knowing it couldn't guarantee its
promise.  Is Bernie Madoff consulting for Seagate?  

You might know more about this than I do.  The last time I dealt with
anything in the vicinity, write-queue depths of 10 were considered
problematic.  100 would have been catastrophic.  

Stefan Meinlschmidt mentioned "on Linux the interval can be, IIRC, up
to 30s".  I think that's a point of confusion.  That 30s sounds to me
suspiciously like the kernel's sync checkpoint interval.  Surely
buffered writes inside the device are not subject to "up to" 30s; the
device (in general) has no such timer.  

--jkl


[sqlite] Find out how many times does SQLite hit the disk?

2016-01-31 Thread James K. Lowden
On Sun, 31 Jan 2016 03:01:30 +0530
dpb  wrote:

> This will help me in deciding if moving to In-memory SQLite will
> improve my application performance. I am done with adding indexes to
> my tables in SQLite DB.

I second Simon's question.  If SQLite isn't fast enough, a good
starting assumption is that it's not being used as efficiently as
possible.  

Remember that problems at the SQL level can introduce delays that are
orders of magnitude greater than the difference in speed between memory
and disk.  That's especially true if your database is small enough to
consider switching to in-memory.  

--jkl


[sqlite] IS a SQLite db of small size as good as reliable cache?

2016-01-31 Thread James K. Lowden
On Sat, 30 Jan 2016 22:23:24 +
Howard Chu  wrote:

> Note that the cache eviction runs quite frequently - once every 5
> seconds or so, and evicts pages regardless of whether there's any
> memory pressure in the system. It's quite possibly the stupidest
> cache manager ever written.

Any insight into what they were thinking?  Back when I used Windows
daily, it used to annoy me that every morning the machine had to warm
up again, to revive the state I'd left it in the night before.  In
NetBSD I learned that unused memory is unused, so why not use it?  

I have a feeling that "fast app launching" is the reason, as though
Windows users were excitedly punching the Start button with a
stopwatch.  But maybe there's more to it than that?  

--jkl



[sqlite] Is the first column of a composite primary key, special?

2016-01-31 Thread James K. Lowden
On Sun, 31 Jan 2016 17:39:28 +0100
Yannick Duch?ne  wrote:

> I saw a page (can't retrieve the URL) suggesting to order table
> columns by names. It was strange to me, as I had the idea of a
> hierarchical access for tables access. But I though ?there must be a
> good reason for them to say this?. 

On average, the quality of advice on the Internet is average, for SQL
doubly so.  Some of it is truely terrible.  

Because the advice on this list is peer-reviewed (in the sense that
people who wrote the code participate), it tends to be very good.  

My advice, ahem, is to choose chemistry over alchemy.  If you don't
understand why the advice is well founded, keep checking until you do.
If sometimes good foundations are a little mysterious, the contrary is
not true: unfounded assumptions flush out early.  

I want to answer your question a little bit abstractly, and then circle
back to SQLite.  

We know how the table is physically ordered.  But there's no WHERE
clause; the whole table will be scanned.  Building an output table for
the aggregates will be required regardless.  The only difference would
be if the cardinality of a, b, and c were differerent.  That is, if
GROUP BY A produces many fewer rows than GROUP BY B, we would expect it
to run faster. Otherwise it's an artifact of the implementation, not
something inherent in the problem.  

Yet your counts indicate the opposite: GROUP BY B is the smallest
output.  How to explain?  Or, um, EXPLAIN?  

Consider that GROUP BY A will cause each A output row to be summed
using the input (from t) sequentially.  As the input is consumed, we
move to the next output row whenever A changes.  There's no seeking
from one output row to another.  

For GROUP BY B, each input row, read sequentially, means a seek
(logically speaking) to the appropriate output row.  If the cost of
that seek is not O(1), it will add to the time used to create the
output.  Because SQLite tables are based on B+ trees, that seek cost
is O(log2 n).  

I'd say that's the source of the difference you're seeing. EXPLAIN
shows there's an optimization for GROUP BY A, probably because the
output can be constructed sequentially.  

And that's a defensible cboice, because aggregation-seeking isn't
usually a high cost (as in fact it's not in your case, either).

Aggregations by definition are reductions. Outputs are usually small,
because otherwise the GROUP BY produces incomprehensible results.
10,000 rows of aggregated output isn't unheard of, but it's unusually
many, and to find one row in 10,000 in a binary tree requires at most 15
comparisons.  More commonly outputs are in the hundreds of rows, and
need half that many.  It could be faster, but only at the expense of
code complexity and memory footprint.  

HTH, to see the lite.  

--jkl


[sqlite] Is the first column of a composite primary key, special?

2016-01-31 Thread James K. Lowden
On Sun, 31 Jan 2016 10:45:59 -0700
"Keith Medcalf"  wrote:

> Hierarchical, Network, and Network Extended database models use
> pointers in sets rather than duplicating the data.  This makes them
> orders of magnitude faster (when properly designed) than a Relational
> Model database, 

I was cheering you on ... 


> but means that there is no recovery possible where a
> pointer-chain becomes corrupted -- with the relational model
> everything has a copy (multiple duplicates) of the data so you just
> drop the corrupted thing and re-create it.

... but I have to take exception to your characterization of
the theory.  :-(  

The relational model, as you well know, doesn't describe
implementation.  It's math.  It says what relations are, and how
they're manipulated.  One data type, and an algebra closed over that
domain.  Math doesn't have pointers or duplication or corruption; those
are computer concepts, and as such are entirely outside the model.  

For example, nothing in the model prohibits or describes using
compression to minimize I/O, or a hash of interned strings.  It's up to
the implementation to find the best way to support relational
operations, and to define "best".  SQLite, after all, supports in-memory
databases, about as fragile and unrecoverable a thing as imaginable! 

In explaining the relational model, it's true Codd does mention
pointers by way of contrasting the relational model with the others you
mentioned.  Very much intentionally, the relational model consists only
of values: every join is "value based", meaning the join of A to B is
expressed strictly in terms of the values in each one.   It does not
matter if A is the "parent" and B is the "child"; the syntax is the
same either way.  

We might say Codd's hand was forced, in that *math* is value-based.  But
he emphasized it as a feature that should be manifested in the query
language, and SQL consequently was (and for the most part, still is)
value-based.  That choice is for the user's sake, because it's easier
for humans to reason about values than about pointers.  

In those old pre-relational systems -- names for which Codd had to come
up with, by the way, because they had no "model" per se, no math --
relationships between "tables" (to use a modern term) were expressed by
pointers of some kind.  The connection was manifested in the database.
If you followed it in your application, you got DBMS support, and it
was simple(ish) and fast. If you wanted an unsupported connection --
count of orders by widget, say -- you were forced to write loops, and
well advised to get coffee while the query ran.  

When we say the relational model "has no pointers", we're referring to
the user's interaction with the data.  All tables are created equal,
you might say, and all joins are the same.  That's the simplification
that permits the advice you so often give: to express the problem
logically.  Pre-relational systems offered no such option.  

We now return you to your regularly scheduled programming.  

--jkl


[sqlite] Is the first column of a composite primary key, special?

2016-01-31 Thread Yannick Duchêne
On Sun, 31 Jan 2016 16:14:45 +0200
R Smith  wrote:

I'm replying to Igor too with this message, as both of you had a similar answer.

> 
> First understand what an Index is and how it works.
> 
> Imagine you are asked to find all the people whose surnames are 
> "Duch?ne" from the telephone directory. You would be able to do this 
> quite fast, because the phonebook is indexed first by Surname, then 
> name, then address. Perhaps a Telephone directory schema might look like 
> this:
>CREATE TABLE PhoneBook (Surname TEXT, Name TEXT, Address TEXT, 
> PhoneNo TEXT, PRIMARY KEY (Surname, Name, Address) );
> 
> Your query might look like this:
>SELECT * FROM PhoneBook WHERE Surname='Duch?ne';
> 
> Imagine now that you are asked to find all people named "Yannick" in the 
> phone directory, like so:
>SELECT * FROM PhoneBook WHERE Name='Yannick';
> 
> Immediately that would go very slow because you have to look at each 
> Surname and see if there are any Yannicks in there, and the same problem 
> arise if you are asked to look for a specific address.


That makes sense, I agree, and I had this in mind for a short time, until it 
was shadowed by a bad bet I made for three reasons. If it's worth the words, 
let me tell:

I saw a page (can't retrieve the URL) suggesting to order table columns by 
names. It was strange to me, as I had the idea of a hierarchical access for 
tables access. But I though ?there must be a good reason for them to say this?. 
Then in an SQLite page [1], there was a suggestion to avoid index containing 
the same data as a wider index. So after these two things, I tried to imagine 
ways of setting up an index so that this makes sense: I though a multi?column 
key could be accessed by any column, using fragments whose content are ordered.

Precisely with the case of your example, I though the "name" column would be 
partitioned into individually sorted parts. While it was also contradicted by 
the fact adding a index on a single column of a multi?column primary key, could 
help grouping (although later again, there was another surprise contradicting 
this too).

[1]: https://www.sqlite.org/queryplanner.html
Which says
> Note that Idx3 contains all the same information as the original Idx1. And so 
> if we have Idx3, we do not really need Idx1 any more.
While reading it again, I overlooked what was next:
> your database schema should never contain two indices where one index is a 
> prefix of the other.
My bad.


> You will have a bright idea right the first time you are asked to do 
> this - you will make a list of names in alphabetical order followed by 
> surnames and keep it separate, so if ever you are asked again to find 
> someone by name, you can reference this second list to quickly see the 
> name and surname, and perhaps use that info to find them in the 
> PhoneBook and get the rest of the info. This second list is what is 
> called an Index - but it is not the PRIMARY index.
> 
> If you wish for all those searched to go fast, you need 3 Indices, not 
> simply a 3-sectiion primary Index.
> 
> Perhaps this SCHEMA would better suit your needs:
> 
>CREATE TABLE t (a TEXT, b TEXT, c INTEGER, PRIMARY KEY (a, b, c))
>CREATE INDEX t_1 ON t (b);
>CREATE INDEX t_1 ON t (c);
> 
> Be careful though, every next Index will require a full datalist plus some 
> overhead worth of space, and will make INSERTs slower because it has to 
> insert more times and re-organize the B-Tree of every index a bit.

I guessed, and that's why I have no swapped index (although the a,b is a 
foreign key to another table with only a,b, which helps for some queries, but 
there is no table for b,a) and prefer an attempt to tweak the query on the 
table as?is (for now, as I will have a third refactoring).

> Best is to decide which searches you will do, make all Indices you think will 
> be needed, then try the queries (using explain query plan), see which Indices 
> are used and that the speed is good, then remove those who are not used.

That's what I did, and it shown me on a query with a single `GROUP BY`, an 
index on "b" helped and was indeed used. Then later, it shown a variant I had 
the idea to test, with two nested `GROUP BY`, is running faster (an efficiency 
not far from that of the same query on the first column) while not using this 
column index at all (which I finally removed). That's how I ended with these 
tests and the question.



May be that's the opportunity for another question I have: given a foreign key 
(a,b) where "a" and "b" are more than a few bytes (not small) or are of 
variable size (still hopefully limited), are the values for "a" and "b" 
duplicated or do the foreign key creates a kind of references? (may be with an 
hash or even a short ID for the bigger value) If it's duplicated, then I will 
use integer keys instead. A bit long ago, I questioned the habit of 
mechanically using integers PK (and also FK), feeling using the literal values 
is more readable and 

[sqlite] Is the first column of a composite primary key, special?

2016-01-31 Thread Simon Slavin

On 31 Jan 2016, at 4:39pm, Yannick Duch?ne  wrote:

> I saw a page (can't retrieve the URL) suggesting to order table columns by 
> names. It was strange to me, as I had the idea of a hierarchical access for 
> tables access. But I though ?there must be a good reason for them to say 
> this?. Then in an SQLite page [1], there was a suggestion to avoid index 
> containing the same data as a wider index. So after these two things, I tried 
> to imagine ways of setting up an index so that this makes sense: I though a 
> multi?column key could be accessed by any column, using fragments whose 
> content are ordered.
> 
> Precisely with the case of your example, I though the "name" column would be 
> partitioned into individually sorted parts. While it was also contradicted by 
> the fact adding a index on a single column of a multi?column primary key, 
> could help grouping (although later again, there was another surprise 
> contradicting this too).

Ignore all the above.  There are rare situations where they're useful but the 
situation you're in is helped far more by using the phonebook analogy earlier 
posters used than by trying to use the above.

Think about pure SQL, and about making one ideal index for each SELECT command, 
and you'll get very good results.  Work out what you want the SELECT (or 
UPDATE, etc.) to do.  Then work out the command.  Then work out the index which 
would be ideal to help the command do its thing.  That's the best way to get 
fast SQL.

Optimizing for SQLite peculiarities (some of which no longer apply because 
inner workings of SQLite have changed since the article was written) is useful 
only very rarely.

Simon.


[sqlite] Wish List for 2016: High Level API for Object Oriented Interactive Languages

2016-01-31 Thread Simon Slavin

On 31 Jan 2016, at 4:42pm, Keith Medcalf  wrote:

> And I thought the "Object Oriented" jihad blew up when it was discovered to 
> be counter to productivity and performance in the 1990's and that it did not 
> provide a single one of the "advantages" claimed by its mujahedeen warriors.

Hmm.  I thing Object Oriented programming is amazing, and it provides the brain 
which a far simpler safer way to think about what it wants the program to do, 
thus speeding development and reducing errors.  I even use OO in JavaScript and 
PHP, both of which languages can do the same things without using OO.

It's just that an interface between procedural programming and OO has to deal 
with certain exceptions and rarities which OO is not good at dealing with.  As 
a previous poster commented, one of these is error-handling.  If the person who 
writes the library doesn't think all this out clearly ahead of time, you end up 
with a library where it takes four lines of programming just to tell whether 
the last command received an error.  Or some other inelegant and tedious thing 
the programmer has to do, to use the library at all.

It's possible to do this stuff well, but plan to throw away the first two 
versions, not just the first one.



Simon.


[sqlite] Is the first column of a composite primary key, special?

2016-01-31 Thread R Smith


On 2016/01/31 3:54 PM, Yannick Duch?ne wrote:
> Hi all,
>
> Another mystery to me. Given this test table:
>
>  CREATE TABLE t (a TEXT, b TEXT, c INTEGER, PRIMARY KEY (a, b, c))
>
> ??this query:
>
>  SELECT Sum(c) FROM t GROUP BY a
>
> ??executes faster than any of these two:
>
>  SELECT Sum(c) FROM t GROUP BY b
>  SELECT Sum(c) FROM t GROUP BY c
>
> ? which executes in about the same time together, proportionally to the 
> number of returned rows. With `GROUP BY a`, execution times seems to be about 
> half than with the two formers. Adding or not adding a `WITHOUT ROWID` gives 
> the same. I give the number of rows, to show if the first one is faster than 
> the second one, that's not because it would returns less rows, on the 
> opposite, it returns a bit more then with grouping by `b`:
>
>   * Grouping by `a` results into 1360 rows in about 40ms +/-3;
>   * Grouping by `b` results into 1170 rows in about 65ms +/-5;
>   * Grouping by `c` results into 3154 rows in about 90ms +/-4.
>
> If the primary key declaration is removed, timing when grouping by `b` or `c` 
> does not change, while timing when grouping by `a` become the same as with 
> the two formers.
>
> I feel to witness this with both SQlite3 CLI and SQLiteBrowser (a detail I 
> must mention after another thread).
>
> Is there any thing special with the first column of a composite primary key? 
> From an implementation point of view, this may makes sense, but I still 
> prefer to ask.

Yes, there is something special about it - but not what you think perhaps.

First understand what an Index is and how it works.

Imagine you are asked to find all the people whose surnames are 
"Duch?ne" from the telephone directory. You would be able to do this 
quite fast, because the phonebook is indexed first by Surname, then 
name, then address. Perhaps a Telephone directory schema might look like 
this:
   CREATE TABLE PhoneBook (Surname TEXT, Name TEXT, Address TEXT, 
PhoneNo TEXT, PRIMARY KEY (Surname, Name, Address) );

Your query might look like this:
   SELECT * FROM PhoneBook WHERE Surname='Duch?ne';

Imagine now that you are asked to find all people named "Yannick" in the 
phone directory, like so:
   SELECT * FROM PhoneBook WHERE Name='Yannick';

Immediately that would go very slow because you have to look at each 
Surname and see if there are any Yannicks in there, and the same problem 
arise if you are asked to look for a specific address.

You will have a bright idea right the first time you are asked to do 
this - you will make a list of names in alphabetical order followed by 
surnames and keep it separate, so if ever you are asked again to find 
someone by name, you can reference this second list to quickly see the 
name and surname, and perhaps use that info to find them in the 
PhoneBook and get the rest of the info. This second list is what is 
called an Index - but it is not the PRIMARY index.

If you wish for all those searched to go fast, you need 3 Indices, not 
simply a 3-sectiion primary Index.

Perhaps this SCHEMA would better suit your needs:

   CREATE TABLE t (a TEXT, b TEXT, c INTEGER, PRIMARY KEY (a, b, c))
   CREATE INDEX t_1 ON t (b);
   CREATE INDEX t_1 ON t (c);

Be careful though, every next Index will require a full datalist plus some 
overhead worth of space, and will make INSERTs slower because it has to insert 
more times and re-organize the B-Tree of every index a bit.
Best is to decide which searches you will do, make all Indices you think will 
be needed, then try the queries (using explain query plan), see which Indices 
are used and that the speed is good, then remove those who are not used.

HTH,
Ryan



[sqlite] Is the first column of a composite primary key, special?

2016-01-31 Thread Dominique Devienne
On Sun, Jan 31, 2016 at 2:54 PM, Yannick Duch?ne 
wrote:

> Another mystery to me. Given this test table:
>
> CREATE TABLE t (a TEXT, b TEXT, c INTEGER, PRIMARY KEY (a, b, c))
>
> ??this query:
>
> SELECT Sum(c) FROM t GROUP BY a
>
> ??executes faster than any of these two:
>
> SELECT Sum(c) FROM t GROUP BY b
> SELECT Sum(c) FROM t GROUP BY c
>
> ? which executes in about the same time together, proportionally to the
> number of returned rows. With `GROUP BY a`, execution times seems to be
> about half than with the two formers. Adding or not adding a `WITHOUT
> ROWID` gives the same. I give the number of rows, to show if the first one
> is faster than the second one, that's not because it would returns less
> rows, on the opposite, it returns a bit more then with grouping by `b`:
>
>  * Grouping by `a` results into 1360 rows in about 40ms +/-3;
>  * Grouping by `b` results into 1170 rows in about 65ms +/-5;
>  * Grouping by `c` results into 3154 rows in about 90ms +/-4.
>

Run "explain query plan ..." [1] to see which plan SQLite chooses for your
different queries.

As [2] says "indexes are only useful if there are WHERE-clause constraints
on the left-most columns of the index",
with "left-most" being the keyword here. `a` is your left-most column, so
it works best with it.

But grouping by `b` may also benefits from the index, thanks to the
"skip-scan" optimization. (see [2] again).

if there are few distinct `a` values, the benefit of skip-scan improves.

The NGQP [3] is more sensitive than the old planner to good statistics, to
find the optimum plan.
So do compare you plans and performance before and after running ANALYZE on
your table(s).

[1] https://www.sqlite.org/eqp.html
[2] https://www.sqlite.org/optoverview.html#skipscan
[3] https://www.sqlite.org/queryplanner-ng.html


[sqlite] Is the first column of a composite primary key, special?

2016-01-31 Thread Yannick Duchêne
Hi all,

Another mystery to me. Given this test table:

CREATE TABLE t (a TEXT, b TEXT, c INTEGER, PRIMARY KEY (a, b, c))

??this query:

SELECT Sum(c) FROM t GROUP BY a

??executes faster than any of these two:

SELECT Sum(c) FROM t GROUP BY b
SELECT Sum(c) FROM t GROUP BY c

? which executes in about the same time together, proportionally to the number 
of returned rows. With `GROUP BY a`, execution times seems to be about half 
than with the two formers. Adding or not adding a `WITHOUT ROWID` gives the 
same. I give the number of rows, to show if the first one is faster than the 
second one, that's not because it would returns less rows, on the opposite, it 
returns a bit more then with grouping by `b`:

 * Grouping by `a` results into 1360 rows in about 40ms +/-3;
 * Grouping by `b` results into 1170 rows in about 65ms +/-5;
 * Grouping by `c` results into 3154 rows in about 90ms +/-4.

If the primary key declaration is removed, timing when grouping by `b` or `c` 
does not change, while timing when grouping by `a` become the same as with the 
two formers.

I feel to witness this with both SQlite3 CLI and SQLiteBrowser (a detail I must 
mention after another thread).

Is there any thing special with the first column of a composite primary key? 
From an implementation point of view, this may makes sense, but I still prefer 
to ask.

I first noticed this another way. This test was just to check on a simpler case.

Initially, I indirectly noticed this with something similar to this:

SELECT b, Sum(c) AS c 
  FROM
   (SELECT b, Sum(c) AS c 
FROM t 
GROUP BY a, b) 
  GROUP BY b -- 60 ms on average

? being faster than this second simpler alternative, something I notice with 
the test table too, just that the difference is less:

SELECT b, Sum(c) AS c
  FROM t 
  GROUP BY b -- 65 to 70 ms on average

? although the first one seems to run more operations, and it's still the same 
if I add an index on `b` for the second alternative and thus it does not use a 
temporary B?tree for grouping.


I also noticed some other cases where queries executes faster on the first 
column of a composite key (with or without indexes), but I won't expose all 
cases, as I'm already too lengthy.


Enough testing for now, I will resume the investigations on this unexpected 
results, later.


-- 
Yannick Duch?ne


[sqlite] SQLite crashing

2016-01-31 Thread Igor Korot
Teg,

On Sun, Jan 31, 2016 at 12:29 AM, Igor Korot  wrote:
> Teg,
>
> On Sat, Jan 30, 2016 at 1:54 PM, Teg  wrote:
>> Hello Igor,
>
> This is all compiling with /MTd.
>
>>
>> I  might set a break point in the destructor for the DB class and make
>> sure  it's not being fired before the exit crash.  Question marks on a
>> pointer  means  it's  an  invalid  pointer. The pointer to DB might be
>> incorrect  or  the  pointer  to something inside the DB class might be
>> incorrect. Sounds almost like it's getting destructed twice.
>
> No, the breakpoint is hit only once.
>
>>
>> What  happens  if  you  new  the DB class then use it, then immediately
>> delete it?  Does it still crash?
>
> Apparently yes. It still crashes.
> However, it does go thru the destructor and then crashes.
>
> Here is the stack trace of the crash:
>  ntdll.dll!77b973a6()
>  [Frames below may be incorrect and/or missing, no symbols loaded
> for ntdll.dll]
>  ntdll.dll!77b5164f()
>  ntdll.dll!77b20f01()
>  KernelBase.dll!762b2844()
>>sqlite.dll!_CrtIsValidHeapPointer(const void * pUserData)  Line 2036
>> C++
>  sqlite.dll!_free_dbg_nolock(void * pUserData, int nBlockUse)
> Line 1322 + 0x9 bytesC++
>  sqlite.dll!_free_dbg(void * pUserData, int nBlockUse)  Line 1265
> + 0xd bytesC++
>  sqlite.dll!operator delete(void * pUserData)  Line 54 + 0x10 bytesC++
>  sqlite.dll!SQLiteDatabase::`vector deleting destructor'()  + 0x65
> bytesC++
>  dialogs.dll!DatabaseType::OnConnect(wxWizardEvent & __formal)
> Line 125 + 0x3a bytesC++
>  dialogs.dll!wxAppConsoleBase::HandleEvent(wxEvtHandler * handler,
> void (wxEvent &)* func, wxEvent & event)  Line 657 + 0xf bytesC++
>  dialogs.dll!wxAppConsoleBase::CallEventHandler(wxEvtHandler *
> handler, wxEventFunctor & functor, wxEvent & event)  Line 669 + 0x22
> bytesC++
>  dialogs.dll!wxEvtHandler::ProcessEventIfMatchesId(const
> wxEventTableEntryBase & entry, wxEvtHandler * handler, wxEvent &
> event)  Line 1383 + 0x29 bytesC++
>  dialogs.dll!wxEvtHandler::SearchDynamicEventTable(wxEvent &
> event)  Line 1775 + 0x11 bytesC++
>  dialogs.dll!wxEvtHandler::TryHereOnly(wxEvent & event)  Line 1576
> + 0x15 bytesC++
>  dialogs.dll!wxEvtHandler::TryBeforeAndHere(wxEvent & event)  Line
> 3656 + 0x2d bytesC++
>  dialogs.dll!wxEvtHandler::ProcessEventLocally(wxEvent & event)
> Line 1513 + 0xc bytesC++
>  dialogs.dll!wxEvtHandler::ProcessEvent(wxEvent & event)  Line
> 1486 + 0xc bytesC++
>  dialogs.dll!wxWizard::ShowPage(wxWizardPage * page, bool
> goingForward)  Line 577 + 0x2a bytesC++
>  dialogs.dll!wxWizard::OnBackOrNext(wxCommandEvent & event)  Line
> 818 + 0x1e bytesC++
>  dialogs.dll!wxAppConsoleBase::HandleEvent(wxEvtHandler * handler,
> void (wxEvent &)* func, wxEvent & event)  Line 657 + 0xf bytesC++
>  dialogs.dll!wxAppConsoleBase::CallEventHandler(wxEvtHandler *
> handler, wxEventFunctor & functor, wxEvent & event)  Line 669 + 0x22
> bytesC++
>  dialogs.dll!wxEvtHandler::ProcessEventIfMatchesId(const
> wxEventTableEntryBase & entry, wxEvtHandler * handler, wxEvent &
> event)  Line 1383 + 0x29 bytesC++
>  dialogs.dll!wxEventHashTable::HandleEvent(wxEvent & event,
> wxEvtHandler * self)  Line 989 + 0x11 bytesC++
>  dialogs.dll!wxEvtHandler::TryHereOnly(wxEvent & event)  Line 1580
> + 0x25 bytesC++
>  dialogs.dll!wxEvtHandler::TryBeforeAndHere(wxEvent & event)  Line
> 3656 + 0x2d bytesC++
>  dialogs.dll!wxEvtHandler::ProcessEventLocally(wxEvent & event)
> Line 1513 + 0xc bytesC++
>  dialogs.dll!wxEvtHandler::ProcessEvent(wxEvent & event)  Line
> 1486 + 0xc bytesC++
>  dialogs.dll!wxWindowBase::TryAfter(wxEvent & event)  Line 3456 +
> 0x1e bytesC++
>  dialogs.dll!wxEvtHandler::ProcessEvent(wxEvent & event)  Line
> 1499 + 0x13 bytesC++
>  dialogs.dll!wxEvtHandler::SafelyProcessEvent(wxEvent & event)
> Line 1604 + 0x13 bytesC++
>  dialogs.dll!wxWindowBase::HandleWindowEvent(wxEvent & event)
> Line 1543C++
>  dialogs.dll!wxControl::ProcessCommand(wxCommandEvent & event)
> Line 289C++
>  dialogs.dll!wxButton::SendClickEvent()  Line 367 + 0xc bytesC++
>  dialogs.dll!wxButton::MSWCommand(unsigned int param, unsigned
> short __formal)  Line 415 + 0x8 bytesC++
>  dialogs.dll!wxWindow::HandleCommand(unsigned short id_, unsigned
> short cmd, HWND__ * control)  Line 5169 + 0x1c bytesC++
>  dialogs.dll!wxWindow::MSWHandleMessage(long * result, unsigned
> int message, unsigned int wParam, long lParam)  Line 2905 + 0x1f bytes
>C++
>  dialogs.dll!wxWindow::MSWWindowProc(unsigned int message,
> unsigned int wParam, long lParam)  Line 3471 + 0x22 bytesC++
>  dialogs.dll!wxTopLevelWindowMSW::MSWWindowProc(unsigned int
> message, unsigned int wParam, long lParam)  Line 321 + 0x14 bytes
> C++
>  

[sqlite] Is the first column of a composite primary key, special?

2016-01-31 Thread Keith Medcalf

> May be that's the opportunity for another question I have: given a foreign
> key (a,b) where "a" and "b" are more than a few bytes (not small) or are
> of variable size (still hopefully limited), are the values for "a" and "b"
> duplicated or do the foreign key creates a kind of references? (may be
> with an hash or even a short ID for the bigger value) If it's duplicated,
> then I will use integer keys instead. A bit long ago, I questioned the
> habit of mechanically using integers PK (and also FK), feeling using the
> literal values is more readable and simplifies queries' text. If my
> assumption is wrong (i.e. this is not by reference and there are copies
> every where), then I will have views for readable consultation and will
> bother less about more verbose queries.

An index will contain the actual data from the columns being indexed, plus the 
row number of the table to which the entry corresponds (or, for without rowid 
tables, the ENTIRE PRIMARY KEY column data to use to locate the underlying 
record).  This is why it is called a RELATIONAL DATABASE.  The entire thing 
operates only on the DATA and nothing but the data and does not contain 
"pointers".

Hierarchical, Network, and Network Extended database models use pointers in 
sets rather than duplicating the data.  This makes them orders of magnitude 
faster (when properly designed) than a Relational Model database, but means 
that there is no recovery possible where a pointer-chain becomes corrupted -- 
with the relational model everything has a copy (multiple duplicates) of the 
data so you just drop the corrupted thing and re-create it.

If a column contains a bunch of repeated data (say a phone book).  You can 
create the structures the following ways:

create table PhoneBookEntry
(
  surname text collate nocase not null,
  firstname text collate nocase not null,
  ...
  primary key (surname, firstname)
);

This will duplicate all the surnames and firstnames in the index.  It also 
means that you cannot have two people with the same name.  So maybe you add 
something else to the key, like the address.  This is then duplicated as well 
in the index and now your constraint is that you cannot have two people with 
the same names at the same address.

If you analyzed the problem, you would note that the Phone Number is the actual 
primary key, and the other data are merely attributes of the phone number.  For 
most efficient operations you would probably end up with something like this:

create table Surnames
(
 surname_id integer primary key,
 surname text not null collate nocase unique
);

create table GivenNames
(
 given_id integer primary key,
 givenname text not null collate nocase unique
);

create table Streets
(
  street_id integer primary key,
  streetname text not null collate nocase unique
);

create table Addresses
(
 address_id integer primary key,
 street_no integer,
 suffix text collate nocase,
 street_id integer not null references Streets
);
create unique index Address_Streetid on Addresses (street_id);
create index Address_Streetno on Addresses (Street_no, Suffix);

create table PhoneDirectory
(
 surname_id integer not null references Surnames,
 given_id integer not null references GivenNames,
 address_id integer not null references Addresses,
 PhoneNumber text collate nocase primary key
);
create index PD_Surnameid on PhoneDirectory (surname_id);
create index PD_Givenid on PhoneDirectory (given_id);
create index PD_addressid on PhoneDirectory (address_id);

create view v_PhoneDirectory
as
  select Surname, Givenname, Street_No, Suffix, Streetname, PhoneNumber
from PhoneDirectory
natural join Addresses
natural join Streets
natural join GivenNames
natural join Surnames;

Then you use the view for all your queries of the PhoneDirectory.  This works 
very well if the optimizer properly prunes out any unnecessary joins caused by 
unreferenced columns in the view.  Even if not, all queries will run very 
quickly even with billions of records in the PhoneDirectory.  (NATURAL JOIN is 
syntactic sugar creating a WHERE clause that joins on like named columns).

This database would be in BCNF normal form.  (Although the streetno and suffix 
ought to be moved out to a separate table(s) if you need 4th or 5th normal).  
The model gets very much more complicated if you also have to handle municipal 
names, city names, etc.

Note that with this structure you can find all the "John" living on "Queen 
Street" in just nanoseconds.  Using it to produce a printed phone book, 
however, is not very efficient.  The structure is designed to optimize querying 
and eliminate update anomolies.  Since printed phone books are only produced 
once a year, the fact that this operation is not efficient is immaterial.






[sqlite] Is the first column of a composite primary key, special?

2016-01-31 Thread Keith Medcalf

On Sunday, 31 January, 2016 06:54
> 
> Another mystery to me. Given this test table:
> 
> CREATE TABLE t (a TEXT, b TEXT, c INTEGER, PRIMARY KEY (a, b, c))
> 
> ??this query:
> 
> SELECT Sum(c) FROM t GROUP BY a
> 
> ??executes faster than any of these two:
> 
> SELECT Sum(c) FROM t GROUP BY b
> SELECT Sum(c) FROM t GROUP BY c
> 

Your results are expected.  There is a wonderful invention in SQL called 
EXPLAIN which is designed to EXPLAIN things to you.  It is very simple to use, 
one just prepends the phrase "EXPLAIN QUERY PLAN" to the thing you want 
explained to you.  If you wish to know *how* the solution is implemented, then 
EXPLAIN by itself will give the VDBE code that needs to be executed to 
implement your query.

In the present example,

sqlite> explain query plan SELECT Sum(c) FROM t GROUP BY a;
0|0|0|SCAN TABLE t USING COVERING INDEX sqlite_autoindex_t_1

sqlite> explain query plan SELECT Sum(c) FROM t GROUP BY b;
0|0|0|SCAN TABLE t
0|0|0|USE TEMP B-TREE FOR GROUP BY

sqlite> explain query plan SELECT Sum(c) FROM t GROUP BY c;
0|0|0|SCAN TABLE t
0|0|0|USE TEMP B-TREE FOR GROUP BY

So you see, the first query is able to use an index (your primary key) because 
the data is already sorted in the order required.
Your other two queries do more work because the data is un-ordered and must be 
sorted in order to compute the results requested.

Or, the first does ONE operation, and the others do TWO operations.  You should 
therefore expect (in a rudimentary fashion) the second two queries each to take 
longer than the first.  (Not necessarily true since a large multistep plan can 
produce results much faster than a brute force method in many cases, provided 
that the necessary indexes are present).

> ? which executes in about the same time together, proportionally to the
> number of returned rows. With `GROUP BY a`, execution times seems to be
> about half than with the two formers. Adding or not adding a `WITHOUT
> ROWID` gives the same. I give the number of rows, to show if the first one
> is faster than the second one, that's not because it would returns less
> rows, on the opposite, it returns a bit more then with grouping by `b`:
> 
>  * Grouping by `a` results into 1360 rows in about 40ms +/-3;
>  * Grouping by `b` results into 1170 rows in about 65ms +/-5;
>  * Grouping by `c` results into 3154 rows in about 90ms +/-4.
> 
> If the primary key declaration is removed, timing when grouping by `b` or
> `c` does not change, while timing when grouping by `a` become the same as
> with the two formers.
> 
> I feel to witness this with both SQlite3 CLI and SQLiteBrowser (a detail I
> must mention after another thread).
> 
> Is there any thing special with the first column of a composite primary
> key? From an implementation point of view, this may makes sense, but I
> still prefer to ask.

No, there is nothing "special" about the columns in an index.  An index which 
sorts by (a,b,c) is still sorted by (a,b,c), (a,b) and (a).
When you ask for a result which requires the data to be sorted by a, the index 
can be used to retrieve the data in the correct order.
When you ask for a result which requires the data to be sorted by b or c, the 
index is useless in retrieving the data in the correct order.

If you asked for WHERE a=5 GROUP BY b, the index would be useful because it 
would return the data in the correct order.  If you asked for WHERE a=5 GROUP 
BY c, then the index would be useful for finding all the candidate rows (where 
a=5) but would not be helpful in the GROUP BY c (unless the distribution of b 
were small so that a skip-scan was reasonable).

> I first noticed this another way. This test was just to check on a simpler
> case.
> 
> Initially, I indirectly noticed this with something similar to this:
> 
> SELECT b, Sum(c) AS c
>   FROM
>(SELECT b, Sum(c) AS c
> FROM t
> GROUP BY a, b)
>   GROUP BY b -- 60 ms on average
> 
> ? being faster than this second simpler alternative, something I notice
> with the test table too, just that the difference is less:
> 
> SELECT b, Sum(c) AS c
>   FROM t
>   GROUP BY b -- 65 to 70 ms on average
> 
> ? although the first one seems to run more operations, and it's still the
> same if I add an index on `b` for the second alternative and thus it does
> not use a temporary B?tree for grouping.
> 
> 
> I also noticed some other cases where queries executes faster on the first
> column of a composite key (with or without indexes), but I won't expose
> all cases, as I'm already too lengthy.

A composite key (as in a declared primary key or unique constaint) is an index.

> Enough testing for now, I will resume the investigations on this
> unexpected results, later.






[sqlite] Wish List for 2016: High Level API for Object Oriented Interactive Languages

2016-01-31 Thread Keith Medcalf

On Saturday, 30 January, 2016 16:03, Simon Slavin  
said:
> On 30 Jan 2016, at 8:13pm, Yannick Duch?ne 
> wrote:

> > In my opinion (which some others share), OO is a bag of miscellaneous
> things which are better tools and better understood when accosted
> individually. Just trying to define what OO is, shows it: is this about
> late binding? (if it is, then there sub?program references, first?class
> functions, or even static polymorphism and signature overloading) About
> encapsulation? (if it is, then there is already modularity and scopes)
> About grouping logically related entities? (if it is, there is already
> modularity, and sometime physically grouping is a bad physical design).
> 
> There are a number of problems in using a relational database for object-
> oriented purposes.  One is that to provide access to stored objects you
> need to access the database in very inefficient ways which are slow and
> are not helped by caching.  You can read about some of the problems here:

And I thought the "Object Oriented" jihad blew up when it was discovered to be 
counter to productivity and performance in the 1990's and that it did not 
provide a single one of the "advantages" claimed by its mujahedeen warriors.

Of course, it could be that there is not really very much of anything at all 
(if there is anything at all) that implements an "object oriented" 
architecture.  It is mostly just glossing appearances and wrappers around 
inherently non-object oriented things.  But then again, that just goes to show 
that OO is inherently flawed.

OO is a dead horse.  It is about time it was finally shot in the head and put 
out of its misery.







[sqlite] Is the first column of a composite primary key, special?

2016-01-31 Thread Igor Tandetnik
On 1/31/2016 8:54 AM, Yannick Duch?ne wrote:
> Another mystery to me. Given this test table:
>
>  CREATE TABLE t (a TEXT, b TEXT, c INTEGER, PRIMARY KEY (a, b, c))
>
> ??this query:
>
>  SELECT Sum(c) FROM t GROUP BY a
>
> ??executes faster than any of these two:
>
>  SELECT Sum(c) FROM t GROUP BY b
>  SELECT Sum(c) FROM t GROUP BY c

Imagine a phone directory book sorted by last name then first name. It's 
easy to find all people named "Smith, John", as well as all people with 
the last name of Smith. But the book's organization is of no help in 
finding all people with the first name of John - you have no choice but 
to scan the whole book.

Same with a database index (whether implicitly created for a primary 
key, or otherwise) - it only helps when the condition involves some 
prefix of the list of columns. In your example, it would help with 
grouping (or sorting, or filtering) on (a), or (a, b), or (a, b, c).
-- 
Igor Tandetnik



[sqlite] Problem with distinct select query

2016-01-31 Thread Gary Baranzini
Thanks for the suggestion, it works great.

> R Smith 
> January 30, 2016 at 6:13 AM
>
>
>
>
> A few ways come to mind, here's an easy one that works for me:
>
>
> WITH RNDCAT(cat) AS (   -- CTE to list 20 random categories
> SELECT DISTINCT formulas.majorcategory FROM formulas ORDER BY 
> RANDOM() LIMIT 20
> ), RNDFRM (cat, pin) AS (-- CTE to randomize the formula list
> SELECT majorcategory, pinyin FROM formulas ORDER BY RANDOM()
> ), RNDVAL(cat, pin) AS ( -- CTE to link one random-list formula to 
> each category
> SELECT RNDCAT.cat, (SELECT pin FROM RNDFRM WHERE 
> RNDFRM.cat=RNDCAT.cat LIMIT 1) FROM RNDCAT
> )  -- Final select to show values and names
> SELECT RNDVAL.pin AS Pinyin, RNDVAL.cat AS Category_ID, MC.item_name 
> AS Category
>   FROM RNDVAL
>   LEFT JOIN majorfcategory AS MC ON MC.id=RNDVAL.cat
> ORDER BY RNDVAL.cat
> ;
>
> Cheers,
> Ryan
>
>
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
> Gary Baranzini 
> January 29, 2016 at 6:51 AM
> Hi,
>
> I have a query where I select 20 random distinct rows.
>
> SELECT DISTINCT formulas.pinyin, formulas.majorcategory, 
> majorfcategory.item_name
> FROM formulas
> JOIN majorfcategory ON majorfcategory.id=formulas.majorcategory
> ORDER BY RANDOM() LIMIT 20
>
> Each row contains a major category id, "majorcategory"
> I want to improve on this query so that the select will NOT return 
> duplicate major categories
>
> I thought I could do:
>
> SELECT DISTINCT formulas.pinyin, formulas.majorcategory, 
> majorfcategory.item_name
> FROM formulas
> JOIN majorfcategory ON majorfcategory.id=formulas.majorcategory
> WHERE formulas.majorcategory in (SELECT DISTINCT 
> formulas.majorcategory FROM formulas) ORDER BY RANDOM() LIMIT 20
>
> But that doesn't work.
>
> Any suggestions will be greatly appreciated.
>
> jb



[sqlite] Find out how many times does SQLite hit the disk?

2016-01-31 Thread Dominique Pellé
dpb  wrote:
> Dear
 SQLite Community,
>
> On windows, is there a way to find out how many times does my SQLite DB hit
> the disk?
>
> This will help me in deciding if moving to In-memory SQLite will improve my
> application performance. I am done with adding indexes to my tables in
> SQLite DB.
>
> If possible, please point me URLs that will help me.

You could print the number of SQLite page misses.
Each miss causes an I/O of page size bytes.
See sqlite3_db_status():

https://www.sqlite.org/c3ref/db_status.html
https://www.sqlite.org/c3ref/c_dbstatus_options.html

On Linux, I would also use strace to see I/Os. I suppose
that procmon on Windows can also show I/Os.

Regards
Dominique


[sqlite] Find out how many times does SQLite hit the disk?

2016-01-31 Thread Yannick Duchêne
On Sat, 30 Jan 2016 20:36:55 -0800
J Decker  wrote:

> On Sat, Jan 30, 2016 at 8:09 PM, J Decker  wrote:
> > could use a tool like ProcMon and filter to disk activity on a
> > specified file to see...
> > https://technet.microsoft.com/en-us/sysinternals/processmonitor.aspx?f=255=-2147217396
> 
> Might not actually be useful though; if the file is memory mapped (WAL
> Journal?) then you won't see those accesses...
> 
> then you'd end up having to hook into VFS stuff...

If it's memory mapped, it's less an efficiency issue, which was the original 
concern if I remember well. If he wants to know about physical accesses, this 
may be what he needs.

-- 
Yannick Duch?ne


[sqlite] Wish List for 2016: High Level API for Object Oriented Interactive Languages

2016-01-31 Thread Yannick Duchêne
On Sat, 30 Jan 2016 23:03:29 +
Simon Slavin  wrote:

> 
> On 30 Jan 2016, at 8:13pm, Yannick Duch?ne  
> wrote:
> 
> > In my opinion (which some others share), OO is a bag of miscellaneous 
> > things which are better tools and better understood when accosted 
> > individually. Just trying to define what OO is, shows it: is this about 
> > late binding? (if it is, then there sub?program references, first?class 
> > functions, or even static polymorphism and signature overloading) About 
> > encapsulation? (if it is, then there is already modularity and scopes) 
> > About grouping logically related entities? (if it is, there is already 
> > modularity, and sometime physically grouping is a bad physical design).
> 
> There are a number of problems in using a relational database for 
> object-oriented purposes.  One is that to provide access to stored objects 
> you need to access the database in very inefficient ways which are slow and 
> are not helped by caching. [?]

This matches what I had in mind with the short sentence ?And especially about 
OO and DB, I know there are complaints OO hides too much the data model of 
DBs?. I heard about an example with a video game engineer explaining OO 
typically impends efficiency for their use cases, in a way which can be related 
to a DB and its tables. Say an object has an attribute `x` among others. Say a 
big set of objects of this type is often iterated, but that's not really the 
objects which are iterated, that's their `x` attribute alone and no other 
attributes are accessed during this frequent long operation. Using OO, it is 
suggested `x` should be stored in the object instances (typically a kind of 
C?struct), so there is a large set of unused data coming with each `x` in an 
iteration, which prevents efficiency (poor caching, more memory access). In 
such a case, the `x` should not be stored in the object, rather in a dedicated 
list holding only the `x` attributes. That's about the same as with a DB when 
some columns are moved from a table to a specific table, while maintaining a 
relation between both.

Where `x` should belong, that's not a conceptual picture about the attributes 
alone which says it, that's the typical operations referring to the attribute, 
which do, and that's as more important as the number of object instance grows. 
That's for a similar reason a DB may need to be refactored when new queries 
come, while from an OO point of view, this would just suggest to add a new 
method to an object, not to move `x` from the object to something else 
(something OO will never suggest).

That's what make me agree a relational DB is not well suited to store objects 
(the models differ too much), unless the model of that object is broken down 
after the requirement of its individual operations, or the object instances are 
just to be stored as monolithic blobs.

There are cases where the OO view matches well, and there are cases where it 
does not. There is no need to force one view to be the other.

-- 
Yannick Duch?ne


[sqlite] Wish List for 2016: High Level API for Object Oriented Interactive Languages

2016-01-31 Thread Yannick Duchêne
On Sat, 30 Jan 2016 20:50:17 -0500
Jim Callahan  wrote:

> I am not interested in a complete ORM; what I am interested is when the
> object-oriented language supports a SQL-R-like object. In R, the object is
> called a data.frame and the package "Pandas" supplies a similar data frame
> object to Python.
> https://pypi.python.org/pypi/pandas/0.10.0/

The page says:
> Additionally, it has the broader goal of becoming the most powerful
> and flexible open source data analysis / manipulation tool available
> in any language.

There have been so much announcements of the like in many areas?? (open?source 
or not)

The page mentions HDF5, and Pandas descriptions seems similar to that of HDF5 
in the intents. If Pandas exists, this may be that its authors believe the HDF5 
library is not good enough. But HDF5 (which is itself subject to criticism) 
made the same promises as Pandas seems to do. Just to say this may be seeing 
something as universal, while it is not for every one or every use, so the 
urgency may not be that hight.

Is the concern mainly about hierarchical data? (SQL has a reputation for not be 
well suited to that) About heterogeneous data? (i.e. not easily matching a type 
or a pattern)

> I am not interested in a complete ORM; what I am interested is when the
> object-oriented language supports [?]

What is object oriented in this context? What properties of the object model 
raise hight in the picture when you think about in typical use cases?

> R as I have mentioned has fantastic interfaces to SQL databases that allow
> one to pass a query and have the result populate a data frame. The data
> frame in R or Python can be fed to a machine learning algorithm (scikit
> learn on Python) or to a plotting package such as ggplot or bokeh.
> https://pypi.python.org/pypi/ggplot/0.6.8
> http://bokeh.pydata.org/en/latest/
> http://scikit-learn.org/stable/
> 
> What I want to do is to demonstrate short scripts in R and Python to
> accomplish the same task.  I don't want the Python scripts to be longer and
> more complicated because Python has an lower level interface to SQLite. [?]

Why should this be SQLite's responsibility? What prevents a Python (or R) 
library to implement the desired interface? (providing a used library does not 
count in a script's length). Is this with the hope to get greater efficiency? 
(i.e. timing and consumed resources)

With my apologizes for the naive questions??

-- 
Yannick Duch?ne


[sqlite] IS a SQLite db of small size as good as reliable cache?

2016-01-31 Thread dpb
Dear Community,

Assuming that SQLite would arrange the database table contents on page
boundary and

1)  if db is small let us say maximum 5MB,

2  further if such a db is part of windows service / application running
continuously then may be pages will remain in the memory under normal
conditions.

Q1) Operating System like Windows would cache the series of pages, is this
behavior not almost same as that of Cache?

Q2) Does it make sense to use In-memory SQLite for small size db upto 5MB
 average size 2MB?

Thank you


[sqlite] Find out how many times does SQLite hit the disk?

2016-01-31 Thread dpb
Dear SQLite Community,

On windows, is there a way to find out how many times does my SQLite DB hit
the disk?

This will help me in deciding if moving to In-memory SQLite will improve my
application performance. I am done with adding indexes to my tables in
SQLite DB.

If possible, please point me URLs that will help me.

Thank you,