[sqlite] Are there practical limits to a not-so simple schema?

Olivier Mascia Wed, 13 Jan 2016 16:06:21 +0100

Thanks Simon & Dominique,

> Le 13 janv. 2016 ? 13:29, Simon Slavin <slavins at bigfraud.org> a ?crit :
> 
>> Does sqlite have to reparse the schema text often to execute the queries? Or 
>> is the schema somehow translated internally to a, stored, digested 
>> ('compiled') format, to ease its working?
> 
> The first time SQLite needs the database schema it reads it from the file 
> and, as you suspected, stores it internally (in a more convenient and 
> efficient format).  The internal version is valid only until a transaction 
> ends and the database is unlocked, because after that another process might 
> create/delete a table, or create a UNIQUE index or something.  However, a 
> schema-version counter is stored in the database and if, on the next command, 
> SQLite sees that this hasn't changed it will continue to use its internal 
> copy.


Perfect.

>> The application which would use this schema is a server-side application 
>> (quite along the lines described in http://sqlite.org/whentouse.html).  We 
>> have started experimentation and things look very good, excellent should I 
>> say, so the above question is more about which details to supervise, which 
>> traps to avoid.  I'm pretty sure there are people here with valuable 
>> background with similar datasets.
> 
> I wrote server-side front ends to SQLite databases without problems.  Any 
> problems come from network communications, not SQLite.

Network not an issue for us. :) All the processing is inside the server, 
including the GUI for users which is remotely served through a specific 
protocol (it more or less is the same thing as a server serving HTML and a 
browser displaying it, but what's served is not HTML and the browser is a 
dedicated client app to that task). One hour of typical data-entry tasks by one 
single person using a platform native GUI, which makes you think the whole 
application is local when it isn't, can weight 1 or 2 hundreds KB of network 
bytes in total. Thin. Very thin. Not comparable to remote desktop solutions : 
we don't transfer pixels but short instructions to draw a window, place a 
control, put this string inside, and on the reverse channel, get events 
(including updated data) for only the GUI actions which are significant to the 
application logic on the server-side.

> My only concern with what you wrote is that you mention 100 tables.  In order 
> to find the table you specify SQLite has to go through a list of (hashed, I 
> think) table names, and going through an average of 50 of them per command 
> can be slow.  If you have many tables with the same columns it's more 
> efficient to merge them into one table with an extra column for the original 
> table-name.  Not only can this speed things up but it makes the database file 
> smaller since each table and index in the database has its own list of pages 
> of data.

Hmm. Indeed this needs further analysis. I'll dig in the code, now that I know 
what to look for.

Thanks a lot Simon.


> Le 13 janv. 2016 ? 13:36, Dominique Devienne <ddevienne at gmail.com> a ?crit 
> :
> 
>> The application which would use this schema is a server-side application
>> (quite along the lines described in http://sqlite.org/whentouse.html).
>> We have started experimentation and things look very good, excellent should
>> I say, so the above question is more about which details to supervise,
>> which traps to avoid.  I'm pretty sure there are people here with valuable
>> background with similar datasets.
>> 
> 
> In a server app, it depends what kind of concurrency you want to support.
> Write locks are exclusive, and database wide, and SQLite doesn't implement
> MVCC, so concurrency could be an issue. WAL mode is supposed to be better
> for concurrency too.

It was well understood from the evaluation we did over the last 6 weeks. The 
outcome of the tests we did are:

- each processing task use its own connection to the database
- a private cache
- SQLite is setup for SQLITE_THREADSAFE=2 ('MULTITHREAD', not 'SERIALIZED'), 
because we never share a connection (and its dependencies like statements) 
between worker threads processing tasks
- every write transaction duration it out of the hands of the users and are all 
designed to be as short-lived as possible (we talk msec here for the most of 
them), and our server serializes those write transactions
- we for now have opted for the WAL mode

Some kind of MVCC is very interesting to us for the purpose of running read 
transactions which would see a stable view of the data, not seeing at all the 
writes committed after the start of the read transactions. If I'm not 
misinterpreting http://www.sqlite.org/wal.html this is what SQLite gives us:

"When a read operation begins on a WAL-mode database, it first remembers the 
location of the last valid commit record in the WAL. Call this point the "end 
mark". Because the WAL can be growing and adding new commit records while 
various readers connect to the database, each reader can potentially have its 
own end mark. But for any particular reader, the end mark is unchanged for the 
duration of the transaction, thus ensuring that a single read transaction only 
sees the database content as it existed at a single point in time."

We have assumed that for this to be meaningful we would have to BEGIN 
TRANSACTION when starting such a sequence of read operations (possibly multiple 
select) and should either ROLLBACK TRANSACTION or COMMIT TRANSACTION when done. 
We still need to better understand if it's better to end a transaction which 
made only reads by a commit or a rollback. In some other engines, it is 
marginally better to ask for commit (which has nothing to do) than rollback.

> Another gotcha is related to large blobs, if you need partial update of
> them. SQLite must rewrite the whole blob. May not apply to you though. Same
> issue with alter table add column if you have large blobs, SQLite must
> rewrite the rows (and the blobs), which can be very expensive. Typical
> work-around (hack IMHO) is to put your blobs in separate aux tables.

The most of the blobs we handle are stored archived PDFs of documents 
(invoices, orders, the like). There are always exceptions of course, but most 
of the blobs once inserted will never again be updated. Some will be replaced 
by new versions (like sometimes some pictures) of course. Partial update of one 
blob content is not at the agenda. What's more we think currently that we will 
indeed use 2 databases. One storing the classic data, and one storing those 
archived scans. The idea is to push the pagesize of the blob-db to something 
like 16K while probably using the default 1K (maybe 2K) for the 
ordinarydata-db. No need for foreign keys cross db (which don't exist). Some 
integer key is well enough to link them.

> There are other small gotchas (untyped / dynamically typed columns, legacy
> non-standard behaviors, etc...) but the above are the two things that come
> to mind given what you describe. --DD

To the contrary, untyped / dynamically typed columns is seen as *great* benefit 
for us.  It's a refreshing concept which maps so well to the remaining part of 
our programming! )

Thanks a lot too, Dominique.

> Le 13 janv. 2016 ? 15:22, Paul <devgs at ukr.net> a ?crit :
> 
> If you ever going to use ANALYZE on your database and database is going to be 
> open frequently (like once per request) consider dropping sqlite_stat3 and 
> sqlite_stat4 tables.
> 
> SQLite reads content of those tables on each open. Number of tables greatly 
> contributes to amount of data stored in there. For me, having below 30 
> tables, penalty of keeping sqlite_stat3 and sqlite_stat4 after ANALYZE equals 
> to some 5 milliseconds of overhead on each open.

Hmm. Here too I will have to investigate. Though we're not in a disconnected 
model as web server / web browser. When a user enters the GUI and starts 
working, it will do so for minutes or hours (unlike successive web requests 
that would live for only msec). We can open a connection for this user and let 
it use it for the duration of his/her work. All statements will properly close 
when the user is still connected but otherwise idle, leaving a connection open, 
but without contention against any other.

Thanks too Paul,

-- 
Meilleures salutations, Met vriendelijke groeten, Best Regards,
Olivier Mascia, integral.be/om

[sqlite] Are there practical limits to a not-so simple schema?

Reply via email to