Re: [sqlite] Concurrency, MVCC

2004-04-16 Thread Christian Smith
On Thu, 15 Apr 2004, Doug Currie wrote:

>
>Thursday, April 15, 2004, 9:16:01 AM, Christian Smith wrote:
>
>> On Wed, 14 Apr 2004, Doug Currie wrote:
>
>>>One way to get table level locking without a great deal of pain is to
>>>integrate the shadow paging ideas with BTree management. Rather than
>>>using page tables for the shadow pages, use the BTrees themselves.
>>>This means that any change to a BTree requires changes along the
>>>entire path back to the root so that only free pages are used to store
>>>new data, including the BTree itself. Writing the root page(s) of the
>>>BTree(s) commits the changes to that table (these tables).
>
>> Actually, this gets my vote. Keeps the pager layer the same,
>
>The pager gets *much* simpler because it doesn't need to make a log
>file. The log file is not necessary because writes only go to free
>pages.
>
>Well, there would be one write-ahead log. It's needed to prevent
>partial updates to the page number pointers to the roots page(s) of
>the BTree(s) at commit. This log is created at commit time, and is
>much simpler and much smaller than the present log file.

I'd have thought it'd be better to preserve the pager layer as is. If it
ain't broke...


> [...]
>
>This design works well. It has the advantage (compared with shadow
>pager) that reads are not burdened with page table indirection. It has
>the potential disadvantage (compared with SQLite 2.8) that small
>writes can modify several pages (based on the depth of the BTree).

So for reads, there is basically no extra burden (other than the caching
of the initial tree roots,) and writing will be slightly slower, but with
decreasing penalty as updates get bigger, and probably insignificant
against dumping of the page cache when transactions are finished, and all
of course in parallel with reads, so overall performance should be improve
in many scenarios.

It would of course be limited, like shadow paging, to a single address
space (writes would block reads in other address spaces.)

>
>I used this design in a proprietary database in the late 1980s. The
>only reason I didn't consider modifying SQLite this way up until now
>is that I was anticipating BTree changes for 3.0, so I confined my
>efforts to the pager layer.

Given this design, if it is adopted, it would also be trivial (and free in
terms of IO) to maintain a running total of records in a given btree as
well, as was requested some weeks back, as any new/deleted records would
update the btree to the root anyway.

Is this design feasible given the time constraints on 3.0? I've not
studied the btree layer in much detail, so don't know how much existing
code would need to change.

Christian

-- 
/"\
\ /ASCII RIBBON CAMPAIGN - AGAINST HTML MAIL
 X   - AGAINST MS ATTACHMENTS
/ \

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sqlite] Concurrency, MVCC

2004-04-15 Thread Mark D. Anderson

On Thu, 15 Apr 2004 20:16:32 -0400, "Doug Currie" <[EMAIL PROTECTED]> said:

> I used this design in a proprietary database in the late 1980s. The
> only reason I didn't consider modifying SQLite this way up until now
> is that I was anticipating BTree changes for 3.0, so I confined my
> efforts to the pager layer.

btw, another example of this class of approach, with a bsd-style
license, is GigaBASE from the prolific Konstantin Knizhnik:
   http://www.garret.ru/~knizhnik/gigabase/GigaBASE.htm

It does not offer anything approaching full SQL.
It does however have several features not available in sqlite:
- online backup [1]
- master-slave replication
- group commit [2]
- parallel query (multiple threads for full table scans)

[1] There is kind of support in sqlite for online backup, via
   echo '.dump' | sqlite ex1 > backup.sql
though this would result in a largish file and blocks
everything else.

[2] Grouping commits is a mechanism that allows for pending transactions 
to get fsync'd together.
This allows for greater performance with a risk only of losing
some transactions (at most the size of the group), but not
greater risk of a corrupted database.
This is more flexibility than sqlite's big knob of OFF/NORMAL/FULL.
It is also offered by DB2, Oracle, and MySQL.



In idle moments I've toyed with what it would take to splice
GigaBASE with the query parser and planner from
lambda-db or sqlite.
But then I wake up

-mda

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sqlite] Concurrency, MVCC

2004-04-15 Thread Doug Currie

Thursday, April 15, 2004, 9:16:01 AM, Christian Smith wrote:

> On Wed, 14 Apr 2004, Doug Currie wrote:

>>One way to get table level locking without a great deal of pain is to
>>integrate the shadow paging ideas with BTree management. Rather than
>>using page tables for the shadow pages, use the BTrees themselves.
>>This means that any change to a BTree requires changes along the
>>entire path back to the root so that only free pages are used to store
>>new data, including the BTree itself. Writing the root page(s) of the
>>BTree(s) commits the changes to that table (these tables).

> Actually, this gets my vote. Keeps the pager layer the same,

The pager gets *much* simpler because it doesn't need to make a log
file. The log file is not necessary because writes only go to free
pages.

Well, there would be one write-ahead log. It's needed to prevent
partial updates to the page number pointers to the roots page(s) of
the BTree(s) at commit. This log is created at commit time, and is
much simpler and much smaller than the present log file.

> and only requires a cache of the root btree for each object
> (table/index) in the database to be maintained on a per-transaction
> basis

Yes, you need to cache the page number of each BTree root at
transaction start.

You'd also need a forest of free pages organized by transaction so
they can be freed at the right time (when the oldest read-transaction
that can reference them has completed).

> , reducing the complications of what to do under memory pressure
> when pages are spilled from the cache as we should be able to keep
> them in memory all the time.

Yes.

> Committing of a transaction would then be an atomic update root btree page
> number in the catalog table.

Yes, atomically for all the BTrees modified. This is probably a single
page of data (4 to 8 bytes of root page number per BTree, i.e., per
table and per index). Well, I usually assume fairly large pages
compared with SQLite's default of 1K. Using larger pages also
decreases the depth of the BTree which reduces the number of pages
written.

This design works well. It has the advantage (compared with shadow
pager) that reads are not burdened with page table indirection. It has
the potential disadvantage (compared with SQLite 2.8) that small
writes can modify several pages (based on the depth of the BTree).

I used this design in a proprietary database in the late 1980s. The
only reason I didn't consider modifying SQLite this way up until now
is that I was anticipating BTree changes for 3.0, so I confined my
efforts to the pager layer.

e


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sqlite] Concurrency, MVCC

2004-04-15 Thread Christian Smith
On Thu, 15 Apr 2004, Andrew Piskorski wrote:

>On Thu, Apr 15, 2004 at 02:16:01PM +0100, Christian Smith wrote:
>> Right tool for the job. Multiple writers has client/server database
>> written all over it. KISS.
>
>No, not true, at least not when the multiple writers are all threads
>within one single process, which appears to be the common case for
>people who'd like greater concurrency in SQLite.

SQLite is an embedded database engine, which I guess is it's main use
(I'm using it at my place to implement a package management tool) and
the desired concurrency from what I gather on the list is read access
while writing, not multiple writers.

>
>Also, if multiple writers worked well for the one-process many-threads
>case, then if you wished you could write a small multi-threaded
>client/server database using SQLite as the underlying storage engine.
>As things stand now, the concurrency limitations mean there isn't much
>point to doing that.

But why re-invent the wheel. There are plenty of client/server databases
that support concurrent writes.

>
>Simplicity however, is of course an important concern.

Absolutely, that's my main point. I'm all for the concurrent
readers/writer, but multiple writers is a whole new ball game, probably
best left to the big boys.

Cheers,
Christian

-- 
/"\
\ /ASCII RIBBON CAMPAIGN - AGAINST HTML MAIL
 X   - AGAINST MS ATTACHMENTS
/ \

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sqlite] Concurrency, MVCC

2004-04-15 Thread Andrew Piskorski
On Thu, Apr 15, 2004 at 02:16:01PM +0100, Christian Smith wrote:
> Right tool for the job. Multiple writers has client/server database
> written all over it. KISS.

No, not true, at least not when the multiple writers are all threads
within one single process, which appears to be the common case for
people who'd like greater concurrency in SQLite.

Also, if multiple writers worked well for the one-process many-threads
case, then if you wished you could write a small multi-threaded
client/server database using SQLite as the underlying storage engine.
As things stand now, the concurrency limitations mean there isn't much
point to doing that.

Simplicity however, is of course an important concern.

-- 
Andrew Piskorski <[EMAIL PROTECTED]>
http://www.piskorski.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sqlite] Concurrency, MVCC

2004-04-15 Thread Christian Smith
On Wed, 14 Apr 2004, Doug Currie wrote:

>Wednesday, April 14, 2004, 1:16:54 AM, Andrew Piskorski wrote:
>
>> How could this be extended to support table locking and PostgreSQL's
>> default "read committed" isolation level? Would the smallest locking
>> granularity possible in Currie's design be one page of rows, however
>> many rows that happens to be?
>
>Things get *much* more complicated once you have multiple simultaneous
>write transactions. I didn't want to go there.

Right tool for the job. Multiple writers has client/server database
written all over it. KISS.

>
>One way to get table level locking without a great deal of pain is to
>integrate the shadow paging ideas with BTree management. Rather than
>using page tables for the shadow pages, use the BTrees themselves.
>This means that any change to a BTree requires changes along the
>entire path back to the root so that only free pages are used to store
>new data, including the BTree itself. Writing the root page(s) of the
>BTree(s) commits the changes to that table (these tables).

Actually, this gets my vote. Keeps the pager layer the same, and only
requires a cache of the root btree for each object (table/index) in the
database to be maintained on a per-transaction basis, reducing the
complications of what to do under memory pressure when pages are spilled
from the cache as we should be able to keep them in memory all the time.

Committing of a transaction would then be an atomic update root btree page
number in the catalog table.

Christian

-- 
/"\
\ /ASCII RIBBON CAMPAIGN - AGAINST HTML MAIL
 X   - AGAINST MS ATTACHMENTS
/ \

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sqlite] Concurrency, MVCC

2004-04-15 Thread Mark D. Anderson

On Wed, 14 Apr 2004 08:13:39 -0400, "D. Richard Hipp" <[EMAIL PROTECTED]>
said:

>* Support for atomic commits of multi-database transactions,
>  which gives you a limited kind of table-level locking,
>  assuming you are willing to put each table in a separate
>  database.

and also a limited form of concurrent writers, as a consequence,
right?
assuming that table locks are acquired in a consistent order
to avoid deadlock, there could be concurrent writers that do
not touch the same tables (in this database-per-table model).

btw, what about offering better behavior about throwing away
cache pages? one approach would be something like a 
commit_begin() function which is offered by some rdbms native
apis. It says "commit what i've done, but at the same time
attempt to acquire the write lock".
Failure to "win" and actually be able to retain the write
lock might not be reported -- the idea is that the application
can at least indicate its desire.

This could also be done as some sort of connection option.

So in the case that a single writer is keeping up with all
requests, it can do so efficiently without throwing away
its pages.

-mda

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sqlite] Concurrency, MVCC

2004-04-14 Thread Mark D. Anderson
Wednesday, April 14, 2004, 1:16:54 AM, Andrew Piskorski wrote:
> as far as I can tell, it seems to be describing a system with
> the usual Oracle/PostgreSQL MVCC semantics, EXCEPT of course that
> Currie proposes that each Write transaction must take a lock on the
> database as a whole.

Well, i suppose from a sufficient distance they look alike,
but in practice MVCC and shadow paging are rather different.

In MVCC, each row typically has two hidden fields identifying the first
and last transaction ids for which the row is relevant.
The last transaction id is to skip rows that are deleted.
There are many variants of MVCC, but you get the idea.

Any reader (or writer) knows its own transaction id, and just
ignores rows that are no applicable.

A "vacuum" process is necessary to periodically reclaim space
taken by rows whose last transaction id is lower than any live
transaction.

In shadow paging, the basic idea is that any reader or writer
gets a view onto the data based on reachability from "pointers"
in a particular root block. Pages that are reachable from any
live root block are never modified. A vacuum process is required 
to collect the space from blocks that are no longer reachable.
Updates to indexes must be treated in roughly the same way as
data pages, because they contain pointers to different data.

Shadow paging can be used for a table-based database, or
a persistent object store.
It certainly is much older than the HUT work; see for example Lorie 77,
"Physical Integrity in a Large Segmented Database."
It falls into the general class of logless transaction systems,
as opposed to the log-based approach that predominates in
current day non-academic database implementations.

-mda

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sqlite] Concurrency, MVCC

2004-04-14 Thread Doug Currie
> D. Richard Hipp wrote:
>> 
>> My thoughts on BlueSky have been added to the wiki page:
>>http://www.sqlite.org/cvstrac/wiki?p=BlueSky

I added some responses; I do not agree with Richard's concerns about
Shadow Paging, and I corrected some mistaken conclusions. I apologize
if my paper was not clear enough in these areas.

Thank you, Richard, for taking the time to review the Shadow Paging
option.

Regards,

e


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sqlite] Concurrency, MVCC

2004-04-14 Thread D. Richard Hipp
D. Richard Hipp wrote:
My thoughts on BlueSky have been added to the wiki page:

  http://www.sqlite.org/wiki?p=BlueSky

That URL should have been:

   http://www.sqlite.org/cvstrac/wiki?p=BlueSky

Left out the "cvstrac".  Sorry for the confusion.

--
D. Richard Hipp -- [EMAIL PROTECTED] -- 704.948.4565
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [sqlite] Concurrency, MVCC

2004-04-14 Thread D. Richard Hipp
Andrew Piskorski wrote:
How feasible would it be to add support for higher concurrency to
SQLite, especially via MVCC?  
My thoughts on BlueSky have been added to the wiki page:

  http://www.sqlite.org/wiki?p=BlueSky

The current plan for version 3.0 is as follows:

  * Support for a modification of the Carlyle method for
allowing writes to begin while reads are still pending.
All reads must finish before the write commits, however.
  * Support for atomic commits of multi-database transactions,
which gives you a limited kind of table-level locking,
assuming you are willing to put each table in a separate
database.
Business constraints require that version 3.0 be working
no later than May 31.  So if you have any alternative
suggestions, you should get them in quickly.
--
D. Richard Hipp -- [EMAIL PROTECTED] -- 704.948.4565
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


[sqlite] Concurrency, MVCC

2004-04-13 Thread Andrew Piskorski
How feasible would it be to add support for higher concurrency to
SQLite, especially via MVCC?  (If it turns out to be feasible and
desirable, I'd be willing to work on it in my free time.)

What I would really like in SQLite would be:
- Good concurrency, preferably similar or better to that of nsv/tsv
  (see below).
- MVCC (multi-version concurrency control) model preferred, as in
  PostgreSQL and Oracle.

How feasible is that?  What designs have been or should be considered
for it?  How much work does this seem to be?  Does it become any
easier if it's considered for in-memory databases only, rather than
both in-memory and on-disk?

After some searching around I found these previous proposals or
discussions of increasing concurrency in SQLite:

  http://www.sqlite.org/concurrency.html

  http://www.sqlite.org/cvstrac/wiki?p=BlueSky
  http://www.sqlite.org/cvstrac/attach_get/100/shadow.pdf
  http://citeseer.ist.psu.edu/131901.html

  
http://www.mail-archive.com/cgi-bin/htsearch?config=sqlite-users_sqlite_org===concurrency

Many of the suggestions in Dr. Hipp's 2003-11-22 concurrency draft
sound very useful, especially blocking (rather than polling) locks.
(In fact, I was surprised to learn that SQLite does NOT block on locks
now.)  But, that draft never mentions MVCC at all.  Why not?  Since
MVCC both gives better concurrency and is friendlier to use than
pessimistic locking models, I was surprised not to at least see it
mentioned.  Is an MVCC implementation thought to be too complicated?
Or?

Doug Currie's <[EMAIL PROTECTED]> "Shadow Paging" design sounds
promising.  Unfortunately, I have not been able to download the
referenced papers at all (where can I get them?), but as far as I can
tell, it seems to be describing a system with the usual
Oracle/PostgreSQL MVCC semantics, EXCEPT of course that Currie
proposes that each Write transaction must take a lock on the database
as a whole.

But, other than the locking granularity, in what way is Currie's
Shadow Paging design the same or different from PostgreSQL's MVCC
implementation, both in terms of user-visible transaction semantics,
and the underlying implementation?

I believe PostgreSQL basically marks each row with a transaction id,
and keeps track of whether each transaction id is in progress,
committed, or aborted.  Here are a few links about that:

  http://developer.postgresql.org/pdf/transactions.pdf
  http://openacs.org/forums/message-view?message_id=176198

Since Currie's design has only one db-wide write lock, it is
semantically equivalent to PostgreSQL's "serializable" isolation
level, correct?  How could this be extended to support table locking
and PostgreSQL's default "read committed" isolation level?  Would the
smallest locking granularity possible in Currie's design be one page
of rows, however many rows that happens to be?

The one process, many threads aspect of Currie's design sounds just
fine to me.  The one write lock for the whole database, on the other
hand, could be quite limiting.  How much more difficult would it be to
add table locks to the design?  It would also be a nice bonus if the
design at least contemplates how to add row locks (or locks for pages
of rows) in the future, but my guess is that table locks would be good
enough in practice.

Currie's design also seems to defer writing any data to disk until the
transaction commits, which seems odd to me.  I did not follow many of
the details of that design so I'm probably missing something here, but
since most write transactions commit rather than abort, in any sort of
MVCC model wouldn't it be better to write data to disk earlier rather
than later?  I'm pretty sure that's what both Oracle and PostgreSQL
do.


My particular interest in SQLite:


Now that I've asked lots of questions above, I'll describe some of the
real-world use cases that got me thinking about this, in case it helps
clarify how and why I'm interested in a high-concurrency SQLite:

A while back I wrote a high-performance multi-threaded application
that basically just accepted data requests, used various ugly low
level proprietary C APIs to fetch data from remote servers, and then
organized the data and fed it back to the client application as simple
rows of CSV-style text.

Using all those low-level C APIs meant tracking lots of somewhat
complicated housekeeping data.  If my application ever crashed all
housekeeping data instantly became worthless anyway, so I definitely
didn't want to store it persistently in Oracle or PostgreSQL; for both
simplicity and performance, I wanted it in-memory only.

So in my case, I used AOLserver's nsv's for this.  (The Tcl Thread
Extension has "tsv" which is just like "nsv", except better.)  Nsvs
are basically just associative arrays for storing key/value pairs
(like Tcl arrays or Perl hashes), but nsv operations are both atomic
and highly concurrent, as they were designed for inter-thread
communication in AOLserver.  (Mutex locking is