Re: [sqlite] Concurrency, MVCC
On Thu, 15 Apr 2004, Doug Currie wrote: > >Thursday, April 15, 2004, 9:16:01 AM, Christian Smith wrote: > >> On Wed, 14 Apr 2004, Doug Currie wrote: > >>>One way to get table level locking without a great deal of pain is to >>>integrate the shadow paging ideas with BTree management. Rather than >>>using page tables for the shadow pages, use the BTrees themselves. >>>This means that any change to a BTree requires changes along the >>>entire path back to the root so that only free pages are used to store >>>new data, including the BTree itself. Writing the root page(s) of the >>>BTree(s) commits the changes to that table (these tables). > >> Actually, this gets my vote. Keeps the pager layer the same, > >The pager gets *much* simpler because it doesn't need to make a log >file. The log file is not necessary because writes only go to free >pages. > >Well, there would be one write-ahead log. It's needed to prevent >partial updates to the page number pointers to the roots page(s) of >the BTree(s) at commit. This log is created at commit time, and is >much simpler and much smaller than the present log file. I'd have thought it'd be better to preserve the pager layer as is. If it ain't broke... > [...] > >This design works well. It has the advantage (compared with shadow >pager) that reads are not burdened with page table indirection. It has >the potential disadvantage (compared with SQLite 2.8) that small >writes can modify several pages (based on the depth of the BTree). So for reads, there is basically no extra burden (other than the caching of the initial tree roots,) and writing will be slightly slower, but with decreasing penalty as updates get bigger, and probably insignificant against dumping of the page cache when transactions are finished, and all of course in parallel with reads, so overall performance should be improve in many scenarios. It would of course be limited, like shadow paging, to a single address space (writes would block reads in other address spaces.) > >I used this design in a proprietary database in the late 1980s. The >only reason I didn't consider modifying SQLite this way up until now >is that I was anticipating BTree changes for 3.0, so I confined my >efforts to the pager layer. Given this design, if it is adopted, it would also be trivial (and free in terms of IO) to maintain a running total of records in a given btree as well, as was requested some weeks back, as any new/deleted records would update the btree to the root anyway. Is this design feasible given the time constraints on 3.0? I've not studied the btree layer in much detail, so don't know how much existing code would need to change. Christian -- /"\ \ /ASCII RIBBON CAMPAIGN - AGAINST HTML MAIL X - AGAINST MS ATTACHMENTS / \ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sqlite] Concurrency, MVCC
On Thu, 15 Apr 2004 20:16:32 -0400, "Doug Currie" <[EMAIL PROTECTED]> said: > I used this design in a proprietary database in the late 1980s. The > only reason I didn't consider modifying SQLite this way up until now > is that I was anticipating BTree changes for 3.0, so I confined my > efforts to the pager layer. btw, another example of this class of approach, with a bsd-style license, is GigaBASE from the prolific Konstantin Knizhnik: http://www.garret.ru/~knizhnik/gigabase/GigaBASE.htm It does not offer anything approaching full SQL. It does however have several features not available in sqlite: - online backup [1] - master-slave replication - group commit [2] - parallel query (multiple threads for full table scans) [1] There is kind of support in sqlite for online backup, via echo '.dump' | sqlite ex1 > backup.sql though this would result in a largish file and blocks everything else. [2] Grouping commits is a mechanism that allows for pending transactions to get fsync'd together. This allows for greater performance with a risk only of losing some transactions (at most the size of the group), but not greater risk of a corrupted database. This is more flexibility than sqlite's big knob of OFF/NORMAL/FULL. It is also offered by DB2, Oracle, and MySQL. In idle moments I've toyed with what it would take to splice GigaBASE with the query parser and planner from lambda-db or sqlite. But then I wake up -mda - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sqlite] Concurrency, MVCC
Thursday, April 15, 2004, 9:16:01 AM, Christian Smith wrote: > On Wed, 14 Apr 2004, Doug Currie wrote: >>One way to get table level locking without a great deal of pain is to >>integrate the shadow paging ideas with BTree management. Rather than >>using page tables for the shadow pages, use the BTrees themselves. >>This means that any change to a BTree requires changes along the >>entire path back to the root so that only free pages are used to store >>new data, including the BTree itself. Writing the root page(s) of the >>BTree(s) commits the changes to that table (these tables). > Actually, this gets my vote. Keeps the pager layer the same, The pager gets *much* simpler because it doesn't need to make a log file. The log file is not necessary because writes only go to free pages. Well, there would be one write-ahead log. It's needed to prevent partial updates to the page number pointers to the roots page(s) of the BTree(s) at commit. This log is created at commit time, and is much simpler and much smaller than the present log file. > and only requires a cache of the root btree for each object > (table/index) in the database to be maintained on a per-transaction > basis Yes, you need to cache the page number of each BTree root at transaction start. You'd also need a forest of free pages organized by transaction so they can be freed at the right time (when the oldest read-transaction that can reference them has completed). > , reducing the complications of what to do under memory pressure > when pages are spilled from the cache as we should be able to keep > them in memory all the time. Yes. > Committing of a transaction would then be an atomic update root btree page > number in the catalog table. Yes, atomically for all the BTrees modified. This is probably a single page of data (4 to 8 bytes of root page number per BTree, i.e., per table and per index). Well, I usually assume fairly large pages compared with SQLite's default of 1K. Using larger pages also decreases the depth of the BTree which reduces the number of pages written. This design works well. It has the advantage (compared with shadow pager) that reads are not burdened with page table indirection. It has the potential disadvantage (compared with SQLite 2.8) that small writes can modify several pages (based on the depth of the BTree). I used this design in a proprietary database in the late 1980s. The only reason I didn't consider modifying SQLite this way up until now is that I was anticipating BTree changes for 3.0, so I confined my efforts to the pager layer. e - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sqlite] Concurrency, MVCC
On Thu, Apr 15, 2004 at 02:16:01PM +0100, Christian Smith wrote: > Right tool for the job. Multiple writers has client/server database > written all over it. KISS. No, not true, at least not when the multiple writers are all threads within one single process, which appears to be the common case for people who'd like greater concurrency in SQLite. Also, if multiple writers worked well for the one-process many-threads case, then if you wished you could write a small multi-threaded client/server database using SQLite as the underlying storage engine. As things stand now, the concurrency limitations mean there isn't much point to doing that. Simplicity however, is of course an important concern. -- Andrew Piskorski <[EMAIL PROTECTED]> http://www.piskorski.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sqlite] Concurrency, MVCC
On Wed, 14 Apr 2004, Doug Currie wrote: >Wednesday, April 14, 2004, 1:16:54 AM, Andrew Piskorski wrote: > >> How could this be extended to support table locking and PostgreSQL's >> default "read committed" isolation level? Would the smallest locking >> granularity possible in Currie's design be one page of rows, however >> many rows that happens to be? > >Things get *much* more complicated once you have multiple simultaneous >write transactions. I didn't want to go there. Right tool for the job. Multiple writers has client/server database written all over it. KISS. > >One way to get table level locking without a great deal of pain is to >integrate the shadow paging ideas with BTree management. Rather than >using page tables for the shadow pages, use the BTrees themselves. >This means that any change to a BTree requires changes along the >entire path back to the root so that only free pages are used to store >new data, including the BTree itself. Writing the root page(s) of the >BTree(s) commits the changes to that table (these tables). Actually, this gets my vote. Keeps the pager layer the same, and only requires a cache of the root btree for each object (table/index) in the database to be maintained on a per-transaction basis, reducing the complications of what to do under memory pressure when pages are spilled from the cache as we should be able to keep them in memory all the time. Committing of a transaction would then be an atomic update root btree page number in the catalog table. Christian -- /"\ \ /ASCII RIBBON CAMPAIGN - AGAINST HTML MAIL X - AGAINST MS ATTACHMENTS / \ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sqlite] Concurrency, MVCC
On Wed, 14 Apr 2004 08:13:39 -0400, "D. Richard Hipp" <[EMAIL PROTECTED]> said: >* Support for atomic commits of multi-database transactions, > which gives you a limited kind of table-level locking, > assuming you are willing to put each table in a separate > database. and also a limited form of concurrent writers, as a consequence, right? assuming that table locks are acquired in a consistent order to avoid deadlock, there could be concurrent writers that do not touch the same tables (in this database-per-table model). btw, what about offering better behavior about throwing away cache pages? one approach would be something like a commit_begin() function which is offered by some rdbms native apis. It says "commit what i've done, but at the same time attempt to acquire the write lock". Failure to "win" and actually be able to retain the write lock might not be reported -- the idea is that the application can at least indicate its desire. This could also be done as some sort of connection option. So in the case that a single writer is keeping up with all requests, it can do so efficiently without throwing away its pages. -mda - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sqlite] Concurrency, MVCC
Wednesday, April 14, 2004, 1:16:54 AM, Andrew Piskorski wrote: > as far as I can tell, it seems to be describing a system with > the usual Oracle/PostgreSQL MVCC semantics, EXCEPT of course that > Currie proposes that each Write transaction must take a lock on the > database as a whole. Well, i suppose from a sufficient distance they look alike, but in practice MVCC and shadow paging are rather different. In MVCC, each row typically has two hidden fields identifying the first and last transaction ids for which the row is relevant. The last transaction id is to skip rows that are deleted. There are many variants of MVCC, but you get the idea. Any reader (or writer) knows its own transaction id, and just ignores rows that are no applicable. A "vacuum" process is necessary to periodically reclaim space taken by rows whose last transaction id is lower than any live transaction. In shadow paging, the basic idea is that any reader or writer gets a view onto the data based on reachability from "pointers" in a particular root block. Pages that are reachable from any live root block are never modified. A vacuum process is required to collect the space from blocks that are no longer reachable. Updates to indexes must be treated in roughly the same way as data pages, because they contain pointers to different data. Shadow paging can be used for a table-based database, or a persistent object store. It certainly is much older than the HUT work; see for example Lorie 77, "Physical Integrity in a Large Segmented Database." It falls into the general class of logless transaction systems, as opposed to the log-based approach that predominates in current day non-academic database implementations. -mda - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sqlite] Concurrency, MVCC
> D. Richard Hipp wrote: >> >> My thoughts on BlueSky have been added to the wiki page: >>http://www.sqlite.org/cvstrac/wiki?p=BlueSky I added some responses; I do not agree with Richard's concerns about Shadow Paging, and I corrected some mistaken conclusions. I apologize if my paper was not clear enough in these areas. Thank you, Richard, for taking the time to review the Shadow Paging option. Regards, e - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sqlite] Concurrency, MVCC
D. Richard Hipp wrote: My thoughts on BlueSky have been added to the wiki page: http://www.sqlite.org/wiki?p=BlueSky That URL should have been: http://www.sqlite.org/cvstrac/wiki?p=BlueSky Left out the "cvstrac". Sorry for the confusion. -- D. Richard Hipp -- [EMAIL PROTECTED] -- 704.948.4565 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sqlite] Concurrency, MVCC
Andrew Piskorski wrote: How feasible would it be to add support for higher concurrency to SQLite, especially via MVCC? My thoughts on BlueSky have been added to the wiki page: http://www.sqlite.org/wiki?p=BlueSky The current plan for version 3.0 is as follows: * Support for a modification of the Carlyle method for allowing writes to begin while reads are still pending. All reads must finish before the write commits, however. * Support for atomic commits of multi-database transactions, which gives you a limited kind of table-level locking, assuming you are willing to put each table in a separate database. Business constraints require that version 3.0 be working no later than May 31. So if you have any alternative suggestions, you should get them in quickly. -- D. Richard Hipp -- [EMAIL PROTECTED] -- 704.948.4565 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[sqlite] Concurrency, MVCC
How feasible would it be to add support for higher concurrency to SQLite, especially via MVCC? (If it turns out to be feasible and desirable, I'd be willing to work on it in my free time.) What I would really like in SQLite would be: - Good concurrency, preferably similar or better to that of nsv/tsv (see below). - MVCC (multi-version concurrency control) model preferred, as in PostgreSQL and Oracle. How feasible is that? What designs have been or should be considered for it? How much work does this seem to be? Does it become any easier if it's considered for in-memory databases only, rather than both in-memory and on-disk? After some searching around I found these previous proposals or discussions of increasing concurrency in SQLite: http://www.sqlite.org/concurrency.html http://www.sqlite.org/cvstrac/wiki?p=BlueSky http://www.sqlite.org/cvstrac/attach_get/100/shadow.pdf http://citeseer.ist.psu.edu/131901.html http://www.mail-archive.com/cgi-bin/htsearch?config=sqlite-users_sqlite_org===concurrency Many of the suggestions in Dr. Hipp's 2003-11-22 concurrency draft sound very useful, especially blocking (rather than polling) locks. (In fact, I was surprised to learn that SQLite does NOT block on locks now.) But, that draft never mentions MVCC at all. Why not? Since MVCC both gives better concurrency and is friendlier to use than pessimistic locking models, I was surprised not to at least see it mentioned. Is an MVCC implementation thought to be too complicated? Or? Doug Currie's <[EMAIL PROTECTED]> "Shadow Paging" design sounds promising. Unfortunately, I have not been able to download the referenced papers at all (where can I get them?), but as far as I can tell, it seems to be describing a system with the usual Oracle/PostgreSQL MVCC semantics, EXCEPT of course that Currie proposes that each Write transaction must take a lock on the database as a whole. But, other than the locking granularity, in what way is Currie's Shadow Paging design the same or different from PostgreSQL's MVCC implementation, both in terms of user-visible transaction semantics, and the underlying implementation? I believe PostgreSQL basically marks each row with a transaction id, and keeps track of whether each transaction id is in progress, committed, or aborted. Here are a few links about that: http://developer.postgresql.org/pdf/transactions.pdf http://openacs.org/forums/message-view?message_id=176198 Since Currie's design has only one db-wide write lock, it is semantically equivalent to PostgreSQL's "serializable" isolation level, correct? How could this be extended to support table locking and PostgreSQL's default "read committed" isolation level? Would the smallest locking granularity possible in Currie's design be one page of rows, however many rows that happens to be? The one process, many threads aspect of Currie's design sounds just fine to me. The one write lock for the whole database, on the other hand, could be quite limiting. How much more difficult would it be to add table locks to the design? It would also be a nice bonus if the design at least contemplates how to add row locks (or locks for pages of rows) in the future, but my guess is that table locks would be good enough in practice. Currie's design also seems to defer writing any data to disk until the transaction commits, which seems odd to me. I did not follow many of the details of that design so I'm probably missing something here, but since most write transactions commit rather than abort, in any sort of MVCC model wouldn't it be better to write data to disk earlier rather than later? I'm pretty sure that's what both Oracle and PostgreSQL do. My particular interest in SQLite: Now that I've asked lots of questions above, I'll describe some of the real-world use cases that got me thinking about this, in case it helps clarify how and why I'm interested in a high-concurrency SQLite: A while back I wrote a high-performance multi-threaded application that basically just accepted data requests, used various ugly low level proprietary C APIs to fetch data from remote servers, and then organized the data and fed it back to the client application as simple rows of CSV-style text. Using all those low-level C APIs meant tracking lots of somewhat complicated housekeeping data. If my application ever crashed all housekeeping data instantly became worthless anyway, so I definitely didn't want to store it persistently in Oracle or PostgreSQL; for both simplicity and performance, I wanted it in-memory only. So in my case, I used AOLserver's nsv's for this. (The Tcl Thread Extension has "tsv" which is just like "nsv", except better.) Nsvs are basically just associative arrays for storing key/value pairs (like Tcl arrays or Perl hashes), but nsv operations are both atomic and highly concurrent, as they were designed for inter-thread communication in AOLserver. (Mutex locking is