Two suggestions inline (towards the bottom.)

On Tue, 30 Mar 2004, D. Richard Hipp wrote:

>I think the gist of Ben's proposal is as follows
>(please correct me if I am wrong):
>
>    Writes do not modify the main database file until
>    they are ready to commit - meaning that reader can
>    continue to read from the file during lengthy
>    transactions.  The journal contains modified pages,
>    not the original pages, and is thus a roll-forward
>    journal rather than a roll-back journal.
>
>The main difficulty in this approach is locating
>a particular page in the journal file when it needs
>to be reloaded.  For example, suppose page X is read
>from the main database, modified, then written into
>the journal.  Due to cache pressure, that page is then
>flushed from the in-memory cache.  Later on, we need
>to read page X again.  How do we locate the position
>in the journal where the current data for page X was
>written so that we can reread it?
>
>I think we can rule out a linear scan of the journal
>file for efficiency reasons.
>
>Do you record the offset of each page in the journal
>with an in-memory hash table, perhaps?  If so, consider
>what happens when you do a massive update to a large
>(say 10GB) database.  If we touch most pages in the
>database and assume (reasonably) that each entry in
>the mapping table requires about 48 bytes, then the
>resulting mapping table would require 500MB of RAM.
>Granted, this is an extreme case - how often do you
>modify every page of a 10GB database. But it might
>happen, and we would like SQLite to be able to do it
>without having to malloc for 0.5GB of memory to pull
>it off.  (To be fair, SQLite already does some other
>per-page allocations, but those would amount to less
>than 50MB of RAM in the example above, an order of
>magnitude less than a mapping table.)
>
>Another idea is to store the mapping in a separate,
>temporary, unjournaled database file.  I have not
>pursued that approach because of efficiency concerns.
>Perhaps the mapping could be kept in memory until it
>got too big, then spilled to an external file.  That
>way it would be fast for the common case but memory
>efficient for the occasional massive update.  Either
>way, the mapping table would be the responsibility
>of the pager layer which (up until now) has known
>nothing about the btree layer.  But with this approach,
>the pager layer would need to call the btree layer
>recursively.  Thinking about that is starting to
>make my head spin.

This mapping file could be a multi-tier page table, like a page table for
a processor.

Each tier could be a page sized page table, just like the i386 page table,
and the page table pages could compete for memory just like data pages. Or
use a seperate TLB with LRU/NRU replacement.

With 1G maximum pages, at 1K per page, each page table page can hold 256
mappings. So a four level page table would be enough to map the entire
page range.

When committing, just walk the page table, and any pages with non-null
mappings are copied from the journal to the main database.

Locality of page updates may, however, make the multi-tier approach
inefficient as we may have hundreds or thousands of page table pages
mapping a single page for a big update, depending on how sparse the
updates are.

>
>Does anybody have any other ideas for doing this
>mapping?  The constraints are that it needs to be
>fast for the common case (where no more than a few
>hundred or a few thousand pages change) but should
>not require too much memory for the occasional
>massive update.

How about taking advantage of sparse files (most OS support sparse files)
for the journal, and saving the redo pages in the new shadow file. Thus
the burden of efficiently handling mapping is bumped down to the OS.

The journal could then take the form of a pair of files:
- One for a list of page numbers from the shadow file that must be
  committed.
- One for the shadow file with updated pages, using sparse file.

When re-reading an updated page that has been bumped from the cache, read
the corresponding page from the shadow file first, and if it is all zeros,
read the page from the main file. This assumes that SQLite will not write
a page of all zeros (safe assumption?)

When updating a page, update the corresponding page in the shadow file and
list it's page number in the journal, if not already there (keep a cache
of updated page numbers to try and avoid dups.)

Committing is then simply reading in the list of pages from the list
journal, and copying the pages from the shadow file to the main file. Once
we've got to the commit stage, we know we will succeed as we're garaunteed
to have enough disk space, and the operation is idempotent.

Does anyone use a FS that doesn't support sparse files? The one I can
think of off the top of my head is FAT. Would that be common?

Cheers,
Christian

-- 
    /"\
    \ /    ASCII RIBBON CAMPAIGN - AGAINST HTML MAIL
     X                           - AGAINST MS ATTACHMENTS
    / \

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to