Re: [sqlite] disk locality (and delta storage)

Nathaniel Smith Tue, 07 Feb 2006 05:30:19 -0800

Thanks for the helpful reply.  Sorry I've taken so long to get back to
this; I've had some hardware trouble and am only catching up on email
now...

On Wed, Feb 01, 2006 at 07:27:06AM -0500, [EMAIL PROTECTED] wrote:
> Nathaniel Smith <[EMAIL PROTECTED]> wrote:
> > I was wondering if there were any docs or explanations available on
> > how SQLite decides to lay out data on disk.
> 
> Free pages in the middle of the file are filled first.  Some effort
> is made to uses pages that are close together for related information.
> In mototone, where you seldom if ever delete anything, you probably
> never have any free pages, so new information is always added to the
> end of the file.

I'm going to ask a bunch of finicky boring questions to make sure I'm
understanding :-).

So and rows are basically written to the file in the same order
that the INSERT statements are executed?

Oh, and should I assume that individual row cells are kept together on
disk, even if they are (much) larger than a db block?  I assume so,
but just want to make sure...

> After you VACUUM, everything will be on disk in row order.  If

I assume this means "sorted by primary key"?  (And with tables in some
random order relative to each other, but I don't think I care about
that at all.)

> you see a big performance improvement after VACUUMing, then the
> disk layout is perhaps an optimization worth looking into.  If
> however (as I suspect) your performance is similar after vacuuming,
> then changing the way information is added to the disk probably
> will not help, since after a VACUUM the information is pretty much
> optimally ordered for minimum seeking.

I think you left out the end of the sentence, "...assuming an in-order
access pattern".

Unless you just mean, during the btree traversals involved in each key
lookup?  Man, there's a whole 'nother part I don't know much about
:-).  I guess walking the btrees can obviously be another source of
disk latency; I'm not sure whether I should worry about this or not.
If I do an INSERT of a row that has some indexes on it, where do those
index entries get written?  Next to the actual row data, at the end of
the file?  (Assuming there are no free blocks earlier in the file.)
And then at VACUUM time each index gets groups into one spot on disk?

I was actually thinking more about the cost of looking up many items
from a table.  Here, unfortunately, our current access pattern is
quite pessimal.  The current schema is:

CREATE TABLE files (id primary key, data not null); 

'id' is the SHA1 hash of the file; 'data' is a compressed raw file.

CREATE TABLE file_deltas
  (id not null, base not null, delta not null,
   unique(id, base)
  );

'id' is the SHA1 of the file this delta lets us construct, 'base' is
the SHA1 of the file that the delta is against, and 'delta' is the
compressed xdelta.

So, when we traverse delta chains, we go wandering all over this table
indexing by the SHA1 of intermediate versions.  Our access isn't just
random, it's _cryptographically strongly_ random! :-)

So, we've been throwing around ways to overhaul this stuff.  Obviously
sqlite is not going to be able to improve on the current situation
without some help from us.

> Let me propose a radical solution:  I've been experimenting with adding
> a VCS component to CVSTrac (http://www.cvstrac.org/) to replace CVS and
> thus provide a complete project management system in a standalone CGI
> program.  My latest thinking on this (backed up by experiment) is to

Entering the VCS game?  Good luck!  It's an interesting (and
surprisingly deep) field.

(Total tangent: I basically know how to make monotone work over a CGI
transport; it's some work, but doable, just no-one's picked it up yet.
It might be worth considering such a solution before trying to build a
new system from scratch.  The basic trade-off would be a CGI script
plus a statically linked binary instead of just a CGI script, but on
the other hand, building Yet Another VCS from scratch is a significant
undertaking.  The detailed trade-off would of course be more
complicated :-).  Something to throw out there in case it leads to
discussion...)

> avoid storing long series of xdeltas.  Each file version is instead stored
> as a baseline and a single xdelta.  The manifest stores two UUIDs instead
> of one.  That way, you can always load a particular file version with
> at most two lookups.  As a file evolves, the baseline version stays the
> same and the xdelta changes from one version to the next.  When the size
> of the xdelta reachs some fraction of the baseline file size, create a
> new baseline.  Experimentally, I have found it works well to create a
> new baseline when the xdelta becomes 15% of the size of the baseline.

Ah, indeed, I'd forgotten about this technique.  Thanks for bringing
it up!  It inspired me to go and sketch out some notes on different
options:
  http://venge.net/monotone/wiki/DeltaStorageStrategies

There are a few different things we're thinking about here.  Right now
we do backwards linear delta chaining, so we always have the latest
version of everything stored in full, and then it takes O(n) time to
fetch a version n steps back in history.  While theoretically
problematic, this actually has caused zero complaints; people in
practice always check out the latest head... However, it is causing us
problems when synchronizing databases over the network, because when
synchronizing you want to send _forward_ deltas.  So we're wasting a
tremendous amount of time shuffling data around to reverse the deltas
(and keep the database consistent in the mean time), and are
considering storage formats that would be more friendly to
push/pull/sync -- while still being acceptably fast for things like
checkout.

> My studies (based on the revision history of SQLite) indicate that using 
> a baseline plus single xdelta representation results in a repository that 
> is about 2x to 3x larger than using a long change of xdeltas.  But access 
> is faster.  And the repository is still 6x to 10x smaller than a git-style
> repository that uses no deltas anywhere.

git-style repositories no longer work that way; they instead do that
for new stuff (to make operations commit fast, etc., it doesn't have
to deal with deltas at all), and then rely on users to run "packing"
commands every once in a while, which use some heuristics to find
things to delta against each other.  I still haven't managed to track
down exactly what these heuristics are, but the results are _very_
small.  However, the required maintainence is a serious barrier for a
general-use system.  (Possibly just because they can batch a bunch of
deltas together and then gzip them all at once, or some such thing.)

While size is definitely not everything -- I doubt anyone notices 10%
here or there -- a factor of 2-3x is probably going to be hard to
sell.  Unfortunately, since it's a nice scheme.

As far as I can tell (Matt, who's actually doing more of the work
here, might disagree), the "ideal" system would be:
  -- store forward delta chains, cutting them off every once in a
     while
  -- somehow organize the storage of these chains per-file, so that
     first we have all the deltas pertaining to one file, sorted by
     application order, then all the deltas for the next file, etc.
     Magically keep this true even as new deltas arrive :-).
  -- when synchronizing, the sending side just sends the raw deltas
     out of its db, and sorts them so it sends all deltas for one
     file, then all deltas for the next file, etc.  This is nice
     because it involve low work for the server (who tends to receive
     each version once and then send it many times), it can be a streaming
     read on the sender, it is cheap to reconstruct each version as we
     receiver (which we have to do to check hashes) (because if we
     have just reconstructed version A, and get the delta from A to
     A+1, we can immediately reconstruct A+1 in constant time), and
     the receiver gets to write each delta straight to disk without
     having to re-deltaify or re-compress (gzip is much more expensive
     than gunzip).
  -- when we need to reconstruct files for checkout or the like,
     choose to reconstruct the files in the order they are listed in
     the db.  So first we write out whichever file's deltas come
     first, then we process the deltas that come after that, etc.
     Since I'm postulating that our db is magically kept in such a
     nice 0-fragmentation state, this actually does the _whole_
     checkout in a single streaming read.  This is kind of the holy
     grail for such a system; even mercurial, the system to beat for
     such things, requires 2 reads from different disk locations per
     file being reconstructed, rather than 1 stream of reads per tree.
     We would probably be faster than 'cp -r'.

But, since running VACUUM basically isn't an option, I'm not sure how
we could get this magically unfragmented db, or what the next-best
thing would be :-).  Some sort of incremental defragmenting might be
handy, I guess.  "Hey, we've got a transaction open anyway, let's
spend 200ms reducing fragmentation a bit before we return to the
user."  I have no idea how such a thing would be implemented; this
stuff seems like magic to me anyway.

So, this was a bit of a brain dump, but I hope it helps clarify some
of what we're thinking about...

-- Nathaniel

-- 
In mathematics, it's not enough to read the words
you have to hear the music

Re: [sqlite] disk locality (and delta storage)

Reply via email to