Re: [sqlite] disk locality (and delta storage)

drh Tue, 07 Feb 2006 07:18:10 -0800

Nathaniel Smith <[EMAIL PROTECTED]> wrote:
> 
> So and rows are basically written to the file in the same order
> that the INSERT statements are executed?


Right.  If there are no free pages in the database file (which
is the usual case for Monotone, I expect) then new pages are
allocated from the end of the file.  If the INSERT is small and
will fit on an existing page, then no new page allocations are
required and the data gets inserted in exactly the right spot.
But when inserting large blobs, as monotone does, you typically
will require a least one new page and that page will come at the
end.

> 
> Oh, and should I assume that individual row cells are kept together on
> disk, even if they are (much) larger than a db block?  I assume so,
> but just want to make sure...

If a table row is too big to fit on a page, then the excess spills
onto a linked list of overflow pages.  SQLite tries to allocate
the base page and the overflow pages near each other and in order.

> 
> > After you VACUUM, everything will be on disk in row order.  If
> 
> I assume this means "sorted by primary key"?  (And with tables in some
> random order relative to each other, but I don't think I care about
> that at all.)

Tables are always sorted by rowid - which is the same as the INTEGER
PRIMARY KEY if you have one.

The "true" primary key for every SQLite table is the rowid.  If you
specify a primary key that is not of type INTEGER, then what SQLite
does really is create a UNIQUE index on that field.  There is still
a rowid which is the "true" primary key in the sense that the table
is stored in rowid order.

> 
> > you see a big performance improvement after VACUUMing, then the
> > disk layout is perhaps an optimization worth looking into.  If
> > however (as I suspect) your performance is similar after vacuuming,
> > then changing the way information is added to the disk probably
> > will not help, since after a VACUUM the information is pretty much
> > optimally ordered for minimum seeking.
> 
> I think you left out the end of the sentence, "...assuming an in-order
> access pattern".

You have 64 bits of rowid space.  You could assign rowids to deltas
in clumps.  Whenever you encounter a new file, assign it a block
of (say) a billion rowids.  Each delta to that file goes into
successive rowids.  Since the table is stored in rowid order, all
delta for a particular file are therefore close to each other in
the table.  This does not guarantee that the btree will be laid out
on disk in order - it probably will not be unless you run a VACUUM -
but it will help.  And I suspect it will help a lot.


> 
> Unless you just mean, during the btree traversals involved in each key
> lookup?  Man, there's a whole 'nother part I don't know much about
> :-).  I guess walking the btrees can obviously be another source of
> disk latency; I'm not sure whether I should worry about this or not.

The fanout on tables is typically large - about 50 to 75.  Even more
if you select a larger page size.  Fanout on indices is much smaller,
10 or 20, because index keys are typically larger than the integer
rowid keys of tables.  So to reduce your disk latency, you want to
try to always search by rowid.

Something you should experiment with, by the way, is increasing
the page size so that more records fit on one page and you get
larger fanout.  Do you get better performance if you rebuild 
your database with say a 16K or 32K page size instead of the
default 1K?

> If I do an INSERT of a row that has some indexes on it, where do those
> index entries get written?  Next to the actual row data, at the end of
> the file?  (Assuming there are no free blocks earlier in the file.)
> And then at VACUUM time each index gets groups into one spot on disk?

Indices are stored in completely separate btrees from the tables.  An
index has key only, and the key is the fields being indexed followed
by a the rowid.  So to lookup a record by index, you first do a
search of the index btree to find the entry with the matching fields.
Then you pull the rowid off of the end of the index entry and use
that rowid to do a separate search in the table btree to get your
actual data.  So an index search actually does two binary lookups - 
one on the index and another on the table.

> 
> I was actually thinking more about the cost of looking up many items
> from a table.  Here, unfortunately, our current access pattern is
> quite pessimal.  The current schema is:
> 
> CREATE TABLE files (id primary key, data not null); 
> 
> 'id' is the SHA1 hash of the file; 'data' is a compressed raw file.
> 
> CREATE TABLE file_deltas
>   (id not null, base not null, delta not null,
>    unique(id, base)
>   );
> 
> 'id' is the SHA1 of the file this delta lets us construct, 'base' is
> the SHA1 of the file that the delta is against, and 'delta' is the
> compressed xdelta.
> 
> So, when we traverse delta chains, we go wandering all over this table
> indexing by the SHA1 of intermediate versions.  Our access isn't just
> random, it's _cryptographically strongly_ random! :-)

I would consider redoing the table like this:

   CREATE TABLE files(
     localid INTEGER PRIMARY KEY,
     id TEXT UNIQUE,
     data BLOB NOT NULL
   );

The files.localid field is the 64-bit integer rowid.  Lookups by rowid
are at least twice as fast as lookups by files.id because you only
have to do a single binary search.  Use the localid in the file_deltas
table.  Then if you also use the trick of assigning continguous blocks 
of localids to each file, all information about a file will be contiguous 
in the btree - and hopefully closer to contiguous on disk.  (Fully 
contiguous after a VACUUM.)  The name "localid" is intended to convey 
the concept that this ID is used only in the local repostory.  Separate
repositories to which you might push or pull have their own localids
that are likely different from yours.  The SHA1 hash id is universal.
The localid is not.  

> 
> > Let me propose a radical solution:  I've been experimenting with adding
> > a VCS component to CVSTrac (http://www.cvstrac.org/) to replace CVS and
> > thus provide a complete project management system in a standalone CGI
> > program.  My latest thinking on this (backed up by experiment) is to
> 
> Entering the VCS game?  Good luck!  It's an interesting (and
> surprisingly deep) field.
> 
> (Total tangent: I basically know how to make monotone work over a CGI
> transport; it's some work, but doable, just no-one's picked it up yet.
> It might be worth considering such a solution before trying to build a
> new system from scratch.  The basic trade-off would be a CGI script
> plus a statically linked binary instead of just a CGI script, but on
> the other hand, building Yet Another VCS from scratch is a significant
> undertaking.  The detailed trade-off would of course be more
> complicated :-).  Something to throw out there in case it leads to
> discussion...)

What I'm looking for is a VCS + bug-tracker + wiki that I can just
drop into a cgi-bin of some low cost web hosting site (like Hurricane
Electric, which I notice we both use) and have a ready-to-roll project
management system with no setup or other details to worry about.  Kind
of a sourceforge-in-a-box, only a lot better than sourceforge.  I'm
looking for something like monotone with CVSTrac enhancements that
works over HTTP.  Nothing like that currently exists, I'm afraid, 
so I've been working on my own.

Yes, it is a big job, though perhaps not any bigger than writing 
an SQL database engine ;-)

Monotone is my favorite of the current crop of VCSes by the way.
It has 80% of what I am looking for.  What I'm really after is 
a better monotone.  I have been greatly inspired by your work,
as have others, I notice.

> 
> While size is definitely not everything -- I doubt anyone notices 10%
> here or there -- a factor of 2-3x is probably going to be hard to
> sell.  Unfortunately, since it's a nice scheme.

The CVS repository for SQLite is 15MiB.  Under a baseline+delta schema
it would grow to (perhaps) 45MiB.  The cheapest account on Hurricane
Electric includes 1GiB of storage.  Why should I care about the extra
30MiB?

> 
> So, this was a bit of a brain dump, but I hope it helps clarify some
> of what we're thinking about...
> 

Very helpful.  I am keen to help you get monotone working better.  Let
me know if there is anything I can do to help or if I can answer any
addition questions for you.
--
D. Richard Hipp   <[EMAIL PROTECTED]>

Re: [sqlite] disk locality (and delta storage)

Reply via email to