subject:"\[sqlite\] disk locality"

[sqlite] Re: [Monotone-devel] [sqlite] disk locality (and delta storage)

2006-02-11 Thread drh

Daniel Carosone <[EMAIL PROTECTED]> wrote:
> 
> > Just type (for example):
> > 
> >monotone httpserver &
> > 
> > and then point your webbrowser at 127.0.0.1.
> 
> My personal favourite is jetty in Java for this kind of thing; I'm not
> sure monotone itself should grow a http server :) 

The built-in webserver need not have all the features or
performance of apache.  Something very simple will suffice.
CVSTrac has the ability to act as its own web server, or 
to let inetd be the "server" that calls CVSTrac on each 
request.  Both features combined are implemented in less 
than 150 lines of C code.

To see the code visit
http://www.cvstrac.org/cvstrac/getfile/cvstrac/cgi.c
Search for cgi_http_server and cgi_handle_http_request.

--
D. Richard Hipp   <[EMAIL PROTECTED]>

[sqlite] Re: [Monotone-devel] [sqlite] disk locality (and delta storage)

2006-02-11 Thread Nathaniel Smith

On Fri, Feb 10, 2006 at 07:06:27PM -0500, [EMAIL PROTECTED] wrote:
> Daniel Carosone <[EMAIL PROTECTED]> wrote:
> > Wiki pages doesn't seem so hard, they're pretty much text documents
> > stored in a VCS anyway. 
> 
> There are some complications.  Each wiki page acts if it where
> its own independent project or branch.  And then you probably want
> some way see all current leaves and to do merges from the web
> interface.  
>
> If you intend your system to be publically accessible, then
> you will also need some mechanism to delete (or at least 
> permanently hide from public view) the spam that miscreants
> occasionally post on wiki pages and sometimes also on tickets.
> Some kind of "death cert" perhaps.

These all seem reasonably straightforward; basically just UI issues.

The trickiest bit is the making each page independent bit... but you
can just do that by having each page be a single file named "page.txt"
in its own branch.  Alternatively, sometimes it is very nice to allow
renaming in a wiki.  (I have definitely wanted this feature, when I
realized that the spec I have been working out on a wiki page actually
should be called something different than what it currently is.)  You
can allow this pretty straightforwardly even in the distributed wiki
case if you just use monotone's rename support...

> > Bug tracking state would require a little
> > more smarts, with suitable text formats stored in the VCS and
> > code/hooks to assist in making sane merges between multiple heads of a
> > given ticket, but even that doesn't seem terribly hard.
> 
> My thinking about tickets is that each change to a ticket
> is just a small, signed, immutable record (like a cert)
> that contains a variable number of name/value pairs.  The
> names are things like "title", "description", "status",
> "assigned-to", etc.  To reconstruct the current state of
> a ticket, you sort all the pieces by their creation time
> (their actual creation time, not the time they happened to
> arrive at your database) and read through them one by one.
> Values overwrite prior values with the same name.  So
> in the end, you have one value for each field name - and
> that is the current state of your ticket.

You seem to be reinventing history tracking with certs.

It seems a lot simpler to just make tickets, like, a text file, and
then use the existing history tracking tools.  They're quite tuned and
involve some non-obvious bits.  In particular, compared to this
scheme, they're insensitive to clock skew, they provide real merging
(in your scheme, if I change a field locally, and you change the same
field locally, then whichever of us has our clock set later at that
time will silently clobber the other person), and users can
investigate the history of a ticket using normal history browsing
tools...

> It is also very useful to have certs that record a link 
> between a revision and a ticket.  In this way you can record 
> that a bug was introduced by, or fixed by, a particular 
> revision.  The linkage between tickets and revisions has 
> proven to be particularly valuable in CVSTrac.

Indeed!  Enabling this kind of workflow integration is _exactly_ why
we have this generic cert thing built in.  No-one's done much with it
yet, though, so really we don't even know how far the possibilities go
-- so I get all excited when someone does :-).

> Once you have cross linking between revisions and tickets,
> if you add a google-like search you then have a very powerful
> system for doing source tree archeological work.  This comes
> up a lot in maintaining SQLite.  Somebody on the web reports
> a problem.  I have some vague memory of fixing a similar
> problem 9 months ago, but do not recall where it was or how
> I fixed it.  By using the search feature of CVSTrac together
> with the links between tickets and check-in comments, I can
> quickly find the historical problem and pinpoint exactly when
> and how the problem was fixed.  Then post a URL to the specific
> ticket that described the problem and the specific revision that
> fixed it.

Yep!  Monotone has much more structured information than CVS, and it's
much easier to get at; one can get at all sorts of things.  You can
track not just the list of commits, but things like branches, figure
out whether they're merged or not yet, see which branches work is
occurring on...

Or how about another cool use of certs:
  http://mt.xaraya.com/com.xaraya.core/index.psp

I think the move from CVS to systems with simpler, saner data models
opens up huge opportunities for better visualization, data mining,
data navigation, community awareness, workflow management... sadly, I
never get to work on this, because I end up spending my time trying to
make the basic pieces just work :-).  But I can't wait to see what
other people come up with, and can cheer loudly while they do...

-- Nathaniel

-- 
So let us espouse a less contested notion of truth and falsehood, even
if it is philosophically debatable (if we

[sqlite] Re: [Monotone-devel] Re: [sqlite] disk locality (and delta storage)

2006-02-11 Thread Nathaniel Smith

Thanks for this detailed email!  I've been a bit crazy with codecon
the last week, and most of what I have to say is "hmm, lots to think
about here", so, yeah :-).

I did want to reply to your VCS system comments, though:

On Tue, Feb 07, 2006 at 10:17:54AM -0500, [EMAIL PROTECTED] wrote:
> What I'm looking for is a VCS + bug-tracker + wiki that I can just
> drop into a cgi-bin of some low cost web hosting site (like Hurricane
> Electric, which I notice we both use) and have a ready-to-roll project
> management system with no setup or other details to worry about.  Kind
> of a sourceforge-in-a-box, only a lot better than sourceforge.  I'm

That sounds awesome.  I'm _really_ interested in this area myself,
actually, though the stuff I've been playing with is somewhat
orthogonal.  (Nothing really working yet, but playing with ways to
improve data integration -- by folding in even more, like IRC logs and
mailing list archives.  These kinds of things require a much heavier
server architecture, though, so sort of exploring different parts of
the design space.)

> looking for something like monotone with CVSTrac enhancements that
> works over HTTP.  Nothing like that currently exists, I'm afraid, 
> so I've been working on my own.
>
> Yes, it is a big job, though perhaps not any bigger than writing 
> an SQL database engine ;-)

I can't say monotone works with CVSTrac, but then, I assume your
system doesn't either off the bat :-).  And, monotone can work over
HTTP -- or at least, _I_ say it can :-).  There is a branch where I
worked out a proof-of-concept implementation of this:
  http://viewmtn.angrygoats.net/branch.psp?branch=net.venge.monotone.dumb
It's as a python script that does some rather Clever Things.  I don't
have time to explain the algorithm in detail now, but it basically
supports monotone's full push/pull/sync semantics, you can talk to one
friend's repo and then talk to another's, etc., that all just works;
and it should be transactional (except that in some rare cases someone
has to rollback by hand; in the dumb remove filesystem case, readers
can't in general do a rollback, and renames can't in general be
atomic, so if you're extremely unlikely you might get wedged).  And it
should be really speedy on top of some well done network backends (you
really want things like pipelining).  If you do take a look at it,
there's basically "dumb.py" doing monotone glue, "fs.py" giving the
generic filesystem interface that needs to be implemented for each
transport, and "merkle_dir.py" in between, where all the magic lives.

I haven't actually convinced anyone yet to take care of it and polish
it up, though (or rewrite more sanely, or whatever), so it's just sort
of been sitting there, waiting forlornly for someone who will give it
love...

> Monotone is my favorite of the current crop of VCSes by the way.
> It has 80% of what I am looking for.  What I'm really after is 
> a better monotone.  I have been greatly inspired by your work,
> as have others, I notice.

Thank you very much!  I would be interested in hearing what you think
of as the last 20%, beyond HTTP support.

I also innocently observe that if what you want is a better monotone,
the quickest route may be to... make monotone better ;-).

> > While size is definitely not everything -- I doubt anyone notices 10%
> > here or there -- a factor of 2-3x is probably going to be hard to
> > sell.  Unfortunately, since it's a nice scheme.
> 
> The CVS repository for SQLite is 15MiB.  Under a baseline+delta schema
> it would grow to (perhaps) 45MiB.  The cheapest account on Hurricane
> Electric includes 1GiB of storage.  Why should I care about the extra
> 30MiB?

Well, because there are enough people out there with gig-sized source
checkouts.  (Yeah, not even history, but... checkouts.)  Or hundreds
and hundreds of thousands of commits (gcc, mozilla, linux kernel...).
While there's definitely value to a tool that works well for the 95%
of projects that are not that big, we really would like to have a
single tool that 100% of projects can viably use.

The extraordinarily intense competition in the VCS field is also a
factor; note that mercurial scales to those projects just fine,
without excessive space usage, and is experiencing great uptake on
small projects too -- so since we think we offer some other
compelling advantages, we really would like to compete directly across
the whole range :-).

-- Nathaniel

-- 
In mathematics, it's not enough to read the words
you have to hear the music

[sqlite] Re: [Monotone-devel] [sqlite] disk locality (and delta storage)

2006-02-10 Thread drh

Daniel Carosone <[EMAIL PROTECTED]> wrote:
> 
> It seems a little odd to me to build a centralised, online
> information system for tracking state and documenting activity around
> and about source code in a distributed and disconnected VCS.  
> 

Ah yes, you're right.  But in the system I envision, the wiki
and bug-tracking are also decentralized, disconnected, and
distributed.

> 
> What I'd really like to see is something that, instead of just
> plugging into monotone to show source code state and patches, actually
> used monotone for all storage and 'information transport', and allowed
> developers to update the wiki pages and bug tracking information in
> the same way they can update the code: offline, with syncs and merges
> later as needed.  

This is more in line with my thinking.

> 
> Wiki pages doesn't seem so hard, they're pretty much text documents
> stored in a VCS anyway. 

There are some complications.  Each wiki page acts if it where
its own independent project or branch.  And then you probably want
some way see all current leaves and to do merges from the web
interface.  

If you intend your system to be publically accessible, then
you will also need some mechanism to delete (or at least 
permanently hide from public view) the spam that miscreants
occasionally post on wiki pages and sometimes also on tickets.
Some kind of "death cert" perhaps.

> Bug tracking state would require a little
> more smarts, with suitable text formats stored in the VCS and
> code/hooks to assist in making sane merges between multiple heads of a
> given ticket, but even that doesn't seem terribly hard.

My thinking about tickets is that each change to a ticket
is just a small, signed, immutable record (like a cert)
that contains a variable number of name/value pairs.  The
names are things like "title", "description", "status",
"assigned-to", etc.  To reconstruct the current state of
a ticket, you sort all the pieces by their creation time
(their actual creation time, not the time they happened to
arrive at your database) and read through them one by one.
Values overwrite prior values with the same name.  So
in the end, you have one value for each field name - and
that is the current state of your ticket.

This approach gives you automatic merging and a complete
change history/audit trail.  Some people are initially shocked
that any user can update any field of the ticket, but that
kind of openness is in keeping with the wiki tradition and
has actually been used very successfully in CVSTrac.  If
you wanted to restrict changes on selected fields, you could
just ignore the name/value pairs for that field on certs
from unauthorized users.

Tickets also benefit from having "remarks" that people can
append to the ticket (without overwriting) and attachments.
Both are handled by separate certs. 

It is also very useful to have certs that record a link 
between a revision and a ticket.  In this way you can record 
that a bug was introduced by, or fixed by, a particular 
revision.  The linkage between tickets and revisions has 
proven to be particularly valuable in CVSTrac.

Once you have cross linking between revisions and tickets,
if you add a google-like search you then have a very powerful
system for doing source tree archeological work.  This comes
up a lot in maintaining SQLite.  Somebody on the web reports
a problem.  I have some vague memory of fixing a similar
problem 9 months ago, but do not recall where it was or how
I fixed it.  By using the search feature of CVSTrac together
with the links between tickets and check-in comments, I can
quickly find the historical problem and pinpoint exactly when
and how the problem was fixed.  Then post a URL to the specific
ticket that described the problem and the specific revision that
fixed it.

> 
> It could still be a web ui if people find that comfortable, just one
> that developers would often run pointing the browser at a local
> server, with a commit at the end of a session, and a later sync and
> merge. Of course, you could/would always have an instance of this
> running on a public webserver in a well-known place, just like
> monotone projects typically do for their source databases: for
> convenience, rather than necessity.

My thinking exactly.  I want the ability to drop a standalone
binary into a cgi-bin on a $7/mo hosting site and have an
instant sourceforge for some small project.  It is also nice
to be able to run things out of a database file in your home
directory.  Or do both.

Some things, like ticket reports, tend to want to use a web
interface.  So how do you do that when you're riding on an
airplane or otherwise cut off from your favorite server?
Just type (for example):

   monotone httpserver &

and then point your webbrowser at 127.0.0.1.

> 
> Some of these things might eventually go into monotone itself, perhaps
> building more tools for project policy and practice assistance around
> base mechanisms such as certs and the DAG structure.  More

RE: [sqlite] disk locality (and delta storage)

2006-02-07 Thread DeMarco, Paul

Trac (http://projects.edgewall.com/trac) is a VCS (Subversion) + Wiki + 
Ticketing... And is built on top of a sqlite db.  It has a fair amount of 
installation pre-requisites but is a very clean interface and trys to adhere to 
KISS for their feature set.

Its of course not usable if your still debating your VCS, Trac forces you into 
SVN (mostly).  There is currently work to support varying VCS backends, 
monotone is not on the list of supported systems yet... But there is a ticket 
to have it added.

http://projects.edgewall.com/trac/ticket/1492

--paul

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 07, 2006 7:18 AM
To: sqlite-users@sqlite.org
Cc: monotone-devel@nongnu.org
Subject: Re: [sqlite] disk locality (and delta storage)

Nathaniel Smith <[EMAIL PROTECTED]> wrote:
> 
> So and rows are basically written to the file in the same order that 
> the INSERT statements are executed?

Right.  If there are no free pages in the database file (which is the usual 
case for Monotone, I expect) then new pages are allocated from the end of the 
file.  If the INSERT is small and will fit on an existing page, then no new 
page allocations are required and the data gets inserted in exactly the right 
spot.
But when inserting large blobs, as monotone does, you typically will require a 
least one new page and that page will come at the end.

> 
> Oh, and should I assume that individual row cells are kept together on 
> disk, even if they are (much) larger than a db block?  I assume so, 
> but just want to make sure...

If a table row is too big to fit on a page, then the excess spills onto a 
linked list of overflow pages.  SQLite tries to allocate the base page and the 
overflow pages near each other and in order.

> 
> > After you VACUUM, everything will be on disk in row order.  If
> 
> I assume this means "sorted by primary key"?  (And with tables in some 
> random order relative to each other, but I don't think I care about 
> that at all.)

Tables are always sorted by rowid - which is the same as the INTEGER PRIMARY 
KEY if you have one.

The "true" primary key for every SQLite table is the rowid.  If you specify a 
primary key that is not of type INTEGER, then what SQLite does really is create 
a UNIQUE index on that field.  There is still a rowid which is the "true" 
primary key in the sense that the table is stored in rowid order.

> 
> > you see a big performance improvement after VACUUMing, then the disk 
> > layout is perhaps an optimization worth looking into.  If however 
> > (as I suspect) your performance is similar after vacuuming, then 
> > changing the way information is added to the disk probably will not 
> > help, since after a VACUUM the information is pretty much optimally 
> > ordered for minimum seeking.
> 
> I think you left out the end of the sentence, "...assuming an in-order 
> access pattern".

You have 64 bits of rowid space.  You could assign rowids to deltas in clumps.  
Whenever you encounter a new file, assign it a block of (say) a billion rowids. 
 Each delta to that file goes into successive rowids.  Since the table is 
stored in rowid order, all delta for a particular file are therefore close to 
each other in the table.  This does not guarantee that the btree will be laid 
out on disk in order - it probably will not be unless you run a VACUUM - but it 
will help.  And I suspect it will help a lot.

> 
> Unless you just mean, during the btree traversals involved in each key 
> lookup?  Man, there's a whole 'nother part I don't know much about 
> :-).  I guess walking the btrees can obviously be another source of 
> disk latency; I'm not sure whether I should worry about this or not.

The fanout on tables is typically large - about 50 to 75.  Even more if you 
select a larger page size.  Fanout on indices is much smaller, 10 or 20, 
because index keys are typically larger than the integer rowid keys of tables.  
So to reduce your disk latency, you want to try to always search by rowid.

Something you should experiment with, by the way, is increasing the page size 
so that more records fit on one page and you get larger fanout.  Do you get 
better performance if you rebuild your database with say a 16K or 32K page size 
instead of the default 1K?

> If I do an INSERT of a row that has some indexes on it, where do those 
> index entries get written?  Next to the actual row data, at the end of 
> the file?  (Assuming there are no free blocks earlier in the file.) 
> And then at VACUUM time each index gets groups into one spot on disk?

Indices are stored in completely separate btrees from the tables.  An index has 
key only, and the key is the fields being indexed followed by a the rowid.  So 
to lookup a record by index, you first do a search of the index btree to

Re: [sqlite] disk locality (and delta storage)

2006-02-07 Thread drh

Nathaniel Smith <[EMAIL PROTECTED]> wrote:
> 
> So and rows are basically written to the file in the same order
> that the INSERT statements are executed?

Right.  If there are no free pages in the database file (which
is the usual case for Monotone, I expect) then new pages are
allocated from the end of the file.  If the INSERT is small and
will fit on an existing page, then no new page allocations are
required and the data gets inserted in exactly the right spot.
But when inserting large blobs, as monotone does, you typically
will require a least one new page and that page will come at the
end.

> 
> Oh, and should I assume that individual row cells are kept together on
> disk, even if they are (much) larger than a db block?  I assume so,
> but just want to make sure...

If a table row is too big to fit on a page, then the excess spills
onto a linked list of overflow pages.  SQLite tries to allocate
the base page and the overflow pages near each other and in order.

> 
> > After you VACUUM, everything will be on disk in row order.  If
> 
> I assume this means "sorted by primary key"?  (And with tables in some
> random order relative to each other, but I don't think I care about
> that at all.)

Tables are always sorted by rowid - which is the same as the INTEGER
PRIMARY KEY if you have one.

The "true" primary key for every SQLite table is the rowid.  If you
specify a primary key that is not of type INTEGER, then what SQLite
does really is create a UNIQUE index on that field.  There is still
a rowid which is the "true" primary key in the sense that the table
is stored in rowid order.

> 
> > you see a big performance improvement after VACUUMing, then the
> > disk layout is perhaps an optimization worth looking into.  If
> > however (as I suspect) your performance is similar after vacuuming,
> > then changing the way information is added to the disk probably
> > will not help, since after a VACUUM the information is pretty much
> > optimally ordered for minimum seeking.
> 
> I think you left out the end of the sentence, "...assuming an in-order
> access pattern".

You have 64 bits of rowid space.  You could assign rowids to deltas
in clumps.  Whenever you encounter a new file, assign it a block
of (say) a billion rowids.  Each delta to that file goes into
successive rowids.  Since the table is stored in rowid order, all
delta for a particular file are therefore close to each other in
the table.  This does not guarantee that the btree will be laid out
on disk in order - it probably will not be unless you run a VACUUM -
but it will help.  And I suspect it will help a lot.

> 
> Unless you just mean, during the btree traversals involved in each key
> lookup?  Man, there's a whole 'nother part I don't know much about
> :-).  I guess walking the btrees can obviously be another source of
> disk latency; I'm not sure whether I should worry about this or not.

The fanout on tables is typically large - about 50 to 75.  Even more
if you select a larger page size.  Fanout on indices is much smaller,
10 or 20, because index keys are typically larger than the integer
rowid keys of tables.  So to reduce your disk latency, you want to
try to always search by rowid.

Something you should experiment with, by the way, is increasing
the page size so that more records fit on one page and you get
larger fanout.  Do you get better performance if you rebuild 
your database with say a 16K or 32K page size instead of the
default 1K?

> If I do an INSERT of a row that has some indexes on it, where do those
> index entries get written?  Next to the actual row data, at the end of
> the file?  (Assuming there are no free blocks earlier in the file.)
> And then at VACUUM time each index gets groups into one spot on disk?

Indices are stored in completely separate btrees from the tables.  An
index has key only, and the key is the fields being indexed followed
by a the rowid.  So to lookup a record by index, you first do a
search of the index btree to find the entry with the matching fields.
Then you pull the rowid off of the end of the index entry and use
that rowid to do a separate search in the table btree to get your
actual data.  So an index search actually does two binary lookups - 
one on the index and another on the table.

> 
> I was actually thinking more about the cost of looking up many items
> from a table.  Here, unfortunately, our current access pattern is
> quite pessimal.  The current schema is:
> 
> CREATE TABLE files (id primary key, data not null); 
> 
> 'id' is the SHA1 hash of the file; 'data' is a compressed raw file.
> 
> CREATE TABLE file_deltas
>   (id not null, base not null, delta not null,
>unique(id, base)
>   );
> 
> 'id' is the SHA1 of the file this delta lets us construct, 'base' is
> the SHA1 of the file that the delta is against, and 'delta' is the
> compressed xdelta.
> 
> So, when we traverse delta chains, we go wandering all over this table
> indexing by the SHA1 of intermediate versions.  Our

Re: [sqlite] disk locality

2006-02-07 Thread Nathaniel Smith

On Wed, Feb 01, 2006 at 08:56:37PM -0800, Joe Wilson wrote:
> Another question... Does Monotone still Base64 encode all its data before 
> putting it into blobs?
> If so, using raw binary SQLite blobs would likely give Monotone a 33% speedup 
> and smaller
> database.

It does, actually, but that's going away as soon as we get around to
finishing reviewing/merging the relevant branch...

-- Nathaniel

-- 
- Don't let your informants burn anything.
- Don't grow old.
- Be good grad students.
  -- advice of Murray B. Emeneau on the occasion of his 100th birthday

Re: [sqlite] disk locality (and delta storage)

2006-02-07 Thread Nathaniel Smith

Thanks for the helpful reply.  Sorry I've taken so long to get back to
this; I've had some hardware trouble and am only catching up on email
now...

On Wed, Feb 01, 2006 at 07:27:06AM -0500, [EMAIL PROTECTED] wrote:
> Nathaniel Smith <[EMAIL PROTECTED]> wrote:
> > I was wondering if there were any docs or explanations available on
> > how SQLite decides to lay out data on disk.
> 
> Free pages in the middle of the file are filled first.  Some effort
> is made to uses pages that are close together for related information.
> In mototone, where you seldom if ever delete anything, you probably
> never have any free pages, so new information is always added to the
> end of the file.

I'm going to ask a bunch of finicky boring questions to make sure I'm
understanding :-).

So and rows are basically written to the file in the same order
that the INSERT statements are executed?

Oh, and should I assume that individual row cells are kept together on
disk, even if they are (much) larger than a db block?  I assume so,
but just want to make sure...

> After you VACUUM, everything will be on disk in row order.  If

I assume this means "sorted by primary key"?  (And with tables in some
random order relative to each other, but I don't think I care about
that at all.)

> you see a big performance improvement after VACUUMing, then the
> disk layout is perhaps an optimization worth looking into.  If
> however (as I suspect) your performance is similar after vacuuming,
> then changing the way information is added to the disk probably
> will not help, since after a VACUUM the information is pretty much
> optimally ordered for minimum seeking.

I think you left out the end of the sentence, "...assuming an in-order
access pattern".

Unless you just mean, during the btree traversals involved in each key
lookup?  Man, there's a whole 'nother part I don't know much about
:-).  I guess walking the btrees can obviously be another source of
disk latency; I'm not sure whether I should worry about this or not.
If I do an INSERT of a row that has some indexes on it, where do those
index entries get written?  Next to the actual row data, at the end of
the file?  (Assuming there are no free blocks earlier in the file.)
And then at VACUUM time each index gets groups into one spot on disk?

I was actually thinking more about the cost of looking up many items
from a table.  Here, unfortunately, our current access pattern is
quite pessimal.  The current schema is:

CREATE TABLE files (id primary key, data not null); 

'id' is the SHA1 hash of the file; 'data' is a compressed raw file.

CREATE TABLE file_deltas
  (id not null, base not null, delta not null,
   unique(id, base)
  );

'id' is the SHA1 of the file this delta lets us construct, 'base' is
the SHA1 of the file that the delta is against, and 'delta' is the
compressed xdelta.

So, when we traverse delta chains, we go wandering all over this table
indexing by the SHA1 of intermediate versions.  Our access isn't just
random, it's _cryptographically strongly_ random! :-)

So, we've been throwing around ways to overhaul this stuff.  Obviously
sqlite is not going to be able to improve on the current situation
without some help from us.

> Let me propose a radical solution:  I've been experimenting with adding
> a VCS component to CVSTrac (http://www.cvstrac.org/) to replace CVS and
> thus provide a complete project management system in a standalone CGI
> program.  My latest thinking on this (backed up by experiment) is to

Entering the VCS game?  Good luck!  It's an interesting (and
surprisingly deep) field.

(Total tangent: I basically know how to make monotone work over a CGI
transport; it's some work, but doable, just no-one's picked it up yet.
It might be worth considering such a solution before trying to build a
new system from scratch.  The basic trade-off would be a CGI script
plus a statically linked binary instead of just a CGI script, but on
the other hand, building Yet Another VCS from scratch is a significant
undertaking.  The detailed trade-off would of course be more
complicated :-).  Something to throw out there in case it leads to
discussion...)

> avoid storing long series of xdeltas.  Each file version is instead stored
> as a baseline and a single xdelta.  The manifest stores two UUIDs instead
> of one.  That way, you can always load a particular file version with
> at most two lookups.  As a file evolves, the baseline version stays the
> same and the xdelta changes from one version to the next.  When the size
> of the xdelta reachs some fraction of the baseline file size, create a
> new baseline.  Experimentally, I have found it works well to create a
> new baseline when the xdelta becomes 15% of the size of the baseline.

Ah, indeed, I'd forgotten about this technique.  Thanks for bringing
it up!  It inspired me to go and sketch out some notes on different
options:
  http://venge.net/monotone/wiki/DeltaStorageStrategies

There are a few different things we're thinking

Re: [sqlite] disk locality

2006-02-01 Thread Joe Wilson

Another question... Does Monotone still Base64 encode all its data before 
putting it into blobs?
If so, using raw binary SQLite blobs would likely give Monotone a 33% speedup 
and smaller
database.

--- Joe Wilson <[EMAIL PROTECTED]> wrote:

> Does Monotone spend most of its database time in a few long running queries 
> or thousands of
> sub-second queries?
> 
> Could you point us to an example monotone database tarball and give some 
> actual examples of
> speed-deficient queries?
> 
> When working with large SQLite databases with blobs I generally find I get 
> much greater speed
> when
> I move expensive data manipulation from the application layer into a custom 
> SQLite function. I
> don't know whether anything in Monotone lends itself to this optimization.
> 
> --- Nathaniel Smith <[EMAIL PROTECTED]> wrote:
> > In its current implementation in monotone, this algorithm seems to be
> > seek-bound.  Some profiling shows that there are cases where we're
> > spending more time blocked waiting for read()s for the db, than it
> > takes to read the entire db 5 times over.  This makes sense.  We're
> > basically doing random reads scattered all over the file, so the OS
> > has no way to do any sort of readahead, which means we're probably
> > hitting disk seek latency on every read, and like usual these days,
> > latency trumps bandwidth.
> 
> 
> 
> __
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
> 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: [sqlite] disk locality

2006-02-01 Thread Joe Wilson

Does Monotone spend most of its database time in a few long running queries or 
thousands of
sub-second queries?

Could you point us to an example monotone database tarball and give some actual 
examples of
speed-deficient queries?

When working with large SQLite databases with blobs I generally find I get much 
greater speed when
I move expensive data manipulation from the application layer into a custom 
SQLite function. I
don't know whether anything in Monotone lends itself to this optimization.

--- Nathaniel Smith <[EMAIL PROTECTED]> wrote:
> In its current implementation in monotone, this algorithm seems to be
> seek-bound.  Some profiling shows that there are cases where we're
> spending more time blocked waiting for read()s for the db, than it
> takes to read the entire db 5 times over.  This makes sense.  We're
> basically doing random reads scattered all over the file, so the OS
> has no way to do any sort of readahead, which means we're probably
> hitting disk seek latency on every read, and like usual these days,
> latency trumps bandwidth.

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: [sqlite] disk locality

2006-02-01 Thread drh

Nathaniel Smith <[EMAIL PROTECTED]> wrote:
> I was wondering if there were any docs or explanations available on
> how SQLite decides to lay out data on disk.

Free pages in the middle of the file are filled first.  Some effort
is made to uses pages that are close together for related information.
In mototone, where you seldom if ever delete anything, you probably
never have any free pages, so new information is always added to the
end of the file.

After you VACUUM, everything will be on disk in row order.  If
you see a big performance improvement after VACUUMing, then the
disk layout is perhaps an optimization worth looking into.  If
however (as I suspect) your performance is similar after vacuuming,
then changing the way information is added to the disk probably
will not help, since after a VACUUM the information is pretty much
optimally ordered for minimum seeking.

> 
> In particular, consider the problem of reconstructing bitstrings
> given a bunch of xdelta's (basically one-way patches) and full texts;
> a typical problem would be to calculate a chain of deltas that applied
> one-by-one to a full text will give the desired text, then pull those
> deltas and the full text out of the db and apply them.  We just store
> these text objects as columns in two tables.
> 
> In its current implementation in monotone, this algorithm seems to be
> seek-bound.  Some profiling shows that there are cases where we're
> spending more time blocked waiting for read()s for the db, than it
> takes to read the entire db 5 times over.  This makes sense.  We're
> basically doing random reads scattered all over the file, so the OS
> has no way to do any sort of readahead, which means we're probably
> hitting disk seek latency on every read, and like usual these days,
> latency trumps bandwidth.
> 
> So, the question is how to fix this.

If you can provide specific schemas and query examples and information
on how big the database is, I might could help.  So far, the problem
is too vague for me to give much useful input.

Let me propose a radical solution:  I've been experimenting with adding
a VCS component to CVSTrac (http://www.cvstrac.org/) to replace CVS and
thus provide a complete project management system in a standalone CGI
program.  My latest thinking on this (backed up by experiment) is to
avoid storing long series of xdeltas.  Each file version is instead stored
as a baseline and a single xdelta.  The manifest stores two UUIDs instead
of one.  That way, you can always load a particular file version with
at most two lookups.  As a file evolves, the baseline version stays the
same and the xdelta changes from one version to the next.  When the size
of the xdelta reachs some fraction of the baseline file size, create a
new baseline.  Experimentally, I have found it works well to create a
new baseline when the xdelta becomes 15% of the size of the baseline.

My studies (based on the revision history of SQLite) indicate that using 
a baseline plus single xdelta representation results in a repository that 
is about 2x to 3x larger than using a long change of xdeltas.  But access 
is faster.  And the repository is still 6x to 10x smaller than a git-style
repository that uses no deltas anywhere.

--
D. Richard Hipp   <[EMAIL PROTECTED]>

[sqlite] Re: [Monotone-devel] [sqlite] disk locality (and delta storage)

[sqlite] Re: [Monotone-devel] [sqlite] disk locality (and delta storage)

[sqlite] Re: [Monotone-devel] Re: [sqlite] disk locality (and delta storage)

[sqlite] Re: [Monotone-devel] [sqlite] disk locality (and delta storage)

RE: [sqlite] disk locality (and delta storage)

Re: [sqlite] disk locality (and delta storage)

Re: [sqlite] disk locality

Re: [sqlite] disk locality (and delta storage)

Re: [sqlite] disk locality

Re: [sqlite] disk locality

Re: [sqlite] disk locality

11 matches

Site Navigation

Mail list logo

Footer information