Thanks for the helpful reply. Sorry I've taken so long to get back to this; I've had some hardware trouble and am only catching up on email now...
On Wed, Feb 01, 2006 at 07:27:06AM -0500, [EMAIL PROTECTED] wrote: > Nathaniel Smith <[EMAIL PROTECTED]> wrote: > > I was wondering if there were any docs or explanations available on > > how SQLite decides to lay out data on disk. > > Free pages in the middle of the file are filled first. Some effort > is made to uses pages that are close together for related information. > In mototone, where you seldom if ever delete anything, you probably > never have any free pages, so new information is always added to the > end of the file. I'm going to ask a bunch of finicky boring questions to make sure I'm understanding :-). So and rows are basically written to the file in the same order that the INSERT statements are executed? Oh, and should I assume that individual row cells are kept together on disk, even if they are (much) larger than a db block? I assume so, but just want to make sure... > After you VACUUM, everything will be on disk in row order. If I assume this means "sorted by primary key"? (And with tables in some random order relative to each other, but I don't think I care about that at all.) > you see a big performance improvement after VACUUMing, then the > disk layout is perhaps an optimization worth looking into. If > however (as I suspect) your performance is similar after vacuuming, > then changing the way information is added to the disk probably > will not help, since after a VACUUM the information is pretty much > optimally ordered for minimum seeking. I think you left out the end of the sentence, "...assuming an in-order access pattern". Unless you just mean, during the btree traversals involved in each key lookup? Man, there's a whole 'nother part I don't know much about :-). I guess walking the btrees can obviously be another source of disk latency; I'm not sure whether I should worry about this or not. If I do an INSERT of a row that has some indexes on it, where do those index entries get written? Next to the actual row data, at the end of the file? (Assuming there are no free blocks earlier in the file.) And then at VACUUM time each index gets groups into one spot on disk? I was actually thinking more about the cost of looking up many items from a table. Here, unfortunately, our current access pattern is quite pessimal. The current schema is: CREATE TABLE files (id primary key, data not null); 'id' is the SHA1 hash of the file; 'data' is a compressed raw file. CREATE TABLE file_deltas (id not null, base not null, delta not null, unique(id, base) ); 'id' is the SHA1 of the file this delta lets us construct, 'base' is the SHA1 of the file that the delta is against, and 'delta' is the compressed xdelta. So, when we traverse delta chains, we go wandering all over this table indexing by the SHA1 of intermediate versions. Our access isn't just random, it's _cryptographically strongly_ random! :-) So, we've been throwing around ways to overhaul this stuff. Obviously sqlite is not going to be able to improve on the current situation without some help from us. > Let me propose a radical solution: I've been experimenting with adding > a VCS component to CVSTrac (http://www.cvstrac.org/) to replace CVS and > thus provide a complete project management system in a standalone CGI > program. My latest thinking on this (backed up by experiment) is to Entering the VCS game? Good luck! It's an interesting (and surprisingly deep) field. (Total tangent: I basically know how to make monotone work over a CGI transport; it's some work, but doable, just no-one's picked it up yet. It might be worth considering such a solution before trying to build a new system from scratch. The basic trade-off would be a CGI script plus a statically linked binary instead of just a CGI script, but on the other hand, building Yet Another VCS from scratch is a significant undertaking. The detailed trade-off would of course be more complicated :-). Something to throw out there in case it leads to discussion...) > avoid storing long series of xdeltas. Each file version is instead stored > as a baseline and a single xdelta. The manifest stores two UUIDs instead > of one. That way, you can always load a particular file version with > at most two lookups. As a file evolves, the baseline version stays the > same and the xdelta changes from one version to the next. When the size > of the xdelta reachs some fraction of the baseline file size, create a > new baseline. Experimentally, I have found it works well to create a > new baseline when the xdelta becomes 15% of the size of the baseline. Ah, indeed, I'd forgotten about this technique. Thanks for bringing it up! It inspired me to go and sketch out some notes on different options: http://venge.net/monotone/wiki/DeltaStorageStrategies There are a few different things we're thinking about here. Right now we do backwards linear delta chaining, so we always have the latest version of everything stored in full, and then it takes O(n) time to fetch a version n steps back in history. While theoretically problematic, this actually has caused zero complaints; people in practice always check out the latest head... However, it is causing us problems when synchronizing databases over the network, because when synchronizing you want to send _forward_ deltas. So we're wasting a tremendous amount of time shuffling data around to reverse the deltas (and keep the database consistent in the mean time), and are considering storage formats that would be more friendly to push/pull/sync -- while still being acceptably fast for things like checkout. > My studies (based on the revision history of SQLite) indicate that using > a baseline plus single xdelta representation results in a repository that > is about 2x to 3x larger than using a long change of xdeltas. But access > is faster. And the repository is still 6x to 10x smaller than a git-style > repository that uses no deltas anywhere. git-style repositories no longer work that way; they instead do that for new stuff (to make operations commit fast, etc., it doesn't have to deal with deltas at all), and then rely on users to run "packing" commands every once in a while, which use some heuristics to find things to delta against each other. I still haven't managed to track down exactly what these heuristics are, but the results are _very_ small. However, the required maintainence is a serious barrier for a general-use system. (Possibly just because they can batch a bunch of deltas together and then gzip them all at once, or some such thing.) While size is definitely not everything -- I doubt anyone notices 10% here or there -- a factor of 2-3x is probably going to be hard to sell. Unfortunately, since it's a nice scheme. As far as I can tell (Matt, who's actually doing more of the work here, might disagree), the "ideal" system would be: -- store forward delta chains, cutting them off every once in a while -- somehow organize the storage of these chains per-file, so that first we have all the deltas pertaining to one file, sorted by application order, then all the deltas for the next file, etc. Magically keep this true even as new deltas arrive :-). -- when synchronizing, the sending side just sends the raw deltas out of its db, and sorts them so it sends all deltas for one file, then all deltas for the next file, etc. This is nice because it involve low work for the server (who tends to receive each version once and then send it many times), it can be a streaming read on the sender, it is cheap to reconstruct each version as we receiver (which we have to do to check hashes) (because if we have just reconstructed version A, and get the delta from A to A+1, we can immediately reconstruct A+1 in constant time), and the receiver gets to write each delta straight to disk without having to re-deltaify or re-compress (gzip is much more expensive than gunzip). -- when we need to reconstruct files for checkout or the like, choose to reconstruct the files in the order they are listed in the db. So first we write out whichever file's deltas come first, then we process the deltas that come after that, etc. Since I'm postulating that our db is magically kept in such a nice 0-fragmentation state, this actually does the _whole_ checkout in a single streaming read. This is kind of the holy grail for such a system; even mercurial, the system to beat for such things, requires 2 reads from different disk locations per file being reconstructed, rather than 1 stream of reads per tree. We would probably be faster than 'cp -r'. But, since running VACUUM basically isn't an option, I'm not sure how we could get this magically unfragmented db, or what the next-best thing would be :-). Some sort of incremental defragmenting might be handy, I guess. "Hey, we've got a transaction open anyway, let's spend 200ms reducing fragmentation a bit before we return to the user." I have no idea how such a thing would be implemented; this stuff seems like magic to me anyway. So, this was a bit of a brain dump, but I hope it helps clarify some of what we're thinking about... -- Nathaniel -- In mathematics, it's not enough to read the words you have to hear the music