Nathaniel Smith <[EMAIL PROTECTED]> wrote: > > So and rows are basically written to the file in the same order > that the INSERT statements are executed?
Right. If there are no free pages in the database file (which is the usual case for Monotone, I expect) then new pages are allocated from the end of the file. If the INSERT is small and will fit on an existing page, then no new page allocations are required and the data gets inserted in exactly the right spot. But when inserting large blobs, as monotone does, you typically will require a least one new page and that page will come at the end. > > Oh, and should I assume that individual row cells are kept together on > disk, even if they are (much) larger than a db block? I assume so, > but just want to make sure... If a table row is too big to fit on a page, then the excess spills onto a linked list of overflow pages. SQLite tries to allocate the base page and the overflow pages near each other and in order. > > > After you VACUUM, everything will be on disk in row order. If > > I assume this means "sorted by primary key"? (And with tables in some > random order relative to each other, but I don't think I care about > that at all.) Tables are always sorted by rowid - which is the same as the INTEGER PRIMARY KEY if you have one. The "true" primary key for every SQLite table is the rowid. If you specify a primary key that is not of type INTEGER, then what SQLite does really is create a UNIQUE index on that field. There is still a rowid which is the "true" primary key in the sense that the table is stored in rowid order. > > > you see a big performance improvement after VACUUMing, then the > > disk layout is perhaps an optimization worth looking into. If > > however (as I suspect) your performance is similar after vacuuming, > > then changing the way information is added to the disk probably > > will not help, since after a VACUUM the information is pretty much > > optimally ordered for minimum seeking. > > I think you left out the end of the sentence, "...assuming an in-order > access pattern". You have 64 bits of rowid space. You could assign rowids to deltas in clumps. Whenever you encounter a new file, assign it a block of (say) a billion rowids. Each delta to that file goes into successive rowids. Since the table is stored in rowid order, all delta for a particular file are therefore close to each other in the table. This does not guarantee that the btree will be laid out on disk in order - it probably will not be unless you run a VACUUM - but it will help. And I suspect it will help a lot. > > Unless you just mean, during the btree traversals involved in each key > lookup? Man, there's a whole 'nother part I don't know much about > :-). I guess walking the btrees can obviously be another source of > disk latency; I'm not sure whether I should worry about this or not. The fanout on tables is typically large - about 50 to 75. Even more if you select a larger page size. Fanout on indices is much smaller, 10 or 20, because index keys are typically larger than the integer rowid keys of tables. So to reduce your disk latency, you want to try to always search by rowid. Something you should experiment with, by the way, is increasing the page size so that more records fit on one page and you get larger fanout. Do you get better performance if you rebuild your database with say a 16K or 32K page size instead of the default 1K? > If I do an INSERT of a row that has some indexes on it, where do those > index entries get written? Next to the actual row data, at the end of > the file? (Assuming there are no free blocks earlier in the file.) > And then at VACUUM time each index gets groups into one spot on disk? Indices are stored in completely separate btrees from the tables. An index has key only, and the key is the fields being indexed followed by a the rowid. So to lookup a record by index, you first do a search of the index btree to find the entry with the matching fields. Then you pull the rowid off of the end of the index entry and use that rowid to do a separate search in the table btree to get your actual data. So an index search actually does two binary lookups - one on the index and another on the table. > > I was actually thinking more about the cost of looking up many items > from a table. Here, unfortunately, our current access pattern is > quite pessimal. The current schema is: > > CREATE TABLE files (id primary key, data not null); > > 'id' is the SHA1 hash of the file; 'data' is a compressed raw file. > > CREATE TABLE file_deltas > (id not null, base not null, delta not null, > unique(id, base) > ); > > 'id' is the SHA1 of the file this delta lets us construct, 'base' is > the SHA1 of the file that the delta is against, and 'delta' is the > compressed xdelta. > > So, when we traverse delta chains, we go wandering all over this table > indexing by the SHA1 of intermediate versions. Our access isn't just > random, it's _cryptographically strongly_ random! :-) I would consider redoing the table like this: CREATE TABLE files( localid INTEGER PRIMARY KEY, id TEXT UNIQUE, data BLOB NOT NULL ); The files.localid field is the 64-bit integer rowid. Lookups by rowid are at least twice as fast as lookups by files.id because you only have to do a single binary search. Use the localid in the file_deltas table. Then if you also use the trick of assigning continguous blocks of localids to each file, all information about a file will be contiguous in the btree - and hopefully closer to contiguous on disk. (Fully contiguous after a VACUUM.) The name "localid" is intended to convey the concept that this ID is used only in the local repostory. Separate repositories to which you might push or pull have their own localids that are likely different from yours. The SHA1 hash id is universal. The localid is not. > > > Let me propose a radical solution: I've been experimenting with adding > > a VCS component to CVSTrac (http://www.cvstrac.org/) to replace CVS and > > thus provide a complete project management system in a standalone CGI > > program. My latest thinking on this (backed up by experiment) is to > > Entering the VCS game? Good luck! It's an interesting (and > surprisingly deep) field. > > (Total tangent: I basically know how to make monotone work over a CGI > transport; it's some work, but doable, just no-one's picked it up yet. > It might be worth considering such a solution before trying to build a > new system from scratch. The basic trade-off would be a CGI script > plus a statically linked binary instead of just a CGI script, but on > the other hand, building Yet Another VCS from scratch is a significant > undertaking. The detailed trade-off would of course be more > complicated :-). Something to throw out there in case it leads to > discussion...) What I'm looking for is a VCS + bug-tracker + wiki that I can just drop into a cgi-bin of some low cost web hosting site (like Hurricane Electric, which I notice we both use) and have a ready-to-roll project management system with no setup or other details to worry about. Kind of a sourceforge-in-a-box, only a lot better than sourceforge. I'm looking for something like monotone with CVSTrac enhancements that works over HTTP. Nothing like that currently exists, I'm afraid, so I've been working on my own. Yes, it is a big job, though perhaps not any bigger than writing an SQL database engine ;-) Monotone is my favorite of the current crop of VCSes by the way. It has 80% of what I am looking for. What I'm really after is a better monotone. I have been greatly inspired by your work, as have others, I notice. > > While size is definitely not everything -- I doubt anyone notices 10% > here or there -- a factor of 2-3x is probably going to be hard to > sell. Unfortunately, since it's a nice scheme. The CVS repository for SQLite is 15MiB. Under a baseline+delta schema it would grow to (perhaps) 45MiB. The cheapest account on Hurricane Electric includes 1GiB of storage. Why should I care about the extra 30MiB? > > So, this was a bit of a brain dump, but I hope it helps clarify some > of what we're thinking about... > Very helpful. I am keen to help you get monotone working better. Let me know if there is anything I can do to help or if I can answer any addition questions for you. -- D. Richard Hipp <[EMAIL PROTECTED]>