Re: [sqlite] The term "flat-file" as applied to sqlite
On Tue, Sep 26, 2006 at 08:05:34PM +0200, [EMAIL PROTECTED] wrote: > Maybe a structured Textfile is a flatfile, if > possibly readable as a tableview. But it is so > only in humans view and humans understanding. Well, and English is a language used by humans to convey their understandings to other humans :-). You can do technical hairsplitting all you want, but the fact is that the term "flat file" has a long history of being used to refer to text files with minimal internal structure, no indexing, accessed with no or minimal locking, no transactions, etc. -- Nathaniel -- "...these, like all words, have single, decontextualized meanings: everyone knows what each of these words means, everyone knows what constitutes an instance of each of their referents. Language is fixed. Meaning is certain. Santa Claus comes down the chimney at midnight on December 24." -- The Language War, Robin Lakoff - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] UNICODE Support
On Fri, Aug 04, 2006 at 10:02:58PM -0700, Cory Nelson wrote: > On 8/4/06, Trevor Talbot <[EMAIL PROTECTED]> wrote: > >On 8/4/06, Cory Nelson <[EMAIL PROTECTED]> wrote: > > > >> But, since you brought it up - I have no expectations of SQLite > >> integrating a full Unicode locale library, however it would be a great > >> improvement if it would respect the current locale and use wcs* > >> functions when available, or at least order by standard Unicode order > >> instead of completely mangling things on UTF-8 codes. > > > >What do you mean by "standard Unicode order" in this context? > > > > Convert UTF-8 to UTF-16 (or both to UCS-4 if you want to be entirely > correct) while sorting, to at least make them follow the same pattern. Huh? UTF-8 handled in the naive way (using "memcmp", like sqlite does) will automagically give you sorting by unicode codepoint (probably the only useful meaning of "standard Unicode order" here). UTF-16 handled in the naive way (either using "memcmp" or lexicographically on 2-byte integers) will sort things by codepoint, mostly, sort of, and otherwise by a weird order that falls out of details of the UTF-16 standard accidentally.[1] Perhaps you're using a legacy system that standardized on UTF-16 before the BMP ran out, and want to be compatible with its idiosyncratic sorting -- then converting things to UTF-16 before comparing makes sense. But that's not really appropriate to make as a general recommendation... better to convert UTF-16 to UTF-8, if you want to be entirely correct :-). [1] see e.g. http://icu.sourceforge.net/docs/papers/utf16_code_point_order.html -- Nathaniel -- Details are all that matters; God dwells there, and you never get to see Him if you don't struggle to get them right. -- Stephen Jay Gould
Re: [sqlite] Reading the same table from two threads
On Wed, Jul 26, 2006 at 02:32:58PM +, [EMAIL PROTECTED] wrote: > It has been suggested that I add a mutex to every SQLite > database connection. This would cause access to a database > connection to automatically serialize even when two or more > threads try to use that connection at once, thus preventing > problems such as the above. > > The downside is the (minimal) additional overhead of acquiring > and releasing a mutex on each API call. There will also be > some changes to the OS-layer backend which will break > private ports to unusual OSes. I'm -0 on this. For monotone, we ship sqlite in our source package (to avoid version skew), but do not otherwise use threads. I think this would mean we had to change our build/configury to figure out how the local thread library worked for sqlite alone, and for no gain to us (perhaps even a speed loss -- mutexes used by libc already show up surprisingly high on our profiles). -- Nathaniel -- The Universe may / Be as large as they say But it wouldn't be missed / If it didn't exist. -- Piet Hein
Re: [sqlite] Problems with multiple threads?
On Wed, Jun 07, 2006 at 01:24:38PM -0400, [EMAIL PROTECTED] wrote: > If it is inconvenient to rollback and retry the entire transaction, > then start the transaction initially with BEGIN EXCLUSIVE. This > will acquire the reserved lock immediately (instead of waiting to > the first write occurs) and so you will either get an SQLITE_BUSY > right away (when it is a simple matter to just rerun the BEGIN EXCLUSIVE > statement until it works) or you can be assured of never getting > another SQLITE_BUSY again until you try to COMMIT (and there too, > you can simply rerun COMMIT repeatedly until it works.) It would be convenient to have another form of "BEGIN", in between DEFERRED and IMMEDIATE, whose effect was to immediately acquire the shared lock. That would allow read-only transactions to get this same level of programming convenience you describe, where one only has to be able to handle SQLITE_BUSY in one place. (Of course, one could simulate this now by immediately running a meaningless SELECT after each call to BEGIN, solely for the side-effect of acquiring the lock, but it seems less elegant and perhaps not guaranteed to continue working in the future.) -- Nathaniel -- .i dei jitfa fanmo xatra
Re: [sqlite] unsuscribe
On Wed, May 31, 2006 at 01:42:48PM -0500, Bob Dankert wrote: > Especially considering they are all spelling the word wrong with the > same mis-spelling. Wild guess: the list actually receives a lot more of these messages than you see, but mailman has the ability to filter out messages that were obviously intended to go to the -request address. I don't know what logic it uses, but I bet it only knows about "unsubscribe", not "unsuscribe" :-). -- Nathaniel -- "If you can explain how you do something, then you're very very bad at it." -- John Hopfield
[sqlite] checking 'PRAGMA integrity_check'
Background: Monotone has a "db check" command, that goes through and verifies that everything in the database is valid and accessible. While our existing tests basically hit every row of every table, so most things should be caught, we didn't previously call sqlite's "PRAGMA integrity_check". In principle, I guess, this could mean that subtle things like a corrupted index that only caused issues on certain access patterns could be missed... So, we just committed a patch to call the PRAGMA as part of the "db check" process, but now this raises the issue of how we can _test_ this new code path. So, my question is -- what is a simple way to reliably corrupt a sqlite db so that "PRAGMA integrity_check" will fail? Thanks, -- Nathaniel -- "...these, like all words, have single, decontextualized meanings: everyone knows what each of these words means, everyone knows what constitutes an instance of each of their referents. Language is fixed. Meaning is certain. Santa Claus comes down the chimney at midnight on December 24." -- The Language War, Robin Lakoff
[sqlite] Re: semicolon in string (was Re: [sqlite] File locking additions)
On Thu, Mar 09, 2006 at 10:00:50AM +0200, Roger wrote: > Hello guys. > > I have a small problem.I am trying to write some sql queries. Whenever i > put a semicolon, sqlite returns an error immediately.for instance if i > select as follows > > Select Name||" "||Surname||" ; "||Address as "Details > >From Person > where PersID=1098 What interface are you using to access sqlite? If I put a semicolon in a string from the sqlite3 command line client it works fine... e.g.: sqlite> select ';'; ; sqlite> select ";"; ; but I could easily imagine a poorly written library getting confused by such a thing. -- Nathaniel -- Details are all that matters; God dwells there, and you never get to see Him if you don't struggle to get them right. -- Stephen Jay Gould
Re: [sqlite] Running App state in db?
On Thu, Mar 02, 2006 at 12:18:21AM +0100, Elrond wrote: > I intended to only put the locks as such in the db. > When it comes to a new lock, I'll select all relevant old > locks, that might conflict, handle the conflict check in > app logic and finally insert a new lock record. (all inside > a proper table lock/transaction). > > I just hoped for some cool way to let the db cleanup all > those records, when the session dies. I don't know enough about your app to know if this would work, but if you only have the one app instance managing locks, and locks expire whenever that app dies... perhaps you can just have that app, on startup, unconditionally erase all locks in the db? -- Nathaniel -- Details are all that matters; God dwells there, and you never get to see Him if you don't struggle to get them right. -- Stephen Jay Gould
Re: [sqlite] Strange execution times
On Wed, Feb 22, 2006 at 01:13:19AM +0200, Ulrich Schöbel wrote: > I tried the 10 reps expecting even more. Then came the surprise: > only 67 microsecs. > > My first feeling was, something like a busy disk or so came > in just when I tried the 100 reps. But the results were reproducible, > deviating only by a few microseconds. Perhaps your timing harness simply can't get accurate results at a mere 10 repetitions? -- Nathaniel -- Eternity is very long, especially towards the end. -- Woody Allen
[sqlite] Re: [Monotone-devel] [sqlite] disk locality (and delta storage)
On Fri, Feb 10, 2006 at 07:06:27PM -0500, [EMAIL PROTECTED] wrote: > Daniel Carosone <[EMAIL PROTECTED]> wrote: > > Wiki pages doesn't seem so hard, they're pretty much text documents > > stored in a VCS anyway. > > There are some complications. Each wiki page acts if it where > its own independent project or branch. And then you probably want > some way see all current leaves and to do merges from the web > interface. > > If you intend your system to be publically accessible, then > you will also need some mechanism to delete (or at least > permanently hide from public view) the spam that miscreants > occasionally post on wiki pages and sometimes also on tickets. > Some kind of "death cert" perhaps. These all seem reasonably straightforward; basically just UI issues. The trickiest bit is the making each page independent bit... but you can just do that by having each page be a single file named "page.txt" in its own branch. Alternatively, sometimes it is very nice to allow renaming in a wiki. (I have definitely wanted this feature, when I realized that the spec I have been working out on a wiki page actually should be called something different than what it currently is.) You can allow this pretty straightforwardly even in the distributed wiki case if you just use monotone's rename support... > > Bug tracking state would require a little > > more smarts, with suitable text formats stored in the VCS and > > code/hooks to assist in making sane merges between multiple heads of a > > given ticket, but even that doesn't seem terribly hard. > > My thinking about tickets is that each change to a ticket > is just a small, signed, immutable record (like a cert) > that contains a variable number of name/value pairs. The > names are things like "title", "description", "status", > "assigned-to", etc. To reconstruct the current state of > a ticket, you sort all the pieces by their creation time > (their actual creation time, not the time they happened to > arrive at your database) and read through them one by one. > Values overwrite prior values with the same name. So > in the end, you have one value for each field name - and > that is the current state of your ticket. You seem to be reinventing history tracking with certs. It seems a lot simpler to just make tickets, like, a text file, and then use the existing history tracking tools. They're quite tuned and involve some non-obvious bits. In particular, compared to this scheme, they're insensitive to clock skew, they provide real merging (in your scheme, if I change a field locally, and you change the same field locally, then whichever of us has our clock set later at that time will silently clobber the other person), and users can investigate the history of a ticket using normal history browsing tools... > It is also very useful to have certs that record a link > between a revision and a ticket. In this way you can record > that a bug was introduced by, or fixed by, a particular > revision. The linkage between tickets and revisions has > proven to be particularly valuable in CVSTrac. Indeed! Enabling this kind of workflow integration is _exactly_ why we have this generic cert thing built in. No-one's done much with it yet, though, so really we don't even know how far the possibilities go -- so I get all excited when someone does :-). > Once you have cross linking between revisions and tickets, > if you add a google-like search you then have a very powerful > system for doing source tree archeological work. This comes > up a lot in maintaining SQLite. Somebody on the web reports > a problem. I have some vague memory of fixing a similar > problem 9 months ago, but do not recall where it was or how > I fixed it. By using the search feature of CVSTrac together > with the links between tickets and check-in comments, I can > quickly find the historical problem and pinpoint exactly when > and how the problem was fixed. Then post a URL to the specific > ticket that described the problem and the specific revision that > fixed it. Yep! Monotone has much more structured information than CVS, and it's much easier to get at; one can get at all sorts of things. You can track not just the list of commits, but things like branches, figure out whether they're merged or not yet, see which branches work is occurring on... Or how about another cool use of certs: http://mt.xaraya.com/com.xaraya.core/index.psp I think the move from CVS to systems with simpler, saner data models opens up huge opportunities for better visualization, data mining, data navigation, community awareness, workflow management... sadly, I never get to work on this, because I end up spending my time trying to make the basic pieces just work :-). But I can't wait to see what other people come up with, and can cheer loudly while they do... -- Nathaniel -- So let us espouse a less contested notion of truth and falsehood, even if it is philosophically debatable (if we lis
[sqlite] Re: [Monotone-devel] Re: [sqlite] disk locality (and delta storage)
Thanks for this detailed email! I've been a bit crazy with codecon the last week, and most of what I have to say is "hmm, lots to think about here", so, yeah :-). I did want to reply to your VCS system comments, though: On Tue, Feb 07, 2006 at 10:17:54AM -0500, [EMAIL PROTECTED] wrote: > What I'm looking for is a VCS + bug-tracker + wiki that I can just > drop into a cgi-bin of some low cost web hosting site (like Hurricane > Electric, which I notice we both use) and have a ready-to-roll project > management system with no setup or other details to worry about. Kind > of a sourceforge-in-a-box, only a lot better than sourceforge. I'm That sounds awesome. I'm _really_ interested in this area myself, actually, though the stuff I've been playing with is somewhat orthogonal. (Nothing really working yet, but playing with ways to improve data integration -- by folding in even more, like IRC logs and mailing list archives. These kinds of things require a much heavier server architecture, though, so sort of exploring different parts of the design space.) > looking for something like monotone with CVSTrac enhancements that > works over HTTP. Nothing like that currently exists, I'm afraid, > so I've been working on my own. > > Yes, it is a big job, though perhaps not any bigger than writing > an SQL database engine ;-) I can't say monotone works with CVSTrac, but then, I assume your system doesn't either off the bat :-). And, monotone can work over HTTP -- or at least, _I_ say it can :-). There is a branch where I worked out a proof-of-concept implementation of this: http://viewmtn.angrygoats.net/branch.psp?branch=net.venge.monotone.dumb It's as a python script that does some rather Clever Things. I don't have time to explain the algorithm in detail now, but it basically supports monotone's full push/pull/sync semantics, you can talk to one friend's repo and then talk to another's, etc., that all just works; and it should be transactional (except that in some rare cases someone has to rollback by hand; in the dumb remove filesystem case, readers can't in general do a rollback, and renames can't in general be atomic, so if you're extremely unlikely you might get wedged). And it should be really speedy on top of some well done network backends (you really want things like pipelining). If you do take a look at it, there's basically "dumb.py" doing monotone glue, "fs.py" giving the generic filesystem interface that needs to be implemented for each transport, and "merkle_dir.py" in between, where all the magic lives. I haven't actually convinced anyone yet to take care of it and polish it up, though (or rewrite more sanely, or whatever), so it's just sort of been sitting there, waiting forlornly for someone who will give it love... > Monotone is my favorite of the current crop of VCSes by the way. > It has 80% of what I am looking for. What I'm really after is > a better monotone. I have been greatly inspired by your work, > as have others, I notice. Thank you very much! I would be interested in hearing what you think of as the last 20%, beyond HTTP support. I also innocently observe that if what you want is a better monotone, the quickest route may be to... make monotone better ;-). > > While size is definitely not everything -- I doubt anyone notices 10% > > here or there -- a factor of 2-3x is probably going to be hard to > > sell. Unfortunately, since it's a nice scheme. > > The CVS repository for SQLite is 15MiB. Under a baseline+delta schema > it would grow to (perhaps) 45MiB. The cheapest account on Hurricane > Electric includes 1GiB of storage. Why should I care about the extra > 30MiB? Well, because there are enough people out there with gig-sized source checkouts. (Yeah, not even history, but... checkouts.) Or hundreds and hundreds of thousands of commits (gcc, mozilla, linux kernel...). While there's definitely value to a tool that works well for the 95% of projects that are not that big, we really would like to have a single tool that 100% of projects can viably use. The extraordinarily intense competition in the VCS field is also a factor; note that mercurial scales to those projects just fine, without excessive space usage, and is experiencing great uptake on small projects too -- so since we think we offer some other compelling advantages, we really would like to compete directly across the whole range :-). -- Nathaniel -- In mathematics, it's not enough to read the words you have to hear the music
Re: [sqlite] Tracing Statements...
On Sat, Feb 11, 2006 at 03:37:11AM -0500, Tim Murison wrote: > I would like to be able to generate a log of all sql statements executed > by my program. > > Unfortunately all of my queries use parameters, and I would like to get > the full query (all parameters replaced with their sqlite3_bind_* > values). In monotone, we log "executing command <...> with parameter 1 bound to <...>, parameter 2 bound to <...>," etc. -- Nathaniel -- "Of course, the entire effort is to put oneself Outside the ordinary range Of what are called statistics." -- Stephan Spender
Re: [sqlite] disk locality
On Wed, Feb 01, 2006 at 08:56:37PM -0800, Joe Wilson wrote: > Another question... Does Monotone still Base64 encode all its data before > putting it into blobs? > If so, using raw binary SQLite blobs would likely give Monotone a 33% speedup > and smaller > database. It does, actually, but that's going away as soon as we get around to finishing reviewing/merging the relevant branch... -- Nathaniel -- - Don't let your informants burn anything. - Don't grow old. - Be good grad students. -- advice of Murray B. Emeneau on the occasion of his 100th birthday
Re: [sqlite] disk locality (and delta storage)
Thanks for the helpful reply. Sorry I've taken so long to get back to this; I've had some hardware trouble and am only catching up on email now... On Wed, Feb 01, 2006 at 07:27:06AM -0500, [EMAIL PROTECTED] wrote: > Nathaniel Smith <[EMAIL PROTECTED]> wrote: > > I was wondering if there were any docs or explanations available on > > how SQLite decides to lay out data on disk. > > Free pages in the middle of the file are filled first. Some effort > is made to uses pages that are close together for related information. > In mototone, where you seldom if ever delete anything, you probably > never have any free pages, so new information is always added to the > end of the file. I'm going to ask a bunch of finicky boring questions to make sure I'm understanding :-). So and rows are basically written to the file in the same order that the INSERT statements are executed? Oh, and should I assume that individual row cells are kept together on disk, even if they are (much) larger than a db block? I assume so, but just want to make sure... > After you VACUUM, everything will be on disk in row order. If I assume this means "sorted by primary key"? (And with tables in some random order relative to each other, but I don't think I care about that at all.) > you see a big performance improvement after VACUUMing, then the > disk layout is perhaps an optimization worth looking into. If > however (as I suspect) your performance is similar after vacuuming, > then changing the way information is added to the disk probably > will not help, since after a VACUUM the information is pretty much > optimally ordered for minimum seeking. I think you left out the end of the sentence, "...assuming an in-order access pattern". Unless you just mean, during the btree traversals involved in each key lookup? Man, there's a whole 'nother part I don't know much about :-). I guess walking the btrees can obviously be another source of disk latency; I'm not sure whether I should worry about this or not. If I do an INSERT of a row that has some indexes on it, where do those index entries get written? Next to the actual row data, at the end of the file? (Assuming there are no free blocks earlier in the file.) And then at VACUUM time each index gets groups into one spot on disk? I was actually thinking more about the cost of looking up many items from a table. Here, unfortunately, our current access pattern is quite pessimal. The current schema is: CREATE TABLE files (id primary key, data not null); 'id' is the SHA1 hash of the file; 'data' is a compressed raw file. CREATE TABLE file_deltas (id not null, base not null, delta not null, unique(id, base) ); 'id' is the SHA1 of the file this delta lets us construct, 'base' is the SHA1 of the file that the delta is against, and 'delta' is the compressed xdelta. So, when we traverse delta chains, we go wandering all over this table indexing by the SHA1 of intermediate versions. Our access isn't just random, it's _cryptographically strongly_ random! :-) So, we've been throwing around ways to overhaul this stuff. Obviously sqlite is not going to be able to improve on the current situation without some help from us. > Let me propose a radical solution: I've been experimenting with adding > a VCS component to CVSTrac (http://www.cvstrac.org/) to replace CVS and > thus provide a complete project management system in a standalone CGI > program. My latest thinking on this (backed up by experiment) is to Entering the VCS game? Good luck! It's an interesting (and surprisingly deep) field. (Total tangent: I basically know how to make monotone work over a CGI transport; it's some work, but doable, just no-one's picked it up yet. It might be worth considering such a solution before trying to build a new system from scratch. The basic trade-off would be a CGI script plus a statically linked binary instead of just a CGI script, but on the other hand, building Yet Another VCS from scratch is a significant undertaking. The detailed trade-off would of course be more complicated :-). Something to throw out there in case it leads to discussion...) > avoid storing long series of xdeltas. Each file version is instead stored > as a baseline and a single xdelta. The manifest stores two UUIDs instead > of one. That way, you can always load a particular file version with > at most two lookups. As a file evolves, the baseline version stays the > same and the xdelta changes from one version to the next. When the size > of the xdelta reachs some fraction of the baseline file size, create a > new baseline. Experimentally, I have found it works well to create a > new baseline when the xdelta becomes 15% of the size of the baseline. Ah,
[sqlite] disk locality
Hello, I was wondering if there were any docs or explanations available on how SQLite decides to lay out data on disk. The context is the monotone project -- http://venge.net/monotone -- a VCS where we store project history in a sqlite database. Sqlite has been _very_ good to us, and we'd rather not get into the game of writing our own transactional storage system, for really obvious reasons. However, it's starting to look like we have some performance bottlenecks here, that might push us that way. In particular, consider the problem of reconstructing bitstrings given a bunch of xdelta's (basically one-way patches) and full texts; a typical problem would be to calculate a chain of deltas that applied one-by-one to a full text will give the desired text, then pull those deltas and the full text out of the db and apply them. We just store these text objects as columns in two tables. In its current implementation in monotone, this algorithm seems to be seek-bound. Some profiling shows that there are cases where we're spending more time blocked waiting for read()s for the db, than it takes to read the entire db 5 times over. This makes sense. We're basically doing random reads scattered all over the file, so the OS has no way to do any sort of readahead, which means we're probably hitting disk seek latency on every read, and like usual these days, latency trumps bandwidth. So, the question is how to fix this. Some (well, one) of our competitors use a custom file format tweaked to minimize this disk latency. Like I said, we'd rather not go this way if we can avoid it; single-file storage is _so_ nice to have. But, while obviously our current mechanism is sub-optimal -- we neither write out the deltas in a helpful order, or index them in any useful way, so there's no way SQLite _could_ optimize for our access pattern even if it wanted to -- I have no idea what sort of tricks I could use to help SQLite help me. How does SQLite choose where to put new records on disk, and is there any way I can give it useful hints? -- Nathaniel P.S.: Sorry for the over-long message :-). -- "Of course, the entire effort is to put oneself Outside the ordinary range Of what are called statistics." -- Stephan Spender
Re: [sqlite] querying hierarchy
On Sun, Jan 29, 2006 at 08:30:57AM +0100, Marten Feldtmann wrote: > Databases like DB2, MaxDB or Oracle offer you these recursive > SQL statements. The main winning is, that you execute ONE query > and retrieves all rows belonging to a sub tree, which is MUCH > faster, that to execute multiple queries ... Is this true in the sqlite context? For client/server systems, getting the server to execute the recursive queries is surely a win, because you cut out all the back-and-forth communication latencies when specifying each new query. In sqlite, though, you can run a second query from your own code just as efficiently as sqlite can from its code... -- Nathaniel -- When the flush of a new-born sun fell first on Eden's green and gold, Our father Adam sat under the Tree and scratched with a stick in the mould; And the first rude sketch that the world had seen was joy to his mighty heart, Till the Devil whispered behind the leaves, "It's pretty, but is it Art?" -- The Conundrum of the Workshops, Rudyard Kipling