[sqlite] I/O error on creating index with 3.7.14
Using 3.7.14, when creating an index on a 27G database (on the table that contains almost all of the data), I consistently (on Windows XP and Debian GNU/Linux, on three different machines) get a disk I/O error. This does not happen using 3.7.13 (only tested on Debian GNU/Linux), nor does it happen when creating the same index on a smaller (~2G) version of the database. I ran the process under strace; the final relevant lines (as far as I can judge, knowing nothing of this) are: lseek(5, 23934032896, SEEK_SET) = 23934032896 write(5, "\231\216\344\271\213\351\235\240\345\261\261\n+\367K\340*\5\2I \1\4\v\3 01\351\276\215\345\202\276\346\271"..., 1024) = 1024 lseek(5, 23934033920, SEEK_SET) = 23934033920 write(5, "\207\345\210\245\347\267\207\347\264\240\346\211\200\350\254 \202\346\2 11\223\351\274\223\345\274\204\n+\367u\334*\5"..., 632) = 632 lseek(5, 0, SEEK_SET) = 0 read(5, 0x7f5462cb06b8, 18446744071873782392) = -1 EFAULT (Bad address) close(5)= 0 The command that causes the error is "CREATE INDEX IF NOT EXISTS TextNGramIndex ON TextNGram (text, ngram, size)". The database schema is (without the failing index): CREATE TABLE Text ( id INTEGER PRIMARY KEY ASC, filename TEXT UNIQUE NOT NULL, checksum TEXT NOT NULL, label TEXT NOT NULL ); CREATE TABLE TextHasNGram ( text INTEGER NOT NULL REFERENCES Text (id), size INTEGER NOT NULL ); CREATE TABLE TextNGram ( text INTEGER NOT NULL REFERENCES Text (id), ngram TEXT NOT NULL, size INTEGER NOT NULL, count INTEGER NOT NULL ); CREATE UNIQUE INDEX TextHasNGramIndex ON TextHasNGram (text, size); CREATE INDEX TextIndexLabel ON Text (label); Is there more information I should provide as part of a proper bug report, or is this a known issue, or have I missed a trick somewhere? Jamie ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] FW: how to select " char in sqlite
char bh1[320]; memset(bh1,0,320); strcpy(bh1,"sqlite3 -html -header t9_engine.db \"select id,partnumber,substr(\'\',1,180) as img,pcs from engine where id>7;\" >> n.html"); system(bh1); //here couldn't work error: sqlite3 -html -header t9_engine.db "select id,partnumber,substr('',1,180) as img,pcs from engine where id>7;" >> n.htmlError: n ear "'\',1,180) as img,pcs from engine where id>7;\" >> n.html"); strcpy(bh1,"sqlite3 -html -header t9_engine.db \"select id,partnumber,'' as img,pcs from engine where id>7;\" >> n.html"); system(bh1); //here could work the result is: 8 AA34841687 000 INSONO-SOUS-MOTEUR-- //here I wanna add " char between 'C:\t9\images\INSONO-SOUS-MOTEUR.jpg' 1 and the best way is changeto > ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] FW: how to select " char in sqlite
char bh1[320]; memset(bh1,0,320); strcpy(bh1,"sqlite3 -html -header t9_engine.db \"select id,partnumber,substr(\'\',1,180) as img,pcs from engine where id>7;\" >> n.html"); system(bh1); //here couldn't work error: sqlite3 -html -header t9_engine.db "select id,partnumber,substr('',1,180) as img,pcs from engine where id>7;" >> n.htmlError: n ear "'\',1,180) as img,pcs from engine where id>7;\" >> n.html"); strcpy(bh1,"sqlite3 -html -header t9_engine.db \"select id,partnumber,'' as img,pcs from engine where id>7;\" >> n.html"); system(bh1); //here could work the result is: 8 AA34841687 000 INSONO-SOUS-MOTEUR-- //here I wanna add " char between 'C:\t9\images\INSONO-SOUS-MOTEUR.jpg' 1 ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
On Thu, Oct 25, 2012 at 11:03:13AM -0700, da...@lang.hm wrote: > I agree, this is why I'm trying to figure out the recommended way to > do this without needing to do full commits. > > Since in most cases it's acceptable to loose the last few chunks > written, if we had some way of specifying ordering, without having > to specify "write this NOW", the solution would be pretty obvious. Well, using data journalling with ext3/4 may do what you want. If you don't do any fsync, the changes will get written every 5 seconds when the automatic journal sync happens (and sub-4k writes will also get coalesced to a 5 second granularity). Even with plain text files, it's pretty easy to tell whether or not the final record is a partially written or not after a crash; just look for a trailing newline. Better yet, if you are writing to multiple log files with data journalling, all of the writes will happen at the same time, and they will be streamed to the file system journal, minimizing random writes for at least the journal writes. - Ted ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
On Thu, 25 Oct 2012, Theodore Ts'o wrote: Or does rsyslog *really* need to issue an fsync after each log message? Or could it batch updates so that every N seconds, it flushes writes to the disk? In part this depends on how paranoid the admin is. By default rsyslog doesn't do fsyncs, but admins can configure it to do so and can configure the batch size. However, what I'm talking about here is not normal message traffic, it's the case where the admin has decided that they don't want to use the normal inmemory queues, they want to have the queues be on disk so that if the system crashes the queued data will still be there to be processed after the crash (In addition, this can get used to cover cases where you want queue sizes larger than your available RAM) In this case, the extreme, and only at the explicit direction of the admin, is to fsync after every message. The norm is that it's acceptable to loose the last few messages, but loosing a chunk out of the middle of the queue file can cause a whole lot more to be lost, passing the threshold of acceptable. Sometimes, the answer is not to try to create exotic database like functionality in the file system --- the answer is to be more intelligent at the application leyer. Not only will the application be more portable, it will also in the end be more efficient, since even with the most exotic database technologies, the most efficient transactional commit is the unneeded commit that you optimize away at the application layer. I agree, this is why I'm trying to figure out the recommended way to do this without needing to do full commits. Since in most cases it's acceptable to loose the last few chunks written, if we had some way of specifying ordering, without having to specify "write this NOW", the solution would be pretty obvious. David Lang ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Disk I/O Error
Caio Honma wrote: > > I was trying to query from a huge table (150,000,000 lines+) using a left > join statement, using mode csv and using an output to file.. after several > minutes I got the Disk I/O error. I searched this maillist for some > possible answers (tried to check_integrity, but the db was ok, then I > installed the Process Monitor). > Can you enable logging via the SQLITE_CONFIG_LOG configuration option before calling any other SQLite APIs? This should help to narrow down the operation returning the error. http://www.sqlite.org/c3ref/c_config_getmalloc.html#sqliteconfiglog Example: void logging_callback(void *notUsed, int iCode, const char *zErrmsg){ /* handle output here... */ fprintf(stdout, "%d: %s\n", iCode, zErrmsg); } sqlite3_config(SQLITE_CONFIG_LOG, logging_callback, 0); -- Joe Mistachkin ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] Disk I/O Error
Hi! I was trying to query from a huge table (150,000,000 lines+) using a left join statement, using mode csv and using an output to file.. after several minutes I got the Disk I/O error. I searched this maillist for some possible answers (tried to check_integrity, but the db was ok, then I installed the Process Monitor). I saw the results after the query, but the only unusual result was a RegQueryValue point to HKLM\System\CurrentControlSet\Control\NIs\CustomLocale\EMPTY with the result NAME NOT FOUND and a RegQueryValue pointing to HKLM\System\CurrentControlSet\Control\NIs\Language\InstallLanguageFallback that returned with a BUFFER OVERFLOW. Does anybody know a way to solve this error? Thanks in advance and sorry about my english. Caio PS: The DB is located in my hard drive and i'm using Windows 7 x64 ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] segfault in 3.7.13 (and older) with UNION ALL
This bug was fixed on 2012-08-28 (by the checkin at http://www.sqlite.org/src/info/200a81358c). The fix is available in SQLite version 3.7.14 and 3.7.14.1 as well as on trunk. On Thu, Oct 25, 2012 at 11:50 AM, Steve Pomeroy wrote: > This bug was originally found on a version of sqlite3 (tested in 3.5.9 > and 3.7.11) in Android (2.2.1 and 4.1.2 respectively), but I've been > able to recreate it on sqlite3 3.7.13 from Debian/wheezy. > > Test case: > > CREATE TABLE t1(a,b,c); > INSERT INTO t1 VALUES(1,2,3), (4,5,6); > CREATE TABLE t2(d,e,f); > INSERT INTO t2 VALUES(3,6,9), (4,8,12); > > # This works as expected > SELECT * FROM (SELECT a AS x, b AS y FROM t1 UNION ALL SELECT d AS x, e > AS y FROM t2); > #1|2 > #4|5 > #3|6 > #4|8 > > # this doesn't crash... > SELECT * FROM (SELECT a AS x, b AS y FROM t1 UNION ALL SELECT d AS x, e > AS y, f AS z FROM t2); > #1|2 > #4|5 > #3|6 > #4|8 > > # but this causes a segfault (it's missing column "y" from the second > inner SELECT): > SELECT * FROM (SELECT a AS x, b AS y FROM t1 UNION ALL SELECT d AS x > FROM t2); > > -- > Steve Pomeroy > MIT Mobile Experience Lab > http://mobile.mit.edu/ > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > -- D. Richard Hipp d...@sqlite.org ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] segfault in 3.7.13 (and older) with UNION ALL
This bug was originally found on a version of sqlite3 (tested in 3.5.9 and 3.7.11) in Android (2.2.1 and 4.1.2 respectively), but I've been able to recreate it on sqlite3 3.7.13 from Debian/wheezy. Test case: CREATE TABLE t1(a,b,c); INSERT INTO t1 VALUES(1,2,3), (4,5,6); CREATE TABLE t2(d,e,f); INSERT INTO t2 VALUES(3,6,9), (4,8,12); # This works as expected SELECT * FROM (SELECT a AS x, b AS y FROM t1 UNION ALL SELECT d AS x, e AS y FROM t2); #1|2 #4|5 #3|6 #4|8 # this doesn't crash... SELECT * FROM (SELECT a AS x, b AS y FROM t1 UNION ALL SELECT d AS x, e AS y, f AS z FROM t2); #1|2 #4|5 #3|6 #4|8 # but this causes a segfault (it's missing column "y" from the second inner SELECT): SELECT * FROM (SELECT a AS x, b AS y FROM t1 UNION ALL SELECT d AS x FROM t2); -- Steve Pomeroy MIT Mobile Experience Lab http://mobile.mit.edu/ ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
On 25 Oct 2012, at 2:04am, da...@lang.hm wrote: > But unless you are a filesystem, how can you make sure that the message data > is written to file1 before you write the metadata about the message to file2? Wait for long enough for the disk subsystem to clear its backlog of write commands. A few seconds should do it. > right now it seems that there is no way for an application to do this other > than doing a fsync(file1) before writing the metadata to file2 No, as I've posted previously to this thread, you can assume that fsync() literally does nothing. It really is implemented as a 'noop' in many cases. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
> > Hopefully, eventually the storage developers will realize the value > > behind ordered commands and learn corresponding SCSI facilities to > > deal with them. > > Eventually, drive manufacturers will realize that trying to price > guage people who want advanced features such as TCQ, DIF/DIX, is the > best way to gaurantee that most people won't bother to purchase them, > and hence the features will remain largely unused I doubt they care. The profit on high end features from the people who really need them I would bet far exceeds any other benefit of giving it to others. Welcome to capitalism 8) Plus - spinning rust for those end users is on the way out, SATA to flash is a bit of hack and people are already putting a lot of focus onto things like NVM Express. Alan ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] transaction commit is successful despite I/O error
I guess this can stand as one more argument for not using SQLite over NFS. Another argument could be found here: http://www.sqlite.org/lockingv3.html#how_to_corrupt. Pavel On Wed, Oct 24, 2012 at 2:01 PM, Efim Dyadkin wrote: > I was probably not quite specific. So I would like to rephrase the problem > and give more details. > > I run a default configuration of Sqlite on Linux and I don't use WAL or > anything that changes transaction behavior. A database is located on a remote > file server and accessed via NFS. There is a single process only that > accesses the database. The problem is when auto-mounter is stopped (on a > machine where the process is running) in the middle of database transaction, > commit returns successfully though it leaves a hot journal on disk because it > is not seen for unlink (ENOENT)!!! > > Note, when auto-mounter is down, you still can operate on a database file and > its journal if they were opened before the stop. > > The problem can be reproduced very easily with an Sqlite command shell: > 1) Run the shell > 2) Begin exclusive transaction > 3) Insert a few records > 4) Stop auto-mounter > 5) Optionally insert a few more records (stopped auto-mounter does not affect > these inserts because database and journal are opened) > 6) Commit (no errors!!!) > 7) Quit the shell > > If you then restart auto-mounter and check the database directory you will > find a hot journal! If you reopen the database the transaction will be played > back and recently commited changes will be gone! > > Thanks for your feedbacks, > Efim > > > > > This e-mail, including any attached files, may contain confidential and > privileged information for the sole use of the intended recipient. Any > review, use, distribution, or disclosure by others is strictly prohibited. If > you are not the intended recipient (or authorized to receive information for > the intended recipient), please contact the sender by reply e-mail and delete > all copies of this message. > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
On Wed, Oct 24, 2012 at 11:58:49PM -0700, da...@lang.hm wrote: > The frustrating thing is that when people point out how things like > sqlite are so horribly slow, the reply seems to be "well, that's > what you get for doing so many fsyncs, don't do that", when there is > a 'problem' like the KDE "config loss" problem a few years ago, the > response is "well, that's what you get for not doing fsync" Sure... but the answer is to only do the fsync's when you need to. For example, if GNOME and KDE is rewriting the entire registry file each time the application is changing a single registry key, sure, if you rewrite the entire registry file, and then fsync after each rewrite before you replace the file, you will be safe. And if the application needs to update dozens or hundreds of registry keys (or every time the window gets moved or resized), then yes, it will be slow. But the application didn't have to do that! It could have updated all the registry keys in memory, and then update the registry file periodically instead. Similarly, Firefox didn't need to do a sqllite commit after every single time its history file was written, causing a third of a megabyte of write traffic each time you clicked on a web page. It could have batched its updates to the history file, since most of the time, you don't care about making sure the web history is written to stable store before you're allowed to click on a web page and visit the next web page. Or does rsyslog *really* need to issue an fsync after each log message? Or could it batch updates so that every N seconds, it flushes writes to the disk? (And this is a problem with most Android applications as well. Apparently the framework API's are such that it's easier for an application to treat each sqlite statement as an atomic update, so many/most application writers don't use explicit transaction boundaries, so updates don't get batched even though it would be more efficient if they did so.) Sometimes, the answer is not to try to create exotic database like functionality in the file system --- the answer is to be more intelligent at the application leyer. Not only will the application be more portable, it will also in the end be more efficient, since even with the most exotic database technologies, the most efficient transactional commit is the unneeded commit that you optimize away at the application layer. - Ted ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
On Thu, Oct 25, 2012 at 02:03:25PM +0100, Alan Cox wrote: > > I doubt they care. The profit on high end features from the people who > really need them I would bet far exceeds any other benefit of giving it to > others. Welcome to capitalism 8) Yes, but it's a question of pricing. If they had priced it a just a wee bit higher, then there would have been incentive to add support for TCQ so it could actually be used into various Linux file systems, since there would have been lots of users of it. But as it is, the folks who are purchasing huge, vast number of these drives --- such as at the large cloud providers: Amazon, Facebook, Racespace, et. al. --- will choose to purchase large numbers of commodity drives, and then find ways to work around the missing functionality in userspace. For example, DIF/DIX would be nice, and if it were available for cheap, I could imagine it being used. But you can accomplish the same thing in userspace, and in fact at Google I've implemented a special not-for-mainline patch which spikes out stable writes (required for DIF/DIX) because it has significant performance overhead, and DIF/DIX has zero benefit if you're not willing to shell out $$$ for hardware that supports it. Maybe the HDD manufacturers have been able to price guage a small number enterprise I/T shops with more dollars than sense, but personally, I'm not convinced they picked an optimal pricing strategy Put another way, I accept that Toyota should price a Lexus ES more than a Camry, but if it's priced at say, 3x the price of a Camry instead of 20%, they might find that precious few people are willing to pay that kind of money for what is essentially the same car with minor luxury tweaks added to it. > Plus - spinning rust for those end users is on the way out, SATA to flash > is a bit of hack and people are already putting a lot of focus onto > things like NVM Express. Yeah I don't buy that. One, flash is still too expensive. Two, the capital costs to build enough Silicon foundries to replace the current production volume of HDD's is way too expensive for any company to afford (the cloud providers are buying *huge* numbers of HDD's) --- and that's assuming companies wouldn't chose to use those foundries for products with larger margins --- such as, for example, CPU/GPU chips. :-) And third and finally, if you study the long-term trends in terms of Data Retention Time (going down), Program and Read Disturb (going up), and Write Endurance (going down) as a function of feature size and/or time, you'd be wise to treat flash as nothing more than short-term cache, and not as a long term stable store. If end users completely give up on flash, and store all of their precious family pictures on flash storage, after a couple of years, they are likely going to be very disappointed Speaking personally, I wouldn't want to have anything on flash for more than a few months at *most* before I made sure I had another copy saved on spinning rust platters for long-term retention. - Ted ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Which non-default options are used for the shell?
Baruch Burstein wrote: > I know FTS is enabled (SQLITE_ENABLE_FTS*), but are any other flags used in > compiling the default shell? sqlite> pragma compile_options; CURDIR ENABLE_FTS3 ENABLE_RTREE TEMP_STORE=1 THREADSAFE=0 Regards, Clemens ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] Which non-default options are used for the shell?
I know FTS is enabled (SQLITE_ENABLE_FTS*), but are any other flags used in compiling the default shell? -- ˙uʍop-ǝpısdn sı ɹoʇıuoɯ ɹnoʎ 'sıɥʇ pɐǝɹ uɐɔ noʎ ɟı ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Bug report concerning the juliandate function
On Wed, Oct 24, 2012 at 1:06 PM, wrote: > Dear sir or madam, > > I have tested sqlite version 3.6. and version 3.7.14.1 with the following > inputs: > > select julianday('2005-02-29'); > select julianday('2005-02-30'); > > In both cases, it should have returned NULL, but it returned a number. > The date-time functions all accept day numbers up to and including 31 on all months, even months that have fewer days. These extra days on the end of each month roll over into the following month. So, for example, all of the following are equivalent: julianday('2005-02-29')==julianday('2005-03-01') julianday('2005-02-30')==julianday('2005-03-02') julianday('2005-02-31')==julianday('2005-03-03') julianday('2005-04-31')==julianday('2005-05-01') > > In the first case, the year 2005 is not a leap year. Therefore, this date > is not valid. > In the second case, February has never 30 days. > > Regards, > > Kagan Kayal > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > -- D. Richard Hipp d...@sqlite.org ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
On Thu, 25 Oct 2012, Theodore Ts'o wrote: On Wed, Oct 24, 2012 at 03:03:00PM -0700, da...@lang.hm wrote: Like what is being described for sqlite, loosing the tail end of the messages is not a big problem under normal conditions. But there is a need to be sure that what is there is complete up to the point where it's lost. this is similar in concept to write-ahead-logs done for databases (without the absolute durability requirement) If that's what you require, and you are using ext3/4, usng data journalling might meet your requirements. It's something you can enable on a per-file basis, via chattr +j; you don't have to force all file systems to use data journaling via the data=journalled mount option. The potential downsides that you may or may not care about for this particular application: (a) This will definitely have a performance impact, especially if you are doing lots of small (less than 4k) writes, since the data blocks will get run through the journal, and will only get written to their final location on disk. (b) You don't get atomicity if the write spans a 4k block boundary. All of the bytes before i_size will be written, so you don't have to worry about "holes"; but the last message written to the log file might be truncated. (c) There will be a performance impact, since the contents of data blocks will be written at least twice (once to the journal, and once to the final location on disk). If you do lots of small, sub-4k writes, the performance might be even worse, since data blocks might be written multiple times to the journal. I'll have to dig into this option. In the case of rsyslog it sounds like it could work (not as good as a filesystem independant way of doing things, but better than full fsyncs) Truncated messages are not great, but they are a detectable, and acceptable risk. while the average message size is much smaller than 4K (on my network it's ~250 bytes), the metadata that's broken out expands this somewhat, and we can afford to waste disk space if it makes things safer or more efficient. If we do update in place with flags with each message, each message will need to be written up to three times (on recipt, being processed, finished processed). With high message burst rates, I'm worried that we would fill up the journal, is there a good way to deal with this? I believe that ext4 can put the journal on a different device from the filesystem, would this help a lot? If you were to put the journal for an ext4 filesystem on a ram disk, you would loose the data recovery protection of the journal, but could you use this trick to get ordered data writes onto the filesystem? David Lang ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
On Thu, 25 Oct 2012, Theodore Ts'o wrote: On Thu, Oct 25, 2012 at 12:18:47AM -0500, Nico Williams wrote: By trusting fsync(). And if you don't care about immediate Durability you can run the fsync() in a background thread and mark the associated transaction as completed in the next transaction to be written after the fsync() completes. The challenge is when you have entagled metadata updates. That is, you update file A, and file B, and file A and B might share metadata. In order to sync file A, you also have to update part of the metadata for the updates to file B, which means calculating the dependencies of what you have to drag in can get very complicated. You can keep track of what bits of the metadata you have to undo and then redo before writing out the metadata for fsync(A), but that basically means you have to implement soft updates, and all of the complexity this implies: http://lwn.net/Articles/339337/ If you can keep all of the metadata separate, this can be somewhat mitigated, but usually the block allocation records (regardless of whether you use a tree, or a bitmap, or some other data structure) tends of have entanglement problems. hmm, two thoughts occur to me. 1. to avoid entanglement, put the two files in separate directories 2. take advantage of entaglement to enforce ordering thread 1 (repeated): write new message to file 1, spawn new thread to fsync thread 2: write to file 2 that message1-5 are being worked on thread 2 (later): write to file 2 that messages 1-5 are done when thread 1 spawns the new thread to do the fsync, the system will be forced to write the data to file 2 as of the time it does the fsync. This should make it so that you never have data written to file2 that refers to data that hasn't been written to file1 yet. It certainly is not impossible; RDBMS's have implemented this. On the other hand, they generally aren't as fast as file systems for non-transactional workloads, and people really care about performance on those sorts of workloads for file systems. the RDBMS's have implemented stronger guarantees than what we are needing A few years ago I was investigating this for logging. With the reliable (RDBMS style) , but inefficent disk queue that rsyslog has, writing to a high-end fusion-io SSD, ext2 resulted in ~8K logs/sec, ext3 resultedin ~2K logs/sec, and JFS/XFS resulted in ~4K logs/sec (ext4 wasn't considered stable enough at the time to be tested) Still, if you want to try to implement such a thing, by all means, give it a try. But I think you'll find that creating a file system that can compete with existing file systems for performance, and *then* also supports a transactional model, is going to be quite a challenge. The question is trying to figure a way to get ordering right with existing filesystms (preferrably without using something too tied to a single filesystem implementation), not try and create a new one. The frustrating thing is that when people point out how things like sqlite are so horribly slow, the reply seems to be "well, that's what you get for doing so many fsyncs, don't do that", when there is a 'problem' like the KDE "config loss" problem a few years ago, the response is "well, that's what you get for not doing fsync" Both responses are correct, from a purely technical point of view. But what's missing is any way to get the result of ordered I/O that will let you do something pretty fast, but with the guarantee that, if you loose data in a crash, the only loss you are risking is that your most recent data may be missing. (either for one file, or using multiple files if that's what it takes) Since this topic came up again, I figured I'd poke a bit and try to either get educated on how to do this "right" or try and see if there's something that could be added to the kernel to make it possible for userspace programs to do this. What I think userspace really needs is something like a barrier function call. "for this fd, don't re-order writes as they go down through the stack" If the hardware is going to reorder things once it hits the hardware, this is going to hurt performance (how much depends on a lot of stuff) but the filesystems are able to make their journals work, so there should be some way to let userspace do some sort of similar ordering David Lang ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
On Thu, Oct 25, 2012 at 12:18:47AM -0500, Nico Williams wrote: > > By trusting fsync(). And if you don't care about immediate Durability > you can run the fsync() in a background thread and mark the associated > transaction as completed in the next transaction to be written after > the fsync() completes. The challenge is when you have entagled metadata updates. That is, you update file A, and file B, and file A and B might share metadata. In order to sync file A, you also have to update part of the metadata for the updates to file B, which means calculating the dependencies of what you have to drag in can get very complicated. You can keep track of what bits of the metadata you have to undo and then redo before writing out the metadata for fsync(A), but that basically means you have to implement soft updates, and all of the complexity this implies: http://lwn.net/Articles/339337/ If you can keep all of the metadata separate, this can be somewhat mitigated, but usually the block allocation records (regardless of whether you use a tree, or a bitmap, or some other data structure) tends of have entanglement problems. It certainly is not impossible; RDBMS's have implemented this. On the other hand, they generally aren't as fast as file systems for non-transactional workloads, and people really care about performance on those sorts of workloads for file systems. (About a decade ago, Oracle tried to claim that you could run file system workloads using an Oracle databsae as a back-end. Everyone laughed at them, and the idea died a quick, merciful death.) Still, if you want to try to implement such a thing, by all means, give it a try. But I think you'll find that creating a file system that can compete with existing file systems for performance, and *then* also supports a transactional model, is going to be quite a challenge. - Ted ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
On Wed, Oct 24, 2012 at 03:03:00PM -0700, da...@lang.hm wrote: > Like what is being described for sqlite, loosing the tail end of the > messages is not a big problem under normal conditions. But there is > a need to be sure that what is there is complete up to the point > where it's lost. > > this is similar in concept to write-ahead-logs done for databases > (without the absolute durability requirement) If that's what you require, and you are using ext3/4, usng data journalling might meet your requirements. It's something you can enable on a per-file basis, via chattr +j; you don't have to force all file systems to use data journaling via the data=journalled mount option. The potential downsides that you may or may not care about for this particular application: (a) This will definitely have a performance impact, especially if you are doing lots of small (less than 4k) writes, since the data blocks will get run through the journal, and will only get written to their final location on disk. (b) You don't get atomicity if the write spans a 4k block boundary. All of the bytes before i_size will be written, so you don't have to worry about "holes"; but the last message written to the log file might be truncated. (c) There will be a performance impact, since the contents of data blocks will be written at least twice (once to the journal, and once to the final location on disk). If you do lots of small, sub-4k writes, the performance might be even worse, since data blocks might be written multiple times to the journal. - Ted ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
On Tue, Oct 23, 2012 at 03:53:11PM -0400, Vladislav Bolkhovitin wrote: > Yes, SCSI has full support for ordered/simple commands designed > exactly for that task: to have steady flow of commands even in case > when some of them are ordered. SCSI does, yes --- *if* the device actually implements Tagged Command Queuing (TCQ). Not all devices do. More importantly, SATA drives do *not* have this capability, and when you compare the price of SATA drives to uber-expensive "enterprise drives", it's not surprising that most people don't actually use SCSI/SAS drives that have implemented TCQ. SATA's Native Command Queuing (NCQ) is not equivalent; this allows the drive to reorder requests (in particular read requests) so they can be serviced more efficiently, but it does *not* allow the OS to specify a partial, relative ordering of requests. Yes, you can turn off writeback caching, but that has pretty huge performance costs; and there is the FUA bit, but that's just an unconditional high priority bypass of the writeback cache, which is useful in some cases, but which again, does not give the ability for the OS to specify a partial order, while letting the drive reorder other requests for efficiency/performance's sake, since the drive has a lot more information about the optimal way to reorder requests based on the current location of the drive head and where certain blocks may have been remapped due to bad block sparing, etc. > Hopefully, eventually the storage developers will realize the value > behind ordered commands and learn corresponding SCSI facilities to > deal with them. Eventually, drive manufacturers will realize that trying to price guage people who want advanced features such as TCQ, DIF/DIX, is the best way to gaurantee that most people won't bother to purchase them, and hence the features will remain largely unused - Ted ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
On Wed, 24 Oct 2012, Nico Williams wrote: On Wed, Oct 24, 2012 at 5:03 PM, wrote: I'm doing some work with rsyslog and it's disk-baded queues and there is a similar issue there. The good news is that we can have a version that is linux specific (rsyslog is used on other OSs, but there is an existing queue implementation that they can use, if the faster one is linux-only, but is significantly faster, that's just a win for Linux) Like what is being described for sqlite, loosing the tail end of the messages is not a big problem under normal conditions. But there is a need to be sure that what is there is complete up to the point where it's lost. this is similar in concept to write-ahead-logs done for databases (without the absolute durability requirement) [...] I am not fully understanding how what you are describing (COW, separate fsync threads, etc) would be implemented on top of existing filesystems. Most of what you are describing seems like it requires access to the underlying storage to implement. could you give a more detailed explination? COW is "copy on write", which is actually a bit of a misnomer -- all COW means is that blocks aren't over-written, instead new blocks are written. In particular this means that inodes, indirect blocks, data blocks, and so on, that are changed are actually written to new locations, and the on-disk format needs to handle this indirection. so how can you do this, and keep the writes in order (especially between two files) without being the filesystem? As for fsyn() and background threads... fsync() is synchronous, but in this scheme we want it to happen asynchronously and then we want to update each transaction with a pointer to the last transaction that is known stable given an fsync()'s return. If you could specify ordering between two writes, I could see a process along the lines of Append new message to file1 append tiny status updates to file2 every million messages, move to new files. once the last message has been processed for the old set of files, delete them. since file2 is small, you can reconstruct state fairly cheaply But unless you are a filesystem, how can you make sure that the message data is written to file1 before you write the metadata about the message to file2? right now it seems that there is no way for an application to do this other than doing a fsync(file1) before writing the metadata to file2 And there is no way for the application to tell the filesystem to write the data in file2 in order (to make sure that block 3 is not written and then have the system crash before block 2 is written), so the application needs to do frequent fsync(file2) calls. If you need complete durability of your data, there are well documented ways of enforcing it (including the lwn.net article http://lwn.net/Articles/457667/ ) But if you don't need the gurantee that your data is on disk now, you just need to have it ordered so that if you crash you can be guaranteed only to loose data off of the tail of your file, there doesn't seem to be any way to do this other than using the fsync() hammer and wait for the overhead of forcing the data to disk now. Or, as I type this, it occurs to me that you may be saying that every time you want to do an ordering guarantee, spawn a new thread to do the fsync and then just keep processing. The fsync will happen at some point, and the writes will not be re-ordered across the fsync, but you can keep going, writing more data while the fsync's are pending. Then if you have a filesystem and I/O subsystem that can consolodate the fwyncs from all the different threads together into one I/O operation without having to flush the entire I/O queue for each one, you can get acceptable performance, with ordering. If the system crashes, data that hasn't had it's fsync() complete will be the only thing that is lost. David Lang ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
On Wed, 24 Oct 2012, Nico Williams wrote: Before that happens, people will keep returning again and again with those simple questions: why the queue must be flushed for any ordered operation? Isn't is an obvious overkill? That [cache flushing] is not what's being asked for here. Just a light-weight barrier. My proposal works without having to add new system calls: a) use a COW format, b) have background threads doing fsync()s, c) in each transaction's root block note the last known-committed (from a completed fsync()) transaction's root block, d) have an array of well-known ubberblocks large enough to accommodate as many transactions as possible without having to wait for any one fsync() to complete, d) do not reclaim space from any one past transaction until at least one subsequent transaction is fully committed. This obtains ACI- transaction semantics (survives power failures but without durability for the last N transactions at power-failure time) without requiring changes to the OS at all, and with support for delayed D (durability) notification. I'm doing some work with rsyslog and it's disk-baded queues and there is a similar issue there. The good news is that we can have a version that is linux specific (rsyslog is used on other OSs, but there is an existing queue implementation that they can use, if the faster one is linux-only, but is significantly faster, that's just a win for Linux) Like what is being described for sqlite, loosing the tail end of the messages is not a big problem under normal conditions. But there is a need to be sure that what is there is complete up to the point where it's lost. this is similar in concept to write-ahead-logs done for databases (without the absolute durability requirement) 1. new messages arrive and get added to the end of the queue file. 2. a thread updates the queue to indicate that it is in the process of delivering a block of messages 3. the thread updates the queue to indicate that the block of messages has been delivered 4. garbage collection happens to delete the old messages to free up space (if queues go into files, this can just be to limit the file size, spilling to multiple files, and when an old file is completely marked as delivered, delete it) I am not fully understanding how what you are describing (COW, separate fsync threads, etc) would be implemented on top of existing filesystems. Most of what you are describing seems like it requires access to the underlying storage to implement. could you give a more detailed explination? David Lang ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] transaction commit is successful despite I/O error
I was probably not quite specific. So I would like to rephrase the problem and give more details. I run a default configuration of Sqlite on Linux and I don't use WAL or anything that changes transaction behavior. A database is located on a remote file server and accessed via NFS. There is a single process only that accesses the database. The problem is when auto-mounter is stopped (on a machine where the process is running) in the middle of database transaction, commit returns successfully though it leaves a hot journal on disk because it is not seen for unlink (ENOENT)!!! Note, when auto-mounter is down, you still can operate on a database file and its journal if they were opened before the stop. The problem can be reproduced very easily with an Sqlite command shell: 1) Run the shell 2) Begin exclusive transaction 3) Insert a few records 4) Stop auto-mounter 5) Optionally insert a few more records (stopped auto-mounter does not affect these inserts because database and journal are opened) 6) Commit (no errors!!!) 7) Quit the shell If you then restart auto-mounter and check the database directory you will find a hot journal! If you reopen the database the transaction will be played back and recently commited changes will be gone! Thanks for your feedbacks, Efim This e-mail, including any attached files, may contain confidential and privileged information for the sole use of the intended recipient. Any review, use, distribution, or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive information for the intended recipient), please contact the sender by reply e-mail and delete all copies of this message. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] Bug report concerning the juliandate function
Dear sir or madam, I have tested sqlite version 3.6. and version 3.7.14.1 with the following inputs: select julianday('2005-02-29'); select julianday('2005-02-30'); In both cases, it should have returned NULL, but it returned a number. In the first case, the year 2005 is not a leap year. Therefore, this date is not valid. In the second case, February has never 30 days. Regards, Kagan Kayal ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] how to select " char in sqlite
On Thu, 25 Oct 2012 01:20:24 +, YAN HONG YE wrote: >char bh1[320]; >memset(bh1,0,320); >strcpy(bh1,"sqlite3 -html -header t9_engine.db \"select >id,partnumber,substr(\'\',1,180) as >img,pcs from engine where id>7;\" >> n.html"); >system(bh1); //here couldn't work As we said before "here couldn't work" is insufficient to describe your problem. Please show us the contents of bh1 just before the system() call, and the error codes and error messages returned. It is bad practice to shell out to the sqlite3 commandline tool from within a program. As you see, you'll get all kinds of quoting issues (probably the root cause of the failure here). You are supposed to use the API, either the C-API http://sqlite.org/c3ref/intro.html or some API provided by a "wrapper" for the computer language your application is coded in. There is excellent sample code on the internet, like http://icculus.org/~chunky/stuff/sqlite3_example/sqlite3_example_bind.c By the way, I don't think the substr() is at the right place. Hope this helps. -- Groet, Cordialement, Pozdrawiam, Regards, Kees Nuyt ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users