On Tue, Dec 7, 2010 at 9:49 AM, Yoni Londner <yonih...@gmail.com> wrote:
> Hi, > > Yes, in this scheme the checksum is based on salt values and own frame > content.a > > Note that the current design solve a potential DB corruption bug in > sqlite. current WAL design is base on the fact that once sqlite writes > pages successfully to the WAL, they will never get corrupted. but this > assumption is not true. take for example the following situation: > > H 1 1 1 2 2 2 3 3 3 3 > > We have here 10 pages in 3 transactions. lets say that sqlite stated a > checkpoint, succesfully checkpointed transaction 1 and 2, and started > copy transaction 3 to the DB. while copying the first pages of > transaction 3, pages from transaction 4 are written to the WAL. > now, since the pages most likely are not aligned to the sector size, the > OS might read part of last page of transaction 3, and write it along > with the first page of transaction 4. > If a power failure occur at this point, then the first pages of > transactions 3 already copied to the DB, while last page of transaction > 3 is corrupted, so when recovering, sqlite will not complete copying > transaction 3 to the DB, and DB we stay corrupted. > > In my design, I used a padding to avoid this situation. > The current SQLite WAL also pads each transaction out to a sector boundary for exactly this reason. > > while this problem can occur on any device, it is more likely to happen > on devices which use flash memory (mostly mobile devices), since the > size of a sector of flash memory tend to be larger than on non flash > memory. > > Yoni. > > On 3/12/2010 7:33 AM, Dan Kennedy wrote: > > > > In the current WAL format, the checksum for each frame is based on > > the contents of the frame, the salt-values in the wal header and the > > checksum of the previous frame. > > > > In this scheme is each frame checksum independent? i.e. each frame > > checksum is computed based only on the salt values in the WAL header > > and the frames own contents? > > > > > > Dan. > > > > > > > > On 12/02/2010 11:04 PM, Yoni Londner wrote: > >> Hi, > >> > >> I will start out by stating that I am not deeply familiar with the > >> sqlite& WAL file layout, but I will try anyway, from my current > >> understanding: > >> > >> The general problem that needs to be solved is to allow SQLITE to be > >> constantly used over time (with no 'idle' time where no sql operations > >> are done, and it is not inside any transaction). > >> > >> The current WAL design does not allow this, since if there are all the > >> time open read transactions and/or write transactions, the WAL will > >> continue to grow forever. > >> I copy/paste below a tiny C program that re-creates this with write > >> transactions. > >> > >> Remember also that in the WAL file we 'gather' up a lot of work. If we > >> do it in the background, this work-load can be smoothed up to run in > >> parallel to the regular system work, but if we must make the SQL 'idle' > >> (close all transactions) in order to execute the checkpoint, on a heavy > >> load system this can mean halting the system for a long period of time > >> (in our case, typically for 30 seconds!). > >> > >> This needs to be solved by two "features" to be added to WAL: > >> > >> 1) 'incremental' checkpoint. > >> > >> 2) WAL file recycling (this item can also be solved by two WAL files, > >> but I think its better to make the WAL file format a little bit more > >> complex than start having to handle in the code management of multiple > >> files). > >> > >> Incremental checkpointing means that checkpointing can be done up until > >> the first open transaction (mxFrame). This means both copying the WAL > >> pages up-until mxFrame of the first open transaction to the DB file, and > >> then MARKING those frames as 'DONE' (I will talk later on how to do the > >> 'DONE' marking). > >> > >> This means that from that point onwards - accessing those pages will be > >> to the DB file, not the WAL file. > >> > >> WAL recycling will be done by writing pages to the beginning of the WAL > >> file when a certain amount of pages from the beginning of the WAL file > >> are 'checkpointed' (marked as DONE). This can also happen in the middle > >> of a transaction. > >> > >> Example: > >> > >> Legend: > >> H - header. > >> 1, 2, 3.. - page of transaction 1, 2, 3.. > >> C - commit marker. > >> BOF1: beginning of WAL-1. EOF1: end of WAL-1 > >> BOF2: beginning of WAL-2. EOF2: end of WAL-2 > >> P: 64K of padding (junk data) > >> > >> WAL file with transactions 1 2 and 3 committed, and transaction 4 open: > >> H 1 1 1C 2 2 2 2C 3 3 3 3C 4 4 4 > >> +-- BOF1 +--- EOF1 > >> > >> We continue to add to transaction 4: > >> H 1 1 1C 2 2 2 2C 3 3 3 3C 4 4 4 4 4 > >> +-- BOF1 +--- EOF1 > >> > >> In the meantime, we checkpointed transactions 1 and 2, because there is > >> a read transaction working on transaction 3: > >> H 1 1 1C 2 2 2 2C 3 3 3 3C 4 4 4 4 4 > >> +-- BOF1 +--- EOF1 > >> +----- checkpointed up to here > >> > >> No we decided we want to recycle. Since there is no read transaction > >> open on transaction 1 and 2 (cannot be, since if you need a page from > >> transaction 1, you will find it in the DB), we can reuse them: > >> H 4 1 1C 2 2 2 2C 3 3 3 3C 4 4 4 4 4 > >> +-- BOF2 +-- BOF1 +--- EOF1 > >> +-- EOF2 +---- checkpointed up to here > >> +------ the 6th page of transaction 4 > >> > >> And now we close transaction 4: > >> H 4 4C 1C 2 2 2 2C 3 3 3 3C 4 4 4 4 4 > >> | +-- EOF2 +-- BOF1 +--- EOF1 > >> +--- BOF2 +---- checkpointed up to here > >> > >> Lets open transaction 5, and write faster than the background > checkpointing: > >> > >> H 4 4C 5 5 5 5 5 3 3 3 3C 4 4 4 4 4 > >> | | +-- BOF1 +--- EOF1 > >> +--- BOF2 | +---- checkpointed up to here > >> +-- EOF2 > >> > >> So we need more place for 5, we will write it at the end: > >> H 4 4C 5 5 5 5 5 3 3 3 3C 4 4 4 4 4 5 5 5 5C > >> | | | | | +-- EOF3 > >> | | | | +--- BOF3 > >> | | +-- BOF1 +--- EOF1 > >> +--- BOF2 | +---- checkpointed up to here > >> +-- EOF2 > >> > >> Ok. This is the most complicated 'mess up' of this WAL file possible. It > >> cannot get 'messier' than this, since we recyclye ONLY when there is a > >> minimum of N (configurable) amount of checkpointed pages at the > >> beginning of the WAL file AND if there are no 'gaps' in the not-yet > >> checkpointed pages. > >> The first condition is mainly for efficiency or large workloads (since > >> recycling has its costs). This second condition is to prevent the WAL > >> files layout to have multiple recycling points in it (prevents it from > >> getting 'messed up'). A wal file can only have at MAX one recycle point > >> at one point in file, and MAX of 2 empty gaps. > >> If we continue the example above, this can happen when we checkpoint > >> transactions 3 and 4 but not transaction 5. So we have a gap at the > >> beginning (end of transaction 4) and in the middle (transaction 3 and > >> beginning of transaction 4) > >> > >> H 4 4C 5 5 5 5 5 3 3 3 3C 4 4 4 4 4 5 5 5 5C > >> +-- BOF1| +- BOF2| > >> +--- EOF1 +--- EOF2 > >> > >> > >> How to mark as DONE: > >> -------------------- > >> We actually don't mark a frame as DONE. > >> We give each frame a sequence number, and we write on each commit frame > >> the current BOF1. > >> We can figure out which frame is good and which is junk using the > >> following algorithm: > >> 1. scan the WAL file and build the index. some of the frames in the > >> index are invalid. > >> 2. search for the commit frame with largest sequence number, and get > >> the BOF1 from it. > >> 3. pass on all frames from BOF1 to last commit frame and make sure it's > >> valid (salt-1, salt-2 and the checksum are valid). > >> 4. In case one of the frames is not valid, find the commit frame > >> previous to the one chosen in step 2, and do steps 3 and 4 again. > >> 5. remove all invalid frames from the index. > >> > >> First example: > >> 0 0 0 0 0 0 0 0 0 1 1 1 1 1 > >> 1 2 3 4 5 6 7 8 9 0 1 2 3 4 > >> H P 1 1 1C 2 2 2 2C 3 3 3 3C 4 4 4 > >> +-- BOF1 +--- EOF1 > >> > >> 1. we build index file for the above WAL. > >> 2. Last commit frame is frame 11, so we check for the BOF1 and find it > >> is frame 01. > >> 3. we check that each frame between frame 01 and frame 11 is valid > >> > >> Second example: > >> > >> 1 1 1 2 2 0 0 1 1 1 1 1 1 1 2 2 2 2 > >> 7 8 9 0 1 8 9 0 1 2 3 4 5 6 2 3 4 5 > >> H P 4C 5 5 5 5 P 3 3 3 3C 4 4 4 4 4 P 5 5 5 5C > >> | | | | | +-- EOF3 > >> | | | | +--- BOF3 > >> | | +-- BOF1 +--- EOF1 > >> +--- BOF2 | +---- checkpointed up to here > >> +-- EOF2 > >> > >> 1. we build index file for the above WAL. > >> 2. Last commit frame is frame 25, so we check for the BOF1 and find it > >> is frame 08. > >> 3. we check that each frame between frame 08 and frame 25 is valid. > >> 4. lets say that we tried to write the HD frames 20, 21, 24 and 25, and > >> in the middle of this the system crashed. the OS and HD can write frames > >> 24 and 25 before frames 20-21, so it is possible to frames 24-25 to be > >> good, and frames 20-21 to be corrupted. > >> 5. once we know frames 20-21 are corrupted, we search for previus commit > >> frame, which is frame 17, and do the check again (this time is should > pass). > >> > >> > >> How to avoid writing sectors near 'critical' sectors: > >> ----------------------------------------------------- > >> HD's and OS's work with basic 'sector' units that start from 512, but > >> can in many cases even reach up to 32K! > >> 'critical sectors' in a WAL file are sectors that if lost can lead to DB > >> corruption. For example, a WAL frame of a checkpoint that stopped in the > >> middle of transfer to the DB, and then that WAL frame is lost (due to > >> corrupt), then next time the system is up - the DB may be corrupt. > >> So pages of such a WAL frame I call 'critical'. > >> > >> In the current 'append only' WAL implementation, such corruption cannot > >> happen. But, once we add recycling feature, this could happen, by > >> writing to a adjacent sector in the FILE (just before or just after) - > >> which if the system is crashes in the middle - could corrupt that > >> logical sector (since the HD and OS have a minimum sector size which > >> they write to). > >> > >> To avoid this, we need to keep a 'safe distance' from 'critical sectors' > >> in the WAL. How we do this: whenever we recycle, we add at the end of > >> the file PAD ('PAD' is 64K of pages - with no content - just used as > >> padding). This PAD at EOF will be used later on when BOF3 needs to > >> append at end of file. > >> At the beginning of the file, in order not to overwrite the header H, it > >> will skip PAD bytes and start only there writing the BOF2. > >> EOF2 will stop PAD bytes before BOF1, and then skip to BOF3 (which > >> already from before had PAD bytes distance from EOF1). > >> > >> So the real layout of the previous example will be more like this: > >> H P 4C 5 5 5 5 P 3 3 3 3C 4 4 4 4 4 P 5 5 5 5C > >> | | | | | +-- EOF3 > >> | | | | +--- BOF3 > >> | | +-- BOF1 +--- EOF1 > >> +--- BOF2 | +---- checkpointed up to here > >> +-- EOF2 > >> > >> This allows us to write to the WAL file without getting 'too close' to > >> 'critical sectors'. > >> > >> I can continue deeper to the headers structures (some fields needs to be > >> added) and to what is the structure of the index file, but I want to > >> hear your opinion about all the above first, and if this sounds > >> reasonable, we can continue work on it. > >> > >> Yoni > >> > >> > >> Example program (WAL file grow limitlessly): > >> ---------------- > >> /* to compile: gcc -g test.c PATH_TO_SQLITE/libsqlite3.a -lrt */ > >> #include "sqlite3.h" > >> #include "stdio.h" > >> #include "stdlib.h" > >> #include "fcntl.h" > >> > >> static void sql_exec(sqlite3 *conn, char *query) > >> { > >> char *err; > >> if (sqlite3_exec(conn, query, NULL, 0,&err)) > >> { > >> printf("sqlite: failed exec '%s'. err: %s\n", query, err); > >> exit(1); > >> } > >> } > >> > >> static sqlite3 *sql_open_conn(void) > >> { > >> sqlite3 *conn; > >> if (sqlite3_open_v2("test.db",&conn, SQLITE_OPEN_READWRITE, > NULL)) > >> { > >> printf("sqlite3_open_v2 failed\n"); > >> exit(1); > >> } > >> return conn; > >> } > >> > >> static int do_checkpoint() > >> { > >> sqlite3 *conn; > >> while (1) > >> { > >> sleep(1); > >> printf("calling wal checkpoint\n"); > >> fflush(0); > >> conn = sql_open_conn(); > >> sql_exec(conn, "PRAGMA wal_checkpoint"); > >> sqlite3_close(conn); > >> } > >> } > >> > >> int main(int argc, char **argv) > >> { > >> sqlite3 *conn = NULL; > >> char *err_msg = NULL; > >> pthread_t thread; > >> int fd, i = 0; > >> unlink("test.db"); > >> unlink("test.db-wal"); > >> unlink("test.db-shm"); > >> if ((fd = open("test.db", O_CREAT|O_RDWR, 0666)<0)) > >> { > >> printf("could not open test.db\n"); > >> exit(1); > >> } > >> close(fd); > >> conn = sql_open_conn(); > >> sql_exec(conn, "PRAGMA journal_mode=WAL"); > >> sql_exec(conn, "PRAGMA wal_autocheckpoint=-1"); > >> sql_exec(conn, "create table tbl1 (one varchar(20), two > varchar(20))"); > >> if (pthread_create(&thread, NULL, do_checkpoint, NULL)) > >> { > >> printf("could not start thread\n"); > >> exit(1); > >> } > >> sql_exec(conn, "BEGIN TRANSACTION"); > >> while (1) > >> { > >> if (!(i++%10000)) > >> { > >> sql_exec(conn, "END TRANSACTION"); > >> sql_exec(conn, "BEGIN TRANSACTION"); > >> } > >> sql_exec(conn, "INSERT INTO tbl1 values('aaaaaaaaaaaaaaaaaaa', " > >> "'bbbbbbbbbbbbbbbbbbb')"); > >> } > >> sql_exec(conn, "END TRANSACTION"); > >> sql_exec(conn, "PRAGMA wal_checkpoint"); > >> sqlite3_close(conn); > >> return 0; > >> } > >> > >> On 29/11/2010 5:13 PM, Pavel Ivanov wrote: > >>>> Well, I love sqlite, and I want to continue using it (small, fast, > >>>> reliable ...). > >>>> I think it is better to solve such problems inside sqlite > >>> > >>> It's impossible. Just try to design the solution you want. Think of > >>> how SQLite should behave to make you happy, think of it with all > >>> details and don't forget that SQLite is a library living inside > >>> several processes simultaneously, sometimes even on different > >>> computers (although it's discouraged), those processes can crash > >>> unpredictably and you have to manage all crashes predictably. > >>> So if you are able to come up with some solution and post all > >>> technical details here I think SQLite developers will be happy to > >>> implement it. > >>> > >>> > >>> Pavel > >>> > >>> On Sun, Nov 28, 2010 at 3:19 AM, Yoni Londner<yonih...@gmail.com> > wrote: > >>>> Hi, > >>>> > >>>> > For a large scale system you have a third choice: use some > other RDBMS > >>>> > which is implemented in one process and has a much better > control over > >>>> > its data and much better communication between readers and > writers. > >>>> Well, I love sqlite, and I want to continue using it (small, fast, > >>>> reliable ...). > >>>> I think it is better to solve such problems inside sqlite (Not to > >>>> mention you don't need large scale system to reproduce this issue, 50 > >>>> lines of C code is enough). > >>>> > >>>> Yoni. > >>>> _______________________________________________ > >>>> sqlite-users mailing list > >>>> sqlite-users@sqlite.org > >>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > >>>> > >>> _______________________________________________ > >>> sqlite-users mailing list > >>> sqlite-users@sqlite.org > >>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > >> _______________________________________________ > >> sqlite-users mailing list > >> sqlite-users@sqlite.org > >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > >> > > > > _______________________________________________ > > sqlite-users mailing list > > sqlite-users@sqlite.org > > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > -- D. Richard Hipp d...@sqlite.org _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users