Re: [sqlite] WAL file size

Richard Hipp Wed, 08 Dec 2010 03:26:16 -0800

On Tue, Dec 7, 2010 at 9:49 AM, Yoni Londner <yonih...@gmail.com> wrote:


> Hi,
>
> Yes, in this scheme the checksum is based on salt values and own frame
> content.a
>
> Note that the current design solve a potential DB corruption bug in
> sqlite. current WAL design is base on the fact that once sqlite writes
> pages successfully to the WAL, they will never get corrupted. but this
> assumption is not true. take for example the following situation:
>
> H 1 1 1 2 2 2 3 3 3 3
>
> We have here 10 pages in 3 transactions. lets say that sqlite stated a
> checkpoint, succesfully checkpointed transaction 1 and 2, and started
> copy transaction 3 to the DB. while copying the first pages of
> transaction 3, pages from transaction 4 are written to the WAL.
> now, since the pages most likely are not aligned to the sector size, the
> OS might read part of last page of transaction 3, and write it along
> with the first page of transaction 4.
> If a power failure occur at this point, then the first pages of
> transactions 3 already copied to the DB, while last page of transaction
> 3 is corrupted, so when recovering, sqlite will not complete copying
> transaction 3 to the DB, and DB we stay corrupted.
>
> In my design, I used a padding to avoid this situation.
>

The current SQLite WAL also pads each transaction out to a sector boundary
for exactly this reason.


>
> while this problem can occur on any device, it is more likely to happen
> on devices which use flash memory (mostly mobile devices), since the
> size of a sector of flash memory tend to be larger than on non flash
> memory.
>
> Yoni.
>
> On 3/12/2010 7:33 AM, Dan Kennedy wrote:
> >
> > In the current WAL format, the checksum for each frame is based on
> > the contents of the frame, the salt-values in the wal header and the
> > checksum of the previous frame.
> >
> > In this scheme is each frame checksum independent? i.e. each frame
> > checksum is computed based only on the salt values in the WAL header
> > and the frames own contents?
> >
> >
> > Dan.
> >
> >
> >
> > On 12/02/2010 11:04 PM, Yoni Londner wrote:
> >> Hi,
> >>
> >> I will start out by stating that I am not deeply familiar with the
> >> sqlite&   WAL file layout, but I will try anyway, from my current
> >> understanding:
> >>
> >> The general problem that needs to be solved is to allow SQLITE to be
> >> constantly used over time (with no 'idle' time where no sql operations
> >> are done, and it is not inside any transaction).
> >>
> >> The current WAL design does not allow this, since if there are all the
> >> time open read transactions and/or write transactions, the WAL will
> >> continue to grow forever.
> >> I copy/paste below a tiny C program that re-creates this with write
> >> transactions.
> >>
> >> Remember also that in the WAL file we 'gather' up a lot of work. If we
> >> do it in the background, this work-load can be smoothed up to run in
> >> parallel to the regular system work, but if we must make the SQL 'idle'
> >> (close all transactions) in order to execute the checkpoint, on a heavy
> >> load system this can mean halting the system for a long period of time
> >> (in our case, typically for 30 seconds!).
> >>
> >> This needs to be solved by two "features" to be added to WAL:
> >>
> >> 1) 'incremental' checkpoint.
> >>
> >> 2) WAL file recycling (this item can also be solved by two WAL files,
> >> but I think its better to make the WAL file format a little bit more
> >> complex than start having to handle in the code management of multiple
> >> files).
> >>
> >> Incremental checkpointing means that checkpointing can be done up until
> >> the first open transaction (mxFrame). This means both copying the WAL
> >> pages up-until mxFrame of the first open transaction to the DB file, and
> >> then MARKING those frames as 'DONE' (I will talk later on how to do the
> >> 'DONE' marking).
> >>
> >> This means that from that point onwards - accessing those pages will be
> >> to the DB file, not the WAL file.
> >>
> >> WAL recycling will be done by writing pages to the beginning of the WAL
> >> file when a certain amount of pages from the beginning of the WAL file
> >> are 'checkpointed' (marked as DONE). This can also happen in the middle
> >> of a transaction.
> >>
> >> Example:
> >>
> >> Legend:
> >> H - header.
> >> 1, 2, 3.. - page of transaction 1, 2, 3..
> >> C - commit marker.
> >> BOF1: beginning of WAL-1. EOF1: end of WAL-1
> >> BOF2: beginning of WAL-2. EOF2: end of WAL-2
> >> P: 64K of padding (junk data)
> >>
> >> WAL file with transactions 1 2 and 3 committed, and transaction 4 open:
> >> H 1 1 1C 2 2 2 2C 3 3 3 3C 4 4 4
> >>      +-- BOF1                     +--- EOF1
> >>
> >> We continue to add to transaction 4:
> >> H 1 1 1C 2 2 2 2C 3 3 3 3C 4 4 4 4 4
> >>      +-- BOF1                         +--- EOF1
> >>
> >> In the meantime, we checkpointed transactions 1 and 2, because there is
> >> a read transaction working on transaction 3:
> >> H 1 1 1C 2 2 2 2C 3 3 3 3C 4 4 4 4 4
> >>                      +-- BOF1         +--- EOF1
> >>                      +----- checkpointed up to here
> >>
> >> No we decided we want to recycle. Since there is no read transaction
> >> open on transaction 1 and 2 (cannot be, since if you need a page from
> >> transaction 1, you will find it in the DB), we can reuse them:
> >> H 4 1 1C 2 2 2 2C 3 3 3 3C 4 4 4 4 4
> >>      +-- BOF2        +-- BOF1         +--- EOF1
> >>      +-- EOF2        +---- checkpointed up to here
> >>      +------ the 6th page of transaction 4
> >>
> >> And now we close transaction 4:
> >> H 4 4C 1C 2 2 2 2C 3 3 3 3C 4 4 4 4 4
> >>      | +-- EOF2       +-- BOF1         +--- EOF1
> >>      +--- BOF2        +---- checkpointed up to here
> >>
> >> Lets open transaction 5, and write faster than the background
> checkpointing:
> >>
> >> H 4 4C 5  5 5 5 5  3 3 3 3C 4 4 4 4 4
> >>      |             |  +-- BOF1         +--- EOF1
> >>      +--- BOF2     |  +---- checkpointed up to here
> >>                    +-- EOF2
> >>
> >> So we need more place for 5, we will write it at the end:
> >> H 4 4C  5 5 5 5 5  3 3 3 3C 4 4 4 4 4 5 5 5 5C
> >>      |             |  |                | |     +-- EOF3
> >>      |             |  |                | +--- BOF3
> >>      |             |  +-- BOF1         +--- EOF1
> >>      +--- BOF2     |  +---- checkpointed up to here
> >>                    +-- EOF2
> >>
> >> Ok. This is the most complicated 'mess up' of this WAL file possible. It
> >> cannot get 'messier' than this, since we recyclye ONLY when there is a
> >> minimum of N (configurable) amount of checkpointed pages at the
> >> beginning of the WAL file AND if there are no 'gaps' in the not-yet
> >> checkpointed pages.
> >> The first condition is mainly for efficiency or large workloads (since
> >> recycling has its costs). This second condition is to prevent the WAL
> >> files layout to have multiple recycling points in it (prevents it from
> >> getting 'messed up'). A wal file can only have at MAX one recycle point
> >> at one point in file, and MAX of 2 empty gaps.
> >> If we continue the example above, this can happen when we checkpoint
> >> transactions 3 and 4 but not transaction 5. So we have a gap at the
> >> beginning (end of transaction 4) and in the middle (transaction 3 and
> >> beginning of transaction 4)
> >>
> >> H 4 4C  5 5 5 5 5  3 3 3 3C 4 4 4 4 4 5 5 5 5C
> >>            +-- BOF1|                     +- BOF2|
> >>                    +--- EOF1                    +--- EOF2
> >>
> >>
> >> How to mark as DONE:
> >> --------------------
> >> We actually don't mark a frame as DONE.
> >> We give each frame a sequence number, and we write on each commit frame
> >> the current BOF1.
> >> We can figure out which frame is good and which is junk using the
> >> following algorithm:
> >> 1. scan the WAL file and build the index. some of the frames in the
> >> index are invalid.
> >> 2. search for the commit frame with largest sequence number, and get
> >> the BOF1 from it.
> >> 3. pass on all frames from BOF1 to last commit frame and make sure it's
> >> valid (salt-1, salt-2 and the checksum are valid).
> >> 4. In case one of the frames is not valid, find the commit frame
> >> previous to the one chosen in step 2, and do steps 3 and 4 again.
> >> 5. remove all invalid frames from the index.
> >>
> >> First example:
> >>        0 0 0  0 0 0 0  0 0 1 1  1 1 1
> >>        1 2 3  4 5 6 7  8 9 0 1  2 3 4
> >> H P 1 1 1C 2 2 2 2C 3 3 3 3C 4 4 4
> >>        +-- BOF1                     +--- EOF1
> >>
> >> 1. we build index file for the above WAL.
> >> 2. Last commit frame is frame 11, so we check for the BOF1 and find it
> >> is frame 01.
> >> 3. we check that each frame between frame 01 and frame 11 is valid
> >>
> >> Second example:
> >>
> >>        1   1 1 2 2    0 0 1 1  1 1 1 1 1   2 2 2 2
> >>        7   8 9 0 1    8 9 0 1  2 3 4 5 6   2 3 4 5
> >> H P 4C  5 5 5 5 P  3 3 3 3C 4 4 4 4 4 P 5 5 5 5C
> >>        |         |    |                |   |     +-- EOF3
> >>        |         |    |                |   +--- BOF3
> >>        |         |    +-- BOF1         +--- EOF1
> >>        +--- BOF2 |    +---- checkpointed up to here
> >>                  +-- EOF2
> >>
> >> 1. we build index file for the above WAL.
> >> 2. Last commit frame is frame 25, so we check for the BOF1 and find it
> >> is frame 08.
> >> 3. we check that each frame between frame 08 and frame 25 is valid.
> >> 4. lets say that we tried to write the HD frames 20, 21, 24 and 25, and
> >> in the middle of this the system crashed. the OS and HD can write frames
> >> 24 and 25 before frames 20-21, so it is possible to frames 24-25 to be
> >> good, and frames 20-21 to be corrupted.
> >> 5. once we know frames 20-21 are corrupted, we search for previus commit
> >> frame, which is frame 17, and do the check again (this time is should
> pass).
> >>
> >>
> >> How to avoid writing sectors near 'critical' sectors:
> >> -----------------------------------------------------
> >> HD's and OS's work with basic 'sector' units that start from 512, but
> >> can in many cases even reach up to 32K!
> >> 'critical sectors' in a WAL file are sectors that if lost can lead to DB
> >> corruption. For example, a WAL frame of a checkpoint that stopped in the
> >> middle of transfer to the DB, and then that WAL frame is lost (due to
> >> corrupt), then next time the system is up - the DB may be corrupt.
> >> So pages of such a WAL frame I call 'critical'.
> >>
> >> In the current 'append only' WAL implementation, such corruption cannot
> >> happen. But, once we add recycling feature, this could happen, by
> >> writing to a adjacent sector in the FILE (just before or just after) -
> >> which if the system is crashes in the middle - could corrupt that
> >> logical sector (since the HD and OS have a minimum sector size which
> >> they write to).
> >>
> >> To avoid this, we need to keep a 'safe distance' from 'critical sectors'
> >> in the WAL. How we do this: whenever we recycle, we add at the end of
> >> the file PAD ('PAD' is 64K of pages - with no content - just used as
> >> padding). This PAD at EOF will be used later on when BOF3 needs to
> >> append at end of file.
> >> At the beginning of the file, in order not to overwrite the header H, it
> >> will skip PAD bytes and start only there writing the BOF2.
> >> EOF2 will stop PAD bytes before BOF1, and then skip to BOF3 (which
> >> already from before had PAD bytes distance from EOF1).
> >>
> >> So the real layout of the previous example will be more like this:
> >> H P 4C  5 5 5 5 P  3 3 3 3C 4 4 4 4 4 P 5 5 5 5C
> >>        |         |    |                |   |     +-- EOF3
> >>        |         |    |                |   +--- BOF3
> >>        |         |    +-- BOF1         +--- EOF1
> >>        +--- BOF2 |    +---- checkpointed up to here
> >>                  +-- EOF2
> >>
> >> This allows us to write to the WAL file without getting 'too close' to
> >> 'critical sectors'.
> >>
> >> I can continue deeper to the headers structures (some fields needs to be
> >> added) and to what is the structure of the index file, but I want to
> >> hear your opinion about all the above first, and if this sounds
> >> reasonable, we can continue work on it.
> >>
> >> Yoni
> >>
> >>
> >> Example program (WAL file grow limitlessly):
> >> ----------------
> >> /* to compile: gcc -g test.c PATH_TO_SQLITE/libsqlite3.a -lrt */
> >> #include "sqlite3.h"
> >> #include "stdio.h"
> >> #include "stdlib.h"
> >> #include "fcntl.h"
> >>
> >> static void sql_exec(sqlite3 *conn, char *query)
> >> {
> >>        char *err;
> >>        if (sqlite3_exec(conn, query, NULL, 0,&err))
> >>        {
> >>            printf("sqlite: failed exec '%s'. err: %s\n", query, err);
> >>      exit(1);
> >>        }
> >> }
> >>
> >> static sqlite3 *sql_open_conn(void)
> >> {
> >>        sqlite3 *conn;
> >>        if (sqlite3_open_v2("test.db",&conn, SQLITE_OPEN_READWRITE,
> NULL))
> >>        {
> >>      printf("sqlite3_open_v2 failed\n");
> >>      exit(1);
> >>        }
> >>        return conn;
> >> }
> >>
> >> static int do_checkpoint()
> >> {
> >>        sqlite3 *conn;
> >>        while (1)
> >>        {
> >>      sleep(1);
> >>      printf("calling wal checkpoint\n");
> >>      fflush(0);
> >>      conn = sql_open_conn();
> >>      sql_exec(conn, "PRAGMA wal_checkpoint");
> >>      sqlite3_close(conn);
> >>        }
> >> }
> >>
> >> int main(int argc, char **argv)
> >> {
> >>        sqlite3 *conn = NULL;
> >>        char *err_msg = NULL;
> >>        pthread_t thread;
> >>        int fd, i = 0;
> >>        unlink("test.db");
> >>        unlink("test.db-wal");
> >>        unlink("test.db-shm");
> >>        if ((fd = open("test.db", O_CREAT|O_RDWR, 0666)<0))
> >>        {
> >>            printf("could not open test.db\n");
> >>            exit(1);
> >>        }
> >>        close(fd);
> >>        conn = sql_open_conn();
> >>        sql_exec(conn, "PRAGMA journal_mode=WAL");
> >>        sql_exec(conn, "PRAGMA wal_autocheckpoint=-1");
> >>        sql_exec(conn, "create table tbl1 (one varchar(20), two
> varchar(20))");
> >>        if (pthread_create(&thread, NULL, do_checkpoint, NULL))
> >>        {
> >>            printf("could not start thread\n");
> >>      exit(1);
> >>        }
> >>        sql_exec(conn, "BEGIN TRANSACTION");
> >>        while (1)
> >>        {
> >>            if (!(i++%10000))
> >>      {
> >>          sql_exec(conn, "END TRANSACTION");
> >>          sql_exec(conn, "BEGIN TRANSACTION");
> >>      }
> >>      sql_exec(conn, "INSERT INTO tbl1 values('aaaaaaaaaaaaaaaaaaa', "
> >>          "'bbbbbbbbbbbbbbbbbbb')");
> >>        }
> >>        sql_exec(conn, "END TRANSACTION");
> >>        sql_exec(conn, "PRAGMA wal_checkpoint");
> >>        sqlite3_close(conn);
> >>        return 0;
> >> }
> >>
> >> On 29/11/2010 5:13 PM, Pavel Ivanov wrote:
> >>>> Well, I love sqlite, and I want to continue using it (small, fast,
> >>>> reliable ...).
> >>>> I think it is better to solve such problems inside sqlite
> >>>
> >>> It's impossible. Just try to design the solution you want. Think of
> >>> how SQLite should behave to make you happy, think of it with all
> >>> details and don't forget that SQLite is a library living inside
> >>> several processes simultaneously, sometimes even on different
> >>> computers (although it's discouraged), those processes can crash
> >>> unpredictably and you have to manage all crashes predictably.
> >>> So if you are able to come up with some solution and post all
> >>> technical details here I think SQLite developers will be happy to
> >>> implement it.
> >>>
> >>>
> >>> Pavel
> >>>
> >>> On Sun, Nov 28, 2010 at 3:19 AM, Yoni Londner<yonih...@gmail.com>
>  wrote:
> >>>> Hi,
> >>>>
> >>>>     >    For a large scale system you have a third choice: use some
> other RDBMS
> >>>>     >    which is implemented in one process and has a much better
> control over
> >>>>     >    its data and much better communication between readers and
> writers.
> >>>> Well, I love sqlite, and I want to continue using it (small, fast,
> >>>> reliable ...).
> >>>> I think it is better to solve such problems inside sqlite (Not to
> >>>> mention you don't need large scale system to reproduce this issue, 50
> >>>> lines of C code is enough).
> >>>>
> >>>> Yoni.
> >>>> _______________________________________________
> >>>> sqlite-users mailing list
> >>>> sqlite-users@sqlite.org
> >>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> >>>>
> >>> _______________________________________________
> >>> sqlite-users mailing list
> >>> sqlite-users@sqlite.org
> >>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> >> _______________________________________________
> >> sqlite-users mailing list
> >> sqlite-users@sqlite.org
> >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> >>
> >
> > _______________________________________________
> > sqlite-users mailing list
> > sqlite-users@sqlite.org
> > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>



-- 
D. Richard Hipp
d...@sqlite.org
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] WAL file size

Reply via email to