Re: [sqlite] WAL file size

Yoni Londner Thu, 02 Dec 2010 08:40:42 -0800

Hi,

I meant to run the checkpoint in another thread (see my attached example 
program), but it can be even from another process.
What happens when it crashes? well, the programmer have to decide how he 
handle checkpoints. he can use auto_checkpoint, he can stop 
auto_checkpoint and do checkpoint in the main thread or in another 
thread, and he can do it in another process. but if this another process 
crash, that is the programmer responsibility to handle (e.g run it again).


The initial prolem I approched is a system that constantly read/write 
from/to the DB, and want to do it fast, so we can't use auto_checkpoint 
or run checkpoint periodically in the main thread.
What we do want is to run checkpoint in the background, but then the WAL 
grow forever...

Yoni.

On 2/12/2010 6:29 PM, Pavel Ivanov wrote:
> You seem to use such term as "background checkpointing". What's that?
> Who runs this background process and what happens when it crashes
> (when all other readers/writers are still working)?
>
>
> Pavel
>
> On Thu, Dec 2, 2010 at 11:04 AM, Yoni Londner<[email protected]>  wrote:
>> Hi,
>>
>> I will start out by stating that I am not deeply familiar with the
>> sqlite&  WAL file layout, but I will try anyway, from my current
>> understanding:
>>
>> The general problem that needs to be solved is to allow SQLITE to be
>> constantly used over time (with no 'idle' time where no sql operations
>> are done, and it is not inside any transaction).
>>
>> The current WAL design does not allow this, since if there are all the
>> time open read transactions and/or write transactions, the WAL will
>> continue to grow forever.
>> I copy/paste below a tiny C program that re-creates this with write
>> transactions.
>>
>> Remember also that in the WAL file we 'gather' up a lot of work. If we
>> do it in the background, this work-load can be smoothed up to run in
>> parallel to the regular system work, but if we must make the SQL 'idle'
>> (close all transactions) in order to execute the checkpoint, on a heavy
>> load system this can mean halting the system for a long period of time
>> (in our case, typically for 30 seconds!).
>>
>> This needs to be solved by two "features" to be added to WAL:
>>
>> 1) 'incremental' checkpoint.
>>
>> 2) WAL file recycling (this item can also be solved by two WAL files,
>> but I think its better to make the WAL file format a little bit more
>> complex than start having to handle in the code management of multiple
>> files).
>>
>> Incremental checkpointing means that checkpointing can be done up until
>> the first open transaction (mxFrame). This means both copying the WAL
>> pages up-until mxFrame of the first open transaction to the DB file, and
>> then MARKING those frames as 'DONE' (I will talk later on how to do the
>> 'DONE' marking).
>>
>> This means that from that point onwards - accessing those pages will be
>> to the DB file, not the WAL file.
>>
>> WAL recycling will be done by writing pages to the beginning of the WAL
>> file when a certain amount of pages from the beginning of the WAL file
>> are 'checkpointed' (marked as DONE). This can also happen in the middle
>> of a transaction.
>>
>> Example:
>>
>> Legend:
>> H - header.
>> 1, 2, 3.. - page of transaction 1, 2, 3..
>> C - commit marker.
>> BOF1: beginning of WAL-1. EOF1: end of WAL-1
>> BOF2: beginning of WAL-2. EOF2: end of WAL-2
>> P: 64K of padding (junk data)
>>
>> WAL file with transactions 1 2 and 3 committed, and transaction 4 open:
>> H 1 1 1C 2 2 2 2C 3 3 3 3C 4 4 4
>>    +-- BOF1                     +--- EOF1
>>
>> We continue to add to transaction 4:
>> H 1 1 1C 2 2 2 2C 3 3 3 3C 4 4 4 4 4
>>    +-- BOF1                         +--- EOF1
>>
>> In the meantime, we checkpointed transactions 1 and 2, because there is
>> a read transaction working on transaction 3:
>> H 1 1 1C 2 2 2 2C 3 3 3 3C 4 4 4 4 4
>>                    +-- BOF1         +--- EOF1
>>                    +----- checkpointed up to here
>>
>> No we decided we want to recycle. Since there is no read transaction
>> open on transaction 1 and 2 (cannot be, since if you need a page from
>> transaction 1, you will find it in the DB), we can reuse them:
>> H 4 1 1C 2 2 2 2C 3 3 3 3C 4 4 4 4 4
>>    +-- BOF2        +-- BOF1         +--- EOF1
>>    +-- EOF2        +---- checkpointed up to here
>>    +------ the 6th page of transaction 4
>>
>> And now we close transaction 4:
>> H 4 4C 1C 2 2 2 2C 3 3 3 3C 4 4 4 4 4
>>    | +-- EOF2       +-- BOF1         +--- EOF1
>>    +--- BOF2        +---- checkpointed up to here
>>
>> Lets open transaction 5, and write faster than the background checkpointing:
>>
>> H 4 4C 5  5 5 5 5  3 3 3 3C 4 4 4 4 4
>>    |             |  +-- BOF1         +--- EOF1
>>    +--- BOF2     |  +---- checkpointed up to here
>>                  +-- EOF2
>>
>> So we need more place for 5, we will write it at the end:
>> H 4 4C  5 5 5 5 5  3 3 3 3C 4 4 4 4 4 5 5 5 5C
>>    |             |  |                | |     +-- EOF3
>>    |             |  |                | +--- BOF3
>>    |             |  +-- BOF1         +--- EOF1
>>    +--- BOF2     |  +---- checkpointed up to here
>>                  +-- EOF2
>>
>> Ok. This is the most complicated 'mess up' of this WAL file possible. It
>> cannot get 'messier' than this, since we recyclye ONLY when there is a
>> minimum of N (configurable) amount of checkpointed pages at the
>> beginning of the WAL file AND if there are no 'gaps' in the not-yet
>> checkpointed pages.
>> The first condition is mainly for efficiency or large workloads (since
>> recycling has its costs). This second condition is to prevent the WAL
>> files layout to have multiple recycling points in it (prevents it from
>> getting 'messed up'). A wal file can only have at MAX one recycle point
>> at one point in file, and MAX of 2 empty gaps.
>> If we continue the example above, this can happen when we checkpoint
>> transactions 3 and 4 but not transaction 5. So we have a gap at the
>> beginning (end of transaction 4) and in the middle (transaction 3 and
>> beginning of transaction 4)
>>
>> H 4 4C  5 5 5 5 5  3 3 3 3C 4 4 4 4 4 5 5 5 5C
>>          +-- BOF1|                     +- BOF2|
>>                  +--- EOF1                    +--- EOF2
>>
>>
>> How to mark as DONE:
>> --------------------
>> We actually don't mark a frame as DONE.
>> We give each frame a sequence number, and we write on each commit frame
>> the current BOF1.
>> We can figure out which frame is good and which is junk using the
>> following algorithm:
>> 1. scan the WAL file and build the index. some of the frames in the
>> index are invalid.
>> 2. search for the commit frame with largest sequence number, and get
>> the BOF1 from it.
>> 3. pass on all frames from BOF1 to last commit frame and make sure it's
>> valid (salt-1, salt-2 and the checksum are valid).
>> 4. In case one of the frames is not valid, find the commit frame
>> previous to the one chosen in step 2, and do steps 3 and 4 again.
>> 5. remove all invalid frames from the index.
>>
>> First example:
>>      0 0 0  0 0 0 0  0 0 1 1  1 1 1
>>      1 2 3  4 5 6 7  8 9 0 1  2 3 4
>> H P 1 1 1C 2 2 2 2C 3 3 3 3C 4 4 4
>>      +-- BOF1                     +--- EOF1
>>
>> 1. we build index file for the above WAL.
>> 2. Last commit frame is frame 11, so we check for the BOF1 and find it
>> is frame 01.
>> 3. we check that each frame between frame 01 and frame 11 is valid
>>
>> Second example:
>>
>>      1   1 1 2 2    0 0 1 1  1 1 1 1 1   2 2 2 2
>>      7   8 9 0 1    8 9 0 1  2 3 4 5 6   2 3 4 5
>> H P 4C  5 5 5 5 P  3 3 3 3C 4 4 4 4 4 P 5 5 5 5C
>>      |         |    |                |   |     +-- EOF3
>>      |         |    |                |   +--- BOF3
>>      |         |    +-- BOF1         +--- EOF1
>>      +--- BOF2 |    +---- checkpointed up to here
>>                +-- EOF2
>>
>> 1. we build index file for the above WAL.
>> 2. Last commit frame is frame 25, so we check for the BOF1 and find it
>> is frame 08.
>> 3. we check that each frame between frame 08 and frame 25 is valid.
>> 4. lets say that we tried to write the HD frames 20, 21, 24 and 25, and
>> in the middle of this the system crashed. the OS and HD can write frames
>> 24 and 25 before frames 20-21, so it is possible to frames 24-25 to be
>> good, and frames 20-21 to be corrupted.
>> 5. once we know frames 20-21 are corrupted, we search for previus commit
>> frame, which is frame 17, and do the check again (this time is should pass).
>>
>>
>> How to avoid writing sectors near 'critical' sectors:
>> -----------------------------------------------------
>> HD's and OS's work with basic 'sector' units that start from 512, but
>> can in many cases even reach up to 32K!
>> 'critical sectors' in a WAL file are sectors that if lost can lead to DB
>> corruption. For example, a WAL frame of a checkpoint that stopped in the
>> middle of transfer to the DB, and then that WAL frame is lost (due to
>> corrupt), then next time the system is up - the DB may be corrupt.
>> So pages of such a WAL frame I call 'critical'.
>>
>> In the current 'append only' WAL implementation, such corruption cannot
>> happen. But, once we add recycling feature, this could happen, by
>> writing to a adjacent sector in the FILE (just before or just after) -
>> which if the system is crashes in the middle - could corrupt that
>> logical sector (since the HD and OS have a minimum sector size which
>> they write to).
>>
>> To avoid this, we need to keep a 'safe distance' from 'critical sectors'
>> in the WAL. How we do this: whenever we recycle, we add at the end of
>> the file PAD ('PAD' is 64K of pages - with no content - just used as
>> padding). This PAD at EOF will be used later on when BOF3 needs to
>> append at end of file.
>> At the beginning of the file, in order not to overwrite the header H, it
>> will skip PAD bytes and start only there writing the BOF2.
>> EOF2 will stop PAD bytes before BOF1, and then skip to BOF3 (which
>> already from before had PAD bytes distance from EOF1).
>>
>> So the real layout of the previous example will be more like this:
>> H P 4C  5 5 5 5 P  3 3 3 3C 4 4 4 4 4 P 5 5 5 5C
>>      |         |    |                |   |     +-- EOF3
>>      |         |    |                |   +--- BOF3
>>      |         |    +-- BOF1         +--- EOF1
>>      +--- BOF2 |    +---- checkpointed up to here
>>                +-- EOF2
>>
>> This allows us to write to the WAL file without getting 'too close' to
>> 'critical sectors'.
>>
>> I can continue deeper to the headers structures (some fields needs to be
>> added) and to what is the structure of the index file, but I want to
>> hear your opinion about all the above first, and if this sounds
>> reasonable, we can continue work on it.
>>
>> Yoni
>>
>>
>> Example program (WAL file grow limitlessly):
>> ----------------
>> /* to compile: gcc -g test.c PATH_TO_SQLITE/libsqlite3.a -lrt */
>> #include "sqlite3.h"
>> #include "stdio.h"
>> #include "stdlib.h"
>> #include "fcntl.h"
>>
>> static void sql_exec(sqlite3 *conn, char *query)
>> {
>>      char *err;
>>      if (sqlite3_exec(conn, query, NULL, 0,&err))
>>      {
>>          printf("sqlite: failed exec '%s'. err: %s\n", query, err);
>>         exit(1);
>>      }
>> }
>>
>> static sqlite3 *sql_open_conn(void)
>> {
>>      sqlite3 *conn;
>>      if (sqlite3_open_v2("test.db",&conn, SQLITE_OPEN_READWRITE, NULL))
>>      {
>>         printf("sqlite3_open_v2 failed\n");
>>         exit(1);
>>      }
>>      return conn;
>> }
>>
>> static int do_checkpoint()
>> {
>>      sqlite3 *conn;
>>      while (1)
>>      {
>>         sleep(1);
>>         printf("calling wal checkpoint\n");
>>         fflush(0);
>>         conn = sql_open_conn();
>>         sql_exec(conn, "PRAGMA wal_checkpoint");
>>         sqlite3_close(conn);
>>      }
>> }
>>
>> int main(int argc, char **argv)
>> {
>>      sqlite3 *conn = NULL;
>>      char *err_msg = NULL;
>>      pthread_t thread;
>>      int fd, i = 0;
>>      unlink("test.db");
>>      unlink("test.db-wal");
>>      unlink("test.db-shm");
>>      if ((fd = open("test.db", O_CREAT|O_RDWR, 0666)<0))
>>      {
>>          printf("could not open test.db\n");
>>          exit(1);
>>      }
>>      close(fd);
>>      conn = sql_open_conn();
>>      sql_exec(conn, "PRAGMA journal_mode=WAL");
>>      sql_exec(conn, "PRAGMA wal_autocheckpoint=-1");
>>      sql_exec(conn, "create table tbl1 (one varchar(20), two varchar(20))");
>>      if (pthread_create(&thread, NULL, do_checkpoint, NULL))
>>      {
>>          printf("could not start thread\n");
>>         exit(1);
>>      }
>>      sql_exec(conn, "BEGIN TRANSACTION");
>>      while (1)
>>      {
>>          if (!(i++%10000))
>>         {
>>             sql_exec(conn, "END TRANSACTION");
>>             sql_exec(conn, "BEGIN TRANSACTION");
>>         }
>>         sql_exec(conn, "INSERT INTO tbl1 values('aaaaaaaaaaaaaaaaaaa', "
>>             "'bbbbbbbbbbbbbbbbbbb')");
>>      }
>>      sql_exec(conn, "END TRANSACTION");
>>      sql_exec(conn, "PRAGMA wal_checkpoint");
>>      sqlite3_close(conn);
>>      return 0;
>> }
>>
>> On 29/11/2010 5:13 PM, Pavel Ivanov wrote:
>>>> Well, I love sqlite, and I want to continue using it (small, fast,
>>>> reliable ...).
>>>> I think it is better to solve such problems inside sqlite
>>>
>>> It's impossible. Just try to design the solution you want. Think of
>>> how SQLite should behave to make you happy, think of it with all
>>> details and don't forget that SQLite is a library living inside
>>> several processes simultaneously, sometimes even on different
>>> computers (although it's discouraged), those processes can crash
>>> unpredictably and you have to manage all crashes predictably.
>>> So if you are able to come up with some solution and post all
>>> technical details here I think SQLite developers will be happy to
>>> implement it.
>>>
>>>
>>> Pavel
>>>
>>> On Sun, Nov 28, 2010 at 3:19 AM, Yoni Londner<[email protected]>    wrote:
>>>> Hi,
>>>>
>>>>    >    For a large scale system you have a third choice: use some other 
>>>> RDBMS
>>>>    >    which is implemented in one process and has a much better control 
>>>> over
>>>>    >    its data and much better communication between readers and writers.
>>>> Well, I love sqlite, and I want to continue using it (small, fast,
>>>> reliable ...).
>>>> I think it is better to solve such problems inside sqlite (Not to
>>>> mention you don't need large scale system to reproduce this issue, 50
>>>> lines of C code is enough).
>>>>
>>>> Yoni.
>>>> _______________________________________________
>>>> sqlite-users mailing list
>>>> [email protected]
>>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>>>
>>> _______________________________________________
>>> sqlite-users mailing list
>>> [email protected]
>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>> _______________________________________________
>> sqlite-users mailing list
>> [email protected]
>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>
> _______________________________________________
> sqlite-users mailing list
> [email protected]
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] WAL file size

Reply via email to