Re: [HACKERS] WAL write of full pages

2004-03-16 Thread Bruce Momjian
Shridhar Daithankar wrote:
 Hi,
 
 I was thinking other way round. What if we write to WAL pages only to those 
 portions which we need to modify and let kernel do the job the way it sees fit? 
 What will happen if it fails?

So you are saying only write the part of the page that we modify?  I
think the kernel reads in the entire page, makes the modification, then
writes it.  However, we still don't know our 1.5k of changes made it on
to the platters completely.

  Our current WAL implementation writes copies of full pages to WAL before
  modifying the page on disk.  This is done to prevent partial pages from
  being corrupted in case the operating system crashes during a page
  write.  
 
 Assuming a WAL page is zero at start and later written say a 128 bytes block. 
 Then how exactly writing 128 bytes is different than writing entire 8K page, 
 especially when we control neither kernel/buffer cache nor disk?
 
 What is partial? Postgresql will always flush entire data block to WAL page 
 isn't it? If write returns, we can assume it is written.

If write returns, it means the data is in the kernel cache, not on the
disks.  Fsync is the only thing that forces it to disk, and it is slow.

  For example, suppose an 8k block is being written to a heap file.  
  First the backend issues a write(), which copies the page into the
  kernel buffer cache.  Later, the kernel sends the write request to the
  drive. Even if the file system uses 8k blocks, the disk is typically
  made up of 512-byte sectors, so the OS translates the 8k block into a
  contiguous number of disk sectors, in this case 16.  There is no
  guarantee that all 16 sectors will be written --- perhaps 8 could be
  written, then the system crashes, or perhaps part of an 512-byte sector
  is written, but the remainder left unchanged.  In all these cases,
  restarting the system will yield corrupt heap blocks.
 
 We are hoping to prevent WAL page corruption which is part of file system 
 corruption. Do we propose to tacle file system corruption in order to guarantee 
 WAL integrity?

We assume the file system will come back with an xlog directory with
files in it because we fsync it.

  The WAL writes copies of full pages so that on restore, it can check
  each page to make sure it hasn't been corrupted.  The system records an
  LSN (log serial number) on every page.  When a pages is modified, its
  pre-change image is written to WAL, but not fsync'ed.  Later, if a
  backend wants to write a page, it must make sure the LSN of page page is
  between the LSN of the last checkpoint and the LSN of the last fsync by
  a committed transactions.  Only in those cases can the page be written
  because we are sure that a copy of the page is in the WAL in case there
  is a partial write.
 
 Do we have per page checksum? It could be in control log, not necessarily in 
 WAL. But just asking since I don't know.

Yes, in WAL.

  Now, as you can image, these WAL page writes take up a considerable
  amount of space in the WAL, and cause slowness, but no one has come up
  with a way to recover from partial pages write with it.  The only way to
  minimze page writes is to increase checkpoint_segments and
  checkpoint_timeout so that checkpoints are less frequent, and pages have
  to be written fewer times to the WAL because old copies of the pages
  remain in WAL longer.
 
 If I am not mistaken, we rely upon WAL being consistent to ensure transaction 
 recovery. We write() WAL and fsync/open/close it to make sure it goes on disk 
 before data pages. What else we can do?
 
 I can not see why writing an 8K block is any more safe than writing just the 
 changes.
 
 I may be dead wrong but just putting my thoughts together..

The problem is that we need to record what was on the page before we
made the modification because there is no way to know that a write
hasn't corrupted some part of the page.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] WAL write of full pages

2004-03-16 Thread Dennis Haney




Bruce Momjian wrote:

  Our current WAL implementation writes copies of full pages to WAL before
modifying the page on disk.  This is done to prevent partial pages from
being corrupted in case the operating system crashes during a page
write.  
  

InnoDB uses a doublebuffer system instead.
http://www.innodb.com/ibman.php#File.space.management

quote:
Starting from 3.23.40b, InnoDB uses a novel file flush technique
called doublewrite.
It adds safety to crash recovery after an operating system crash
or a power outage, and improves performance on most Unix flavors
by reducing the need for fsync() operations.
Doublewrite means that InnoDB before writing pages to a data file
first writes them to a contiguous tablespace area called the
doublewrite buffer. Only after the write and the flush to the
doublewrite
buffer has completed, InnoDB writes the pages to their proper
positions in the data file. If the operating system crashes in the
middle of a page write, InnoDB will in recovery find a good
copy of the page from the doublewrite buffer.


-- 
Dennis





Re: [HACKERS] WAL write of full pages

2004-03-16 Thread Bruce Momjian
Shridhar Daithankar wrote:
 Bruce Momjian wrote:
 
  Shridhar Daithankar wrote:
 I can not see why writing an 8K block is any more safe than writing just the 
 changes.
 
 I may be dead wrong but just putting my thoughts together..
  The problem is that we need to record what was on the page before we
  made the modification because there is no way to know that a write
  hasn't corrupted some part of the page.
 
 OK... I think there is hardly any way around the fact that we need to flush a 
 page the way we do it now. But that is slow. So what do we do.
 
 How feasible it would be to push fsyncing those pages/files to background writer 
 and have it done on priority? That way the disk IO wait could get out of 
 critical execution path. May be that could yield the performance benefit we are 
 looking for.

We already allow committing transactions to flush WAL.  We don't do the
flush when we write the page image to WAL, unless we can't get any other
buffer and have to write it ourselves and it hasn't already been
fsync'ed by another transaction.  This is where the current LSN come in ---
it tells us how far fsync has gone, and each page has an LSN that tells
us when it was written to WAL.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 8: explain analyze is your friend


[HACKERS] WAL write of full pages

2004-03-15 Thread Bruce Momjian
Our current WAL implementation writes copies of full pages to WAL before
modifying the page on disk.  This is done to prevent partial pages from
being corrupted in case the operating system crashes during a page
write.  

For example, suppose an 8k block is being written to a heap file.  
First the backend issues a write(), which copies the page into the
kernel buffer cache.  Later, the kernel sends the write request to the
drive. Even if the file system uses 8k blocks, the disk is typically
made up of 512-byte sectors, so the OS translates the 8k block into a
contiguous number of disk sectors, in this case 16.  There is no
guarantee that all 16 sectors will be written --- perhaps 8 could be
written, then the system crashes, or perhaps part of an 512-byte sector
is written, but the remainder left unchanged.  In all these cases,
restarting the system will yield corrupt heap blocks.

The WAL writes copies of full pages so that on restore, it can check
each page to make sure it hasn't been corrupted.  The system records an
LSN (log serial number) on every page.  When a pages is modified, its
pre-change image is written to WAL, but not fsync'ed.  Later, if a
backend wants to write a page, it must make sure the LSN of page page is
between the LSN of the last checkpoint and the LSN of the last fsync by
a committed transactions.  Only in those cases can the page be written
because we are sure that a copy of the page is in the WAL in case there
is a partial write.

Now, as you can image, these WAL page writes take up a considerable
amount of space in the WAL, and cause slowness, but no one has come up
with a way to recover from partial pages write with it.  The only way to
minimze page writes is to increase checkpoint_segments and
checkpoint_timeout so that checkpoints are less frequent, and pages have
to be written fewer times to the WAL because old copies of the pages
remain in WAL longer.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] WAL write of full pages

2004-03-15 Thread Marty Scholes
If I understand WAL correctly (and I may not), it is essentially a write 
cache for writes to the data files, because:

1. Data file writes are notoriously random, and writing the log is 
sequential.  Ironically, the sectors mapped by the OS to the disk are 
likely not at all sequential, but they likely are more sequential than 
the random data writes.

2. Log writing allows use of small, super fast drives (e.g. Solid State 
Disks) to speed up total database performance.  You can have slower 
drives for the large files in the database and still get acceptable 
performance.

3. WAL allows for syncing only the pages changed.  For example, suppose 
14 transactions are in flight and each one modifies 40 pages of a data 
file.  When one transaction commits, 560 pages are dirty, but only 40 
need to be written.  Without very close control of which buffers get 
dirtied to the OS (and Pg may have this, I am not sure), then all 560 
pages may get written in place of the 40 that actually need to be written.

My only complaint is about larger systems which have a single (or 
mirrored) large arrays.  If I have a very fast array of some sort that 
has proper caching, and my data files are on the array, look at my 
options for log files:

1. Put them on the array.
Pros:
* Fastest drive available
* RAID, so most reliable drive available
Cons:
* All changes get dumped twice: once for WAL, once at checkpoint.
* The array is no slower on random writes then sequential ones, which 
means that the benefits of writing to WAL vs. the data files are lost.

2. Put them on an actual (or mirrored actual) spindle
Pros:
* Keeps WAL and data file I/O separate
Cons:
* All of the non array drives are still slower than the array
3. Put them on mirrored solid state disks or another array
Pros:
* Very fast
* WAL and data file I/O is separate
Cons:
* Big $.  Extremely large $/GB ratio.
* If an array, hordes of unused space.
I suspect (but cannot prove) that performance would jump for systems 
like ours if WAL was done away with entirely and the individual data 
files were synchronized on commit.

Is there a simple way to turn off WAL in the config files so that I may 
do some benchmarking?

Bruce Momjian wrote:
Our current WAL implementation writes copies of full pages to WAL before
modifying the page on disk.  This is done to prevent partial pages from
being corrupted in case the operating system crashes during a page
write.  

For example, suppose an 8k block is being written to a heap file.  
First the backend issues a write(), which copies the page into the
kernel buffer cache.  Later, the kernel sends the write request to the
drive. Even if the file system uses 8k blocks, the disk is typically
made up of 512-byte sectors, so the OS translates the 8k block into a
contiguous number of disk sectors, in this case 16.  There is no
guarantee that all 16 sectors will be written --- perhaps 8 could be
written, then the system crashes, or perhaps part of an 512-byte sector
is written, but the remainder left unchanged.  In all these cases,
restarting the system will yield corrupt heap blocks.

The WAL writes copies of full pages so that on restore, it can check
each page to make sure it hasn't been corrupted.  The system records an
LSN (log serial number) on every page.  When a pages is modified, its
pre-change image is written to WAL, but not fsync'ed.  Later, if a
backend wants to write a page, it must make sure the LSN of page page is
between the LSN of the last checkpoint and the LSN of the last fsync by
a committed transactions.  Only in those cases can the page be written
because we are sure that a copy of the page is in the WAL in case there
is a partial write.
Now, as you can image, these WAL page writes take up a considerable
amount of space in the WAL, and cause slowness, but no one has come up
with a way to recover from partial pages write with it.  The only way to
minimze page writes is to increase checkpoint_segments and
checkpoint_timeout so that checkpoints are less frequent, and pages have
to be written fewer times to the WAL because old copies of the pages
remain in WAL longer.


---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] WAL write of full pages

2004-03-15 Thread Rod Taylor
 I suspect (but cannot prove) that performance would jump for systems 
 like ours if WAL was done away with entirely and the individual data 
 files were synchronized on commit.

You know.. thats exactly what WAL is designed to prevent? Grab a copy of
7.0 and 7.1. Do a benchmark between the 2 with fsync on in both cases.

I think you'll find one is about 50% faster than the other on a single
disk system, and about the same if you have gobs of battery backed write
cache.



---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] WAL write of full pages

2004-03-15 Thread Tom Lane
Marty Scholes [EMAIL PROTECTED] writes:
 I suspect (but cannot prove) that performance would jump for systems 
 like ours if WAL was done away with entirely and the individual data 
 files were synchronized on commit.

I rather doubt this, since we used to do things that way and we saw an
across-the-board performance improvement when we got rid of it in favor
of WAL.

 Is there a simple way to turn off WAL in the config files so that I may 
 do some benchmarking?

No, there's no way to turn it off at all.  You can disable fsync'ing it,
but that's hardly representative of what would happen if the data writes
had to be fsync'd instead.

Your analysis is missing an important point, which is what happens when
multiple transactions successively modify the same page.  With a
sync-the-data-files approach, we'd have to write the data page again for
each commit.  With WAL, the data page will likely not get written at all
(until a checkpoint happens).  Instead there will be per-transaction
writes to the WAL, but the data volume will be less since WAL records
are generally tuple-sized not page-sized.  There's probably no win for
large transactions that touch most of the tuples on a given data page,
but for small transactions it's a win.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] WAL write of full pages

2004-03-15 Thread Marty Scholes
Tom Lane wrote:
Your analysis is missing an important point, which is what happens when
multiple transactions successively modify the same page.  With a
sync-the-data-files approach, we'd have to write the data page again for
each commit.  With WAL, the data page will likely not get written at all
(until a checkpoint happens).  Instead there will be per-transaction
writes to the WAL, but the data volume will be less since WAL records
are generally tuple-sized not page-sized.  There's probably no win for
large transactions that touch most of the tuples on a given data page,
but for small transactions it's a win.


Well said.  I had not considered that the granularity of WAL entries was 
different than that of dirtying data pages.

I have no doubt that all of these issues have been hashed out before, 
and I appreciate you sharing the rationale behind the design decisions.

I can't help but wonder if there is a better way for update intensive 
environments, which probably did not play a large role in design decisions.

Since I live it, I know of other shops that use an industrial strength 
RDBMS (Oracle, Sybase, MS SQL, etc.) for batch data processing, not just 
transaction processing.  Often times a large data set comes in, gets 
loaded then churned for a few mintes/hours then spit out, with 
relatively little residual data held in the RDBMS.

Why use an RDBMS for this kind of work?  Because it's 
faster/cheaper/better than any alternative we have seen.

I have a 100 GB Oracle installation, small by most standards, but it has 
well over 1 TB per month flushed through it.

Bulk loads are not a once in a while undertaking.

At any rate, thanks again.
Marty
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] WAL write of full pages

2004-03-15 Thread Manfred Spraul
Marty Scholes wrote:

2. Put them on an actual (or mirrored actual) spindle
Pros:
* Keeps WAL and data file I/O separate
Cons:
* All of the non array drives are still slower than the array
Are you sure this is a problem? The dbt-2 benchmarks from osdl run on an 
8-way Intel computer with several raid arrays distributed to 40 disks. 
IIRC it generates around 1.5 MB wal logs per second - well withing the 
capability of a single drive. My laptop can write around 10 MB/sec 
(measured with dd if=/dev/zero of=fill and vmstat), fast drives should 
be above 20 MB/sec.
How much wal data is generated by large postgres setups? Are there any 
setups that are limited by the wal logs.

--
   Manfred


---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] WAL write of full pages

2004-03-15 Thread Shridhar Daithankar
Hi,

I was thinking other way round. What if we write to WAL pages only to those 
portions which we need to modify and let kernel do the job the way it sees fit? 
What will happen if it fails?

Bruce Momjian wrote:

Our current WAL implementation writes copies of full pages to WAL before
modifying the page on disk.  This is done to prevent partial pages from
being corrupted in case the operating system crashes during a page
write.  
Assuming a WAL page is zero at start and later written say a 128 bytes block. 
Then how exactly writing 128 bytes is different than writing entire 8K page, 
especially when we control neither kernel/buffer cache nor disk?

What is partial? Postgresql will always flush entire data block to WAL page 
isn't it? If write returns, we can assume it is written.

For example, suppose an 8k block is being written to a heap file.  
First the backend issues a write(), which copies the page into the
kernel buffer cache.  Later, the kernel sends the write request to the
drive. Even if the file system uses 8k blocks, the disk is typically
made up of 512-byte sectors, so the OS translates the 8k block into a
contiguous number of disk sectors, in this case 16.  There is no
guarantee that all 16 sectors will be written --- perhaps 8 could be
written, then the system crashes, or perhaps part of an 512-byte sector
is written, but the remainder left unchanged.  In all these cases,
restarting the system will yield corrupt heap blocks.
We are hoping to prevent WAL page corruption which is part of file system 
corruption. Do we propose to tacle file system corruption in order to guarantee 
WAL integrity?

The WAL writes copies of full pages so that on restore, it can check
each page to make sure it hasn't been corrupted.  The system records an
LSN (log serial number) on every page.  When a pages is modified, its
pre-change image is written to WAL, but not fsync'ed.  Later, if a
backend wants to write a page, it must make sure the LSN of page page is
between the LSN of the last checkpoint and the LSN of the last fsync by
a committed transactions.  Only in those cases can the page be written
because we are sure that a copy of the page is in the WAL in case there
is a partial write.
Do we have per page checksum? It could be in control log, not necessarily in 
WAL. But just asking since I don't know.

Now, as you can image, these WAL page writes take up a considerable
amount of space in the WAL, and cause slowness, but no one has come up
with a way to recover from partial pages write with it.  The only way to
minimze page writes is to increase checkpoint_segments and
checkpoint_timeout so that checkpoints are less frequent, and pages have
to be written fewer times to the WAL because old copies of the pages
remain in WAL longer.
If I am not mistaken, we rely upon WAL being consistent to ensure transaction 
recovery. We write() WAL and fsync/open/close it to make sure it goes on disk 
before data pages. What else we can do?

I can not see why writing an 8K block is any more safe than writing just the 
changes.

I may be dead wrong but just putting my thoughts together..

 Shridhar

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]