Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-05 Thread Simon Riggs
On Tue, 2007-04-03 at 19:45 +0900, Koichi Suzuki wrote:
 Bruce Momjian wrote:
  Your patch has been added to the PostgreSQL unapplied patches list at:
  
  http://momjian.postgresql.org/cgi-bin/pgpatches
 
 Thank you very much for including.   Attached is an update of the patch 
 according to Simon Riggs's comment about GUC name.

The patch comes with its own install kit, which is great to review
(many thanks), but hard to determine where you think code should go when
committed.

My guess based on your patch
- the patch gets applied to core :-)
- pg_compresslog *and* pg_decompresslog go to a contrib directory called
contrib/lesslog?

Can you please produce a combined patch that does all of the above, plus
edits the contrib Makefile to add all of those, as well as editing the
README so it doesn't mention the patch, just the contrib executables?

The patch looks correct to me now. I haven't tested it yet, but will be
doing so in the last week of April, which is when I'll be doing docs for
this and other stuff, since time is pressing now.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load distributed checkpoint V3

2007-04-05 Thread Heikki Linnakangas

ITAGAKI Takahiro wrote:

Here is the latest version of Load distributed checkpoint patch.


Unfortunately because of the recent instrumentation and 
CheckpointStartLock patches this patch doesn't apply cleanly to CVS HEAD 
anymore. Could you fix the bitrot and send an updated patch, please?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] [HACKERS] --enable-xml instead of --with-libxml?

2007-04-05 Thread Andrew Dunstan

Bruce Momjian wrote:

I have applied the following patch which adds documentation and an
improved error message for an installation that does not use
--with-libxml.

  


and you broke the buildfarm ... you need to fix the regression test.

cheers

andrew

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] Load distributed checkpoint V3

2007-04-05 Thread Greg Smith

On Thu, 5 Apr 2007, Heikki Linnakangas wrote:

Unfortunately because of the recent instrumentation and CheckpointStartLock 
patches this patch doesn't apply cleanly to CVS HEAD anymore. Could you fix 
the bitrot and send an updated patch, please?


The Logging checkpoints and other slowdown causes patch I submitted 
touches some of the same code as well, that's another possible merge 
coming depending on what order this all gets committed in.  Running into 
what I dubbed perpetual checkpoints was one of the reasons I started 
logging timing information for the various portions of the checkpoint, to 
tell when it was bogged down with slow writes versus being held up in sync 
for various (possibly fixed with your CheckpointStartLock) issues.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] CREATE TABLE LIKE INCLUDING INDEXES support

2007-04-05 Thread Bruce Momjian
Gregory Stark wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
 
  Uh, shouldn't CREATE TABLE LIKE INCLUDING CONSTRAINTS already be including
  any indexes in the parent table?
 
 You could argue it should for unique indexes since our unique indexes are how
 we implement unique constraints. But I see no particular reason to expect it
 to copy random other indexes. At least its name doesn't lead one to expect it
 to.
 
 I also thought it was sort of strange to have a command that otherwise is just
 copying definitions suddenly start building whole new objects. I think I was
 thinking it would be a long slow operation but I suppose creating an empty
 index isn't really noticeably slow. It could print a NOTICE similar to what's
 printed when you create a primary key or unique constraint.
 
 It does mean that users would be unable to create a partition, load data, then
 build indexes. Perhaps it would be nice to have an ALTER TABLE foo LIKE bar
 INCLUDING CONSTRAINTS as well.

The patch already _only_ does constraint(unique) indexes:

 So, that's what this patch does. When a table is created with 'CREATE
 TABLE ... LIKE parent INCLUDING INDEXES'  this iterates over the parent
 table indexes looking for constraint indexes, and alters the
 CreateStmtContext to include equivalent indexes on the child table.

so I am just suggesting it do that always for INCLUDING CONSTRAINTS,
with a notice as you suggest.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Load distributed checkpoint V3

2007-04-05 Thread Heikki Linnakangas

Greg Smith wrote:

On Thu, 5 Apr 2007, Heikki Linnakangas wrote:


Bgwriter has two goals:
1. keep enough buffers clean that normal backends never need to do a 
write

2. smooth checkpoints by writing buffers ahead of time
Load distributed checkpoints will do 2. in a much better way than the 
bgwriter_all_* guc options. I think we should remove that aspect of 
bgwriter in favor of this patch.


...

Let me suggest a different way of looking at this problem.  At any 
moment, some percentage of your buffer pool is dirty.  Whether it's 0% 
or 100% dramatically changes what the background writer should be 
doing.  Whether most of the data is usage_count0 or not also makes a 
difference.  None of the current code has any idea what type of buffer 
pool they're working with, and therefore they don't have enough 
information to make a well-informed prediction about what is going to 
happen in the near future.


The purpose of the bgwriter_all_* settings is to shorten the duration of 
the eventual checkpoint. The reason to shorten the checkpoint duration 
is to limit the damage to other I/O activity it causes. My thinking is 
that assuming the LDC patch is effective (agreed, needs more testing) at 
smoothening the checkpoint, the duration doesn't matter anymore. Do you 
want to argue there's other reasons to shorten the checkpoint duration?


I'll tell you what I did to the all-scan.  I ran a few hundred hours 
worth of background writer tests to collect data on what it does wrong, 
then wrote a prototype automatic background writer that resets the 
all-scan parameters based on what I found.  It keeps a running estimate 
of how dirty the pool at large is using a weighted average of the most 
recent scan with the past history.  From there, I have a simple model 
that predicts how much of the buffer we can scan in any interval, and 
intends to enforce a maximum bound on the amount of physical I/O you're 
willing to stream out.  The beta code is sitting at 
http://www.westnet.com/~gsmith/content/postgresql/bufmgr.c if you want 
to see what I've done so far.  The parts that are done work fine--as 
long as you give it a reasonable % to scan by default, it will correct 
all_max_pages and the interval in real-time to meet the scan rate 
requested you want given how much is currently dirty; the I/O rate is 
computed but doesn't limit properly yet.


Nice. Enforcing a max bound on the I/O seems reasonable, if we accept 
that shortening the checkpoint is a goal.


Why haven't I brought this all up yet?  Two reasons.  The first is 
because it doesn't work on my system; checkpoints and overall throughput 
get worse when you try to shorten them by running the background writer 
at optimal aggressiveness.  Under really heavy load, the writes slow 
down as all the disk caches fill, the background writer fights with 
reads on the data that isn't in the mostly dirty cache (introducing 
massive seek delays), it stops cleaning effectively, and it's better for 
it to not even try.  My next generation of code was going to start with 
the LRU flush and then only move onto the all-scan if there's time 
leftover.


The second is that I just started to get useful results here in the last 
few weeks, and I assumed it's too big of a topic to start suggesting 
major redesigns to the background writer mechanism at that point (from 
me at least!).  I was waiting for 8.3 to freeze before even trying.  If 
you want to push through a redesign there, maybe you can get away with 
it at this late moment.  But I ask that you please don't remove anything 
from the current design until you have significant test results to back 
up that change.


Point taken. I need to start testing the LDC patch.

Since we're discussing this, let me tell what I've been thinking about 
the lru cleaning behavior of bgwriter. ISTM that that's more 
straigthforward to tune automatically. Bgwriter basically needs to 
ensure that the next X buffers with usage_count=0 in the clock sweep are 
clean. X is the predicted number of buffers backends will evict until 
the next bgwriter round.


The number of buffers evicted by normal backends in a bgwriter_delay 
period is simple to keep track of, just increase a counter in 
StrategyGetBuffer and reset it when bgwriter wakes up. We can use that 
as an estimate of X with some safety margin.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Fix mdsync never-ending loop problem

2007-04-05 Thread Heikki Linnakangas

Heikki Linnakangas wrote:
Here's a fix for the problem that on a busy system, mdsync never 
finishes. See the original problem description on hackers:

http://archives.postgresql.org/pgsql-hackers/2007-04/msg00259.php

The solution is taken from ITAGAKI Takahiro's Load Distributed 
Checkpoint patch. At the beginning of mdsync, the pendingOpsTable is 
copied to a linked list, and that list is then processed until it's empty.


Here's an updated patch, the one I sent earlier is broken. I ignored the 
return value of list_delete_cell.


We could just review and apply ITAGAKI's patch as it is instead of this 
snippet of it, but because that can take some time I'd like to see this 
applied before that.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/storage/smgr/md.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/smgr/md.c,v
retrieving revision 1.127
diff -c -r1.127 md.c
*** src/backend/storage/smgr/md.c	17 Jan 2007 16:25:01 -	1.127
--- src/backend/storage/smgr/md.c	5 Apr 2007 16:09:31 -
***
*** 863,989 
  void
  mdsync(void)
  {
! 	bool		need_retry;
  
  	if (!pendingOpsTable)
  		elog(ERROR, cannot sync without a pendingOpsTable);
  
  	/*
! 	 * The fsync table could contain requests to fsync relations that have
! 	 * been deleted (unlinked) by the time we get to them.  Rather than
! 	 * just hoping an ENOENT (or EACCES on Windows) error can be ignored,
! 	 * what we will do is retry the whole process after absorbing fsync
! 	 * request messages again.  Since mdunlink() queues a revoke message
! 	 * before actually unlinking, the fsync request is guaranteed to be gone
! 	 * the second time if it really was this case.  DROP DATABASE likewise
! 	 * has to tell us to forget fsync requests before it starts deletions.
  	 */
! 	do {
! 		HASH_SEQ_STATUS hstat;
! 		PendingOperationEntry *entry;
! 		int			absorb_counter;
  
! 		need_retry = false;
  
  		/*
! 		 * If we are in the bgwriter, the sync had better include all fsync
! 		 * requests that were queued by backends before the checkpoint REDO
! 		 * point was determined. We go that a little better by accepting all
! 		 * requests queued up to the point where we start fsync'ing.
  		 */
  		AbsorbFsyncRequests();
  
! 		absorb_counter = FSYNCS_PER_ABSORB;
! 		hash_seq_init(hstat, pendingOpsTable);
! 		while ((entry = (PendingOperationEntry *) hash_seq_search(hstat)) != NULL)
  		{
! 			/*
! 			 * If fsync is off then we don't have to bother opening the file
! 			 * at all.  (We delay checking until this point so that changing
! 			 * fsync on the fly behaves sensibly.)
! 			 */
! 			if (enableFsync)
! 			{
! SMgrRelation reln;
! MdfdVec*seg;
  
! /*
!  * If in bgwriter, we want to absorb pending requests every so
!  * often to prevent overflow of the fsync request queue.  This
!  * could result in deleting the current entry out from under
!  * our hashtable scan, so the procedure is to fall out of the
!  * scan and start over from the top of the function.
!  */
! if (--absorb_counter = 0)
! {
! 	need_retry = true;
! 	break;
! }
  
! /*
!  * Find or create an smgr hash entry for this relation. This
!  * may seem a bit unclean -- md calling smgr?  But it's really
!  * the best solution.  It ensures that the open file reference
!  * isn't permanently leaked if we get an error here. (You may
!  * say but an unreferenced SMgrRelation is still a leak! Not
!  * really, because the only case in which a checkpoint is done
!  * by a process that isn't about to shut down is in the
!  * bgwriter, and it will periodically do smgrcloseall(). This
!  * fact justifies our not closing the reln in the success path
!  * either, which is a good thing since in non-bgwriter cases
!  * we couldn't safely do that.)  Furthermore, in many cases
!  * the relation will have been dirtied through this same smgr
!  * relation, and so we can save a file open/close cycle.
!  */
! reln = smgropen(entry-tag.rnode);
! 
! /*
!  * It is possible that the relation has been dropped or
!  * truncated since the fsync request was entered.  Therefore,
!  * allow ENOENT, but only if we didn't fail once already on
!  * this file.  This applies both during _mdfd_getseg() and
!  * during FileSync, since fd.c might have closed the file
!  * behind our back.
!  */
! seg = _mdfd_getseg(reln,
!    entry-tag.segno * ((BlockNumber) RELSEG_SIZE),
!    false, EXTENSION_RETURN_NULL);
! if (seg == NULL ||
! 	FileSync(seg-mdfd_vfd)  0)
! {
! 	/*
! 	 * XXX is there any point in allowing more than one try?
! 	 * Don't see one at the moment, but easy to change the
! 	 * test here if so.
! 	 */
! 	if (!FILE_POSSIBLY_DELETED(errno) ||
! 		++(entry-failures)  

Re: [PATCHES] Fix mdsync never-ending loop problem

2007-04-05 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 I wonder if it wouldn't be better to clean this up by creating a
 separate typedef for segment numbers, with its own special values?

Probably.  I remember having thought about it when I put in the
FORGET_DATABASE_FSYNC hack.  I think I didn't do it because I needed
to backpatch and so I wanted a minimal-size patch.  Feel free to do it
in HEAD ...

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Load distributed checkpoint V3

2007-04-05 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 The number of buffers evicted by normal backends in a bgwriter_delay 
 period is simple to keep track of, just increase a counter in 
 StrategyGetBuffer and reset it when bgwriter wakes up. We can use that 
 as an estimate of X with some safety margin.

You'd want some kind of moving-average smoothing in there, probably with
a lot shorter ramp-up than ramp-down time constant, but this seems
reasonable enough to try.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Fix mdsync never-ending loop problem

2007-04-05 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
Here's a fix for the problem that on a busy system, mdsync never 
finishes. See the original problem description on hackers:


This leaks memory, no?  (list_delete_cell only deletes the ListCell.)


Oh, I just spotted another problem with it and posted an updated patch, 
but I missed that.



But I dislike copying the table entries anyway, see comment on -hackers.


Frankly the cycle id idea sounds more ugly and fragile to me. You'll 
need to do multiple scans of the hash table that way, starting from top 
every time you call AbsorbFsyncRequests (like we do know). But whatever...



BTW, it's very hard to see what a patch like this is actually changing.
It might be better to submit a version that doesn't reindent the chunks
of code you aren't changing, so as to reduce the visual size of the
diff.  A note to the committer to reindent the whole function is
sufficient (or if he forgets, pg_indent will fix it eventually).


Ok, will do that. Or would you like to just take over from here?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Fix mdsync never-ending loop problem

2007-04-05 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 But I dislike copying the table entries anyway, see comment on -hackers.

 Frankly the cycle id idea sounds more ugly and fragile to me. You'll 
 need to do multiple scans of the hash table that way, starting from top 
 every time you call AbsorbFsyncRequests (like we do know).

How so?  You just ignore entries whose cycleid is too large.  You'd have
to be careful about wraparound in the comparisons, but that's not hard
to deal with.  Also, AFAICS you still have the retry problem (and an
even bigger memory leak problem) with this coding --- the to-do list
doesn't eliminate the issue of correct handling of a failure.

 Ok, will do that. Or would you like to just take over from here?

No, I'm up to my ears in varlena.  You're the one in a position to test
this, anyway.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] Fix mdsync never-ending loop problem

2007-04-05 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:

Tom Lane wrote:

But I dislike copying the table entries anyway, see comment on -hackers.


Frankly the cycle id idea sounds more ugly and fragile to me. You'll 
need to do multiple scans of the hash table that way, starting from top 
every time you call AbsorbFsyncRequests (like we do know).


How so?  You just ignore entries whose cycleid is too large.  You'd have
to be careful about wraparound in the comparisons, but that's not hard
to deal with.  Also, AFAICS you still have the retry problem (and an
even bigger memory leak problem) with this coding --- the to-do list
doesn't eliminate the issue of correct handling of a failure.


You have to start the hash_seq_search from scratch after each call to 
AbsorbFsyncRequests because it can remove entries, including the one the 
scan is stopped on.


I think the failure handling is correct in the to-do list approach, 
when an entry is read from the list, it's checked that the entry hasn't 
been removed from the hash table. Actually there was a bug in the 
original LDC patch in the failure handling: it replaced the per-entry 
failures-counter with a local retry_counter variable, but it wasn't 
cleared after a successful write which would lead to bogus ERRORs when 
multiple relations are dropped during the mdsync. I kept the original 
per-entry counter, though the local variable approach could be made to work.


The memory leak obviously needs to be fixed, but that's just a matter of 
adding a pfree.



Ok, will do that. Or would you like to just take over from here?


No, I'm up to my ears in varlena.  You're the one in a position to test
this, anyway.


Ok.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PATCHES] Fix mdsync never-ending loop problem

2007-04-05 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 I think the failure handling is correct in the to-do list approach, 
 when an entry is read from the list, it's checked that the entry hasn't 
 been removed from the hash table. Actually there was a bug in the 
 original LDC patch in the failure handling: it replaced the per-entry 
 failures-counter with a local retry_counter variable, but it wasn't 
 cleared after a successful write which would lead to bogus ERRORs when 
 multiple relations are dropped during the mdsync. I kept the original 
 per-entry counter, though the local variable approach could be made to work.

Yeah.  One of the things that bothered me about the patch was that it
would be easy to mess up by updating state in the copied entry instead
of the real info in the hashtable.  It would be clearer what's
happening if the to-do list contains only the lookup keys and not the
whole struct.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Fix mdsync never-ending loop problem

2007-04-05 Thread Simon Riggs
On Thu, 2007-04-05 at 17:14 +0100, Heikki Linnakangas wrote:

 We could just review and apply ITAGAKI's patch as it is instead of
 this snippet of it, but because that can take some time I'd like to
 see this applied before that. 

I think we are just beginning to understand the quality of Itagaki's
thinking.

We should give him a chance to interact on this and if there are parts
of his patch that we want, then it should be him that does it. I'm not
sure that carving the good bits off each others patches is likely to
help teamwork in the long term. At very least he deserves much credit
for his farsighted work.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Fix mdsync never-ending loop problem

2007-04-05 Thread Heikki Linnakangas

Simon Riggs wrote:

On Thu, 2007-04-05 at 17:14 +0100, Heikki Linnakangas wrote:


We could just review and apply ITAGAKI's patch as it is instead of
this snippet of it, but because that can take some time I'd like to
see this applied before that. 


I think we are just beginning to understand the quality of Itagaki's
thinking.

We should give him a chance to interact on this and if there are parts
of his patch that we want, then it should be him that does it. 


Itagaki, would you like to take a stab at this?


I'm not
sure that carving the good bits off each others patches is likely to
help teamwork in the long term. At very least he deserves much credit
for his farsighted work.


Oh sure! Thank you for your efforts, Itagaki!

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] Load distributed checkpoint V3

2007-04-05 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
The number of buffers evicted by normal backends in a bgwriter_delay 
period is simple to keep track of, just increase a counter in 
StrategyGetBuffer and reset it when bgwriter wakes up. We can use that 
as an estimate of X with some safety margin.


You'd want some kind of moving-average smoothing in there, probably with
a lot shorter ramp-up than ramp-down time constant, but this seems
reasonable enough to try.


Ironically, I just noticed that we already have a patch in the patch 
queue that implements exactly that, again by Itagaki. I need to start 
paying more attention :-). Keep up the good work!


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PATCHES] autovacuum multiworkers, patch 5

2007-04-05 Thread Alvaro Herrera
Another problem seems to be that I'm not checking anywhere that a
regular connection (not autovac) is not using an autovac-reserved PGPROC
slot :-(  I think I should tweak the logic that deals with
ReservedBackends but it doesn't look entirely trivial.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[PATCHES] Reviewers Guide to Deferred Transactions/Transaction Guarantee

2007-04-05 Thread Simon Riggs
transaction_guarantee.v11.patch 
- keep current, cleanup, more comments and docs

Brief Performance Analysis
--

I've tested 3 scenarios:
1. normal
2. wal_writer_delay = 100ms
3. wal_writer_delay = 100ms and transaction_guarantee = off

On my laptop, with a scale=1 pgbench database with 1 connection I
consistently get around 85 tps in mode (1), with a slight performance
drop in mode (2). In mode (3) I get anywhere from 200tps - 900 tps,
depending upon how well cached everything is, with 700 tps being fairly
typical. fsync = on gives around 900tps. 

Also good speedups with multiple session tests.

make installcheck passes in 120 sec in mode (3), though 155 sec in mode
(1) and 158 sec in mode (2).

Basic Implementation


xact.c
xact.h

The basic implementation simply records the LSN of the xlog commit
record in a shared memory area, the deferred fsync cache. 

ipci.c

The cache is protected by an LWlock called DeferredFsyncLock.

lwlock.h

A WALWriter process wakes up regularly to perform a background flush of
WAL up to the point of the highest LSN in the deferred fsync cache.

walwriter.c
walwriter.h
postmaster.c

WALWriter can be enabled only at server start.
(All above same as March 11 version)

Correctness
---

postgres.c

Only certain code paths can execute transaction_guarantee = off
transactions, though the main code paths for OLTP allow it.

xlog.c 

CreateCheckpoint() must protect against starting a checkpoint when
commits are not yet flushed, so an additional flush must occur here.

vacuum.c 

VACUUM FULL cannot move tuples until their states are all known, so this
command triggers a background flush also.

clog.c
clog.h
slru.c
slru.h

Changes to Clog and SLRU enforce the basic rule of WAL-before-data,
which otherwise might allow the record of a commit to reach disk before
the flush of the WAL. This is implemented by storing an LSN for each
clog page.

transam.c
transam.h
twophase.c
xact.c

The above files have API changes that allow the LSN at transaction
commit to be passed through to the Clog.

tqual.c
tqual.h
multixact.c
multixact.h

Visibility hint bits must also not be set before the transaction is
flushed, so other changes are required to ensure we store the LSN of
each transaction, not just the maximum LSN. Changes to tqual.c appear
extensive, though this is just refactoring to allow us to make
additional function calls before setting bits - there are no functional
changes to any HeapTupleSatisfies... functions.

xact.c

Contains the module for the Deferred Transaction functions and in
particular the deferred transaction cache. This could be a separate
module, since there is only a slight link with the other xact.c code. 

User Interface
--

guc.c
postgresql.conf.sample
guc_table.h

New parameters have been added, with a new parameter grouping of
WAL_COMMITS created to control the various commit parameters.

Performance Tuning
--

The WALWriter wakes up each eal_writer_delay milliseconds. There are two
protections against mis-setting this parameter.

pmsignal.h

The WALWriter will also be woken by a signal if the DF cache has nearly
filled and flushing would be desirable.

The WALWriter will also loop without any delay if the number of
transactions committed while it was writing WAL is above a threshold
value.

Docs

The fsync parameter has been removed from postgresql.conf.sample and the
docs, though it still exists in this patch to allow performance testing
during Beta. It is suggested that fsync=on should mean the same thing as
transaction_guarantee = off, wal_writer_delay = 100ms, if it is
specified in postgresql.conf or on the server command line.

A new section in wal.sgml willd escribe this in more detail, later.

Open Questions
--

1. Should the DFC use a standard hash table? Custom code allows both
additional speed and the ability to signal when it fills.

2. Should tqual.c update the LSN of a heap page with the LSN of the
transaction commit that it can read from the DF cache?

3. Should the WALWriter also do the wal_buffers half-full write at the
start of XLogInsert() ?

4. The recent changes to remove CheckpointStartLock haven't changed the
code path for deferred transactions, so a similar solution might be
possible there also.

5. Is it correct to do WAL-before-flush for clog only, or should this
be multixact also?

All of the above are fairly minor changes.

Any other thoughts/comments/tests welcome.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



transaction_guarantee.v11.patch.gz
Description: GNU Zip compressed data

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] pgbench transaction timestamps

2007-04-05 Thread Tatsuo Ishii
  Tatsuo, would you please comment on this patch too?
 
 No problem. I will come up with a comment by the end of this week.

Here are comments.

1) latency log file format extention looks usefull (-x option).

2) it seems the cleanup feature (-X option) was withdrawed by the
   author, but the patches still include the feature. So I'm confused.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

  ---
  
  Greg Smith wrote:
   This patch changes the way pgbench outputs its latency log files so that 
   every transaction gets a timestamp and notes which transaction type was 
   executed.  It's a one-line change that just dumps some additional 
   information that was already sitting in that area of code. I also made a 
   couple of documentation corrections and clarifications on some of the 
   more 
   confusing features of pgbench.
   
   It's straightforward to parse log files in this format to analyze what 
   happened during the test at a higher level than was possible with the 
   original format.  You can find some rough sample code to convert this 
   latency format into CVS files and then into graphs at 
   http://www.westnet.com/~gsmith/content/postgresql/pgbench.htm which I'll 
   be expanding on once I get all my little patches sent in here.
   
   If you recall the earlier version of this patch I submitted, it added a 
   cleanup feature that did a vacuum and checkpoint after the test was 
   finished and reported two TPS results.  The idea was to quantify how much 
   of a hit the eventual table maintenance required to clean up after the 
   test would take.  While those things do influence results and cause some 
   of the run-to-run variation in TPS (checkpoints are particularly visible 
   in the graphs), after further testing I concluded running a VACUUM 
   VERBOSE 
   and CHECKPOINT in a script afterwards and analyzing the results was more 
   useful than integrating something into pgbench itself.
   
   --
   * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD
  Content-Description: 
  
  [ Attachment, skipping... ]
  
   
   ---(end of broadcast)---
   TIP 6: explain analyze is your friend
  
  -- 
Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
EnterpriseDB   http://www.enterprisedb.com
  
+ If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] Optimized pgbench for 8.3

2007-04-05 Thread Tatsuo Ishii
  Tatsuo, would you please comment on this patch?
 
 Sure. I will come up with a comment by the end of this week.

The patches look good to me.

BTW, is anybody working on enabling the fill factor to the tables used
by pgbench? 8.3 will introduce HOT, and I think adding the feature
will make it easier to test HOT.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

  ---
  
  ITAGAKI Takahiro wrote:
   The attached is a patch to optimize contrib/pgbench using new 8.3 
   features.
   
   - Use DROP IF EXISTS to suppress errors for initial loadings.
   - Use a combination of TRUNCATE and COPY to reduce WAL on creating
 the accounts table.
   
   Also, there are some cosmetic changes.
   
   - Change the output of -v option from starting full vacuum...
 to starting vacuum accounts... in reflection of the fact.
   - Shape duplicated error checks into executeStatement().
   
   
   There is a big performance win in COPY with no WAL feature.
   Thanks for the efforts!
   
   Regards,
   ---
   ITAGAKI Takahiro
   NTT Open Source Software Center
  
  [ Attachment, skipping... ]
  
   
   ---(end of broadcast)---
   TIP 5: don't forget to increase your free space map settings
  
  -- 
Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
EnterpriseDB   http://www.enterprisedb.com
  
+ If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] Load distributed checkpoint V3

2007-04-05 Thread Takayuki Tsunakawa
Hello, long time no see.

I'm sorry to interrupt your discussion. I'm afraid the code is getting
more complicated to continue to use fsync(). Though I don't intend to
say the current approach is wrong, could anyone evaluate O_SYNC
approach again that commercial databases use and tell me if and why
PostgreSQL's fsync() approach is better than theirs?

This January, I got a good result with O_SYNC, which I haven't
reported here. I'll show it briefly. Please forgive me for my abrupt
email, because I don't have enough time.
# Personally, I want to work in the community, if I'm allowed.
And sorry again. I reported that O_SYNC resulted in very bad
performance last year. But that was wrong. The PC server I borrowed
was configured that all the disks form one RAID5 device. So, the disks
for data and WAL (/dev/sdd and /dev/sde) came from the same RAID5
device, resulting in I/O conflict.

What I modified is md.c only. I just added O_SYNC to the open flag in
mdopen() and _mdfd_openseg(), if am_bgwriter is true. I didn't want
backends to use O_SYNC because mdextend() does not have to transfer
data to disk.

My evaluation environment was:

CPU: Intel Xeon 3.2GHz * 2 (HT on)
Memory: 4GB
Disk: Ultra320 SCSI (perhaps configured as write back)
OS: RHEL3.0 Update 6
Kernel: 2.4.21-37.ELsmp
PostgreSQL: 8.2.1

The relevant settings of PostgreSQL are:

shared_buffers = 2GB
wal_buffers = 1MB
wal_sync_method = open_sync
checkpoint_* and bgwriter_* parameters are left as their defaults.

I used pgbench, with the data of scaling factor 50.


[without O_SYNC, original behavior]
- pgbench -c1 -t16000
  best response: 1ms
  worst response: 6314ms
  10th worst response: 427ms
  tps: 318
- pgbench -c32 -t500
  best response: 1ms
  worst response: 8690ms
  10th worst response: 8668ms
  tps: 330

[with O_SYNC]
- pgbench -c1 -t16000
  best response: 1ms
  worst response: 350ms
  10th worst response: 91ms
  tps: 427
- pgbench -c32 -t500
  best response: 1ms
  worst response: 496ms
  10th worst response: 435ms
  tps: 1117

If the write back cache were disabled, the difference would be
smaller.
Windows version showed similar improvements.


However, this approach has two big problems.


(1) Slow down bulk updates

Updates of large amount of data get much slower because bgwriter seeks
and writes dirty buffers synchronously page-by-page. For example:

- COPY of accounts (5m records) and CHECKPOINT command after COPY
  without O_SYNC: 100sec
  with O_SYNC: 1046sec
- UPDATE of all records of accounts
  without O_SYNC: 139sec
  with O_SYNC: 639sec
- CHECKPOINT command for flushing 1.6GB of dirty buffers
  without O_SYNC: 24sec
  with O_SYNC: 126sec

To mitigate this problem, I sorted dirty buffers by their relfilenode
and block numbers and wrote multiple pages that are adjacent both on
memory and on disk. The result was:

- COPY of accounts (5m records) and CHECKPOINT command after COPY

  227sec
- UPDATE of all records of accounts
  569sec
- CHECKPOINT command for flushing 1.6GB of dirty buffers
  71sec

Still bad...


(2) Can't utilize tablespaces

Though I didn't evaluate, update activity would be much less efficient
with O_SYNC than with fsync() when using multiple tablespaces, because
there is only one bgwriter.


Anyone can solve these problems?
One of my ideas is to use scattered I/O. I hear that readv()/writev()
became able to do real scattered I/O since kernel 2.6 (RHEL4.0).  With
kernels before 2.6, readv()/writev() just performed I/Os sequentially.
Windows has provided reliable scattered I/O for years.

Another idea is to use async I/O, possibly combined with multiple
bgwriter approach on platforms where async I/O is not available. How
about the chance Josh-san has brought?



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-05 Thread Koichi Suzuki

Hi,

I agree to put the patch to core and the others (pg_compresslog and 
pg_decompresslog) to contrib/lesslog.


I will make separate materials to go to core and contrib.

As for patches, we have tested against pgbench, DBT-2 and our 
propriatery benchmarks and it looked to run correctly.


Regards;

Simon Riggs wrote:

On Tue, 2007-04-03 at 19:45 +0900, Koichi Suzuki wrote:

Bruce Momjian wrote:

Your patch has been added to the PostgreSQL unapplied patches list at:

http://momjian.postgresql.org/cgi-bin/pgpatches
Thank you very much for including.   Attached is an update of the patch 
according to Simon Riggs's comment about GUC name.


The patch comes with its own install kit, which is great to review
(many thanks), but hard to determine where you think code should go when
committed.

My guess based on your patch
- the patch gets applied to core :-)
- pg_compresslog *and* pg_decompresslog go to a contrib directory called
contrib/lesslog?

Can you please produce a combined patch that does all of the above, plus
edits the contrib Makefile to add all of those, as well as editing the
README so it doesn't mention the patch, just the contrib executables?

The patch looks correct to me now. I haven't tested it yet, but will be
doing so in the last week of April, which is when I'll be doing docs for
this and other stuff, since time is pressing now.




--
Koichi Suzuki

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PATCHES] Packed Varlena Update (v21)

2007-04-05 Thread Tom Lane
stark [EMAIL PROTECTED] writes:
 [ packed varlena patch ]

Applied with revisions.  Waiting to see how badly the buildfarm
complains ;-) ... but I did test on little- and big-endian machines
with different MAXALIGN values.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] Packed Varlena Update (v21)

2007-04-05 Thread Tom Lane
I wrote:
 stark [EMAIL PROTECTED] writes:
 [ packed varlena patch ]

 Applied with revisions.

Forgot to mention: one of the revisions was to not add the sizes.sql
test, because the output was platform-dependent and is likely to get
more so if any ability to change the toast thresholds gets put in.
Can we improve that test to not expose any platform-dependent numbers?

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Load distributed checkpoint V3

2007-04-05 Thread Greg Smith

On Thu, 5 Apr 2007, Heikki Linnakangas wrote:

The purpose of the bgwriter_all_* settings is to shorten the duration of 
the eventual checkpoint. The reason to shorten the checkpoint duration 
is to limit the damage to other I/O activity it causes. My thinking is 
that assuming the LDC patch is effective (agreed, needs more testing) at 
smoothening the checkpoint, the duration doesn't matter anymore. Do you 
want to argue there's other reasons to shorten the checkpoint duration?


My testing results suggest that LDC doesn't smooth the checkpoint usefully 
when under a high (30 client here) load, because (on Linux at least) the 
way the OS caches writes clashes badly with how buffers end up being 
evicted if the buffer pool fills back up before the checkpoint is done. 
In that context, anything that slows down the checkpoint duration is going 
to make the problem worse rather than better, because it makes it more 
likely that the tail end of the checkpoint will have to fight with the 
clients for write bandwidth, at which point they both suffer.  If you just 
get the checkpoint done fast, the clients can't fill the pool as fast as 
the BufferSync is writing it out, and things are as happy as they can be 
without a major rewrite to all this code.  I can get a tiny improvement in 
some respects by delaying 2-5 seconds between finishing the writes and 
calling fsync, because that gives Linux a moment to usefully spool some of 
the data to the disk controller's cache; beyond that any additional delay 
is a problem.


Since it's only the high load cases I'm having trouble dealing with, this 
basically makes it a non-starter for me.  The focus on checkpoint_timeout 
and ignoring checkpoint_segments in the patch is also a big issue for me. 
At the same time, I recognize that the approach taken in LDC probably is a 
big improvement for many systems, it's just a step backwards for my 
highest throughput one.  I'd really enjoy hearing some results from 
someone else.


The number of buffers evicted by normal backends in a bgwriter_delay period 
is simple to keep track of, just increase a counter in StrategyGetBuffer and 
reset it when bgwriter wakes up.


I see you've already found the other helpful Itagaki patch in this area. 
I know I would like to see his code for tracking evictions commited, then 
I'd like that to be added as another counter in pg_stat_bgwriter (I 
mentioned that to Magnus in passing when he was setting the stats up but 
didn't press it because of the patch dependency).  Ideally, and this idea 
was also in Itagaki's patch with the writtenByBgWriter/ByBackEnds debug 
hook, I think it's important that you know how every buffer written to 
disk got there--was it a background writer, a checkpoint, or an eviction 
that wrote it out?  Track all those and you can really learn something 
about your write performance, data that's impossible to collect right now.


However, as Itagaki himself points out, doing something useful with 
bgwriter_lru_maxpages is only one piece of automatically tuning the 
background writer.  I hate to join in on chopping his patches up, but 
without some additional work I don't think the exact auto-tuning logic he 
then applies will work in all cases, which could make it more a problem 
than the current crude yet predictable method.  The whole way 
bgwriter_lru_maxpages and num_to_clean play off each other in his code 
currently has a number of failure modes I'm concerned about.  I'm not sure 
if a re-write using a moving-average approach (as I did in my auto-tuning 
writer prototype and as Tom just suggested here) will be sufficient to fix 
all of them.  Was already on my to-do list to investigate that further.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate