Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-30 Thread Heikki Linnakangas
We just discussed this in detail with Simon, and it looks like we have 
5 (!) different but related problems:


1) The original problem of freeze then crash, leaving too high values in 
relminxid and datminxid. If you then run vacuum, it might truncate CLOG 
and you lose the commit status of the records that were supposed to be 
frozen.


To fix this, we need to WAL log freezing as already discussed.

2) vactuple_get_minxid doesn't take into account xmax's of tuples that 
have HEAP_XMAX_INVALID set. That's a problem:


transaction 1001 - BEGIN; DELETE FROM foo where key = 1;
transaction 1001 - ROLLBACK;
transaction 1002 - VACUUM foo;
crash

VACUUM foo will set relminxid to 1002, because HEAP_XMAX_INVALID was set 
on the tuple (possibly by vacuum itself) that the deletion that rolled 
back touched. However, that hint-bit update hasn't hit the disk yet, so 
after recovery, the tuple will have an xmax of 1001 with no hint-bit, 
and relminxid is 1002.


The simplest fix for this issue is to ignore the HEAP_XMAX_INVALID hint 
bit, and take any xmax other than InvalidXid into account when 
calculating the relminxid.


3) If you recover from a PITR backup (or have a hot stand-by), with base 
backup that's more than 4 billion transactions older than the newest WAL 
record, the clog entries of old transactions in the base backup will 
overlap with the clog entries of new transactions that are in the WAL 
records. This is the problem you also pointed out below.


To fix this, we need to emit a WAL record when truncating the clog. We 
must also make sure that recovery of any WAL record type doesn't rely on 
clog, because if we truncate the clog and then crash, recovery won't 
have the clog available for the old transactions. At the moment, 
TruncateCLog issues a checkpoint to protect from that but that's not 
going to work when rolling forward logs in PITR, right?


4) If we fix issue 2 so that vactuple_get_minxid always takes xmax into 
account, even if HEAP_XMAX_INVALID is set, a tuple with an aborted xmax 
will keep us from advancing relminxid and truncating clog etc. That 
doesn't lead to data corruption, but you will eventually hit the 
transaction wrap-around limit. We don't have the same problem with xmin, 
because we freeze tuples that are older than FreezeLimit to avoid it, 
but we don't do that for xmax.


To fix this, replace any xmax older than FreezeLimit with InvalidXid 
during vacuum. That also needs to be WAL logged.


5) We don't freeze tuples that are in RECENTLY_DEAD or 
DELETE_IN_PROGRESS state. That doesn't lead to data corruption, but it 
might make you hit the transaction wrap-around limit. That can happen if 
you have a transaction that deletes or updates a very old, but not yet 
frozen tuple. If you run vacuum while the deleting transaction is in 
progress, vacuum won't freeze the tuple, and won't advance the 
wrap-around limit because of the old tuple. That's not serious if the 
deleting transaction commits, because the next vacuum will then remove 
the tuple, but if it aborts, we might run into the same problem on the 
next vacuum, and the next one, and the next one, until we reach the 
wrap-around.


To fix this, simply do the freezing for tuples in RECENTLY_DEAD and 
DELETE_IN_PROGRESS states as well,


Am I missing something? Finding this many bugs makes me nervous...

Simon volunteered to make the clog changes for 3 because it's a PITR 
related issue. I can write a patch/patches for the other changes if it 
helps.


Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:

Tom Lane wrote:

I think it's premature to start writing
patches until we've decided how this really needs to work.


Not logging hint-bit updates seems safe to me. As long as we have the 
clog, the hint-bit is just a hint. The problem with freezing is that 
after freezing tuples, the corresponding clog page can go away.


Actually clog can go away much sooner than that, at least in normal
operation --- that's what datvacuumxid is for, to track where we can
truncate clog.  Maybe it's OK to say that during WAL replay we keep it
all the way back to the freeze horizon, but I'm not sure how we keep the
system from wiping clog it still needs right after switching to normal
operation.  Maybe we should somehow not xlog updates of datvacuumxid?

Another thing I'm concerned about is the scenario where a PITR
hot-standby machine tracks a master over a period of more than 4 billion
transactions.  I'm not sure what will happen in the slave's pg_clog
directory, but I'm afraid it won't be good :-(

regards, tom lane




--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-30 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:

Tom Lane wrote:

I think it's premature to start writing
patches until we've decided how this really needs to work.


Not logging hint-bit updates seems safe to me. As long as we have the 
clog, the hint-bit is just a hint. The problem with freezing is that 
after freezing tuples, the corresponding clog page can go away.


Actually clog can go away much sooner than that, at least in normal
operation --- that's what datvacuumxid is for, to track where we can
truncate clog.  Maybe it's OK to say that during WAL replay we keep it
all the way back to the freeze horizon, but I'm not sure how we keep the
system from wiping clog it still needs right after switching to normal
operation.  Maybe we should somehow not xlog updates of datvacuumxid?


Oh, I just understood what you were trying to say. We really don't keep 
clog around all the way back to the freeze horizon.


Well, how about doing just that? With a FreezeLimit of 1 billion 
transactions, that's 128 megabytes of clog. We could freeze more 
aggressively to make it less.


We can't just not xlog updates of datvacuumxid, because that buffer 
might still be written to disk before all the hint-bit updates. One 
trick would be to include the old value in the WAL record. WAL recovery 
would have to keep track of those records, and reset any datvacuumxid to 
the old value if there's no checkpoint record afterwards.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-30 Thread Alvaro Herrera
Heikki Linnakangas wrote:
 We just discussed this in detail with Simon, and it looks like we have 
 5 (!) different but related problems:

Wow, four of them are mine :-(

 2) vactuple_get_minxid doesn't take into account xmax's of tuples that 
 have HEAP_XMAX_INVALID set. That's a problem:
 
 transaction 1001 - BEGIN; DELETE FROM foo where key = 1;
 transaction 1001 - ROLLBACK;
 transaction 1002 - VACUUM foo;
 crash
 
 VACUUM foo will set relminxid to 1002, because HEAP_XMAX_INVALID was set 
 on the tuple (possibly by vacuum itself) that the deletion that rolled 
 back touched. However, that hint-bit update hasn't hit the disk yet, so 
 after recovery, the tuple will have an xmax of 1001 with no hint-bit, 
 and relminxid is 1002.
 
 The simplest fix for this issue is to ignore the HEAP_XMAX_INVALID hint 
 bit, and take any xmax other than InvalidXid into account when 
 calculating the relminxid.

Ugh.  Is there another solution to this?  Say, sync the buffer so that
the hint bits are written to disk?  The bug (4) below is problematic if
you take this approach; basically it removes all the optimization won by
the relminxid patch.


 Simon volunteered to make the clog changes for 3 because it's a PITR 
 related issue. I can write a patch/patches for the other changes if it 
 helps.

I'm swamped at the moment, so I'd appreciate it.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] bug in on_error_rollback !?

2006-10-30 Thread Jim C. Nasby
On Fri, Oct 27, 2006 at 01:19:25PM -, Greg Sabino Mullane wrote:
  This is documented clearly on the psql man page, so it is simply not a
  bug, and changing this would probably break lots of legacy scripts.
 
 In a general sense, perhaps, but in this *particular* case, I don't
 see what harm allowing \set on_error_rollback would be: it certainly
 won't break any existing scripts. I wrote this feature (but someone else
 chose the name!) and I still occasionally write it lowercase and wonder
 why it isn't working. :)
 
 Perhaps even allowing all of the \set commands to be case-insensitive
 may be a good idea?

The problem is that people my be depending on the case-sensitivity in
their scripts...

\set ACounter 1
\set aCounter 2

Of course, they're just asking for trouble, but suddenly making psql
variables case-insensitive would break that code.

Perhaps a good compromise would be adopting SQL case syntax (ie: you
have to wrap in double quotes to preserve case).
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-30 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Ugh.  Is there another solution to this?  Say, sync the buffer so that
 the hint bits are written to disk?

Yeah.  The original design for all this is explained by the notes for
TruncateCLOG:

 * When this is called, we know that the database logically contains no
 * reference to transaction IDs older than oldestXact.  However, we must
 * not truncate the CLOG until we have performed a checkpoint, to ensure
 * that no such references remain on disk either; else a crash just after
 * the truncation might leave us with a problem.

The pre-8.2 coding is actually perfectly safe within a single database,
because TruncateCLOG is only called at the end of a database-wide
vacuum, and so the checkpoint is guaranteed to have flushed valid hint
bits for all tuples to disk.  There is a risk in other databases though.
I think that in the 8.2 structure the equivalent notion must be that
VACUUM has to flush and fsync a table before it can advance the table's
relminxid.

That still leaves us with the problem of hint bits not being updated
during WAL replay.  I think the best solution for this is for WAL replay
to force relvacuumxid to equal relminxid (btw, these field names seem
poorly chosen, and the comment in catalogs.sgml isn't self-explanatory...)
rather than adopting the value shown in the WAL record.  This probably
is best done by abandoning the generic overwrite tuple WAL record type
in favor of something specific to minxid updates.  The effect would then
be that a PITR slave would not truncate its clog beyond the freeze
horizon until it had performed a vacuum of its own.

The point about aborted xmax being a risk factor is a good one.  I don't
think the risk is material for ordinary crash recovery scenarios,
because ordinarily we'd have many opportunities to set the hint bit
before anything really breaks, but it's definitely an issue for
long-term PITR replay scenarios.

I'll work on this as soon as I get done with the btree-index issue I'm
messing with now.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] Index greater than 8k

2006-10-30 Thread Joshua D. Drake
Hello,

I recently posted about a word being too long with Tsearch2. That isn't
actually the problem I am trying to solve (thanks for the feedback
though, now I understand it).

The problem I am after is the 8k index size issue. It is very easy to
get a GIST index (especially when using tsearch2) that is larger than that.

Is recompiling the block size the option there?
What are the downsides, except for the custom build?

Joshua D. Drake

-- 

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Deadlock with pg_dump?

2006-10-30 Thread Simon Riggs
On Thu, 2006-10-26 at 18:45 -0400, Tom Lane wrote:
 Chris Campbell [EMAIL PROTECTED] writes:
  Is there additional logging information I can turn on to get more  
  details? I guess I need to see exactly what locks both processes  
  hold, and what queries they were running when the deadlock occurred?  
  Is that easily done, without turning on logging for *all* statements?
 
 log_min_error_statement = error would at least get you the statements
 reporting the deadlocks, though not what they're conflicting against.

Yeh, we need a much better locking logger for performance analysis.

We really need to dump the whole wait-for graph for deadlocks, since
this might be more complex than just two statements involved. Deadlocks
ought to be so infrequent that we can afford the log space to do this -
plus if we did this it would likely lead to fewer deadlocks.

For 8.3 I'd like to have a log_min_duration_lockwait (secs) parameter
that would allow you to dump the wait-for graph for any data-level locks
that wait too long, rather than just those that deadlock. Many
applications experience heavy locking because of lack of holistic
design. That will also show up the need for other utilities to act
CONCURRENTLY, if possible.

 [ Memo to hackers: why is it that log_min_error_statement = error
 isn't the default? ]

For production systems where we expect fewer ERRORs, each one is
important, so this would be a good default.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-30 Thread Simon Riggs
On Mon, 2006-10-30 at 12:05 -0500, Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  Ugh.  Is there another solution to this?  Say, sync the buffer so that
  the hint bits are written to disk?
 
 Yeah.  The original design for all this is explained by the notes for
 TruncateCLOG:
 
  * When this is called, we know that the database logically contains no
  * reference to transaction IDs older than oldestXact.However, we must
  * not truncate the CLOG until we have performed a checkpoint, to ensure
  * that no such references remain on disk either; else a crash just after
  * the truncation might leave us with a problem.
 
 The pre-8.2 coding is actually perfectly safe within a single database,
 because TruncateCLOG is only called at the end of a database-wide
 vacuum, and so the checkpoint is guaranteed to have flushed valid hint
 bits for all tuples to disk.  There is a risk in other databases though.
 I think that in the 8.2 structure the equivalent notion must be that
 VACUUM has to flush and fsync a table before it can advance the table's
 relminxid.

Ouch! We did discuss that also. Flushing the buffercache is nasty with
very large caches, so this makes autovacuum much less friendly - and
could take a seriously long time if you enforce the vacuum delay
costings.

ISTM we only need to flush iff the clog would be truncated when we
update relminxid. Otherwise we are safe to update even if we crash,
since the clog will not have been truncated. 

 That still leaves us with the problem of hint bits not being updated
 during WAL replay.  I think the best solution for this is for WAL replay
 to force relvacuumxid to equal relminxid (btw, these field names seem
 poorly chosen, and the comment in catalogs.sgml isn't self-explanatory...)
 rather than adopting the value shown in the WAL record.  This probably
 is best done by abandoning the generic overwrite tuple WAL record type
 in favor of something specific to minxid updates.  The effect would then
 be that a PITR slave would not truncate its clog beyond the freeze
 horizon until it had performed a vacuum of its own.

Sounds good. Methinks we do still need the TruncateCLOG patch to ensure
we do WAL replay for the truncation? I'm posting that now to -patches as
a prototype.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com

Index: src/backend/access/transam/clog.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/clog.c,v
retrieving revision 1.40
diff -c -r1.40 clog.c
*** src/backend/access/transam/clog.c	4 Oct 2006 00:29:49 -	1.40
--- src/backend/access/transam/clog.c	30 Oct 2006 14:32:14 -
***
*** 68,74 
  
  static int	ZeroCLOGPage(int pageno, bool writeXlog);
  static bool CLOGPagePrecedes(int page1, int page2);
! static void WriteZeroPageXlogRec(int pageno);
  
  
  /*
--- 68,74 
  
  static int	ZeroCLOGPage(int pageno, bool writeXlog);
  static bool CLOGPagePrecedes(int page1, int page2);
! static void WriteClogXlogRec(int pageno, int rectype);
  
  
  /*
***
*** 198,204 
  	slotno = SimpleLruZeroPage(ClogCtl, pageno);
  
  	if (writeXlog)
! 		WriteZeroPageXlogRec(pageno);
  
  	return slotno;
  }
--- 198,204 
  	slotno = SimpleLruZeroPage(ClogCtl, pageno);
  
  	if (writeXlog)
! 		WriteClogXlogRec(pageno, CLOG_ZEROPAGE);
  
  	return slotno;
  }
***
*** 338,343 
--- 338,345 
  	/* Perform a CHECKPOINT */
  	RequestCheckpoint(true, false);
  
+ 	WriteClogXlogRec(cutoffPage, CLOG_TRUNCATE);
+ 
  	/* Now we can remove the old CLOG segment(s) */
  	SimpleLruTruncate(ClogCtl, cutoffPage);
  }
***
*** 375,389 
   * (Besides which, this is normally done just before entering a transaction.)
   */
  static void
! WriteZeroPageXlogRec(int pageno)
  {
  	XLogRecData rdata;
  
  	rdata.data = (char *) (pageno);
  	rdata.len = sizeof(int);
  	rdata.buffer = InvalidBuffer;
  	rdata.next = NULL;
! 	(void) XLogInsert(RM_CLOG_ID, CLOG_ZEROPAGE | XLOG_NO_TRAN, rdata);
  }
  
  /*
--- 377,393 
   * (Besides which, this is normally done just before entering a transaction.)
   */
  static void
! WriteClogXlogRec(int pageno, int rectype)
  {
  	XLogRecData rdata;
  
+ Assert(rectype == CLOG_ZEROPAGE || rectype == CLOG_TRUNCATE);
+ 
  	rdata.data = (char *) (pageno);
  	rdata.len = sizeof(int);
  	rdata.buffer = InvalidBuffer;
  	rdata.next = NULL;
! 	(void) XLogInsert(RM_CLOG_ID, rectype | XLOG_NO_TRAN, rdata);
  }
  
  /*
***
*** 409,414 
--- 413,432 
  
  		LWLockRelease(CLogControlLock);
  	}
+ else if (info == CLOG_TRUNCATE)
+ {
+ 		int			pageno;
+ 
+ 		memcpy(pageno, XLogRecGetData(record), sizeof(int));
+ 
+ 		LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
+ 
+ 	SimpleLruTruncate(ClogCtl, pageno);
+ 
+ 		LWLockRelease(CLogControlLock);
+ }
+ else
+ 		elog(PANIC, clog_redo: unknown op code %u, info);
  }
  
  void

Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-30 Thread Alvaro Herrera
Simon Riggs wrote:
 On Mon, 2006-10-30 at 12:05 -0500, Tom Lane wrote:
  Alvaro Herrera [EMAIL PROTECTED] writes:
   Ugh.  Is there another solution to this?  Say, sync the buffer so that
   the hint bits are written to disk?
  
  Yeah.  The original design for all this is explained by the notes for
  TruncateCLOG:
  
   * When this is called, we know that the database logically contains no
   * reference to transaction IDs older than oldestXact.  However, we must
   * not truncate the CLOG until we have performed a checkpoint, to ensure
   * that no such references remain on disk either; else a crash just after
   * the truncation might leave us with a problem.
  
  The pre-8.2 coding is actually perfectly safe within a single database,
  because TruncateCLOG is only called at the end of a database-wide
  vacuum, and so the checkpoint is guaranteed to have flushed valid hint
  bits for all tuples to disk.  There is a risk in other databases though.
  I think that in the 8.2 structure the equivalent notion must be that
  VACUUM has to flush and fsync a table before it can advance the table's
  relminxid.
 
 Ouch! We did discuss that also. Flushing the buffercache is nasty with
 very large caches, so this makes autovacuum much less friendly - and
 could take a seriously long time if you enforce the vacuum delay
 costings.
 
 ISTM we only need to flush iff the clog would be truncated when we
 update relminxid. Otherwise we are safe to update even if we crash,
 since the clog will not have been truncated. 

I don't understand.  When clog is actually going to be truncated, if
it's determined that there's any page that can be truncated, then a
checkpoint is forced.  If no page is going to be removed then there's no
checkpoint, which makes a lot of sense and of course avoids the problem
of useless flushes.

In fact I don't understand what's the point about multiple databases vs.
a single database.  Surely a checkpoint would flush all buffers in all
databases, no?  This would flush all hint bits, everywhere.  So this bug
does not really exist.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-30 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 ISTM we only need to flush iff the clog would be truncated when we
 update relminxid.

Wrong :-(  If the relvacuumxid change (not relminxid ... as I said, these
names aren't very transparent) makes it to disk but not all the hint
bits do, you're at risk.  Crash, restart, vacuum some other table, and
*now* the global min vacuumxid advances.  The fact that we're
WAL-logging the relvacuumxid change makes this scenario exceedingly
probable, if no action is taken to force out the hint bits.

The only alternative I can see is the one Heikki suggested: don't
truncate clog until the freeze horizon.  That's safe (given the planned
change to WAL-log tuple freezing) and clean and simple, but a permanent
requirement of 250MB+ for pg_clog would put the final nail in the coffin
of PG's usability in small-disk-footprint environments.  So I don't like
it much.  I suppose it could be made more tolerable by reducing the
freeze horizon, say to 100M instead of 1G transactions.  Anyone for a
GUC parameter?  In a high-volume DB you'd want the larger setting to
minimize the amount of tuple freezing work.  OTOH it seems like making
this configurable creates a nasty risk for PITR situations: a slave
that's configured with a smaller freeze window than the master is
probably not safe.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-30 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 In fact I don't understand what's the point about multiple databases vs.
 a single database.  Surely a checkpoint would flush all buffers in all
 databases, no?

Yeah --- all the ones that are dirty *now*.  Consider the case where you
vacuum DB X, update its datvacuumxid, and don't checkpoint because the
global min didn't advance.  Now you crash, possibly leaving some hint
bits unwritten; but the datvacuumxid change did make it to disk.  After
restart, vacuum DB Y, update its datvacuumxid, and find that the global
min *did* advance.  You checkpoint, and that guarantees that DB Y is
clean for the clog truncation.  But DB X isn't.

The 8.2 changes have created the equivalent risk at the level of each
individual table.  We can't write a vacuumxid change unless we are sure
that the hint-bit changes it promises are actually down to disk.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-30 Thread Simon Riggs
On Mon, 2006-10-30 at 16:58 -0500, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  ISTM we only need to flush iff the clog would be truncated when we
  update relminxid.
 
 Wrong :-(  If the relvacuumxid change (not relminxid ... as I said, these
 names aren't very transparent) makes it to disk but not all the hint
 bits do, you're at risk.  Crash, restart, vacuum some other table, and
 *now* the global min vacuumxid advances.  The fact that we're
 WAL-logging the relvacuumxid change makes this scenario exceedingly
 probable, if no action is taken to force out the hint bits.

I don't agree: If the truncation points are at 1 million, 2 million etc,
then if we advance the relvacuumxid from 1.2 million to 1.5 million,
then crash, the hints bits for that last vacuum are lost. Sounds bad,
but we have not truncated clog, so there is no danger. In order to
truncate up to 2 million we need to re-vacuum; at that point we discover
that the 1.5 million setting was wrong, realise it should have been 1.2
million but don't care because we now set it to 1.8 million etc. No
problem, even with repeated crashes. We only flush when we move the
counter past a truncation point.

If you look at this another way, maybe you'll see what I'm saying: Only
update relvacuumxid iff the update would allow us to truncate the clog.
That way we leap forwards in 1 million Xid chunks, rounded down. No
change to clog = no update = no danger that we need to flush to avoid.

 The only alternative I can see is the one Heikki suggested: don't
 truncate clog until the freeze horizon.  That's safe (given the planned
 change to WAL-log tuple freezing) and clean and simple, but a permanent
 requirement of 250MB+ for pg_clog would put the final nail in the coffin
 of PG's usability in small-disk-footprint environments.  So I don't like
 it much.  I suppose it could be made more tolerable by reducing the
 freeze horizon, say to 100M instead of 1G transactions.  Anyone for a
 GUC parameter?  In a high-volume DB you'd want the larger setting to
 minimize the amount of tuple freezing work.  OTOH it seems like making
 this configurable creates a nasty risk for PITR situations: a slave
 that's configured with a smaller freeze window than the master is
 probably not safe.

If we need to, just put the CLOG seg size in pg_config_manual.h

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-30 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 I don't agree: If the truncation points are at 1 million, 2 million etc,
 then if we advance the relvacuumxid from 1.2 million to 1.5 million,
 then crash, the hints bits for that last vacuum are lost. Sounds bad,
 but we have not truncated clog, so there is no danger.

You're still wrong though.  Suppose that VACUUM moves a particular rel's
relvacuumxid from 1.9 to 2.1 million, but because this rel is not
currently the oldest vacuumxid, it doesn't truncate clog.  Then we crash
and lose hint bits, but not the relvacuumxid change.  Then VACUUM
vacuums some other rel and advances its relvacuumxid from 1.9 to 2.1
million --- but this time that *was* the globally oldest value, and now
we think we can truncate clog at 2 million.  But the first rel might
still have some unhinted xids around 1.9 million.

 If you look at this another way, maybe you'll see what I'm saying: Only
 update relvacuumxid iff the update would allow us to truncate the clog.

Then you'll never update it at all, because there will always be some
other rel constraining the global min.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-30 Thread Simon Riggs
On Mon, 2006-10-30 at 19:18 -0500, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  I don't agree: If the truncation points are at 1 million, 2 million etc,
  then if we advance the relvacuumxid from 1.2 million to 1.5 million,
  then crash, the hints bits for that last vacuum are lost. Sounds bad,
  but we have not truncated clog, so there is no danger.
 
 You're still wrong though. 

Frequently, I'd say :-)

  Suppose that VACUUM moves a particular rel's
 relvacuumxid from 1.9 to 2.1 million, but because this rel is not
 currently the oldest vacuumxid, it doesn't truncate clog.  Then we crash
 and lose hint bits, but not the relvacuumxid change.  Then VACUUM
 vacuums some other rel and advances its relvacuumxid from 1.9 to 2.1
 million --- but this time that *was* the globally oldest value, and now
 we think we can truncate clog at 2 million.  But the first rel might
 still have some unhinted xids around 1.9 million.

That was understood; in the above example I agree you need to flush. If
you don't pass a truncation point, you don't need to flush whether or
not you actually truncate. So we don't need to flush *every* time, so
IMHO we don't need to play safe and keep clog the size of an iceberg.

Anyway, if PITR is safe again, I'd like to sleepzz

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-30 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 That was understood; in the above example I agree you need to flush. If
 you don't pass a truncation point, you don't need to flush whether or
 not you actually truncate. So we don't need to flush *every* time,

OK, but does that actually do much of anything for your performance
complaint?  Just after GlobalXmin has passed a truncation point, *every*
vacuum the system does will start performing a flush-n-fsync, which
seems like exactly what you didn't like.  If the syncs were spread out
in time for different rels then maybe this idea would help, but AFAICS
they won't be.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PATCHES] --single-transaction doc clarification

2006-10-30 Thread Tom Lane
Neil Conway [EMAIL PROTECTED] writes:
 On Tue, 2006-10-31 at 01:07 +, Simon Riggs wrote:
 As requested.

 Applied, thanks for the patch.

This patch converted a correct statement into a lie: there is not
anything that will cause begin/commit in a script file to fail just
because you wrapped begin/commit around them.  I rewrote the text to

If the script itself uses commandBEGIN/, commandCOMMIT/,
or commandROLLBACK/, this option will not have the desired
effects.
Also, if the script contains any command that cannot be executed
inside a transaction block, specifying this option will cause that
command (and hence the whole transaction) to fail.
 
regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [SQL] Case Preservation disregarding case sensitivity?

2006-10-30 Thread beau hargis
On Friday 27 October 2006 19:38, Joe wrote:
 Hi Beau,

 On Fri, 2006-10-27 at 16:23 -0700, beau hargis wrote:
  I am hoping that there is an easy way to obtain case-preservation with
  case-insensitivity, or at the very least, case-preservation and complete
  case-sensitivity, or case-preservation and a consistant case-conversion
  strategy.
 
  The case of the column names need to be preserved because that is the way
  the schema is designed and most importantly (VERY, VERY IMPORTANT),
  column names are used in apps as hash values, or as named references
  which are case sensitive and as such need to be delivered to the client
  in exactly in the manner specified at the time of table creation.
 

 I went through the same issue in my conversion from MySQL to Postgres
 and (since I had a small application) I ended up changing up all my
 tables and columns UserProfile to user_profile.

 I'm afraid however, that it's MySQL that is the odd man out.  I haven't
 researched this completely but I believe PG follows either the FIPS-127
 or SQL-92 standard with respect to what are called delimited
 identifiers.  Basically, this says if you want case sensitivity in
 identifier names, you have to use double quotes wherever you refer to
 the identifier.  Without the double quotes, the SQL implementor can
 either use UPPERCASE (as I believe Oracle and DB2 do) or lowercase (as
 PG does) when it displays those identifiers.

 Joe

Again, I am at the same point I was at when searching and searching for 
information on the problem, and I am not sure what the SQL standard has to 
say about it: I do not, and I dont think that anyone else who is struggling 
to deal with the problem does, care about case-sensitivity. I am interested 
in case preservation of column names. I do, indeed, want identifiers treated 
in a case insensitive way, but I want the case PRESERVED in the table 
definitions and I want that case, as preserved, to be reflected in the field 
names as returned by the server to any client library that connects and 
initiates a query. 

Case-preservation is not the same as case-sensitivity; nor is 
case-normalization the same as case-insensitivity. What PostgreSQL is doing 
is converting any, and all, identifiers to a lower case and then matching 
those against the identifiers (as stored in the table definition) in a 
case-sensitive manner. It 'normalizes' the case of the identifiers so that it 
has a common internal representation; the desires of the programmer and 
database architect be damned. 

Referenced specification details:
From FIPS-127:
===
3. Delimited identifiers. In the previous ANSI SQL specification, it was not 
possible for an application to specify identifiers with spaces or other 
special symbols. Also, it was not possible to protect against future assaults 
on the name space for (identifier) by additions to the (reserved word) list. 
The new facility for (delimited identifier) allows a user to enclose all 
identifiers in double-quotation marks, thereby ensuring that the name defined 
or referenced may contain spaces or other special symbols and will not be 
impacted by future additions to the (reserved word) list.
===

From SQL-92/Sec. 5.2:
===
10)The identifier body of a regular identifier is equivalent
to an identifier body in which every letter that is a lower-
case letter is replaced by the equivalent upper-case letter
or letters. This treatment includes determination of equiva-
lence, representation in the Information and Definition Schemas,
representation in the diagnostics area, and similar uses.

11)The identifier body of a regular identifier (with every
letter that is a lower-case letter replaced by the equivalent
upper-case letter or letters), treated as the repetition of
a character string literal that specifies a character set
specification of SQL_TEXT, shall not be equal, according to
the comparison rules in Subclause 8.2, comparison predicate,
to any reserved word (with every letter that is a lower-case
letter replaced by the equivalent upper-case letter or letters),
treated as the repetition of a character string literal that
specifies a character set specification of SQL_TEXT.

12)Two regular identifiers are equivalent if their identifier
bodys, considered as the repetition of a character string
literal that specifies a character set specification of
SQL_TEXT, compare equally according to the comparison rules
in Subclause 8.2, comparison predicate.

13)A regular identifier and a delimited identifier are equiva-
lent if the identifier body of the regular identifier (with
every letter that is a lower-case letter replaced by the equiva-
lent upper-case letter or letters) and the delimited identifier
body of the delimited identifier (with all occurrences of
quote replaced by quote symbol and all occurrences of dou-
blequote symbol replaced by double quote), considered as
the 

Re: [HACKERS] [SQL] Case Preservation disregarding case sensitivity?

2006-10-30 Thread Tom Lane
beau hargis [EMAIL PROTECTED] writes:
 Considering the differences that already exist between database systems and 
 their varying compliance with SQL and the various extensions that have been 
 created, I do not consider that the preservation of case for identifiers 
 would violate any SQL standard.

That's not how I read the spec.  It is true that we are not 100% spec
compliant, but that isn't a good argument for moving further away from
spec.  Not to mention breaking backwards compatibility with our
historical behavior.  The change you propose would fix your application
at the cost of breaking other people's applications.   Perhaps you
should consider fixing your app instead.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [GENERAL] Index greater than 8k

2006-10-30 Thread Teodor Sigaev

The problem I am after is the 8k index size issue. It is very easy to
get a GIST index (especially when using tsearch2) that is larger than that.

Hmm, tsearch2 GIST index  is specially designed for support huge index entry:
first, every lexemes in tsvectore are transformed to hash value (with a help of 
crc32), second, it's stripped all position infos, third, if size of array is 
greater than TOAST_INDEX_TARGET then tsearch2 will make bit signature of 
tsvector. Signature's length is fixed and equals to 252 bytes by default (+ 8 
bytes for header of datum). All values on internal pages are represented as 
signatures below.


So, tsearch2 guarantees that index entry will be small enough. If it's not true, 
then there is a bug - pls, make test suite demonstrating the problem.



Is recompiling the block size the option there?
What are the downsides, except for the custom build?


Can you send exact error message?

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] [SQL] Case Preservation disregarding case

2006-10-30 Thread Chuck McDevitt
At Teradata, we certainly interpreted the spec to allow case-preserving,
but case-insensitive, identifiers.
Users really liked it that way:  If you re-created a CREATE TABLE
statement from the catalog, you could get back exactly the case the user
had entered, but people using the table didn't need to worry about case.
And column titles in reports would have the nice case preserving
information.
Sort of like how Windows systems treat file names... The case is
preserved, but you don't need to know it to access the file.

I know UNIX users usually think case-preserving with case-insensitive
a foreign concept, but that doesn't mean the average user feels the
same.

If I want my column named WeeklyTotalSales, I really don't want to
have to always quote it and type in the exact case.

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Tom Lane
Sent: Monday, October 30, 2006 7:24 PM
To: beau hargis
Cc: pgsql-sql@postgresql.org; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [SQL] Case Preservation disregarding case
sensitivity?

beau hargis [EMAIL PROTECTED] writes:
 Considering the differences that already exist between database
systems and 
 their varying compliance with SQL and the various extensions that have
been 
 created, I do not consider that the preservation of case for
identifiers 
 would violate any SQL standard.

That's not how I read the spec.  It is true that we are not 100% spec
compliant, but that isn't a good argument for moving further away from
spec.  Not to mention breaking backwards compatibility with our
historical behavior.  The change you propose would fix your application
at the cost of breaking other people's applications.   Perhaps you
should consider fixing your app instead.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [SQL] Case Preservation disregarding case sensitivity?

2006-10-30 Thread Tom Lane
Chuck McDevitt [EMAIL PROTECTED] writes:
 At Teradata, we certainly interpreted the spec to allow case-preserving,
 but case-insensitive, identifiers.

Really?

As I see it, the controlling parts of the SQL spec are (SQL99 sec 5.2)

26) A regular identifier and a delimited identifier are
equivalent if the identifier body of the regular identifier
(with every letter that is a lower-case letter replaced by the
corresponding upper-case letter or letters) and the delimited
identifier body of the delimited identifier (with all
occurrences of quote replaced by quote symbol and all
occurrences of doublequote symbol replaced by double quote),
considered as the repetition of a character string literal
that specifies a character set specification of SQL_IDENTIFIER
and an implementation-defined collation that is sensitive to
case, compare equally according to the comparison rules in
Subclause 8.2, comparison predicate.

27) Two delimited identifiers are equivalent if their delimited
identifier bodys, considered as the repetition of a character
string literal that specifies a character set specification
of SQL_IDENTIFIER and an implementation-defined collation
that is sensitive to case, compare equally according to the
comparison rules in Subclause 8.2, comparison predicate.

Note well the sensitive to case bits there.  Now consider

CREATE TABLE tab (
foobar int,
FooBar timestamp,
FOOBAR varchar(3)
);

We can *not* reject this as containing duplicate column names, else we
have certainly violated rule 27.  Now what will you do with

SELECT fooBar FROM tab;

?  The spec is unquestionably on the side of you selected the varchar
column; historical Postgres practice is on the side of you selected
the int column.  AFAICS a case-insensitive approach would have to
fail with some I can't identify which column you mean error.  I am
interested to see where you find support for that in the spec...

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org