Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled

2007-10-18 Thread Jacky Leng
 You need to set $PGDATA before running the script. And psql,pg_ctl and
 pg_resetxlog need to be in $PATH. After running the script, restart
 postmaster and run SELECT * FROM t2. There should be one row in the
 table, but it's empty.

I've tried this script, and superisingly  found that T2 is not empty, just 
as it should be.
Then I see how cluster is done, and found that



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled

2007-10-18 Thread Jacky Leng
Sorry, send the mail wrongly just now.

 You need to set $PGDATA before running the script. And psql,pg_ctl and
 pg_resetxlog need to be in $PATH. After running the script, restart
 postmaster and run SELECT * FROM t2. There should be one row in the
 table, but it's empty.

I've tried this script on postgres (PostgreSQL) 8.3devel, and found that
T2 is not empty after recovery(just as it should be)---but the latest 
version
act just like what you said.

Then I see how cluster is done, and found that:
In postgres (PostgreSQL) 8.3devel, unlike AlterTableSetTablespace (which
copys the whole relation block-by-block, and doesn't use wal under 
non-archiving
mode), Cluster copys the relation row-by-row(simple_heap_insert), which
always uses wal regardless of archiving mode. As wal exists, recovery will
cope with everything rightly.
The latest version acts differently probably because that it removes wal of 
cluser
under non-archiving mode.

So the conclusion is: we can replace wal mechanism with smgrimmedsync only 
if
relfilenode is not allowed to be reused, but this's not true, so what we 
should
keep wal.

Is it right? 



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled

2007-10-18 Thread Jacky Leng
 You need to set $PGDATA before running the script. And psql,pg_ctl and
 pg_resetxlog need to be in $PATH. After running the script, restart
 postmaster and run SELECT * FROM t2. There should be one row in the
 table, but it's empty.

I've tried this script on postgres (PostgreSQL) 8.3devel, and found that
T2 is not empty after recovery(just as it should be)---but the latest 
version
act just like what you said.

Then I see how cluster is done, and found that:
In postgres (PostgreSQL) 8.3devel, unlike AlterTableSetTablespace (which
copys the whole relation block-by-block, and doesn't use wal under 
non-archiving
mode), Cluster copys the relation row-by-row(simple_heap_insert), which
always uses wal regardless of archiving mode. As wal exists, recovery will
cope with everything rightly.
The latest version acts differently probably because that it removes wal of 
cluser
under non-archiving mode.

So the conclusion is: we can replace wal mechanism with smgrimmedsync only 
if
relfilenode is not allowed to be reused, but this's not true, so what we 
should
keep wal.

Is it right? 



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled

2007-10-18 Thread Jacky Leng
 You need to set $PGDATA before running the script. And psql,pg_ctl and
 pg_resetxlog need to be in $PATH. After running the script, restart
 postmaster and run SELECT * FROM t2. There should be one row in the
 table, but it's empty.

I've tried this script on postgres (PostgreSQL) 8.3devel, and found that
T2 is not empty after recovery(just as it should be)---but the latest 
version
act just like what you said.

Then I see how cluster is done, and found that:
In postgres (PostgreSQL) 8.3devel, unlike AlterTableSetTablespace (which
copys the whole relation block-by-block, and doesn't use wal under 
non-archiving
mode), Cluster copys the relation row-by-row(simple_heap_insert), which
always uses wal regardless of archiving mode. As wal exists, recovery will
cope with everything rightly.
The latest version acts differently probably because that it removes wal of 
cluser
under non-archiving mode.

So the conclusion is: we can replace wal mechanism with smgrimmedsync only 
if
relfilenode is not allowed to be reused, but this's not true, so what we 
should
keep wal.

Is it right? 



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled

2007-10-17 Thread Heikki Linnakangas
Simon Riggs wrote:
 On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote:
 Second, suppose that no checkpoint has occured during the upper 
 series--authough not quite possible; 
 
 That part is irrelevant. It's forced out to disk and doesn't need
 recovery, with or without the checkpoint.
 
 There's no hole that I can see.

No, Jacky is right. The same problem exists at least with CLUSTER, and I
think there's other commands that rely on immediate fsync as well.

Attached is a shell script that demonstrates the problem on CVS HEAD
with CLUSTER. It creates two tables, T1 and T2, both with one row. Then
T1 is dropped, and T2 is CLUSTERed, so that the new T2 relation file
happens to get the same relfilenode that T1 had. Then we crash the
server, forcing a WAL replay. After that, T2 is empty. Oops.

Unfortunately I don't see any easy way to fix it. One approach would be
to avoid reusing the relfilenodes until next checkpoint, but I don't see
any nice place to keep track of OIDs that have been dropped since last
checkpoint.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled

2007-10-17 Thread Jacky Leng

 On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote:
 Second, suppose that no checkpoint has occured during the upper
 series--authough not quite possible;

 That part is irrelevant. It's forced out to disk and doesn't need
 recovery, with or without the checkpoint.

 There's no hole that I can see.

Yes, it's really forced out.
But if there's no checkpoint, the recovery process will begin from
the time point before T1 is created, and as T1 was dropped, it'll
remove T2's file!

 -- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


 ---(end of broadcast)---
 TIP 2: Don't 'kill -9' the postmaster
 



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled

2007-10-17 Thread Heikki Linnakangas
Forgot to attach the script I promised..

You need to set $PGDATA before running the script. And psql,pg_ctl and
pg_resetxlog need to be in $PATH. After running the script, restart
postmaster and run SELECT * FROM t2. There should be one row in the
table, but it's empty.

Heikki Linnakangas wrote:
 Simon Riggs wrote:
 On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote:
 Second, suppose that no checkpoint has occured during the upper 
 series--authough not quite possible; 
 That part is irrelevant. It's forced out to disk and doesn't need
 recovery, with or without the checkpoint.

 There's no hole that I can see.
 
 No, Jacky is right. The same problem exists at least with CLUSTER, and I
 think there's other commands that rely on immediate fsync as well.
 
 Attached is a shell script that demonstrates the problem on CVS HEAD
 with CLUSTER. It creates two tables, T1 and T2, both with one row. Then
 T1 is dropped, and T2 is CLUSTERed, so that the new T2 relation file
 happens to get the same relfilenode that T1 had. Then we crash the
 server, forcing a WAL replay. After that, T2 is empty. Oops.
 
 Unfortunately I don't see any easy way to fix it. One approach would be
 to avoid reusing the relfilenodes until next checkpoint, but I don't see
 any nice place to keep track of OIDs that have been dropped since last
 checkpoint.
 


-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com


cluster-relfilenode-clash.sh.gz
Description: GNU Zip compressed data

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled

2007-10-17 Thread Heikki Linnakangas
I wrote:
 Unfortunately I don't see any easy way to fix it. One approach would be
 to avoid reusing the relfilenodes until next checkpoint, but I don't see
 any nice place to keep track of OIDs that have been dropped since last
 checkpoint.

Ok, here's one idea:

Instead of deleting the file immediately on commit of DROP TABLE, the
file is truncated to release the space, but not unlink()ed, to avoid
reusing that relfilenode. The truncated file can be deleted after next
checkpoint.

Now, how does checkpoint know what to delete? We can use the fsync
request mechanism for that. When a file is truncated, a new kind of
fsync request, a deletion request, is sent to the bgwriter, which
collects all such requests to a list. Before checkpoint calculates new
RedoRecPtr, the list is swapped with an empty one, and after writing the
new checkpoint record, all the files that were in the list are deleted.

We would leak empty files on crashes, but we leak files on crashes
anyway, so that shouldn't be an issue. This scheme wouldn't require
catalog changes, so it would be suitable for backpatching.

Any better ideas?

Do we care enough about this to fix this? Enough to backpatch? The
probability of this happening is pretty small, but the consequences are
really bad, so my vote is yes and yes.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled

2007-10-17 Thread Florian G. Pflug

Heikki Linnakangas wrote:

I wrote:

Unfortunately I don't see any easy way to fix it. One approach would be
to avoid reusing the relfilenodes until next checkpoint, but I don't see
any nice place to keep track of OIDs that have been dropped since last
checkpoint.


Ok, here's one idea:

Instead of deleting the file immediately on commit of DROP TABLE, the
file is truncated to release the space, but not unlink()ed, to avoid
reusing that relfilenode. The truncated file can be deleted after next
checkpoint.

Now, how does checkpoint know what to delete? We can use the fsync
request mechanism for that. When a file is truncated, a new kind of
fsync request, a deletion request, is sent to the bgwriter, which
collects all such requests to a list. Before checkpoint calculates new
RedoRecPtr, the list is swapped with an empty one, and after writing the
new checkpoint record, all the files that were in the list are deleted.

We would leak empty files on crashes, but we leak files on crashes
anyway, so that shouldn't be an issue. This scheme wouldn't require
catalog changes, so it would be suitable for backpatching.

Any better ideas?
Couldn't we fix this by forcing a checkpoint before we commit the transaction 
that created the new pg_class entry for the clustered table? Or rather, more 
generally, before committing a transaction that created a new non-temporary 
relfilenode but didn't WAL-log any subsequent inserts.


Thats of course a rather sledgehammer-like approach to this problem - but at 
least for the backbranched the fix would be less intrusive...


regards, Florian Pflug


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled

2007-10-17 Thread Simon Riggs
On Wed, 2007-10-17 at 12:11 +0100, Heikki Linnakangas wrote:
 Simon Riggs wrote:
  On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote:
  Second, suppose that no checkpoint has occured during the upper 
  series--authough not quite possible; 
  
  That part is irrelevant. It's forced out to disk and doesn't need
  recovery, with or without the checkpoint.
  
  There's no hole that I can see.
 
 No, Jacky is right. The same problem exists at least with CLUSTER, and I
 think there's other commands that rely on immediate fsync as well.
 
 Attached is a shell script that demonstrates the problem on CVS HEAD
 with CLUSTER. It creates two tables, T1 and T2, both with one row. Then
 T1 is dropped, and T2 is CLUSTERed, so that the new T2 relation file
 happens to get the same relfilenode that T1 had. Then we crash the
 server, forcing a WAL replay. After that, T2 is empty. Oops.
 
 Unfortunately I don't see any easy way to fix it. 

So, what you are saying is that re-using relfilenodes can cause problems
during recovery in any command that alters the relfilenode of a
relation?

If you've got a better problem statement it would be good to get that
right first before we discuss solutions.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled

2007-10-17 Thread Florian G. Pflug

Simon Riggs wrote:

On Wed, 2007-10-17 at 12:11 +0100, Heikki Linnakangas wrote:

Simon Riggs wrote:

On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote:
Second, suppose that no checkpoint has occured during the upper 
series--authough not quite possible;
That part is irrelevant. It's forced out to disk and doesn't need 
recovery, with or without the checkpoint.


There's no hole that I can see.
No, Jacky is right. The same problem exists at least with CLUSTER, and I 
think there's other commands that rely on immediate fsync as well.


Attached is a shell script that demonstrates the problem on CVS HEAD with
CLUSTER. It creates two tables, T1 and T2, both with one row. Then T1 is
dropped, and T2 is CLUSTERed, so that the new T2 relation file happens to
get the same relfilenode that T1 had. Then we crash the server, forcing a
WAL replay. After that, T2 is empty. Oops.

Unfortunately I don't see any easy way to fix it.


So, what you are saying is that re-using relfilenodes can cause problems 
during recovery in any command that alters the relfilenode of a relation?


For what I understand, I'd say that creating a relfilenode *and* subsequently
inserting data without WAL-logging causes the problem. If the relfilenode was
recently deleted, the inserts might be effectively undone upon recovery (because
we first replay the delete), but later *not* redone (because we didn't WAL-log
the inserts).

That brings me to another idea from a fix that is less heavyweight than my
previous checkpoint-before-commit suggestion.

We could make relfilenodes globally unique if we added the xid and epoch of the
creating transaction to the filename. Those are 64 bits, so if we encode them
in base 36 (using A-Z,0-9), that'd increase the length of the filenames by 13.

regards, Florian Pflug


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled

2007-10-17 Thread Heikki Linnakangas
Florian G. Pflug wrote:
 Heikki Linnakangas wrote:
 I wrote:
 Unfortunately I don't see any easy way to fix it. One approach would be
 to avoid reusing the relfilenodes until next checkpoint, but I don't see
 any nice place to keep track of OIDs that have been dropped since last
 checkpoint.

 Ok, here's one idea:

 Instead of deleting the file immediately on commit of DROP TABLE, the
 file is truncated to release the space, but not unlink()ed, to avoid
 reusing that relfilenode. The truncated file can be deleted after next
 checkpoint.

 Now, how does checkpoint know what to delete? We can use the fsync
 request mechanism for that. When a file is truncated, a new kind of
 fsync request, a deletion request, is sent to the bgwriter, which
 collects all such requests to a list. Before checkpoint calculates new
 RedoRecPtr, the list is swapped with an empty one, and after writing the
 new checkpoint record, all the files that were in the list are deleted.

 We would leak empty files on crashes, but we leak files on crashes
 anyway, so that shouldn't be an issue. This scheme wouldn't require
 catalog changes, so it would be suitable for backpatching.

 Any better ideas?
 Couldn't we fix this by forcing a checkpoint before we commit the
 transaction that created the new pg_class entry for the clustered table?
 Or rather, more generally, before committing a transaction that created
 a new non-temporary relfilenode but didn't WAL-log any subsequent inserts.

Yes, that would work. As a small optimization, you could set a flag in
shared mem whenever you delete a rel file, and skip the checkpoint when
that flag isn't set.

 Thats of course a rather sledgehammer-like approach to this problem -
 but at least for the backbranched the fix would be less intrusive...

Too much of a sledgehammer IMHO.

BTW, CREATE INDEX is also vulnerable. And in 8.3, COPY to a table
created/truncated in the same transaction.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings