Re: [HACKERS] WAL logging problem in 9.4.3?

2017-09-14 Thread Kyotaro HORIGUCHI
At Wed, 13 Sep 2017 17:42:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170913.174239.25978735.horiguchi.kyot...@lab.ntt.co.jp>
> filterdiff seems to did something wrong..

# to did...

The patch is broken by filterdiff so I send a new patch made
directly by git format-patch. I confirmed that a build completes
with applying this.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 7086b5855080065f73de4d099cbaab09511f01fc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 12 Sep 2017 13:01:33 +0900
Subject: [PATCH] Fix WAL logging problem

---
 src/backend/access/heap/heapam.c| 113 +---
 src/backend/access/heap/pruneheap.c |   3 +-
 src/backend/access/heap/rewriteheap.c   |   4 +-
 src/backend/access/heap/visibilitymap.c |   3 +-
 src/backend/access/transam/xact.c   |   7 +
 src/backend/catalog/storage.c   | 318 +---
 src/backend/commands/copy.c |  13 +-
 src/backend/commands/createas.c |   9 +-
 src/backend/commands/matview.c  |   6 +-
 src/backend/commands/tablecmds.c|   8 +-
 src/backend/commands/vacuumlazy.c   |   6 +-
 src/backend/storage/buffer/bufmgr.c |  40 +++-
 src/backend/utils/cache/relcache.c  |  13 ++
 src/include/access/heapam.h |   8 +-
 src/include/catalog/storage.h   |   5 +-
 src/include/storage/bufmgr.h|   2 +
 src/include/utils/rel.h |   8 +
 17 files changed, 476 insertions(+), 90 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d20f038..e40254d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
  *	  the POSTGRES heap access method used for all POSTGRES
  *	  relations.
  *
+ * WAL CONSIDERATIONS
+ *	  All heap operations are normally WAL-logged. but there are a few
+ *	  exceptions. Temporary and unlogged relations never need to be
+ *	  WAL-logged, but we can also skip WAL-logging for a table that was
+ *	  created in the same transaction, if we don't need WAL for PITR or
+ *	  WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ *	  the file to disk at COMMIT instead.
+ *
+ *	  The same-relation optimization is not employed automatically on all
+ *	  updates to a table that was created in the same transacton, because
+ *	  for a small number of changes, it's cheaper to just create the WAL
+ *	  records than fsyncing() the whole relation at COMMIT. It is only
+ *	  worthwhile for (presumably) large operations like COPY, CLUSTER,
+ *	  or VACUUM FULL. Use heap_register_sync() to initiate such an
+ *	  operation; it will cause any subsequent updates to the table to skip
+ *	  WAL-logging, if possible, and cause the heap to be synced to disk at
+ *	  COMMIT.
+ *
+ *	  To make that work, all modifications to heap must use
+ *	  HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ *	  for the given block.
+ *
  *-
  */
 #include "postgres.h"
@@ -56,6 +78,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -2373,12 +2396,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * The new tuple is stamped with current transaction ID and the specified
  * command ID.
  *
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- *
  * The HEAP_INSERT_SKIP_FSM option is passed directly to
  * RelationGetBufferForTuple, which see for more info.
  *
@@ -2409,6 +2426,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * TID where the tuple was stored.  But note that any toasting of fields
  * within the tuple data is NOT reflected into *tup.
  */
+extern HTAB *pendingSyncs;
 Oid
 heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 			int options, BulkInsertState bistate)
@@ -2482,7 +2500,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+	if (BufferNeedsWAL(relation, buffer))
 	{
 		xl_heap_insert xlrec;
 		xl_heap_header xlhdr;
@@ -2681,12 +2699,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 	int			ndone;
 	char	   *scratch = NULL;
 	Page		page;
-	bool		needwal;
 	Size		saveFreeSpace;
 	bool		need_tuple_data = RelationIsLogicallyLogged(relation);
 	bool		need_cids = 

Re: [HACKERS] WAL logging problem in 9.4.3?

2017-09-13 Thread Kyotaro HORIGUCHI
At Wed, 13 Sep 2017 15:05:31 +1200, Thomas Munro 
 wrote in 

Re: [HACKERS] WAL logging problem in 9.4.3?

2017-09-13 Thread Alvaro Herrera
Kyotaro HORIGUCHI wrote:

> The CF status of this patch turned into "Waiting on Author" by
> automated CI checking.

I object to automated turning of patches to waiting on author by
machinery.  Sending occasional reminder messages to authors making them
know about outdated patches seems acceptable to me at this stage.

It'll take some time for this machinery to get perfected; only when it
is beyond experimental mode it'll be acceptable to change patches'
status in an automated fashion.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2017-09-12 Thread Thomas Munro
On Wed, Sep 13, 2017 at 1:04 PM, Kyotaro HORIGUCHI
 wrote:
> The CF status of this patch turned into "Waiting on Author" by
> automated CI checking. However, I still don't get any error even
> on the current master (69835bc) after make distclean. Also I
> don't see any difference between the "problematic" patch and my
> working branch has nothing different other than patching line
> shifts. (So I haven't post a new one.)
>
> I looked on the location heapam.c:2502 where the CI complains at
> in my working branch and I found a different code with the
> complaint.
>
> https://travis-ci.org/postgresql-cfbot/postgresql/builds/27450
>
> 1363 heapam.c:2502:18: error: ‘HEAP_INSERT_SKIP_WAL’ undeclared (first use in 
> this function)
> 1364   if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
>
> heapam.c:2502@work branch
> 2502:   /* XLOG stuff */
> 2503:   if (BufferNeedsWAL(relation, buffer))
>
> So I conclude that the CI mechinery failed to applly the patch
> correctly.

Hi Horiguchi-san,

Hmm.  Here is that line in heamap.c in unpatched master:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/heap/heapam.c;h=d20f0381f3bc23f99c505ef8609d63240ac5d44b;hb=HEAD#l2485

It says:

2485 if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))

After applying fix-wal-level-minimal-michael-horiguchi-3.patch from
this message:

https://www.postgresql.org/message-id/20170912.131441.20602611.horiguchi.kyotaro%40lab.ntt.co.jp

... that line is unchanged, although it has moved to line number 2502.
It doesn't compile for me, because your patch removed the definition
of HEAP_INSERT_SKIP_WAL but hasn't removed that reference to it.

I'm not sure what happened.  Is it possible that your patch was not
created by diffing against master?

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2017-09-12 Thread Kyotaro HORIGUCHI
Hello, (does this seem to be a top post?)

The CF status of this patch turned into "Waiting on Author" by
automated CI checking. However, I still don't get any error even
on the current master (69835bc) after make distclean. Also I
don't see any difference between the "problematic" patch and my
working branch has nothing different other than patching line
shifts. (So I haven't post a new one.)

I looked on the location heapam.c:2502 where the CI complains at
in my working branch and I found a different code with the
complaint.

https://travis-ci.org/postgresql-cfbot/postgresql/builds/27450

1363 heapam.c:2502:18: error: ‘HEAP_INSERT_SKIP_WAL’ undeclared (first use in 
this function)
1364   if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))

heapam.c:2502@work branch
2502:   /* XLOG stuff */
2503:   if (BufferNeedsWAL(relation, buffer))

So I conclude that the CI mechinery failed to applly the patch
correctly.


At Thu, 13 Apr 2017 15:29:35 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170413.152935.100104316.horiguchi.kyot...@lab.ntt.co.jp>
> > > > I'll post new patch in this way soon.
> > > 
> > > Here it is.
> > 
> > It contained tariling space and missing test script.  This is the
> > correct patch.
> > 
> > > - Relation has new members no_pending_sync and pending_sync that
> > >   works as instant cache of an entry in pendingSync hash.
> > > 
> > > - Commit-time synchronizing is restored as Michael's patch.
> > > 
> > > - If relfilenode is replaced, pending_sync for the old node is
> > >   removed. Anyway this is ignored on abort and meaningless on
> > >   commit.
> > > 
> > > - TAP test is renamed to 012 since some new files have been added.
> > > 
> > > Accessing pending sync hash occured on every calling of
> > > HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
> > > accessing relations has pending sync.  Almost of them are
> > > eliminated as the result.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2017-09-11 Thread Kyotaro HORIGUCHI
Hello,

At Fri, 08 Sep 2017 16:30:01 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170908.163001.53230385.horiguchi.kyot...@lab.ntt.co.jp>
> > >> 2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl
> > >> STATEMENT:  ANALYZE;
> > >> 2017-04-13 12:12:25.766 JST [85492] LOG:  BufferNeedsWAL: pendingSyncs
> > >> = 0x0, no_pending_sync = 0
> > >> 
> > >> -   lsn = XLogInsert(RM_SMGR_ID,
> > >> -XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
> > >> +   rel->no_pending_sync= false;
> > >> +   rel->pending_sync = pending;
> > >> +   }
> > >> 
> > >> It seems to me that those flags and the pending_sync data should be
> > >> kept in the context of backend process and not be part of the Relation
> > >> data...
> > > 
> > > I understand that the context of "backend process" means
> > > storage.c local. I don't mind the context on which the data is,
> > > but I found only there that can get rid of frequent hash
> > > searching. For pending deletions, just appending to a list is
> > > enough and costs almost nothing, on the other hand pendig syncs
> > > are required to be referenced, sometimes very frequently.
> > > 
> > >> +void
> > >> +RecordPendingSync(Relation rel)
> > >> I don't think that I agree that this should be part of relcache.c. The
> > >> syncs are tracked should be tracked out of the relation context.
> > > 
> > > Yeah.. It's in storage.c in the latest patch. (Sorry for the
> > > duplicate name). I think it is a kind of bond between smgr and
> > > relation.
> > > 
> > >> Seeing how invasive this change is, I would also advocate for this
> > >> patch as only being a HEAD-only change, not many people are
> > >> complaining about this optimization of TRUNCATE missing when wal_level
> > >> = minimal, and this needs a very careful review.
> > > 
> > > Agreed.
> > > 
> > >> Should I code something? Or Horiguchi-san, would you take care of it?
> > >> The previous crash I saw has been taken care of, but it's been really
> > >> some time since I looked at this patch...
> > > 
> > > My point is hash-search on every tuple insertion should be evaded
> > > even if it happens rearely. Once it was a bit apart from your
> > > original patch, but in the latest patch the significant part
> > > (pending-sync hash) is revived from the original one.
> > 
> > This patch has followed along since CF 2016-03, do we think we can reach a
> > conclusion in this CF?  It was marked as "Waiting on Author”, based on
> > developments since in this thread, I’ve changed it back to “Needs Review”
> > again.
> 
> I manged to reload its context into my head. It doesn't apply on
> the current master and needs some amendment. I'm going to work on
> this.

Rebased and slightly modified.

Michael's latest patch on which this patch is piggybacking seems
works perfectly. The motive of my addition is avoiding frequent
(I think specifically per tuple modification) hash accessing
occurs while pending-syncs exist. The hash contains at least 6 or
more entries.

The attached patch emits more log messages that will be removed
in the final shape to see how much the addition reduces the hash
access.  As a basis of determining the worthiness of the
additional mechanism, I'll show an example of a set of queries
below.

In the log messages, "r" is relation oid, "b" is buffer number,
"hash" is the pointer to the backend-global hash table for
pending syncs. "ent" is the entry in the hash belongs to the
relation, "neg" is a flag indicates that the existing pending
sync hash doesn't have an entry for the relation.

=# set log_min_message to debug2;
=# begin;
=# create table test1(a text primary key);
> DEBUG:  BufferNeedsWAL(r 2608, b 55): hash = (nil), ent=(nil), neg = 0
# relid=2608 buf=55, hash has not been created

=# insert into test1 values ('inserted row');
> DEBUG:  BufferNeedsWAL(r 24807, b 0): hash = (nil), ent=(nil), neg = 0
# relid=24807, fist buffer, hash has not bee created

=# copy test1 from '//copy_data.txt';
> DEBUG:  BufferNeedsWAL(r 24807, b 0): hash = 0x171de00, ent=0x171f390, neg = 0
# hash created, pending sync entry linked, no longer needs hash acess
# (repeats for the number of buffers)
COPY 200

=# create table test3(a text primary key);
> DEBUG:  BufferNeedsWAL(r 2608, b 55): hash = 0x171de00, ent=(nil), neg = 1
# no pending sync entry for this relation, no longer needs hash access.

=# insert into test3 (select a from generate_series(0, 99) a);
> DEBUG:  BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=(nil), neg = 0
> DEBUG:  BufferNeedsWAL: accessing hash : not found
> DEBUG:  BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=(nil), neg = 1
# This table no longer needs hash access, (repeats for the number of tuples)

=#  truncate test3;
=#  insert into test3 (select a from generate_series(0, 99) a);
> DEBUG:  BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=(nil), neg = 0
> DEBUG:  BufferNeedsWAL: accessing hash : found
> 

Re: [HACKERS] WAL logging problem in 9.4.3?

2017-09-08 Thread Kyotaro HORIGUCHI
Thank you for your notification.

At Tue, 5 Sep 2017 12:05:01 +0200, Daniel Gustafsson  wrote in 

> > On 13 Apr 2017, at 11:42, Kyotaro HORIGUCHI 
> >  wrote:
> > 
> > At Thu, 13 Apr 2017 13:52:40 +0900, Michael Paquier 
> >  wrote in 
> > 
> >> On Tue, Apr 11, 2017 at 5:38 PM, Kyotaro HORIGUCHI
> >>  wrote:
> >>> Sorry, what I have just sent was broken.
> >> 
> >> You can use PROVE_TESTS when running make check to select a subset of
> >> tests you want to run. I use that all the time when working on patches
> >> dedicated to certain code paths.
> > 
> > Thank you for the information. Removing unwanted test scripts
> > from t/ directories was annoyance. This makes me happy.
> > 
>  - Relation has new members no_pending_sync and pending_sync that
>   works as instant cache of an entry in pendingSync hash.
>  - Commit-time synchronizing is restored as Michael's patch.
>  - If relfilenode is replaced, pending_sync for the old node is
>   removed. Anyway this is ignored on abort and meaningless on
>   commit.
>  - TAP test is renamed to 012 since some new files have been added.
>  
>  Accessing pending sync hash occurred on every calling of
>  HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
>  accessing relations has pending sync.  Almost of them are
>  eliminated as the result.
> >> 
> >> Did you actually test this patch? One of the logs added makes the
> >> tests a long time to run:
> > 
> > Maybe this patch requires make clean since it extends the
> > structure RelationData. (Perhaps I saw the same trouble.)
> > 
> >> 2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl
> >> STATEMENT:  ANALYZE;
> >> 2017-04-13 12:12:25.766 JST [85492] LOG:  BufferNeedsWAL: pendingSyncs
> >> = 0x0, no_pending_sync = 0
> >> 
> >> -   lsn = XLogInsert(RM_SMGR_ID,
> >> -XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
> >> +   rel->no_pending_sync= false;
> >> +   rel->pending_sync = pending;
> >> +   }
> >> 
> >> It seems to me that those flags and the pending_sync data should be
> >> kept in the context of backend process and not be part of the Relation
> >> data...
> > 
> > I understand that the context of "backend process" means
> > storage.c local. I don't mind the context on which the data is,
> > but I found only there that can get rid of frequent hash
> > searching. For pending deletions, just appending to a list is
> > enough and costs almost nothing, on the other hand pendig syncs
> > are required to be referenced, sometimes very frequently.
> > 
> >> +void
> >> +RecordPendingSync(Relation rel)
> >> I don't think that I agree that this should be part of relcache.c. The
> >> syncs are tracked should be tracked out of the relation context.
> > 
> > Yeah.. It's in storage.c in the latest patch. (Sorry for the
> > duplicate name). I think it is a kind of bond between smgr and
> > relation.
> > 
> >> Seeing how invasive this change is, I would also advocate for this
> >> patch as only being a HEAD-only change, not many people are
> >> complaining about this optimization of TRUNCATE missing when wal_level
> >> = minimal, and this needs a very careful review.
> > 
> > Agreed.
> > 
> >> Should I code something? Or Horiguchi-san, would you take care of it?
> >> The previous crash I saw has been taken care of, but it's been really
> >> some time since I looked at this patch...
> > 
> > My point is hash-search on every tuple insertion should be evaded
> > even if it happens rearely. Once it was a bit apart from your
> > original patch, but in the latest patch the significant part
> > (pending-sync hash) is revived from the original one.
> 
> This patch has followed along since CF 2016-03, do we think we can reach a
> conclusion in this CF?  It was marked as "Waiting on Author”, based on
> developments since in this thread, I’ve changed it back to “Needs Review”
> again.

I manged to reload its context into my head. It doesn't apply on
the current master and needs some amendment. I'm going to work on
this.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2017-09-05 Thread Daniel Gustafsson
> On 13 Apr 2017, at 11:42, Kyotaro HORIGUCHI  
> wrote:
> 
> At Thu, 13 Apr 2017 13:52:40 +0900, Michael Paquier 
>  wrote in 
> 
>> On Tue, Apr 11, 2017 at 5:38 PM, Kyotaro HORIGUCHI
>>  wrote:
>>> Sorry, what I have just sent was broken.
>> 
>> You can use PROVE_TESTS when running make check to select a subset of
>> tests you want to run. I use that all the time when working on patches
>> dedicated to certain code paths.
> 
> Thank you for the information. Removing unwanted test scripts
> from t/ directories was annoyance. This makes me happy.
> 
 - Relation has new members no_pending_sync and pending_sync that
  works as instant cache of an entry in pendingSync hash.
 - Commit-time synchronizing is restored as Michael's patch.
 - If relfilenode is replaced, pending_sync for the old node is
  removed. Anyway this is ignored on abort and meaningless on
  commit.
 - TAP test is renamed to 012 since some new files have been added.
 
 Accessing pending sync hash occurred on every calling of
 HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
 accessing relations has pending sync.  Almost of them are
 eliminated as the result.
>> 
>> Did you actually test this patch? One of the logs added makes the
>> tests a long time to run:
> 
> Maybe this patch requires make clean since it extends the
> structure RelationData. (Perhaps I saw the same trouble.)
> 
>> 2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl
>> STATEMENT:  ANALYZE;
>> 2017-04-13 12:12:25.766 JST [85492] LOG:  BufferNeedsWAL: pendingSyncs
>> = 0x0, no_pending_sync = 0
>> 
>> -   lsn = XLogInsert(RM_SMGR_ID,
>> -XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
>> +   rel->no_pending_sync= false;
>> +   rel->pending_sync = pending;
>> +   }
>> 
>> It seems to me that those flags and the pending_sync data should be
>> kept in the context of backend process and not be part of the Relation
>> data...
> 
> I understand that the context of "backend process" means
> storage.c local. I don't mind the context on which the data is,
> but I found only there that can get rid of frequent hash
> searching. For pending deletions, just appending to a list is
> enough and costs almost nothing, on the other hand pendig syncs
> are required to be referenced, sometimes very frequently.
> 
>> +void
>> +RecordPendingSync(Relation rel)
>> I don't think that I agree that this should be part of relcache.c. The
>> syncs are tracked should be tracked out of the relation context.
> 
> Yeah.. It's in storage.c in the latest patch. (Sorry for the
> duplicate name). I think it is a kind of bond between smgr and
> relation.
> 
>> Seeing how invasive this change is, I would also advocate for this
>> patch as only being a HEAD-only change, not many people are
>> complaining about this optimization of TRUNCATE missing when wal_level
>> = minimal, and this needs a very careful review.
> 
> Agreed.
> 
>> Should I code something? Or Horiguchi-san, would you take care of it?
>> The previous crash I saw has been taken care of, but it's been really
>> some time since I looked at this patch...
> 
> My point is hash-search on every tuple insertion should be evaded
> even if it happens rearely. Once it was a bit apart from your
> original patch, but in the latest patch the significant part
> (pending-sync hash) is revived from the original one.

This patch has followed along since CF 2016-03, do we think we can reach a
conclusion in this CF?  It was marked as "Waiting on Author”, based on
developments since in this thread, I’ve changed it back to “Needs Review”
again.

cheers ./daniel




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2017-04-13 Thread Kyotaro HORIGUCHI
At Thu, 13 Apr 2017 13:52:40 +0900, Michael Paquier  
wrote in 
> On Tue, Apr 11, 2017 at 5:38 PM, Kyotaro HORIGUCHI
>  wrote:
> > Sorry, what I have just sent was broken.
> 
> You can use PROVE_TESTS when running make check to select a subset of
> tests you want to run. I use that all the time when working on patches
> dedicated to certain code paths.

Thank you for the information. Removing unwanted test scripts
from t/ directories was annoyance. This makes me happy.

> >> - Relation has new members no_pending_sync and pending_sync that
> >>   works as instant cache of an entry in pendingSync hash.
> >> - Commit-time synchronizing is restored as Michael's patch.
> >> - If relfilenode is replaced, pending_sync for the old node is
> >>   removed. Anyway this is ignored on abort and meaningless on
> >>   commit.
> >> - TAP test is renamed to 012 since some new files have been added.
> >>
> >> Accessing pending sync hash occurred on every calling of
> >> HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
> >> accessing relations has pending sync.  Almost of them are
> >> eliminated as the result.
> 
> Did you actually test this patch? One of the logs added makes the
> tests a long time to run:

Maybe this patch requires make clean since it extends the
structure RelationData. (Perhaps I saw the same trouble.)

> 2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl
> STATEMENT:  ANALYZE;
> 2017-04-13 12:12:25.766 JST [85492] LOG:  BufferNeedsWAL: pendingSyncs
> = 0x0, no_pending_sync = 0
> 
> -   lsn = XLogInsert(RM_SMGR_ID,
> -XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
> +   rel->no_pending_sync= false;
> +   rel->pending_sync = pending;
> +   }
> 
> It seems to me that those flags and the pending_sync data should be
> kept in the context of backend process and not be part of the Relation
> data...

I understand that the context of "backend process" means
storage.c local. I don't mind the context on which the data is,
but I found only there that can get rid of frequent hash
searching. For pending deletions, just appending to a list is
enough and costs almost nothing, on the other hand pendig syncs
are required to be referenced, sometimes very frequently.

> +void
> +RecordPendingSync(Relation rel)
> I don't think that I agree that this should be part of relcache.c. The
> syncs are tracked should be tracked out of the relation context.

Yeah.. It's in storage.c in the latest patch. (Sorry for the
duplicate name). I think it is a kind of bond between smgr and
relation.

> Seeing how invasive this change is, I would also advocate for this
> patch as only being a HEAD-only change, not many people are
> complaining about this optimization of TRUNCATE missing when wal_level
> = minimal, and this needs a very careful review.

Agreed.

> Should I code something? Or Horiguchi-san, would you take care of it?
> The previous crash I saw has been taken care of, but it's been really
> some time since I looked at this patch...

My point is hash-search on every tuple insertion should be evaded
even if it happens rearely. Once it was a bit apart from your
original patch, but in the latest patch the significant part
(pending-sync hash) is revived from the original one.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2017-04-13 Thread Kyotaro HORIGUCHI
I'd like to put a supplimentary explanation.

At Tue, 11 Apr 2017 17:38:12 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170411.173812.133964522.horiguchi.kyot...@lab.ntt.co.jp>
> Sorry, what I have just sent was broken.
> 
> At Tue, 11 Apr 2017 17:33:41 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
>  wrote in 
> <20170411.173341.257028732.horiguchi.kyot...@lab.ntt.co.jp>
> > At Tue, 11 Apr 2017 09:56:06 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
> >  wrote in 
> > <20170411.095606.245908357.horiguchi.kyot...@lab.ntt.co.jp>
> > > Hello, thank you for looking this.
> > > 
> > > At Fri, 07 Apr 2017 20:38:35 -0400, Tom Lane  wrote 
> > > in <27309.1491611...@sss.pgh.pa.us>
> > > > Alvaro Herrera  writes:
> > > > > Interesting.  I wonder if it's possible that a relcache invalidation
> > > > > would cause these values to get lost for some reason, because that 
> > > > > would
> > > > > be dangerous.
> > > > 
> > > > > I suppose the rationale is that this shouldn't happen because any
> > > > > operation that does things this way must hold an exclusive lock on the
> > > > > relation.  But that doesn't guarantee that the relcache entry is
> > > > > completely stable,
> > > > 
> > > > It ABSOLUTELY is not safe.  Relcache flushes can happen regardless of
> > > > how strong a lock you hold.
> > > > 
> > > > regards, tom lane
> > > 
> > > Ugh. Yes, relcache invalidation happens anytime and it resets the

The pending locations are not stored in relcache hash so the
problem here is not invalidation but that Relation objects are
created as necessary, anywhere. Even if no invalidation happens,
the same thing will happen in a bit different form.

> > > added values. pg_stat_info deceived me that it can store
> > > transient values. But I  came up with another thought.
> > > 
> > > The reason I proposed it was I thought that hash_search for every
> > > buffer is not good. Instead, like pg_stat_info, we can link the
> > 
> > buffer => buffer modification
> > 
> > > pending-sync hash entry to Relation. This greately reduces the
> > > frequency of hash-searching.
> > > 
> > > I'll post new patch in this way soon.
> > 
> > Here it is.
> 
> It contained tariling space and missing test script.  This is the
> correct patch.
> 
> > - Relation has new members no_pending_sync and pending_sync that
> >   works as instant cache of an entry in pendingSync hash.
> > 
> > - Commit-time synchronizing is restored as Michael's patch.
> > 
> > - If relfilenode is replaced, pending_sync for the old node is
> >   removed. Anyway this is ignored on abort and meaningless on
> >   commit.
> > 
> > - TAP test is renamed to 012 since some new files have been added.
> > 
> > Accessing pending sync hash occured on every calling of
> > HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
> > accessing relations has pending sync.  Almost of them are
> > eliminated as the result.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2017-04-12 Thread Michael Paquier
On Tue, Apr 11, 2017 at 5:38 PM, Kyotaro HORIGUCHI
 wrote:
> Sorry, what I have just sent was broken.

You can use PROVE_TESTS when running make check to select a subset of
tests you want to run. I use that all the time when working on patches
dedicated to certain code paths.

>> - Relation has new members no_pending_sync and pending_sync that
>>   works as instant cache of an entry in pendingSync hash.
>> - Commit-time synchronizing is restored as Michael's patch.
>> - If relfilenode is replaced, pending_sync for the old node is
>>   removed. Anyway this is ignored on abort and meaningless on
>>   commit.
>> - TAP test is renamed to 012 since some new files have been added.
>>
>> Accessing pending sync hash occurred on every calling of
>> HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
>> accessing relations has pending sync.  Almost of them are
>> eliminated as the result.

Did you actually test this patch? One of the logs added makes the
tests a long time to run:
2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl
STATEMENT:  ANALYZE;
2017-04-13 12:12:25.766 JST [85492] LOG:  BufferNeedsWAL: pendingSyncs
= 0x0, no_pending_sync = 0

-   lsn = XLogInsert(RM_SMGR_ID,
-XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+   rel->no_pending_sync= false;
+   rel->pending_sync = pending;
+   }
It seems to me that those flags and the pending_sync data should be
kept in the context of backend process and not be part of the Relation
data...

+void
+RecordPendingSync(Relation rel)
I don't think that I agree that this should be part of relcache.c. The
syncs are tracked should be tracked out of the relation context.

Seeing how invasive this change is, I would also advocate for this
patch as only being a HEAD-only change, not many people are
complaining about this optimization of TRUNCATE missing when wal_level
= minimal, and this needs a very careful review.

Should I code something? Or Horiguchi-san, would you take care of it?
The previous crash I saw has been taken care of, but it's been really
some time since I looked at this patch...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2017-04-11 Thread Kyotaro HORIGUCHI
Sorry, what I have just sent was broken.

At Tue, 11 Apr 2017 17:33:41 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170411.173341.257028732.horiguchi.kyot...@lab.ntt.co.jp>
> At Tue, 11 Apr 2017 09:56:06 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
>  wrote in 
> <20170411.095606.245908357.horiguchi.kyot...@lab.ntt.co.jp>
> > Hello, thank you for looking this.
> > 
> > At Fri, 07 Apr 2017 20:38:35 -0400, Tom Lane  wrote in 
> > <27309.1491611...@sss.pgh.pa.us>
> > > Alvaro Herrera  writes:
> > > > Interesting.  I wonder if it's possible that a relcache invalidation
> > > > would cause these values to get lost for some reason, because that would
> > > > be dangerous.
> > > 
> > > > I suppose the rationale is that this shouldn't happen because any
> > > > operation that does things this way must hold an exclusive lock on the
> > > > relation.  But that doesn't guarantee that the relcache entry is
> > > > completely stable,
> > > 
> > > It ABSOLUTELY is not safe.  Relcache flushes can happen regardless of
> > > how strong a lock you hold.
> > > 
> > >   regards, tom lane
> > 
> > Ugh. Yes, relcache invalidation happens anytime and it resets the
> > added values. pg_stat_info deceived me that it can store
> > transient values. But I  came up with another thought.
> > 
> > The reason I proposed it was I thought that hash_search for every
> > buffer is not good. Instead, like pg_stat_info, we can link the
> 
> buffer => buffer modification
> 
> > pending-sync hash entry to Relation. This greately reduces the
> > frequency of hash-searching.
> > 
> > I'll post new patch in this way soon.
> 
> Here it is.

It contained tariling space and missing test script.  This is the
correct patch.

> - Relation has new members no_pending_sync and pending_sync that
>   works as instant cache of an entry in pendingSync hash.
> 
> - Commit-time synchronizing is restored as Michael's patch.
> 
> - If relfilenode is replaced, pending_sync for the old node is
>   removed. Anyway this is ignored on abort and meaningless on
>   commit.
> 
> - TAP test is renamed to 012 since some new files have been added.
> 
> Accessing pending sync hash occured on every calling of
> HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
> accessing relations has pending sync.  Almost of them are
> eliminated as the result.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0c3e2b0..23a6d56 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
  *	  the POSTGRES heap access method used for all POSTGRES
  *	  relations.
  *
+ * WAL CONSIDERATIONS
+ *	  All heap operations are normally WAL-logged. but there are a few
+ *	  exceptions. Temporary and unlogged relations never need to be
+ *	  WAL-logged, but we can also skip WAL-logging for a table that was
+ *	  created in the same transaction, if we don't need WAL for PITR or
+ *	  WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ *	  the file to disk at COMMIT instead.
+ *
+ *	  The same-relation optimization is not employed automatically on all
+ *	  updates to a table that was created in the same transacton, because
+ *	  for a small number of changes, it's cheaper to just create the WAL
+ *	  records than fsyncing() the whole relation at COMMIT. It is only
+ *	  worthwhile for (presumably) large operations like COPY, CLUSTER,
+ *	  or VACUUM FULL. Use heap_register_sync() to initiate such an
+ *	  operation; it will cause any subsequent updates to the table to skip
+ *	  WAL-logging, if possible, and cause the heap to be synced to disk at
+ *	  COMMIT.
+ *
+ *	  To make that work, all modifications to heap must use
+ *	  HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ *	  for the given block.
+ *
  *-
  */
 #include "postgres.h"
@@ -56,6 +78,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -2356,12 +2379,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * The new tuple is stamped with current transaction ID and the specified
  * command ID.
  *
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- *
  * The HEAP_INSERT_SKIP_FSM option is passed directly to
  * RelationGetBufferForTuple, which see for more info.
  *
@@ -2392,6 

Re: [HACKERS] WAL logging problem in 9.4.3?

2017-04-11 Thread Kyotaro HORIGUCHI
At Tue, 11 Apr 2017 09:56:06 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170411.095606.245908357.horiguchi.kyot...@lab.ntt.co.jp>
> Hello, thank you for looking this.
> 
> At Fri, 07 Apr 2017 20:38:35 -0400, Tom Lane  wrote in 
> <27309.1491611...@sss.pgh.pa.us>
> > Alvaro Herrera  writes:
> > > Interesting.  I wonder if it's possible that a relcache invalidation
> > > would cause these values to get lost for some reason, because that would
> > > be dangerous.
> > 
> > > I suppose the rationale is that this shouldn't happen because any
> > > operation that does things this way must hold an exclusive lock on the
> > > relation.  But that doesn't guarantee that the relcache entry is
> > > completely stable,
> > 
> > It ABSOLUTELY is not safe.  Relcache flushes can happen regardless of
> > how strong a lock you hold.
> > 
> > regards, tom lane
> 
> Ugh. Yes, relcache invalidation happens anytime and it resets the
> added values. pg_stat_info deceived me that it can store
> transient values. But I  came up with another thought.
> 
> The reason I proposed it was I thought that hash_search for every
> buffer is not good. Instead, like pg_stat_info, we can link the

buffer => buffer modification

> pending-sync hash entry to Relation. This greately reduces the
> frequency of hash-searching.
> 
> I'll post new patch in this way soon.

Here it is.

- Relation has new members no_pending_sync and pending_sync that
  works as instant cache of an entry in pendingSync hash.

- Commit-time synchronizing is restored as Michael's patch.

- If relfilenode is replaced, pending_sync for the old node is
  removed. Anyway this is ignored on abort and meaningless on
  commit.

- TAP test is renamed to 012 since some new files have been added.

Accessing pending sync hash occured on every calling of
HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
accessing relations has pending sync.  Almost of them are
eliminated as the result.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0c3e2b0..aa1b97d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
  *	  the POSTGRES heap access method used for all POSTGRES
  *	  relations.
  *
+ * WAL CONSIDERATIONS
+ *	  All heap operations are normally WAL-logged. but there are a few
+ *	  exceptions. Temporary and unlogged relations never need to be
+ *	  WAL-logged, but we can also skip WAL-logging for a table that was
+ *	  created in the same transaction, if we don't need WAL for PITR or
+ *	  WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ *	  the file to disk at COMMIT instead.
+ *
+ *	  The same-relation optimization is not employed automatically on all
+ *	  updates to a table that was created in the same transacton, because
+ *	  for a small number of changes, it's cheaper to just create the WAL
+ *	  records than fsyncing() the whole relation at COMMIT. It is only
+ *	  worthwhile for (presumably) large operations like COPY, CLUSTER,
+ *	  or VACUUM FULL. Use heap_register_sync() to initiate such an
+ *	  operation; it will cause any subsequent updates to the table to skip
+ *	  WAL-logging, if possible, and cause the heap to be synced to disk at
+ *	  COMMIT.
+ *
+ *	  To make that work, all modifications to heap must use
+ *	  HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ *	  for the given block.
+ *
  *-
  */
 #include "postgres.h"
@@ -56,6 +78,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -2356,12 +2379,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * The new tuple is stamped with current transaction ID and the specified
  * command ID.
  *
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- *
  * The HEAP_INSERT_SKIP_FSM option is passed directly to
  * RelationGetBufferForTuple, which see for more info.
  *
@@ -2465,7 +2482,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+	if (BufferNeedsWAL(relation, buffer))
 	{
 		xl_heap_insert xlrec;
 		xl_heap_header xlhdr;
@@ -2664,12 +2681,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 	int			ndone;
 	char	

Re: [HACKERS] WAL logging problem in 9.4.3?

2017-04-10 Thread Kyotaro HORIGUCHI
Hello, thank you for looking this.

At Fri, 07 Apr 2017 20:38:35 -0400, Tom Lane  wrote in 
<27309.1491611...@sss.pgh.pa.us>
> Alvaro Herrera  writes:
> > Interesting.  I wonder if it's possible that a relcache invalidation
> > would cause these values to get lost for some reason, because that would
> > be dangerous.
> 
> > I suppose the rationale is that this shouldn't happen because any
> > operation that does things this way must hold an exclusive lock on the
> > relation.  But that doesn't guarantee that the relcache entry is
> > completely stable,
> 
> It ABSOLUTELY is not safe.  Relcache flushes can happen regardless of
> how strong a lock you hold.
> 
>   regards, tom lane

Ugh. Yes, relcache invalidation happens anytime and it resets the
added values. pg_stat_info deceived me that it can store
transient values. But I  came up with another thought.

The reason I proposed it was I thought that hash_search for every
buffer is not good. Instead, like pg_stat_info, we can link the
pending-sync hash entry to Relation. This greately reduces the
frequency of hash-searching.

I'll post new patch in this way soon.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2017-04-07 Thread Tom Lane
Alvaro Herrera  writes:
> Interesting.  I wonder if it's possible that a relcache invalidation
> would cause these values to get lost for some reason, because that would
> be dangerous.

> I suppose the rationale is that this shouldn't happen because any
> operation that does things this way must hold an exclusive lock on the
> relation.  But that doesn't guarantee that the relcache entry is
> completely stable,

It ABSOLUTELY is not safe.  Relcache flushes can happen regardless of
how strong a lock you hold.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2017-04-07 Thread Alvaro Herrera
Alvaro Herrera wrote:

> I suppose the rationale is that this shouldn't happen because any
> operation that does things this way must hold an exclusive lock on the
> relation.  But that doesn't guarantee that the relcache entry is
> completely stable, does it?  If we can get proof of that, then this
> technique should be safe, I think.

It occurs to me that in order to test this we could run the recovery
tests (including Michael's new 006 file, which you didn't include in
your patch) under -D CLOBBER_CACHE_ALWAYS.  I think that'd be sufficient
proof that it is solid.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2017-04-07 Thread Alvaro Herrera
I have claimed this patch as committer FWIW.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2017-04-07 Thread Alvaro Herrera
Kyotaro HORIGUCHI wrote:

> The attached patch is quiiiccck-and-dirty-hack of Michael's patch
> just as a PoC of my proposal quoted above. This also passes the
> 006 test.  The major changes are the following.
> 
> - Moved sync_above and truncted_to into  RelationData.

Interesting.  I wonder if it's possible that a relcache invalidation
would cause these values to get lost for some reason, because that would
be dangerous.

I suppose the rationale is that this shouldn't happen because any
operation that does things this way must hold an exclusive lock on the
relation.  But that doesn't guarantee that the relcache entry is
completely stable, does it?  If we can get proof of that, then this
technique should be safe, I think.

In your version of the patch, which I spent some time skimming, I am
missing comments on various functions.  I added some as I went along,
including one XXX indicating it must be filled.

RecordPendingSync() should really live in relcache.c (and probably get a
different name).

> X I feel that I have dropped one of the features of the origitnal
>   patch during the hack, but I don't recall it clearly now:(

Hah :-)

> X I haven't consider relfilenode replacement, which didn't matter
>   for the original patch. (but there's few places to consider).

Hmm ...  Please provide.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0c3e2b0..aa1b97d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
  *   the POSTGRES heap access method used for all POSTGRES
  *   relations.
  *
+ * WAL CONSIDERATIONS
+ *   All heap operations are normally WAL-logged. but there are a few
+ *   exceptions. Temporary and unlogged relations never need to be
+ *   WAL-logged, but we can also skip WAL-logging for a table that was
+ *   created in the same transaction, if we don't need WAL for PITR or
+ *   WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ *   the file to disk at COMMIT instead.
+ *
+ *   The same-relation optimization is not employed automatically on all
+ *   updates to a table that was created in the same transacton, because
+ *   for a small number of changes, it's cheaper to just create the WAL
+ *   records than fsyncing() the whole relation at COMMIT. It is only
+ *   worthwhile for (presumably) large operations like COPY, CLUSTER,
+ *   or VACUUM FULL. Use heap_register_sync() to initiate such an
+ *   operation; it will cause any subsequent updates to the table to skip
+ *   WAL-logging, if possible, and cause the heap to be synced to disk at
+ *   COMMIT.
+ *
+ *   To make that work, all modifications to heap must use
+ *   HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ *   for the given block.
+ *
  *-
  */
 #include "postgres.h"
@@ -56,6 +78,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -2356,12 +2379,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * The new tuple is stamped with current transaction ID and the specified
  * command ID.
  *
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- *
  * The HEAP_INSERT_SKIP_FSM option is passed directly to
  * RelationGetBufferForTuple, which see for more info.
  *
@@ -2465,7 +2482,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId 
cid,
MarkBufferDirty(buffer);
 
/* XLOG stuff */
-   if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+   if (BufferNeedsWAL(relation, buffer))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2664,12 +2681,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, 
int ntuples,
int ndone;
char   *scratch = NULL;
Pagepage;
-   boolneedwal;
SizesaveFreeSpace;
boolneed_tuple_data = RelationIsLogicallyLogged(relation);
boolneed_cids = 
RelationIsAccessibleInLogicalDecoding(relation);
 
-   needwal = !(options & HEAP_INSERT_SKIP_WAL) && 
RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,

 

Re: [HACKERS] WAL logging problem in 9.4.3?

2017-03-01 Thread David Steele
On 1/30/17 11:33 PM, Michael Paquier wrote:
> On Fri, Dec 2, 2016 at 1:39 PM, Haribabu Kommi  
> wrote:
>> The latest proposed patch still having problems.
>> Closed in 2016-11 commitfest with "moved to next CF" status because of a bug
>> fix patch.
>> Please feel free to update the status once you submit the updated patch.
> And moved to CF 2017-03...

Are there any plans to post a new patch?  This thread is now 18 months
old and it would be good to get a resolution in this CF.

Thanks,

-- 
-David
da...@pgmasters.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2017-01-30 Thread Michael Paquier
On Fri, Dec 2, 2016 at 1:39 PM, Haribabu Kommi  wrote:
> The latest proposed patch still having problems.
> Closed in 2016-11 commitfest with "moved to next CF" status because of a bug
> fix patch.
> Please feel free to update the status once you submit the updated patch.

And moved to CF 2017-03...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2016-12-01 Thread Haribabu Kommi
On Wed, Nov 9, 2016 at 5:55 PM, Michael Paquier 
wrote:

>
>
> On Wed, Nov 9, 2016 at 9:27 AM, Michael Paquier 
> wrote:
> > On Wed, Nov 9, 2016 at 5:39 AM, Robert Haas 
> wrote:
> >> On Thu, Feb 4, 2016 at 7:24 AM, Heikki Linnakangas 
> wrote:
> >>> I dropped the ball on this one back in July, so here's an attempt to
> revive
> >>> this thread.
> >>>
> >>> I spent some time fixing the remaining issues with the prototype patch
> I
> >>> posted earlier, and rebased that on top of current git master. See
> attached.
> >>>
> >>> Some review of that would be nice. If there are no major issues with
> it, I'm
> >>> going to create backpatchable versions of this for 9.4 and below.
> >>
> >> Are you going to do commit something here?  This thread and patch are
> >> now 14 months old, which is a long time to make people wait for a bug
> >> fix.  The status in the CF is "Ready for Committer" although I am not
> >> sure if that's accurate.
> >
> > "Needs Review" is definitely a better definition of its current state.
> > The last time I had a look at this patch I thought that it was in
> > pretty good shape (not Horiguchi-san's version, but the one in
> > https://www.postgresql.org/message-id/CAB7nPqR+3JjS=JB3R=
> axxkxcyeb-q77u-erw7_ukajctwnt...@mail.gmail.com).
> > With some of the recent changes, surely it needs a second look, things
> > related to heap handling tend to rot quickly.
> >
> > I'll look into it once again by the end of this week if Heikki does
> > not show up, the rest will be on him I am afraid...
>
> I have been able to hit a crash with recovery test 008:
> (lldb) bt
> * thread #1: tid = 0x, 0x7fff96d48f06 
> libsystem_kernel.dylib`__pthread_kill
> + 10, stop reason = signal SIGSTOP
>   * frame #0: 0x7fff96d48f06 libsystem_kernel.dylib`__pthread_kill +
> 10
> frame #1: 0x7fff9102e4ec libsystem_pthread.dylib`pthread_kill + 90
> frame #2: 0x7fff8e5cc6df libsystem_c.dylib`abort + 129
> frame #3: 0x000106ef10f0 
> postgres`ExceptionalCondition(conditionName="!((
> !( ((void) ((bool) (! (!((buffer) <= NBuffers && (buffer) >= -NLocBuffer))
> || (ExceptionalCondition(\"!((buffer) <= NBuffers && (buffer) >=
> -NLocBuffer)\", (\"FailedAssertion\"), \"bufmgr.c\", 2593), 0, (buffer)
> != 0 ) ? ((bool) 0) : ((buffer) < 0) ? (LocalRefCount[-(buffer) - 1] > 0) :
> (GetPrivateRefCount(buffer) > 0) ))", errorType="FailedAssertion",
> fileName="bufmgr.c", lineNumber=2593) + 128 at assert.c:54
> frame #4: 0x000106cf4a2c postgres`BufferGetBlockNumber(buffer=0)
> + 204 at bufmgr.c:2593
> frame #5: 0x00010694e6ad postgres`HeapNeedsWAL(rel=0x7f9454804118,
> buf=0) + 61 at heapam.c:9234
> frame #6: 0x00010696d8bd 
> postgres`visibilitymap_set(rel=0x7f9454804118,
> heapBlk=1, heapBuf=0, recptr=50841176, vmBuf=118, cutoff_xid=866,
> flags='\x01') + 989 at visibilitymap.c:310
> frame #7: 0x00010695d020 
> postgres`heap_xlog_visible(record=0x7f94520035d0)
> + 896 at heapam.c:8148
> frame #8: 0x00010695c582 
> postgres`heap2_redo(record=0x7f94520035d0)
> + 242 at heapam.c:9107
> frame #9: 0x0001069d132d postgres`StartupXLOG + 9181 at xlog.c:6950
> frame #10: 0x000106c9d783 postgres`StartupProcessMain + 339 at
> startup.c:216
> frame #11: 0x0001069ee6ec postgres`AuxiliaryProcessMain(argc=2,
> argv=0x7fff59316d80) + 1676 at bootstrap.c:420
> frame #12: 0x000106c98002 
> postgres`StartChildProcess(type=StartupProcess)
> + 322 at postmaster.c:5221
> frame #13: 0x000106c96031 postgres`PostmasterMain(argc=3,
> argv=0x7f9451c04210) + 6033 at postmaster.c:1301
> frame #14: 0x000106bc30cf postgres`main(argc=3,
> argv=0x7f9451c04210) + 751 at main.c:228
> (lldb) up 1
> frame #4: 0x000106cf4a2c postgres`BufferGetBlockNumber(buffer=0) +
> 204 at bufmgr.c:2593
>2590{
>2591BufferDesc *bufHdr;
>2592
> -> 2593Assert(BufferIsPinned(buffer));
>2594
>2595if (BufferIsLocal(buffer))
>2596bufHdr = GetLocalBufferDescriptor(-buffer - 1);
>

The latest proposed patch still having problems.
Closed in 2016-11 commitfest with "moved to next CF" status because of a
bug fix patch.
Please feel free to update the status once you submit the updated patch.

Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] WAL logging problem in 9.4.3?

2016-11-08 Thread Michael Paquier
On Wed, Nov 9, 2016 at 9:27 AM, Michael Paquier 
wrote:
> On Wed, Nov 9, 2016 at 5:39 AM, Robert Haas  wrote:
>> On Thu, Feb 4, 2016 at 7:24 AM, Heikki Linnakangas 
wrote:
>>> I dropped the ball on this one back in July, so here's an attempt to
revive
>>> this thread.
>>>
>>> I spent some time fixing the remaining issues with the prototype patch I
>>> posted earlier, and rebased that on top of current git master. See
attached.
>>>
>>> Some review of that would be nice. If there are no major issues with
it, I'm
>>> going to create backpatchable versions of this for 9.4 and below.
>>
>> Are you going to do commit something here?  This thread and patch are
>> now 14 months old, which is a long time to make people wait for a bug
>> fix.  The status in the CF is "Ready for Committer" although I am not
>> sure if that's accurate.
>
> "Needs Review" is definitely a better definition of its current state.
> The last time I had a look at this patch I thought that it was in
> pretty good shape (not Horiguchi-san's version, but the one in
>
https://www.postgresql.org/message-id/CAB7nPqR+3JjS=JB3R=axxkxcyeb-q77u-erw7_ukajctwnt...@mail.gmail.com
).
> With some of the recent changes, surely it needs a second look, things
> related to heap handling tend to rot quickly.
>
> I'll look into it once again by the end of this week if Heikki does
> not show up, the rest will be on him I am afraid...

I have been able to hit a crash with recovery test 008:
(lldb) bt
* thread #1: tid = 0x, 0x7fff96d48f06
libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
  * frame #0: 0x7fff96d48f06 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x7fff9102e4ec libsystem_pthread.dylib`pthread_kill + 90
frame #2: 0x7fff8e5cc6df libsystem_c.dylib`abort + 129
frame #3: 0x000106ef10f0
postgres`ExceptionalCondition(conditionName="!(( !( ((void) ((bool) (!
(!((buffer) <= NBuffers && (buffer) >= -NLocBuffer)) ||
(ExceptionalCondition(\"!((buffer) <= NBuffers && (buffer) >=
-NLocBuffer)\", (\"FailedAssertion\"), \"bufmgr.c\", 2593), 0, (buffer)
!= 0 ) ? ((bool) 0) : ((buffer) < 0) ? (LocalRefCount[-(buffer) - 1] > 0) :
(GetPrivateRefCount(buffer) > 0) ))", errorType="FailedAssertion",
fileName="bufmgr.c", lineNumber=2593) + 128 at assert.c:54
frame #4: 0x000106cf4a2c postgres`BufferGetBlockNumber(buffer=0) +
204 at bufmgr.c:2593
frame #5: 0x00010694e6ad
postgres`HeapNeedsWAL(rel=0x7f9454804118, buf=0) + 61 at heapam.c:9234
frame #6: 0x00010696d8bd
postgres`visibilitymap_set(rel=0x7f9454804118, heapBlk=1, heapBuf=0,
recptr=50841176, vmBuf=118, cutoff_xid=866, flags='\x01') + 989 at
visibilitymap.c:310
frame #7: 0x00010695d020
postgres`heap_xlog_visible(record=0x7f94520035d0) + 896 at heapam.c:8148
frame #8: 0x00010695c582
postgres`heap2_redo(record=0x7f94520035d0) + 242 at heapam.c:9107
frame #9: 0x0001069d132d postgres`StartupXLOG + 9181 at xlog.c:6950
frame #10: 0x000106c9d783 postgres`StartupProcessMain + 339 at
startup.c:216
frame #11: 0x0001069ee6ec postgres`AuxiliaryProcessMain(argc=2,
argv=0x7fff59316d80) + 1676 at bootstrap.c:420
frame #12: 0x000106c98002
postgres`StartChildProcess(type=StartupProcess) + 322 at postmaster.c:5221
frame #13: 0x000106c96031 postgres`PostmasterMain(argc=3,
argv=0x7f9451c04210) + 6033 at postmaster.c:1301
frame #14: 0x000106bc30cf postgres`main(argc=3,
argv=0x7f9451c04210) + 751 at main.c:228
(lldb) up 1
frame #4: 0x000106cf4a2c postgres`BufferGetBlockNumber(buffer=0) + 204
at bufmgr.c:2593
   2590{
   2591BufferDesc *bufHdr;
   2592
-> 2593Assert(BufferIsPinned(buffer));
   2594
   2595if (BufferIsLocal(buffer))
   2596bufHdr = GetLocalBufferDescriptor(-buffer - 1);
-- 
Michael


Re: [HACKERS] WAL logging problem in 9.4.3?

2016-11-08 Thread Michael Paquier
On Wed, Nov 9, 2016 at 5:39 AM, Robert Haas  wrote:
> On Thu, Feb 4, 2016 at 7:24 AM, Heikki Linnakangas  wrote:
>> I dropped the ball on this one back in July, so here's an attempt to revive
>> this thread.
>>
>> I spent some time fixing the remaining issues with the prototype patch I
>> posted earlier, and rebased that on top of current git master. See attached.
>>
>> Some review of that would be nice. If there are no major issues with it, I'm
>> going to create backpatchable versions of this for 9.4 and below.
>
> Are you going to do commit something here?  This thread and patch are
> now 14 months old, which is a long time to make people wait for a bug
> fix.  The status in the CF is "Ready for Committer" although I am not
> sure if that's accurate.

"Needs Review" is definitely a better definition of its current state.
The last time I had a look at this patch I thought that it was in
pretty good shape (not Horiguchi-san's version, but the one in
https://www.postgresql.org/message-id/CAB7nPqR+3JjS=JB3R=axxkxcyeb-q77u-erw7_ukajctwnt...@mail.gmail.com).
With some of the recent changes, surely it needs a second look, things
related to heap handling tend to rot quickly.

I'll look into it once again by the end of this week if Heikki does
not show up, the rest will be on him I am afraid...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2016-11-08 Thread Robert Haas
On Thu, Feb 4, 2016 at 7:24 AM, Heikki Linnakangas  wrote:
> I dropped the ball on this one back in July, so here's an attempt to revive
> this thread.
>
> I spent some time fixing the remaining issues with the prototype patch I
> posted earlier, and rebased that on top of current git master. See attached.
>
> Some review of that would be nice. If there are no major issues with it, I'm
> going to create backpatchable versions of this for 9.4 and below.

Heikki:

Are you going to do commit something here?  This thread and patch are
now 14 months old, which is a long time to make people wait for a bug
fix.  The status in the CF is "Ready for Committer" although I am not
sure if that's accurate.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2016-10-31 Thread Kyotaro HORIGUCHI
Hi,

At Sun, 2 Oct 2016 21:43:46 +0900, Michael Paquier  
wrote in 
> On Thu, Sep 29, 2016 at 10:02 PM, Kyotaro HORIGUCHI
>  wrote:
> > Hello,
> >
> > At Thu, 29 Sep 2016 16:59:55 +0900, Michael Paquier 
> >  wrote in 
> > 
> >> On Mon, Sep 26, 2016 at 5:03 PM, Kyotaro HORIGUCHI
> >>  wrote:
> >> > Hello, I return to this before my things:)
> >> >
> >> > Though I haven't played with the patch yet..
> >>
> >> Be sure to run the test cases in the patch or base your tests on them then!
> >
> > All items of 006_truncate_opt fail on ed0b228 and they are fixed
> > with the patch.
> >
> >> > Though I don't know how it actually impacts the perfomance, it
> >> > seems to me that we can live with truncated_to and sync_above in
> >> > RelationData and BufferNeedsWAL(rel, buf) instead of
> >> > HeapNeedsWAL(rel, buf).  Anyway up to one entry for one relation
> >> > seems to exist at once in the hash.
> >>
> >> TBH, I still think that the design of this patch as proposed is pretty
> >> cool and easy to follow.
> >
> > It is clean from certain viewpoint but additional hash,
> > especially hash-searching on every HeapNeedsWAL seems to me to be
> > unacceptable. Do you see it accetable?
> >
> >
> > The attached patch is quiiiccck-and-dirty-hack of Michael's patch
> > just as a PoC of my proposal quoted above. This also passes the
> > 006 test.  The major changes are the following.
> >
> > - Moved sync_above and truncted_to into  RelationData.
> >
> > - Cleaning up is done in AtEOXact_cleanup instead of explicit
> >   calling to smgrDoPendingSyncs().
> >
> > * BufferNeedsWAL (replace of HeapNeedsWAL) no longer requires
> >   hash_search. It just refers to the additional members in the
> >   given Relation.
> >
> > X I feel that I have dropped one of the features of the origitnal
> >   patch during the hack, but I don't recall it clearly now:(
> >
> > X I haven't consider relfilenode replacement, which didn't matter
> >   for the original patch. (but there's few places to consider).
> >
> > What do you think about this?
> 
> I have moved this patch to next CF. (I still need to look at your patch.)

Thanks for considering that.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2016-10-02 Thread Michael Paquier
On Thu, Sep 29, 2016 at 10:02 PM, Kyotaro HORIGUCHI
 wrote:
> Hello,
>
> At Thu, 29 Sep 2016 16:59:55 +0900, Michael Paquier 
>  wrote in 
> 
>> On Mon, Sep 26, 2016 at 5:03 PM, Kyotaro HORIGUCHI
>>  wrote:
>> > Hello, I return to this before my things:)
>> >
>> > Though I haven't played with the patch yet..
>>
>> Be sure to run the test cases in the patch or base your tests on them then!
>
> All items of 006_truncate_opt fail on ed0b228 and they are fixed
> with the patch.
>
>> > Though I don't know how it actually impacts the perfomance, it
>> > seems to me that we can live with truncated_to and sync_above in
>> > RelationData and BufferNeedsWAL(rel, buf) instead of
>> > HeapNeedsWAL(rel, buf).  Anyway up to one entry for one relation
>> > seems to exist at once in the hash.
>>
>> TBH, I still think that the design of this patch as proposed is pretty
>> cool and easy to follow.
>
> It is clean from certain viewpoint but additional hash,
> especially hash-searching on every HeapNeedsWAL seems to me to be
> unacceptable. Do you see it accetable?
>
>
> The attached patch is quiiiccck-and-dirty-hack of Michael's patch
> just as a PoC of my proposal quoted above. This also passes the
> 006 test.  The major changes are the following.
>
> - Moved sync_above and truncted_to into  RelationData.
>
> - Cleaning up is done in AtEOXact_cleanup instead of explicit
>   calling to smgrDoPendingSyncs().
>
> * BufferNeedsWAL (replace of HeapNeedsWAL) no longer requires
>   hash_search. It just refers to the additional members in the
>   given Relation.
>
> X I feel that I have dropped one of the features of the origitnal
>   patch during the hack, but I don't recall it clearly now:(
>
> X I haven't consider relfilenode replacement, which didn't matter
>   for the original patch. (but there's few places to consider).
>
> What do you think about this?

I have moved this patch to next CF. (I still need to look at your patch.)
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2016-09-29 Thread Kyotaro HORIGUCHI
Hello,

At Thu, 29 Sep 2016 16:59:55 +0900, Michael Paquier  
wrote in 
> On Mon, Sep 26, 2016 at 5:03 PM, Kyotaro HORIGUCHI
>  wrote:
> > Hello, I return to this before my things:)
> >
> > Though I haven't played with the patch yet..
> 
> Be sure to run the test cases in the patch or base your tests on them then!

All items of 006_truncate_opt fail on ed0b228 and they are fixed
with the patch.

> > Though I don't know how it actually impacts the perfomance, it
> > seems to me that we can live with truncated_to and sync_above in
> > RelationData and BufferNeedsWAL(rel, buf) instead of
> > HeapNeedsWAL(rel, buf).  Anyway up to one entry for one relation
> > seems to exist at once in the hash.
> 
> TBH, I still think that the design of this patch as proposed is pretty
> cool and easy to follow.

It is clean from certain viewpoint but additional hash,
especially hash-searching on every HeapNeedsWAL seems to me to be
unacceptable. Do you see it accetable?


The attached patch is quiiiccck-and-dirty-hack of Michael's patch
just as a PoC of my proposal quoted above. This also passes the
006 test.  The major changes are the following.

- Moved sync_above and truncted_to into  RelationData.

- Cleaning up is done in AtEOXact_cleanup instead of explicit
  calling to smgrDoPendingSyncs().

* BufferNeedsWAL (replace of HeapNeedsWAL) no longer requires
  hash_search. It just refers to the additional members in the
  given Relation.

X I feel that I have dropped one of the features of the origitnal
  patch during the hack, but I don't recall it clearly now:(

X I haven't consider relfilenode replacement, which didn't matter
  for the original patch. (but there's few places to consider).

What do you think about this?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 38bba16..02e33cc 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
  *	  the POSTGRES heap access method used for all POSTGRES
  *	  relations.
  *
+ * WAL CONSIDERATIONS
+ *	  All heap operations are normally WAL-logged. but there are a few
+ *	  exceptions. Temporary and unlogged relations never need to be
+ *	  WAL-logged, but we can also skip WAL-logging for a table that was
+ *	  created in the same transaction, if we don't need WAL for PITR or
+ *	  WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ *	  the file to disk at COMMIT instead.
+ *
+ *	  The same-relation optimization is not employed automatically on all
+ *	  updates to a table that was created in the same transacton, because
+ *	  for a small number of changes, it's cheaper to just create the WAL
+ *	  records than fsyncing() the whole relation at COMMIT. It is only
+ *	  worthwhile for (presumably) large operations like COPY, CLUSTER,
+ *	  or VACUUM FULL. Use heap_register_sync() to initiate such an
+ *	  operation; it will cause any subsequent updates to the table to skip
+ *	  WAL-logging, if possible, and cause the heap to be synced to disk at
+ *	  COMMIT.
+ *
+ *	  To make that work, all modifications to heap must use
+ *	  HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ *	  for the given block.
+ *
  *-
  */
 #include "postgres.h"
@@ -55,6 +77,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -2331,12 +2354,6 @@ FreeBulkInsertState(BulkInsertState bistate)
  * The new tuple is stamped with current transaction ID and the specified
  * command ID.
  *
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- *
  * The HEAP_INSERT_SKIP_FSM option is passed directly to
  * RelationGetBufferForTuple, which see for more info.
  *
@@ -2440,7 +2457,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+	if (BufferNeedsWAL(relation, buffer))
 	{
 		xl_heap_insert xlrec;
 		xl_heap_header xlhdr;
@@ -2639,12 +2656,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 	int			ndone;
 	char	   *scratch = NULL;
 	Page		page;
-	bool		needwal;
 	Size		saveFreeSpace;
 	bool		need_tuple_data = RelationIsLogicallyLogged(relation);
 	bool		need_cids = RelationIsAccessibleInLogicalDecoding(relation);
 
-	needwal 

Re: [HACKERS] WAL logging problem in 9.4.3?

2016-09-29 Thread Michael Paquier
On Mon, Sep 26, 2016 at 5:03 PM, Kyotaro HORIGUCHI
 wrote:
> Hello, I return to this before my things:)
>
> Though I haven't played with the patch yet..

Be sure to run the test cases in the patch or base your tests on them then!

> Though I don't know how it actually impacts the perfomance, it
> seems to me that we can live with truncated_to and sync_above in
> RelationData and BufferNeedsWAL(rel, buf) instead of
> HeapNeedsWAL(rel, buf).  Anyway up to one entry for one relation
> seems to exist at once in the hash.

TBH, I still think that the design of this patch as proposed is pretty
cool and easy to follow.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2016-09-26 Thread Kyotaro HORIGUCHI
Hello, I return to this before my things:)

Though I haven't played with the patch yet..

At Fri, 29 Jul 2016 16:54:42 +0900, Michael Paquier  
wrote in 
> > Well, not that soon at the end, but I am back on that... I have not
> > completely reviewed all the code yet, and the case of index relation
> > referring to a relation optimized with truncate is still broken, but
> > for now here is a rebased patch if people are interested. I am going
> > to get as well a TAP tests out of my pocket to ease testing.
> 
> The patch I sent yesterday was based on an incorrect version. Attached
> is a slightly-modified version of the last one I found here
> (https://www.postgresql.org/message-id/56b342f5.1050...@iki.fi), which
> is rebased on HEAD at ed0b228. I have also converted the test case
> script of upthread into a TAP test in src/test/recovery that covers 3
> cases and I included that in the patch:
> 1) CREATE + INSERT + COPY => crash
> 2) CREATE + trigger + COPY => crash
> 3) CREATE + TRUNCATE + COPY => incorrect number of rows.
> The first two tests make the system crash, the third one reports an
> incorrect number of rows.

At the first glance, managing sync_above and truncate_to is
workable for these cases, but seems too complicated for the
problem to be resolved.

This provides smgr with a capability to manage pending page
synchs. But the postpone-page-syncs-or-not issue rather seems to
be a matter of the users of that, who are responsible for WAL
issueing. Anyway heap_resgister_sync doesn't use any secret of
smgr. So I think this approach binds smgr with Relation too
tightly.

By this patch, many RelationNeedsWALs, which just accesses local
struct, are replaced with HeapNeedsWAL, which eventually accesses
a hash added by this patch. Especially in log_heap_update, it is
called for every update of single tuple (on a relation that needs
WAL).

Though I don't know how it actually impacts the perfomance, it
seems to me that we can live with truncated_to and sync_above in
RelationData and BufferNeedsWAL(rel, buf) instead of
HeapNeedsWAL(rel, buf).  Anyway up to one entry for one relation
seems to exist at once in the hash.

What do you think?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2016-07-29 Thread Michael Paquier
On Thu, Jul 28, 2016 at 4:59 PM, Michael Paquier
 wrote:
> On Wed, Apr 6, 2016 at 3:11 PM, Michael Paquier
>  wrote:
>> On Wed, Mar 23, 2016 at 12:45 PM, Michael Paquier
>>  wrote:
>>> On Wed, Mar 23, 2016 at 11:11 AM, David Steele  wrote:
 I would prefer not to bump it to the next CF unless we decide this will
 not get fixed for 9.6.
>>>
>>> It may make sense to add that to the list of open items for 9.6
>>> instead. That's not a feature.
>>
>> So I have moved this patch to the next CF for now, and will work on
>> fixing it rather soonishly as an effort to stabilize 9.6 as well as
>> back-branches.
>
> Well, not that soon at the end, but I am back on that... I have not
> completely reviewed all the code yet, and the case of index relation
> referring to a relation optimized with truncate is still broken, but
> for now here is a rebased patch if people are interested. I am going
> to get as well a TAP tests out of my pocket to ease testing.

The patch I sent yesterday was based on an incorrect version. Attached
is a slightly-modified version of the last one I found here
(https://www.postgresql.org/message-id/56b342f5.1050...@iki.fi), which
is rebased on HEAD at ed0b228. I have also converted the test case
script of upthread into a TAP test in src/test/recovery that covers 3
cases and I included that in the patch:
1) CREATE + INSERT + COPY => crash
2) CREATE + trigger + COPY => crash
3) CREATE + TRUNCATE + COPY => incorrect number of rows.
The first two tests make the system crash, the third one reports an
incorrect number of rows.

This is registered in next CF by the way:
https://commitfest.postgresql.org/10/528/
Thoughts?
-- 
Michael


fix-wal-level-minimal-michael-2.patch
Description: invalid/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2016-07-28 Thread Michael Paquier
On Wed, Apr 6, 2016 at 3:11 PM, Michael Paquier
 wrote:
> On Wed, Mar 23, 2016 at 12:45 PM, Michael Paquier
>  wrote:
>> On Wed, Mar 23, 2016 at 11:11 AM, David Steele  wrote:
>>> I would prefer not to bump it to the next CF unless we decide this will
>>> not get fixed for 9.6.
>>
>> It may make sense to add that to the list of open items for 9.6
>> instead. That's not a feature.
>
> So I have moved this patch to the next CF for now, and will work on
> fixing it rather soonishly as an effort to stabilize 9.6 as well as
> back-branches.

Well, not that soon at the end, but I am back on that... I have not
completely reviewed all the code yet, and the case of index relation
referring to a relation optimized with truncate is still broken, but
for now here is a rebased patch if people are interested. I am going
to get as well a TAP tests out of my pocket to ease testing.
-- 
Michael
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 38bba16..bbc09cd 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -55,6 +55,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -2331,12 +2332,6 @@ FreeBulkInsertState(BulkInsertState bistate)
  * The new tuple is stamped with current transaction ID and the specified
  * command ID.
  *
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- *
  * The HEAP_INSERT_SKIP_FSM option is passed directly to
  * RelationGetBufferForTuple, which see for more info.
  *
@@ -2440,7 +2435,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+	if (HeapNeedsWAL(relation, buffer))
 	{
 		xl_heap_insert xlrec;
 		xl_heap_header xlhdr;
@@ -2639,12 +2634,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 	int			ndone;
 	char	   *scratch = NULL;
 	Page		page;
-	bool		needwal;
 	Size		saveFreeSpace;
 	bool		need_tuple_data = RelationIsLogicallyLogged(relation);
 	bool		need_cids = RelationIsAccessibleInLogicalDecoding(relation);
 
-	needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
 	saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
    HEAP_DEFAULT_FILLFACTOR);
 
@@ -2659,7 +2652,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 	 * palloc() within a critical section is not safe, so we allocate this
 	 * beforehand.
 	 */
-	if (needwal)
+	if (RelationNeedsWAL(relation))
 		scratch = palloc(BLCKSZ);
 
 	/*
@@ -2727,7 +2720,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 			 * We don't use heap_multi_insert for catalog tuples yet, but
 			 * better be prepared...
 			 */
-			if (needwal && need_cids)
+			if (HeapNeedsWAL(relation, buffer) && need_cids)
 log_heap_new_cid(relation, heaptup);
 		}
 
@@ -2747,7 +2740,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (needwal)
+		if (HeapNeedsWAL(relation, buffer))
 		{
 			XLogRecPtr	recptr;
 			xl_heap_multi_insert *xlrec;
@@ -3261,7 +3254,7 @@ l1:
 	 * NB: heap_abort_speculative() uses the same xlog record and replay
 	 * routines.
 	 */
-	if (RelationNeedsWAL(relation))
+	if (HeapNeedsWAL(relation, buffer))
 	{
 		xl_heap_delete xlrec;
 		XLogRecPtr	recptr;
@@ -3982,7 +3975,7 @@ l2:
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(relation))
+		if (HeapNeedsWAL(relation, buffer))
 		{
 			xl_heap_lock xlrec;
 			XLogRecPtr	recptr;
@@ -4194,7 +4187,7 @@ l2:
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (HeapNeedsWAL(relation, buffer))
 	{
 		XLogRecPtr	recptr;
 
@@ -5148,7 +5141,7 @@ failed:
 	 * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
 	 * entries for everything anyway.)
 	 */
-	if (RelationNeedsWAL(relation))
+	if (HeapNeedsWAL(relation, *buffer))
 	{
 		xl_heap_lock xlrec;
 		XLogRecPtr	recptr;
@@ -5825,7 +5818,7 @@ l4:
 		MarkBufferDirty(buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (HeapNeedsWAL(rel, buf))
 		{
 			xl_heap_lock_updated xlrec;
 			XLogRecPtr	recptr;
@@ -5980,7 +5973,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
 	htup->t_ctid = tuple->t_self;
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (HeapNeedsWAL(relation, buffer))
 	{
 		xl_heap_confirm xlrec;
 		XLogRecPtr	recptr;
@@ 

Re: [HACKERS] WAL logging problem in 9.4.3?

2016-04-06 Thread Michael Paquier
On Wed, Mar 23, 2016 at 12:45 PM, Michael Paquier
 wrote:
> On Wed, Mar 23, 2016 at 11:11 AM, David Steele  wrote:
>> I would prefer not to bump it to the next CF unless we decide this will
>> not get fixed for 9.6.
>
> It may make sense to add that to the list of open items for 9.6
> instead. That's not a feature.

So I have moved this patch to the next CF for now, and will work on
fixing it rather soonishly as an effort to stabilize 9.6 as well as
back-branches.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2016-03-22 Thread Michael Paquier
On Wed, Mar 23, 2016 at 11:11 AM, David Steele  wrote:
> I would prefer not to bump it to the next CF unless we decide this will
> not get fixed for 9.6.

It may make sense to add that to the list of open items for 9.6
instead. That's not a feature.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2016-03-22 Thread David Steele
On 3/22/16 8:54 PM, Michael Paquier wrote:
> On Wed, Mar 23, 2016 at 9:52 AM, Michael Paquier
>  wrote:
>> On Wed, Mar 23, 2016 at 1:38 AM, David Steele  wrote:
>>> On 3/15/16 10:01 PM, Kyotaro HORIGUCHI wrote:
>>>
 Ok, I understand that this is not an issue in a hurry. I'll go to
 another patch that needs review.
>>>
>>> Since we're getting towards the end of the CF is it time to pick this up
>>> again?
>>
>> Perhaps not. This is a legit bug with an unfinished patch (see index
>> relation truncation) that is going to need a careful review. I don't
>> think that this should be impacted by the 4/8 feature freeze, so we
>> could still work on that after the embargo and we've had this bug for
>> months actually. FWIW, I am still planning to work on it once the CF
>> is done, in order to keep my manpower focused on actual patch reviews
>> as much as possible...
> 
> In short, we may want to bump that to next CF... I have already marked
> this ticket as something to work on soonish on my side, so it does not
> change much seen from here if it's part of the next CF. What we should
> just be sure is not to lose track of its existence.

I would prefer not to bump it to the next CF unless we decide this will
not get fixed for 9.6.

-- 
-David
da...@pgmasters.net


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2016-03-22 Thread Michael Paquier
On Wed, Mar 23, 2016 at 9:52 AM, Michael Paquier
 wrote:
> On Wed, Mar 23, 2016 at 1:38 AM, David Steele  wrote:
>> On 3/15/16 10:01 PM, Kyotaro HORIGUCHI wrote:
>>
>>> Ok, I understand that this is not an issue in a hurry. I'll go to
>>> another patch that needs review.
>>
>> Since we're getting towards the end of the CF is it time to pick this up
>> again?
>
> Perhaps not. This is a legit bug with an unfinished patch (see index
> relation truncation) that is going to need a careful review. I don't
> think that this should be impacted by the 4/8 feature freeze, so we
> could still work on that after the embargo and we've had this bug for
> months actually. FWIW, I am still planning to work on it once the CF
> is done, in order to keep my manpower focused on actual patch reviews
> as much as possible...

In short, we may want to bump that to next CF... I have already marked
this ticket as something to work on soonish on my side, so it does not
change much seen from here if it's part of the next CF. What we should
just be sure is not to lose track of its existence.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2016-03-22 Thread Michael Paquier
On Wed, Mar 23, 2016 at 1:38 AM, David Steele  wrote:
> On 3/15/16 10:01 PM, Kyotaro HORIGUCHI wrote:
>
>> Ok, I understand that this is not an issue in a hurry. I'll go to
>> another patch that needs review.
>
> Since we're getting towards the end of the CF is it time to pick this up
> again?

Perhaps not. This is a legit bug with an unfinished patch (see index
relation truncation) that is going to need a careful review. I don't
think that this should be impacted by the 4/8 feature freeze, so we
could still work on that after the embargo and we've had this bug for
months actually. FWIW, I am still planning to work on it once the CF
is done, in order to keep my manpower focused on actual patch reviews
as much as possible...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2016-03-22 Thread David Steele
On 3/15/16 10:01 PM, Kyotaro HORIGUCHI wrote:

> Ok, I understand that this is not an issue in a hurry. I'll go to
> another patch that needs review.

Since we're getting towards the end of the CF is it time to pick this up
again?

Thanks,
-- 
-David
da...@pgmasters.net


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2016-03-15 Thread Kyotaro HORIGUCHI
Thank you for the comment.

I understand that this is not an issue in a hurry so don't bother
to reply.

At Tue, 15 Mar 2016 18:21:34 +0100, Michael Paquier  
wrote in 

Re: [HACKERS] WAL logging problem in 9.4.3?

2016-03-15 Thread Michael Paquier
On Fri, Mar 11, 2016 at 9:32 AM, Kyotaro HORIGUCHI
 wrote:
> At Fri, 19 Feb 2016 22:27:00 +0900, Michael Paquier 
>  wrote in 
> 
>> > Worth noting that this patch does not address the problem with index
>> > relations when a TRUNCATE is used in the same transaction as its
>
> Focusing this issue, what we should do is somehow building empty
> index just after a index truncation. The attached patch does the
> following things to fix this.
>
> - make index_build use ambuildempty when the relation on which
>   the index will be built is apparently empty. That is, when the
>   relation has no block.
> - add one parameter "persistent" to ambuildempty(). It behaves as
>   before if the parameter is false. If not, it creates an empty
>   index on MAIN_FORK and emits logs even if wal_level is minimal.

Hm. It seems to me that this patch is just a bandaid for the real
problem which is that we should not TRUNCATE the underlying index
relations when the TRUNCATE optimization is running. In short I would
let the empty routines in AM code paths alone, and just continue using
them for the generation of INIT_FORKNUM with unlogged relations. Your
patch is not something backpatchable anyway I think.

> The new parameter 'persistent' would be better be forknum because
> it actually represents the persistency of the index to be
> created. But I'm out of time now..

I actually have some users running with wal_level to minimal, even if
I don't think they use this optimization, we had better fix even index
relations at the same time as table relations.. I'll try to get some
time once the patch review storm goes down a little, except if someone
beats me to it first.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2016-03-11 Thread Kyotaro HORIGUCHI
Hello, I considered on the original issue.

At Fri, 19 Feb 2016 22:27:00 +0900, Michael Paquier  
wrote in 
> > Worth noting that this patch does not address the problem with index
> > relations when a TRUNCATE is used in the same transaction as its

Focusing this issue, what we should do is somehow building empty
index just after a index truncation. The attached patch does the
following things to fix this.

- make index_build use ambuildempty when the relation on which
  the index will be built is apparently empty. That is, when the
  relation has no block.

- add one parameter "persistent" to ambuildempty(). It behaves as
  before if the parameter is false. If not, it creates an empty
  index on MAIN_FORK and emits logs even if wal_level is minimal.

Creation of an index for an empty table can be safely done by
ambuildempty, since it creates the image for init fork, which can
be simply copied as main fork on initialization. And the heap is
always empty when RelationTruncateIndexes calls index_build.

For nonempty tables, ambuild properly initializes the new index.

The new parameter 'persistent' would be better be forknum because
it actually represents the persistency of the index to be
created. But I'm out of time now..


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index c740952..7f0d3f9 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -675,13 +675,14 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 }
 
 void
-brinbuildempty(Relation index)
+brinbuildempty(Relation index, bool persistent)
 {
 	Buffer		metabuf;
+	ForkNumber	forknum = (persistent ? MAIN_FORKNUM : INIT_FORKNUM);
 
 	/* An empty BRIN index has a metapage only. */
 	metabuf =
-		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+		ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
 	/* Initialize and xlog metabuffer. */
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index cd21e0e..c041360 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -430,20 +430,23 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 }
 
 /*
- *	ginbuildempty() -- build an empty gin index in the initialization fork
+ *	ginbuildempty() -- build an empty gin index
+ *	  the new index is built in the intialization fork or main fork according
+ *	  to the parameter persistent.
  */
 void
-ginbuildempty(Relation index)
+ginbuildempty(Relation index, bool persistent)
 {
 	Buffer		RootBuffer,
 MetaBuffer;
+	ForkNumber	forknum = (persistent ? MAIN_FORKNUM : INIT_FORKNUM);
 
 	/* An empty GIN index has two pages. */
 	MetaBuffer =
-		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+		ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(MetaBuffer, BUFFER_LOCK_EXCLUSIVE);
 	RootBuffer =
-		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+		ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
 	/* Initialize and xlog metabuffer and root buffer. */
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 996363c..3d73083 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -110,15 +110,18 @@ createTempGistContext(void)
 }
 
 /*
- *	gistbuildempty() -- build an empty gist index in the initialization fork
+ *	gistbuildempty() -- build an empty gist index. 
+ *	  the new index is built in the intialization fork or main fork according
+ *	  to the parameter persistent.
  */
 void
-gistbuildempty(Relation index)
+gistbuildempty(Relation index, bool persistent)
 {
 	Buffer		buffer;
+	ForkNumber	forknum = (persistent ? MAIN_FORKNUM : INIT_FORKNUM);
 
 	/* Initialize the root page */
-	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+	buffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 	/* Initialize and xlog buffer */
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 3d48c4f..3b9cd66 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -156,12 +156,14 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 }
 
 /*
- *	hashbuildempty() -- build an empty hash index in the initialization fork
+ *	hashbuildempty() -- build an empty hash index
+ *	  the new index is built in the intialization fork or main fork according
+ *	  to the parameter persistent.
  */
 void
-hashbuildempty(Relation index)
+hashbuildempty(Relation index, bool persistent)
 {
-	_hash_metapinit(index, 0, INIT_FORKNUM);
+	_hash_metapinit(index, 0, persistent ? MAIN_FORKNUM : 

Re: [HACKERS] WAL logging problem in 9.4.3?

2016-02-19 Thread Michael Paquier
On Fri, Feb 19, 2016 at 4:33 PM, Michael Paquier
 wrote:
> On Thu, Feb 18, 2016 at 4:27 PM, Michael Paquier
>  wrote:
>> On Thu, Feb 4, 2016 at 3:24 PM, Heikki Linnakangas  wrote:
>>> I dropped the ball on this one back in July, so here's an attempt to revive
>>> this thread.
>>>
>>> I spent some time fixing the remaining issues with the prototype patch I
>>> posted earlier, and rebased that on top of current git master. See attached.
>>>
>>> Some review of that would be nice. If there are no major issues with it, I'm
>>> going to create backpatchable versions of this for 9.4 and below.
>>
>> I am going to look into that very soon. For now and to not forget
>> about this bug, I have added an entry in the CF app:
>> https://commitfest.postgresql.org/9/528/
>
> Worth noting that this patch does not address the problem with index
> relations when a TRUNCATE is used in the same transaction as its
> CREATE TABLE, take that for example when wal_level = minimal:
> 1) Run transaction
> =# begin;
> BEGIN
> =# create table ab (a int primary key);
> CREATE TABLE
> =# truncate ab;
> TRUNCATE TABLE
> =# commit;
> COMMIT
> 2) Restart server with immediate mode.
> 3) Failure:
> =# table ab;
> ERROR:  XX001: could not read block 0 in file "base/16384/16388": read
> only 0 of 8192 bytes
> LOCATION:  mdread, md.c:728
>
> The case where a COPY is issued after TRUNCATE is fixed though, so
> that's still an improvement.
>
> Here are other comments.
>
> +   /* Flush updates to relations that we didn't WAL-logged */
> +   smgrDoPendingSyncs(true);
> "Flush updates to relations there were not WAL-logged"?
>
> +void
> +FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
> +{
> +   FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
> +}
> islocal is always set as false, I'd rather remove it this argument
> from FlushRelationBuffersWithoutRelCache.
>
> for (i = 0; i < nrels; i++)
> +   {
> smgrclose(srels[i]);
> +   }
> Looks like noise.
>
> +   if (!found)
> +   {
> +   pending->truncated_to = InvalidBlockNumber;
> +   pending->sync_above = nblocks;
> +
> +   elog(DEBUG2, "registering new pending sync for rel %u/%u/%u at
> block %u",
> +rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks);
> +
> +   }
> +   else if (pending->sync_above == InvalidBlockNumber)
> +   {
> +   elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
> +rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks);
> +   pending->sync_above = nblocks;
> +   }
> +   else
> Here couldn't it be possible that when (sync_above !=
> InvalidBlockNumber), nblocks can be higher than sync_above? In which
> case we had better increase sync_above to nblocks, no?
>
> +   if (!pendingSyncs)
> +   createPendingSyncsHash();
> +   pending = (PendingRelSync *) hash_search(pendingSyncs,
> +(void *) >rd_node,
> +HASH_ENTER, );
> This is lacking comments.
>
> -   if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT))
> +   BufferGetTag(buffer, , , );
> +   if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT) &&
> +   !smgrIsSyncPending(rnode, blknum))
> Here as well explaining in more details why the buffer does not need
> to go through XLogSaveBufferForHint would be nice.

An additional one:
-   XLogRegisterBuffer(0, newbuf, bufflags);
-   if (oldbuf != newbuf)
-   XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
In log_heap_update, the new buffer is now conditionally logged
depending on if the heap needs WAL or not.

Now during replay the following thing is done:
-   oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1,
- );
+   if (oldblk == newblk)
+   oldaction = XLogReadBufferForRedo(record, 0, );
+   else if (XLogRecHasBlockRef(record, 1))
+   oldaction = XLogReadBufferForRedo(record, 1, );
+   else
+   oldaction = BLK_DONE;
Shouldn't we check for XLogRecHasBlockRef(record, 0) when the tuple is
updated on the same page?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2016-02-18 Thread Michael Paquier
On Thu, Feb 18, 2016 at 4:27 PM, Michael Paquier
 wrote:
> On Thu, Feb 4, 2016 at 3:24 PM, Heikki Linnakangas  wrote:
>> I dropped the ball on this one back in July, so here's an attempt to revive
>> this thread.
>>
>> I spent some time fixing the remaining issues with the prototype patch I
>> posted earlier, and rebased that on top of current git master. See attached.
>>
>> Some review of that would be nice. If there are no major issues with it, I'm
>> going to create backpatchable versions of this for 9.4 and below.
>
> I am going to look into that very soon. For now and to not forget
> about this bug, I have added an entry in the CF app:
> https://commitfest.postgresql.org/9/528/

Worth noting that this patch does not address the problem with index
relations when a TRUNCATE is used in the same transaction as its
CREATE TABLE, take that for example when wal_level = minimal:
1) Run transaction
=# begin;
BEGIN
=# create table ab (a int primary key);
CREATE TABLE
=# truncate ab;
TRUNCATE TABLE
=# commit;
COMMIT
2) Restart server with immediate mode.
3) Failure:
=# table ab;
ERROR:  XX001: could not read block 0 in file "base/16384/16388": read
only 0 of 8192 bytes
LOCATION:  mdread, md.c:728

The case where a COPY is issued after TRUNCATE is fixed though, so
that's still an improvement.

Here are other comments.

+   /* Flush updates to relations that we didn't WAL-logged */
+   smgrDoPendingSyncs(true);
"Flush updates to relations there were not WAL-logged"?

+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+   FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
islocal is always set as false, I'd rather remove it this argument
from FlushRelationBuffersWithoutRelCache.

for (i = 0; i < nrels; i++)
+   {
smgrclose(srels[i]);
+   }
Looks like noise.

+   if (!found)
+   {
+   pending->truncated_to = InvalidBlockNumber;
+   pending->sync_above = nblocks;
+
+   elog(DEBUG2, "registering new pending sync for rel %u/%u/%u at
block %u",
+rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks);
+
+   }
+   else if (pending->sync_above == InvalidBlockNumber)
+   {
+   elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks);
+   pending->sync_above = nblocks;
+   }
+   else
Here couldn't it be possible that when (sync_above !=
InvalidBlockNumber), nblocks can be higher than sync_above? In which
case we had better increase sync_above to nblocks, no?

+   if (!pendingSyncs)
+   createPendingSyncsHash();
+   pending = (PendingRelSync *) hash_search(pendingSyncs,
+(void *) >rd_node,
+HASH_ENTER, );
This is lacking comments.

-   if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT))
+   BufferGetTag(buffer, , , );
+   if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT) &&
+   !smgrIsSyncPending(rnode, blknum))
Here as well explaining in more details why the buffer does not need
to go through XLogSaveBufferForHint would be nice.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2016-02-17 Thread Michael Paquier
On Thu, Feb 4, 2016 at 3:24 PM, Heikki Linnakangas  wrote:
> I dropped the ball on this one back in July, so here's an attempt to revive
> this thread.
>
> I spent some time fixing the remaining issues with the prototype patch I
> posted earlier, and rebased that on top of current git master. See attached.
>
> Some review of that would be nice. If there are no major issues with it, I'm
> going to create backpatchable versions of this for 9.4 and below.

I am going to look into that very soon. For now and to not forget
about this bug, I have added an entry in the CF app:
https://commitfest.postgresql.org/9/528/
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2016-02-04 Thread Heikki Linnakangas

On 22/10/15 03:56, Michael Paquier wrote:

On Wed, Oct 21, 2015 at 11:53 PM, Alvaro Herrera
 wrote:

Heikki Linnakangas wrote:


Thanks. For comparison, I wrote a patch to implement what I had in mind.

When a WAL-skipping COPY begins, we add an entry for that relation in a
"pending-fsyncs" hash table. Whenever we perform any action on a heap that
would normally be WAL-logged, we check if the relation is in the hash table,
and skip WAL-logging if so.


I think this wasn't applied, was it?


No, it was not applied.


I dropped the ball on this one back in July, so here's an attempt to 
revive this thread.


I spent some time fixing the remaining issues with the prototype patch I 
posted earlier, and rebased that on top of current git master. See attached.


Some review of that would be nice. If there are no major issues with it, 
I'm going to create backpatchable versions of this for 9.4 and below.


- Heikki

>From 063e1aa258800873783190a9678d551b43c0e39e Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 4 Feb 2016 15:21:09 +0300
Subject: [PATCH 1/1] Fix the optimization to skip WAL-logging on table created
 in same xact.

There were several bugs in the optimization to skip WAL-logging for a table
that was created (or truncated) in the same transaction, with
wal_level=minimal, leading to data loss if a crash happened after the
optimization was used:

* If the table was created, and then truncated, and then loaded with COPY,
  we would replay the truncate record at commit, and the table would end
  up being empty after replay.

* If there is a trigger on a table that modifies the same table, and you
  COPY to the table in the transaction that created it, you might have some
  WAL-logged operations on a page, performed by the trigger, intermixed with
  the non-WAL-logged inserts done by the COPY. That can lead to crash at
  recovery, because we might try to replay a WAL record that e.g. updates
  a tuple, but insertion of the tuple was not WAL-logged.
---
 src/backend/access/heap/heapam.c| 254 +++-
 src/backend/access/heap/pruneheap.c |   2 +-
 src/backend/access/heap/rewriteheap.c   |   4 +-
 src/backend/access/heap/visibilitymap.c |   2 +-
 src/backend/access/transam/xact.c   |   7 +
 src/backend/catalog/storage.c   | 250 ---
 src/backend/commands/copy.c |  14 +-
 src/backend/commands/createas.c |   9 +-
 src/backend/commands/matview.c  |   6 +-
 src/backend/commands/tablecmds.c|   5 +-
 src/backend/commands/vacuumlazy.c   |   6 +-
 src/backend/storage/buffer/bufmgr.c |  47 --
 src/include/access/heapam.h |   8 +-
 src/include/access/heapam_xlog.h|   2 +
 src/include/catalog/storage.h   |   3 +
 src/include/storage/bufmgr.h|   2 +
 16 files changed, 487 insertions(+), 134 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f443742..79298e2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
  *	  the POSTGRES heap access method used for all POSTGRES
  *	  relations.
  *
+ * WAL CONSIDERATIONS
+ *	  All heap operations are normally WAL-logged. but there are a few
+ *	  exceptions. Temporary and unlogged relations never need to be
+ *	  WAL-logged, but we can also skip WAL-logging for a table that was
+ *	  created in the same transaction, if we don't need WAL for PITR or
+ *	  WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ *	  the file to disk at COMMIT instead.
+ *
+ *	  The same-relation optimization is not employed automatically on all
+ *	  updates to a table that was created in the same transacton, because
+ *	  for a small number of changes, it's cheaper to just create the WAL
+ *	  records than fsyncing() the whole relation at COMMIT. It is only
+ *	  worthwhile for (presumably) large operations like COPY, CLUSTER,
+ *	  or VACUUM FULL. Use heap_register_sync() to initiate such an
+ *	  operation; it will cause any subsequent updates to the table to skip
+ *	  WAL-logging, if possible, and cause the heap to be synced to disk at
+ *	  COMMIT.
+ *
+ *	  To make that work, all modifications to heap must use
+ *	  HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ *	  for the given block.
+ *
  *-
  */
 #include "postgres.h"
@@ -55,6 +77,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -2332,12 +2355,6 @@ FreeBulkInsertState(BulkInsertState bistate)
  * The new tuple is stamped with current transaction ID and the specified
  * command ID.
  *
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is 

Re: [HACKERS] WAL logging problem in 9.4.3?

2015-10-21 Thread Alvaro Herrera
Heikki Linnakangas wrote:

> Thanks. For comparison, I wrote a patch to implement what I had in mind.
> 
> When a WAL-skipping COPY begins, we add an entry for that relation in a
> "pending-fsyncs" hash table. Whenever we perform any action on a heap that
> would normally be WAL-logged, we check if the relation is in the hash table,
> and skip WAL-logging if so.

I think this wasn't applied, was it?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-10-21 Thread Michael Paquier
On Wed, Oct 21, 2015 at 11:53 PM, Alvaro Herrera
 wrote:
> Heikki Linnakangas wrote:
>
>> Thanks. For comparison, I wrote a patch to implement what I had in mind.
>>
>> When a WAL-skipping COPY begins, we add an entry for that relation in a
>> "pending-fsyncs" hash table. Whenever we perform any action on a heap that
>> would normally be WAL-logged, we check if the relation is in the hash table,
>> and skip WAL-logging if so.
>
> I think this wasn't applied, was it?

No, it was not applied.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-24 Thread Heikki Linnakangas

On 07/23/2015 09:38 PM, Robert Haas wrote:

On Wed, Jul 22, 2015 at 12:21 PM, Heikki Linnakangas hlinn...@iki.fi wrote:


This is more invasive than I'd like to backpatch, but I think it's the
simplest approach that works, and doesn't disable any of the important
optimizations we have.


Hmm, isn't HeapNeedsWAL() a lot more costly than RelationNeedsWAL()?


Yes. But it's still very cheap, especially in the common case that the 
pending syncs hash table is empty.



Should we be worried about that?


It doesn't worry me.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-23 Thread Robert Haas
On Wed, Jul 22, 2015 at 12:21 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

 This is more invasive than I'd like to backpatch, but I think it's the
 simplest approach that works, and doesn't disable any of the important
 optimizations we have.

Hmm, isn't HeapNeedsWAL() a lot more costly than RelationNeedsWAL()?
Should we be worried about that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-23 Thread Simon Riggs
On 22 July 2015 at 17:21, Heikki Linnakangas hlinn...@iki.fi wrote:


 When a WAL-skipping COPY begins, we add an entry for that relation in a
 pending-fsyncs hash table. Whenever we perform any action on a heap that
 would normally be WAL-logged, we check if the relation is in the hash
 table, and skip WAL-logging if so.

 That was a simplified explanation. In reality, when WAL-skipping COPY
 begins, we also memorize the current size of the relation. Any actions on
 blocks greater than the old size are not WAL-logged, and any actions on
 smaller-numbered blocks are. This ensures that if you did any INSERTs on
 the table before the COPY, any new actions on the blocks that were already
 WAL-logged by the INSERT are also WAL-logged. And likewise if you perform
 any INSERTs after (or during, by trigger) the COPY, and they modify the new
 pages, those actions are not WAL-logged. So starting a WAL-skipping COPY
 splits the relation into two parts: the first part that is WAL-logged as
 usual, and the later part that is not WAL-logged. (there is one loose end
 marked with XXX in the patch on this, when one of the pages involved in a
 cold UPDATE is before the watermark and the other is after)

 The actual fsync() has been moved to the end of transaction, as we are now
 skipping WAL-logging of any actions after the COPY as well.

 And truncations complicate things further. If we emit a truncation WAL
 record in the transaction, we also make an entry in the hash table to
 record that. All operations on a relation that has been truncated must be
 WAL-logged as usual, because replaying the truncate record will destroy all
 data even if we fsync later. But we still optimize for BEGIN; CREATE;
 COPY; TRUNCATE; COPY; style patterns, because if we truncate a relation
 that has already been marked for fsync-at-COMMIT, we don't need to WAL-log
 the truncation either.


 This is more invasive than I'd like to backpatch, but I think it's the
 simplest approach that works, and doesn't disable any of the important
 optimizations we have.


I didn't like it when I first read this, but I do now. As a by product of
fixing the bug it actually extends the optimization.

You can optimize this approach so we always write WAL unless one of the two
subid fields are set, so there is no need to call smgrIsSyncPending() every
time. I couldn't see where this depended upon wal_level, but I guess its
there somewhere.

I'm unhappy about the call during MarkBufferDirtyHint() which is just too
costly. The only way to do this cheaply is to specifically mark buffers as
being BM_WAL_SKIPPED, so they do not need to be hinted. That flag would be
removed when we flush the buffers for the relation.



  And what reason is there to think that this would fix all the problems?


 I don't think either suggested fix could be claimed to be a great
 solution,
 since there is little principle here, only heuristic. Heikki's solution
 would be the only safe way, but is not backpatchable.


 I can't get too excited about a half-fix that leaves you with data
 corruption in some scenarios.


On further consideration, it seems obvious that Andres' suggestion would
not work for UPDATE or DELETE, so I now agree.

It does seem a big thing to backpatch; alternative suggestions?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-22 Thread Heikki Linnakangas

On 07/22/2015 11:18 AM, Simon Riggs wrote:

On 10 July 2015 at 00:06, Tom Lane t...@sss.pgh.pa.us wrote:


Andres Freund and...@anarazel.de writes:

On 2015-07-06 11:49:54 -0400, Tom Lane wrote:

Rather than reverting cab9a0656c36739f, which would re-introduce a
different performance problem, perhaps we could have COPY create a new
relfilenode when it does this.  That should be safe if the table was
previously empty.



I'm not convinced that cab9a0656c36739f needs to survive in that
form. To me only allowing one COPY to benefit from the wal_level =
minimal optimization has a significantly higher cost than
cab9a0656c36739f.


What evidence have you got to base that value judgement on?

cab9a0656c36739f was based on an actual user complaint, so we have good
evidence that there are people out there who care about the cost of
truncating a table many times in one transaction.  On the other hand,
I know of no evidence that anyone's depending on multiple sequential
COPYs, nor intermixed COPY and INSERT, to be fast.  The original argument
for having this COPY optimization at all was to make restoring pg_dump
scripts in a single transaction fast; and that use-case doesn't care
about anything but a single COPY into a virgin table.



We have to backpatch this fix, so it must be both simple and effective.

Heikki's suggestions may be best, maybe not, but they don't seem
backpatchable.

Tom's suggestion looks good. So does Andres' suggestion. I have coded both.


Thanks. For comparison, I wrote a patch to implement what I had in mind.

When a WAL-skipping COPY begins, we add an entry for that relation in a 
pending-fsyncs hash table. Whenever we perform any action on a heap 
that would normally be WAL-logged, we check if the relation is in the 
hash table, and skip WAL-logging if so.


That was a simplified explanation. In reality, when WAL-skipping COPY 
begins, we also memorize the current size of the relation. Any actions 
on blocks greater than the old size are not WAL-logged, and any actions 
on smaller-numbered blocks are. This ensures that if you did any INSERTs 
on the table before the COPY, any new actions on the blocks that were 
already WAL-logged by the INSERT are also WAL-logged. And likewise if 
you perform any INSERTs after (or during, by trigger) the COPY, and they 
modify the new pages, those actions are not WAL-logged. So starting a 
WAL-skipping COPY splits the relation into two parts: the first part 
that is WAL-logged as usual, and the later part that is not WAL-logged. 
(there is one loose end marked with XXX in the patch on this, when one 
of the pages involved in a cold UPDATE is before the watermark and the 
other is after)


The actual fsync() has been moved to the end of transaction, as we are 
now skipping WAL-logging of any actions after the COPY as well.


And truncations complicate things further. If we emit a truncation WAL 
record in the transaction, we also make an entry in the hash table to 
record that. All operations on a relation that has been truncated must 
be WAL-logged as usual, because replaying the truncate record will 
destroy all data even if we fsync later. But we still optimize for 
BEGIN; CREATE; COPY; TRUNCATE; COPY; style patterns, because if we 
truncate a relation that has already been marked for fsync-at-COMMIT, we 
don't need to WAL-log the truncation either.



This is more invasive than I'd like to backpatch, but I think it's the 
simplest approach that works, and doesn't disable any of the important 
optimizations we have.



And what reason is there to think that this would fix all the problems?


I don't think either suggested fix could be claimed to be a great solution,
since there is little principle here, only heuristic. Heikki's solution
would be the only safe way, but is not backpatchable.


I can't get too excited about a half-fix that leaves you with data 
corruption in some scenarios.


I wrote a little test script to test all these different scenarios 
(attached). Both of your patches fail with the script.


- Heikki

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2e3b9d2..9ef688b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2021,12 +2021,6 @@ FreeBulkInsertState(BulkInsertState bistate)
  * The new tuple is stamped with current transaction ID and the specified
  * command ID.
  *
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- *
  * The HEAP_INSERT_SKIP_FSM option is passed directly to
  * RelationGetBufferForTuple, which see for more info.
  *
@@ -2115,7 +2109,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	MarkBufferDirty(buffer);
 
 

Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-22 Thread Andres Freund
On 2015-07-21 21:37:41 +0200, Martijn van Oosterhout wrote:
 On Tue, Jul 21, 2015 at 02:24:47PM -0400, Todd A. Cook wrote:
  Hi,
  
  This thread seemed to trail off without a resolution.  Was anything done?
 
 Not that I can tell.

Heikki and I had some in-person conversation about it at a conference,
but we didn't really find anything we both liked...

I was the original poster of this thread. We've
 worked around the issue by placing a CHECKPOINT command at the end of
 the migration script.  For us it's not a performance issue, more a
 correctness one, tables were empty when they shouldn't have been.

If it's just correctness, you could just use wal_level = archive.

 I'm hoping a fix will appear in the 9.5 release, since we're intending
 to release with that version.  A forced checkpoint every now and them
 probably won't be a serious problem though.

We're imo going to have to fix this in the back branches.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-22 Thread Simon Riggs
On 10 July 2015 at 00:06, Tom Lane t...@sss.pgh.pa.us wrote:

 Andres Freund and...@anarazel.de writes:
  On 2015-07-06 11:49:54 -0400, Tom Lane wrote:
  Rather than reverting cab9a0656c36739f, which would re-introduce a
  different performance problem, perhaps we could have COPY create a new
  relfilenode when it does this.  That should be safe if the table was
  previously empty.

  I'm not convinced that cab9a0656c36739f needs to survive in that
  form. To me only allowing one COPY to benefit from the wal_level =
  minimal optimization has a significantly higher cost than
  cab9a0656c36739f.

 What evidence have you got to base that value judgement on?

 cab9a0656c36739f was based on an actual user complaint, so we have good
 evidence that there are people out there who care about the cost of
 truncating a table many times in one transaction.  On the other hand,
 I know of no evidence that anyone's depending on multiple sequential
 COPYs, nor intermixed COPY and INSERT, to be fast.  The original argument
 for having this COPY optimization at all was to make restoring pg_dump
 scripts in a single transaction fast; and that use-case doesn't care
 about anything but a single COPY into a virgin table.


We have to backpatch this fix, so it must be both simple and effective.

Heikki's suggestions may be best, maybe not, but they don't seem
backpatchable.

Tom's suggestion looks good. So does Andres' suggestion. I have coded both.

And what reason is there to think that this would fix all the problems?


I don't think either suggested fix could be claimed to be a great solution,
since there is little principle here, only heuristic. Heikki's solution
would be the only safe way, but is not backpatchable.

Forcing SKIP_FSM to always extend has no negative side effects in other
code paths, AFAICS.

Patches attached. Martijn, please verify.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


fix_wal_logging_copy_truncate.v1.patch
Description: Binary data


fix_copy_zero_blocks.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-21 Thread Todd A. Cook

Hi,

This thread seemed to trail off without a resolution.  Was anything done?
(See more below.)

On 07/09/15 19:06, Tom Lane wrote:

Andres Freund and...@anarazel.de writes:

On 2015-07-06 11:49:54 -0400, Tom Lane wrote:

Rather than reverting cab9a0656c36739f, which would re-introduce a
different performance problem, perhaps we could have COPY create a new
relfilenode when it does this.  That should be safe if the table was
previously empty.



I'm not convinced that cab9a0656c36739f needs to survive in that
form. To me only allowing one COPY to benefit from the wal_level =
minimal optimization has a significantly higher cost than
cab9a0656c36739f.


What evidence have you got to base that value judgement on?

cab9a0656c36739f was based on an actual user complaint, so we have good
evidence that there are people out there who care about the cost of
truncating a table many times in one transaction.


I'm the complainer mentioned in the cab9a0656c36739f commit message. :)

FWIW, we use a temp table to split a join across 4 largish tables
(10^8 rows or more each) and 2 small tables (10^6 rows each).  We
write the results of joining the 2 largest tables into the temp
table, and then join that to the other 4.  This gave significant
performance benefits because the planner would know the exact row
count of the 2-way join heading into the 4-way join.  After commit
cab9a0656c36739f, we got another noticeable performance improvement
(I did timings before and after, but I can't seem to put my hands
on the numbers right now).

We do millions of these queries every day in batches.  Each batch
reuses a single temp table (truncating it before each pair of joins)
so as to reduce the churn in the system catalogs.  In case it matters,
the temp table is created with ON COMMIT DROP.

This was (and still is) done on 9.2.x.

HTH.

-- todd cook
-- tc...@blackducksoftware.com


 On the other hand,

I know of no evidence that anyone's depending on multiple sequential
COPYs, nor intermixed COPY and INSERT, to be fast.  The original argument
for having this COPY optimization at all was to make restoring pg_dump
scripts in a single transaction fast; and that use-case doesn't care
about anything but a single COPY into a virgin table.

I think you're worrying about exactly the wrong case.


My tentative guess is that the best course is to
a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the
truncation replay issue.
b) Force new pages to be used when using the heap_sync mode in
COPY. That avoids the INIT danger you found. It seems rather
reasonable to avoid using pages that have already been the target of
WAL logging here in general.


And what reason is there to think that this would fix all the problems?
We know of those two, but we've not exactly looked hard for other cases.
Again, the only known field usage for the COPY optimization is the pg_dump
scenario; were that not so, we'd have noticed the problem long since.
So I don't have any faith that this is a well-tested area.

regards, tom lane






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-21 Thread Martijn van Oosterhout
On Tue, Jul 21, 2015 at 02:24:47PM -0400, Todd A. Cook wrote:
 Hi,
 
 This thread seemed to trail off without a resolution.  Was anything done?

Not that I can tell. I was the original poster of this thread. We've
worked around the issue by placing a CHECKPOINT command at the end of
the migration script.  For us it's not a performance issue, more a
correctness one, tables were empty when they shouldn't have been.

I'm hoping a fix will appear in the 9.5 release, since we're intending
to release with that version.  A forced checkpoint every now and them
probably won't be a serious problem though.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-10 Thread Heikki Linnakangas

On 07/10/2015 02:06 AM, Tom Lane wrote:

Andres Freund and...@anarazel.de writes:

On 2015-07-06 11:49:54 -0400, Tom Lane wrote:

Rather than reverting cab9a0656c36739f, which would re-introduce a
different performance problem, perhaps we could have COPY create a new
relfilenode when it does this.  That should be safe if the table was
previously empty.



I'm not convinced that cab9a0656c36739f needs to survive in that
form. To me only allowing one COPY to benefit from the wal_level =
minimal optimization has a significantly higher cost than
cab9a0656c36739f.


What evidence have you got to base that value judgement on?

cab9a0656c36739f was based on an actual user complaint, so we have good
evidence that there are people out there who care about the cost of
truncating a table many times in one transaction.


Yeah, if we specifically made that case cheap, in response to a 
complaint, it would be a regression to make it expensive again. We might 
get away with it in a major version, but would hate to backpatch that.



My tentative guess is that the best course is to
a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the
truncation replay issue.
b) Force new pages to be used when using the heap_sync mode in
COPY. That avoids the INIT danger you found. It seems rather
reasonable to avoid using pages that have already been the target of
WAL logging here in general.


And what reason is there to think that this would fix all the problems?
We know of those two, but we've not exactly looked hard for other cases.


Hmm. Perhaps that could be made to work, but it feels pretty fragile. 
For example, you could have an insert trigger on the table that inserts 
additional rows to the same table, and those inserts would be intermixed 
with the rows inserted by COPY. You'll have to avoid that somehow. 
Full-page images in general are a problem. If a checkpoint happens, and 
a trigger modifies the page we're COPYing to in any way, you have the 
same issue. Even reading a page can cause a full-page image of it to be 
written: If you update a hint bit on the page while reading it, and 
checksums are enabled, and a checkpoint happened since the page was last 
updated, bang. I don't think that's a problem in this case because there 
are no hint bits to be set on pages that we're COPYing to, but it's a 
whole new subtle assumption.


I think we should
1. reliably and explicitly keep track of whether we've WAL-logged any 
TRUNCATE, INSERT/UPDATE+INIT, or any other full-page-logging operations 
on the relation, and

2. make sure we never skip WAL-logging again if we have.

Let's add a flag, rd_skip_wal_safe, to RelationData that's initially set 
when a new relfilenode is created, i.e. whenever rd_createSubid or 
rd_newRelfilenodeSubid is set. Whenever a TRUNCATE or a full-page image 
(including INSERT/UPDATE+INIT) is WAL-logged, clear the flag. In copy.c, 
only skip WAL-logging if the flag is still set. To deal with the case 
that the flag gets cleared in the middle of COPY, also check the flag 
whenever we're about to skip WAL-logging in heap_insert, and if it's 
been cleared, ignore the HEAP_INSERT_SKIP_WAL option and WAL-log anyway.


Compared to the status quo, that disables the WAL-skipping optimization 
in the scenario where you CREATE, INSERT, then COPY to a table in the 
same transaction. I think that's acceptable.


(Alternatively, to handle the case that the flag gets cleared in the 
middle of COPY, add another flag to RelationData indicating that a 
WAL-skipping COPY is in-progress, and refrain from WAL-logging any 
FPW-writing operations on the table when it's set (or any operations 
whatsoever). That'd be more efficient, but it's such a rare corner case 
that it hardly matters.)


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-10 Thread Andres Freund
On 2015-07-09 19:06:11 -0400, Tom Lane wrote:
 What evidence have you got to base that value judgement on?

 cab9a0656c36739f was based on an actual user complaint, so we have good
 evidence that there are people out there who care about the cost of
 truncating a table many times in one transaction.  On the other hand,
 I know of no evidence that anyone's depending on multiple sequential
 COPYs, nor intermixed COPY and INSERT, to be fast.  The original argument
 for having this COPY optimization at all was to make restoring pg_dump
 scripts in a single transaction fast; and that use-case doesn't care
 about anything but a single COPY into a virgin table.

Well, you'll hardly have heard complaints about COPY, given that we
behaved like currently for a long while.

I definitely know of ETL like processes that have relied on subsequent
COPYs into truncates relations being cheaper. Can't remember the same
for intermixed COPY and INSERT, but it'd not surprise me if somebody
mixed COPY and UPDATEs rather freely for ETL.

 I think you're worrying about exactly the wrong case.

  My tentative guess is that the best course is to
  a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the
 truncation replay issue.
  b) Force new pages to be used when using the heap_sync mode in
 COPY. That avoids the INIT danger you found. It seems rather
 reasonable to avoid using pages that have already been the target of
 WAL logging here in general.

 And what reason is there to think that this would fix all the
 problems?

Yea, that's the big problem.

 Again, the only known field usage for the COPY optimization is the pg_dump
 scenario; were that not so, we'd have noticed the problem long since.
 So I don't have any faith that this is a well-tested area.

You need to crash in the right moment. I don't think that's that
frequently exercised...


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-10 Thread Andres Freund
On 2015-07-10 19:23:28 +0900, Fujii Masao wrote:
 Maybe I'm missing something. But I start wondering why TRUNCATE
 and INSERT (or even all the operations on the table created at
 the current transaction) need to be WAL-logged while COPY can be
 optimized. If no WAL records are generated on that table, the problem
 we're talking about seems not to occur. Also this seems safe and
 doesn't degrade the performance of data loading. Thought?

Skipping WAL logging means that you need to scan through the whole
shrared buffers to write out dirty buffers and fsync the segments. A
single insert wal record is a couple orders of magnitudes cheaper than
that.  Essentially doing this juts for COPY is a heuristic.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-10 Thread Andres Freund
On 2015-07-10 13:38:50 +0300, Heikki Linnakangas wrote:
 In the long-term, I'd like to refactor this whole thing so that we never
 WAL-log any operations on a relation that's created in the same transaction
 (when wal_level=minimal). Instead, at COMMIT, we'd fsync() the relation, or
 if it's smaller than some threshold, WAL-log the contents of the whole file
 at that point. That would move all that
 more-difficult-than-it-seems-at-first-glance logic from COPY and indexam's
 to a central location, and it would allow the same optimization for all
 operations, not just COPY. But that probably isn't feasible to backpatch.

I don't think that's really realistic until we have a buffer manager
that lets you efficiently scan for all pages of a relation :(


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-10 Thread Fujii Masao
On Fri, Jul 10, 2015 at 2:27 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Fujii Masao masao.fu...@gmail.com writes:
 On Tue, Jul 7, 2015 at 12:49 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 One idea I had was to allow the COPY optimization only if the heap file is
 physically zero-length at the time the COPY starts.

 This seems not helpful for the case where TRUNCATE is executed
 before COPY. No?

 Huh?  The heap file would be zero length in that case.

 So, if COPY is executed multiple times at the same transaction,
 only first COPY can be optimized?

 This is true, and I don't think we should care, especially not if we're
 going to take risks of incorrect behavior in order to optimize that
 third-order case.  The fact that we're dealing with this bug at all should
 remind us that this stuff is harder than it looks.  I want a simple,
 reliable, back-patchable fix, and I do not believe that what you are
 suggesting would be any of those.

Maybe I'm missing something. But I start wondering why TRUNCATE
and INSERT (or even all the operations on the table created at
the current transaction) need to be WAL-logged while COPY can be
optimized. If no WAL records are generated on that table, the problem
we're talking about seems not to occur. Also this seems safe and
doesn't degrade the performance of data loading. Thought?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-09 Thread Tom Lane
Fujii Masao masao.fu...@gmail.com writes:
 On Tue, Jul 7, 2015 at 12:49 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 One idea I had was to allow the COPY optimization only if the heap file is
 physically zero-length at the time the COPY starts.

 This seems not helpful for the case where TRUNCATE is executed
 before COPY. No?

Huh?  The heap file would be zero length in that case.

 So, if COPY is executed multiple times at the same transaction,
 only first COPY can be optimized?

This is true, and I don't think we should care, especially not if we're
going to take risks of incorrect behavior in order to optimize that
third-order case.  The fact that we're dealing with this bug at all should
remind us that this stuff is harder than it looks.  I want a simple,
reliable, back-patchable fix, and I do not believe that what you are
suggesting would be any of those.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-09 Thread Andres Freund
On 2015-07-06 11:49:54 -0400, Tom Lane wrote:
 One idea I had was to allow the COPY optimization only if the heap file is
 physically zero-length at the time the COPY starts.  That would still be
 able to optimize in all the cases we care about making COPY fast for.
 Rather than reverting cab9a0656c36739f, which would re-introduce a
 different performance problem, perhaps we could have COPY create a new
 relfilenode when it does this.  That should be safe if the table was
 previously empty.

I'm not convinced that cab9a0656c36739f needs to survive in that
form. To me only allowing one COPY to benefit from the wal_level =
minimal optimization has a significantly higher cost than
cab9a0656c36739f.

My tentative guess is that the best course is to

a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the
   truncation replay issue.

b) Force new pages to be used when using the heap_sync mode in
   COPY. That avoids the INIT danger you found. It seems rather
   reasonable to avoid using pages that have already been the target of
   WAL logging here in general.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-09 Thread Fujii Masao
On Tue, Jul 7, 2015 at 12:49 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Andres Freund and...@anarazel.de writes:
 On 2015-07-06 11:14:40 -0400, Tom Lane wrote:
 The COUNT() correctly says 11 rows, but after crash-and-recover,
 only the row with -1 is there.  This is because the INSERT writes
 out an INSERT+INIT WAL record, which we happily replay, clobbering
 the data added later by COPY.

 ISTM any WAL logged action that touches a relfilenode essentially needs
 to disable further optimization based on the knowledge that the relation
 is new.

 After a bit more thought, I think it's not so much any WAL logged action
 as any unconditionally-replayed action.  INSERT+INIT breaks this
 example because heap_xlog_insert will unconditionally replay the action,
 even if the page is valid and has same or newer LSN.  Similarly, TRUNCATE
 is problematic because we redo it unconditionally (and in that case it's
 hard to see an alternative).

 It'd not be impossible to add more state to the relcache entry for the
 relation. Whether it's likely that we'd find all the places that'd need
 updating that state, I'm not sure.

 Yeah, the sticking point is mainly being sure that the state is correctly
 tracked, both now and after future changes.  We'd need to identify a state
 invariant that we could be pretty confident we'd not break.

 One idea I had was to allow the COPY optimization only if the heap file is
 physically zero-length at the time the COPY starts.

This seems not helpful for the case where TRUNCATE is executed
before COPY. No?

 That would still be
 able to optimize in all the cases we care about making COPY fast for.
 Rather than reverting cab9a0656c36739f, which would re-introduce a
 different performance problem, perhaps we could have COPY create a new
 relfilenode when it does this.  That should be safe if the table was
 previously empty.

So, if COPY is executed multiple times at the same transaction,
only first COPY can be optimized?


After second thought, I'm thinking that we can safely optimize
COPY if no problematic WAL records like INSERT+INIT or TRUNCATE
are generated since current REDO location or the table was created
at the same transaction. That is, if INSERT or TRUNCATE is executed
after the table creation, but if CHECKPOINT happens subsequently,
we don't need to log COPY. The subsequent crash recovery will not
replay such problematic WAL records. So the example cases where
we can optimize COPY are:

BEGIN
CREATE TABLE
COPY
COPY-- subsequent COPY also can be optimized

BEGIN
CREATE TABLE
TRUNCATE
CHECKPOINT
COPY

BEGIN
CREATE TABLE
INSERT
CHECKPOINT
COPY

A crash recovery can start from previous REDO location (i.e., REDO
location of the last checkpoint record). So we might need to check
whether such problematic WAL records are generated since the previous
REDO location instead of current one.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-09 Thread Tom Lane
Andres Freund and...@anarazel.de writes:
 On 2015-07-06 11:49:54 -0400, Tom Lane wrote:
 Rather than reverting cab9a0656c36739f, which would re-introduce a
 different performance problem, perhaps we could have COPY create a new
 relfilenode when it does this.  That should be safe if the table was
 previously empty.

 I'm not convinced that cab9a0656c36739f needs to survive in that
 form. To me only allowing one COPY to benefit from the wal_level =
 minimal optimization has a significantly higher cost than
 cab9a0656c36739f.

What evidence have you got to base that value judgement on?

cab9a0656c36739f was based on an actual user complaint, so we have good
evidence that there are people out there who care about the cost of
truncating a table many times in one transaction.  On the other hand,
I know of no evidence that anyone's depending on multiple sequential
COPYs, nor intermixed COPY and INSERT, to be fast.  The original argument
for having this COPY optimization at all was to make restoring pg_dump
scripts in a single transaction fast; and that use-case doesn't care
about anything but a single COPY into a virgin table.

I think you're worrying about exactly the wrong case.

 My tentative guess is that the best course is to
 a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the
truncation replay issue.
 b) Force new pages to be used when using the heap_sync mode in
COPY. That avoids the INIT danger you found. It seems rather
reasonable to avoid using pages that have already been the target of
WAL logging here in general.

And what reason is there to think that this would fix all the problems?
We know of those two, but we've not exactly looked hard for other cases.
Again, the only known field usage for the COPY optimization is the pg_dump
scenario; were that not so, we'd have noticed the problem long since.
So I don't have any faith that this is a well-tested area.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-06 Thread Fujii Masao
On Sat, Jul 4, 2015 at 2:26 AM, Andres Freund and...@anarazel.de wrote:
 On 2015-07-03 19:02:29 +0200, Andres Freund wrote:
 Maybe I'm just daft right now (35C outside, 32 inside, so ...), but I'm
 right now missing how the whole skip wal logging if relation has just
 been truncated optimization can ever actually be crashsafe unless we
 use a new relfilenode (which we don't!).

Agreed... When I ran the following test scenario, I found that
the loaded data disappeared after the crash recovery.

1. start PostgreSQL server with wal_level = minimal
2. execute the following SQL statements
\copy (SELECT num FROM generate_series(1,10) num) to /tmp/num.csv with csv
BEGIN;
CREATE TABLE test (i int primary key);
TRUNCATE TABLE test;
\copy test from /tmp/num.csv with csv
COMMIT;
SELECT COUNT(*) FROM test;  -- returns 10

3. shutdown the server with immediate mode
4. restart the server
5. execute the following SQL statement after crash recovery ends
SELECT COUNT(*) FROM test;  -- returns 0..

In #2, 10 rows were copied and the transaction was committed.
The subsequent statement of select count(*) obviously returned 10.
However, after crash recovery, in #5, the same statement returned 0.
That is, the loaded (+ committed) 10 data was lost after the crash.

 We actually used to use a different relfilenode, but optimized that
 away: cab9a0656c36739f59277b34fea8ab9438395869

 commit cab9a0656c36739f59277b34fea8ab9438395869
 Author: Tom Lane t...@sss.pgh.pa.us
 Date:   Sun Aug 23 19:23:41 2009 +

 Make TRUNCATE do truncate-in-place when processing a relation that was 
 created
 or previously truncated in the current (sub)transaction.  This is safe 
 since
 if the (sub)transaction later rolls back, we'd just discard the rel's 
 current
 physical file anyway.  This avoids unreasonable growth in the number of
 transient files when a relation is repeatedly truncated.  Per a 
 performance
 gripe a couple weeks ago from Todd Cook.

 to me the reasoning here looks flawed.

Before this commit, when I ran the above test scenario, no data loss happened.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-06 Thread Tom Lane
Fujii Masao masao.fu...@gmail.com writes:
 On Sat, Jul 4, 2015 at 2:26 AM, Andres Freund and...@anarazel.de wrote:
 We actually used to use a different relfilenode, but optimized that
 away: cab9a0656c36739f59277b34fea8ab9438395869
 
 commit cab9a0656c36739f59277b34fea8ab9438395869
 Author: Tom Lane t...@sss.pgh.pa.us
 Date:   Sun Aug 23 19:23:41 2009 +
 
 Make TRUNCATE do truncate-in-place when processing a relation that was 
 created
 or previously truncated in the current (sub)transaction.  This is safe since
 if the (sub)transaction later rolls back, we'd just discard the rel's current
 physical file anyway.  This avoids unreasonable growth in the number of
 transient files when a relation is repeatedly truncated.  Per a performance
 gripe a couple weeks ago from Todd Cook.
 
 to me the reasoning here looks flawed.

 Before this commit, when I ran the above test scenario, no data loss happened.

Actually I think what is broken here is COPY's test to decide whether it
can omit writing WAL:

 * Check to see if we can avoid writing WAL
 *
 * If archive logging/streaming is not enabled *and* either
 *- table was created in same transaction as this COPY
 *- data is being written to relfilenode created in this transaction
 * then we can skip writing WAL.  It's safe because if the transaction
 * doesn't commit, we'll discard the table (or the new relfilenode file).
 * If it does commit, we'll have done the heap_sync at the bottom of this
 * routine first.

The problem with that analysis is that it supposes that, if we crash and
recover, the WAL replay sequence will not touch the data.  What's killing
us in this example is the replay of the TRUNCATE, but that is not the only
possibility.  For example consider this modification of Fujii-san's test
case:

BEGIN;
CREATE TABLE test (i int primary key);
INSERT INTO test VALUES(-1);
\copy test from /tmp/num.csv with csv
COMMIT;
SELECT COUNT(*) FROM test;

The COUNT() correctly says 11 rows, but after crash-and-recover,
only the row with -1 is there.  This is because the INSERT writes
out an INSERT+INIT WAL record, which we happily replay, clobbering
the data added later by COPY.

We might have to give up on this COPY optimization :-(.  I'm not
sure what would be a safe rule for deciding that we can skip WAL
logging in this situation, but I am pretty sure that it would
require keeping information we don't currently keep about what's
happened earlier in the transaction.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-06 Thread Tom Lane
Andres Freund and...@anarazel.de writes:
 On 2015-07-06 11:14:40 -0400, Tom Lane wrote:
 The COUNT() correctly says 11 rows, but after crash-and-recover,
 only the row with -1 is there.  This is because the INSERT writes
 out an INSERT+INIT WAL record, which we happily replay, clobbering
 the data added later by COPY.

 ISTM any WAL logged action that touches a relfilenode essentially needs
 to disable further optimization based on the knowledge that the relation
 is new.

After a bit more thought, I think it's not so much any WAL logged action
as any unconditionally-replayed action.  INSERT+INIT breaks this
example because heap_xlog_insert will unconditionally replay the action,
even if the page is valid and has same or newer LSN.  Similarly, TRUNCATE
is problematic because we redo it unconditionally (and in that case it's
hard to see an alternative).

 It'd not be impossible to add more state to the relcache entry for the
 relation. Whether it's likely that we'd find all the places that'd need
 updating that state, I'm not sure.

Yeah, the sticking point is mainly being sure that the state is correctly
tracked, both now and after future changes.  We'd need to identify a state
invariant that we could be pretty confident we'd not break.

One idea I had was to allow the COPY optimization only if the heap file is
physically zero-length at the time the COPY starts.  That would still be
able to optimize in all the cases we care about making COPY fast for.
Rather than reverting cab9a0656c36739f, which would re-introduce a
different performance problem, perhaps we could have COPY create a new
relfilenode when it does this.  That should be safe if the table was
previously empty.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-06 Thread Andres Freund
On 2015-07-06 11:14:40 -0400, Tom Lane wrote:
 BEGIN;
 CREATE TABLE test (i int primary key);
 INSERT INTO test VALUES(-1);
 \copy test from /tmp/num.csv with csv
 COMMIT;
 SELECT COUNT(*) FROM test;
 
 The COUNT() correctly says 11 rows, but after crash-and-recover,
 only the row with -1 is there.  This is because the INSERT writes
 out an INSERT+INIT WAL record, which we happily replay, clobbering
 the data added later by COPY.

ISTM any WAL logged action that touches a relfilenode essentially needs
to disable further optimization based on the knowledge that the relation
is new.

 We might have to give up on this COPY optimization :-(.

A crazy, not well though through, bandaid for the INSERT+INIT case would
be to force COPY to use a new page when using the SKIP_WAL codepath.

 I'm not sure what would be a safe rule for deciding that we can skip
 WAL logging in this situation, but I am pretty sure that it would
 require keeping information we don't currently keep about what's
 happened earlier in the transaction.

It'd not be impossible to add more state to the relcache entry for the
relation. Whether it's likely that we'd find all the places that'd need
updating that state, I'm not sure.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-03 Thread Fujii Masao
On Fri, Jul 3, 2015 at 3:01 PM, Martijn van Oosterhout
klep...@svana.org wrote:
 On Fri, Jul 03, 2015 at 02:34:44PM +0900, Fujii Masao wrote:
  Hmm, for me it is 100% reproducable. Are you familiar with Docker? I
  can probably construct a Dockerfile that reproduces it pretty reliably.

 I could reproduce the problem in the master branch by doing
 the following steps.

 Thank you, I wasn't sure if you could kill the server fast enough
 without containers, but it looks like immediate mode is enough.

 1. start the PostgreSQL server with wal_level = minimal
 2. execute the following SQL statements
  begin;
  create table test(id serial primary key);
  truncate table test;
  commit;
 3. shutdown the server with immediate mode
 4. restart the server (crash recovery occurs)
 5. execute the following SQL statement
 select * from test;

 The optimization of TRUNCATE opereation that we can use when
 CREATE TABLE and TRUNCATE are executed in the same transaction block
 seems to cause the problem. In this case, only index file truncation is
 logged, and index creation in btbuild() is not logged because wal_level
 is minimal. Then at the subsequent crash recovery, index file is truncated
 to 0 byte... Very simple fix is to log an index creation in that case,
 but not sure if that's ok to do..

In 9.2 or before, this problem doesn't occur because no such error is thrown
even if an index file size is zero. But in 9.3 or later, since the planner
tries to read a meta page of an index to get the height of the btree tree,
an empty index file causes such error. The planner was changed that way by
commit 31f38f28, and the problem seems to be an oversight of that commit.

I'm not familiar with that change of the planner, but ISTM that we can
simply change _bt_getrootheight() so that 0 is returned if an index file is
empty, i.e., meta page cannot be read, in order to work around the problem.
Thought?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-03 Thread Tom Lane
Fujii Masao masao.fu...@gmail.com writes:
 The optimization of TRUNCATE opereation that we can use when
 CREATE TABLE and TRUNCATE are executed in the same transaction block
 seems to cause the problem. In this case, only index file truncation is
 logged, and index creation in btbuild() is not logged because wal_level
 is minimal. Then at the subsequent crash recovery, index file is truncated
 to 0 byte... Very simple fix is to log an index creation in that case,
 but not sure if that's ok to do..

 In 9.2 or before, this problem doesn't occur because no such error is thrown
 even if an index file size is zero. But in 9.3 or later, since the planner
 tries to read a meta page of an index to get the height of the btree tree,
 an empty index file causes such error. The planner was changed that way by
 commit 31f38f28, and the problem seems to be an oversight of that commit.

What?  You want to blame the planner for failing because the index was
left corrupt by broken WAL replay?  A failure would occur anyway at
execution.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-03 Thread Martijn van Oosterhout
On Fri, Jul 03, 2015 at 12:53:56PM -0400, Tom Lane wrote:
 Fujii Masao masao.fu...@gmail.com writes:
  Okay, so probably we need to change WAL replay of TRUNCATE so that
  the index file is truncated to one containing only meta page instead of
  empty one. That is, the WAL replay of TRUNCATE would need to call
  index_build() after smgrtruncate() maybe.
 
 That seems completely unworkable.  For one thing, index_build would expect
 to be able to do catalog lookups, but we can't assume that the catalogs
 are in a good state yet.
 
 I think the responsibility has to be on the WAL-writing end to emit WAL
 instructions that lead to a correct on-disk state.  Putting complex
 behavior into the reading side is fundamentally misguided.

Am I missing something. ISTM that if the truncate record was simply not
logged at all everything would work fine. The whole point is that the
table was created in this transaction and so if it exists the table on
disk must be the correct representation.

The broken index is just one symptom. The heap also shouldn't be
truncated at all. If you insert a row before commit then after replay
the tuple should be there still.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-03 Thread Tom Lane
Fujii Masao masao.fu...@gmail.com writes:
 Okay, so probably we need to change WAL replay of TRUNCATE so that
 the index file is truncated to one containing only meta page instead of
 empty one. That is, the WAL replay of TRUNCATE would need to call
 index_build() after smgrtruncate() maybe.

That seems completely unworkable.  For one thing, index_build would expect
to be able to do catalog lookups, but we can't assume that the catalogs
are in a good state yet.

I think the responsibility has to be on the WAL-writing end to emit WAL
instructions that lead to a correct on-disk state.  Putting complex
behavior into the reading side is fundamentally misguided.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-03 Thread Andres Freund
On 2015-07-03 18:49:31 +0200, Andres Freund wrote:
 But the more interesting question is why that's not hhappening
 today. RelationTruncateIndexes() does call the index_build() which
 should end up WAL logging the index creation.

So that's because there's an XLogIsNeeded() preventing it.

Maybe I'm just daft right now (35C outside, 32 inside, so ...), but I'm
right now missing how the whole skip wal logging if relation has just
been truncated optimization can ever actually be crashsafe unless we
use a new relfilenode (which we don't!).

Sure, we do an heap_sync() at the the end of the transaction. That's
nice and all. But it doesn't help if we crash and re-start WAL apply
from a checkpoint before the table was created. Because that'll replay
the truncation.

That's much worse than just the indexes - the rows added by a COPY
without WAL logging will also be truncated away, no?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-03 Thread Andres Freund
On 2015-07-03 19:14:26 +0200, Martijn van Oosterhout wrote:
 Am I missing something. ISTM that if the truncate record was simply not
 logged at all everything would work fine. The whole point is that the
 table was created in this transaction and so if it exists the table on
 disk must be the correct representation.

That'd not work either. Consider:

BEGIN;
CREATE TABLE ...
INSERT;
TRUNCATE;
INSERT;
COMMIT;

If you replay that without a truncation wal record the second INSERT
will try to add stuff to already occupied space. And they can have
different lengths and stuff, so you cannot just ignore that fact.

 The broken index is just one symptom.

Agreed. I think the problem is something else though. Namely that we
reuse the relfilenode for heap_truncate_one_rel(). That's just entirely
broken afaics. We need to allocate a new relfilenode and write stuff
into that. Then we can forgo WAL logging the truncation record.

 If you insert a row before commit then after replay the tuple should be there 
 still.

The insert would be WAL logged. COPY skips wal logging tho.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-03 Thread Andres Freund
On 2015-07-03 19:02:29 +0200, Andres Freund wrote:
 Maybe I'm just daft right now (35C outside, 32 inside, so ...), but I'm
 right now missing how the whole skip wal logging if relation has just
 been truncated optimization can ever actually be crashsafe unless we
 use a new relfilenode (which we don't!).

We actually used to use a different relfilenode, but optimized that
away: cab9a0656c36739f59277b34fea8ab9438395869

commit cab9a0656c36739f59277b34fea8ab9438395869
Author: Tom Lane t...@sss.pgh.pa.us
Date:   Sun Aug 23 19:23:41 2009 +

Make TRUNCATE do truncate-in-place when processing a relation that was 
created
or previously truncated in the current (sub)transaction.  This is safe since
if the (sub)transaction later rolls back, we'd just discard the rel's 
current
physical file anyway.  This avoids unreasonable growth in the number of
transient files when a relation is repeatedly truncated.  Per a performance
gripe a couple weeks ago from Todd Cook.

to me the reasoning here looks flawed.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-03 Thread Fujii Masao
On Fri, Jul 3, 2015 at 11:52 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Fujii Masao masao.fu...@gmail.com writes:
 The optimization of TRUNCATE opereation that we can use when
 CREATE TABLE and TRUNCATE are executed in the same transaction block
 seems to cause the problem. In this case, only index file truncation is
 logged, and index creation in btbuild() is not logged because wal_level
 is minimal. Then at the subsequent crash recovery, index file is truncated
 to 0 byte... Very simple fix is to log an index creation in that case,
 but not sure if that's ok to do..

 In 9.2 or before, this problem doesn't occur because no such error is thrown
 even if an index file size is zero. But in 9.3 or later, since the planner
 tries to read a meta page of an index to get the height of the btree tree,
 an empty index file causes such error. The planner was changed that way by
 commit 31f38f28, and the problem seems to be an oversight of that commit.

 What?  You want to blame the planner for failing because the index was
 left corrupt by broken WAL replay?  A failure would occur anyway at
 execution.

Yep, right. I was not thinking that such index with file size 0 is corrupted
because the reported problem didn't happen before that commit was added.
But that's my fault. Such index can cause an error even in other code paths.

Okay, so probably we need to change WAL replay of TRUNCATE so that
the index file is truncated to one containing only meta page instead of
empty one. That is, the WAL replay of TRUNCATE would need to call
index_build() after smgrtruncate() maybe.

Then how should we implement that? Invent new WAL record type that
calls smgrtruncate() and index_build() during WAL replay? Or add the
special flag to XLOG_SMGR_TRUNCATE record, and make WAL replay
call index_build() only if the flag is found? Any other good idea?
Anyway ISTM that we might need to add or modify WAL record.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-03 Thread Andres Freund
On 2015-07-04 01:39:42 +0900, Fujii Masao wrote:
 Okay, so probably we need to change WAL replay of TRUNCATE so that
 the index file is truncated to one containing only meta page instead of
 empty one. That is, the WAL replay of TRUNCATE would need to call
 index_build() after smgrtruncate() maybe.
 
 Then how should we implement that? Invent new WAL record type that
 calls smgrtruncate() and index_build() during WAL replay? Or add the
 special flag to XLOG_SMGR_TRUNCATE record, and make WAL replay
 call index_build() only if the flag is found? Any other good idea?
 Anyway ISTM that we might need to add or modify WAL record.

It's easy enough to log something like a metapage with
log_newpage().

But the more interesting question is why that's not hhappening
today. RelationTruncateIndexes() does call the index_build() which
should end up WAL logging the index creation.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-03 Thread Andres Freund
On 2015-07-03 18:38:37 -0400, Tom Lane wrote:
  Why exactly?  The first truncation in the (sub)xact would have assigned a
 new relfilenode, why do we need another one?  The file in question will
 go away on crash/rollback in any case, and no other transaction can see
 it yet.

Consider:

BEGIN;
CREATE TABLE;
INSERT largeval;
TRUNCATE;
INSERT 1;
COPY;
INSERT 2;
COMMIT;

INSERT 1 is going to be WAL logged. For that to work correctly TRUNCATE
has to be WAL logged, as otherwise there'll be conflicting/overlapping
tuples on the target page.

But:

The truncation itself is not fully wal logged, neither is the COPY. Both
rely on heap_sync()/immedsync(). For that to be correct the current
relfilenode's truncation may *not* be wal-logged, because the contents
of the COPY or the truncation itself will only be on-disk, not in the
WAL.

Only being on-disk but not in the WAL is a problem if we crash and
replay the truncate record.

 I'm prepared to believe that some bit of logic is doing the wrong
 thing in this state, but I do not agree that truncate-in-place is
 unworkable.

Unless we're prepared to make everything that potentially WAL logs
something do the rel-rd_createSubid == mySubid  dance, I can't see
that working.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-03 Thread Tom Lane
Martijn van Oosterhout klep...@svana.org writes:
 With inserts the WAL records look as follows (relfilenodes changed):
 ...
 And amazingly, the database cluster successfuly recovers and there's no
 error now.  So the problem is *only* because there is no data in the
 table at commit time.  Which indicates that it's the 'newroot record
 that saves the day normally.  And it's apparently generated by the
 first insert.

Yeah, because the correct empty state of a btree index is to have a
metapage but no root page, so the first insert forces creation of a root
page.  And, by chance, btree_xlog_newroot restores the metapage from
scratch, so this works even if the metapage had been missing or corrupt.

However, things would still break if the first access to the index was
a read attempt rather than an insert.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-03 Thread Andres Freund
On 2015-07-03 19:26:05 +0200, Andres Freund wrote:
 On 2015-07-03 19:02:29 +0200, Andres Freund wrote:
  Maybe I'm just daft right now (35C outside, 32 inside, so ...), but I'm
  right now missing how the whole skip wal logging if relation has just
  been truncated optimization can ever actually be crashsafe unless we
  use a new relfilenode (which we don't!).
 
 We actually used to use a different relfilenode, but optimized that
 away: cab9a0656c36739f59277b34fea8ab9438395869
 
 commit cab9a0656c36739f59277b34fea8ab9438395869
 Author: Tom Lane t...@sss.pgh.pa.us
 Date:   Sun Aug 23 19:23:41 2009 +
 
 Make TRUNCATE do truncate-in-place when processing a relation that was 
 created
 or previously truncated in the current (sub)transaction.  This is safe 
 since
 if the (sub)transaction later rolls back, we'd just discard the rel's 
 current
 physical file anyway.  This avoids unreasonable growth in the number of
 transient files when a relation is repeatedly truncated.  Per a 
 performance
 gripe a couple weeks ago from Todd Cook.
 
 to me the reasoning here looks flawed.

It looks to me we need to re-neg on this a bit. I think we can still be
more efficient than the general codepath: We can drop the old
relfilenode immediately. But pg_class.relfilenode has to differ from the
old after the truncation.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-03 Thread Tom Lane
Andres Freund and...@anarazel.de writes:
 On 2015-07-03 19:26:05 +0200, Andres Freund wrote:
 commit cab9a0656c36739f59277b34fea8ab9438395869
 Author: Tom Lane t...@sss.pgh.pa.us
 Date:   Sun Aug 23 19:23:41 2009 +
 
 Make TRUNCATE do truncate-in-place when processing a relation that was 
 created
 or previously truncated in the current (sub)transaction.  This is safe since
 if the (sub)transaction later rolls back, we'd just discard the rel's current
 physical file anyway.  This avoids unreasonable growth in the number of
 transient files when a relation is repeatedly truncated.  Per a performance
 gripe a couple weeks ago from Todd Cook.
 
 to me the reasoning here looks flawed.

 It looks to me we need to re-neg on this a bit. I think we can still be
 more efficient than the general codepath: We can drop the old
 relfilenode immediately. But pg_class.relfilenode has to differ from the
 old after the truncation.

Why exactly?  The first truncation in the (sub)xact would have assigned a
new relfilenode, why do we need another one?  The file in question will
go away on crash/rollback in any case, and no other transaction can see
it yet.

I'm prepared to believe that some bit of logic is doing the wrong thing in
this state, but I do not agree that truncate-in-place is unworkable.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-03 Thread Martijn van Oosterhout
On Fri, Jul 03, 2015 at 07:21:21PM +0200, Andres Freund wrote:
 On 2015-07-03 19:14:26 +0200, Martijn van Oosterhout wrote:
  Am I missing something. ISTM that if the truncate record was simply not
  logged at all everything would work fine. The whole point is that the
  table was created in this transaction and so if it exists the table on
  disk must be the correct representation.
 
 That'd not work either. Consider:
 
 BEGIN;
 CREATE TABLE ...
 INSERT;
 TRUNCATE;
 INSERT;
 COMMIT;
 
 If you replay that without a truncation wal record the second INSERT
 will try to add stuff to already occupied space. And they can have
 different lengths and stuff, so you cannot just ignore that fact.

I was about to disagree with you by suggesting that if the table was
created in this transaction then WAL logging is skipped. But testing
shows that inserts are indeed logged, as you point out.

With inserts the WAL records look as follows (relfilenodes changed):

martijn@martijn-jessie:~/git/ctm/docker$ sudo 
/usr/lib/postgresql/9.4/bin/pg_xlogdump -p /tmp/pgtest/postgres/pg_xlog/ 
00010001 |grep -wE '16386|16384|16390'
rmgr: Storage len (rec/tot): 16/48, tx:  0, lsn: 
0/016A79C8, prev 0/016A79A0, bkp: , desc: file create: base/12139/16384
rmgr: Sequencelen (rec/tot):158/   190, tx:683, lsn: 
0/016B4258, prev 0/016B2508, bkp: , desc: log: rel 1663/12139/16384
rmgr: Storage len (rec/tot): 16/48, tx:683, lsn: 
0/016B4318, prev 0/016B4258, bkp: , desc: file create: base/12139/16386
rmgr: Storage len (rec/tot): 16/48, tx:683, lsn: 
0/016B9468, prev 0/016B9418, bkp: , desc: file create: base/12139/16390
rmgr: Sequencelen (rec/tot):158/   190, tx:683, lsn: 
0/016BC938, prev 0/016BC880, bkp: , desc: log: rel 1663/12139/16384
rmgr: Sequencelen (rec/tot):158/   190, tx:683, lsn: 
0/016BCAF0, prev 0/016BCAA0, bkp: , desc: log: rel 1663/12139/16384
rmgr: Heaplen (rec/tot): 35/67, tx:683, lsn: 
0/016BCBB0, prev 0/016BCAF0, bkp: , desc: insert(init): rel 
1663/12139/16386; tid 0/1
rmgr: Btree   len (rec/tot): 20/52, tx:683, lsn: 
0/016BCBF8, prev 0/016BCBB0, bkp: , desc: newroot: rel 1663/12139/16390; 
root 1 lev 0
rmgr: Btree   len (rec/tot): 34/66, tx:683, lsn: 
0/016BCC30, prev 0/016BCBF8, bkp: , desc: insert: rel 1663/12139/16390; tid 
1/1
rmgr: Storage len (rec/tot): 16/48, tx:683, lsn: 
0/016BCC78, prev 0/016BCC30, bkp: , desc: file truncate: base/12139/16386 
to 0 blocks
rmgr: Storage len (rec/tot): 16/48, tx:683, lsn: 
0/016BCCA8, prev 0/016BCC78, bkp: , desc: file truncate: base/12139/16390 
to 0 blocks
rmgr: Heaplen (rec/tot): 35/67, tx:683, lsn: 
0/016BCCD8, prev 0/016BCCA8, bkp: , desc: insert(init): rel 
1663/12139/16386; tid 0/1
rmgr: Btree   len (rec/tot): 20/52, tx:683, lsn: 
0/016BCD20, prev 0/016BCCD8, bkp: , desc: newroot: rel 1663/12139/16390; 
root 1 lev 0
rmgr: Btree   len (rec/tot): 34/66, tx:683, lsn: 
0/016BCD58, prev 0/016BCD20, bkp: , desc: insert: rel 1663/12139/16390; tid 
1/1
 
   relname   | relfilenode 
-+-
 test|   16386
 test_id_seq |   16384
 test_pkey   |   16390
(3 rows)

And amazingly, the database cluster successfuly recovers and there's no
error now.  So the problem is *only* because there is no data in the
table at commit time.  Which indicates that it's the 'newroot record
that saves the day normally.  And it's apparently generated by the
first insert.

 Agreed. I think the problem is something else though. Namely that we
 reuse the relfilenode for heap_truncate_one_rel(). That's just entirely
 broken afaics. We need to allocate a new relfilenode and write stuff
 into that. Then we can forgo WAL logging the truncation record.

Would that properly initialise the index though?

Anyway, this is way outside my expertise, so I'll bow out now. Let me
know if I can be of more assistance.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-03 Thread Martijn van Oosterhout
On Fri, Jul 03, 2015 at 02:34:44PM +0900, Fujii Masao wrote:
  Hmm, for me it is 100% reproducable. Are you familiar with Docker? I
  can probably construct a Dockerfile that reproduces it pretty reliably.
 
 I could reproduce the problem in the master branch by doing
 the following steps.

Thank you, I wasn't sure if you could kill the server fast enough
without containers, but it looks like immediate mode is enough.

 1. start the PostgreSQL server with wal_level = minimal
 2. execute the following SQL statements
  begin;
  create table test(id serial primary key);
  truncate table test;
  commit;
 3. shutdown the server with immediate mode
 4. restart the server (crash recovery occurs)
 5. execute the following SQL statement
 select * from test;
 
 The optimization of TRUNCATE opereation that we can use when
 CREATE TABLE and TRUNCATE are executed in the same transaction block
 seems to cause the problem. In this case, only index file truncation is
 logged, and index creation in btbuild() is not logged because wal_level
 is minimal. Then at the subsequent crash recovery, index file is truncated
 to 0 byte... Very simple fix is to log an index creation in that case,
 but not sure if that's ok to do..

Looks plausible to me.

For reference I attach a small tarball for reproduction with docker.

1. Unpack tarball into empty dir (it has three small files)
2. docker build -t test .
3. docker run -v /tmp/pgtest:/data test
4. docker run -v /tmp/pgtest:/data test

Data dir is in /tmp/pgtest

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


postgresql-test.tgz
Description: GNU Unix tar archive


signature.asc
Description: Digital signature


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-02 Thread Fujii Masao
On Fri, Jul 3, 2015 at 2:20 PM, Martijn van Oosterhout
klep...@svana.org wrote:
 On Fri, Jul 03, 2015 at 12:21:02AM +0200, Andres Freund wrote:
 Hi,

 On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote:
  === Start with an empty database

 My guess is you have wal_level = minimal?

 Default config, was just initdb'd. So yes, the default wal_level =
 minimal.

  === Note the index file is 8KB.
  === At this point nuke the database server (in this case it was simply
  === destroying the container it was running in.

 How did you continue from there? The container has persistent storage?
 Or are you repapplying the WAL to somewhere else?

 The container has persistant storage on the host. What I think is
 actually unusual is that the script that started postgres was missing
 an 'exec so postgres never gets the signal to shutdown.

  martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p 
  /data/postgres/pg_xlog/ 00010001 |grep -wE 
  '16389|16387|16393'
  rmgr: XLOGlen (rec/tot): 72/   104, tx:  0, lsn: 
  0/016A9240, prev 0/016A9200, bkp: , desc: checkpoint: redo 0/16A9240; 
  tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; 
  oldest xid 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; 
  shutdown
  rmgr: Storage len (rec/tot): 16/48, tx:  0, lsn: 
  0/016A92D0, prev 0/016A92A8, bkp: , desc: file create: base/16385/16387
  rmgr: Sequencelen (rec/tot):158/   190, tx:686, lsn: 
  0/016B5E50, prev 0/016B5D88, bkp: , desc: log: rel 1663/16385/16387
  rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
  0/016B5F10, prev 0/016B5E50, bkp: , desc: file create: base/16385/16389
  rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
  0/016BB028, prev 0/016BAFD8, bkp: , desc: file create: base/16385/16393
  rmgr: Sequencelen (rec/tot):158/   190, tx:686, lsn: 
  0/016BE4F8, prev 0/016BE440, bkp: , desc: log: rel 1663/16385/16387
  rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
  0/016BE6B0, prev 0/016BE660, bkp: , desc: file truncate: 
  base/16385/16389 to 0 blocks
  rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
  0/016BE6E0, prev 0/016BE6B0, bkp: , desc: file truncate: 
  base/16385/16393 to 0 blocks
  pg_xlogdump: FATAL:  error in WAL record at 0/16BE710: record with zero 
  length at 0/16BE740

 Note that the truncate will lead to a new, different, relfilenode.

 Really? Comparing the relfilenodes gives the same values before and
 after the truncate.

Yep, the relfilenodes are not changed in this case because CREATE TABLE and
TRUNCATE were executed in the same transaction block.

  ctmp=# select * from test;
  ERROR:  could not read block 0 in file base/16385/16393: read only 0 of 
  8192 bytes

 Hm. I can't reproduce this. Can you include a bit more details about how
 to reproduce?

 Hmm, for me it is 100% reproducable. Are you familiar with Docker? I
 can probably construct a Dockerfile that reproduces it pretty reliably.

I could reproduce the problem in the master branch by doing
the following steps.

1. start the PostgreSQL server with wal_level = minimal
2. execute the following SQL statements
 begin;
 create table test(id serial primary key);
 truncate table test;
 commit;
3. shutdown the server with immediate mode
4. restart the server (crash recovery occurs)
5. execute the following SQL statement
select * from test;

The optimization of TRUNCATE opereation that we can use when
CREATE TABLE and TRUNCATE are executed in the same transaction block
seems to cause the problem. In this case, only index file truncation is
logged, and index creation in btbuild() is not logged because wal_level
is minimal. Then at the subsequent crash recovery, index file is truncated
to 0 byte... Very simple fix is to log an index creation in that case,
but not sure if that's ok to do..

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-02 Thread Martijn van Oosterhout
On Fri, Jul 03, 2015 at 12:21:02AM +0200, Andres Freund wrote:
 Hi,
 
 On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote:
  === Start with an empty database
 
 My guess is you have wal_level = minimal?

Default config, was just initdb'd. So yes, the default wal_level =
minimal.

  === Note the index file is 8KB.
  === At this point nuke the database server (in this case it was simply 
  === destroying the container it was running in.
 
 How did you continue from there? The container has persistent storage?
 Or are you repapplying the WAL to somewhere else?

The container has persistant storage on the host. What I think is
actually unusual is that the script that started postgres was missing
an 'exec so postgres never gets the signal to shutdown.

  martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p 
  /data/postgres/pg_xlog/ 00010001 |grep -wE 
  '16389|16387|16393'
  rmgr: XLOGlen (rec/tot): 72/   104, tx:  0, lsn: 
  0/016A9240, prev 0/016A9200, bkp: , desc: checkpoint: redo 0/16A9240; 
  tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; 
  oldest xid 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; 
  shutdown
  rmgr: Storage len (rec/tot): 16/48, tx:  0, lsn: 
  0/016A92D0, prev 0/016A92A8, bkp: , desc: file create: base/16385/16387
  rmgr: Sequencelen (rec/tot):158/   190, tx:686, lsn: 
  0/016B5E50, prev 0/016B5D88, bkp: , desc: log: rel 1663/16385/16387
  rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
  0/016B5F10, prev 0/016B5E50, bkp: , desc: file create: base/16385/16389
  rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
  0/016BB028, prev 0/016BAFD8, bkp: , desc: file create: base/16385/16393
  rmgr: Sequencelen (rec/tot):158/   190, tx:686, lsn: 
  0/016BE4F8, prev 0/016BE440, bkp: , desc: log: rel 1663/16385/16387
  rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
  0/016BE6B0, prev 0/016BE660, bkp: , desc: file truncate: 
  base/16385/16389 to 0 blocks
  rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
  0/016BE6E0, prev 0/016BE6B0, bkp: , desc: file truncate: 
  base/16385/16393 to 0 blocks
  pg_xlogdump: FATAL:  error in WAL record at 0/16BE710: record with zero 
  length at 0/16BE740
 
 Note that the truncate will lead to a new, different, relfilenode.

Really? Comparing the relfilenodes gives the same values before and
after the truncate.
 
  ctmp=# select * from test;
  ERROR:  could not read block 0 in file base/16385/16393: read only 0 of 
  8192 bytes
 
 Hm. I can't reproduce this. Can you include a bit more details about how
 to reproduce?

Hmm, for me it is 100% reproducable. Are you familiar with Docker? I
can probably construct a Dockerfile that reproduces it pretty reliably.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


[HACKERS] WAL logging problem in 9.4.3?

2015-07-02 Thread Martijn van Oosterhout
Hoi,

I ran into this in our CI setup and I thinks it's an actual bug. The
issue appears to be that when a table is created *and truncated* in a
single transaction, that the WAL log is logging a truncate record it
shouldn't, such that if the database crashes you have a broken index.
It would also lose any data that was in the table at commit time.

I didn't test 9.4.4 yet, though I don't see anything in the release
notes that resemble this.

Detail:

=== Start with an empty database

martijn@martijn-jessie:$ psql ctmp -h localhost -U username
Password for user username:
psql (9.4.3)
Type help for help.

ctmp=# begin;
BEGIN
ctmp=# create table test(id serial primary key);
CREATE TABLE
ctmp=# truncate table test;
TRUNCATE TABLE
ctmp=# commit;
COMMIT
ctmp=# select relname, relfilenode from pg_class where relname like 'test%';
   relname   | relfilenode
-+-
 test|   16389
 test_id_seq |   16387
 test_pkey   |   16393
(3 rows)

=== Note: if you do a CHECKPOINT here the issue doesn't happen
=== obviously.

ctmp=# \q
martijn@martijn-jessie:$ sudo ls -l 
/data/postgres/base/16385/{16389,16387,16393}
[sudo] password for martijn:
-rw--- 1 messagebus ssl-cert 8192 Jul  2 23:34 
/data/postgres/base/16385/16387
-rw--- 1 messagebus ssl-cert0 Jul  2 23:34 
/data/postgres/base/16385/16389
-rw--- 1 messagebus ssl-cert 8192 Jul  2 23:34 
/data/postgres/base/16385/16393

=== Note the index file is 8KB.
=== At this point nuke the database server (in this case it was simply 
=== destroying the container it was running in.

=== Dump the xlogs just to show what got recorded. Note there's a
=== truncate for the data file and the index file.

martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p 
/data/postgres/pg_xlog/ 00010001 |grep -wE '16389|16387|16393'
rmgr: XLOGlen (rec/tot): 72/   104, tx:  0, lsn: 
0/016A9240, prev 0/016A9200, bkp: , desc: checkpoint: redo 0/16A9240; tli 
1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 
673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown
rmgr: Storage len (rec/tot): 16/48, tx:  0, lsn: 
0/016A92D0, prev 0/016A92A8, bkp: , desc: file create: base/16385/16387
rmgr: Sequencelen (rec/tot):158/   190, tx:686, lsn: 
0/016B5E50, prev 0/016B5D88, bkp: , desc: log: rel 1663/16385/16387
rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
0/016B5F10, prev 0/016B5E50, bkp: , desc: file create: base/16385/16389
rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
0/016BB028, prev 0/016BAFD8, bkp: , desc: file create: base/16385/16393
rmgr: Sequencelen (rec/tot):158/   190, tx:686, lsn: 
0/016BE4F8, prev 0/016BE440, bkp: , desc: log: rel 1663/16385/16387
rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
0/016BE6B0, prev 0/016BE660, bkp: , desc: file truncate: base/16385/16389 
to 0 blocks
rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
0/016BE6E0, prev 0/016BE6B0, bkp: , desc: file truncate: base/16385/16393 
to 0 blocks
pg_xlogdump: FATAL:  error in WAL record at 0/16BE710: record with zero length 
at 0/16BE740

=== Start the DB up again

database_1 | LOG:  database system was interrupted; last known up at 2015-07-02 
21:08:05 UTC
database_1 | LOG:  database system was not properly shut down; automatic 
recovery in progress
database_1 | LOG:  redo starts at 0/16A92A8
database_1 | LOG:  record with zero length at 0/16BE740
database_1 | LOG:  redo done at 0/16BE710
database_1 | LOG:  last completed transaction was at log time 2015-07-02 
21:34:45.664989+00
database_1 | LOG:  database system is ready to accept connections
database_1 | LOG:  autovacuum launcher started

=== Oops, the index file is empty now

martijn@martijn-jessie:$ sudo ls -l 
/data/postgres/base/16385/{16389,16387,16393}
-rw--- 1 messagebus ssl-cert 8192 Jul  2 23:37 
/data/postgres/base/16385/16387
-rw--- 1 messagebus ssl-cert0 Jul  2 23:34 
/data/postgres/base/16385/16389
-rw--- 1 messagebus ssl-cert0 Jul  2 23:37 
/data/postgres/base/16385/16393

martijn@martijn-jessie:$ psql ctmp -h localhost -U username
Password for user username:
psql (9.4.3)
Type help for help.

=== And now the index is broken. I think the only reason it doesn't
=== complain about the data file is because zero bytes there is OK.  But if
=== the table had data before it would be gone now.

ctmp=# select * from test;
ERROR:  could not read block 0 in file base/16385/16393: read only 0 of 8192 
bytes

ctmp=# select version();
version 
   
---
 PostgreSQL 9.4.3 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 
4.9.2-10) 4.9.2, 64-bit
(1 row)

Hope this 

Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-02 Thread Andres Freund
Hi,

On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote:
 === Start with an empty database

My guess is you have wal_level = minimal?

 ctmp=# begin;
 BEGIN
 ctmp=# create table test(id serial primary key);
 CREATE TABLE
 ctmp=# truncate table test;
 TRUNCATE TABLE
 ctmp=# commit;
 COMMIT
 ctmp=# select relname, relfilenode from pg_class where relname like 'test%';
relname   | relfilenode
 -+-
  test|   16389
  test_id_seq |   16387
  test_pkey   |   16393
 (3 rows)
 

 === Note the index file is 8KB.
 === At this point nuke the database server (in this case it was simply 
 === destroying the container it was running in.

How did you continue from there? The container has persistent storage?
Or are you repapplying the WAL to somewhere else?

 === Dump the xlogs just to show what got recorded. Note there's a
 === truncate for the data file and the index file.

That should be ok.

 martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p 
 /data/postgres/pg_xlog/ 00010001 |grep -wE '16389|16387|16393'
 rmgr: XLOGlen (rec/tot): 72/   104, tx:  0, lsn: 
 0/016A9240, prev 0/016A9200, bkp: , desc: checkpoint: redo 0/16A9240; tli 
 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 
 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown
 rmgr: Storage len (rec/tot): 16/48, tx:  0, lsn: 
 0/016A92D0, prev 0/016A92A8, bkp: , desc: file create: base/16385/16387
 rmgr: Sequencelen (rec/tot):158/   190, tx:686, lsn: 
 0/016B5E50, prev 0/016B5D88, bkp: , desc: log: rel 1663/16385/16387
 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
 0/016B5F10, prev 0/016B5E50, bkp: , desc: file create: base/16385/16389
 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
 0/016BB028, prev 0/016BAFD8, bkp: , desc: file create: base/16385/16393
 rmgr: Sequencelen (rec/tot):158/   190, tx:686, lsn: 
 0/016BE4F8, prev 0/016BE440, bkp: , desc: log: rel 1663/16385/16387
 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
 0/016BE6B0, prev 0/016BE660, bkp: , desc: file truncate: base/16385/16389 
 to 0 blocks
 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
 0/016BE6E0, prev 0/016BE6B0, bkp: , desc: file truncate: base/16385/16393 
 to 0 blocks
 pg_xlogdump: FATAL:  error in WAL record at 0/16BE710: record with zero 
 length at 0/16BE740

Note that the truncate will lead to a new, different, relfilenode.

 === Start the DB up again
 
 database_1 | LOG:  database system was interrupted; last known up at 
 2015-07-02 21:08:05 UTC
 database_1 | LOG:  database system was not properly shut down; automatic 
 recovery in progress
 database_1 | LOG:  redo starts at 0/16A92A8
 database_1 | LOG:  record with zero length at 0/16BE740
 database_1 | LOG:  redo done at 0/16BE710
 database_1 | LOG:  last completed transaction was at log time 2015-07-02 
 21:34:45.664989+00
 database_1 | LOG:  database system is ready to accept connections
 database_1 | LOG:  autovacuum launcher started
 
 === Oops, the index file is empty now

That's probably just the old index file?

 martijn@martijn-jessie:$ sudo ls -l 
 /data/postgres/base/16385/{16389,16387,16393}
 -rw--- 1 messagebus ssl-cert 8192 Jul  2 23:37 
 /data/postgres/base/16385/16387
 -rw--- 1 messagebus ssl-cert0 Jul  2 23:34 
 /data/postgres/base/16385/16389
 -rw--- 1 messagebus ssl-cert0 Jul  2 23:37 
 /data/postgres/base/16385/16393
 
 martijn@martijn-jessie:$ psql ctmp -h localhost -U username
 Password for user username:
 psql (9.4.3)
 Type help for help.
 
 === And now the index is broken. I think the only reason it doesn't
 === complain about the data file is because zero bytes there is OK.  But if
 === the table had data before it would be gone now.
 
 ctmp=# select * from test;
 ERROR:  could not read block 0 in file base/16385/16393: read only 0 of 
 8192 bytes

Hm. I can't reproduce this. Can you include a bit more details about how
to reproduce?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers