subject:"recovering from \"found xmin ... from before relfrozenxid ...\""

Re: recovering from "found xmin ... from before relfrozenxid ..."

2021-03-14 Thread Peter Geoghegan

On Mon, Sep 21, 2020 at 1:11 PM Andres Freund  wrote:
> > I'm not sure I really understand how that's happening, because surely
> > HOT chains and non-HOT chains are pruned by the same code, but it
> > doesn't sound good.
>
> Not necessarily, unfortunately:
>
> case HEAPTUPLE_DEAD:

> So if e.g. a transaction aborts between the heap_page_prune and this
> check the pruning behaviour depends on whether the tuple is part of a
> HOT chain or not.

I have a proposal that includes removing this "tupgone = true" special case:

https://postgr.es/m/CAH2-Wzm7Y=_g3fjvhv7a85afubusydggdneqn1hodveoctl...@mail.gmail.com

Of course this won't change the fact that vacuumlazy.c can disagree
with pruning about what is dead -- that is a necessary consequence of
having multiple HTSV calls for the same tuple in vacuumlazy.c (it can
change in the presence of concurrently aborted transactions). But
removing the "tupgone = true" special case does seem much more
consistent, and simpler overall. We have lots of code that is only
needed to make that special case work. For example, the whole idea of
a dedicated XLOG_HEAP2_CLEANUP_INFO record for recovery conflicts can
go -- we can get by with only XLOG_HEAP2_CLEAN records from pruning,
IIRC.

Have I missed something? The special case in question seems pretty
awful to me, so I have to wonder why somebody else didn't remove it
long ago...

--
Peter Geoghegan

Re: recovering from "found xmin ... from before relfrozenxid ..."

2021-01-18 Thread Andrey Borodin




> 18 янв. 2021 г., в 18:54, Robert Haas  написал(а):
> 
> On Sat, Jan 16, 2021 at 10:41 AM Andrey Borodin  wrote:
>> Does anyone maintain opensource pg_surgery analogs for released versions of 
>> PG?
>> It seems to me I'll have to use something like this and I just though that I 
>> should consider pg_surgery in favour of our pg_dirty_hands.
> 
> I do not. I'm still of the opinion that we ought to back-patch
> pg_surgery.
+1.
Yesterday I spent a few hours packaging pg_dirty_hands and pg_surgery(BTW it 
works fine for 12).
It's a kind of a 911 tool, one doesn't think they will need it until they 
actually do. And clocks are ticking.
OTOH, it opens new ways to shoot in the foot.

Best regards, Andrey Borodin.

Re: recovering from "found xmin ... from before relfrozenxid ..."

2021-01-18 Thread Robert Haas

On Mon, Jan 18, 2021 at 9:25 AM Michael Banck  wrote:
> One other possiblity would be to push a version of pg_surgery that is
> compatible with the back-branches somewhere external (e.g. either
> git.postgresql.org and/or Github), so that it can be picked up by
> distributions and/or individual users in need.

Sure, but I don't see how that's better.

> That is Assuming it does not need assorted server changes to go with; I
> did not read the thread in detail but I was under the assumption it is a
> client program?

It's a server extension. It does not require core changes.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

2021-01-18 Thread Michael Banck

On Mon, Jan 18, 2021 at 08:54:10AM -0500, Robert Haas wrote:
> On Sat, Jan 16, 2021 at 10:41 AM Andrey Borodin  wrote:
> > Does anyone maintain opensource pg_surgery analogs for released
> > versions of PG?  It seems to me I'll have to use something like this
> > and I just though that I should consider pg_surgery in favour of our
> > pg_dirty_hands.
> 
> I do not. I'm still of the opinion that we ought to back-patch
> pg_surgery. This didn't attract a consensus before, and it's hard to
> dispute that it's a new feature in what would be a back branch. But
> it's unclear to me how users are otherwise supposed to recover from
> some of the bugs that are or have been present in those back branches.

One other possiblity would be to push a version of pg_surgery that is
compatible with the back-branches somewhere external (e.g. either
git.postgresql.org and/or Github), so that it can be picked up by
distributions and/or individual users in need.

That is Assuming it does not need assorted server changes to go with; I
did not read the thread in detail but I was under the assumption it is a
client program?

Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: recovering from "found xmin ... from before relfrozenxid ..."

2021-01-18 Thread Robert Haas

On Sat, Jan 16, 2021 at 10:41 AM Andrey Borodin  wrote:
> Does anyone maintain opensource pg_surgery analogs for released versions of 
> PG?
> It seems to me I'll have to use something like this and I just though that I 
> should consider pg_surgery in favour of our pg_dirty_hands.

I do not. I'm still of the opinion that we ought to back-patch
pg_surgery. This didn't attract a consensus before, and it's hard to
dispute that it's a new feature in what would be a back branch. But
it's unclear to me how users are otherwise supposed to recover from
some of the bugs that are or have been present in those back branches.
I'm not sure that I see the logic in telling people we'll try to
prevent them from getting hosed in the future but if they're already
hosed they can wait for v14 to fix it. They can't wait that long, and
a dump-and-restore cycle is awfully painful.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

2021-01-16 Thread Andrey Borodin

Hi hackers!

Does anyone maintain opensource pg_surgery analogs for released versions of PG?
It seems to me I'll have to use something like this and I just though that I 
should consider pg_surgery in favour of our pg_dirty_hands.

Thanks!

Best regards, Andrey Borodin.

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-10-28 Thread Andres Freund

Hi,

On 2020-10-28 21:00:30 -0700, Andres Freund wrote:
> On 2020-10-28 19:09:14 -0700, Andres Freund wrote:
> > On 2020-10-28 18:13:44 -0700, Andres Freund wrote:
> > > Just pushed this. Let's see what the BF says...
> > 
> > It says that apparently something is unstable about my new test. It
> > first passed on a few animals, but then failed a lot in a row. Looking.
> 
> The differentiating factor is force_parallel_mode=regress.
> 
> Ugh, this is nasty: The problem is that we can end up computing the
> horizons the first time before MyDatabaseId is even set. Which leads us
> to compute a too aggressive horizon for plain tables, because we skip
> over them, as MyDatabaseId still is InvalidOid:
> 
>   /*
>* Normally queries in other databases are ignored for anything 
> but
>* the shared horizon. But in recovery we cannot compute an 
> accurate
>* per-database horizon as all xids are managed via the
>* KnownAssignedXids machinery.
>*/
>   if (in_recovery ||
>   proc->databaseId == MyDatabaseId ||
>   proc->databaseId == 0)  /* always include WalSender */
>   h->data_oldest_nonremovable =
>   TransactionIdOlder(h->data_oldest_nonremovable, 
> xmin);
> 
> That then subsequently leads us consider a row fully dead in
> heap_hot_search_buffers(). Triggering the killtuples logic. Causing the
> test to fail.
> 
> With force_parallel_mode=regress we constantly start parallel workers,
> which makes it much more likely that this case is hit.
> 
> It's trivial to fix luckily...

Pushed that fix, hopefully that calms the BF.

Greetings,

Andres Freund

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-10-28 Thread Andres Freund

Hi,

On 2020-10-28 19:09:14 -0700, Andres Freund wrote:
> On 2020-10-28 18:13:44 -0700, Andres Freund wrote:
> > Just pushed this. Let's see what the BF says...
> 
> It says that apparently something is unstable about my new test. It
> first passed on a few animals, but then failed a lot in a row. Looking.

The differentiating factor is force_parallel_mode=regress.

Ugh, this is nasty: The problem is that we can end up computing the
horizons the first time before MyDatabaseId is even set. Which leads us
to compute a too aggressive horizon for plain tables, because we skip
over them, as MyDatabaseId still is InvalidOid:

/*
 * Normally queries in other databases are ignored for anything 
but
 * the shared horizon. But in recovery we cannot compute an 
accurate
 * per-database horizon as all xids are managed via the
 * KnownAssignedXids machinery.
 */
if (in_recovery ||
proc->databaseId == MyDatabaseId ||
proc->databaseId == 0)  /* always include WalSender */
h->data_oldest_nonremovable =
TransactionIdOlder(h->data_oldest_nonremovable, 
xmin);

That then subsequently leads us consider a row fully dead in
heap_hot_search_buffers(). Triggering the killtuples logic. Causing the
test to fail.

With force_parallel_mode=regress we constantly start parallel workers,
which makes it much more likely that this case is hit.

It's trivial to fix luckily...

Greetings,

Andres Freund

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-10-28 Thread Andres Freund

On 2020-10-28 18:13:44 -0700, Andres Freund wrote:
> Just pushed this. Let's see what the BF says...

It says that apparently something is unstable about my new test. It
first passed on a few animals, but then failed a lot in a row. Looking.

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-10-28 Thread Andres Freund

Hi,

On 2020-10-27 20:51:10 -0700, Andres Freund wrote:
> On 2020-10-15 01:37:35 -0700, Andres Freund wrote:
> > Attached is a *prototype* implemention of this concept, which clearly is
> > lacking some comment work (and is intentionally lacking some
> > re-indentation).
> > 
> > I described my thoughts about how to limit the horizons for temp tables in
> > https://www.postgresql.org/message-id/20201014203103.72oke6hqywcyhx7s%40alap3.anarazel.de
> 
> Attached is an updated version of this patch. Quite a bit of polish,
> added removal of the isTopLevel arguments added a7212be8b9e that are now
> unnecessary, and changed the initialization of the temp table horizons
> to be latestCompletedXid + 1 instead of just latestCompletedXid when no
> xid is assigned.
> 
> 
> > Besides comments this probably mainly needs a bit more tests around temp
> > table vacuuming. Should have at least an isolation test that verifies
> > that temp table rows can be a) vacuumed b) pruned away in the presence
> > of other sessions with xids.
> 
> I added an isolationtester test for this. It verifies that dead rows in
> temp tables get vacuumed and pruned despite concurrent sessions having
> older snapshots. It does so by forcing an IOS and checking the number of
> heap fetches reported by EXPLAIN. I also added a companion test for
> permanent relations, ensuring that such rows do not get removed.
> 
> 
> Any comments? Otherwise I'll push that patch tomorrow.

Just pushed this. Let's see what the BF says...

It's kinda cool how much more aggressive hot pruning / killtuples now is
for temp tables.

Greetings,

Andres Freund

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-10-27 Thread Andres Freund

Hi,

On 2020-10-15 01:37:35 -0700, Andres Freund wrote:
> Attached is a *prototype* implemention of this concept, which clearly is
> lacking some comment work (and is intentionally lacking some
> re-indentation).
> 
> I described my thoughts about how to limit the horizons for temp tables in
> https://www.postgresql.org/message-id/20201014203103.72oke6hqywcyhx7s%40alap3.anarazel.de

Attached is an updated version of this patch. Quite a bit of polish,
added removal of the isTopLevel arguments added a7212be8b9e that are now
unnecessary, and changed the initialization of the temp table horizons
to be latestCompletedXid + 1 instead of just latestCompletedXid when no
xid is assigned.


> Besides comments this probably mainly needs a bit more tests around temp
> table vacuuming. Should have at least an isolation test that verifies
> that temp table rows can be a) vacuumed b) pruned away in the presence
> of other sessions with xids.

I added an isolationtester test for this. It verifies that dead rows in
temp tables get vacuumed and pruned despite concurrent sessions having
older snapshots. It does so by forcing an IOS and checking the number of
heap fetches reported by EXPLAIN. I also added a companion test for
permanent relations, ensuring that such rows do not get removed.


Any comments? Otherwise I'll push that patch tomorrow.

Greetings,

Andres Freund
>From 1ba3c5eeb250d3e755ad5c044aede1816d040ae4 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 27 Oct 2020 20:39:57 -0700
Subject: [PATCH v2] Centralize horizon determination for temp tables, fix bug
 due to skew.

Also add tests to verify that temp tables can be pruned despite other
sessions having older snapshots.

Author: Andres Freund
Discussion: https://postgr.es/m/20201014203103.72oke6hqywcyh...@alap3.anarazel.de
Discussion: https://postgr.es/m/20201015083735.derdzysdtqdvx...@alap3.anarazel.de
---
 src/include/commands/cluster.h   |   3 +-
 src/include/commands/vacuum.h|   1 -
 src/backend/access/heap/vacuumlazy.c |   1 -
 src/backend/commands/cluster.c   |  28 +--
 src/backend/commands/vacuum.c|  76 +++---
 src/backend/storage/ipc/procarray.c  |  59 -
 src/test/isolation/expected/horizons.out | 281 +++
 src/test/isolation/isolation_schedule|   1 +
 src/test/isolation/specs/horizons.spec   | 165 +
 9 files changed, 541 insertions(+), 74 deletions(-)
 create mode 100644 src/test/isolation/expected/horizons.out
 create mode 100644 src/test/isolation/specs/horizons.spec

diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 1eb144204b6..e05884781b9 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -19,8 +19,7 @@
 
 
 extern void cluster(ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Oid tableOid, Oid indexOid, int options,
-		bool isTopLevel);
+extern void cluster_rel(Oid tableOid, Oid indexOid, int options);
 extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
 	   bool recheck, LOCKMODE lockmode);
 extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d9475c99890..a4cd7214009 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -267,7 +267,6 @@ extern void vacuum_set_xid_limits(Relation rel,
   int freeze_min_age, int freeze_table_age,
   int multixact_freeze_min_age,
   int multixact_freeze_table_age,
-  bool isTopLevel,
   TransactionId *oldestXmin,
   TransactionId *freezeLimit,
   TransactionId *xidFullScanLimit,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 4f2f38168dc..be5439dd9d8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -471,7 +471,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 		  params->freeze_table_age,
 		  params->multixact_freeze_min_age,
 		  params->multixact_freeze_table_age,
-		  true, /* we must be a top-level command */
 		  , , ,
 		  , );
 
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 0d647e912c4..04d12a7ece6 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -67,13 +67,10 @@ typedef struct
 } RelToCluster;
 
 
-static void rebuild_relation(Relation OldHeap, Oid indexOid,
-			 bool isTopLevel, bool verbose);
+static void rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose);
 static void copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
-			bool isTopLevel, bool verbose,
-			bool *pSwapToastByContent,
-			TransactionId *pFreezeXid,
-			MultiXactId *pCutoffMulti);
+			bool verbose, bool *pSwapToastByContent,
+			TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
 static List

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-10-15 Thread Andres Freund

Hi,

On 2020-09-21 14:20:03 -0700, Andres Freund wrote:
> I can give it a try. I can see several paths of varying invasiveness,
> not sure yet what the best approach is. Let me think about if for a bit.

Ugh, sorry for taking so long to get around to this.

Attached is a *prototype* implemention of this concept, which clearly is
lacking some comment work (and is intentionally lacking some
re-indentation).

I described my thoughts about how to limit the horizons for temp tables in
https://www.postgresql.org/message-id/20201014203103.72oke6hqywcyhx7s%40alap3.anarazel.de

Besides comments this probably mainly needs a bit more tests around temp
table vacuuming. Should have at least an isolation test that verifies
that temp table rows can be a) vacuumed b) pruned away in the presence
of other sessions with xids.

Greetings,

Andres Freund
diff --git i/src/backend/commands/vacuum.c w/src/backend/commands/vacuum.c
index ddeec870d81..8851fafed64 100644
--- i/src/backend/commands/vacuum.c
+++ w/src/backend/commands/vacuum.c
@@ -950,24 +950,8 @@ vacuum_set_xid_limits(Relation rel,
 	MultiXactId mxactLimit;
 	MultiXactId safeMxactLimit;
 
-	if (RELATION_IS_LOCAL(rel) && !IsInTransactionBlock(isTopLevel))
 	{
 		/*
-		 * If we are processing a temp relation (which by prior checks must be
-		 * one belonging to our session), and we are not inside any
-		 * transaction block, then there can be no tuples in the rel that are
-		 * still in-doubt, nor can there be any that are dead but possibly
-		 * still interesting to some snapshot our session holds.  We don't
-		 * need to care whether other sessions could see such tuples, either.
-		 * So we can aggressively set the cutoff xmin to be the nextXid.
-		 */
-		*oldestXmin = ReadNewTransactionId();
-	}
-	else
-	{
-		/*
-		 * Otherwise, calculate the cutoff xmin normally.
-		 *
 		 * We can always ignore processes running lazy vacuum.  This is
 		 * because we use these values only for deciding which tuples we must
 		 * keep in the tables.  Since lazy vacuum doesn't write its XID
diff --git i/src/backend/storage/ipc/procarray.c w/src/backend/storage/ipc/procarray.c
index 07c5eeb7495..c3cfe34ffee 100644
--- i/src/backend/storage/ipc/procarray.c
+++ w/src/backend/storage/ipc/procarray.c
@@ -153,6 +153,8 @@ typedef struct ProcArrayStruct
  *I.e. the difference to GlobalVisCatalogRels is that
  *replication slot's catalog_xmin is not taken into account.
  *
+ * 4) GlobalVisTempRels, 
+ *
  * GlobalVisTestFor(relation) returns the appropriate state
  * for the relation.
  *
@@ -234,6 +236,9 @@ typedef struct ComputeXidHorizonsResult
 	 * defined tables.
 	 */
 	TransactionId data_oldest_nonremovable;
+
+	TransactionId temp_oldest_nonremovable;
+
 } ComputeXidHorizonsResult;
 
 
@@ -257,12 +262,13 @@ static TransactionId standbySnapshotPendingXmin;
 
 /*
  * State for visibility checks on different types of relations. See struct
- * GlobalVisState for details. As shared, catalog, and user defined
+ * GlobalVisState for details. As shared, catalog, normal and temporary
  * relations can have different horizons, one such state exists for each.
  */
 static GlobalVisState GlobalVisSharedRels;
 static GlobalVisState GlobalVisCatalogRels;
 static GlobalVisState GlobalVisDataRels;
+static GlobalVisState GlobalVisTempRels;
 
 /*
  * This backend's RecentXmin at the last time the accurate xmin horizon was
@@ -1668,6 +1674,10 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		h->oldest_considered_running = initial;
 		h->shared_oldest_nonremovable = initial;
 		h->data_oldest_nonremovable = initial;
+		if (TransactionIdIsValid(MyProc->xid))
+			h->temp_oldest_nonremovable = MyProc->xid;
+		else
+			h->temp_oldest_nonremovable = initial;
 	}
 
 	/*
@@ -1760,6 +1770,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 			TransactionIdOlder(h->shared_oldest_nonremovable, kaxmin);
 		h->data_oldest_nonremovable =
 			TransactionIdOlder(h->data_oldest_nonremovable, kaxmin);
+		/* temp relations cannot be accessed in recovery */
 	}
 	else
 	{
@@ -1785,6 +1796,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		h->data_oldest_nonremovable =
 			TransactionIdRetreatedBy(h->data_oldest_nonremovable,
 	 vacuum_defer_cleanup_age);
+		/* defer doesn't apply to temp relations */
 	}
 
 	/*
@@ -1844,6 +1856,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		 h->catalog_oldest_nonremovable));
 	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
 		 h->data_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+		 h->temp_oldest_nonremovable));
 	Assert(!TransactionIdIsValid(h->slot_xmin) ||
 		   TransactionIdPrecedesOrEquals(h->oldest_considered_running,
 		 h->slot_xmin));
@@ -1878,6 +1892,8 @@ GetOldestNonRemovableTransactionId(Relation rel)
 		return horizons.shared_oldest_nonremovable;
 	else if (RelationIsAccessibleInLogicalDecoding(rel))
 		return

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-21 Thread Andres Freund

On 2020-09-21 17:03:53 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2020-09-21 16:40:40 -0400, Tom Lane wrote:
> >> Andres Freund  writes:
> >>> I think that's an argument for what I suggested elsewhere, which is that
> >>> we should move the logic for a different horizon for temp tables out of
> >>> vacuum_set_xid_limits, and into procarray.
> 
> >> But procarray does not seem like a great place for
> >> table-persistence-dependent decisions either?
> 
> > That ship has sailed a long long time ago though. GetOldestXmin() has
> > looked at the passed in relation for a quite a while, and even before
> > that we had logic about 'allDbs' etc.  It doesn't easily seem possible
> > to avoid that, given how intimately that's coupled with how snapshots
> > are built and used, database & vacuumFlags checks etc.
> 
> OK.  Given that you've got strong feelings about this, do you want to
> propose a patch?  I'm happy to fix it, since it's at least in part my
> bug, but I probably won't do it exactly like you would.

I can give it a try. I can see several paths of varying invasiveness,
not sure yet what the best approach is. Let me think about if for a bit.

Greetings,

Andres Freund

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-21 Thread Tom Lane

Andres Freund  writes:
> On 2020-09-21 16:40:40 -0400, Tom Lane wrote:
>> Andres Freund  writes:
>>> I think that's an argument for what I suggested elsewhere, which is that
>>> we should move the logic for a different horizon for temp tables out of
>>> vacuum_set_xid_limits, and into procarray.

>> But procarray does not seem like a great place for
>> table-persistence-dependent decisions either?

> That ship has sailed a long long time ago though. GetOldestXmin() has
> looked at the passed in relation for a quite a while, and even before
> that we had logic about 'allDbs' etc.  It doesn't easily seem possible
> to avoid that, given how intimately that's coupled with how snapshots
> are built and used, database & vacuumFlags checks etc.

OK.  Given that you've got strong feelings about this, do you want to
propose a patch?  I'm happy to fix it, since it's at least in part my
bug, but I probably won't do it exactly like you would.

regards, tom lane

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-21 Thread Andres Freund

Hi,

On 2020-09-21 16:40:40 -0400, Tom Lane wrote:
> Andres Freund  writes:
> >> I think to move forward, we need to figure out what the freezing
> >> behavior ought to be for temp tables.  We could make it the same
> >> as it was before a7212be8b, which'd just require some more complexity
> >> in vacuum_set_xid_limits.  However, that negates the idea that we'd
> >> like VACUUM's behavior on a temp table to be fully independent of
> >> whether concurrent transactions exist.  I'd prefer to allow a7212be8b's
> >> behavior to stand, but then it seems we need to lobotomize the error
> >> check in heap_prepare_freeze_tuple to some extent.
> 
> > I think that's an argument for what I suggested elsewhere, which is that
> > we should move the logic for a different horizon for temp tables out of
> > vacuum_set_xid_limits, and into procarray.
> 
> But procarray does not seem like a great place for
> table-persistence-dependent decisions either?

That ship has sailed a long long time ago though. GetOldestXmin() has
looked at the passed in relation for a quite a while, and even before
that we had logic about 'allDbs' etc.  It doesn't easily seem possible
to avoid that, given how intimately that's coupled with how snapshots
are built and used, database & vacuumFlags checks etc.

Greetings,

Andres Freund

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-21 Thread Tom Lane

Andres Freund  writes:
> The reason for that is that the GlobalVisState stuff is computed
> heuristically (and then re-checked if that's not sufficient to prune a
> tuple, unless already done so). That's done so GetSnapshotData() doesn't
> have to look at each backends ->xmin, which is quite a massive speedup
> at higher connection counts, as each backends ->xmin changes much more
> often than each backend's xid.

OK.

> What do you exactly mean with the "going to huge effort to decide" bit?

I'd looked at all the complexity around GlobalVisState, but failed to
register that it should be pretty cheap on a per-tuple basis.  So never
mind that complaint.  The point that remains is just that it's different
from HeapTupleSatisfiesVacuum's rules.

>> I think to move forward, we need to figure out what the freezing
>> behavior ought to be for temp tables.  We could make it the same
>> as it was before a7212be8b, which'd just require some more complexity
>> in vacuum_set_xid_limits.  However, that negates the idea that we'd
>> like VACUUM's behavior on a temp table to be fully independent of
>> whether concurrent transactions exist.  I'd prefer to allow a7212be8b's
>> behavior to stand, but then it seems we need to lobotomize the error
>> check in heap_prepare_freeze_tuple to some extent.

> I think that's an argument for what I suggested elsewhere, which is that
> we should move the logic for a different horizon for temp tables out of
> vacuum_set_xid_limits, and into procarray.

But procarray does not seem like a great place for
table-persistence-dependent decisions either?

>> Independently of that, it seems like we need to fix things so that
>> when pruneheap.c is called by vacuum, it makes EXACTLY the same
>> dead-or-not-dead decisions that the main vacuum code makes.  This
>> business with applying some GlobalVisState rule or other instead
>> seems just as unsafe as can be.

> It's not great, I agree. Not sure there is a super nice answer
> though. Note that, even before my changes, vacuumlazy can decide
> differently than pruning whether a tuple is live. E.g. when an inserting
> transaction aborts. That's pretty much unavoidable as long as we have
> multiple HTSV calls for a tuple, since none of our locking can (nor
> should) prevent concurrent transactions from aborting.

It's clear that if the environment changes between test A and test B,
we might get different results.  What I'm not happy about is that the
rules are different, so we might get different results even if the
environment did not change.

regards, tom lane

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-21 Thread Andres Freund

Hi,

On 2020-09-20 13:13:16 -0400, Tom Lane wrote:
> 2. As Amit suspected, there's an inconsistency between pruneheap.c's
> rules for which tuples are removable and vacuum.c's rules for that.
> This seems like a massive bug in its own right: what's the point of
> pruneheap.c going to huge effort to decide whether it should keep a
> tuple if vacuum will then kill it anyway?  I do not understand why
> whoever put in the GlobalVisState stuff only applied it in pruneheap.c
> and not VACUUM proper.

The reason for that is that the GlobalVisState stuff is computed
heuristically (and then re-checked if that's not sufficient to prune a
tuple, unless already done so). That's done so GetSnapshotData() doesn't
have to look at each backends ->xmin, which is quite a massive speedup
at higher connection counts, as each backends ->xmin changes much more
often than each backend's xid.

But for VACUUM we need to do the accurate scan of the procarray anyway,
because we need an accurate value for things other than HOT pruning
decisions.

What do you exactly mean with the "going to huge effort to decide" bit?

> I think to move forward, we need to figure out what the freezing
> behavior ought to be for temp tables.  We could make it the same
> as it was before a7212be8b, which'd just require some more complexity
> in vacuum_set_xid_limits.  However, that negates the idea that we'd
> like VACUUM's behavior on a temp table to be fully independent of
> whether concurrent transactions exist.  I'd prefer to allow a7212be8b's
> behavior to stand, but then it seems we need to lobotomize the error
> check in heap_prepare_freeze_tuple to some extent.

I think that's an argument for what I suggested elsewhere, which is that
we should move the logic for a different horizon for temp tables out of
vacuum_set_xid_limits, and into procarray.

> Independently of that, it seems like we need to fix things so that
> when pruneheap.c is called by vacuum, it makes EXACTLY the same
> dead-or-not-dead decisions that the main vacuum code makes.  This
> business with applying some GlobalVisState rule or other instead
> seems just as unsafe as can be.

It's not great, I agree. Not sure there is a super nice answer
though. Note that, even before my changes, vacuumlazy can decide
differently than pruning whether a tuple is live. E.g. when an inserting
transaction aborts. That's pretty much unavoidable as long as we have
multiple HTSV calls for a tuple, since none of our locking can (nor
should) prevent concurrent transactions from aborting.

Before your new code avoiding the GetOldestNonRemovableTransactionId()
call for temp tables, the GlobalVis* can never be more pessimistic than
decisions based ona prior GetOldestNonRemovableTransactionId call (as
that internally updates the heuristic horizons used by GlobalVis).

Greetings,

Andres Freund

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-21 Thread Tom Lane

Robert Haas  writes:
> On Mon, Sep 21, 2020 at 2:21 PM Tom Lane  wrote:
>> Right, but what we end up with is that the very same tuple xmin and
>> xmax might result in pruning/deletion, or not, depending on whether
>> it's part of a HOT chain or not.  That's at best pretty weird, and
>> at worst it means that corner-case bugs in other places are triggered
>> in only one of the two scenarios ... which is what we have here.

> I'm not sure I really understand how that's happening, because surely
> HOT chains and non-HOT chains are pruned by the same code, but it
> doesn't sound good.

No, they're not.  lazy_scan_heap() will never remove a tuple that
is HeapTupleIsHotUpdated or HeapTupleIsHeapOnly, even if it thinks
it's DEAD -- cf. vacuumlazy.c, about line 1350.  So tuples in
a HOT chain are deleted exactly when pruneheap.c sees fit to do so.
OTOH, for tuples not in a HOT chain, the decision is (I believe)
entirely on lazy_scan_heap().  And the core of my complaint is that
pruneheap.c's decisions about what is DEAD are not reliably identical
to what HeapTupleSatisfiesVacuum thinks.

I don't mind if a free-standing prune operation has its own rules,
but when it's invoked by VACUUM it ought to follow VACUUM's rules about
what is dead or alive.  What remains unclear at this point is whether
we ought to import some of the intelligence added by the GlobalVisState
patch into VACUUM's behavior, or just dumb down pruneheap.c so that
it applies exactly the HeapTupleSatisfiesVacuum rules when invoked
by VACUUM.

regards, tom lane

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-21 Thread Andres Freund

Hi,

On 2020-09-21 16:02:29 -0400, Robert Haas wrote:
> On Mon, Sep 21, 2020 at 2:21 PM Tom Lane  wrote:
> > Right, but what we end up with is that the very same tuple xmin and
> > xmax might result in pruning/deletion, or not, depending on whether
> > it's part of a HOT chain or not.  That's at best pretty weird, and
> > at worst it means that corner-case bugs in other places are triggered
> > in only one of the two scenarios ... which is what we have here.
> 
> I'm not sure I really understand how that's happening, because surely
> HOT chains and non-HOT chains are pruned by the same code, but it
> doesn't sound good.

Not necessarily, unfortunately:

case HEAPTUPLE_DEAD:

/*
 * Ordinarily, DEAD tuples would have been removed by
 * heap_page_prune(), but it's possible that the tuple
 * state changed since heap_page_prune() looked.  In
 * particular an INSERT_IN_PROGRESS tuple could have
 * changed to DEAD if the inserter aborted.  So this
 * cannot be considered an error condition.
 *
 * If the tuple is HOT-updated then it must only be
 * removed by a prune operation; so we keep it just as if
 * it were RECENTLY_DEAD.  Also, if it's a heap-only
 * tuple, we choose to keep it, because it'll be a lot
 * cheaper to get rid of it in the next pruning pass than
 * to treat it like an indexed tuple. Finally, if index
 * cleanup is disabled, the second heap pass will not
 * execute, and the tuple will not get removed, so we must
 * treat it like any other dead tuple that we choose to
 * keep.
 *
 * If this were to happen for a tuple that actually needed
 * to be deleted, we'd be in trouble, because it'd
 * possibly leave a tuple below the relation's xmin
 * horizon alive.  heap_prepare_freeze_tuple() is prepared
 * to detect that case and abort the transaction,
 * preventing corruption.
 */
if (HeapTupleIsHotUpdated() ||
HeapTupleIsHeapOnly() ||
params->index_cleanup == VACOPT_TERNARY_DISABLED)
nkeep += 1;
else
tupgone = true; /* we can delete the tuple */
all_visible = false;

So if e.g. a transaction aborts between the heap_page_prune and this
check the pruning behaviour depends on whether the tuple is part of a
HOT chain or not.

Greetings,

Andres Freund

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-21 Thread Robert Haas

On Mon, Sep 21, 2020 at 2:21 PM Tom Lane  wrote:
> Right, but what we end up with is that the very same tuple xmin and
> xmax might result in pruning/deletion, or not, depending on whether
> it's part of a HOT chain or not.  That's at best pretty weird, and
> at worst it means that corner-case bugs in other places are triggered
> in only one of the two scenarios ... which is what we have here.

I'm not sure I really understand how that's happening, because surely
HOT chains and non-HOT chains are pruned by the same code, but it
doesn't sound good.

> FWIW, weakening the sanity checks in heap_prepare_freeze_tuple is
> *not* my preferred fix here.  But it'll take some work in other
> places to preserve them.

Make sense.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-21 Thread Tom Lane

Robert Haas  writes:
> On Sun, Sep 20, 2020 at 1:13 PM Tom Lane  wrote:
>> 2. As Amit suspected, there's an inconsistency between pruneheap.c's
>> rules for which tuples are removable and vacuum.c's rules for that.
>> This seems like a massive bug in its own right: what's the point of
>> pruneheap.c going to huge effort to decide whether it should keep a
>> tuple if vacuum will then kill it anyway?  I do not understand why
>> whoever put in the GlobalVisState stuff only applied it in pruneheap.c
>> and not VACUUM proper.

> I am not sure I fully understand why you're contrasting pruneheap.c
> with vacuum here, because vacuum just does HOT pruning to remove dead
> tuples - maybe calling the relevant functions with different
> arguments, but it doesn't have its own independent logic for that.

Right, but what we end up with is that the very same tuple xmin and
xmax might result in pruning/deletion, or not, depending on whether
it's part of a HOT chain or not.  That's at best pretty weird, and
at worst it means that corner-case bugs in other places are triggered
in only one of the two scenarios ... which is what we have here.

> The key point is that the freezing code isn't, or at least
> historically wasn't, very smart about dead tuples. For example, I
> think if you told it to freeze something that was dead it would just
> do it, which is obviously bad. And that's why Andres stuck those
> sanity checks in there. But it's still pretty fragile. I think perhaps
> the pruning code should be rewritten in such a way that it can be
> combined with the code that freezes and marks pages all-visible, so
> that there's not so much action at a distance, but such an endeavor is
> in itself pretty scary, and certainly not back-patchable.

Not sure.  The pruning code is trying to serve two masters, that is
both VACUUM and on-the-fly cleanup during ordinary queries.  If you
try to merge it with other tasks that VACUUM does, you're going to
have a mess for the second usage.  I fear there's going to be pretty
strong conservation of cruft either way.

FWIW, weakening the sanity checks in heap_prepare_freeze_tuple is
*not* my preferred fix here.  But it'll take some work in other
places to preserve them.

regards, tom lane

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-21 Thread Robert Haas

On Sun, Sep 20, 2020 at 1:13 PM Tom Lane  wrote:
> 1. My patch a7212be8b does indeed have a problem.  It will allow
> vacuum_set_xid_limits to compute freezeLimit = nextXid for a temp
> table if freeze_min_age is zero (ie VACUUM FREEZE).  If there's
> any concurrent transactions, this falls foul of
> heap_prepare_freeze_tuple's expectation that
>
>  * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
>  * XID older than it could neither be running nor seen as running by any
>  * open transaction.  This ensures that the replacement will not change
>  * anyone's idea of the tuple state.
>
> The "cannot freeze committed xmax" error message appears to be banking on
> the assumption that we'd not reach heap_prepare_freeze_tuple for any
> committed-dead tuple unless its xmax is past the specified cutoff_xid.
>
> 2. As Amit suspected, there's an inconsistency between pruneheap.c's
> rules for which tuples are removable and vacuum.c's rules for that.
> This seems like a massive bug in its own right: what's the point of
> pruneheap.c going to huge effort to decide whether it should keep a
> tuple if vacuum will then kill it anyway?  I do not understand why
> whoever put in the GlobalVisState stuff only applied it in pruneheap.c
> and not VACUUM proper.

I am not sure I fully understand why you're contrasting pruneheap.c
with vacuum here, because vacuum just does HOT pruning to remove dead
tuples - maybe calling the relevant functions with different
arguments, but it doesn't have its own independent logic for that.

The key point is that the freezing code isn't, or at least
historically wasn't, very smart about dead tuples. For example, I
think if you told it to freeze something that was dead it would just
do it, which is obviously bad. And that's why Andres stuck those
sanity checks in there. But it's still pretty fragile. I think perhaps
the pruning code should be rewritten in such a way that it can be
combined with the code that freezes and marks pages all-visible, so
that there's not so much action at a distance, but such an endeavor is
in itself pretty scary, and certainly not back-patchable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-21 Thread Tom Lane

Amit Kapila  writes:
> On Sun, Sep 20, 2020 at 10:43 PM Tom Lane  wrote:
>> AFAICS, there is no chance of the existing pg_surgery regression test
>> being fully stable if we don't fix both things.

> What if ensure that it runs with autovacuum = off and there is no
> parallel test running? I am not sure about the second part but if we
> can do that then the test will be probably stable.

Then it'll not be usable under "make installcheck", which is not
very nice.  It's also arguable that you aren't testing pg_surgery
under real-world conditions if you do it like that.

Moreover, I think that both of these points need to be addressed
anyway, as they represent bugs that are reachable independently
of pg_surgery.  Admittedly, we do not have a test case that
proves that the inconsistency between pruneheap and vacuum has
any bad effects in the absence of a7212be8b.  But do you really
want to bet that there are none?

regards, tom lane

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-20 Thread Amit Kapila

On Sun, Sep 20, 2020 at 10:43 PM Tom Lane  wrote:
>
> I wrote:
> > So this confirms the suspicion that the cause of the buildfarm
> > failures is a concurrently-open transaction, presumably from
> > autovacuum.  I don't have time to poke further right now.
>
..
> 2. As Amit suspected, there's an inconsistency between pruneheap.c's
> rules for which tuples are removable and vacuum.c's rules for that.
> This seems like a massive bug in its own right: what's the point of
> pruneheap.c going to huge effort to decide whether it should keep a
> tuple if vacuum will then kill it anyway?  I do not understand why
> whoever put in the GlobalVisState stuff only applied it in pruneheap.c
> and not VACUUM proper.
>
> These two points interact, in that we don't get to the "cannot freeze"
> failure except when pruneheap has decided not to remove something that
> is removable according to VACUUM's rules.  (VACUUM doesn't actually
> remove it, because lazy_scan_heap won't try to remove HeapOnly tuples
> even when it thinks they're HEAPTUPLE_DEAD; but then it tries to freeze
> the tuple, and heap_prepare_freeze_tuple spits up.)  However, if I revert
> a7212be8b then the pg_surgery test still fails in the presence of a
> concurrent transaction (both of the expected "skipping TID" notices
> disappear).  So reverting that patch wouldn't get us out of trouble.
>
> I think to move forward, we need to figure out what the freezing
> behavior ought to be for temp tables.  We could make it the same
> as it was before a7212be8b, which'd just require some more complexity
> in vacuum_set_xid_limits.  However, that negates the idea that we'd
> like VACUUM's behavior on a temp table to be fully independent of
> whether concurrent transactions exist.  I'd prefer to allow a7212be8b's
> behavior to stand, but then it seems we need to lobotomize the error
> check in heap_prepare_freeze_tuple to some extent.
>
> Independently of that, it seems like we need to fix things so that
> when pruneheap.c is called by vacuum, it makes EXACTLY the same
> dead-or-not-dead decisions that the main vacuum code makes.  This
> business with applying some GlobalVisState rule or other instead
> seems just as unsafe as can be.
>

Yeah, on a quick look it seems before commit dc7420c2c9 the
pruneheap.c and the main Vacuum code use to make the same decision and
that is commit which has introduced GlobalVisState stuff.

> AFAICS, there is no chance of the existing pg_surgery regression test
> being fully stable if we don't fix both things.
>

What if ensure that it runs with autovacuum = off and there is no
parallel test running? I am not sure about the second part but if we
can do that then the test will be probably stable.

-- 
With Regards,
Amit Kapila.

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-20 Thread Ashutosh Sharma

Hi Tom,

On Sun, Sep 20, 2020 at 10:43 PM Tom Lane  wrote:
>
> I wrote:
> > So this confirms the suspicion that the cause of the buildfarm
> > failures is a concurrently-open transaction, presumably from
> > autovacuum.  I don't have time to poke further right now.
>
> I spent some more time analyzing this, and there seem to be two distinct
> issues:
>
> 1. My patch a7212be8b does indeed have a problem.  It will allow
> vacuum_set_xid_limits to compute freezeLimit = nextXid for a temp
> table if freeze_min_age is zero (ie VACUUM FREEZE).  If there's
> any concurrent transactions, this falls foul of
> heap_prepare_freeze_tuple's expectation that
>
>  * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
>  * XID older than it could neither be running nor seen as running by any
>  * open transaction.  This ensures that the replacement will not change
>  * anyone's idea of the tuple state.
>
> The "cannot freeze committed xmax" error message appears to be banking on
> the assumption that we'd not reach heap_prepare_freeze_tuple for any
> committed-dead tuple unless its xmax is past the specified cutoff_xid.
>
> 2. As Amit suspected, there's an inconsistency between pruneheap.c's
> rules for which tuples are removable and vacuum.c's rules for that.
> This seems like a massive bug in its own right: what's the point of
> pruneheap.c going to huge effort to decide whether it should keep a
> tuple if vacuum will then kill it anyway?  I do not understand why
> whoever put in the GlobalVisState stuff only applied it in pruneheap.c
> and not VACUUM proper.
>
> These two points interact, in that we don't get to the "cannot freeze"
> failure except when pruneheap has decided not to remove something that
> is removable according to VACUUM's rules.  (VACUUM doesn't actually
> remove it, because lazy_scan_heap won't try to remove HeapOnly tuples
> even when it thinks they're HEAPTUPLE_DEAD; but then it tries to freeze
> the tuple, and heap_prepare_freeze_tuple spits up.)  However, if I revert
> a7212be8b then the pg_surgery test still fails in the presence of a
> concurrent transaction (both of the expected "skipping TID" notices
> disappear).  So reverting that patch wouldn't get us out of trouble.
>
> I think to move forward, we need to figure out what the freezing
> behavior ought to be for temp tables.  We could make it the same
> as it was before a7212be8b, which'd just require some more complexity
> in vacuum_set_xid_limits.  However, that negates the idea that we'd
> like VACUUM's behavior on a temp table to be fully independent of
> whether concurrent transactions exist.  I'd prefer to allow a7212be8b's
> behavior to stand, but then it seems we need to lobotomize the error
> check in heap_prepare_freeze_tuple to some extent.
>
> Independently of that, it seems like we need to fix things so that
> when pruneheap.c is called by vacuum, it makes EXACTLY the same
> dead-or-not-dead decisions that the main vacuum code makes.  This
> business with applying some GlobalVisState rule or other instead
> seems just as unsafe as can be.
>
> AFAICS, there is no chance of the existing pg_surgery regression test
> being fully stable if we don't fix both things.
>

>From the explanation you provided above it seems like the test-case
for pg_surgery module is failing because there is some issue with the
changes done in a7212be8b commit (shown below). In other words, I
believe that the test-case for pg_surgery has actually detected an
issue in this commit.

commit a7212be8b9e0885ee769e8c55f99ef742cda487b
Author: Tom Lane 
Date:   Tue Sep 1 18:37:12 2020 -0400

Set cutoff xmin more aggressively when vacuuming a temporary table.



So, do you mean to say that if the issues related to temp tables
induced by the above commit is fixed, it will make the regression test
for pg_surgery stable?

Please let me know if I am missing something here. Thank you.

> BTW, attached is a quick-hack patch to allow automated testing
> of this scenario, along the lines I sketched yesterday.  This
> test passes if you run the two scripts serially, but not when
> you run them in parallel.  I'm not proposing this for commit;
> it's a hack and its timing behavior is probably not stable enough
> for the buildfarm.  But it's pretty useful for poking at these
> issues.
>

Yeah, understood, thanks for sharing this.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-20 Thread Tom Lane

I wrote:
> So this confirms the suspicion that the cause of the buildfarm
> failures is a concurrently-open transaction, presumably from
> autovacuum.  I don't have time to poke further right now.

I spent some more time analyzing this, and there seem to be two distinct
issues:

1. My patch a7212be8b does indeed have a problem.  It will allow
vacuum_set_xid_limits to compute freezeLimit = nextXid for a temp
table if freeze_min_age is zero (ie VACUUM FREEZE).  If there's
any concurrent transactions, this falls foul of
heap_prepare_freeze_tuple's expectation that

 * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
 * XID older than it could neither be running nor seen as running by any
 * open transaction.  This ensures that the replacement will not change
 * anyone's idea of the tuple state.

The "cannot freeze committed xmax" error message appears to be banking on
the assumption that we'd not reach heap_prepare_freeze_tuple for any
committed-dead tuple unless its xmax is past the specified cutoff_xid.

2. As Amit suspected, there's an inconsistency between pruneheap.c's
rules for which tuples are removable and vacuum.c's rules for that.
This seems like a massive bug in its own right: what's the point of
pruneheap.c going to huge effort to decide whether it should keep a
tuple if vacuum will then kill it anyway?  I do not understand why
whoever put in the GlobalVisState stuff only applied it in pruneheap.c
and not VACUUM proper.

These two points interact, in that we don't get to the "cannot freeze"
failure except when pruneheap has decided not to remove something that
is removable according to VACUUM's rules.  (VACUUM doesn't actually
remove it, because lazy_scan_heap won't try to remove HeapOnly tuples
even when it thinks they're HEAPTUPLE_DEAD; but then it tries to freeze
the tuple, and heap_prepare_freeze_tuple spits up.)  However, if I revert
a7212be8b then the pg_surgery test still fails in the presence of a
concurrent transaction (both of the expected "skipping TID" notices
disappear).  So reverting that patch wouldn't get us out of trouble.

I think to move forward, we need to figure out what the freezing
behavior ought to be for temp tables.  We could make it the same
as it was before a7212be8b, which'd just require some more complexity
in vacuum_set_xid_limits.  However, that negates the idea that we'd
like VACUUM's behavior on a temp table to be fully independent of
whether concurrent transactions exist.  I'd prefer to allow a7212be8b's
behavior to stand, but then it seems we need to lobotomize the error
check in heap_prepare_freeze_tuple to some extent.

Independently of that, it seems like we need to fix things so that
when pruneheap.c is called by vacuum, it makes EXACTLY the same
dead-or-not-dead decisions that the main vacuum code makes.  This
business with applying some GlobalVisState rule or other instead
seems just as unsafe as can be.

AFAICS, there is no chance of the existing pg_surgery regression test
being fully stable if we don't fix both things.

BTW, attached is a quick-hack patch to allow automated testing
of this scenario, along the lines I sketched yesterday.  This
test passes if you run the two scripts serially, but not when
you run them in parallel.  I'm not proposing this for commit;
it's a hack and its timing behavior is probably not stable enough
for the buildfarm.  But it's pretty useful for poking at these
issues.

regards, tom lane

diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile
index a66776c4c4..a7bcfca0fb 100644
--- a/contrib/pg_surgery/Makefile
+++ b/contrib/pg_surgery/Makefile
@@ -9,7 +9,7 @@ EXTENSION = pg_surgery
 DATA = pg_surgery--1.0.sql
 PGFILEDESC = "pg_surgery - perform surgery on a damaged relation"

-REGRESS = heap_surgery
+REGRESS = --schedule=surgery_schedule

 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/pg_surgery/expected/heap_surgery.out b/contrib/pg_surgery/expected/heap_surgery.out
index d4a757ffa0..6ef5597e7c 100644
--- a/contrib/pg_surgery/expected/heap_surgery.out
+++ b/contrib/pg_surgery/expected/heap_surgery.out
@@ -1,4 +1,10 @@
 create extension pg_surgery;
+select pg_sleep(0.1);  -- ensure concurrent transaction is ready
+ pg_sleep 
+--
+ 
+(1 row)
+
 -- create a normal heap table and insert some rows.
 -- use a temp table so that vacuum behavior doesn't depend on global xmin
 create temp table htab (a int);
diff --git a/contrib/pg_surgery/expected/sleep.out b/contrib/pg_surgery/expected/sleep.out
new file mode 100644
index 00..208e31163d
--- /dev/null
+++ b/contrib/pg_surgery/expected/sleep.out
@@ -0,0 +1,6 @@
+select pg_sleep(1.0);  -- long enough for whole heap_surgery test
+ pg_sleep 
+--
+ 
+(1 row)
+
diff --git a/contrib/pg_surgery/sql/heap_surgery.sql b/contrib/pg_surgery/sql/heap_surgery.sql
index 6526b27535..212c657f3c 100644
--- a/contrib/pg_surgery/sql/heap_surgery.sql
+++ b/contrib/pg_surgery/sql/heap_surgery.sql
@@

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-19 Thread Tom Lane

I wrote:
> I was able to partially reproduce whelk's failure here.  I got a
> couple of cases of "cannot freeze committed xmax", which then leads
> to the second NOTICE diff; but I couldn't reproduce the first
> NOTICE diff.  That was out of about a thousand tries :-( so it's not
> looking like a promising thing to reproduce without modifying the test.

... however, it's trivial to reproduce via manual interference,
using the same strategy discussed recently for another case:
add a pg_sleep at the start of the heap_surgery.sql script,
run "make installcheck", and while that's running start another
session in which you begin a serializable transaction, execute
any old SELECT, and wait.  AFAICT this reproduces all of whelk's
symptoms with 100% reliability.

With a little more effort, this could be automated by putting
some long-running transaction (likely, it needn't be any more
complicated than "select pg_sleep(10)") in a second test script
launched in parallel with heap_surgery.sql.

So this confirms the suspicion that the cause of the buildfarm
failures is a concurrently-open transaction, presumably from
autovacuum.  I don't have time to poke further right now.

regards, tom lane

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-19 Thread Tom Lane

Amit Kapila  writes:
> I think our assumption that changing the tests to have temp tables
> will make them safe w.r.t concurrent activity doesn't seem to be
> correct. We do set OldestXmin for temp tables aggressive enough that
> it allows us to remove all dead tuples but the test case behavior lies
> on whether we are able to prune the chain. AFAICS, we are using
> different cut-offs in heap_page_prune when it is called via
> lazy_scan_heap. So that seems to be causing both the failures.

Hm, reasonable theory.

I was able to partially reproduce whelk's failure here.  I got a
couple of cases of "cannot freeze committed xmax", which then leads
to the second NOTICE diff; but I couldn't reproduce the first
NOTICE diff.  That was out of about a thousand tries :-( so it's not
looking like a promising thing to reproduce without modifying the test.

I wonder whether "cannot freeze committed xmax" doesn't represent an
actual bug, ie is a7212be8b setting the cutoff *too* aggressively?
But if so, why's it so hard to reproduce?

regards, tom lane

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-19 Thread Amit Kapila

On Sat, Sep 19, 2020 at 4:03 AM Tom Lane  wrote:
>
> Robert Haas  writes:
> > Cool, thanks to both of you for looking. Committed.
>
> Hmph, according to whelk this is worse not better:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=whelk=2020-09-18%2017%3A42%3A11
>
> I'm at a loss to understand what's going on there.
>

I think our assumption that changing the tests to have temp tables
will make them safe w.r.t concurrent activity doesn't seem to be
correct. We do set OldestXmin for temp tables aggressive enough that
it allows us to remove all dead tuples but the test case behavior lies
on whether we are able to prune the chain. AFAICS, we are using
different cut-offs in heap_page_prune when it is called via
lazy_scan_heap. So that seems to be causing both the failures.

-- 
With Regards,
Amit Kapila.

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-18 Thread Tom Lane

Robert Haas  writes:
> Cool, thanks to both of you for looking. Committed.

Hmph, according to whelk this is worse not better:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=whelk=2020-09-18%2017%3A42%3A11

I'm at a loss to understand what's going on there.

regards, tom lane

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-18 Thread Robert Haas

On Wed, Sep 16, 2020 at 10:27 PM Ashutosh Sharma  wrote:
> > This is OK by me.
>
> Looks good to me too.

Cool, thanks to both of you for looking. Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-16 Thread Ashutosh Sharma

On Thu, Sep 17, 2020 at 12:14 AM Tom Lane  wrote:
>
> Robert Haas  writes:
> > OK. After some more study of the thread and some more experimentation,
> > I came up with the attached. I'll go ahead and commit this if nobody
> > objects.
>
> This is OK by me.
>

Looks good to me too.

Thanks,

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-16 Thread Tom Lane

Robert Haas  writes:
> OK. After some more study of the thread and some more experimentation,
> I came up with the attached. I'll go ahead and commit this if nobody
> objects.

This is OK by me.

regards, tom lane

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-16 Thread Robert Haas

On Wed, Sep 16, 2020 at 11:48 AM Tom Lane  wrote:
> Robert Haas  writes:
> > Tom, I know that you often have strong feelings about the exact
> > wording and details of this kind of stuff, so if you feel moved to
> > commit something that is fine with me. If not, I will take my best
> > shot at it.
>
> I'm not feeling terribly picky about it --- so it's all yours.

OK. After some more study of the thread and some more experimentation,
I came up with the attached. I'll go ahead and commit this if nobody
objects.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


0001-pg_surgery-Try-to-stabilize-regression-tests.patch
Description: Binary data

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-16 Thread Tom Lane

Robert Haas  writes:
> Tom, I know that you often have strong feelings about the exact
> wording and details of this kind of stuff, so if you feel moved to
> commit something that is fine with me. If not, I will take my best
> shot at it.

I'm not feeling terribly picky about it --- so it's all yours.

regards, tom lane

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-16 Thread Robert Haas

On Wed, Sep 16, 2020 at 1:48 AM Ashutosh Sharma  wrote:
> Attached is the patch with the changes suggested here. I've tried to
> use a normal heap table with (autovacuum = off) wherever possible.
> Please have a look and let me know for any other issues.

I think the comment needs some wordsmithing -- "unlike other cases" is
not that informative, and "we get a stable vacuum results" isn't
either very clear or all that grammatical. If we're going to add a
comment add here, why not just "use a temp table, so autovacuum can't
interfere"?

Tom, I know that you often have strong feelings about the exact
wording and details of this kind of stuff, so if you feel moved to
commit something that is fine with me. If not, I will take my best
shot at it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-15 Thread Ashutosh Sharma

On Wed, Sep 16, 2020 at 10:40 AM Ashutosh Sharma  wrote:
>
> On Wed, Sep 16, 2020 at 9:14 AM Tom Lane  wrote:
> >
> > Ashutosh Sharma  writes:
> > > On Wed, Sep 16, 2020 at 1:25 AM Tom Lane  wrote:
> > >> * Should any of the other tables in the test be converted to temp?
> >
> > > Are you trying to say that we can achieve the things being done in
> > > test-case 1 and 2 by having a single temp table and we should aim for
> > > it because it will make the test-case more efficient and easy to
> > > maintain?
> >
> > Well, I'm just looking at the comment that says the reason for the
> > begin/rollback structure is to keep autovacuum's hands off the table.
> > In most if not all of the other places where we need that, the preferred
> > method is to make the table temp or mark it with (autovacuum = off).
> > While this way isn't wrong exactly, nor inefficient, it does seem
> > a little restrictive.  For instance, you can't easily test cases that
> > involve intentional errors.
> >
> > Another point is that we have a few optimizations that apply to tables
> > created in the current transaction.  I'm not sure whether any of them
> > would fire in this test case, but if they do (now or in the future)
> > that might mean you aren't testing the usual scenario for use of
> > pg_surgery, which is surely not going to be a new-in-transaction
> > table.  (That might be an argument for preferring autovacuum = off
> > over a temp table, too.)
> >
>
> I agree with you on both the above points. I'll try to make the
> necessary changes to address all your comments. Also, I'd prefer
> creating a normal heap table with autovacuum = off over the temp table
> for the reasons you mentioned in the second point.
>

Attached is the patch with the changes suggested here. I've tried to
use a normal heap table with (autovacuum = off) wherever possible.
Please have a look and let me know for any other issues.

Thanks,

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
diff --git a/contrib/pg_surgery/expected/heap_surgery.out b/contrib/pg_surgery/expected/heap_surgery.out
index 9451c57..033e3aa 100644
--- a/contrib/pg_surgery/expected/heap_surgery.out
+++ b/contrib/pg_surgery/expected/heap_surgery.out
@@ -1,8 +1,8 @@
 create extension pg_surgery;
 -- create a normal heap table and insert some rows.
--- note that we don't commit the transaction, so autovacuum can't interfere.
-begin;
-create table htab(a int);
+-- note that we don't want autovacuum to interfere and therefore we disable it
+-- on the test table.
+create table htab(a int) with (autovacuum_enabled = off);
 insert into htab values (100), (200), (300), (400), (500);
 -- test empty TID array
 select heap_force_freeze('htab'::regclass, ARRAY[]::tid[]);
@@ -85,9 +85,10 @@ NOTICE:  skipping tid (0, 6) for relation "htab" because the item number is out
  
 (1 row)
 
-rollback;
 -- set up a new table with a redirected line pointer
-create table htab2(a int) with (autovacuum_enabled = off);
+-- note that unlike other test cases we need a temp table here to ensure that we
+-- get a stable vacuum results.
+create temp table htab2(a int);
 insert into htab2 values (100);
 update htab2 set a = 200;
 vacuum htab2;
@@ -175,6 +176,6 @@ ERROR:  "vw" is not a table, materialized view, or TOAST table
 select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
 ERROR:  "vw" is not a table, materialized view, or TOAST table
 -- cleanup.
-drop table htab2;
+drop table htab;
 drop view vw;
 drop extension pg_surgery;
diff --git a/contrib/pg_surgery/sql/heap_surgery.sql b/contrib/pg_surgery/sql/heap_surgery.sql
index 8a27214..234e667 100644
--- a/contrib/pg_surgery/sql/heap_surgery.sql
+++ b/contrib/pg_surgery/sql/heap_surgery.sql
@@ -1,9 +1,9 @@
 create extension pg_surgery;
 
 -- create a normal heap table and insert some rows.
--- note that we don't commit the transaction, so autovacuum can't interfere.
-begin;
-create table htab(a int);
+-- note that we don't want autovacuum to interfere and therefore we disable it
+-- on the test table.
+create table htab(a int) with (autovacuum_enabled = off);
 insert into htab values (100), (200), (300), (400), (500);
 
 -- test empty TID array
@@ -38,10 +38,10 @@ select ctid, xmax from htab where xmin = 2;
 -- out-of-range TIDs should be skipped
 select heap_force_freeze('htab'::regclass, ARRAY['(0, 0)', '(0, 6)']::tid[]);
 
-rollback;
-
 -- set up a new table with a redirected line pointer
-create table htab2(a int) with (autovacuum_enabled = off);
+-- note that unlike other test cases we need a temp table here to ensure that we
+-- get a stable vacuum results.
+create temp table htab2(a int);
 insert into htab2 values (100);
 update htab2 set a = 200;
 vacuum htab2;
@@ -86,6 +86,6 @@ select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
 select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
 
 -- cleanup.
-drop table htab2;
+drop table htab;
 drop view vw;
 drop extension pg_surgery;

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-15 Thread Ashutosh Sharma

On Wed, Sep 16, 2020 at 9:14 AM Tom Lane  wrote:
>
> Ashutosh Sharma  writes:
> > On Wed, Sep 16, 2020 at 1:25 AM Tom Lane  wrote:
> >> * Should any of the other tables in the test be converted to temp?
>
> > Are you trying to say that we can achieve the things being done in
> > test-case 1 and 2 by having a single temp table and we should aim for
> > it because it will make the test-case more efficient and easy to
> > maintain?
>
> Well, I'm just looking at the comment that says the reason for the
> begin/rollback structure is to keep autovacuum's hands off the table.
> In most if not all of the other places where we need that, the preferred
> method is to make the table temp or mark it with (autovacuum = off).
> While this way isn't wrong exactly, nor inefficient, it does seem
> a little restrictive.  For instance, you can't easily test cases that
> involve intentional errors.
>
> Another point is that we have a few optimizations that apply to tables
> created in the current transaction.  I'm not sure whether any of them
> would fire in this test case, but if they do (now or in the future)
> that might mean you aren't testing the usual scenario for use of
> pg_surgery, which is surely not going to be a new-in-transaction
> table.  (That might be an argument for preferring autovacuum = off
> over a temp table, too.)
>

I agree with you on both the above points. I'll try to make the
necessary changes to address all your comments. Also, I'd prefer
creating a normal heap table with autovacuum = off over the temp table
for the reasons you mentioned in the second point.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-15 Thread Tom Lane

Ashutosh Sharma  writes:
> On Wed, Sep 16, 2020 at 1:25 AM Tom Lane  wrote:
>> * Should any of the other tables in the test be converted to temp?

> Are you trying to say that we can achieve the things being done in
> test-case 1 and 2 by having a single temp table and we should aim for
> it because it will make the test-case more efficient and easy to
> maintain?

Well, I'm just looking at the comment that says the reason for the
begin/rollback structure is to keep autovacuum's hands off the table.
In most if not all of the other places where we need that, the preferred
method is to make the table temp or mark it with (autovacuum = off).
While this way isn't wrong exactly, nor inefficient, it does seem
a little restrictive.  For instance, you can't easily test cases that
involve intentional errors.

Another point is that we have a few optimizations that apply to tables
created in the current transaction.  I'm not sure whether any of them
would fire in this test case, but if they do (now or in the future)
that might mean you aren't testing the usual scenario for use of
pg_surgery, which is surely not going to be a new-in-transaction
table.  (That might be an argument for preferring autovacuum = off
over a temp table, too.)

Like I said, I don't have a big problem with leaving the rest of the
test as-is.  It just seems to be doing things in an unusual way for
no very good reason.

regards, tom lane

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-15 Thread Ashutosh Sharma

On Wed, Sep 16, 2020 at 1:25 AM Tom Lane  wrote:
>
> Robert Haas  writes:
> > On Mon, Sep 14, 2020 at 6:26 AM Ashutosh Sharma  
> > wrote:
> >> Thanks for reporting. I'm able to reproduce the issue by creating some
> >> delay just before "-- now create an unused line pointer" and use the
> >> delay to start a new session either with repeatable read or
> >> serializable transaction isolation level and run some query on the
> >> test table. To fix this, as you suggested I've converted the test
> >> table to the temp table. Attached is the patch with the changes.
> >> Please have a look and let me know about any concerns.
>
> > Tom, do you have any concerns about this fix?
>
> It seems OK as far as it goes.  Two thoughts:
>
> * Do we need a comment in the test pointing out that the table must be
> temp to ensure that we get stable vacuum results?  Or will the commit
> log message be enough documentation?
>

I'll add a note for this.

> * Should any of the other tables in the test be converted to temp?
> I see that the other test cases are kluging around related issues
> by dint of never committing their tables at all.  It's not clear
> to me how badly those test cases have been distorted by that, or
> whether it means they're testing less-than-typical situations.
>

Are you trying to say that we can achieve the things being done in
test-case 1 and 2 by having a single temp table and we should aim for
it because it will make the test-case more efficient and easy to
maintain? If so, I will try to do the necessary changes and submit a
new patch for it. please confirm.

Thanks,

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-15 Thread Tom Lane

Robert Haas  writes:
> On Mon, Sep 14, 2020 at 6:26 AM Ashutosh Sharma  wrote:
>> Thanks for reporting. I'm able to reproduce the issue by creating some
>> delay just before "-- now create an unused line pointer" and use the
>> delay to start a new session either with repeatable read or
>> serializable transaction isolation level and run some query on the
>> test table. To fix this, as you suggested I've converted the test
>> table to the temp table. Attached is the patch with the changes.
>> Please have a look and let me know about any concerns.

> Tom, do you have any concerns about this fix?

It seems OK as far as it goes.  Two thoughts:

* Do we need a comment in the test pointing out that the table must be
temp to ensure that we get stable vacuum results?  Or will the commit
log message be enough documentation?

* Should any of the other tables in the test be converted to temp?
I see that the other test cases are kluging around related issues
by dint of never committing their tables at all.  It's not clear
to me how badly those test cases have been distorted by that, or
whether it means they're testing less-than-typical situations.

Anyway, if you're satisfied with leaving the other cases as-is,
I have no objections.

regards, tom lane

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-15 Thread Robert Haas

On Mon, Sep 14, 2020 at 6:26 AM Ashutosh Sharma  wrote:
> Thanks for reporting. I'm able to reproduce the issue by creating some
> delay just before "-- now create an unused line pointer" and use the
> delay to start a new session either with repeatable read or
> serializable transaction isolation level and run some query on the
> test table. To fix this, as you suggested I've converted the test
> table to the temp table. Attached is the patch with the changes.
> Please have a look and let me know about any concerns.

Tom, do you have any concerns about this fix?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-14 Thread Ashutosh Sharma

On Sun, Sep 13, 2020 at 3:30 AM Tom Lane  wrote:
>
> Robert Haas  writes:
> > I have committed this version.
>
> This failure says that the test case is not entirely stable:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer=2020-09-12%2005%3A13%3A12
>
> diff -U3 
> /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/expected/heap_surgery.out
>  
> /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/results/heap_surgery.out
> --- 
> /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/expected/heap_surgery.out
>2020-09-11 06:31:36.0 +
> +++ 
> /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/results/heap_surgery.out
> 2020-09-12 11:40:26.0 +
> @@ -116,7 +116,6 @@
>   vacuum freeze htab2;
>   -- unused TIDs should be skipped
>   select heap_force_kill('htab2'::regclass, ARRAY['(0, 2)']::tid[]);
> - NOTICE:  skipping tid (0, 2) for relation "htab2" because it is marked 
> unused
>heap_force_kill
>   -
>
>
> sungazer's first run after pg_surgery went in was successful, so it's
> not a hard failure.  I'm guessing that it's timing dependent.
>
> The most obvious theory for the cause is that what VACUUM does with
> a tuple depends on whether the tuple's xmin is below global xmin,
> and a concurrent autovacuum could very easily be holding back global
> xmin.  While I can't easily get autovac to run at just the right
> time, I did verify that a concurrent regular session holding back
> global xmin produces the symptom seen above.  (To replicate, insert
> "select pg_sleep(...)" in heap_surgery.sql before "-- now create an unused
> line pointer"; run make installcheck; and use the delay to connect
> to the database manually, start a serializable transaction, and do
> any query to acquire a snapshot.)
>

Thanks for reporting. I'm able to reproduce the issue by creating some
delay just before "-- now create an unused line pointer" and use the
delay to start a new session either with repeatable read or
serializable transaction isolation level and run some query on the
test table. To fix this, as you suggested I've converted the test
table to the temp table. Attached is the patch with the changes.
Please have a look and let me know about any concerns.

Thanks,

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
diff --git a/contrib/pg_surgery/expected/heap_surgery.out b/contrib/pg_surgery/expected/heap_surgery.out
index 9451c57..d2cf647 100644
--- a/contrib/pg_surgery/expected/heap_surgery.out
+++ b/contrib/pg_surgery/expected/heap_surgery.out
@@ -87,7 +87,7 @@ NOTICE:  skipping tid (0, 6) for relation "htab" because the item number is out
 
 rollback;
 -- set up a new table with a redirected line pointer
-create table htab2(a int) with (autovacuum_enabled = off);
+create temp table htab2(a int);
 insert into htab2 values (100);
 update htab2 set a = 200;
 vacuum htab2;
@@ -175,6 +175,5 @@ ERROR:  "vw" is not a table, materialized view, or TOAST table
 select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
 ERROR:  "vw" is not a table, materialized view, or TOAST table
 -- cleanup.
-drop table htab2;
 drop view vw;
 drop extension pg_surgery;
diff --git a/contrib/pg_surgery/sql/heap_surgery.sql b/contrib/pg_surgery/sql/heap_surgery.sql
index 8a27214..3635bd5 100644
--- a/contrib/pg_surgery/sql/heap_surgery.sql
+++ b/contrib/pg_surgery/sql/heap_surgery.sql
@@ -41,7 +41,7 @@ select heap_force_freeze('htab'::regclass, ARRAY['(0, 0)', '(0, 6)']::tid[]);
 rollback;
 
 -- set up a new table with a redirected line pointer
-create table htab2(a int) with (autovacuum_enabled = off);
+create temp table htab2(a int);
 insert into htab2 values (100);
 update htab2 set a = 200;
 vacuum htab2;
@@ -86,6 +86,5 @@ select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
 select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
 
 -- cleanup.
-drop table htab2;
 drop view vw;
 drop extension pg_surgery;

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-12 Thread Tom Lane

Robert Haas  writes:
> I have committed this version.

This failure says that the test case is not entirely stable:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer=2020-09-12%2005%3A13%3A12

diff -U3 
/home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/expected/heap_surgery.out
 
/home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/results/heap_surgery.out
--- 
/home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/expected/heap_surgery.out
   2020-09-11 06:31:36.0 +
+++ 
/home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/results/heap_surgery.out
2020-09-12 11:40:26.0 +
@@ -116,7 +116,6 @@
  vacuum freeze htab2;
  -- unused TIDs should be skipped
  select heap_force_kill('htab2'::regclass, ARRAY['(0, 2)']::tid[]);
- NOTICE:  skipping tid (0, 2) for relation "htab2" because it is marked unused
   heap_force_kill 
  -
   

sungazer's first run after pg_surgery went in was successful, so it's
not a hard failure.  I'm guessing that it's timing dependent.

The most obvious theory for the cause is that what VACUUM does with
a tuple depends on whether the tuple's xmin is below global xmin,
and a concurrent autovacuum could very easily be holding back global
xmin.  While I can't easily get autovac to run at just the right
time, I did verify that a concurrent regular session holding back
global xmin produces the symptom seen above.  (To replicate, insert
"select pg_sleep(...)" in heap_surgery.sql before "-- now create an unused
line pointer"; run make installcheck; and use the delay to connect
to the database manually, start a serializable transaction, and do
any query to acquire a snapshot.)

I suggest that the easiest way to make this test reliable is to
make the test tables be temp tables (which allows dropping the
autovacuum_enabled = off property, too).  In the wake of commit
a7212be8b, that should guarantee that vacuum has stable tuple-level
behavior regardless of what is happening concurrently.

regards, tom lane

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-10 Thread Justin Pryzby

On Thu, Sep 10, 2020 at 11:21:02AM -0400, Robert Haas wrote:
> On Fri, Aug 28, 2020 at 5:55 AM Ashutosh Sharma  wrote:
> > Please have a look into the attached patch for the changes and let me know 
> > for any other concerns. Thank you.
> 
> I have committed this version.

Thanks ; I marked it as such in CF app.
https://commitfest.postgresql.org/29/2700/

-- 
Justin

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-10 Thread Robert Haas

On Fri, Aug 28, 2020 at 5:55 AM Ashutosh Sharma  wrote:
> Please have a look into the attached patch for the changes and let me know 
> for any other concerns. Thank you.

I have committed this version.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-01 Thread Masahiko Sawada

On Fri, 28 Aug 2020 at 23:39, Robert Haas  wrote:
>
> On Fri, Aug 28, 2020 at 4:07 AM Masahiko Sawada
>  wrote:
> > You've removed the description about executing VACUUM with
> > DISABLE_PAGE_SKIPPING option on the target relation after using
> > pg_surgery functions from the doc but I guess it’s better to recommend
> > that in the doc for safety. Could you please tell me the reason for
> > removing that?
>
> Well, I think that was added because there wasn't code to clear the
> visibility map bits, either page-level in the map, but we added code
> for that, so now I don't really see why it's necessary or even
> desirable.
>
> Here are a few example scenarios:
>
> 1. My table is not corrupt. For no particular reason, I force-freeze
> or force-kill a tuple which is neither dead nor all-visible.
> Concurrent queries might return wrong answers, but the table is not
> corrupt. It does not require VACUUM and would not benefit from it.
> Actually, it doesn't need anything at all.
>
> 2. My table is not corrupt. For no particular reason, I force-freeze a
> tuple which is dead. I believe it's possible that the index entries
> for that tuple might be gone already, but VACUUM won't fix that.
> REINDEX or a table rewrite would, though. It's also possible, if the
> dead tuple was added by an aborted transaction which added columns to
> the table, that the tuple might have been created using a tuple
> descriptor that differs from the table's current tuple descriptor. If
> so, I think scanning the table could produce a crash. VACUUM won't fix
> this, either. I would need to delete or force-kill the offending
> tuple.
>
> 3. I have one or more tuples in my table that are intact except that
> they have garbage values for xmin, resulting in VACUUM failure or
> possibly even SELECT failure if the CLOG entries are also missing. I
> force-kill or force-freeze them. If by chance the affected tuples were
> also omitted from one or more indexes, a REINDEX or table rewrite is
> needed to fix them, but a VACUUM will not help. On the other hand, if
> those tuples are present in the indexes, there's no remaining problem
> and VACUUM is not needed for the purpose of restoring the integrity of
> the table. If the problem has been ongoing for a while, VACUUM might
> be needed to advance relfrozenxid, but that doesn't require
> DISABLE_PAGE_SKIPPING.
>
> 4. I have some pages in my table that have incorrect visibility map
> bits. In this case, I need VACUUM (DISABLE_PAGE_SKIPPING). However, I
> don't need the functions we're talking about here at all unless I also
> have tuples with corrupted visibility information. If I do happen to
> have both tuples with corrupted visibility information and also pages
> with incorrect visibility map bits, then I suppose I need both these
> tools and also VACUUM (DISABLE_PAGE_SKIPPING). Probably, I'll want to
> do the VACUUM second. But, if I happened to do the VACUUM first and
> then use these functions afterward, the worst thing that could happen
> is that I might end up with a some dead tuples that could've gotten
> removed faster if I'd switched the order. And that's not a disaster.
>
> Basically, I can see no real reason to recommend VACUUM
> (DISABLE_PAGE_SKIPPING) here. There are problems that can be fixed
> with that command, and there are problems that can be fixed by this
> method, but they are mostly independent of each other. We should not
> recommend that people run VACUUM "just in case." That kind of fuzzy
> thinking seems relatively prevalent already, and it leads to people
> spending a lot of time running slow maintenance commands that do
> nothing to help them, and which occasionally make things worse.
>

Thank you for your explanation. That very makes sense to me.

If vacuum could fix the particular kind of problem by using together
with pg_surgery we could recommend using vacuum. But I agree that the
corruption of heap table is not the case.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-28 Thread Robert Haas

On Fri, Aug 28, 2020 at 4:07 AM Masahiko Sawada
 wrote:
> You've removed the description about executing VACUUM with
> DISABLE_PAGE_SKIPPING option on the target relation after using
> pg_surgery functions from the doc but I guess it’s better to recommend
> that in the doc for safety. Could you please tell me the reason for
> removing that?

Well, I think that was added because there wasn't code to clear the
visibility map bits, either page-level in the map, but we added code
for that, so now I don't really see why it's necessary or even
desirable.

Here are a few example scenarios:

1. My table is not corrupt. For no particular reason, I force-freeze
or force-kill a tuple which is neither dead nor all-visible.
Concurrent queries might return wrong answers, but the table is not
corrupt. It does not require VACUUM and would not benefit from it.
Actually, it doesn't need anything at all.

2. My table is not corrupt. For no particular reason, I force-freeze a
tuple which is dead. I believe it's possible that the index entries
for that tuple might be gone already, but VACUUM won't fix that.
REINDEX or a table rewrite would, though. It's also possible, if the
dead tuple was added by an aborted transaction which added columns to
the table, that the tuple might have been created using a tuple
descriptor that differs from the table's current tuple descriptor. If
so, I think scanning the table could produce a crash. VACUUM won't fix
this, either. I would need to delete or force-kill the offending
tuple.

3. I have one or more tuples in my table that are intact except that
they have garbage values for xmin, resulting in VACUUM failure or
possibly even SELECT failure if the CLOG entries are also missing. I
force-kill or force-freeze them. If by chance the affected tuples were
also omitted from one or more indexes, a REINDEX or table rewrite is
needed to fix them, but a VACUUM will not help. On the other hand, if
those tuples are present in the indexes, there's no remaining problem
and VACUUM is not needed for the purpose of restoring the integrity of
the table. If the problem has been ongoing for a while, VACUUM might
be needed to advance relfrozenxid, but that doesn't require
DISABLE_PAGE_SKIPPING.

4. I have some pages in my table that have incorrect visibility map
bits. In this case, I need VACUUM (DISABLE_PAGE_SKIPPING). However, I
don't need the functions we're talking about here at all unless I also
have tuples with corrupted visibility information. If I do happen to
have both tuples with corrupted visibility information and also pages
with incorrect visibility map bits, then I suppose I need both these
tools and also VACUUM (DISABLE_PAGE_SKIPPING). Probably, I'll want to
do the VACUUM second. But, if I happened to do the VACUUM first and
then use these functions afterward, the worst thing that could happen
is that I might end up with a some dead tuples that could've gotten
removed faster if I'd switched the order. And that's not a disaster.

Basically, I can see no real reason to recommend VACUUM
(DISABLE_PAGE_SKIPPING) here. There are problems that can be fixed
with that command, and there are problems that can be fixed by this
method, but they are mostly independent of each other. We should not
recommend that people run VACUUM "just in case." That kind of fuzzy
thinking seems relatively prevalent already, and it leads to people
spending a lot of time running slow maintenance commands that do
nothing to help them, and which occasionally make things worse.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-28 Thread Robert Haas

On Fri, Aug 28, 2020 at 5:55 AM Ashutosh Sharma  wrote:
> We can certainly do this way, but I would still prefer having a comparator 
> function (tidcmp) here for the reasons that it makes the code look a bit 
> cleaner, it also makes us more consistent with the way the comparator 
> function argument is being passed to qsort at several other places in 
> postgres which kinda of increases the code readability and simplicity. For 
> e.g. there is a comparator function for gin that does the same thing as 
> tidcmp is doing here.

Me too. Casting one kind of function pointer to another kind of
function pointer assumes that the compiler is using the same
argument-passing conventions in both cases, which seems slightly
risky. It also means that if the signature for the function were to
diverge further from the signature that we need in the future, the
compiler might not warn us about it. Perhaps there is some case where
the performance gains would be sufficiently to justify those risks,
but this is certainly not that case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-28 Thread Ashutosh Sharma

> > The tidcmp function can be removed, and ItemPointerCompare used
directly by qsort as:
> >
> > -   qsort((void*) tids, ntids, sizeof(ItemPointerData),
tidcmp);
> > +   qsort((void*) tids, ntids, sizeof(ItemPointerData),
> > + (int (*) (const void *, const void *))
ItemPointerCompare);
> >
>
> Will have a look into this.
>

We can certainly do this way, but I would still prefer having a comparator
function (tidcmp) here for the reasons that it makes the code look a bit
cleaner, it also makes us more consistent with the way the comparator
function argument is being passed to qsort at several other places in
postgres which kinda of increases the code readability and simplicity. For
e.g. there is a comparator function for gin that does the same thing as
tidcmp is doing here. See below:

static int
qsortCompareItemPointers(const void *a, const void *b)
{
int res = ginCompareItemPointers((ItemPointer) a, (ItemPointer)
b);

/* Assert that there are no equal item pointers being sorted */
Assert(res != 0);
return res;
}

In this case as well, it could have been done the way you are suggesting,
but it seems like writing a small comparator function with the prototype
that qsort accepts looked like a better option. Considering this, I am just
leaving this as-it-is. Please let me know if you feel the other way round.

> > The header comment for function find_tids_one_page should state the
requirement that the tids array must be sorted.
> >
>
> Okay, will add a comment for this.
>

Added a comment for this in the attached patch.

Please have a look into the attached patch for the changes and let me know
for any other concerns. Thank you.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
From ee49cfbfc153bbf6212d90ac81a2f9d13fa0eef7 Mon Sep 17 00:00:00 2001
From: ashu 
Date: Fri, 28 Aug 2020 15:00:41 +0530
Subject: [PATCH] Add contrib/pg_surgery to perform surgery on a damaged
 relation.

This contrib module provides couple of functions named heap_force_kill
and heap_force_freeze that can be used to perform surgery on a damaged
heap table.

Author: Ashutosh Sharma and Robert Haas.
Reviewed by: Sawada Masahiko, Andrey Borodin, Beena Emerson, Mark Dilger and Robert Haas
Tested by: Rajkumar Raghuwanshi
Discussion: https://www.postgresql.org/message-id/CA%2BTgmoZW1fsU-QUNCRUQMGUygBDPVeOTLCqRdQZch%3DEYZnctSA%40mail.gmail.com
---
 contrib/Makefile |   1 +
 contrib/pg_surgery/.gitignore|   4 +
 contrib/pg_surgery/Makefile  |  23 ++
 contrib/pg_surgery/expected/heap_surgery.out | 180 +++
 contrib/pg_surgery/heap_surgery.c| 428 +++
 contrib/pg_surgery/pg_surgery--1.0.sql   |  18 ++
 contrib/pg_surgery/pg_surgery.control|   5 +
 contrib/pg_surgery/sql/heap_surgery.sql  |  91 ++
 doc/src/sgml/contrib.sgml|   1 +
 doc/src/sgml/filelist.sgml   |   1 +
 doc/src/sgml/pgsurgery.sgml  | 107 +++
 src/tools/pgindent/typedefs.list |   1 +
 12 files changed, 860 insertions(+)
 create mode 100644 contrib/pg_surgery/.gitignore
 create mode 100644 contrib/pg_surgery/Makefile
 create mode 100644 contrib/pg_surgery/expected/heap_surgery.out
 create mode 100644 contrib/pg_surgery/heap_surgery.c
 create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql
 create mode 100644 contrib/pg_surgery/pg_surgery.control
 create mode 100644 contrib/pg_surgery/sql/heap_surgery.sql
 create mode 100644 doc/src/sgml/pgsurgery.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index 1846d41..c8d2a16 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
 		pg_prewarm	\
 		pg_standby	\
 		pg_stat_statements \
+		pg_surgery	\
 		pg_trgm		\
 		pgcrypto	\
 		pgrowlocks	\
diff --git a/contrib/pg_surgery/.gitignore b/contrib/pg_surgery/.gitignore
new file mode 100644
index 000..5dcb3ff
--- /dev/null
+++ b/contrib/pg_surgery/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile
new file mode 100644
index 000..a66776c
--- /dev/null
+++ b/contrib/pg_surgery/Makefile
@@ -0,0 +1,23 @@
+# contrib/pg_surgery/Makefile
+
+MODULE_big = pg_surgery
+OBJS = \
+	$(WIN32RES) \
+	heap_surgery.o
+
+EXTENSION = pg_surgery
+DATA = pg_surgery--1.0.sql
+PGFILEDESC = "pg_surgery - perform surgery on a damaged relation"
+
+REGRESS = heap_surgery
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_surgery
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_surgery/expected/heap_surgery.out b/contrib/pg_surgery/expected/heap_surgery.out
new file mode 100644
index 000..9451c57
--- /dev/null
+++

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-28 Thread Masahiko Sawada

On Fri, 28 Aug 2020 at 05:14, Robert Haas  wrote:
>
> On Wed, Aug 26, 2020 at 10:26 PM Ashutosh Sharma  
> wrote:
> > Okay, point noted.
>
> I spent some time today working on this patch. I'm fairly happy with
> it now and intend to commit it if nobody sees a big problem with that.
> Per discussion, I do not intend to back-patch at this time. The two
> most significant changes I made to your version are:

Thank you for updating the patch.

>
> 1. I changed things around to avoid using any form of ereport() in a
> critical section. I'm not actually sure whether it is project policy
> to avoid ereport(NOTICE, ...) or similar in a critical section, but it
> seems prudent, because if anything fails in a critical section, we
> will PANIC, so doing fewer things there seems prudent.
>
> 2. I changed the code so that it does not try to follow redirected
> line pointers; instead, it skips them with an appropriate message, as
> we were already doing for dead and unused line pointers. I think the
> way you had it coded might've been my suggestion originally, but the
> more I looked into it the less I liked it. One problem is that it
> didn't match the docs. A second is that following a corrupted line
> pointer might index off the end of the line pointer array, and while
> that probably shouldn't happen, we are talking about corruption
> recovery here. Then I realized that, as you coded it, if the line
> pointer was redirected to a line pointer that is in turn dead (or
> unused, if there's corruption) the user would get a NOTICE complaining
> about a TID they hadn't specified, which seems like it would be very
> confusing. I thought about trying to fix all that stuff, but it just
> didn't seem worth it, because I can't think of a good reason to pass
> this function the TID of a redirected line pointer in the first place.
> If you're doing surgery, you should probably specify the exact thing
> upon which you want to operate, not some other thing that points to
> it.
>
> Here is a list of other changes I made:
>
> * Added a .gitignore file.
> * Renamed the regression test file from pg_surgery to heap_surgery to
> match the name of the single C source file we currently have.
> * Capitalized TID in a few places.
> * Ran pgindent.
> * Adjusted various comments.
> * Removed the check for an empty TID array. I don't see any reason why
> this should be an error case and I don't see much precedent for having
> such a check.
> * Fixed the code to work properly with that change and added a test case.
> * Added a check that the array is not multi-dimensional.
> * Put the AM type check before the relkind check, following existing 
> precedent.
> * Adjusted the check to use the AM OID rather than the handler OID,
> following existing precedent. Fixed the message wording accordingly.
> * Changed the documentation wording to say less about specific
> recovery procedures and focus more on the general idea that this is
> dangerous.

You've removed the description about executing VACUUM with
DISABLE_PAGE_SKIPPING option on the target relation after using
pg_surgery functions from the doc but I guess it’s better to recommend
that in the doc for safety. Could you please tell me the reason for
removing that?

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-27 Thread Ashutosh Sharma

Hi Mark,

Thanks for the review. Please find my comments inline below:

> HeapTupleForceOption should be added to src/tools/pgindent/typedefs.list.
>

This has been fixed in the v9 patch.

>
> The tidcmp function can be removed, and ItemPointerCompare used directly by 
> qsort as:
>
> -   qsort((void*) tids, ntids, sizeof(ItemPointerData), tidcmp);
> +   qsort((void*) tids, ntids, sizeof(ItemPointerData),
> + (int (*) (const void *, const void *)) 
> ItemPointerCompare);
>

Will have a look into this.

>
> sanity_check_tid_array() has two error messages:
>
>   "array must not contain nulls"
>   "empty tid array"
>
> I would change the first to say "tid array must not contain nulls", as "tid" 
> is the name of the parameter being checked.  It is also more consistent with 
> the second error message, but that doesn't matter to me so much, as I'd argue 
> for removing the second check.  I don't see why an empty array should draw an 
> error.  It seems more reasonable to just return early since there is no work 
> to do.  Consider if somebody uses a function that returns the tids for all 
> corrupt tuples in a table, aggregates that into an array, and hands that to 
> this function.  It doesn't seem like an error for that aggregated array to 
> have zero elements in it.  I suppose you could emit a NOTICE in this case?
>

This comment is no more valid as per the changes done in the v9 patch.

>
> Upthread:
> > On Aug 13, 2020, at 12:03 PM, Robert Haas  wrote:
> >
> >> This looks like a very good suggestion to me. I will do this change in
> >> the next version. Just wondering if we should be doing similar changes
> >> in other contrib modules (like pgrowlocks, pageinspect and
> >> pgstattuple) as well?
> >
> > It seems like it should be consistent, but I'm not sure the proposed
> > change is really an improvement.
>
> You have used Asim's proposed check:
>
> if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)
> ereport(ERROR,
> (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>  errmsg("only the relation using heap_tableam_handler is 
> supported")));
>
> which Robert seems unenthusiastic about, but if you are going that direction, 
> I think at least the language of the error message should be changed. I 
> recommend something like:
>
> if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)
> ereport(ERROR,
> (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> -errmsg("only the relation using 
> heap_tableam_handler is supported")));
> +errmsg("\"%s\" does not use a heap access 
> method",
> +   
> RelationGetRelationName(rel;
>
> where "a heap access method" could also be written as "a heap table type 
> access method", "a heap table compatible access method", and so forth.  There 
> doesn't seem to be enough precedent to dictate exactly how to phrase this, or 
> perhaps I'm just not looking in the right place.
>

Same here. This also looks invalid as per the changes done in v9 patch.

>
> The header comment for function find_tids_one_page should state the 
> requirement that the tids array must be sorted.
>

Okay, will add a comment for this.

>
> The heap_force_common function contains multiple ereport(NOTICE,...) within a 
> critical section.  I don't think that is normal practice.  Can those reports 
> be buffered until after the critical section is exited?  You also have a 
> CHECK_FOR_INTERRUPTS within the critical section.
>

This has been fixed in the v9 patch.

> [1] https://commitfest.postgresql.org/29/2700/
> —

Thanks for adding a commitfest entry for this.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-27 Thread Ashutosh Sharma

On Fri, Aug 28, 2020 at 1:44 AM Robert Haas  wrote:
>
> On Wed, Aug 26, 2020 at 10:26 PM Ashutosh Sharma  
> wrote:
> > Okay, point noted.
>
> I spent some time today working on this patch. I'm fairly happy with
> it now and intend to commit it if nobody sees a big problem with that.
> Per discussion, I do not intend to back-patch at this time. The two
> most significant changes I made to your version are:
>
> 1. I changed things around to avoid using any form of ereport() in a
> critical section. I'm not actually sure whether it is project policy
> to avoid ereport(NOTICE, ...) or similar in a critical section, but it
> seems prudent, because if anything fails in a critical section, we
> will PANIC, so doing fewer things there seems prudent.
>
> 2. I changed the code so that it does not try to follow redirected
> line pointers; instead, it skips them with an appropriate message, as
> we were already doing for dead and unused line pointers. I think the
> way you had it coded might've been my suggestion originally, but the
> more I looked into it the less I liked it. One problem is that it
> didn't match the docs. A second is that following a corrupted line
> pointer might index off the end of the line pointer array, and while
> that probably shouldn't happen, we are talking about corruption
> recovery here. Then I realized that, as you coded it, if the line
> pointer was redirected to a line pointer that is in turn dead (or
> unused, if there's corruption) the user would get a NOTICE complaining
> about a TID they hadn't specified, which seems like it would be very
> confusing. I thought about trying to fix all that stuff, but it just
> didn't seem worth it, because I can't think of a good reason to pass
> this function the TID of a redirected line pointer in the first place.
> If you're doing surgery, you should probably specify the exact thing
> upon which you want to operate, not some other thing that points to
> it.
>
> Here is a list of other changes I made:
>
> * Added a .gitignore file.
> * Renamed the regression test file from pg_surgery to heap_surgery to
> match the name of the single C source file we currently have.
> * Capitalized TID in a few places.
> * Ran pgindent.
> * Adjusted various comments.
> * Removed the check for an empty TID array. I don't see any reason why
> this should be an error case and I don't see much precedent for having
> such a check.
> * Fixed the code to work properly with that change and added a test case.
> * Added a check that the array is not multi-dimensional.
> * Put the AM type check before the relkind check, following existing 
> precedent.
> * Adjusted the check to use the AM OID rather than the handler OID,
> following existing precedent. Fixed the message wording accordingly.
> * Changed the documentation wording to say less about specific
> recovery procedures and focus more on the general idea that this is
> dangerous.
> * Removed all but one of the test cases that checked what happens if
> you use this on a non-heap; three tests for basically the same thing
> seemed excessive.
> * Added some additional tests to improve code coverage. There are now
> only a handful of lines not covered.
> * Reorganized the test cases somewhat.
>
> New patch attached.
>

Thank you Robert for the patch. I've looked into the changes you've
made to the v8 patch and they all look good to me.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-27 Thread Mark Dilger

> On Aug 26, 2020, at 4:36 AM, Ashutosh Sharma  wrote:
> 
> This patch also takes care of all the other review comments from - [1].
> 
> [1] - 
> https://www.postgresql.org/message-id/CA%2Bfd4k6%2BJWq2MfQt5b7fSJ2wMvCes9TRfbDhVO_fQP9B8JJRAA%40mail.gmail.com
> 
> 
> --
> With Regards,
> Ashutosh Sharma
> EnterpriseDB:http://www.enterprisedb.com
> 

Hi Ashutosh,

I took a look at the v8 patch, created a commitfest entry [1] because I did not 
find one already existent, and have the following review comments:

HeapTupleForceOption should be added to src/tools/pgindent/typedefs.list.

The tidcmp function can be removed, and ItemPointerCompare used directly by 
qsort as:

-   qsort((void*) tids, ntids, sizeof(ItemPointerData), tidcmp);
+   qsort((void*) tids, ntids, sizeof(ItemPointerData),
+ (int (*) (const void *, const void *)) 
ItemPointerCompare);

sanity_check_tid_array() has two error messages:

  "array must not contain nulls"
  "empty tid array"

I would change the first to say "tid array must not contain nulls", as "tid" is 
the name of the parameter being checked.  It is also more consistent with the 
second error message, but that doesn't matter to me so much, as I'd argue for 
removing the second check.  I don't see why an empty array should draw an 
error.  It seems more reasonable to just return early since there is no work to 
do.  Consider if somebody uses a function that returns the tids for all corrupt 
tuples in a table, aggregates that into an array, and hands that to this 
function.  It doesn't seem like an error for that aggregated array to have zero 
elements in it.  I suppose you could emit a NOTICE in this case?

Upthread:
> On Aug 13, 2020, at 12:03 PM, Robert Haas  wrote:
> 
>> This looks like a very good suggestion to me. I will do this change in
>> the next version. Just wondering if we should be doing similar changes
>> in other contrib modules (like pgrowlocks, pageinspect and
>> pgstattuple) as well?
> 
> It seems like it should be consistent, but I'm not sure the proposed
> change is really an improvement.

You have used Asim's proposed check:

if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("only the relation using heap_tableam_handler is 
supported")));

which Robert seems unenthusiastic about, but if you are going that direction, I 
think at least the language of the error message should be changed. I recommend 
something like:

if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-errmsg("only the relation using 
heap_tableam_handler is supported")));
+errmsg("\"%s\" does not use a heap access 
method",
+   RelationGetRelationName(rel;

where "a heap access method" could also be written as "a heap table type access 
method", "a heap table compatible access method", and so forth.  There doesn't 
seem to be enough precedent to dictate exactly how to phrase this, or perhaps 
I'm just not looking in the right place.

The header comment for function find_tids_one_page should state the requirement 
that the tids array must be sorted.

The heap_force_common function contains multiple ereport(NOTICE,...) within a 
critical section.  I don't think that is normal practice.  Can those reports be 
buffered until after the critical section is exited?  You also have a 
CHECK_FOR_INTERRUPTS within the critical section.

[1] https://commitfest.postgresql.org/29/2700/
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-27 Thread Robert Haas

On Wed, Aug 26, 2020 at 10:26 PM Ashutosh Sharma  wrote:
> Okay, point noted.

I spent some time today working on this patch. I'm fairly happy with
it now and intend to commit it if nobody sees a big problem with that.
Per discussion, I do not intend to back-patch at this time. The two
most significant changes I made to your version are:

1. I changed things around to avoid using any form of ereport() in a
critical section. I'm not actually sure whether it is project policy
to avoid ereport(NOTICE, ...) or similar in a critical section, but it
seems prudent, because if anything fails in a critical section, we
will PANIC, so doing fewer things there seems prudent.

2. I changed the code so that it does not try to follow redirected
line pointers; instead, it skips them with an appropriate message, as
we were already doing for dead and unused line pointers. I think the
way you had it coded might've been my suggestion originally, but the
more I looked into it the less I liked it. One problem is that it
didn't match the docs. A second is that following a corrupted line
pointer might index off the end of the line pointer array, and while
that probably shouldn't happen, we are talking about corruption
recovery here. Then I realized that, as you coded it, if the line
pointer was redirected to a line pointer that is in turn dead (or
unused, if there's corruption) the user would get a NOTICE complaining
about a TID they hadn't specified, which seems like it would be very
confusing. I thought about trying to fix all that stuff, but it just
didn't seem worth it, because I can't think of a good reason to pass
this function the TID of a redirected line pointer in the first place.
If you're doing surgery, you should probably specify the exact thing
upon which you want to operate, not some other thing that points to
it.

Here is a list of other changes I made:

* Added a .gitignore file.
* Renamed the regression test file from pg_surgery to heap_surgery to
match the name of the single C source file we currently have.
* Capitalized TID in a few places.
* Ran pgindent.
* Adjusted various comments.
* Removed the check for an empty TID array. I don't see any reason why
this should be an error case and I don't see much precedent for having
such a check.
* Fixed the code to work properly with that change and added a test case.
* Added a check that the array is not multi-dimensional.
* Put the AM type check before the relkind check, following existing precedent.
* Adjusted the check to use the AM OID rather than the handler OID,
following existing precedent. Fixed the message wording accordingly.
* Changed the documentation wording to say less about specific
recovery procedures and focus more on the general idea that this is
dangerous.
* Removed all but one of the test cases that checked what happens if
you use this on a non-heap; three tests for basically the same thing
seemed excessive.
* Added some additional tests to improve code coverage. There are now
only a handful of lines not covered.
* Reorganized the test cases somewhat.

New patch attached.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


v9-0001-pg_surgery-rmh-based-on-ashutosh-sharma-v8.patch
Description: Binary data

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-26 Thread Ashutosh Sharma

On Wed, Aug 26, 2020 at 9:19 PM Robert Haas  wrote:
>
> On Wed, Aug 26, 2020 at 7:36 AM Ashutosh Sharma  wrote:
> > Removed this note from the documentation and added a note saying: "The
> > user needs to ensure that they do not operate pg_force_freeze function
> > on a deleted tuple because it may revive the deleted tuple."
>
> I do not agree with that note, either. I believe that trying to tell
> people what things specifically they should do or avoid doing with the
> tool is the wrong approach. Instead, the thrust of the message should
> be to tell people that if you use this, it may corrupt your database,
> and that's your problem. The difficulty with telling people what
> specifically they ought to avoid doing is that experts will be annoyed
> to be told that something is not safe when they know that it is fine,
> and non-experts will think that some uses are safer than they really
> are.
>

Okay, point noted.

Thanks,

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-26 Thread Robert Haas

On Wed, Aug 26, 2020 at 7:36 AM Ashutosh Sharma  wrote:
> Removed this note from the documentation and added a note saying: "The
> user needs to ensure that they do not operate pg_force_freeze function
> on a deleted tuple because it may revive the deleted tuple."

I do not agree with that note, either. I believe that trying to tell
people what things specifically they should do or avoid doing with the
tool is the wrong approach. Instead, the thrust of the message should
be to tell people that if you use this, it may corrupt your database,
and that's your problem. The difficulty with telling people what
specifically they ought to avoid doing is that experts will be annoyed
to be told that something is not safe when they know that it is fine,
and non-experts will think that some uses are safer than they really
are.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-26 Thread Ashutosh Sharma

Thanks for the review. Please find my comments inline below:

On Tue, Aug 25, 2020 at 5:47 PM Masahiko Sawada
 wrote:
>
> Let me share other comments on the latest version patch:
>
> Some words need to be tagged. For instance, I found the following words:
>
> VACUUM
> DISABLE_PAGE_SKIPPING
> HEAP_XMIN_FROZEN
> HEAP_XMAX_INVALID
>

Okay, done.

> ---
> +test=# select ctid from t1 where xmin = 507;
> + ctid
> +---
> + (0,3)
> +(1 row)
> +
> +test=# select heap_force_freeze('t1'::regclass, ARRAY['(0, 3)']::tid[]);
> +-[ RECORD 1 ]-+-
> +heap_force_freeze |
>
> I think it's better to use a consistent output format. The former uses
> the normal format whereas the latter uses the expanded format.
>

Yep, makes sense, done.

> ---
> + 
> + 
> +  While performing surgery on a damaged relation, we must not be doing 
> anything
> +  else on that relation in parallel. This is to ensure that when we are
> +  operating on a damaged tuple there is no other transaction trying to modify
> +  that tuple.
> + 
> + 
>
> If we prefer to avoid concurrent operations on the target relation why
> don't we use AccessExclusiveLock?
>

Removed this note from the documentation and added a note saying: "The
user needs to ensure that they do not operate pg_force_freeze function
on a deleted tuple because it may revive the deleted tuple."

> ---
> +CREATE FUNCTION heap_force_kill(reloid regclass, tids tid[])
> +RETURNS VOID
> +AS 'MODULE_PATHNAME', 'heap_force_kill'
> +LANGUAGE C STRICT;
>
> +CREATE FUNCTION heap_force_freeze(reloid regclass, tids tid[])
> +RETURNS VOID
> +AS 'MODULE_PATHNAME', 'heap_force_freeze'
> +LANGUAGE C STRICT;
>
> I think these functions should be PARALLEL UNSAFE.
>

By default the functions are marked PARALLEL UNSAFE, so I think there
is nothing to do here.

Attached patch with above changes.

This patch also takes care of all the other review comments from - [1].

[1] - 
https://www.postgresql.org/message-id/CA%2Bfd4k6%2BJWq2MfQt5b7fSJ2wMvCes9TRfbDhVO_fQP9B8JJRAA%40mail.gmail.com


--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
From a8067f03cf5ea900790a96ea0059ae080a164ed6 Mon Sep 17 00:00:00 2001
From: ashu 
Date: Wed, 26 Aug 2020 15:57:59 +0530
Subject: [PATCH] Add contrib/pg_surgery to perform surgery on a damaged heap
 table.

This contrib module basically adds a couple of functions named
heap_force_kill and heap_force_freeze that can be used in the scripts
to perform surgery on the damaged heap tables.

Ashutosh Sharma.
---
 contrib/Makefile   |   1 +
 contrib/pg_surgery/Makefile|  23 ++
 contrib/pg_surgery/expected/pg_surgery.out | 161 
 contrib/pg_surgery/heap_surgery.c  | 404 +
 contrib/pg_surgery/pg_surgery--1.0.sql |  18 ++
 contrib/pg_surgery/pg_surgery.control  |   5 +
 contrib/pg_surgery/sql/pg_surgery.sql  |  89 +++
 doc/src/sgml/contrib.sgml  |   1 +
 doc/src/sgml/filelist.sgml |   1 +
 doc/src/sgml/pgsurgery.sgml| 144 ++
 10 files changed, 847 insertions(+)
 create mode 100644 contrib/pg_surgery/Makefile
 create mode 100644 contrib/pg_surgery/expected/pg_surgery.out
 create mode 100644 contrib/pg_surgery/heap_surgery.c
 create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql
 create mode 100644 contrib/pg_surgery/pg_surgery.control
 create mode 100644 contrib/pg_surgery/sql/pg_surgery.sql
 create mode 100644 doc/src/sgml/pgsurgery.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index 1846d41..c8d2a16 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
 		pg_prewarm	\
 		pg_standby	\
 		pg_stat_statements \
+		pg_surgery	\
 		pg_trgm		\
 		pgcrypto	\
 		pgrowlocks	\
diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile
new file mode 100644
index 000..ecf2e20
--- /dev/null
+++ b/contrib/pg_surgery/Makefile
@@ -0,0 +1,23 @@
+# contrib/pg_surgery/Makefile
+
+MODULE_big = pg_surgery
+OBJS = \
+	$(WIN32RES) \
+	heap_surgery.o
+
+EXTENSION = pg_surgery
+DATA = pg_surgery--1.0.sql
+PGFILEDESC = "pg_surgery - perform surgery on a damaged relation"
+
+REGRESS = pg_surgery
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_surgery
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_surgery/expected/pg_surgery.out b/contrib/pg_surgery/expected/pg_surgery.out
new file mode 100644
index 000..9858de2
--- /dev/null
+++ b/contrib/pg_surgery/expected/pg_surgery.out
@@ -0,0 +1,161 @@
+create extension pg_surgery;
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- supported relations passes.
+--
+-- normal heap table.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+select * from htab where xmin = 2;
+ a 
+---

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-25 Thread Ashutosh Sharma

On Tue, Aug 25, 2020 at 11:51 PM Robert Haas  wrote:
>
> On Tue, Aug 25, 2020 at 8:17 AM Masahiko Sawada
>  wrote:
> > + 
> > + 
> > +  While performing surgery on a damaged relation, we must not be doing 
> > anything
> > +  else on that relation in parallel. This is to ensure that when we are
> > +  operating on a damaged tuple there is no other transaction trying to 
> > modify
> > +  that tuple.
> > + 
> > + 
> >
> > If we prefer to avoid concurrent operations on the target relation why
> > don't we use AccessExclusiveLock?
>
> I disagree with the content of the note. It's up to the user whether
> to perform any concurrent operations on the target relation, and in
> many cases it would be fine to do so. Users who can afford to take the
> table off-line to repair the problem don't really need this tool in
> the first place.
>

The only reason I added this note was to ensure that we do not revive
the tuple that is deleted but not yet vacuumed. There is one
corner-case scenario as reported by you in - [1] where you have
explained a scenario under which vacuum can report "found xmin ...
from before relfrozenxid ..." sort of error for the  deleted tuples.
And as per the explanation provided there, it can happen when there
are multiple transactions operating on the same tuple. However, I
think we can take care of this scenario by doing some code changes in
heap_force_freeze to identify the deleted tuples and maybe skip such
tuples. So, yeah, I will do the code changes for handling this and
remove the note added in the documentation. Thank you Robert and
Masahiko-san for pointing this out.

[1] - 
https://www.postgresql.org/message-id/CA%2BTgmobfJ8CkabKJZ-1FGfvbSz%2Bb8bBX807Y6hHEtVfzVe%2Bg6A%40mail.gmail.com

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-25 Thread Robert Haas

On Tue, Aug 25, 2020 at 8:17 AM Masahiko Sawada
 wrote:
> + 
> + 
> +  While performing surgery on a damaged relation, we must not be doing 
> anything
> +  else on that relation in parallel. This is to ensure that when we are
> +  operating on a damaged tuple there is no other transaction trying to modify
> +  that tuple.
> + 
> + 
>
> If we prefer to avoid concurrent operations on the target relation why
> don't we use AccessExclusiveLock?

I disagree with the content of the note. It's up to the user whether
to perform any concurrent operations on the target relation, and in
many cases it would be fine to do so. Users who can afford to take the
table off-line to repair the problem don't really need this tool in
the first place.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-25 Thread Ashutosh Sharma

Hi Masahiko-san,

Thank you for the review. Please check my comments inline below:

On Tue, Aug 25, 2020 at 1:39 PM Masahiko Sawada
 wrote:
>
> On Fri, 21 Aug 2020 at 22:25, Ashutosh Sharma  wrote:
> >
> > Hi Masahiko-san,
> >
> > Please find the updated patch with the following new changes:
> >
>
> Thank you for updating the patch!
>
> > 1) It adds the code changes in heap_force_kill function to clear an
> > all-visible bit on the visibility map corresponding to the page that
> > is marked all-visible. Along the way it also clears PD_ALL_VISIBLE
> > flag on the page header.
>
> I think we need to clear all visibility map bits by using
> VISIBILITYMAP_VALID_BITS. Otherwise, the page has all-frozen bit but
> not all-visible bit, which is not a valid state.
>

Yeah, makes sense, I will do that change in the next version of patch.

> >
> > 2) It adds the code changes in heap_force_freeze function to reset the
> > ctid value in a tuple header if it is corrupted.
> >
> > 3) It adds several notes and examples in the documentation stating
> > when and how we need to use the functions provided by this module.
> >
> > Please have a look and let me know for any other concern.
> >
>
> And here are small comments on the heap_surgery.c:
>
> +   /*
> +* Get the offset numbers from the tids belonging to one particular 
> page
> +* and process them one by one.
> +*/
> +   blkno = tids_same_page_fetch_offnums(tids, ntids, _start_ptr,
> +offnos);
> +
> +   /* Calculate the number of offsets stored in offnos array. */
> +   noffs = next_start_ptr - curr_start_ptr;
> +
> +   /*
> +* Update the current start pointer so that next time when
> +* tids_same_page_fetch_offnums() is called, we can calculate the 
> number
> +* of offsets present in the offnos array.
> +*/
> +   curr_start_ptr = next_start_ptr;
> +
> +   /* Check whether the block number is valid. */
> +   if (blkno >= nblocks)
> +   {
> +   ereport(NOTICE,
> +   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +errmsg("skipping block %u for relation \"%s\"
> because the block number is out of range",
> +   blkno, RelationGetRelationName(rel;
> +   continue;
> +   }
> +
> +   CHECK_FOR_INTERRUPTS();
>
> I guess it would be better to call CHECK_FOR_INTERRUPTS() at the top
> of the do loop for safety. I think it's unlikely to happen but the
> user might mistakenly specify a lot of wrong block numbers.
>

Okay, np, will shift it to top of the do loop.

> 
> +   offnos = (OffsetNumber *) palloc(ntids * sizeof(OffsetNumber));
> +   noffs = curr_start_ptr = next_start_ptr = 0;
> +   nblocks = RelationGetNumberOfBlocks(rel);
> +
> +   do
> +   {
>
> (snip)
>
> +
> +   /*
> +* Get the offset numbers from the tids belonging to one particular 
> page
> +* and process them one by one.
> +*/
> +   blkno = tids_same_page_fetch_offnums(tids, ntids, _start_ptr,
> +offnos);
> +
> +   /* Calculate the number of offsets stored in offnos array. */
> +   noffs = next_start_ptr - curr_start_ptr;
> +
>
> (snip)
>
> +   /* No ereport(ERROR) from here until all the changes are logged. */
> +   START_CRIT_SECTION();
> +
> +   for (i = 0; i < noffs; i++)
>
> You copy all offset numbers belonging to the same page to palloc'd
> array, offnos, and iterate it while processing the tuples. I might be
> missing something but I think we can do that without allocating the
> space for offset numbers. Is there any reason for this? I guess we can
> do that by just iterating the sorted tids array.
>

Hmmm.. okay, I see your point. I think probably what you are trying to
suggest here is to make use of the current and next start pointers to
get the tids belonging to the same page and process them one by one
instead of fetching the offset numbers of all tids belonging to one
page into the offnos array and then iterate through the offnos array.
I think that is probably possible and I will try to do that in the
next version of patch. If there is something else that you have in
your mind, please let me know.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-25 Thread Masahiko Sawada

On Tue, 25 Aug 2020 at 17:08, Masahiko Sawada
 wrote:
>
> On Fri, 21 Aug 2020 at 22:25, Ashutosh Sharma  wrote:
> >
> > Hi Masahiko-san,
> >
> > Please find the updated patch with the following new changes:
> >
>
> Thank you for updating the patch!
>
> > 1) It adds the code changes in heap_force_kill function to clear an
> > all-visible bit on the visibility map corresponding to the page that
> > is marked all-visible. Along the way it also clears PD_ALL_VISIBLE
> > flag on the page header.
>
> I think we need to clear all visibility map bits by using
> VISIBILITYMAP_VALID_BITS. Otherwise, the page has all-frozen bit but
> not all-visible bit, which is not a valid state.
>
> >
> > 2) It adds the code changes in heap_force_freeze function to reset the
> > ctid value in a tuple header if it is corrupted.
> >
> > 3) It adds several notes and examples in the documentation stating
> > when and how we need to use the functions provided by this module.
> >
> > Please have a look and let me know for any other concern.
> >
>
> And here are small comments on the heap_surgery.c:
>
> +   /*
> +* Get the offset numbers from the tids belonging to one particular 
> page
> +* and process them one by one.
> +*/
> +   blkno = tids_same_page_fetch_offnums(tids, ntids, _start_ptr,
> +offnos);
> +
> +   /* Calculate the number of offsets stored in offnos array. */
> +   noffs = next_start_ptr - curr_start_ptr;
> +
> +   /*
> +* Update the current start pointer so that next time when
> +* tids_same_page_fetch_offnums() is called, we can calculate the 
> number
> +* of offsets present in the offnos array.
> +*/
> +   curr_start_ptr = next_start_ptr;
> +
> +   /* Check whether the block number is valid. */
> +   if (blkno >= nblocks)
> +   {
> +   ereport(NOTICE,
> +   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +errmsg("skipping block %u for relation \"%s\"
> because the block number is out of range",
> +   blkno, RelationGetRelationName(rel;
> +   continue;
> +   }
> +
> +   CHECK_FOR_INTERRUPTS();
>
> I guess it would be better to call CHECK_FOR_INTERRUPTS() at the top
> of the do loop for safety. I think it's unlikely to happen but the
> user might mistakenly specify a lot of wrong block numbers.
>
> 
> +   offnos = (OffsetNumber *) palloc(ntids * sizeof(OffsetNumber));
> +   noffs = curr_start_ptr = next_start_ptr = 0;
> +   nblocks = RelationGetNumberOfBlocks(rel);
> +
> +   do
> +   {
>
> (snip)
>
> +
> +   /*
> +* Get the offset numbers from the tids belonging to one particular 
> page
> +* and process them one by one.
> +*/
> +   blkno = tids_same_page_fetch_offnums(tids, ntids, _start_ptr,
> +offnos);
> +
> +   /* Calculate the number of offsets stored in offnos array. */
> +   noffs = next_start_ptr - curr_start_ptr;
> +
>
> (snip)
>
> +   /* No ereport(ERROR) from here until all the changes are logged. */
> +   START_CRIT_SECTION();
> +
> +   for (i = 0; i < noffs; i++)
>
> You copy all offset numbers belonging to the same page to palloc'd
> array, offnos, and iterate it while processing the tuples. I might be
> missing something but I think we can do that without allocating the
> space for offset numbers. Is there any reason for this? I guess we can
> do that by just iterating the sorted tids array.
>

Let me share other comments on the latest version patch:

Some words need to be tagged. For instance, I found the following words:

VACUUM
DISABLE_PAGE_SKIPPING
HEAP_XMIN_FROZEN
HEAP_XMAX_INVALID

---
+test=# select ctid from t1 where xmin = 507;
+ ctid
+---
+ (0,3)
+(1 row)
+
+test=# select heap_force_freeze('t1'::regclass, ARRAY['(0, 3)']::tid[]);
+-[ RECORD 1 ]-+-
+heap_force_freeze |

I think it's better to use a consistent output format. The former uses
the normal format whereas the latter uses the expanded format.

---
+ 
+ 
+  While performing surgery on a damaged relation, we must not be doing anything
+  else on that relation in parallel. This is to ensure that when we are
+  operating on a damaged tuple there is no other transaction trying to modify
+  that tuple.
+ 
+ 

If we prefer to avoid concurrent operations on the target relation why
don't we use AccessExclusiveLock?

---
+CREATE FUNCTION heap_force_kill(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_kill'
+LANGUAGE C STRICT;

+CREATE FUNCTION heap_force_freeze(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_freeze'
+LANGUAGE C STRICT;

I think these functions should be PARALLEL UNSAFE.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-25 Thread Masahiko Sawada

On Fri, 21 Aug 2020 at 22:25, Ashutosh Sharma  wrote:
>
> Hi Masahiko-san,
>
> Please find the updated patch with the following new changes:
>

Thank you for updating the patch!

> 1) It adds the code changes in heap_force_kill function to clear an
> all-visible bit on the visibility map corresponding to the page that
> is marked all-visible. Along the way it also clears PD_ALL_VISIBLE
> flag on the page header.

I think we need to clear all visibility map bits by using
VISIBILITYMAP_VALID_BITS. Otherwise, the page has all-frozen bit but
not all-visible bit, which is not a valid state.

>
> 2) It adds the code changes in heap_force_freeze function to reset the
> ctid value in a tuple header if it is corrupted.
>
> 3) It adds several notes and examples in the documentation stating
> when and how we need to use the functions provided by this module.
>
> Please have a look and let me know for any other concern.
>

And here are small comments on the heap_surgery.c:

+   /*
+* Get the offset numbers from the tids belonging to one particular page
+* and process them one by one.
+*/
+   blkno = tids_same_page_fetch_offnums(tids, ntids, _start_ptr,
+offnos);
+
+   /* Calculate the number of offsets stored in offnos array. */
+   noffs = next_start_ptr - curr_start_ptr;
+
+   /*
+* Update the current start pointer so that next time when
+* tids_same_page_fetch_offnums() is called, we can calculate the number
+* of offsets present in the offnos array.
+*/
+   curr_start_ptr = next_start_ptr;
+
+   /* Check whether the block number is valid. */
+   if (blkno >= nblocks)
+   {
+   ereport(NOTICE,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("skipping block %u for relation \"%s\"
because the block number is out of range",
+   blkno, RelationGetRelationName(rel;
+   continue;
+   }
+
+   CHECK_FOR_INTERRUPTS();

I guess it would be better to call CHECK_FOR_INTERRUPTS() at the top
of the do loop for safety. I think it's unlikely to happen but the
user might mistakenly specify a lot of wrong block numbers.


+   offnos = (OffsetNumber *) palloc(ntids * sizeof(OffsetNumber));
+   noffs = curr_start_ptr = next_start_ptr = 0;
+   nblocks = RelationGetNumberOfBlocks(rel);
+
+   do
+   {

(snip)

+
+   /*
+* Get the offset numbers from the tids belonging to one particular page
+* and process them one by one.
+*/
+   blkno = tids_same_page_fetch_offnums(tids, ntids, _start_ptr,
+offnos);
+
+   /* Calculate the number of offsets stored in offnos array. */
+   noffs = next_start_ptr - curr_start_ptr;
+

(snip)

+   /* No ereport(ERROR) from here until all the changes are logged. */
+   START_CRIT_SECTION();
+
+   for (i = 0; i < noffs; i++)

You copy all offset numbers belonging to the same page to palloc'd
array, offnos, and iterate it while processing the tuples. I might be
missing something but I think we can do that without allocating the
space for offset numbers. Is there any reason for this? I guess we can
do that by just iterating the sorted tids array.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-24 Thread Ashutosh Sharma

On Mon, Aug 24, 2020 at 11:00 PM Andrey M. Borodin  wrote:
>
> Hi!
>
> > 21 авг. 2020 г., в 18:24, Ashutosh Sharma  
> > написал(а):
> >
> > Please find the updated patch with the following new changes:
>
> Do you have plans to support pg_surgery as external extension? For example 
> for earlier versions of Postgres and for new features, like amcheck_next is 
> maintained.
> ISTM that I'll have to use something like that tomorrow and I'm in doubt - 
> should I resurrect our pg_dirty_hands or try your new pg_surgey...
>

AFAICS, we don't have any plans to support pg_surgery as an external
extension as of now. Based on the discussion that has happened earlier
in this thread, I think we might also back-patch this contrib module.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-24 Thread Ashutosh Sharma

On Mon, Aug 24, 2020 at 7:15 PM Robert Haas  wrote:
>
> On Tue, Aug 18, 2020 at 12:14 PM Alvaro Herrera
>  wrote:
> > It makes sense to recommend VACUUM after fixing the page, but I agree
> > with Sawada-san that it would be sensible to reset the VM bit while
> > doing surgery, since that's the state that the page would be in.  We
> > should certainly *strongly recommend* to do VACUUM DISABLE_PAGE_SKIPPING,
> > but if users fail to do so, then leaving the VM bit set just means that
> > we know *for certain* that there will be further corruption as soon as
> > the XID counter advances sufficiently.
>
> +1.
>

This has been taken care of in the latest v7 patch.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-24 Thread Andrey M. Borodin

Hi!

> 21 авг. 2020 г., в 18:24, Ashutosh Sharma  написал(а):
> 
> Please find the updated patch with the following new changes:

Do you have plans to support pg_surgery as external extension? For example for 
earlier versions of Postgres and for new features, like amcheck_next is 
maintained.
ISTM that I'll have to use something like that tomorrow and I'm in doubt - 
should I resurrect our pg_dirty_hands or try your new pg_surgey...

Thanks!

Best regards, Andrey Borodin.

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-24 Thread Robert Haas

On Tue, Aug 18, 2020 at 12:14 PM Alvaro Herrera
 wrote:
> It makes sense to recommend VACUUM after fixing the page, but I agree
> with Sawada-san that it would be sensible to reset the VM bit while
> doing surgery, since that's the state that the page would be in.  We
> should certainly *strongly recommend* to do VACUUM DISABLE_PAGE_SKIPPING,
> but if users fail to do so, then leaving the VM bit set just means that
> we know *for certain* that there will be further corruption as soon as
> the XID counter advances sufficiently.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-21 Thread Ashutosh Sharma

Hi Masahiko-san,

Please find the updated patch with the following new changes:

1) It adds the code changes in heap_force_kill function to clear an
all-visible bit on the visibility map corresponding to the page that
is marked all-visible. Along the way it also clears PD_ALL_VISIBLE
flag on the page header.

2) It adds the code changes in heap_force_freeze function to reset the
ctid value in a tuple header if it is corrupted.

3) It adds several notes and examples in the documentation stating
when and how we need to use the functions provided by this module.

Please have a look and let me know for any other concern.

Thanks,

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Thu, Aug 20, 2020 at 11:43 AM Ashutosh Sharma  wrote:
>
> On Thu, Aug 20, 2020 at 11:04 AM Masahiko Sawada
>  wrote:
> >
> > On Wed, 19 Aug 2020 at 20:45, Ashutosh Sharma  wrote:
> > >
> > > On Wed, Aug 19, 2020 at 3:55 PM Masahiko Sawada
> > >  wrote:
> > > >
> > > > On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma  
> > > > wrote:
> > > > >
> > > > > On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada
> > > > >  wrote:
> > > > > >
> > > > > > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma 
> > > > > >  wrote:
> > > > > > >
> > > > > > > > pg_force_freeze() can revival a tuple that is already deleted 
> > > > > > > > but not
> > > > > > > > vacuumed yet. Therefore, the user might need to reindex indexes 
> > > > > > > > after
> > > > > > > > using that function. For instance, with the following script, 
> > > > > > > > the last
> > > > > > > > two queries: index scan and seq scan, will return different 
> > > > > > > > results.
> > > > > > > >
> > > > > > > > set enable_seqscan to off;
> > > > > > > > set enable_bitmapscan to off;
> > > > > > > > set enable_indexonlyscan to off;
> > > > > > > > create table tbl (a int primary key);
> > > > > > > > insert into tbl values (1);
> > > > > > > >
> > > > > > > > update tbl set a = a + 100 where a = 1;
> > > > > > > >
> > > > > > > > explain analyze select * from tbl where a < 200;
> > > > > > > >
> > > > > > > > -- revive deleted tuple on heap
> > > > > > > > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> > > > > > > >
> > > > > > > > -- index scan returns 2 tuples
> > > > > > > > explain analyze select * from tbl where a < 200;
> > > > > > > >
> > > > > > > > -- seq scan returns 1 tuple
> > > > > > > > set enable_seqscan to on;
> > > > > > > > explain analyze select * from tbl;
> > > > > > > >
> > > > > > >
> > > > > > > I am not sure if this is the right use-case of pg_force_freeze
> > > > > > > function. I think we should only be running pg_force_freeze 
> > > > > > > function
> > > > > > > on a tuple for which VACUUM reports "found xmin ABC from before
> > > > > > > relfrozenxid PQR" sort of error otherwise it might worsen the 
> > > > > > > things
> > > > > > > instead of making it better.
> > > > > >
> > > > > > Should this also be documented? I think that it's hard to force the
> > > > > > user to always use this module in the right situation but we need to
> > > > > > show at least when to use.
> > > > > >
> > > > >
> > > > > I've already added some examples in the documentation explaining the
> > > > > use-case of force_freeze function. If required, I will also add a note
> > > > > about it.
> > > > >
> > > > > > > > Also, if a tuple updated and moved to another partition is 
> > > > > > > > revived by
> > > > > > > > heap_force_freeze(), its ctid still has special values:
> > > > > > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I 
> > > > > > > > don't
> > > > > > > > see a problem yet caused by a visible tuple having the special 
> > > > > > > > ctid
> > > > > > > > value, but it might be worth considering either to reset ctid 
> > > > > > > > value as
> > > > > > > > well or to not freezing already-deleted tuple.
> > > > > > > >
> > > > > > >
> > > > > > > For this as well, the answer remains the same as above.
> > > > > >
> > > > > > Perhaps the same is true when a tuple header is corrupted including
> > > > > > xmin and ctid for some reason and the user wants to fix it? I'm
> > > > > > concerned that a live tuple having the wrong ctid will cause SEGV or
> > > > > > PANIC error in the future.
> > > > > >
> > > > >
> > > > > If a tuple header itself is corrupted, then I think we must kill that
> > > > > tuple. If only xmin and t_ctid fields are corrupted, then probably we
> > > > > can think of resetting the ctid value of that tuple. However, it won't
> > > > > be always possible to detect the corrupted ctid value. It's quite
> > > > > possible that the corrupted ctid field has valid values for block
> > > > > number and offset number in it, but it's actually corrupted and it
> > > > > would be difficult to consider such ctid as corrupted. Hence, we can't
> > > > > do anything about such types of corruption. Probably in such cases we
> > > > > need to run VACUUM FULL on such tables so that new ctid gets generated
> > > > > for

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-20 Thread Ashutosh Sharma

On Thu, Aug 20, 2020 at 11:04 AM Masahiko Sawada
 wrote:
>
> On Wed, 19 Aug 2020 at 20:45, Ashutosh Sharma  wrote:
> >
> > On Wed, Aug 19, 2020 at 3:55 PM Masahiko Sawada
> >  wrote:
> > >
> > > On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma  
> > > wrote:
> > > >
> > > > On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada
> > > >  wrote:
> > > > >
> > > > > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma  
> > > > > wrote:
> > > > > >
> > > > > > > pg_force_freeze() can revival a tuple that is already deleted but 
> > > > > > > not
> > > > > > > vacuumed yet. Therefore, the user might need to reindex indexes 
> > > > > > > after
> > > > > > > using that function. For instance, with the following script, the 
> > > > > > > last
> > > > > > > two queries: index scan and seq scan, will return different 
> > > > > > > results.
> > > > > > >
> > > > > > > set enable_seqscan to off;
> > > > > > > set enable_bitmapscan to off;
> > > > > > > set enable_indexonlyscan to off;
> > > > > > > create table tbl (a int primary key);
> > > > > > > insert into tbl values (1);
> > > > > > >
> > > > > > > update tbl set a = a + 100 where a = 1;
> > > > > > >
> > > > > > > explain analyze select * from tbl where a < 200;
> > > > > > >
> > > > > > > -- revive deleted tuple on heap
> > > > > > > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> > > > > > >
> > > > > > > -- index scan returns 2 tuples
> > > > > > > explain analyze select * from tbl where a < 200;
> > > > > > >
> > > > > > > -- seq scan returns 1 tuple
> > > > > > > set enable_seqscan to on;
> > > > > > > explain analyze select * from tbl;
> > > > > > >
> > > > > >
> > > > > > I am not sure if this is the right use-case of pg_force_freeze
> > > > > > function. I think we should only be running pg_force_freeze function
> > > > > > on a tuple for which VACUUM reports "found xmin ABC from before
> > > > > > relfrozenxid PQR" sort of error otherwise it might worsen the things
> > > > > > instead of making it better.
> > > > >
> > > > > Should this also be documented? I think that it's hard to force the
> > > > > user to always use this module in the right situation but we need to
> > > > > show at least when to use.
> > > > >
> > > >
> > > > I've already added some examples in the documentation explaining the
> > > > use-case of force_freeze function. If required, I will also add a note
> > > > about it.
> > > >
> > > > > > > Also, if a tuple updated and moved to another partition is 
> > > > > > > revived by
> > > > > > > heap_force_freeze(), its ctid still has special values:
> > > > > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I 
> > > > > > > don't
> > > > > > > see a problem yet caused by a visible tuple having the special 
> > > > > > > ctid
> > > > > > > value, but it might be worth considering either to reset ctid 
> > > > > > > value as
> > > > > > > well or to not freezing already-deleted tuple.
> > > > > > >
> > > > > >
> > > > > > For this as well, the answer remains the same as above.
> > > > >
> > > > > Perhaps the same is true when a tuple header is corrupted including
> > > > > xmin and ctid for some reason and the user wants to fix it? I'm
> > > > > concerned that a live tuple having the wrong ctid will cause SEGV or
> > > > > PANIC error in the future.
> > > > >
> > > >
> > > > If a tuple header itself is corrupted, then I think we must kill that
> > > > tuple. If only xmin and t_ctid fields are corrupted, then probably we
> > > > can think of resetting the ctid value of that tuple. However, it won't
> > > > be always possible to detect the corrupted ctid value. It's quite
> > > > possible that the corrupted ctid field has valid values for block
> > > > number and offset number in it, but it's actually corrupted and it
> > > > would be difficult to consider such ctid as corrupted. Hence, we can't
> > > > do anything about such types of corruption. Probably in such cases we
> > > > need to run VACUUM FULL on such tables so that new ctid gets generated
> > > > for each tuple in the table.
> > >
> > > Understood.
> > >
> > > Perhaps such corruption will be able to be detected by new heapam
> > > check functions discussed on another thread. My point was that it
> > > might be better to attempt making the tuple header sane state as much
> > > as possible when fixing a live tuple in order to prevent further
> > > problems such as databases crash by malicious attackers.
> > >
> >
> > Agreed. So, to handle the ctid related concern that you raised, I'm
> > planning to do the following changes to ensure that the tuple being
> > marked as frozen contains the correct item pointer value. Please let
> > me know if you are okay with these changes.
>
> Given that a live tuple never indicates to ve moved partitions, I
> guess the first condition in the if statement is not necessary. The
> rest looks good to me, although other hackers might think differently.
>

Okay, thanks for confirming. I am planning to go ahead with this
approach. Will later

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-19 Thread Masahiko Sawada

On Wed, 19 Aug 2020 at 20:45, Ashutosh Sharma  wrote:
>
> On Wed, Aug 19, 2020 at 3:55 PM Masahiko Sawada
>  wrote:
> >
> > On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma  wrote:
> > >
> > > On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada
> > >  wrote:
> > > >
> > > > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma  
> > > > wrote:
> > > > >
> > > > > > pg_force_freeze() can revival a tuple that is already deleted but 
> > > > > > not
> > > > > > vacuumed yet. Therefore, the user might need to reindex indexes 
> > > > > > after
> > > > > > using that function. For instance, with the following script, the 
> > > > > > last
> > > > > > two queries: index scan and seq scan, will return different results.
> > > > > >
> > > > > > set enable_seqscan to off;
> > > > > > set enable_bitmapscan to off;
> > > > > > set enable_indexonlyscan to off;
> > > > > > create table tbl (a int primary key);
> > > > > > insert into tbl values (1);
> > > > > >
> > > > > > update tbl set a = a + 100 where a = 1;
> > > > > >
> > > > > > explain analyze select * from tbl where a < 200;
> > > > > >
> > > > > > -- revive deleted tuple on heap
> > > > > > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> > > > > >
> > > > > > -- index scan returns 2 tuples
> > > > > > explain analyze select * from tbl where a < 200;
> > > > > >
> > > > > > -- seq scan returns 1 tuple
> > > > > > set enable_seqscan to on;
> > > > > > explain analyze select * from tbl;
> > > > > >
> > > > >
> > > > > I am not sure if this is the right use-case of pg_force_freeze
> > > > > function. I think we should only be running pg_force_freeze function
> > > > > on a tuple for which VACUUM reports "found xmin ABC from before
> > > > > relfrozenxid PQR" sort of error otherwise it might worsen the things
> > > > > instead of making it better.
> > > >
> > > > Should this also be documented? I think that it's hard to force the
> > > > user to always use this module in the right situation but we need to
> > > > show at least when to use.
> > > >
> > >
> > > I've already added some examples in the documentation explaining the
> > > use-case of force_freeze function. If required, I will also add a note
> > > about it.
> > >
> > > > > > Also, if a tuple updated and moved to another partition is revived 
> > > > > > by
> > > > > > heap_force_freeze(), its ctid still has special values:
> > > > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> > > > > > see a problem yet caused by a visible tuple having the special ctid
> > > > > > value, but it might be worth considering either to reset ctid value 
> > > > > > as
> > > > > > well or to not freezing already-deleted tuple.
> > > > > >
> > > > >
> > > > > For this as well, the answer remains the same as above.
> > > >
> > > > Perhaps the same is true when a tuple header is corrupted including
> > > > xmin and ctid for some reason and the user wants to fix it? I'm
> > > > concerned that a live tuple having the wrong ctid will cause SEGV or
> > > > PANIC error in the future.
> > > >
> > >
> > > If a tuple header itself is corrupted, then I think we must kill that
> > > tuple. If only xmin and t_ctid fields are corrupted, then probably we
> > > can think of resetting the ctid value of that tuple. However, it won't
> > > be always possible to detect the corrupted ctid value. It's quite
> > > possible that the corrupted ctid field has valid values for block
> > > number and offset number in it, but it's actually corrupted and it
> > > would be difficult to consider such ctid as corrupted. Hence, we can't
> > > do anything about such types of corruption. Probably in such cases we
> > > need to run VACUUM FULL on such tables so that new ctid gets generated
> > > for each tuple in the table.
> >
> > Understood.
> >
> > Perhaps such corruption will be able to be detected by new heapam
> > check functions discussed on another thread. My point was that it
> > might be better to attempt making the tuple header sane state as much
> > as possible when fixing a live tuple in order to prevent further
> > problems such as databases crash by malicious attackers.
> >
>
> Agreed. So, to handle the ctid related concern that you raised, I'm
> planning to do the following changes to ensure that the tuple being
> marked as frozen contains the correct item pointer value. Please let
> me know if you are okay with these changes.

Given that a live tuple never indicates to ve moved partitions, I
guess the first condition in the if statement is not necessary. The
rest looks good to me, although other hackers might think differently.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-19 Thread Ashutosh Sharma

On Wed, Aug 19, 2020 at 3:55 PM Masahiko Sawada
 wrote:
>
> On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma  wrote:
> >
> > On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada
> >  wrote:
> > >
> > > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma  
> > > wrote:
> > > >
> > > > > pg_force_freeze() can revival a tuple that is already deleted but not
> > > > > vacuumed yet. Therefore, the user might need to reindex indexes after
> > > > > using that function. For instance, with the following script, the last
> > > > > two queries: index scan and seq scan, will return different results.
> > > > >
> > > > > set enable_seqscan to off;
> > > > > set enable_bitmapscan to off;
> > > > > set enable_indexonlyscan to off;
> > > > > create table tbl (a int primary key);
> > > > > insert into tbl values (1);
> > > > >
> > > > > update tbl set a = a + 100 where a = 1;
> > > > >
> > > > > explain analyze select * from tbl where a < 200;
> > > > >
> > > > > -- revive deleted tuple on heap
> > > > > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> > > > >
> > > > > -- index scan returns 2 tuples
> > > > > explain analyze select * from tbl where a < 200;
> > > > >
> > > > > -- seq scan returns 1 tuple
> > > > > set enable_seqscan to on;
> > > > > explain analyze select * from tbl;
> > > > >
> > > >
> > > > I am not sure if this is the right use-case of pg_force_freeze
> > > > function. I think we should only be running pg_force_freeze function
> > > > on a tuple for which VACUUM reports "found xmin ABC from before
> > > > relfrozenxid PQR" sort of error otherwise it might worsen the things
> > > > instead of making it better.
> > >
> > > Should this also be documented? I think that it's hard to force the
> > > user to always use this module in the right situation but we need to
> > > show at least when to use.
> > >
> >
> > I've already added some examples in the documentation explaining the
> > use-case of force_freeze function. If required, I will also add a note
> > about it.
> >
> > > > > Also, if a tuple updated and moved to another partition is revived by
> > > > > heap_force_freeze(), its ctid still has special values:
> > > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> > > > > see a problem yet caused by a visible tuple having the special ctid
> > > > > value, but it might be worth considering either to reset ctid value as
> > > > > well or to not freezing already-deleted tuple.
> > > > >
> > > >
> > > > For this as well, the answer remains the same as above.
> > >
> > > Perhaps the same is true when a tuple header is corrupted including
> > > xmin and ctid for some reason and the user wants to fix it? I'm
> > > concerned that a live tuple having the wrong ctid will cause SEGV or
> > > PANIC error in the future.
> > >
> >
> > If a tuple header itself is corrupted, then I think we must kill that
> > tuple. If only xmin and t_ctid fields are corrupted, then probably we
> > can think of resetting the ctid value of that tuple. However, it won't
> > be always possible to detect the corrupted ctid value. It's quite
> > possible that the corrupted ctid field has valid values for block
> > number and offset number in it, but it's actually corrupted and it
> > would be difficult to consider such ctid as corrupted. Hence, we can't
> > do anything about such types of corruption. Probably in such cases we
> > need to run VACUUM FULL on such tables so that new ctid gets generated
> > for each tuple in the table.
>
> Understood.
>
> Perhaps such corruption will be able to be detected by new heapam
> check functions discussed on another thread. My point was that it
> might be better to attempt making the tuple header sane state as much
> as possible when fixing a live tuple in order to prevent further
> problems such as databases crash by malicious attackers.
>

Agreed. So, to handle the ctid related concern that you raised, I'm
planning to do the following changes to ensure that the tuple being
marked as frozen contains the correct item pointer value. Please let
me know if you are okay with these changes.

   HeapTupleHeader htup;
+  ItemPointerData ctid;

   Assert(heap_force_opt == HEAP_FORCE_FREEZE);

+  ItemPointerSet(, blkno, offnos[i]);
+
   htup = (HeapTupleHeader)
PageGetItem(page, itemid);

+   /*
+* Make sure that this tuple holds the
correct item pointer
+* value.
+*/
+   if
(!HeapTupleHeaderIndicatesMovedPartitions(htup) &&
+   !ItemPointerEquals(, >t_ctid))
+   ItemPointerSet(>t_ctid,
blkno, offnos[i]);
+
HeapTupleHeaderSetXmin(htup,
FrozenTransactionId);

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-19 Thread Masahiko Sawada

On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma  wrote:
>
> On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada
>  wrote:
> >
> > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma  wrote:
> > >
> > > > pg_force_freeze() can revival a tuple that is already deleted but not
> > > > vacuumed yet. Therefore, the user might need to reindex indexes after
> > > > using that function. For instance, with the following script, the last
> > > > two queries: index scan and seq scan, will return different results.
> > > >
> > > > set enable_seqscan to off;
> > > > set enable_bitmapscan to off;
> > > > set enable_indexonlyscan to off;
> > > > create table tbl (a int primary key);
> > > > insert into tbl values (1);
> > > >
> > > > update tbl set a = a + 100 where a = 1;
> > > >
> > > > explain analyze select * from tbl where a < 200;
> > > >
> > > > -- revive deleted tuple on heap
> > > > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> > > >
> > > > -- index scan returns 2 tuples
> > > > explain analyze select * from tbl where a < 200;
> > > >
> > > > -- seq scan returns 1 tuple
> > > > set enable_seqscan to on;
> > > > explain analyze select * from tbl;
> > > >
> > >
> > > I am not sure if this is the right use-case of pg_force_freeze
> > > function. I think we should only be running pg_force_freeze function
> > > on a tuple for which VACUUM reports "found xmin ABC from before
> > > relfrozenxid PQR" sort of error otherwise it might worsen the things
> > > instead of making it better.
> >
> > Should this also be documented? I think that it's hard to force the
> > user to always use this module in the right situation but we need to
> > show at least when to use.
> >
>
> I've already added some examples in the documentation explaining the
> use-case of force_freeze function. If required, I will also add a note
> about it.
>
> > > > Also, if a tuple updated and moved to another partition is revived by
> > > > heap_force_freeze(), its ctid still has special values:
> > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> > > > see a problem yet caused by a visible tuple having the special ctid
> > > > value, but it might be worth considering either to reset ctid value as
> > > > well or to not freezing already-deleted tuple.
> > > >
> > >
> > > For this as well, the answer remains the same as above.
> >
> > Perhaps the same is true when a tuple header is corrupted including
> > xmin and ctid for some reason and the user wants to fix it? I'm
> > concerned that a live tuple having the wrong ctid will cause SEGV or
> > PANIC error in the future.
> >
>
> If a tuple header itself is corrupted, then I think we must kill that
> tuple. If only xmin and t_ctid fields are corrupted, then probably we
> can think of resetting the ctid value of that tuple. However, it won't
> be always possible to detect the corrupted ctid value. It's quite
> possible that the corrupted ctid field has valid values for block
> number and offset number in it, but it's actually corrupted and it
> would be difficult to consider such ctid as corrupted. Hence, we can't
> do anything about such types of corruption. Probably in such cases we
> need to run VACUUM FULL on such tables so that new ctid gets generated
> for each tuple in the table.

Understood.

Perhaps such corruption will be able to be detected by new heapam
check functions discussed on another thread. My point was that it
might be better to attempt making the tuple header sane state as much
as possible when fixing a live tuple in order to prevent further
problems such as databases crash by malicious attackers.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-19 Thread Ashutosh Sharma

On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada
 wrote:
>
> On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma  wrote:
> >
> > > pg_force_freeze() can revival a tuple that is already deleted but not
> > > vacuumed yet. Therefore, the user might need to reindex indexes after
> > > using that function. For instance, with the following script, the last
> > > two queries: index scan and seq scan, will return different results.
> > >
> > > set enable_seqscan to off;
> > > set enable_bitmapscan to off;
> > > set enable_indexonlyscan to off;
> > > create table tbl (a int primary key);
> > > insert into tbl values (1);
> > >
> > > update tbl set a = a + 100 where a = 1;
> > >
> > > explain analyze select * from tbl where a < 200;
> > >
> > > -- revive deleted tuple on heap
> > > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> > >
> > > -- index scan returns 2 tuples
> > > explain analyze select * from tbl where a < 200;
> > >
> > > -- seq scan returns 1 tuple
> > > set enable_seqscan to on;
> > > explain analyze select * from tbl;
> > >
> >
> > I am not sure if this is the right use-case of pg_force_freeze
> > function. I think we should only be running pg_force_freeze function
> > on a tuple for which VACUUM reports "found xmin ABC from before
> > relfrozenxid PQR" sort of error otherwise it might worsen the things
> > instead of making it better.
>
> Should this also be documented? I think that it's hard to force the
> user to always use this module in the right situation but we need to
> show at least when to use.
>

I've already added some examples in the documentation explaining the
use-case of force_freeze function. If required, I will also add a note
about it.

> > > Also, if a tuple updated and moved to another partition is revived by
> > > heap_force_freeze(), its ctid still has special values:
> > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> > > see a problem yet caused by a visible tuple having the special ctid
> > > value, but it might be worth considering either to reset ctid value as
> > > well or to not freezing already-deleted tuple.
> > >
> >
> > For this as well, the answer remains the same as above.
>
> Perhaps the same is true when a tuple header is corrupted including
> xmin and ctid for some reason and the user wants to fix it? I'm
> concerned that a live tuple having the wrong ctid will cause SEGV or
> PANIC error in the future.
>

If a tuple header itself is corrupted, then I think we must kill that
tuple. If only xmin and t_ctid fields are corrupted, then probably we
can think of resetting the ctid value of that tuple. However, it won't
be always possible to detect the corrupted ctid value. It's quite
possible that the corrupted ctid field has valid values for block
number and offset number in it, but it's actually corrupted and it
would be difficult to consider such ctid as corrupted. Hence, we can't
do anything about such types of corruption. Probably in such cases we
need to run VACUUM FULL on such tables so that new ctid gets generated
for each tuple in the table.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-18 Thread Ashutosh Sharma

On Tue, Aug 18, 2020 at 9:44 PM Alvaro Herrera  wrote:
>
> On 2020-Aug-17, Ashutosh Sharma wrote:
>
> > > +   if (heap_force_opt == HEAP_FORCE_KILL)
> > > +   ItemIdSetDead(itemid);
> > >
> > > I think that if the page is an all-visible page, we should clear an
> > > all-visible bit on the visibility map corresponding to the page and
> > > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
> > > return the wrong results.
> >
> > I think we should let VACUUM do that. Please note that this module is
> > intended to be used only on a damaged relation and should only be
> > operated on damaged tuples of such relations. And the execution of any
> > of the functions provided by this module on a damaged relation must be
> > followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
> > This is necessary to bring back a damaged relation to the sane state
> > once a surgery is performed on it. I will try to add this note in the
> > documentation for this module.
>
> It makes sense to recommend VACUUM after fixing the page, but I agree
> with Sawada-san that it would be sensible to reset the VM bit while
> doing surgery, since that's the state that the page would be in.

Sure, I will try to do that change but I would still recommend to
always run VACUUM with DISABLE_PAGE_SKIPPING option on the relation
that underwent surgery.

We
> should certainly *strongly recommend* to do VACUUM DISABLE_PAGE_SKIPPING,
> but if users fail to do so, then leaving the VM bit set just means that
> we know *for certain* that there will be further corruption as soon as
> the XID counter advances sufficiently.
>

Yeah, I've already added a note for this in the documentation:

Note: "After a surgery is performed on a damaged relation using this
module, we must run VACUUM with DISABLE_PAGE_SKIPPING option on that
relation to bring it back into a sane state."

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-18 Thread Masahiko Sawada

On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma  wrote:
>
> Hello Masahiko-san,
>
> Thanks for the review. Please check the comments inline below:
>
> On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada
>  wrote:
>
> > Thank you for updating the patch! Here are my comments on v5 patch:
> >
> > --- a/contrib/Makefile
> > +++ b/contrib/Makefile
> > @@ -35,6 +35,7 @@ SUBDIRS = \
> > pg_standby  \
> > pg_stat_statements \
> > pg_trgm \
> > +   pg_surgery  \
> > pgcrypto\
> >
> > I guess we use alphabetical order here. So pg_surgery should be placed
> > before pg_trgm.
> >
>
> Okay, will take care of this in the next version of patch.
>
> > ---
> > +   if (heap_force_opt == HEAP_FORCE_KILL)
> > +   ItemIdSetDead(itemid);
> >
> > I think that if the page is an all-visible page, we should clear an
> > all-visible bit on the visibility map corresponding to the page and
> > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
> > return the wrong results.
> >
>
> I think we should let VACUUM do that. Please note that this module is
> intended to be used only on a damaged relation and should only be
> operated on damaged tuples of such relations. And the execution of any
> of the functions provided by this module on a damaged relation must be
> followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
> This is necessary to bring back a damaged relation to the sane state
> once a surgery is performed on it. I will try to add this note in the
> documentation for this module.
>
> > ---
> > +   /*
> > +* We do not mark the buffer dirty or do WAL logging for unmodifed
> > +* pages.
> > +*/
> > +   if (!did_modify_page)
> > +   goto skip_wal;
> > +
> > +   /* Mark buffer dirty before we write WAL. */
> > +   MarkBufferDirty(buf);
> > +
> > +   /* XLOG stuff */
> > +   if (RelationNeedsWAL(rel))
> > +   log_newpage_buffer(buf, true);
> > +
> > +skip_wal:
> > +   END_CRIT_SECTION();
> > +
> >
> > s/unmodifed/unmodified/
> >
>
> okay, will fix this typo.
>
> > Do we really need to use goto? I think we can modify it like follows:
> >
> > if (did_modity_page)
> > {
> >/* Mark buffer dirty before we write WAL. */
> >MarkBufferDirty(buf);
> >
> >/* XLOG stuff */
> >if (RelationNeedsWAL(rel))
> >log_newpage_buffer(buf, true);
> > }
> >
> > END_CRIT_SECTION();
> >
>
> No, we don't need it. We can achieve the same by checking the status
> of did_modify_page flag as you suggested. I will do this change in the
> next version.
>
> > ---
> > pg_force_freeze() can revival a tuple that is already deleted but not
> > vacuumed yet. Therefore, the user might need to reindex indexes after
> > using that function. For instance, with the following script, the last
> > two queries: index scan and seq scan, will return different results.
> >
> > set enable_seqscan to off;
> > set enable_bitmapscan to off;
> > set enable_indexonlyscan to off;
> > create table tbl (a int primary key);
> > insert into tbl values (1);
> >
> > update tbl set a = a + 100 where a = 1;
> >
> > explain analyze select * from tbl where a < 200;
> >
> > -- revive deleted tuple on heap
> > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> >
> > -- index scan returns 2 tuples
> > explain analyze select * from tbl where a < 200;
> >
> > -- seq scan returns 1 tuple
> > set enable_seqscan to on;
> > explain analyze select * from tbl;
> >
>
> I am not sure if this is the right use-case of pg_force_freeze
> function. I think we should only be running pg_force_freeze function
> on a tuple for which VACUUM reports "found xmin ABC from before
> relfrozenxid PQR" sort of error otherwise it might worsen the things
> instead of making it better.

Should this also be documented? I think that it's hard to force the
user to always use this module in the right situation but we need to
show at least when to use.

> > Also, if a tuple updated and moved to another partition is revived by
> > heap_force_freeze(), its ctid still has special values:
> > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> > see a problem yet caused by a visible tuple having the special ctid
> > value, but it might be worth considering either to reset ctid value as
> > well or to not freezing already-deleted tuple.
> >
>
> For this as well, the answer remains the same as above.

Perhaps the same is true when a tuple header is corrupted including
xmin and ctid for some reason and the user wants to fix it? I'm
concerned that a live tuple having the wrong ctid will cause SEGV or
PANIC error in the future.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-18 Thread Alvaro Herrera

On 2020-Aug-17, Ashutosh Sharma wrote:

> > +   if (heap_force_opt == HEAP_FORCE_KILL)
> > +   ItemIdSetDead(itemid);
> >
> > I think that if the page is an all-visible page, we should clear an
> > all-visible bit on the visibility map corresponding to the page and
> > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
> > return the wrong results.
> 
> I think we should let VACUUM do that. Please note that this module is
> intended to be used only on a damaged relation and should only be
> operated on damaged tuples of such relations. And the execution of any
> of the functions provided by this module on a damaged relation must be
> followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
> This is necessary to bring back a damaged relation to the sane state
> once a surgery is performed on it. I will try to add this note in the
> documentation for this module.

It makes sense to recommend VACUUM after fixing the page, but I agree
with Sawada-san that it would be sensible to reset the VM bit while
doing surgery, since that's the state that the page would be in.  We
should certainly *strongly recommend* to do VACUUM DISABLE_PAGE_SKIPPING,
but if users fail to do so, then leaving the VM bit set just means that
we know *for certain* that there will be further corruption as soon as
the XID counter advances sufficiently.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-18 Thread Rajkumar Raghuwanshi

Thanks for suggestion Ashutosh, I have done testing around these suggestion
and found no issues. I will continue testing same with updated patch posted
on this thread.

On Fri, Aug 7, 2020 at 12:45 PM Ashutosh Sharma 
wrote:

> Thanks Rajkumar for testing the patch.
>
> Here are some of the additional test-cases that I would suggest you to
> execute, if possible:
>
> 1) You may try running the test-cases that you have executed so far
> with SR setup and see if the changes are getting reflected on the
> standby.
>
> 2) You may also try running some concurrent test-cases for e.g. try
> running these functions with VACUUM or some other sql commands
> (preferable DML commands) in parallel.
>
> 3) See what happens when you pass some invalid tids (containing
> invalid block or offset number) to these functions. You may also try
> running these functions on the same tuple repeatedly and see the
> behaviour.
>
> ...
>
> --
> With Regards,
> Ashutosh Sharma
> EnterpriseDB:http://www.enterprisedb.com


Thanks & Regards,
Rajkumar Raghuwanshi

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-18 Thread Chris Travers

On Tue, Jul 14, 2020 at 12:28 AM Peter Geoghegan  wrote:

> On Mon, Jul 13, 2020 at 2:12 PM Robert Haas  wrote:
> > 1. There's nothing to identify the tuple that has the problem, and no
> > way to know how many more of them there might be. Back-patching
> > b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first
> > part of this.
>
> I am in favor of backpatching such changes in cases where senior
> community members feel that it could help with hypothetical
> undiscovered data corruption issues -- if they're willing to take
> responsibility for the change. It certainly wouldn't be the first
> time. A "defense in depth" mindset seems like the right one when it
> comes to data corruption bugs. Early detection is really important.
>
> > Moreover, not everyone is as
> > interested in an extended debugging exercise as they are in getting
> > the system working again, and VACUUM failing repeatedly is a pretty
> > serious problem.
>
> That's absolutely consistent with my experience. Most users want to
> get back to business as usual now, while letting somebody else do the
> hard work of debugging.
>

Also even if you do trace the problem you still have to recover.

And sometimes I have found latent corruption from times when dbs were
running on older versions and older servers, making debugging largely a
futile exercise.

>
> > Therefore, one of my colleagues has - at my request - created a couple
> > of functions called heap_force_kill() and heap_force_freeze() which
> > take an array of TIDs.
>
> > So I have these questions:
> >
> > - Do people think it would me smart/good/useful to include something
> > like this in PostgreSQL?
>
> I'm in favor of it.
>

+1

Would be worth extending it with some functions to grab rows that have
various TOAST oids too.

>
> > - If so, how? I would propose a new contrib module that we back-patch
> > all the way, because the VACUUM errors were back-patched all the way,
> > and there seems to be no advantage in making people wait 5 years for a
> > new version that has some kind of tooling in this area.
>
> I'm in favor of it being *possible* to backpatch tooling that is
> clearly related to correctness in a fundamental way. Obviously this
> would mean that we'd be revising our general position on backpatching
> to allow some limited exceptions around corruption. I'm not sure that
> this meets that standard, though. It's hardly something that we can
> expect all that many users to be able to use effectively.
>
> I may be biased, but I'd be inclined to permit it in the case of
> something like amcheck, or pg_visibility, on the grounds that they're
> more or less the same as the new VACUUM errcontext instrumentation you
> mentioned. The same cannot be said of something like this new
> heap_force_kill() stuff.
>
> > - Any ideas for additional things we should include, or improvements
> > on the sketch above?
>
> Clearly you should work out a way of making it very hard to
> accidentally (mis)use. For example, maybe you make the functions check
> for the presence of a sentinel file in the data directory.
>

Agreed.

>
>
> --
> Peter Geoghegan
>
>
>

-- 
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-18 Thread Ashutosh Sharma

Attached is the new version of patch that addresses the comments from
Asim Praveen and Masahiko-san. It also improves the documentation to
some extent.


On Tue, Aug 18, 2020 at 1:46 PM Ashutosh Sharma  wrote:
>
> Hello Masahiko-san,
>
> I've spent some more time trying to understand the code in
> lazy_scan_heap function to know under what all circumstances a VACUUM
> can fail with "found xmin ... before relfrozenxid ..." error for a
> tuple whose xmin is behind relfrozenxid. Here are my observations:
>
> 1) It can fail with this error for a live tuple
>
> OR,
>
> 2) It can also fail with this error if a tuple (that went through
> update) is marked as HEAP_HOT_UPDATED or HEAP_ONLY_TUPLE.
>
> OR,
>
> 3) If there are any concurrent transactions, then the tuple might be
> marked as HEAPTUPLE_INSERT_IN_PROGRESS or HEAPTUPLE_DELETE_IN_PROGRESS
> or HEAPTUPLE_RECENTLY_DEAD in which case also VACUUM can fail with
> this error.
>
> Now, AFAIU, as we will be dealing with a damaged table, the chances of
> point #3 being the cause of this error looks impossible in our case
> because I don't think we will be doing anything in parallel when
> performing surgery on a damaged table, in fact we shouldn't be doing
> any such things. However, it is quite possible that reason #2 could
> cause VACUUM to fail with this sort of error, but, as we are already
> skipping redirected item pointers in heap_force_common(), I think, we
> would never be marking HEAP_HOT_UPDATED tuple as frozen and I don't
> see any problem in marking HEAP_ONLY_TUPLE as frozen. So, probably, we
> may not need to handle point #2 as well.
>
> Further, I also don't see VACUUM reporting this error for a tuple that
> has been moved from one partition to another. So, I think we might not
> need to do any special handling for a tuple that got updated and its
> new version was moved to another partition.
>
> If you feel I am missing something here, please correct me. Thank you.
>
> Moreover, while I was exploring on above, I noticed that in
> lazy_scan_heap(), before we call HeapTupleSatisfiesVacuum() we check
> for a redirected item pointers and if any redirected item pointer is
> detected we do not call HeapTupleSatisfiesVacuum(). So, not sure how
> HeapTupleSatisfiesVacuum would ever return a dead tuple that is marked
> with HEAP_HOT_UPDATED. I am referring to the following code in
> lazy_scan_heap().
>
> for (offnum = FirstOffsetNumber;
>  offnum <= maxoff;
>  offnum = OffsetNumberNext(offnum))
> {
> ItemId  itemid;
>
> itemid = PageGetItemId(page, offnum);
>
> .
> .
>
>
> /* Redirect items mustn't be touched */ <-- this check
> would bypass the redirected item pointers from being checked for
> HeapTupleSatisfiesVacuum.
> if (ItemIdIsRedirected(itemid))
> {
> hastup = true;  /* this page won't be truncatable */
> continue;
> }
>
> ..
> ..
>
> switch (HeapTupleSatisfiesVacuum(, OldestXmin, buf))
> {
> case HEAPTUPLE_DEAD:
>
> if (HeapTupleIsHotUpdated() ||
> HeapTupleIsHeapOnly() ||
> params->index_cleanup == VACOPT_TERNARY_DISABLED)
> nkeep += 1;
> else
> tupgone = true; /* we can delete the tuple */
> ..
> ..
>  }
>
>
> So, the point is, would HeapTupleIsHotUpdated() ever be true?
>
> --
> With Regards,
> Ashutosh Sharma
> EnterpriseDB:http://www.enterprisedb.com
>
> On Mon, Aug 17, 2020 at 11:35 AM Ashutosh Sharma  
> wrote:
> >
> > Hello Masahiko-san,
> >
> > Thanks for the review. Please check the comments inline below:
> >
> > On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada
> >  wrote:
> >
> > > Thank you for updating the patch! Here are my comments on v5 patch:
> > >
> > > --- a/contrib/Makefile
> > > +++ b/contrib/Makefile
> > > @@ -35,6 +35,7 @@ SUBDIRS = \
> > > pg_standby  \
> > > pg_stat_statements \
> > > pg_trgm \
> > > +   pg_surgery  \
> > > pgcrypto\
> > >
> > > I guess we use alphabetical order here. So pg_surgery should be placed
> > > before pg_trgm.
> > >
> >
> > Okay, will take care of this in the next version of patch.
> >
> > > ---
> > > +   if (heap_force_opt == HEAP_FORCE_KILL)
> > > +   ItemIdSetDead(itemid);
> > >
> > > I think that if the page is an all-visible page, we should clear an
> > > all-visible bit on the visibility map corresponding to the page and
> > > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
> > > return the wrong results.
> > >
> >
> > I think we should let VACUUM do that. Please note that this module is
> > intended to be used only on a damaged relation and should only be
> > operated on damaged tuples of such relations. And

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-18 Thread Ashutosh Sharma

Hello Masahiko-san,

I've spent some more time trying to understand the code in
lazy_scan_heap function to know under what all circumstances a VACUUM
can fail with "found xmin ... before relfrozenxid ..." error for a
tuple whose xmin is behind relfrozenxid. Here are my observations:

1) It can fail with this error for a live tuple

OR,

2) It can also fail with this error if a tuple (that went through
update) is marked as HEAP_HOT_UPDATED or HEAP_ONLY_TUPLE.

OR,

3) If there are any concurrent transactions, then the tuple might be
marked as HEAPTUPLE_INSERT_IN_PROGRESS or HEAPTUPLE_DELETE_IN_PROGRESS
or HEAPTUPLE_RECENTLY_DEAD in which case also VACUUM can fail with
this error.

Now, AFAIU, as we will be dealing with a damaged table, the chances of
point #3 being the cause of this error looks impossible in our case
because I don't think we will be doing anything in parallel when
performing surgery on a damaged table, in fact we shouldn't be doing
any such things. However, it is quite possible that reason #2 could
cause VACUUM to fail with this sort of error, but, as we are already
skipping redirected item pointers in heap_force_common(), I think, we
would never be marking HEAP_HOT_UPDATED tuple as frozen and I don't
see any problem in marking HEAP_ONLY_TUPLE as frozen. So, probably, we
may not need to handle point #2 as well.

Further, I also don't see VACUUM reporting this error for a tuple that
has been moved from one partition to another. So, I think we might not
need to do any special handling for a tuple that got updated and its
new version was moved to another partition.

If you feel I am missing something here, please correct me. Thank you.

Moreover, while I was exploring on above, I noticed that in
lazy_scan_heap(), before we call HeapTupleSatisfiesVacuum() we check
for a redirected item pointers and if any redirected item pointer is
detected we do not call HeapTupleSatisfiesVacuum(). So, not sure how
HeapTupleSatisfiesVacuum would ever return a dead tuple that is marked
with HEAP_HOT_UPDATED. I am referring to the following code in
lazy_scan_heap().

for (offnum = FirstOffsetNumber;
 offnum <= maxoff;
 offnum = OffsetNumberNext(offnum))
{
ItemId  itemid;

itemid = PageGetItemId(page, offnum);

.
.

/* Redirect items mustn't be touched */ <-- this check
would bypass the redirected item pointers from being checked for
HeapTupleSatisfiesVacuum.
if (ItemIdIsRedirected(itemid))
{
hastup = true;  /* this page won't be truncatable */
continue;
}

..
..

switch (HeapTupleSatisfiesVacuum(, OldestXmin, buf))
{
case HEAPTUPLE_DEAD:

if (HeapTupleIsHotUpdated() ||
HeapTupleIsHeapOnly() ||
params->index_cleanup == VACOPT_TERNARY_DISABLED)
nkeep += 1;
else
tupgone = true; /* we can delete the tuple */
..
..
 }

So, the point is, would HeapTupleIsHotUpdated() ever be true?

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Mon, Aug 17, 2020 at 11:35 AM Ashutosh Sharma  wrote:
>
> Hello Masahiko-san,
>
> Thanks for the review. Please check the comments inline below:
>
> On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada
>  wrote:
>
> > Thank you for updating the patch! Here are my comments on v5 patch:
> >
> > --- a/contrib/Makefile
> > +++ b/contrib/Makefile
> > @@ -35,6 +35,7 @@ SUBDIRS = \
> > pg_standby  \
> > pg_stat_statements \
> > pg_trgm \
> > +   pg_surgery  \
> > pgcrypto\
> >
> > I guess we use alphabetical order here. So pg_surgery should be placed
> > before pg_trgm.
> >
>
> Okay, will take care of this in the next version of patch.
>
> > ---
> > +   if (heap_force_opt == HEAP_FORCE_KILL)
> > +   ItemIdSetDead(itemid);
> >
> > I think that if the page is an all-visible page, we should clear an
> > all-visible bit on the visibility map corresponding to the page and
> > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
> > return the wrong results.
> >
>
> I think we should let VACUUM do that. Please note that this module is
> intended to be used only on a damaged relation and should only be
> operated on damaged tuples of such relations. And the execution of any
> of the functions provided by this module on a damaged relation must be
> followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
> This is necessary to bring back a damaged relation to the sane state
> once a surgery is performed on it. I will try to add this note in the
> documentation for this module.
>
> > ---
> > +   /*
> > +* We do not mark the buffer dirty or do WAL logging for unmodifed
> > +

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-17 Thread Ashutosh Sharma

Hello Masahiko-san,

Thanks for the review. Please check the comments inline below:

On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada
 wrote:

> Thank you for updating the patch! Here are my comments on v5 patch:
>
> --- a/contrib/Makefile
> +++ b/contrib/Makefile
> @@ -35,6 +35,7 @@ SUBDIRS = \
> pg_standby  \
> pg_stat_statements \
> pg_trgm \
> +   pg_surgery  \
> pgcrypto\
>
> I guess we use alphabetical order here. So pg_surgery should be placed
> before pg_trgm.
>

Okay, will take care of this in the next version of patch.

> ---
> +   if (heap_force_opt == HEAP_FORCE_KILL)
> +   ItemIdSetDead(itemid);
>
> I think that if the page is an all-visible page, we should clear an
> all-visible bit on the visibility map corresponding to the page and
> PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
> return the wrong results.
>

I think we should let VACUUM do that. Please note that this module is
intended to be used only on a damaged relation and should only be
operated on damaged tuples of such relations. And the execution of any
of the functions provided by this module on a damaged relation must be
followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
This is necessary to bring back a damaged relation to the sane state
once a surgery is performed on it. I will try to add this note in the
documentation for this module.

> ---
> +   /*
> +* We do not mark the buffer dirty or do WAL logging for unmodifed
> +* pages.
> +*/
> +   if (!did_modify_page)
> +   goto skip_wal;
> +
> +   /* Mark buffer dirty before we write WAL. */
> +   MarkBufferDirty(buf);
> +
> +   /* XLOG stuff */
> +   if (RelationNeedsWAL(rel))
> +   log_newpage_buffer(buf, true);
> +
> +skip_wal:
> +   END_CRIT_SECTION();
> +
>
> s/unmodifed/unmodified/
>

okay, will fix this typo.

> Do we really need to use goto? I think we can modify it like follows:
>
> if (did_modity_page)
> {
>/* Mark buffer dirty before we write WAL. */
>MarkBufferDirty(buf);
>
>/* XLOG stuff */
>if (RelationNeedsWAL(rel))
>log_newpage_buffer(buf, true);
> }
>
> END_CRIT_SECTION();
>

No, we don't need it. We can achieve the same by checking the status
of did_modify_page flag as you suggested. I will do this change in the
next version.

> ---
> pg_force_freeze() can revival a tuple that is already deleted but not
> vacuumed yet. Therefore, the user might need to reindex indexes after
> using that function. For instance, with the following script, the last
> two queries: index scan and seq scan, will return different results.
>
> set enable_seqscan to off;
> set enable_bitmapscan to off;
> set enable_indexonlyscan to off;
> create table tbl (a int primary key);
> insert into tbl values (1);
>
> update tbl set a = a + 100 where a = 1;
>
> explain analyze select * from tbl where a < 200;
>
> -- revive deleted tuple on heap
> select heap_force_freeze('tbl', array['(0,1)'::tid]);
>
> -- index scan returns 2 tuples
> explain analyze select * from tbl where a < 200;
>
> -- seq scan returns 1 tuple
> set enable_seqscan to on;
> explain analyze select * from tbl;
>

I am not sure if this is the right use-case of pg_force_freeze
function. I think we should only be running pg_force_freeze function
on a tuple for which VACUUM reports "found xmin ABC from before
relfrozenxid PQR" sort of error otherwise it might worsen the things
instead of making it better. Now, the question is - can VACUUM report
this type of error for a deleted tuple or it would only report it for
a live tuple? AFAIU this won't be reported for the deleted tuples
because VACUUM wouldn't consider freezing a tuple that has been
deleted.

> Also, if a tuple updated and moved to another partition is revived by
> heap_force_freeze(), its ctid still has special values:
> MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> see a problem yet caused by a visible tuple having the special ctid
> value, but it might be worth considering either to reset ctid value as
> well or to not freezing already-deleted tuple.
>

For this as well, the answer remains the same as above.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-13 Thread Masahiko Sawada

On Wed, 12 Aug 2020 at 22:27, Ashutosh Sharma  wrote:
>
> Thanks Robert for the review. Please find my comments inline below:
>
> On Fri, Aug 7, 2020 at 9:21 PM Robert Haas  wrote:
> >
> > On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma  
> > wrote:
> > > Attached v4 patch fixes the latest comments from Robert and Masahiko-san.
> >
> > Compiler warning:
> >
> > heap_surgery.c:136:13: error: comparison of unsigned expression < 0 is
> > always false [-Werror,-Wtautological-compare]
> > if (blkno < 0 || blkno >= nblocks)
> > ~ ^ ~
> >
>
> Fixed.
>
> > There's a certain inconsistency to these messages:
> >
> > rhaas=# create table foo (a int);
> > CREATE TABLE
> > rhaas=# insert into foo values (1);
> > INSERT 0 1
> > rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]);
> > NOTICE:  skipping tid (0, 2) because it contains an invalid offset
> >  heap_force_kill
> > -
> >
> > (1 row)
> >
> > rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]);
> > ERROR:  invalid item pointer
> > LOCATION:  tids_same_page_fetch_offnums, heap_surgery.c:347
> > rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]);
> > ERROR:  block number 1 is out of range for relation "foo"
> >
> > From a user perspective it seems like I've made three very similar
> > mistakes: in the first case, the offset is too high, in the second
> > case it's too low, and in the third case the block number is out of
> > range. But in one case I get a NOTICE and in the other two cases I get
> > an ERROR. In one case I get the relation name and in the other two
> > cases I don't. The two complaints about an invalid offset are phrased
> > completely differently from each other. For example, suppose you do
> > this:
> >
> > ERROR: tid (%u, %u) is invalid for relation "%s" because the block
> > number is out of range (%u..%u)
> > ERROR: tid (%u, %u) is invalid for relation "%s" because the item
> > number is out of range for this block (%u..%u)
> > ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused
> > ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead
> >
>
> Corrected.
>
> > I think I misled you when I said to use pg_class_aclcheck. I think it
> > should actually be pg_class_ownercheck.
> >
>
> okay, I've changed it to pg_class_ownercheck.
>
> > I think the relkind sanity check should permit RELKIND_MATVIEW also.
> >
>
> Yeah, actually we should allow MATVIEW, don't know why I thought of
> blocking it earlier.
>
> > It's unclear to me why the freeze logic here shouldn't do this part
> > what heap_prepare_freeze_tuple() does when freezing xmax:
> >
> > frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
> > frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;
> >
>
> Yeah, we should have these changes when freezing the xmax.
>
> > Likewise, why should we not freeze or invalidate xvac in the case
> > where tuple->t_infomask & HEAP_MOVED, as heap_prepare_freeze_tuple()
> > would do?
> >
>
> Again, we should have this as well.
>
> Apart from above, this time I've also added the documentation on
> pg_surgery module and added a few more test-cases.
>
> Attached patch with above changes.
>

Thank you for updating the patch! Here are my comments on v5 patch:

--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -35,6 +35,7 @@ SUBDIRS = \
pg_standby  \
pg_stat_statements \
pg_trgm \
+   pg_surgery  \
pgcrypto\

I guess we use alphabetical order here. So pg_surgery should be placed
before pg_trgm.

---
+   if (heap_force_opt == HEAP_FORCE_KILL)
+   ItemIdSetDead(itemid);

I think that if the page is an all-visible page, we should clear an
all-visible bit on the visibility map corresponding to the page and
PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
return the wrong results.

---
+   /*
+* We do not mark the buffer dirty or do WAL logging for unmodifed
+* pages.
+*/
+   if (!did_modify_page)
+   goto skip_wal;
+
+   /* Mark buffer dirty before we write WAL. */
+   MarkBufferDirty(buf);
+
+   /* XLOG stuff */
+   if (RelationNeedsWAL(rel))
+   log_newpage_buffer(buf, true);
+
+skip_wal:
+   END_CRIT_SECTION();
+

s/unmodifed/unmodified/

Do we really need to use goto? I think we can modify it like follows:

if (did_modity_page)
{
   /* Mark buffer dirty before we write WAL. */
   MarkBufferDirty(buf);

   /* XLOG stuff */
   if (RelationNeedsWAL(rel))
   log_newpage_buffer(buf, true);
}

END_CRIT_SECTION();

---
pg_force_freeze() can revival a tuple that is already deleted but not
vacuumed yet. Therefore, the user might need to reindex indexes after
using that function. For instance, with the following script, the last
two queries: index scan and seq scan, will return different results.

set enable_seqscan to off;
set enable_bitmapscan to off;

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-13 Thread Robert Haas

On Thu, Aug 13, 2020 at 3:52 AM Ashutosh Sharma  wrote:
> This looks like a very good suggestion to me. I will do this change in
> the next version. Just wondering if we should be doing similar changes
> in other contrib modules (like pgrowlocks, pageinspect and
> pgstattuple) as well?

It seems like it should be consistent, but I'm not sure the proposed
change is really an improvement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-13 Thread Ashutosh Sharma

Hi Asim,

Thanks for having a look into the patch and for sharing your feedback.
Please find my comments inline below:

On Thu, Aug 13, 2020 at 12:36 PM Asim Praveen  wrote:
>
> Hi Ashutosh
>
> I stumbled upon this thread today, went through your patch and it looks good. 
>  A minor suggestion in sanity_check_relation():
>
> if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
> ereport(ERROR,
> (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>  errmsg("only heap AM is supported")));
>
> Instead of checking the access method OID, it seems better to check the 
> handler OID like so:
>
> if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)
>
> The reason is current version of sanity_check_relation() would emit error for 
> the following case even when the table structure is actually heap.
>
> create access method myam type table handler heap_tableam_handler;
> create table mytable (…) using myam;
>

This looks like a very good suggestion to me. I will do this change in
the next version. Just wondering if we should be doing similar changes
in other contrib modules (like pgrowlocks, pageinspect and
pgstattuple) as well?

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-13 Thread Asim Praveen

Hi Ashutosh

I stumbled upon this thread today, went through your patch and it looks good.  
A minor suggestion in sanity_check_relation():

if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("only heap AM is supported")));

Instead of checking the access method OID, it seems better to check the handler 
OID like so:

if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)

The reason is current version of sanity_check_relation() would emit error for 
the following case even when the table structure is actually heap.

create access method myam type table handler heap_tableam_handler;
create table mytable (…) using myam;

Asim

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-12 Thread Ashutosh Sharma

Thanks Robert for the review. Please find my comments inline below:

On Fri, Aug 7, 2020 at 9:21 PM Robert Haas  wrote:
>
> On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma  wrote:
> > Attached v4 patch fixes the latest comments from Robert and Masahiko-san.
>
> Compiler warning:
>
> heap_surgery.c:136:13: error: comparison of unsigned expression < 0 is
> always false [-Werror,-Wtautological-compare]
> if (blkno < 0 || blkno >= nblocks)
> ~ ^ ~
>

Fixed.

> There's a certain inconsistency to these messages:
>
> rhaas=# create table foo (a int);
> CREATE TABLE
> rhaas=# insert into foo values (1);
> INSERT 0 1
> rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]);
> NOTICE:  skipping tid (0, 2) because it contains an invalid offset
>  heap_force_kill
> -
>
> (1 row)
>
> rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]);
> ERROR:  invalid item pointer
> LOCATION:  tids_same_page_fetch_offnums, heap_surgery.c:347
> rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]);
> ERROR:  block number 1 is out of range for relation "foo"
>
> From a user perspective it seems like I've made three very similar
> mistakes: in the first case, the offset is too high, in the second
> case it's too low, and in the third case the block number is out of
> range. But in one case I get a NOTICE and in the other two cases I get
> an ERROR. In one case I get the relation name and in the other two
> cases I don't. The two complaints about an invalid offset are phrased
> completely differently from each other. For example, suppose you do
> this:
>
> ERROR: tid (%u, %u) is invalid for relation "%s" because the block
> number is out of range (%u..%u)
> ERROR: tid (%u, %u) is invalid for relation "%s" because the item
> number is out of range for this block (%u..%u)
> ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused
> ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead
>

Corrected.

> I think I misled you when I said to use pg_class_aclcheck. I think it
> should actually be pg_class_ownercheck.
>

okay, I've changed it to pg_class_ownercheck.

> I think the relkind sanity check should permit RELKIND_MATVIEW also.
>

Yeah, actually we should allow MATVIEW, don't know why I thought of
blocking it earlier.

> It's unclear to me why the freeze logic here shouldn't do this part
> what heap_prepare_freeze_tuple() does when freezing xmax:
>
> frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
> frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;
>

Yeah, we should have these changes when freezing the xmax.

> Likewise, why should we not freeze or invalidate xvac in the case
> where tuple->t_infomask & HEAP_MOVED, as heap_prepare_freeze_tuple()
> would do?
>

Again, we should have this as well.

Apart from above, this time I've also added the documentation on
pg_surgery module and added a few more test-cases.

Attached patch with above changes.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
From 2aa56b9632cf1d291f4433afd972dc647d354dcb Mon Sep 17 00:00:00 2001
From: ashu 
Date: Wed, 12 Aug 2020 14:38:14 +0530
Subject: [PATCH] Add contrib/pg_surgery to perform surgery on a damaged heap
 table.

This contrib module basically adds a couple of functions named
heap_force_kill and heap_force_freeze that can be used in the scripts
to perform surgery on the damaged heap tables.

Ashutosh Sharma.
---
 contrib/Makefile   |   1 +
 contrib/pg_surgery/Makefile|  23 ++
 contrib/pg_surgery/expected/pg_surgery.out | 161 +
 contrib/pg_surgery/heap_surgery.c  | 375 +
 contrib/pg_surgery/pg_surgery--1.0.sql |  18 ++
 contrib/pg_surgery/pg_surgery.control  |   5 +
 contrib/pg_surgery/sql/pg_surgery.sql  |  89 +++
 doc/src/sgml/contrib.sgml  |   1 +
 doc/src/sgml/filelist.sgml |   1 +
 doc/src/sgml/pgsurgery.sgml|  94 
 10 files changed, 768 insertions(+)
 create mode 100644 contrib/pg_surgery/Makefile
 create mode 100644 contrib/pg_surgery/expected/pg_surgery.out
 create mode 100644 contrib/pg_surgery/heap_surgery.c
 create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql
 create mode 100644 contrib/pg_surgery/pg_surgery.control
 create mode 100644 contrib/pg_surgery/sql/pg_surgery.sql
 create mode 100644 doc/src/sgml/pgsurgery.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index 1846d41..07d5734 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -35,6 +35,7 @@ SUBDIRS = \
 		pg_standby	\
 		pg_stat_statements \
 		pg_trgm		\
+		pg_surgery	\
 		pgcrypto	\
 		pgrowlocks	\
 		pgstattuple	\
diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile
new file mode 100644
index 000..ecf2e20
--- /dev/null
+++ b/contrib/pg_surgery/Makefile
@@ -0,0 +1,23 @@
+# contrib/pg_surgery/Makefile
+
+MODULE_big = pg_surgery

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-11 Thread Ashutosh Sharma

On Tue, Aug 11, 2020 at 7:33 PM Robert Haas  wrote:
>
> On Tue, Aug 11, 2020 at 3:39 AM Ashutosh Sharma  wrote:
> > The other two messages for reporting unused items and dead items
> > remain the same. Hence, with above change, we would be reporting the
> > following 4 messages:
> >
> > NOTICE:  skipping all the tids in block %u for relation "%s" because
> > the block number is out of range
> >
> > NOTICE:  skipping tid (%u, %u) for relation "%s" because the item
> > number is out of range for this block
> >
> > NOTICE:  skipping tid (%u, %u) for relation "%s" because it is marked dead
> >
> > NOTICE:  skipping tid (%u, %u) for relation "%s" because it is marked unused
> >
> > Please let me know if you are okay with the above changes or not?
>
> That seems broadly reasonable, but I would suggest phrasing the first
> message like this:
>
> skipping block %u for relation "%s" because the block number is out of range
>

Okay, thanks for the confirmation. I'll do that.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-11 Thread Robert Haas

On Tue, Aug 11, 2020 at 3:39 AM Ashutosh Sharma  wrote:
> The other two messages for reporting unused items and dead items
> remain the same. Hence, with above change, we would be reporting the
> following 4 messages:
>
> NOTICE:  skipping all the tids in block %u for relation "%s" because
> the block number is out of range
>
> NOTICE:  skipping tid (%u, %u) for relation "%s" because the item
> number is out of range for this block
>
> NOTICE:  skipping tid (%u, %u) for relation "%s" because it is marked dead
>
> NOTICE:  skipping tid (%u, %u) for relation "%s" because it is marked unused
>
> Please let me know if you are okay with the above changes or not?

That seems broadly reasonable, but I would suggest phrasing the first
message like this:

skipping block %u for relation "%s" because the block number is out of range

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-11 Thread Ashutosh Sharma

On Fri, Aug 7, 2020 at 9:21 PM Robert Haas  wrote:
>
> On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma  wrote:
> There's a certain inconsistency to these messages:
>
> rhaas=# create table foo (a int);
> CREATE TABLE
> rhaas=# insert into foo values (1);
> INSERT 0 1
> rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]);
> NOTICE:  skipping tid (0, 2) because it contains an invalid offset
>  heap_force_kill
> -
>
> (1 row)
>
> rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]);
> ERROR:  invalid item pointer
> LOCATION:  tids_same_page_fetch_offnums, heap_surgery.c:347
> rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]);
> ERROR:  block number 1 is out of range for relation "foo"
>
> From a user perspective it seems like I've made three very similar
> mistakes: in the first case, the offset is too high, in the second
> case it's too low, and in the third case the block number is out of
> range. But in one case I get a NOTICE and in the other two cases I get
> an ERROR. In one case I get the relation name and in the other two
> cases I don't. The two complaints about an invalid offset are phrased
> completely differently from each other. For example, suppose you do
> this:
>
> ERROR: tid (%u, %u) is invalid for relation "%s" because the block
> number is out of range (%u..%u)
> ERROR: tid (%u, %u) is invalid for relation "%s" because the item
> number is out of range for this block (%u..%u)
> ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused
> ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead
>

Thank you for your suggestions. To make this consistent, I am planning
to do the following changes:

Remove the error message to report "invalid item pointer" from
tids_same_page_fetch_offnums() and expand the if-check to detect any
invalid offset number in the CRITICAL section such that it not just
checks if the offset number is > maxoffset, but also checks if the
offset number is equal to 0. That way it would also do the job that
"if (!ItemPointerIsValid)" was doing for us.

Further, if any invalid block number is detected, then I am planning
to skip all the tids associated with this block and move to the next
block. Hence, instead of reporting the error I would report the NOTICE
message to the user.

The other two messages for reporting unused items and dead items
remain the same. Hence, with above change, we would be reporting the
following 4 messages:

NOTICE:  skipping all the tids in block %u for relation "%s" because
the block number is out of range

NOTICE:  skipping tid (%u, %u) for relation "%s" because the item
number is out of range for this block

NOTICE:  skipping tid (%u, %u) for relation "%s" because it is marked dead

NOTICE:  skipping tid (%u, %u) for relation "%s" because it is marked unused

Please let me know if you are okay with the above changes or not?

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-07 Thread Robert Haas

On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma  wrote:
> Attached v4 patch fixes the latest comments from Robert and Masahiko-san.

Compiler warning:

heap_surgery.c:136:13: error: comparison of unsigned expression < 0 is
always false [-Werror,-Wtautological-compare]
if (blkno < 0 || blkno >= nblocks)
~ ^ ~

There's a certain inconsistency to these messages:

rhaas=# create table foo (a int);
CREATE TABLE
rhaas=# insert into foo values (1);
INSERT 0 1
rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]);
NOTICE:  skipping tid (0, 2) because it contains an invalid offset
 heap_force_kill
-

(1 row)

rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]);
ERROR:  invalid item pointer
LOCATION:  tids_same_page_fetch_offnums, heap_surgery.c:347
rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]);
ERROR:  block number 1 is out of range for relation "foo"

>From a user perspective it seems like I've made three very similar
mistakes: in the first case, the offset is too high, in the second
case it's too low, and in the third case the block number is out of
range. But in one case I get a NOTICE and in the other two cases I get
an ERROR. In one case I get the relation name and in the other two
cases I don't. The two complaints about an invalid offset are phrased
completely differently from each other. For example, suppose you do
this:

ERROR: tid (%u, %u) is invalid for relation "%s" because the block
number is out of range (%u..%u)
ERROR: tid (%u, %u) is invalid for relation "%s" because the item
number is out of range for this block (%u..%u)
ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused
ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead

I think I misled you when I said to use pg_class_aclcheck. I think it
should actually be pg_class_ownercheck.

I think the relkind sanity check should permit RELKIND_MATVIEW also.

It's unclear to me why the freeze logic here shouldn't do this part
what heap_prepare_freeze_tuple() does when freezing xmax:

frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;

Likewise, why should we not freeze or invalidate xvac in the case
where tuple->t_infomask & HEAP_MOVED, as heap_prepare_freeze_tuple()
would do?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-07 Thread Ashutosh Sharma

Thanks Rajkumar for testing the patch.

Here are some of the additional test-cases that I would suggest you to
execute, if possible:

1) You may try running the test-cases that you have executed so far
with SR setup and see if the changes are getting reflected on the
standby.

2) You may also try running some concurrent test-cases for e.g. try
running these functions with VACUUM or some other sql commands
(preferable DML commands) in parallel.

3) See what happens when you pass some invalid tids (containing
invalid block or offset number) to these functions. You may also try
running these functions on the same tuple repeatedly and see the
behaviour.

...

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Thu, Aug 6, 2020 at 2:25 PM Rajkumar Raghuwanshi
 wrote:
>
> I have been doing some user-level testing of this feature, apart from sanity 
> test for extension and it's functions
>
> I have tried to corrupt tuples and then able to fix it using 
> heap_force_freeze/kill functions.
>
>
> --corrupt relfrozenxid, cause vacuum failed.
>
> update pg_class set relfrozenxid = (relfrozenxid::text::integer + 
> 10)::text::xid where relname = 'test_tbl';
>
> UPDATE 1
>
> insert into test_tbl values (2, 'BBB');
>
>
> postgres=# vacuum test_tbl;
>
> ERROR:  found xmin 507 from before relfrozenxid 516
>
> CONTEXT:  while scanning block 0 of relation "public.test_tbl"
>
>
> postgres=# select *, ctid, xmin, xmax from test_tbl;
>
>  a |  b  | ctid  | xmin | xmax
>
> ---+-+---+--+--
>
>  1 | AAA | (0,1) |  505 |0
>
>  2 | BBB | (0,2) |  507 |0
>
> (2 rows)
>
>
> --fixed using heap_force_freeze
>
> postgres=# select heap_force_freeze('test_tbl'::regclass, 
> ARRAY['(0,2)']::tid[]);
>
>  heap_force_freeze
>
> ---
>
>
> postgres=# vacuum test_tbl;
>
> VACUUM
>
> postgres=# select *, ctid, xmin, xmax from test_tbl;
>
>  a |  b  | ctid  | xmin | xmax
>
> ---+-+---+--+--
>
>  1 | AAA | (0,1) |  505 |0
>
>  2 | BBB | (0,2) |2 |0
>
> (2 rows)
>
>
> --corrupt table headers in base/oid. file, cause table access failed.
>
> postgres=# select ctid, * from test_tbl;
>
> ERROR:  could not access status of transaction 4294967295
>
> DETAIL:  Could not open file "pg_xact/0FFF": No such file or directory.
>
>
> --removed corrupted tuple using heap_force_kill
>
> postgres=# select heap_force_kill('test_tbl'::regclass, 
> ARRAY['(0,2)']::tid[]);
>
>  heap_force_kill
>
> -
>
>
>
> (1 row)
>
>
> postgres=# select ctid, * from test_tbl;
>
>  ctid  | a |  b
>
> ---+---+-
>
>  (0,1) | 1 | AAA
>
> (1 row)
>
>
> I will be continuing with my testing with the latest patch and update here if 
> found anything.
>
>
> Thanks & Regards,
> Rajkumar Raghuwanshi
>
>
> On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada 
>  wrote:
>>
>> On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma  wrote:
>> >
>> > Hi Robert,
>> >
>> > Thanks for the review. Please find my comments inline:
>> >
>> > On Sat, Aug 1, 2020 at 12:18 AM Robert Haas  wrote:
>> > >
>> > > On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma  
>> > > wrote:
>> > > > Attached is the new version of patch that addresses the comments from 
>> > > > Andrey and Beena.
>> > >
>> > > +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"
>> > >
>> > > the -> a
>> > >
>> > > I also suggest: heap table -> relation, because we might want to
>> > > extend this to other cases later.
>> > >
>> >
>> > Corrected.
>> >
>> > > +-- toast table.
>> > > +begin;
>> > > +create table ttab(a text);
>> > > +insert into ttab select string_agg(chr(floor(random() * 26)::int +
>> > > 65), '') from generate_series(1,1);
>> > > +select * from ttab where xmin = 2;
>> > > + a
>> > > +---
>> > > +(0 rows)
>> > > +
>> > > +select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
>> > > + heap_force_freeze
>> > > +---
>> > > +
>> > > +(1 row)
>> > > +
>> > >
>> > > I don't understand the point of this. You're not testing the function
>> > > on the TOAST table; you're testing it on the main table when there
>> > > happens to be a TOAST table that is probably getting used for
>> > > something. But that's not really relevant to what is being tested
>> > > here, so as written this seems redundant with the previous cases.
>> > >
>> >
>> > Yeah, it's being tested on the main table, not on a toast table. I've
>> > removed this test-case and also restricted direct access to the toast
>> > table using heap_force_kill/freeze functions. I think we shouldn't be
>> > using these functions to do any changes in the toast table. We will
>> > only use these functions with the main table and let VACUUM remove the
>> > corresponding data chunks (pointed by the tuple that got removed from
>> > the main table).
>> >
>> > Another option would be to identify all the data chunks corresponding
>> > to the tuple (ctid) being killed from the main table and remove them
>> > one by one. We will only do this if

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-07 Thread Ashutosh Sharma

On Thu, Aug 6, 2020 at 9:19 PM Robert Haas  wrote:
>
> On Thu, Aug 6, 2020 at 2:11 AM Ashutosh Sharma  wrote:
> > Okay, If you want I can remove the restriction on a toast table, but,
> > then that means a user can directly remove the data chunks from the
> > toast table without changing anything in the main table. This means we
> > won't be able to query the main table as it will fail with an error
> > like "ERROR:  unexpected chunk number ...". So, we will have to find
> > some way to identify the pointer to the chunks that got deleted from
> > the toast table and remove that pointer from the main table. We also
> > need to make sure that before we remove a tuple (pointer) from the
> > main table, we identify all the remaining data chunks pointed by this
> > tuple and remove them completely only then that table would be
> > considered to be in a good state. Now, I am not sure if we can
> > currently do all these things.
>
> That's the user's problem. If they don't have a plan for that, they
> shouldn't use this tool. I don't think we can or should try to handle
> it in this code.
>

Okay, thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-06 Thread Robert Haas

On Thu, Aug 6, 2020 at 2:11 AM Ashutosh Sharma  wrote:
> Okay, If you want I can remove the restriction on a toast table, but,
> then that means a user can directly remove the data chunks from the
> toast table without changing anything in the main table. This means we
> won't be able to query the main table as it will fail with an error
> like "ERROR:  unexpected chunk number ...". So, we will have to find
> some way to identify the pointer to the chunks that got deleted from
> the toast table and remove that pointer from the main table. We also
> need to make sure that before we remove a tuple (pointer) from the
> main table, we identify all the remaining data chunks pointed by this
> tuple and remove them completely only then that table would be
> considered to be in a good state. Now, I am not sure if we can
> currently do all these things.

That's the user's problem. If they don't have a plan for that, they
shouldn't use this tool. I don't think we can or should try to handle
it in this code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-06 Thread Ashutosh Sharma

Attached v4 patch fixes the latest comments from Robert and Masahiko-san.

Changes:
1) Let heap_force_kill and freeze functions to be used with toast tables.
2) Replace "int nskippedItems" with "bool did_modify_page" flag to
know if any modification happened in the page or not.
3) Declare some of the variables such as buf, page inside of the do
loop instead of declaring them at the top of heap_force_common
function.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Thu, Aug 6, 2020 at 2:49 PM Masahiko Sawada
 wrote:
>
> On Thu, 6 Aug 2020 at 18:05, Ashutosh Sharma  wrote:
> >
> > Hello Masahiko-san,
> >
> > Thanks for looking into the patch. Please find my comments inline below:
> >
> > On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada
> >  wrote:
> > >
> > > On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma  
> > > wrote:
> > > > Please check the attached patch for the changes.
> > >
> > > I also looked at this version patch and have some small comments:
> > >
> > > +   Oid relid = PG_GETARG_OID(0);
> > > +   ArrayType  *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
> > > +   ItemPointer tids;
> > > +   int ntids;
> > > +   Relationrel;
> > > +   Buffer  buf;
> > > +   Pagepage;
> > > +   ItemId  itemid;
> > > +   BlockNumber blkno;
> > > +   OffsetNumber   *offnos;
> > > +   OffsetNumberoffno,
> > > +   noffs,
> > > +   curr_start_ptr,
> > > +   next_start_ptr,
> > > +   maxoffset;
> > > +   int i,
> > > +   nskippedItems,
> > > +   nblocks;
> > >
> > > You declare all variables at the top of heap_force_common() function
> > > but I think we can declare some variables such as buf, page inside of
> > > the do loop.
> > >
> >
> > Sure, I will do this in the next version of patch.
> >
> > > ---
> > > +   if (offnos[i] > maxoffset)
> > > +   {
> > > +   ereport(NOTICE,
> > > +errmsg("skipping tid (%u, %u) because it
> > > contains an invalid offset",
> > > +   blkno, offnos[i]));
> > > +   continue;
> > > +   }
> > >
> > > If all tids on a page take the above path, we will end up logging FPI
> > > in spite of modifying nothing on the page.
> > >
> >
> > Yeah, that's right. I've taken care of this in the new version of
> > patch which I am yet to share.
> >
> > > ---
> > > +   /* XLOG stuff */
> > > +   if (RelationNeedsWAL(rel))
> > > +   log_newpage_buffer(buf, true);
> > >
> > > I think we need to set the returned LSN by log_newpage_buffer() to the 
> > > page lsn.
> > >
> >
> > I think we are already setting the page lsn in the log_newpage() which
> > is being called from log_newpage_buffer(). So, AFAIU, no change would
> > be required here. Please let me know if I am missing something.
>
> You're right. I'd missed the comment of log_newpage_buffer(). Thanks!
>
> Regards,
>
> --
> Masahiko Sawadahttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 6d616af51a8f9e6352145f799f635ec907ed83c2 Mon Sep 17 00:00:00 2001
From: ashu 
Date: Thu, 6 Aug 2020 18:02:29 +0530
Subject: [PATCH] Add contrib/pg_surgery to perform surgery on a damaged heap
 table.

This contrib module basically adds a couple of functions named
heap_force_kill and heap_force_freeze that can be used in the scripts
to perform surgery on the damaged heap tables.

Ashutosh Sharma.
---
 contrib/Makefile   |   1 +
 contrib/pg_surgery/Makefile|  23 ++
 contrib/pg_surgery/expected/pg_surgery.out |  78 +++
 contrib/pg_surgery/heap_surgery.c  | 362 +
 contrib/pg_surgery/pg_surgery--1.0.sql |  18 ++
 contrib/pg_surgery/pg_surgery.control  |   5 +
 contrib/pg_surgery/sql/pg_surgery.sql  |  61 +
 7 files changed, 548 insertions(+)
 create mode 100644 contrib/pg_surgery/Makefile
 create mode 100644 contrib/pg_surgery/expected/pg_surgery.out
 create mode 100644 contrib/pg_surgery/heap_surgery.c
 create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql
 create mode 100644 contrib/pg_surgery/pg_surgery.control
 create mode 100644 contrib/pg_surgery/sql/pg_surgery.sql

diff --git a/contrib/Makefile b/contrib/Makefile
index 1846d41..07d5734 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -35,6 +35,7 @@ SUBDIRS = \
 		pg_standby	\
 		pg_stat_statements \
 		pg_trgm		\
+		pg_surgery	\
 		pgcrypto	\
 		pgrowlocks	\
 		pgstattuple	\
diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile
new file mode 100644
index 000..ecf2e20
--- /dev/null
+++ b/contrib/pg_surgery/Makefile
@@ -0,0 +1,23 @@
+# contrib/pg_surgery/Makefile
+
+MODULE_big = pg_surgery
+OBJS = \
+	$(WIN32RES) \
+	heap_surgery.o
+
+EXTENSION = pg_surgery
+DATA = pg_surgery--1.0.sql
+PGFILEDESC = "pg_surgery -

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-06 Thread Masahiko Sawada

On Thu, 6 Aug 2020 at 18:05, Ashutosh Sharma  wrote:
>
> Hello Masahiko-san,
>
> Thanks for looking into the patch. Please find my comments inline below:
>
> On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada
>  wrote:
> >
> > On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma  wrote:
> > > Please check the attached patch for the changes.
> >
> > I also looked at this version patch and have some small comments:
> >
> > +   Oid relid = PG_GETARG_OID(0);
> > +   ArrayType  *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
> > +   ItemPointer tids;
> > +   int ntids;
> > +   Relationrel;
> > +   Buffer  buf;
> > +   Pagepage;
> > +   ItemId  itemid;
> > +   BlockNumber blkno;
> > +   OffsetNumber   *offnos;
> > +   OffsetNumberoffno,
> > +   noffs,
> > +   curr_start_ptr,
> > +   next_start_ptr,
> > +   maxoffset;
> > +   int i,
> > +   nskippedItems,
> > +   nblocks;
> >
> > You declare all variables at the top of heap_force_common() function
> > but I think we can declare some variables such as buf, page inside of
> > the do loop.
> >
>
> Sure, I will do this in the next version of patch.
>
> > ---
> > +   if (offnos[i] > maxoffset)
> > +   {
> > +   ereport(NOTICE,
> > +errmsg("skipping tid (%u, %u) because it
> > contains an invalid offset",
> > +   blkno, offnos[i]));
> > +   continue;
> > +   }
> >
> > If all tids on a page take the above path, we will end up logging FPI
> > in spite of modifying nothing on the page.
> >
>
> Yeah, that's right. I've taken care of this in the new version of
> patch which I am yet to share.
>
> > ---
> > +   /* XLOG stuff */
> > +   if (RelationNeedsWAL(rel))
> > +   log_newpage_buffer(buf, true);
> >
> > I think we need to set the returned LSN by log_newpage_buffer() to the page 
> > lsn.
> >
>
> I think we are already setting the page lsn in the log_newpage() which
> is being called from log_newpage_buffer(). So, AFAIU, no change would
> be required here. Please let me know if I am missing something.

You're right. I'd missed the comment of log_newpage_buffer(). Thanks!

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-06 Thread Ashutosh Sharma

Hello Masahiko-san,

Thanks for looking into the patch. Please find my comments inline below:

On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada
 wrote:
>
> On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma  wrote:
> > Please check the attached patch for the changes.
>
> I also looked at this version patch and have some small comments:
>
> +   Oid relid = PG_GETARG_OID(0);
> +   ArrayType  *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
> +   ItemPointer tids;
> +   int ntids;
> +   Relationrel;
> +   Buffer  buf;
> +   Pagepage;
> +   ItemId  itemid;
> +   BlockNumber blkno;
> +   OffsetNumber   *offnos;
> +   OffsetNumberoffno,
> +   noffs,
> +   curr_start_ptr,
> +   next_start_ptr,
> +   maxoffset;
> +   int i,
> +   nskippedItems,
> +   nblocks;
>
> You declare all variables at the top of heap_force_common() function
> but I think we can declare some variables such as buf, page inside of
> the do loop.
>

Sure, I will do this in the next version of patch.

> ---
> +   if (offnos[i] > maxoffset)
> +   {
> +   ereport(NOTICE,
> +errmsg("skipping tid (%u, %u) because it
> contains an invalid offset",
> +   blkno, offnos[i]));
> +   continue;
> +   }
>
> If all tids on a page take the above path, we will end up logging FPI
> in spite of modifying nothing on the page.
>

Yeah, that's right. I've taken care of this in the new version of
patch which I am yet to share.

> ---
> +   /* XLOG stuff */
> +   if (RelationNeedsWAL(rel))
> +   log_newpage_buffer(buf, true);
>
> I think we need to set the returned LSN by log_newpage_buffer() to the page 
> lsn.
>

I think we are already setting the page lsn in the log_newpage() which
is being called from log_newpage_buffer(). So, AFAIU, no change would
be required here. Please let me know if I am missing something.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-06 Thread Rajkumar Raghuwanshi

I have been doing some user-level testing of this feature, apart from
sanity test for extension and it's functions

I have tried to corrupt tuples and then able to fix it using
heap_force_freeze/kill functions.


--corrupt relfrozenxid, cause vacuum failed.

update pg_class set relfrozenxid = (relfrozenxid::text::integer +
10)::text::xid where relname = 'test_tbl';

UPDATE 1

insert into test_tbl values (2, 'BBB');


postgres=# vacuum test_tbl;

ERROR:  found xmin 507 from before relfrozenxid 516

CONTEXT:  while scanning block 0 of relation "public.test_tbl"


postgres=# select *, ctid, xmin, xmax from test_tbl;

 a |  b  | ctid  | xmin | xmax

---+-+---+--+--

 1 | AAA | (0,1) |  505 |0

 2 | BBB | (0,2) |  507 |0

(2 rows)


--fixed using heap_force_freeze

postgres=# select heap_force_freeze('test_tbl'::regclass,
ARRAY['(0,2)']::tid[]);

 heap_force_freeze

---


postgres=# vacuum test_tbl;

VACUUM

postgres=# select *, ctid, xmin, xmax from test_tbl;

 a |  b  | ctid  | xmin | xmax

---+-+---+--+--

 1 | AAA | (0,1) |  505 |0

 2 | BBB | (0,2) |2 |0

(2 rows)


--corrupt table headers in base/oid. file, cause table access failed.

postgres=# select ctid, * from test_tbl;

ERROR:  could not access status of transaction 4294967295

DETAIL:  Could not open file "pg_xact/0FFF": No such file or directory.


--removed corrupted tuple using heap_force_kill

postgres=# select heap_force_kill('test_tbl'::regclass,
ARRAY['(0,2)']::tid[]);

 heap_force_kill

-



(1 row)


postgres=# select ctid, * from test_tbl;

 ctid  | a |  b

---+---+-

 (0,1) | 1 | AAA

(1 row)


I will be continuing with my testing with the latest patch and update here
if found anything.


Thanks & Regards,
Rajkumar Raghuwanshi


On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada <
masahiko.saw...@2ndquadrant.com> wrote:

> On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma 
> wrote:
> >
> > Hi Robert,
> >
> > Thanks for the review. Please find my comments inline:
> >
> > On Sat, Aug 1, 2020 at 12:18 AM Robert Haas 
> wrote:
> > >
> > > On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma 
> wrote:
> > > > Attached is the new version of patch that addresses the comments
> from Andrey and Beena.
> > >
> > > +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"
> > >
> > > the -> a
> > >
> > > I also suggest: heap table -> relation, because we might want to
> > > extend this to other cases later.
> > >
> >
> > Corrected.
> >
> > > +-- toast table.
> > > +begin;
> > > +create table ttab(a text);
> > > +insert into ttab select string_agg(chr(floor(random() * 26)::int +
> > > 65), '') from generate_series(1,1);
> > > +select * from ttab where xmin = 2;
> > > + a
> > > +---
> > > +(0 rows)
> > > +
> > > +select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
> > > + heap_force_freeze
> > > +---
> > > +
> > > +(1 row)
> > > +
> > >
> > > I don't understand the point of this. You're not testing the function
> > > on the TOAST table; you're testing it on the main table when there
> > > happens to be a TOAST table that is probably getting used for
> > > something. But that's not really relevant to what is being tested
> > > here, so as written this seems redundant with the previous cases.
> > >
> >
> > Yeah, it's being tested on the main table, not on a toast table. I've
> > removed this test-case and also restricted direct access to the toast
> > table using heap_force_kill/freeze functions. I think we shouldn't be
> > using these functions to do any changes in the toast table. We will
> > only use these functions with the main table and let VACUUM remove the
> > corresponding data chunks (pointed by the tuple that got removed from
> > the main table).
> >
> > Another option would be to identify all the data chunks corresponding
> > to the tuple (ctid) being killed from the main table and remove them
> > one by one. We will only do this if the tuple from the main table that
> > has been marked as killed has an external storage. We will have to add
> > a bunch of code for this otherwise we can let VACUUM do this for us.
> > Let me know your thoughts on this.
> >
> > > +-- test pg_surgery functions with the unsupported relations. Should
> fail.
> > >
> > > Please name the specific functions being tested here in case we add
> > > more in the future that are tested separately.
> > >
> >
> > Done.
> >
> > > +++ b/contrib/pg_surgery/heap_surgery_funcs.c
> > >
> > > I think we could drop _funcs from the file name.
> > >
> >
> > Done.
> >
> > > +#ifdef PG_MODULE_MAGIC
> > > +PG_MODULE_MAGIC;
> > > +#endif
> > >
> > > The #ifdef here is not required, and if you look at other contrib
> > > modules you'll see that they don't have it.
> > >
> >
> > Okay, done.
> >
> > > I don't like all the macros at the top of the file much. CHECKARRVALID
> > > is only used in one place, so it seems to me that you might as well
> > > just inline it

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-06 Thread Masahiko Sawada

On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma  wrote:
>
> Hi Robert,
>
> Thanks for the review. Please find my comments inline:
>
> On Sat, Aug 1, 2020 at 12:18 AM Robert Haas  wrote:
> >
> > On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma  
> > wrote:
> > > Attached is the new version of patch that addresses the comments from 
> > > Andrey and Beena.
> >
> > +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"
> >
> > the -> a
> >
> > I also suggest: heap table -> relation, because we might want to
> > extend this to other cases later.
> >
>
> Corrected.
>
> > +-- toast table.
> > +begin;
> > +create table ttab(a text);
> > +insert into ttab select string_agg(chr(floor(random() * 26)::int +
> > 65), '') from generate_series(1,1);
> > +select * from ttab where xmin = 2;
> > + a
> > +---
> > +(0 rows)
> > +
> > +select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
> > + heap_force_freeze
> > +---
> > +
> > +(1 row)
> > +
> >
> > I don't understand the point of this. You're not testing the function
> > on the TOAST table; you're testing it on the main table when there
> > happens to be a TOAST table that is probably getting used for
> > something. But that's not really relevant to what is being tested
> > here, so as written this seems redundant with the previous cases.
> >
>
> Yeah, it's being tested on the main table, not on a toast table. I've
> removed this test-case and also restricted direct access to the toast
> table using heap_force_kill/freeze functions. I think we shouldn't be
> using these functions to do any changes in the toast table. We will
> only use these functions with the main table and let VACUUM remove the
> corresponding data chunks (pointed by the tuple that got removed from
> the main table).
>
> Another option would be to identify all the data chunks corresponding
> to the tuple (ctid) being killed from the main table and remove them
> one by one. We will only do this if the tuple from the main table that
> has been marked as killed has an external storage. We will have to add
> a bunch of code for this otherwise we can let VACUUM do this for us.
> Let me know your thoughts on this.
>
> > +-- test pg_surgery functions with the unsupported relations. Should fail.
> >
> > Please name the specific functions being tested here in case we add
> > more in the future that are tested separately.
> >
>
> Done.
>
> > +++ b/contrib/pg_surgery/heap_surgery_funcs.c
> >
> > I think we could drop _funcs from the file name.
> >
>
> Done.
>
> > +#ifdef PG_MODULE_MAGIC
> > +PG_MODULE_MAGIC;
> > +#endif
> >
> > The #ifdef here is not required, and if you look at other contrib
> > modules you'll see that they don't have it.
> >
>
> Okay, done.
>
> > I don't like all the macros at the top of the file much. CHECKARRVALID
> > is only used in one place, so it seems to me that you might as well
> > just inline it and lose the macro. Likewise for SORT and ARRISEMPTY.
> >
>
> Done.
>
> > Once you do that, heap_force_common() can just compute the number of
> > array elements once, instead of doing it once inside ARRISEMPTY, then
> > again inside SORT, and then a third time to initialize ntids. You can
> > just have a local variable in that function that is set once and then
> > used as needed. Then you are only doing ARRNELEMS once, so you can get
> > rid of that macro too. The same technique can be used to get rid of
> > ARRPTR. So then all the macros go away without introducing any code
> > duplication.
> >
>
> Done.
>
> > +/* Options to forcefully change the state of a heap tuple. */
> > +typedef enum HTupleForceOption
> > +{
> > + FORCE_KILL,
> > + FORCE_FREEZE
> > +} HTupleForceOption;
> >
> > I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the
> > enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE.
>
> Done.
>
> Also, how about
> > option -> operation?
> >
>
> I think both look okay to me.
>
> > + return heap_force_common(fcinfo, FORCE_KILL);
> >
> > I think it might be more idiomatic to use PG_RETURN_DATUM here.  I
> > know it's the same thing, though, and perhaps I'm even wrong about the
> > prevailing style.
> >
>
> Done.
>
> > + Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE);
> >
> > I think this is unnecessary. It's an enum with 2 values.
> >
>
> Removed.
>
> > + if (ARRISEMPTY(ta))
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("empty tid array")));
> >
> > I don't see why this should be an error. Why can't we just continue
> > normally and if it does nothing, it does nothing? You'd need to change
> > the do..while loop to a while loop so that the end condition is tested
> > at the top, but that seems fine.
> >
>
> I think it's okay to have this check. So, just left it as-is. We do
> have such checks in other contrib modules as well wherever the array
> is being passed as an input to the function.
>
> > + rel = relation_open(relid, AccessShareLock);
> >
> > Maybe we should take

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-06 Thread Ashutosh Sharma

On Thu, Aug 6, 2020 at 1:29 AM Robert Haas  wrote:
>
> On Wed, Aug 5, 2020 at 9:42 AM Ashutosh Sharma  wrote:
> > Yeah, it's being tested on the main table, not on a toast table. I've
> > removed this test-case and also restricted direct access to the toast
> > table using heap_force_kill/freeze functions. I think we shouldn't be
> > using these functions to do any changes in the toast table. We will
> > only use these functions with the main table and let VACUUM remove the
> > corresponding data chunks (pointed by the tuple that got removed from
> > the main table).
>
> I agree with removing the test case, but I disagree with restricting
> this from being used on the TOAST table. These are tools for experts,
> who may use them as they see fit. It's unlikely that it would be
> useful to use this on a TOAST table, I think, but not impossible.
>

Okay, If you want I can remove the restriction on a toast table, but,
then that means a user can directly remove the data chunks from the
toast table without changing anything in the main table. This means we
won't be able to query the main table as it will fail with an error
like "ERROR:  unexpected chunk number ...". So, we will have to find
some way to identify the pointer to the chunks that got deleted from
the toast table and remove that pointer from the main table. We also
need to make sure that before we remove a tuple (pointer) from the
main table, we identify all the remaining data chunks pointed by this
tuple and remove them completely only then that table would be
considered to be in a good state. Now, I am not sure if we can
currently do all these things.

> > Another option would be to identify all the data chunks corresponding
> > to the tuple (ctid) being killed from the main table and remove them
> > one by one. We will only do this if the tuple from the main table that
> > has been marked as killed has an external storage. We will have to add
> > a bunch of code for this otherwise we can let VACUUM do this for us.
> > Let me know your thoughts on this.
>
> I don't think VACUUM will do anything for us automatically -- it isn't
> going to know that we force-killed the tuple in the main table.
> Normally, a tuple delete would have to set xmax on the TOAST tuples
> and then VACUUM would do its thing, but in this case that won't
> happen. So if you force-kill a tuple in the main table you would end
> up with a space leak in the TOAST table.
>
> The problem here is that one reason you might force-killing a tuple in
> the main table is because it's full of garbage. If so, trying to
> decode the tuple so that you can find the TOAST pointers might crash
> or error out, or maybe that part will work but then you'll error out
> trying to look up the corresponding TOAST tuples, either because the
> values are not valid or because the TOAST table itself is generally
> hosed in some way. So I think it is probably best if we keep this tool
> as simple as possible, with as few dependencies as we can, and
> document the possible negative outcomes of using it.

I completely agree with you.

It's not
> impossible to recover from a space-leak like this; you can always move
> the data into a new table with CTAS and then drop the old one. Not
> sure whether CLUSTER or VACUUM FULL would also be sufficient.
>

Yeah, I think, we can either use CTAS or VACUUM FULL, both look fine.

> Separately, we might want to add a TOAST-checker to amcheck, or
> enhance the heap-checker Mark is working on, and one of the things it
> could do is check for TOAST entries to which nothing points. Then if
> you force-kill tuples in the main table you could also use that tool
> to look for things in the TOAST table that ought to also be
> force-killed.
>

Okay, good to know that. Thanks for sharing this info.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-06 Thread Ashutosh Sharma

On Thu, Aug 6, 2020 at 1:04 AM Robert Haas  wrote:
>
> On Mon, Aug 3, 2020 at 12:13 PM Ashutosh Sharma  wrote:
> > If the above path is taken that means none of the items in the page
> > got changed.
>
> Oops. I didn't realize that, sorry. Maybe it would be a little more
> clear if instead of "int nSkippedItems" you had "bool
> did_modify_page"? Then you could initialize it to false and set it to
> true just before doing the page modifications.
>

Okay, np, in that case, as you suggested, I will replace "int
nSkippedItems" with "did_modify_page" to increase the clarity. I will
do this change in the next version of patch. Thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

1 2 >

1 - 100 of 154 matches

Mail list logo