Re: [HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
On Sun, Aug 7, 2016 at 5:46 PM, Tom Lane wrote: > Robert Haas writes: >> On Sun, Aug 7, 2016 at 2:03 PM, Tom Lane wrote: >>> What I suggested just now in <2850.1470592...@sss.pgh.pa.us> might >>> be implementable with a couple hours' work, though. Do you have a >>> reason to think it'd be insufficient? > >> No - if you can implement that quickly, I think it sounds like a great idea. > > Pushed. Thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
Robert Haas writes: > On Sun, Aug 7, 2016 at 2:03 PM, Tom Lane wrote: >> What I suggested just now in <2850.1470592...@sss.pgh.pa.us> might >> be implementable with a couple hours' work, though. Do you have a >> reason to think it'd be insufficient? > No - if you can implement that quickly, I think it sounds like a great idea. Pushed. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
Robert Haas writes: > On Sun, Aug 7, 2016 at 2:03 PM, Tom Lane wrote: >> What I suggested just now in <2850.1470592...@sss.pgh.pa.us> might >> be implementable with a couple hours' work, though. Do you have a >> reason to think it'd be insufficient? > No - if you can implement that quickly, I think it sounds like a great idea. I'll look into it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
On Sun, Aug 7, 2016 at 2:03 PM, Tom Lane wrote: > Robert Haas writes: >> So I think in the short term what we should do about this is just fix >> it so it doesn't crash. > > Well, we clearly need to fix GetOldestSnapshot so it won't crash, > but I do not think that having RETURNING queries randomly returning > "ERROR: no known snapshots" is acceptable even for a beta release. > If we aren't prepared to do something about that before Monday, > I think we need to revert 3e2f3c2e until we do have a fix for it. > > What I suggested just now in <2850.1470592...@sss.pgh.pa.us> might > be implementable with a couple hours' work, though. Do you have a > reason to think it'd be insufficient? No - if you can implement that quickly, I think it sounds like a great idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
On Sun, Aug 7, 2016 at 1:57 PM, Tom Lane wrote: > Andrew Gierth writes: >> In a similar case in the past involving holdable cursors, the solution >> was to detoast _before_ storing in the tuplestore (see >> PersistHoldablePortal). I guess the question now is, under what >> circumstances is it now allowable to detoast a datum with no active >> snapshot? (Wouldn't it be necessary in such a case to know what the >> oldest snapshot ever used in the transaction was?) > > After looking at this a bit, I think probably the appropriate solution > is to register the snapshot that was used by the query and store it as > a property of the Portal, releasing it when the Portal is destroyed. > Essentially, this views possession of a relevant snapshot as a resource > that is required to make toast dereferences safe. Hmm, good idea. > I think there has been a bug here for awhile. Consider a committed-dead > row with some associated toast data, and suppose the query's snapshot > was the last one that could see that row. Once we destroy the query's > snapshot, there is nothing preventing a concurrent VACUUM from removing > the dead row and the toast data. Yeah: as you see, I came to the same conclusion. > When the RETURNING code was originally > written, I think this was safe enough, because the bookkeeping that > determined when VACUUM could remove data was based on transactions' > advertised xmins, and those did not move once set for the life of the > transaction. So dereferencing a toast pointer you'd fetched was safe > for the rest of the transaction. But when we changed over to > oldest-snapshot-based xmin advertisement, and made it so that a > transaction holding no snapshots advertised no xmin, we created a hazard > for data held in portals. But I missed this aspect of it. > In this view of things, flattening toast pointers in "held" tuplestores > is still necessary, but it's because their protective snapshot is going > away not because the transaction as a whole is going away. But as long > as we hold onto the relevant snapshot, we don't have to do that for > portals used intra-transaction. > > It's interesting to think about whether we could let snapshots outlive > transactions and thereby not need to flatten "held" tuplestores either. > I kinda doubt it's a good idea because of the potential bad effects > for vacuum not being able to remove dead rows for a long time; but > it seems at least possible to do it, in this world-view. EnterpriseDB has had complaints from customers about the cost of flattening TOAST pointers when tuplestores are held, so I think providing an option to skip the flattening (at the risk of increased bloat) would be a good idea even if we chose not to change the default behavior. It's worth noting that the ability to set old_snapshot_threshold serves as a way of limiting the damage that can be caused by the open snapshot, so that optional behavior might be more appealing now than heretofore. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
Robert Haas writes: > So I think in the short term what we should do about this is just fix > it so it doesn't crash. Well, we clearly need to fix GetOldestSnapshot so it won't crash, but I do not think that having RETURNING queries randomly returning "ERROR: no known snapshots" is acceptable even for a beta release. If we aren't prepared to do something about that before Monday, I think we need to revert 3e2f3c2e until we do have a fix for it. What I suggested just now in <2850.1470592...@sss.pgh.pa.us> might be implementable with a couple hours' work, though. Do you have a reason to think it'd be insufficient? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
On Sat, Aug 6, 2016 at 9:00 AM, Andrew Gierth wrote: > Hmm. > > So this happens because RETURNING queries run to completion immediately > and populate a tuplestore with the results, and the portal then fetches > from the tuplestore to send to the destination. The assumption is that > the tuplestore output can be processed without needing a snapshot, which > obviously is not true now if it contains toasted data. > > In a similar case in the past involving holdable cursors, the solution > was to detoast _before_ storing in the tuplestore (see > PersistHoldablePortal). I guess the question now is, under what > circumstances is it now allowable to detoast a datum with no active > snapshot? (Wouldn't it be necessary in such a case to know what the > oldest snapshot ever used in the transaction was?) Yes, I think you're right. Suppose we leave "snapshot too old" to one side; assume that feature is disabled. If the backend fetches the heap tuples without de-TOAST-ing, releases its snapshot (presumably resetting xmin), and then goes into the tank for a while, those tuples could be pruned. When the backend wakes up again and tries to TOAST, we would get the the exact sort of heap:toast disagreement that we set out to prevent here. That's not likely to occur because in most cases the number of tuples returned will be small, and VACUUM is quite unlikely to remove them before we de-TOAST. But this report makes me suspect it can happen (I have not tested). With "snapshot too old" enabled, it becomes much more likely. The offending prune operation can happen at any time after our snapshot times out and before we de-TOAST, rather than needing to happen after the query ends and before we de-TOAST. So I think in the short term what we should do about this is just fix it so it doesn't crash. In the longer term, we might want to think a bit more carefully about the way we handle de-TOASTing overall; we've had a number of different bugs that are all rooted in failure to think carefully about the fact that the query's snapshot needs to live at least as long as any TOAST pointers that we might want to de-TOAST later (3f8c8e3c61cef5729980ee4372ec159862a979f1, ec543db77b6b72f24d0a637c4a4a419cf8311d0b). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
Andrew Gierth writes: > In a similar case in the past involving holdable cursors, the solution > was to detoast _before_ storing in the tuplestore (see > PersistHoldablePortal). I guess the question now is, under what > circumstances is it now allowable to detoast a datum with no active > snapshot? (Wouldn't it be necessary in such a case to know what the > oldest snapshot ever used in the transaction was?) After looking at this a bit, I think probably the appropriate solution is to register the snapshot that was used by the query and store it as a property of the Portal, releasing it when the Portal is destroyed. Essentially, this views possession of a relevant snapshot as a resource that is required to make toast dereferences safe. I think there has been a bug here for awhile. Consider a committed-dead row with some associated toast data, and suppose the query's snapshot was the last one that could see that row. Once we destroy the query's snapshot, there is nothing preventing a concurrent VACUUM from removing the dead row and the toast data. When the RETURNING code was originally written, I think this was safe enough, because the bookkeeping that determined when VACUUM could remove data was based on transactions' advertised xmins, and those did not move once set for the life of the transaction. So dereferencing a toast pointer you'd fetched was safe for the rest of the transaction. But when we changed over to oldest-snapshot-based xmin advertisement, and made it so that a transaction holding no snapshots advertised no xmin, we created a hazard for data held in portals. In this view of things, flattening toast pointers in "held" tuplestores is still necessary, but it's because their protective snapshot is going away not because the transaction as a whole is going away. But as long as we hold onto the relevant snapshot, we don't have to do that for portals used intra-transaction. It's interesting to think about whether we could let snapshots outlive transactions and thereby not need to flatten "held" tuplestores either. I kinda doubt it's a good idea because of the potential bad effects for vacuum not being able to remove dead rows for a long time; but it seems at least possible to do it, in this world-view. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
> "Amit" == Amit Kapila writes: Amit> Sure, that is the reason of crash, but even if we do that it will Amit> lead to an error "no known snapshots". Here, what is going on is Amit> that we initialized toast snapshot when there is no active Amit> snapshot in the backend, so GetOldestSnapshot() won't return any Amit> snapshot. Hmm. So this happens because RETURNING queries run to completion immediately and populate a tuplestore with the results, and the portal then fetches from the tuplestore to send to the destination. The assumption is that the tuplestore output can be processed without needing a snapshot, which obviously is not true now if it contains toasted data. In a similar case in the past involving holdable cursors, the solution was to detoast _before_ storing in the tuplestore (see PersistHoldablePortal). I guess the question now is, under what circumstances is it now allowable to detoast a datum with no active snapshot? (Wouldn't it be necessary in such a case to know what the oldest snapshot ever used in the transaction was?) Amit> I think for such situations, we need to initialize the lsn and Amit> whenTaken of ToastSnapshot as we do in GetSnapshotData() [1]. Would that not give a too-recent LSN, resulting in possibly failing to fetch the toast rows? -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
On Sat, Aug 6, 2016 at 5:51 PM, Andrew Gierth wrote: >> "Andreas" == Andreas Seltenreich writes: > > 418 if (OldestActiveSnapshot != NULL) > 419 ActiveLSN = OldestActiveSnapshot->as_snap->lsn; > 420 > 421 if (XLogRecPtrIsInvalid(RegisteredLSN) || RegisteredLSN > ActiveLSN) > 422 return OldestActiveSnapshot->as_snap; > > This second conditional should clearly be inside the first one... > Sure, that is the reason of crash, but even if we do that it will lead to an error "no known snapshots". Here, what is going on is that we initialized toast snapshot when there is no active snapshot in the backend, so GetOldestSnapshot() won't return any snapshot. I think for such situations, we need to initialize the lsn and whenTaken of ToastSnapshot as we do in GetSnapshotData() [1]. We need to do this when snapshot returned by GetOldestSnapshot() is NULL. Thoughts? [1] In below code if (old_snapshot_threshold < 0) { .. } else { /* * Capture the current time and WAL stream location in case this * snapshot becomes old enough to need to fall back on the special * "old snapshot" logic. */ snapshot->lsn = GetXLogInsertRecPtr(); snapshot->whenTaken = GetSnapshotCurrentTimestamp(); MaintainOldSnapshotTimeMapping(snapshot->whenTaken, xmin); } -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
> "Andreas" == Andreas Seltenreich writes: 418 if (OldestActiveSnapshot != NULL) 419 ActiveLSN = OldestActiveSnapshot->as_snap->lsn; 420 421 if (XLogRecPtrIsInvalid(RegisteredLSN) || RegisteredLSN > ActiveLSN) 422 return OldestActiveSnapshot->as_snap; This second conditional should clearly be inside the first one... -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
On Sat, Aug 6, 2016 at 6:32 PM, Andreas Seltenreich wrote: > since updating master from c93d873..fc509cd, I see crashes in > GetOldestSnapshot() on update/delete returning statements. > > I reduced the triggering statements down to this: > > update clstr_tst set d = d returning d; > > Backtrace below. 3e2f3c2e is likely to blame here.. I have moved the open item "old_snapshot_threshold allows heap:toast disagreement" back to the list of open items. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers