Re: [HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
On Sun, Aug 7, 2016 at 5:46 PM, Tom Lanewrote: > Robert Haas writes: >> On Sun, Aug 7, 2016 at 2:03 PM, Tom Lane wrote: >>> What I suggested just now in <2850.1470592...@sss.pgh.pa.us> might >>> be implementable with a couple hours' work, though. Do you have a >>> reason to think it'd be insufficient? > >> No - if you can implement that quickly, I think it sounds like a great idea. > > Pushed. Thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
Robert Haaswrites: > On Sun, Aug 7, 2016 at 2:03 PM, Tom Lane wrote: >> What I suggested just now in <2850.1470592...@sss.pgh.pa.us> might >> be implementable with a couple hours' work, though. Do you have a >> reason to think it'd be insufficient? > No - if you can implement that quickly, I think it sounds like a great idea. Pushed. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
Robert Haaswrites: > On Sun, Aug 7, 2016 at 2:03 PM, Tom Lane wrote: >> What I suggested just now in <2850.1470592...@sss.pgh.pa.us> might >> be implementable with a couple hours' work, though. Do you have a >> reason to think it'd be insufficient? > No - if you can implement that quickly, I think it sounds like a great idea. I'll look into it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
On Sun, Aug 7, 2016 at 2:03 PM, Tom Lanewrote: > Robert Haas writes: >> So I think in the short term what we should do about this is just fix >> it so it doesn't crash. > > Well, we clearly need to fix GetOldestSnapshot so it won't crash, > but I do not think that having RETURNING queries randomly returning > "ERROR: no known snapshots" is acceptable even for a beta release. > If we aren't prepared to do something about that before Monday, > I think we need to revert 3e2f3c2e until we do have a fix for it. > > What I suggested just now in <2850.1470592...@sss.pgh.pa.us> might > be implementable with a couple hours' work, though. Do you have a > reason to think it'd be insufficient? No - if you can implement that quickly, I think it sounds like a great idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
On Sun, Aug 7, 2016 at 1:57 PM, Tom Lanewrote: > Andrew Gierth writes: >> In a similar case in the past involving holdable cursors, the solution >> was to detoast _before_ storing in the tuplestore (see >> PersistHoldablePortal). I guess the question now is, under what >> circumstances is it now allowable to detoast a datum with no active >> snapshot? (Wouldn't it be necessary in such a case to know what the >> oldest snapshot ever used in the transaction was?) > > After looking at this a bit, I think probably the appropriate solution > is to register the snapshot that was used by the query and store it as > a property of the Portal, releasing it when the Portal is destroyed. > Essentially, this views possession of a relevant snapshot as a resource > that is required to make toast dereferences safe. Hmm, good idea. > I think there has been a bug here for awhile. Consider a committed-dead > row with some associated toast data, and suppose the query's snapshot > was the last one that could see that row. Once we destroy the query's > snapshot, there is nothing preventing a concurrent VACUUM from removing > the dead row and the toast data. Yeah: as you see, I came to the same conclusion. > When the RETURNING code was originally > written, I think this was safe enough, because the bookkeeping that > determined when VACUUM could remove data was based on transactions' > advertised xmins, and those did not move once set for the life of the > transaction. So dereferencing a toast pointer you'd fetched was safe > for the rest of the transaction. But when we changed over to > oldest-snapshot-based xmin advertisement, and made it so that a > transaction holding no snapshots advertised no xmin, we created a hazard > for data held in portals. But I missed this aspect of it. > In this view of things, flattening toast pointers in "held" tuplestores > is still necessary, but it's because their protective snapshot is going > away not because the transaction as a whole is going away. But as long > as we hold onto the relevant snapshot, we don't have to do that for > portals used intra-transaction. > > It's interesting to think about whether we could let snapshots outlive > transactions and thereby not need to flatten "held" tuplestores either. > I kinda doubt it's a good idea because of the potential bad effects > for vacuum not being able to remove dead rows for a long time; but > it seems at least possible to do it, in this world-view. EnterpriseDB has had complaints from customers about the cost of flattening TOAST pointers when tuplestores are held, so I think providing an option to skip the flattening (at the risk of increased bloat) would be a good idea even if we chose not to change the default behavior. It's worth noting that the ability to set old_snapshot_threshold serves as a way of limiting the damage that can be caused by the open snapshot, so that optional behavior might be more appealing now than heretofore. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
Robert Haaswrites: > So I think in the short term what we should do about this is just fix > it so it doesn't crash. Well, we clearly need to fix GetOldestSnapshot so it won't crash, but I do not think that having RETURNING queries randomly returning "ERROR: no known snapshots" is acceptable even for a beta release. If we aren't prepared to do something about that before Monday, I think we need to revert 3e2f3c2e until we do have a fix for it. What I suggested just now in <2850.1470592...@sss.pgh.pa.us> might be implementable with a couple hours' work, though. Do you have a reason to think it'd be insufficient? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
On Sat, Aug 6, 2016 at 9:00 AM, Andrew Gierthwrote: > Hmm. > > So this happens because RETURNING queries run to completion immediately > and populate a tuplestore with the results, and the portal then fetches > from the tuplestore to send to the destination. The assumption is that > the tuplestore output can be processed without needing a snapshot, which > obviously is not true now if it contains toasted data. > > In a similar case in the past involving holdable cursors, the solution > was to detoast _before_ storing in the tuplestore (see > PersistHoldablePortal). I guess the question now is, under what > circumstances is it now allowable to detoast a datum with no active > snapshot? (Wouldn't it be necessary in such a case to know what the > oldest snapshot ever used in the transaction was?) Yes, I think you're right. Suppose we leave "snapshot too old" to one side; assume that feature is disabled. If the backend fetches the heap tuples without de-TOAST-ing, releases its snapshot (presumably resetting xmin), and then goes into the tank for a while, those tuples could be pruned. When the backend wakes up again and tries to TOAST, we would get the the exact sort of heap:toast disagreement that we set out to prevent here. That's not likely to occur because in most cases the number of tuples returned will be small, and VACUUM is quite unlikely to remove them before we de-TOAST. But this report makes me suspect it can happen (I have not tested). With "snapshot too old" enabled, it becomes much more likely. The offending prune operation can happen at any time after our snapshot times out and before we de-TOAST, rather than needing to happen after the query ends and before we de-TOAST. So I think in the short term what we should do about this is just fix it so it doesn't crash. In the longer term, we might want to think a bit more carefully about the way we handle de-TOASTing overall; we've had a number of different bugs that are all rooted in failure to think carefully about the fact that the query's snapshot needs to live at least as long as any TOAST pointers that we might want to de-TOAST later (3f8c8e3c61cef5729980ee4372ec159862a979f1, ec543db77b6b72f24d0a637c4a4a419cf8311d0b). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
Andrew Gierthwrites: > In a similar case in the past involving holdable cursors, the solution > was to detoast _before_ storing in the tuplestore (see > PersistHoldablePortal). I guess the question now is, under what > circumstances is it now allowable to detoast a datum with no active > snapshot? (Wouldn't it be necessary in such a case to know what the > oldest snapshot ever used in the transaction was?) After looking at this a bit, I think probably the appropriate solution is to register the snapshot that was used by the query and store it as a property of the Portal, releasing it when the Portal is destroyed. Essentially, this views possession of a relevant snapshot as a resource that is required to make toast dereferences safe. I think there has been a bug here for awhile. Consider a committed-dead row with some associated toast data, and suppose the query's snapshot was the last one that could see that row. Once we destroy the query's snapshot, there is nothing preventing a concurrent VACUUM from removing the dead row and the toast data. When the RETURNING code was originally written, I think this was safe enough, because the bookkeeping that determined when VACUUM could remove data was based on transactions' advertised xmins, and those did not move once set for the life of the transaction. So dereferencing a toast pointer you'd fetched was safe for the rest of the transaction. But when we changed over to oldest-snapshot-based xmin advertisement, and made it so that a transaction holding no snapshots advertised no xmin, we created a hazard for data held in portals. In this view of things, flattening toast pointers in "held" tuplestores is still necessary, but it's because their protective snapshot is going away not because the transaction as a whole is going away. But as long as we hold onto the relevant snapshot, we don't have to do that for portals used intra-transaction. It's interesting to think about whether we could let snapshots outlive transactions and thereby not need to flatten "held" tuplestores either. I kinda doubt it's a good idea because of the potential bad effects for vacuum not being able to remove dead rows for a long time; but it seems at least possible to do it, in this world-view. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
> "Amit" == Amit Kapilawrites: Amit> Sure, that is the reason of crash, but even if we do that it will Amit> lead to an error "no known snapshots". Here, what is going on is Amit> that we initialized toast snapshot when there is no active Amit> snapshot in the backend, so GetOldestSnapshot() won't return any Amit> snapshot. Hmm. So this happens because RETURNING queries run to completion immediately and populate a tuplestore with the results, and the portal then fetches from the tuplestore to send to the destination. The assumption is that the tuplestore output can be processed without needing a snapshot, which obviously is not true now if it contains toasted data. In a similar case in the past involving holdable cursors, the solution was to detoast _before_ storing in the tuplestore (see PersistHoldablePortal). I guess the question now is, under what circumstances is it now allowable to detoast a datum with no active snapshot? (Wouldn't it be necessary in such a case to know what the oldest snapshot ever used in the transaction was?) Amit> I think for such situations, we need to initialize the lsn and Amit> whenTaken of ToastSnapshot as we do in GetSnapshotData() [1]. Would that not give a too-recent LSN, resulting in possibly failing to fetch the toast rows? -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
On Sat, Aug 6, 2016 at 5:51 PM, Andrew Gierthwrote: >> "Andreas" == Andreas Seltenreich writes: > > 418 if (OldestActiveSnapshot != NULL) > 419 ActiveLSN = OldestActiveSnapshot->as_snap->lsn; > 420 > 421 if (XLogRecPtrIsInvalid(RegisteredLSN) || RegisteredLSN > ActiveLSN) > 422 return OldestActiveSnapshot->as_snap; > > This second conditional should clearly be inside the first one... > Sure, that is the reason of crash, but even if we do that it will lead to an error "no known snapshots". Here, what is going on is that we initialized toast snapshot when there is no active snapshot in the backend, so GetOldestSnapshot() won't return any snapshot. I think for such situations, we need to initialize the lsn and whenTaken of ToastSnapshot as we do in GetSnapshotData() [1]. We need to do this when snapshot returned by GetOldestSnapshot() is NULL. Thoughts? [1] In below code if (old_snapshot_threshold < 0) { .. } else { /* * Capture the current time and WAL stream location in case this * snapshot becomes old enough to need to fall back on the special * "old snapshot" logic. */ snapshot->lsn = GetXLogInsertRecPtr(); snapshot->whenTaken = GetSnapshotCurrentTimestamp(); MaintainOldSnapshotTimeMapping(snapshot->whenTaken, xmin); } -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
> "Andreas" == Andreas Seltenreichwrites: 418 if (OldestActiveSnapshot != NULL) 419 ActiveLSN = OldestActiveSnapshot->as_snap->lsn; 420 421 if (XLogRecPtrIsInvalid(RegisteredLSN) || RegisteredLSN > ActiveLSN) 422 return OldestActiveSnapshot->as_snap; This second conditional should clearly be inside the first one... -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
On Sat, Aug 6, 2016 at 6:32 PM, Andreas Seltenreichwrote: > since updating master from c93d873..fc509cd, I see crashes in > GetOldestSnapshot() on update/delete returning statements. > > I reduced the triggering statements down to this: > > update clstr_tst set d = d returning d; > > Backtrace below. 3e2f3c2e is likely to blame here.. I have moved the open item "old_snapshot_threshold allows heap:toast disagreement" back to the list of open items. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [sqlsmith] Crash in GetOldestSnapshot()
Hi, since updating master from c93d873..fc509cd, I see crashes in GetOldestSnapshot() on update/delete returning statements. I reduced the triggering statements down to this: update clstr_tst set d = d returning d; Backtrace below. regards, Andreas Program received signal SIGSEGV, Segmentation fault. (gdb) bt #0 GetOldestSnapshot () at snapmgr.c:422 #1 0x004b8279 in init_toast_snapshot (toast_snapshot=0x7ffcd824b010) at tuptoaster.c:2314 #2 0x004b83bc in toast_fetch_datum (attr=) at tuptoaster.c:1869 #3 0x004b9ab5 in heap_tuple_untoast_attr (attr=0x18226c8) at tuptoaster.c:179 #4 0x007f71ad in pg_detoast_datum_packed (datum=) at fmgr.c:2266 #5 0x007cfc12 in text_to_cstring (t=0x18226c8) at varlena.c:186 #6 0x007f5735 in FunctionCall1Coll (flinfo=flinfo@entry=0x18221c0, collation=collation@entry=0, arg1=arg1@entry=25306824) at fmgr.c:1297 #7 0x007f68ee in OutputFunctionCall (flinfo=0x18221c0, val=25306824) at fmgr.c:1946 #8 0x00478bc1 in printtup (slot=0x1821f80, self=0x181ce48) at printtup.c:359 #9 0x006f9c8e in RunFromStore (portal=portal@entry=0x177cbf8, direction=direction@entry=ForwardScanDirection, count=count@entry=0, dest=0x181ce48) at pquery.c:1117 #10 0x006f9d52 in PortalRunSelect (portal=portal@entry=0x177cbf8, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, dest=dest@entry=0x181ce48) at pquery.c:942 #11 0x006fb41e in PortalRun (portal=portal@entry=0x177cbf8, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x181ce48, altdest=altdest@entry=0x181ce48, completionTag=completionTag@entry=0x7ffcd824b920 "") at pquery.c:787 #12 0x006f822b in exec_simple_query (query_string=0x17db878 "update clstr_tst set d = d returning d;") at postgres.c:1094 #13 PostgresMain (argc=, argv=argv@entry=0x1781ce0, dbname=0x1781b40 "regression", username=) at postgres.c:4074 #14 0x0046c9bd in BackendRun (port=0x1786920) at postmaster.c:4262 #15 BackendStartup (port=0x1786920) at postmaster.c:3936 #16 ServerLoop () at postmaster.c:1693 #17 0x00693044 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x175d5f0) at postmaster.c:1301 #18 0x0046dd26 in main (argc=3, argv=0x175d5f0) at main.c:228 (gdb) list 417 418 if (OldestActiveSnapshot != NULL) 419 ActiveLSN = OldestActiveSnapshot->as_snap->lsn; 420 421 if (XLogRecPtrIsInvalid(RegisteredLSN) || RegisteredLSN > ActiveLSN) 422 return OldestActiveSnapshot->as_snap; 423 424 return OldestRegisteredSnapshot; 425 } 426 (gdb) bt full #0 GetOldestSnapshot () at snapmgr.c:422 OldestRegisteredSnapshot = RegisteredLSN = ActiveLSN = #1 0x004b8279 in init_toast_snapshot (toast_snapshot=0x7ffcd824b010) at tuptoaster.c:2314 snapshot = #2 0x004b83bc in toast_fetch_datum (attr=) at tuptoaster.c:1869 toastrel = 0x7f8b4ca88920 toastidxs = 0x18447c8 toastkey = { sk_flags = 0, sk_attno = 1, sk_strategy = 3, sk_subtype = 0, sk_collation = 100, sk_func = { fn_addr = 0x77c490 , fn_oid = 184, fn_nargs = 2, fn_strict = 1 '\001', fn_retset = 0 '\000', fn_stats = 2 '\002', fn_extra = 0x0, fn_mcxt = 0x18282a8, fn_expr = 0x0 }, sk_argument = 34491 } toastscan = ttup = toasttupDesc = 0x7f8b4ca88c50 result = 0x18422d8 toast_pointer = ressize = 5735 residx = nextidx = 0 numchunks = 3 chunk = isnull = chunkdata = chunksize = num_indexes = 1 validIndex = 0 SnapshotToast = { satisfies = 0x112, xmin = 3626283536, xmax = 32764, xip = 0xf8ac628, xcnt = 5221870, subxip = 0x0, subxcnt = 0, suboverflowed = 0 '\000', takenDuringRecovery = 0 '\000', copied = 0 '\000', curcid = 14, speculativeToken = 0, active_count = 260753304, regd_count = 0, ph_node = { first_child = 0xf8ac680, next_sibling = 0xa400112, prev_or_parent = 0x0 }, whenTaken = 274, lsn = 0 } __func__ = "toast_fetch_datum" #3 0x004b9ab5 in heap_tuple_untoast_attr (attr=0x18226c8) at tuptoaster.c:179 No locals. #4 0x007f71ad in pg_detoast_datum_packed (datum=) at fmgr.c:2266 No locals. #5 0x007cfc12 in text_to_cstring (t=0x18226c8) at varlena.c:186 tunpacked = result = #6 0x007f5735 in FunctionCall1Coll (flinfo=flinfo@entry=0x18221c0,