Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags

2006-06-14 Thread Bruce Momjian
Added to TODO: * Consider increasing internal areas when shared buffers is increased http://archives.postgresql.org/pgsql-hackers/2005-10/msg01419.php --- Alvaro Herrera wrote: Jim C. Nasby

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags

2005-11-03 Thread Alvaro Herrera
Jim C. Nasby wrote: BTW, is MEMORY_CONTEXT_CHECKING that expensive? It seems like it shouldn't be, but I'm only guessing at what exactly it does... Yes, because not only it checks marker bytes at the end of palloc chunks and similar stuff, but it also overwrites whole contexts with 0x7f when

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags

2005-11-03 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes: May I propose to make Assert() yield only WARNINGs, That is a horrid idea --- for one thing, it means that Asserts inside the elog machinery itself would be instant infinite recursion, and even elsewhere you'd have to think a bit about whether it's ok to

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags

2005-11-03 Thread Jim C. Nasby
On Thu, Nov 03, 2005 at 11:11:42AM -0500, Tom Lane wrote: Perhaps rather than an all-or-nothing debug_assertions GUC variable, what we want is something that turns on or off expensive assertion checks at runtime. This could include MEMORY_CONTEXT_CHECKING and anything else where the actual

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags

2005-11-03 Thread Tom Lane
Jim C. Nasby [EMAIL PROTECTED] writes: Seriously, I am wondering about the performance hit of always checking debug_assertions. http://archives.postgresql.org/pgsql-hackers/2005-08/msg00389.php indicates that even with debug_assertions=false, --enable-cassert has a big performance impact.

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags

2005-11-02 Thread Greg Stark
Tom Lane [EMAIL PROTECTED] writes: Greg Stark [EMAIL PROTECTED] writes: I happen to think that except for the rare assertion that has major performance impact all the assertions should be on in production builds. The goal of assertions is to catch corruption quickly and that's something

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags

2005-11-02 Thread Jim C. Nasby
On Wed, Nov 02, 2005 at 07:03:57AM -0500, Greg Stark wrote: I would bet that ninety percent of the Asserts in the existing code are on conditions that could represent, at worst, corruption of backend-local or even transaction-local data structures. Taking down the entire database cluster

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags

2005-11-02 Thread Tom Lane
Jim C. Nasby [EMAIL PROTECTED] writes: BTW, that's a reversal from what I was originally arguing for, which was due to the performance penalty associated with --enable-cassert. My client is now running with Tom's suggestion of commenting out CLOBBER_FREED_MEMORY and MEMORY_CONTEXT_CHECKING and

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags 0x01),)

2005-11-01 Thread Alvaro Herrera
Tom Lane wrote: Jim C. Nasby [EMAIL PROTECTED] writes: AFAIK they're not using subtransactions at all, but I'll check. Well, yeah, they are ... else you'd never have seen this failure. Maybe it's in plpgsql EXCEPTION clauses. -- Alvaro Herrera Valdivia, Chile ICBM: S 39ยบ 49' 17.7,

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags 0x01),)

2005-11-01 Thread Jim C. Nasby
On Tue, Nov 01, 2005 at 11:23:55AM -0300, Alvaro Herrera wrote: Tom Lane wrote: Jim C. Nasby [EMAIL PROTECTED] writes: AFAIK they're not using subtransactions at all, but I'll check. Well, yeah, they are ... else you'd never have seen this failure. Maybe it's in plpgsql EXCEPTION

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags 0x01),)

2005-11-01 Thread Jim C. Nasby
On Tue, Nov 01, 2005 at 11:23:55AM -0300, Alvaro Herrera wrote: Tom Lane wrote: Jim C. Nasby [EMAIL PROTECTED] writes: AFAIK they're not using subtransactions at all, but I'll check. Well, yeah, they are ... else you'd never have seen this failure. Maybe it's in plpgsql EXCEPTION

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags 0x01),)

2005-11-01 Thread Tom Lane
I wrote: Even though the bug seems completely clog.c's fault, s/clog.c/slru.c/ of course :-( regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags 0x01),)

2005-11-01 Thread Tom Lane
Jim C. Nasby [EMAIL PROTECTED] writes: Doesn't clog use the same code? Yeah, but all three of your examples were referencing pg_subtrans, as proven by the stack traces and the contents of the shared control block. Even though the bug seems completely clog.c's fault, this is not a coincidence:

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags

2005-11-01 Thread Greg Stark
Bruce Momjian pgman@candle.pha.pa.us writes: Jim C. Nasby wrote: My assumption is that the asserts that are currently in place fall into one of two categories: either they check for something that if false could result in data corruption in the heap, or they check for something that

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags

2005-11-01 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes: I happen to think that except for the rare assertion that has major performance impact all the assertions should be on in production builds. The goal of assertions is to catch corruption quickly and that's something that's just as important in production as

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags 0x01),)

2005-10-31 Thread Jim C. Nasby
On Sun, Oct 30, 2005 at 06:17:53PM -0500, Tom Lane wrote: I'd like Jim to test this theory by seeing if it helps to reverse the order of the if-test elements at lines 294/295, ie make it look like if (shared-page_status[slotno] != SLRU_PAGE_READ_IN_PROGRESS ||

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags 0x01),)

2005-10-31 Thread Jim C. Nasby
Sorry, two more things... Will increasing shared_buffers make this less likely to occur? Or is this just something that's likely to happen when there are things like seqscans that are putting buffers near the front of the LRU? (The 8.0.3 buffer manager does something like that, right?) Is this

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags

2005-10-31 Thread Bruce Momjian
Jim C. Nasby wrote: On Sun, Oct 30, 2005 at 06:17:53PM -0500, Tom Lane wrote: I'd like Jim to test this theory by seeing if it helps to reverse the order of the if-test elements at lines 294/295, ie make it look like if (shared-page_status[slotno] != SLRU_PAGE_READ_IN_PROGRESS ||

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags 0x01),)

2005-10-31 Thread Tom Lane
Jim C. Nasby [EMAIL PROTECTED] writes: On Sun, Oct 30, 2005 at 06:17:53PM -0500, Tom Lane wrote: This won't do as a permanent patch, because it isn't guaranteed to fix the problem on machines that don't strongly order writes, but it should work on Opterons, at least well enough to confirm the

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags 0x01),)

2005-10-31 Thread Jim C. Nasby
On Mon, Oct 31, 2005 at 01:05:06PM -0500, Tom Lane wrote: Jim C. Nasby [EMAIL PROTECTED] writes: On Sun, Oct 30, 2005 at 06:17:53PM -0500, Tom Lane wrote: This won't do as a permanent patch, because it isn't guaranteed to fix the problem on machines that don't strongly order writes, but it

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags

2005-10-31 Thread Bruce Momjian
Tom Lane wrote: Jim C. Nasby [EMAIL PROTECTED] writes: On Sun, Oct 30, 2005 at 06:17:53PM -0500, Tom Lane wrote: This won't do as a permanent patch, because it isn't guaranteed to fix the problem on machines that don't strongly order writes, but it should work on Opterons, at least well

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags

2005-10-31 Thread Jim C. Nasby
On Mon, Oct 31, 2005 at 01:01:14PM -0500, Bruce Momjian wrote: This incident has made me wonder if it's worth creating two classes of assertions. The (hopefully more common) set of assertions would be for things that shouldn't happen, but if go un-caught won't result in heap corruption. A

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags

2005-10-31 Thread Bruce Momjian
Jim C. Nasby wrote: On Mon, Oct 31, 2005 at 01:01:14PM -0500, Bruce Momjian wrote: This incident has made me wonder if it's worth creating two classes of assertions. The (hopefully more common) set of assertions would be for things that shouldn't happen, but if go un-caught won't result

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags

2005-10-31 Thread Jim C. Nasby
On Mon, Oct 31, 2005 at 01:34:17PM -0500, Bruce Momjian wrote: There is no way if the system has some incorrect value whether that would later corrupt the data or not. Anything the system does that it shouldn't do is a potential corruption problem. But is it safe to say that there are areas

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags

2005-10-31 Thread Gregory Maxwell
On 10/31/05, Jim C. Nasby [EMAIL PROTECTED] wrote: On Mon, Oct 31, 2005 at 01:34:17PM -0500, Bruce Momjian wrote: There is no way if the system has some incorrect value whether that would later corrupt the data or not. Anything the system does that it shouldn't do is a potential corruption

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags 0x01),)

2005-10-31 Thread Jim C. Nasby
Now that I've got a little better idea of what this code does, I've noticed something interesting... this issue is happening on an 8-way machine, and NUM_SLRU_BUFFERS is currently defined at 8. Doesn't this greatly increase the odds of buffer conflicts? Bug aside, would it be better to set

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags 0x01),)

2005-10-31 Thread Alvaro Herrera
Jim C. Nasby wrote: Now that I've got a little better idea of what this code does, I've noticed something interesting... this issue is happening on an 8-way machine, and NUM_SLRU_BUFFERS is currently defined at 8. Doesn't this greatly increase the odds of buffer conflicts? Bug aside, would it

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags 0x01),)

2005-10-31 Thread Jim C. Nasby
On Mon, Oct 31, 2005 at 09:02:59PM -0300, Alvaro Herrera wrote: Jim C. Nasby wrote: Now that I've got a little better idea of what this code does, I've noticed something interesting... this issue is happening on an 8-way machine, and NUM_SLRU_BUFFERS is currently defined at 8. Doesn't this

Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags 0x01),)

2005-10-31 Thread Tom Lane
Jim C. Nasby [EMAIL PROTECTED] writes: AFAIK they're not using subtransactions at all, but I'll check. Well, yeah, they are ... else you'd never have seen this failure. regards, tom lane ---(end of broadcast)--- TIP 9: In

slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags 0x01),)

2005-10-30 Thread Tom Lane
OK, I think I see it. The problem is that the code in slru.c is careful about not modifying state when it doesn't hold the proper lock, but not so careful about not *inspecting* state without the proper lock. In particular consider these lines in SimpleLruReadPage (line numbers are as in CVS