Re: [HACKERS] windows shared memory error

2009-05-05 Thread Magnus Hagander
Tom Lane wrote: Magnus Hagander mag...@hagander.net writes: Passes my tests, but I can't really reproduce the requirement to retry, so I haven't been able to test that part :( The patch looks sane to me. If you want to test, perhaps reducing the sleep to 1 msec or so would reproduce the

Re: [HACKERS] windows shared memory error

2009-05-04 Thread Magnus Hagander
Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: Now presumably we sleep for 1 sec between the CloseHandle() call and the CreateFileMapping() call in that code for a reason. I'm not sure. Magnus never did answer my question about why the sleep and retry was put in at all; it

Re: [HACKERS] windows shared memory error

2009-05-04 Thread Magnus Hagander
Tom Lane wrote: Magnus Hagander mag...@hagander.net writes: Tom Lane wrote: It says here: http://msdn.microsoft.com/en-us/library/ms885627.aspx FWIW, this is the Windows CE documentation. The one for win32 is at: http://msdn.microsoft.com/en-us/library/ms679360(VS.85).aspx Sorry, that

Re: [HACKERS] windows shared memory error

2009-05-04 Thread Andrew Dunstan
Magnus Hagander wrote: Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: Now presumably we sleep for 1 sec between the CloseHandle() call and the CreateFileMapping() call in that code for a reason. I'm not sure. Magnus never did answer my question about why the

Re: [HACKERS] windows shared memory error

2009-05-04 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes: Magnus Hagander wrote: The actual 1 second value was completely random - it fixed all the issues on my test VM at the time. I don't recall exactly the details, but I do recall having to run a lot of tests before I managed to provoke an error, and

Re: [HACKERS] windows shared memory error

2009-05-04 Thread Andrew Dunstan
Tom Lane wrote: I still think there's absolutely no evidence suggesting that a variable backoff is necessary. Given how little this code is going to be exercised in the real world, how long will it take till we find out if you get it wrong? Use a simple retry loop and be done with it.

Re: [HACKERS] windows shared memory error

2009-05-04 Thread Magnus Hagander
Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: Magnus Hagander wrote: The actual 1 second value was completely random - it fixed all the issues on my test VM at the time. I don't recall exactly the details, but I do recall having to run a lot of tests before I managed to provoke

Re: [HACKERS] windows shared memory error

2009-05-04 Thread Alvaro Herrera
Magnus Hagander wrote: Andrew, you want to write up a patch or do you want me to do it? This is going to be backpatched, I assume? -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-hackers mailing

Re: [HACKERS] windows shared memory error

2009-05-04 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes: This is going to be backpatched, I assume? Yeah, back to 8.2 I suppose. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription:

Re: [HACKERS] windows shared memory error

2009-05-04 Thread Tom Lane
Magnus Hagander mag...@hagander.net writes: Tom Lane wrote: I still think there's absolutely no evidence suggesting that a variable backoff is necessary. Given how little this code is going to be exercised in the real world, how long will it take till we find out if you get it wrong? Use a

Re: [HACKERS] windows shared memory error

2009-05-04 Thread Andrew Dunstan
Magnus Hagander wrote: Andrew, you want to write up a patch or do you want me to do it? Go for it. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] windows shared memory error

2009-05-04 Thread Magnus Hagander
Andrew Dunstan wrote: Magnus Hagander wrote: Andrew, you want to write up a patch or do you want me to do it? Go for it. How does this look? Passes my tests, but I can't really reproduce the requirement to retry, so I haven't been able to test that part :( //Magnus ***

Re: [HACKERS] windows shared memory error

2009-05-04 Thread Alvaro Herrera
Magnus Hagander wrote: How does this look? Passes my tests, but I can't really reproduce the requirement to retry, so I haven't been able to test that part :( I'm disappointed :-( I thought this thread (without reading it too deeply) was about fixing the problem that backends sometimes

Re: [HACKERS] windows shared memory error

2009-05-04 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes: I'm disappointed :-( I thought this thread (without reading it too deeply) was about fixing the problem that backends sometimes fail to connect to shmem, on a system that's been running for a while. Nobody knows yet what's wrong there or how

Re: [HACKERS] windows shared memory error

2009-05-04 Thread Tom Lane
Magnus Hagander mag...@hagander.net writes: Passes my tests, but I can't really reproduce the requirement to retry, so I haven't been able to test that part :( The patch looks sane to me. If you want to test, perhaps reducing the sleep to 1 msec or so would reproduce the need to go around the

Re: [HACKERS] windows shared memory error

2009-05-03 Thread Magnus Hagander
Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: I am seeing Postgres 8.3.7 running as a service on Windows Server 2003 repeatedly fail to restart after a backend crash because of the following code in port/win32_shmem.c: On further review, I see an entirely different

Re: [HACKERS] windows shared memory error

2009-05-03 Thread Magnus Hagander
Andrew Dunstan wrote: Tom Lane wrote: Now this would only explain problems if there were some code path through the postmaster that could leave the errno set to ERROR_ALREADY_EXISTS (a/k/a EEXIST) when this code is reached. I'm not sure there is one, and I have even less of a theory as

Re: [HACKERS] windows shared memory error

2009-05-03 Thread Andrew Dunstan
Magnus Hagander wrote: Andrew, just to confirm: you've found a case where this happens *repeatably*? That's what we've failed to do before - it's happened now and then, but never during testing... Well, it happened several times to my client within a matter of hours. I didn't see any

Re: [HACKERS] windows shared memory error

2009-05-03 Thread Tom Lane
Magnus Hagander mag...@hagander.net writes: Tom Lane wrote: It says here: http://msdn.microsoft.com/en-us/library/ms885627.aspx FWIW, this is the Windows CE documentation. The one for win32 is at: http://msdn.microsoft.com/en-us/library/ms679360(VS.85).aspx Sorry, that was the one that came

Re: [HACKERS] windows shared memory error

2009-05-03 Thread Andrew Dunstan
Tom Lane wrote: The quick try would be to stick a SetLastError(0) in there, just to be sure... Could be worth a try? I kinda think we should do that whether or not it can be proven to have anything to do with Andrew's report. It's just like errno = 0 for Unix --- sometimes you have to

Re: [HACKERS] windows shared memory error

2009-05-03 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes: Now presumably we sleep for 1 sec between the CloseHandle() call and the CreateFileMapping() call in that code for a reason. I'm not sure. Magnus never did answer my question about why the sleep and retry was put in at all; it seems not unlikely from

Re: [HACKERS] windows shared memory error

2009-05-02 Thread Andrew Dunstan
Tom Lane wrote: Now this would only explain problems if there were some code path through the postmaster that could leave the errno set to ERROR_ALREADY_EXISTS (a/k/a EEXIST) when this code is reached. I'm not sure there is one, and I have even less of a theory as to why system load might

Re: [HACKERS] windows shared memory error

2009-05-02 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes: Maybe we need to look at all the places we call GetLastError(). There are quite a few of them. It would only be an issue with syscalls that have badly designed APIs like this one. Most of the time you know that the function has failed and is supposed

[HACKERS] windows shared memory error

2009-05-01 Thread Andrew Dunstan
I am seeing Postgres 8.3.7 running as a service on Windows Server 2003 repeatedly fail to restart after a backend crash because of the following code in port/win32_shmem.c: /* * If the segment already existed, CreateFileMapping() will return a * handle to the existing one. */

Re: [HACKERS] windows shared memory error

2009-05-01 Thread Dave Page
On Fri, May 1, 2009 at 12:59 AM, Andrew Dunstan and...@dunslane.net wrote: It strikes me that we really need to try reconnecting to the shared memory here several times, and maybe the backoff need to increase each time. On a loaded server this cause postgres to fail to restart fairly reliably.

Re: [HACKERS] windows shared memory error

2009-05-01 Thread Greg Stark
On Fri, May 1, 2009 at 8:42 AM, Dave Page dp...@pgadmin.org wrote: On Fri, May 1, 2009 at 12:59 AM, Andrew Dunstan and...@dunslane.net wrote: It strikes me that we really need to try reconnecting to the shared memory here several times, and maybe the backoff need to increase each time. On a

Re: [HACKERS] windows shared memory error

2009-05-01 Thread Dave Page
On Fri, May 1, 2009 at 11:05 AM, Greg Stark st...@enterprisedb.com wrote: Do we have any idea why it may take a short while before it gets dropped from the global namespace? Is there some demon running which only wakes up periodically? Or any specific reason it takes so long? That might give

Re: [HACKERS] windows shared memory error

2009-05-01 Thread Heikki Linnakangas
Dave Page wrote: On Fri, May 1, 2009 at 12:59 AM, Andrew Dunstan and...@dunslane.net wrote: It strikes me that we really need to try reconnecting to the shared memory here several times, and maybe the backoff need to increase each time. On a loaded server this cause postgres to fail to restart

Re: [HACKERS] windows shared memory error

2009-05-01 Thread Dave Page
On Fri, May 1, 2009 at 4:10 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Dave Page wrote: On Fri, May 1, 2009 at 12:59 AM, Andrew Dunstan and...@dunslane.net wrote: It strikes me that we really need to try reconnecting to the shared memory here several times, and maybe

Re: [HACKERS] windows shared memory error

2009-05-01 Thread Andrew Dunstan
Heikki Linnakangas wrote: Dave Page wrote: On Fri, May 1, 2009 at 12:59 AM, Andrew Dunstan and...@dunslane.net wrote: It strikes me that we really need to try reconnecting to the shared memory here several times, and maybe the backoff need to increase each time. On a loaded server this

Re: [HACKERS] windows shared memory error

2009-05-01 Thread Heikki Linnakangas
Dave Page wrote: On Fri, May 1, 2009 at 4:10 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Dave Page wrote: On Fri, May 1, 2009 at 12:59 AM, Andrew Dunstan and...@dunslane.net wrote: It strikes me that we really need to try reconnecting to the shared memory here several

Re: [HACKERS] windows shared memory error

2009-05-01 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes: It strikes me that we really need to try reconnecting to the shared memory here several times, and maybe the backoff need to increase each time. Adding a backoff would make the code significantly more complex, with no gain that I can see. Just loop

Re: [HACKERS] windows shared memory error

2009-05-01 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: It strikes me that we really need to try reconnecting to the shared memory here several times, and maybe the backoff need to increase each time. Adding a backoff would make the code significantly more complex, with no gain

Re: [HACKERS] windows shared memory error

2009-05-01 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes: We've seen similar things with other Windows file operations, IIRC. What bothers me is that the problem might be precisely because the 1 second sleep between the CloseHandle() call and the CreateFileMapping() call might not be enough due to system

Re: [HACKERS] windows shared memory error

2009-05-01 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: We've seen similar things with other Windows file operations, IIRC. What bothers me is that the problem might be precisely because the 1 second sleep between the CloseHandle() call and the CreateFileMapping() call might not be

Re: [HACKERS] windows shared memory error

2009-05-01 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes: I am seeing Postgres 8.3.7 running as a service on Windows Server 2003 repeatedly fail to restart after a backend crash because of the following code in port/win32_shmem.c: On further review, I see an entirely different explanation for possible