[HACKERS] Re: retry shm attach for windows (WAS: Re: OK, so culicidae is *still* broken)

2017-07-10 Thread Noah Misch
On Mon, Jul 10, 2017 at 10:46:09AM -0400, Tom Lane wrote:
> Magnus Hagander  writes:
> > On Jul 10, 2017 16:08, "Tom Lane"  wrote:
> >> Okay, so that leaves us with a decision to make: push it into beta2, or
> >> wait till after wrap?  I find it pretty scary to push a patch with
> >> portability implications so soon before wrap, but a quick look at the
> >> buildfarm suggests we'd get runs from 3 or 4 Windows members before the
> >> wrap, if I do it quickly.  If we wait, then it will hit the field in
> >> production releases with reasonable buildfarm testing but little more.
> >> That's also pretty scary.
> >> On balance I'm tempted to push it now for beta2, but it's a close call.
> >> Thoughts?
> 
> > Given the lack of windows testing on non packaged releases I think we
> > should definitely push this for beta2. That will give it a much better real
> > world testing on what is still a beta.
> > If it breaks its a lot better to do it in beta2 (and possibly quickly roll
> > a beta3) than in production.
> 
> Shall we go for broke and also remove the ASLR-disabling patch in beta2?
> I do not feel a need to back-patch that one, at least not yet.  But if
> we're going to do it any time soon, a beta seems like the time.

As I mentioned in my message eight hours ago, no.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: retry shm attach for windows (WAS: Re: OK, so culicidae is *still* broken)

2017-07-10 Thread Noah Misch
On Mon, Jul 10, 2017 at 10:08:53AM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > I recommend pushing your patch so the August back-branch releases have it.
> > One can see by inspection that your patch has negligible effect on systems
> > healthy today.  I have a reasonable suspicion it will help some systems.  If
> > others remain broken after this, that fact will provide a useful clue.
> 
> Okay, so that leaves us with a decision to make: push it into beta2, or
> wait till after wrap?  I find it pretty scary to push a patch with
> portability implications so soon before wrap, but a quick look at the
> buildfarm suggests we'd get runs from 3 or 4 Windows members before the
> wrap, if I do it quickly.  If we wait, then it will hit the field in
> production releases with reasonable buildfarm testing but little more.
> That's also pretty scary.
> 
> On balance I'm tempted to push it now for beta2, but it's a close call.
> Thoughts?

Pushing now for beta2 sounds good.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: retry shm attach for windows (WAS: Re: OK, so culicidae is *still* broken)

2017-07-10 Thread Magnus Hagander
On Jul 10, 2017 16:08, "Tom Lane"  wrote:

Noah Misch  writes:
> On Mon, Jun 05, 2017 at 09:56:33AM -0400, Tom Lane wrote:
>> Yeah, being able to reproduce the problem reliably enough to say whether
>> it's fixed or not is definitely the sticking point here.  I have some
>> ideas about that: ...

> I tried this procedure without finding a single failure.

Thanks for testing!  But apparently we still lack some critical part
of the recipe for getting failures in the wild.

> I watched the ensuing memory maps, which led me to these interesting
facts:

>   An important result of the ASLR design in Windows Vista is that some
address
>   space layout parameters, such as PEB, stack, and heap locations, are
>   selected once per program execution. Other parameters, such as the
location
>   of the program code, data segment, BSS segment, and libraries, change
only
>   between reboots.
>   ...
>   This offset is selected once per reboot, although we’ve uncovered at
least
>   one other way to cause this offset to be reset without a reboot (see
>   Appendix II).
>   -- http://www.symantec.com/avcenter/reference/Address_
Space_Layout_Randomization.pdf

That is really interesting, though I'm not quite sure what to make of it.
It suggests though that you might need certain types of antivirus products
to be active, to create more variability in the initial process address
space layout than would happen from Windows ASLR alone.

> I recommend pushing your patch so the August back-branch releases have it.
> One can see by inspection that your patch has negligible effect on systems
> healthy today.  I have a reasonable suspicion it will help some systems.
If
> others remain broken after this, that fact will provide a useful clue.

Okay, so that leaves us with a decision to make: push it into beta2, or
wait till after wrap?  I find it pretty scary to push a patch with
portability implications so soon before wrap, but a quick look at the
buildfarm suggests we'd get runs from 3 or 4 Windows members before the
wrap, if I do it quickly.  If we wait, then it will hit the field in
production releases with reasonable buildfarm testing but little more.
That's also pretty scary.

On balance I'm tempted to push it now for beta2, but it's a close call.
Thoughts?


Given the lack of windows testing on non packaged releases I think we
should definitely push this for beta2. That will give it a much better real
world testing on what is still a beta.

If it breaks its a lot better to do it in beta2 (and possibly quickly roll
a beta3) than in production.

/Magnus


[HACKERS] Re: retry shm attach for windows (WAS: Re: OK, so culicidae is *still* broken)

2017-07-10 Thread Michael Paquier
On Mon, Jul 10, 2017 at 3:36 PM, Noah Misch  wrote:
> I recommend pushing your patch so the August back-branch releases have it.
> One can see by inspection that your patch has negligible effect on systems
> healthy today.  I have a reasonable suspicion it will help some systems.  If
> others remain broken after this, that fact will provide a useful clue.

+1. Thanks for taking the time to test and look at the patch.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: retry shm attach for windows (WAS: Re: OK, so culicidae is *still* broken)

2017-07-09 Thread Noah Misch
On Mon, Jun 05, 2017 at 09:56:33AM -0400, Tom Lane wrote:
> Amit Kapila  writes:
> > Sure.  I think it is slightly tricky because specs don't say clearly
> > how ASLR can impact the behavior of any API and in my last attempt I
> > could not reproduce the issue.
> 
> > I can try to do basic verification with the patch you have proposed,
> > but I fear, to do the actual tests as suggested by you is difficult as
> > it is not reproducible on my machine, but I can still try.
> 
> Yeah, being able to reproduce the problem reliably enough to say whether
> it's fixed or not is definitely the sticking point here.  I have some
> ideas about that:
> 
> 1. Be sure to use Win32 not Win64 --- the odds of a failure in the larger
> address space are so small you'd never prove anything.  (And of course
> it has to be a version that has ASLR enabled.)
> 
> 2. Revert 7f3e17b48 so that you have an ASLR-enabled build.
> 
> 3. Crank shared_buffers to the maximum the machine will allow, reducing
> the amount of free address space and improving the odds of a collision.
> 
> 4. Spawn lots of sessions --- pgbench with -C option might be a useful
> testing tool.
> 
> With luck, that will get failures often enough that you can be pretty
> sure whether a patch improves the situation or not.

I tried this procedure without finding a single failure.  Attributes:

- 32-bit build of commit fad7873 w/ 7f3e17b48 reverted
- confirmed ASLR-enabled with "dumpbin /headers postgres.exe"
- OS = 64-bit Windows Server 2016
- compiler = Visual Studio 2015 Express
- no config.pl, so not linked with any optional library
- tried shared_buffers=1100M and shared_buffers=24M
- echo 'select 1;' >pgbench-trivial; pgbench -n -f pgbench-trivial -C -c 30 -j5 
-T900
- tried starting as a service at boot time, in addition to manual start

=== postgresql.conf ===
log_connections = on
log_line_prefix = '%p %m '
autovacuum = off
listen_addresses = '127.0.0.1'
log_min_messages = debug1
max_connections = 40
shared_buffers = 24MB
deadlock_timeout = '20ms'
wal_buffers = '16MB'
fsync = off


I watched the ensuing memory maps, which led me to these interesting facts:

  An important result of the ASLR design in Windows Vista is that some address
  space layout parameters, such as PEB, stack, and heap locations, are
  selected once per program execution. Other parameters, such as the location
  of the program code, data segment, BSS segment, and libraries, change only
  between reboots.
  ...
  This offset is selected once per reboot, although we’ve uncovered at least
  one other way to cause this offset to be reset without a reboot (see
  Appendix II).
  -- 
http://www.symantec.com/avcenter/reference/Address_Space_Layout_Randomization.pdf
 

So, reattach failures might be reproducible on some reboots and not others.
While I had a few reboots during the test work, I did not exercise that
dimension systematically.  This information also implies we should not
re-enable ASLR, even if your patch helps, due to addresses that change less
than once per process creation but occasionally more than once per boot.


I did try the combination of your patch and the following to simulate a 95%
failure rate:

--- a/src/backend/port/win32_shmem.c
+++ b/src/backend/port/win32_shmem.c
@@ -410,2 +410,4 @@ pgwin32_ReserveSharedMemoryRegion(HANDLE hChild)
 MEM_RESERVE, 
PAGE_READWRITE);
+   if (random() > MAX_RANDOM_VALUE / 20)
+   address = NULL;
if (address == NULL)

This confirmed retries worked.  During a 900s test run, one connection attempt
failed permanently by exhausting its 100 retries.  The run achieved 11
connections per second.  That is an order of magnitude slower than a run
without simulated failures, but most applications would tolerate it.

I recommend pushing your patch so the August back-branch releases have it.
One can see by inspection that your patch has negligible effect on systems
healthy today.  I have a reasonable suspicion it will help some systems.  If
others remain broken after this, that fact will provide a useful clue.

Thanks,
nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers