Jaroslav, Looks good for me!
> I've changed the PortAllocator to allocate an array of unique random > ports instead of letting ServerSocket to take care of it. > > I ran the test 200x in a tight loop without a failure. > > I hope this will resolve this test's intermittent failures due to port > conflicts once and for all. > > Update: http://cr.openjdk.java.net/~jbachorik/8066708/webrev.03 > -Dmitry On 2015-01-09 15:17, Jaroslav Bachorik wrote: > Thank you all for the valuable input! > > On 9.1.2015 03:50, Stuart Marks wrote: >> Hi Jaroslav, >> >> I'm distant enough from this code that I don't think I'm in a position >> to say "no you can't check this in," and I'm mindful of the fact that >> this bug is a high priority and you want to get a fix in. But having >> said that I think there's a surprising amount of complexity here for >> what it does. >> >> Implementing a retry-on-BindException policy seems to be fairly >> sensible, since I assume it would be fairly invasive to modify the code >> in the JVMs being forked to use an ephemeral port and send that >> information back to the test. >> >> My conjecture is however that the open/close/reopen logic is the primary >> cause of the BindExceptions that occur. If you're going to retry on >> BindException, you might as well choose a random port number instead of >> doing the open/close to get a port number from the system. >> >> The range that Dmitry suggests is reasonable, though I note that the >> actual ephemeral port range used by the kernel will differ from OS to OS >> and even from system to system. I don't know if that's really >> significant though. If you end up choosing a port outside the ephemeral >> range for some system, does it really matter? >> >> If you do decide to have PortAllocator open and close a ServerSocket (in >> order to find a previously unused port) I'd suggest removing the logic >> that leaves the socket open until the first call to get(). That logic >> reduces the possibility that some other process will open the socket >> after the close but before the reopen. In my experience that's not >> what's causing the BindExceptions. It could still happen, though, but >> you're protected by the retry logic anyway. Leaving the socket open >> longer actually hurts, I think, because it increases the chance that the >> kernel won't have cleaned up the port by the time the test wants to >> reopen it. >> >> If you change PortAllocator to close the socket immediately, you can get >> rid of the need to call release() in a finally-block of the caller. You >> could also change the type of the functional interface to be >> >> int[] -> void >> >> since the PortAllocator doesn't hold onto any resources that need to be >> cleaned up. It just calls the execute() method and passes an array of n >> port numbers. >> >> It's probably necessary to have the socket close() call in a retry loop. >> The socket is always closed immediately from the user process point of >> view, so isClosed() will always return true immediately after the >> close() call returns. You can verify this easily by looking in the >> ServerSocket.java source code. I believe the state that prevents the >> port from being reused immediately is private to the kernel and cannot >> be observed from a user process, at least not without attempting to >> reopen the socket. >> >> Side note: one of the jcmd() overloads says that parameter 'c' (a >> Consumer) may be null. It doesn't look like this is handled. If you >> really want to support this, I'd assign () -> { } to it if it's null so >> that it can be called unconditionally. (Or just disallow null.) > > I've changed the PortAllocator to allocate an array of unique random > ports instead of letting ServerSocket to take care of it. > > I ran the test 200x in a tight loop without a failure. > > I hope this will resolve this test's intermittent failures due to port > conflicts once and for all. > > Update: http://cr.openjdk.java.net/~jbachorik/8066708/webrev.03 > > Thanks, > > -JB- > > >> >> s'marks >> >> >> On 1/6/15 2:00 PM, Dmitry Samersoff wrote: >>> Jaroslav, >>> >>> It might be better to just choose a random digit between 49152–65535 >>> and attempt to use it. >>> >>> -Dmitry >>> >>> >>> On 2014-12-18 17:29, Jaroslav Bachorik wrote: >>>> On 12/11/2014 03:43 PM, Dmitry Samersoff wrote: >>>>> Jaroslav, >>>>> >>>>> You can set SO_LINGER to zero, in this case socket will be closed >>>>> immediately without waiting in TIME_WAIT >>>>> >>>>> But there are no reliable way to predict whether you can take this >>>>> port >>>>> or not after you close it. >>>>> >>>>> So the only valid solution is to try to connect to a random port >>>>> and if >>>>> this attempt fails try another random port. Everything else will cause >>>>> more or less frequent intermittent failures. >>>> >>>> Thanks for all the suggestions! >>>> >>>> http://cr.openjdk.java.net/~jbachorik/8066708/webrev.02 >>>> >>>> I've enhanced the original patch with the retry logic using different >>>> random port if starting the JMX agent on the provided port fails with >>>> BindException. >>>> >>>> I'm keeping there the changes for properly closing the ports opened for >>>> the test purposes and also setting the SO_REUSEADDR - anyway, it does >>>> not make sense to reuse the ephemeral test ports. >>>> >>>> I've split the original "test_06" test case in order to keep it >>>> readable >>>> even with the new retry logic - and also to make each test case to test >>>> just one scenario. >>>> >>>> Cheers, >>>> >>>> -JB- >>>> >>>>> >>>>> -Dmitry >>>>> >>>>> >>>>> On 2014-12-11 17:06, Jaroslav Bachorik wrote: >>>>>> On 12/09/2014 01:25 PM, Jaroslav Bachorik wrote: >>>>>>> On 12/09/2014 01:39 AM, Stuart Marks wrote: >>>>>>>> On 12/8/14 12:35 PM, Jaroslav Bachorik wrote: >>>>>>>>> Please, review the following test change >>>>>>>>> >>>>>>>>> Issue : https://bugs.openjdk.java.net/browse/JDK-8066708 >>>>>>>>> Webrev: http://cr.openjdk.java.net/~jbachorik/8066708/webrev.00 >>>>>>>>> >>>>>>>>> The test fails very intermittently when RMI registry is trying to >>>>>>>>> bind >>>>>>>>> to a port >>>>>>>>> previously used in the test (via ServerSocket). >>>>>>>>> >>>>>>>>> This seems to be caused by the sockets created via `new >>>>>>>>> ServerSocket(0)` and >>>>>>>>> being in reusable mode. The fix attempts to prevent this by >>>>>>>>> explicitly >>>>>>>>> forbidding the reusable mode. >>>>>>>> >>>>>>>> Hi Jaroslav, >>>>>>>> >>>>>>>> I happened to see this fly by, and there are (I think) some similar >>>>>>>> issues going on in the RMI tests. >>>>>>>> >>>>>>>> But first I'll note that I don't think setReuseAddress() will >>>>>>>> have the >>>>>>>> effect that you want. Typically it's set to true before binding a >>>>>>>> socket, so that a subsequent bind operation will succeed even if >>>>>>>> the >>>>>>>> address/port is already in use. ServerSockets created with new >>>>>>>> ServerSocket(0) are already bound, and I'm not sure what calling >>>>>>>> setReuseAddress(false) will do on such sockets. The spec says >>>>>>>> behavior >>>>>>>> is undefined, but my bet is that it does nothing. >>>>>>>> >>>>>>>> I guess it doesn't hurt to try this out to see if it makes a >>>>>>>> difference, >>>>>>>> but I don't have much confidence it will help. >>>>>>>> >>>>>>>> The potential similarity to the RMI tests is exemplified by >>>>>>>> JDK-8049202 >>>>>>>> (sorry, this bug report isn't open) but briefly this tests the RMI >>>>>>>> registry as follows: >>>>>>>> >>>>>>>> 1. Opens port 1099 using new ServerSocket(1099) [1099 is the >>>>>>>> default >>>>>>>> RMI registry port] in order to ensure that 1099 isn't in >>>>>>>> use by >>>>>>>> something else already; >>>>>>>> >>>>>>>> 2. If this succeeds, it immediately closes the ServerSocket. >>>>>>>> >>>>>>>> 3. Then it creates a new RMI registry on port 1099. >>>>>>>> >>>>>>>> In principle, this should succeed, yet it fails around 10% of the >>>>>>>> time >>>>>>>> on some systems. The error is "port already in use". My best >>>>>>>> theory is >>>>>>>> that even though the socket has just been closed by a user program, >>>>>>>> the >>>>>>>> kernel has to run the socket through some of the socket states >>>>>>>> such as >>>>>>>> FIN_WAIT_1, FIN_WAIT_2, or CLOSING before the socket is actually >>>>>>>> closed >>>>>>>> and is available for reuse. If a program -- even the same one -- >>>>>>>> attempts to open a socket on the same port before the socket has >>>>>>>> reached >>>>>>>> its final state, it will get an "already in use error". >>>>>>>> >>>>>>>> If this is true I don't believe that setting SO_REUSEADDR will >>>>>>>> work if >>>>>>>> the socket is in one of these final states. (I remember reading >>>>>>>> this >>>>>>>> somewhere but I'm not sure where at the moment. I can try to dig >>>>>>>> it up >>>>>>>> if there is interest.) >>>>>>>> >>>>>>>> I admit this is just a theory and I'm open to alternatives, and I'm >>>>>>>> also >>>>>>>> open to hearing about ways to deal with this problem. >>>>>>>> >>>>>>>> Could something similar be going on with this JMX test? >>>>>>> >>>>>>> Hm, this is exactly what happened with this test :( >>>>>>> >>>>>>> The problem is that the port is reported as available while it is >>>>>>> still >>>>>>> occupied and RMI registry attempts to start using that port. >>>>>>> >>>>>>> If setting SO_REUSEADDR does not work then the only solution would >>>>>>> be to >>>>>>> retry the test case when this exception occurs. >>>>>> >>>>>> Further investigation shows that the problem was rather the client >>>>>> connecting to a socket being shut down. >>>>>> >>>>>> It sounds like setting SO_REUSEADDR to false should prevent this >>>>>> failure. >>>>>> >>>>>> From the ServerSocket javadoc: >>>>>> "When a TCP connection is closed the connection may remain in a >>>>>> timeout >>>>>> state for a period of time after the connection is closed (typically >>>>>> known as the TIME_WAIT state or 2MSL wait state). For applications >>>>>> using >>>>>> a well known socket address or port it may not be possible to bind a >>>>>> socket to the required SocketAddress if there is a connection in the >>>>>> timeout state involving the socket address or port." >>>>>> >>>>>> It also turns out that the test does not close the server sockets >>>>>> properly so there might be several sockets being opened or timed out >>>>>> dangling around. >>>>>> >>>>>> I've updated the test so it is setting SO_REUSEADDR for all the new >>>>>> ServerSockets instances + introduced the mechanism to run the test >>>>>> code >>>>>> while properly cleaning up any allocated ports. >>>>>> >>>>>> http://cr.openjdk.java.net/~jbachorik/8066708/webrev.01/ >>>>>> >>>>>> -JB- >>>>>> >>>>>>> >>>>>>> -JB- >>>>>>> >>>>>>>> >>>>>>>> s'marks >>>>>>> >>>>>> >>>>> >>>>> >>>> >>> >>> > -- Dmitry Samersoff Oracle Java development team, Saint Petersburg, Russia * I would love to change the world, but they won't give me the sources.