On 25/01/2013 3:06 p.m., Alex Rousskov wrote:
Hello,

     The attached patch fixes several ConnOpener problems by relying on
AsyncJob protections while maintaining a tighter grip on various I/O and
sleep states. It is in PREVIEW state because I would like to do more
testing, but it did pass basic tests, and I am not currently aware of
serious problems with the patch code.

I started with Rainer Weikusat's timeout polishing patch posted
yesterday, but all bugs are mine.


Here are some of the addressed problems:

* Connection descriptor was not closed when attempting to reconnect
after failures. We now properly close on failures, sleep with descriptor
closed, and then reopen.

* Timeout handler was not cleaned up properly in some cases, causing
memory leaks (for the handler Pointer) and possibly timeouts that were
fired (for then-active handler), after the connection was passed to the
initiator.

* Comm close handler was not cleaned up properly.

* Connection timeout was enforced for each connection attempt instead of
all attempts together.

and possibly other problems. The full extent of all side-effects of
mishandled race conditions and state conflicts is probably unknown.


TODO: Needs more testing, especially around corner cases.
       Does somebody need more specific callback cancellation reasons?
       Consider calling comm_close instead of direct write_data cleanup.
       Make connect_timeout documentation in squid.conf less ambiguous.
       Move prevalent conn_ debugging to the status() method?
       Polish Comm timeout handling to always reset .timeout on callback?
       Consider revising eventDelete() to delete between-I/O sleep
       timeout.

Feedback welcomed.

NP: This is way beyond the simple fixes Ranier was working on. The changes here relys on code behaviour which will limit the patch to trunk or 3.3. I was a bit borderline on the earlier on the size of Raniers patches, but this is going over the change amount I'm comfortable porting to the stable branch with a beta cycle coming to an end.

Auditing anyway:

* You are still making comments about what "Comm" should do (XXX: Comm should!). ConnOpener *is* "Comm" at this point in the transaction. If "Comm" needs to do anything then it is *this* objects responsibility scope to see that it happens. If thre is a *simple* helper function elsewhere in comm_*() or Comm:: or fd_*() which can help so be it, but this object *is* Comm and needs to peform the "Comm should do X" operations as related to state opening an FD.

* It was probably done beforehand, but it much clearer that it happens now that the sleep()/DelayedRetry mechanism leaks Pointer() as well as the InProgress mechanism. +++ IMHO: leave it leaking, the use-case is a rarity and we can update the event API separately and faster than we can fix all the callers to workaround it.

* Looking at your comment in there about comm_close() I see why we are at such loggerheads about it.
  +++ comm_close() does *not* replace the write_data cleanup.
- write_data cleanup is explicitly and solely to remove the leaked Pointer()'s *nothing else*. The extra two lines of code are to ensure that hack does not corrupt anything. Until those fd_table pointers stop being dynamic this code is non-optional. - even if you called comm_close() you would need to perform the write_data cleanup before calling it to prevent the leak.

* apart from the timeout unset the relevant parts of the comm_close() sequence are all in comm_close_complete(). Perhapse you should schedule one of those calls instead of clearing the handlers and calling fd_table() synchronously.. ++ but notice how that (and comm_close()) adds a second async delay on the FD becoming available for re-use. This is a performance optimization. You will need to setup a synchronous comm_close_complete() to achieve the same speed. ++ the other operations in com_close() function itself are all NOP. No other component has been given the FD or conn_ to set any state themselves.

* the naming of "ConnOpener::open()" is ambiguous. Since it is not the active open() operation in this Job. It is the getSocket() operation and should be named as such.

Amos

Reply via email to