Re: client_side and comm_close
Alex Rousskov wrote: On Mon, 2008-04-21 at 15:48 +1200, Amos Jeffries wrote: comm_close(fd) API: 1) No I/O callbacks will be dialed after comm_close is called (including the time when comm_close is running). Sounds good. 2) All close callbacks registered before comm_close was called will be called asynchronously, sometime after comm_close exits. Sound good. 3) The code can ask Comm whether a FD associated with a close callback has been closed. The answer will be yes after comm_close is called and no before that. This interface needs to be added. Direct access to fd_table[fd].flags will not work because the same FD could have been assigned to another connection already. The code will have to use its close callback to determine FD status. Sounds good, BUT, direct access to fd_table pieces may need to be blocked entirely (private:) so code is forced to go through the Comm API properly. Yes, except if we want to avoid modifying old code that can still access those flags directly because it gets immediate close callbacks (more on that below). (2) states that the higher-level close callbacks may be run at any time. ie after the callback (3) refers to is run. This leaves a window open for disaster, unless the closing callbacks are made immediate, and back we go to recursion... Those are the same close callbacks! There are no low-level and high-level close callbacks here. The external code can use its stored callback pointer to get FD status even after that close callback has been called. There is no problem with that. The callback will not look at fd_table in that case, it will just say yes, the fd is closed as far as you should be concerned. And, per recent suggestions, old code will get immediate close callbacks so it does not need to be modified to use the close callback pointer to ask about FD status. Alright. I got to this branch of the thread before the other which makes things clearer. Same for the below. 4) Registering any callback after comm_close has been called is wrong. Comm will assert if it can detect such late registrations. Not all late registrations will be detected because the same FD can be already assigned to a different (new) connection[*]. That non-detection seems to me to be a worry. The same problem in (3) occurs here. (4) can guarantee that the closing callbacks don't play nasty re-registrations. But only if they are called immediate instead of scheduled. Sorry, I do not understand. The late registration of a close callback can come from anywhere. The old code relies on fd only. There is no way for comm to distinguish whether the caller is using a FD from a closed connection or a new one. This problem exists in the old code as well so there is no new danger here! This problem can only be solved when we have comm handlers of one sort or another. The above comm_close API is easy to implement without massive code changes. Do you think it is the right API? If not, what should be changed? Apart from the worry with immediate vs delayed closing callbacks. To reduce that worry somewhat I think, the callbacks which actually use ERR_COMM_CLOSING for anything other than immediate abort will need to be replaced with two; a normal callback that checks the FD is open, and a simple closing callback. I am confused by the normal callback that checks the FD is open part. What is that for? Are you talking about some internal comm calls to close the FD? I believe that handler code that currently does not ignore ERR_COMM_CLOSING notifications will need to be moved into the close handler code (because only the close handler will be called if we eliminate ERR_COMM_CLOSING). Nevermind. I was saying it wrong, but meaning what you have re-stated. Amos -- Please use Squid 2.6.STABLE19 or 3.0.STABLE4
Re: client_side and comm_close
Alex Rousskov wrote: On Mon, 2008-04-21 at 16:02 +1200, Amos Jeffries wrote: On Sun, 2008-04-20 at 22:01 +0300, Tsantilas Christos wrote: Maybe it can be easier: The ease of implementation is a separate question. We still have to agree what we are implementing, and the items below attempt to define that. Ambiguity and lack of clear APIs is quite dangerous here (as the related bugs illustrate!). Just keeping the comm_close as is and in AsyncCallsQueue just cancels/not execute asyncCall's for which the fd_table[fd].flags.closing is set. Implementing #1 below can indeed be as simple as you sketch above. We need to make sure that the final call that closes the FD and makes fd value available to other connections is placed last (that may already be true and is easy to fix if not). Not true. IMO, The proposed API _very_ first two lines of code in comm_close are to register a special Comm callback to perform the fclose() call, and then to immediately set fd_table flag closed for the rest of the comm_close process. Agreed on the flag, disagreed on the call. The special/internal Comm call (to self) should be scheduled last (after all close callbacks) and not first because the close handler might need access to some FD-related info. That info should be preserved until all close handlers have been called. With that condition at the start we can guarantee that any registrations made during the close sequence are either non-FD-relevant or caught. Yes, the flag is sufficient for that. The internal close for good call can still be last. I was thinking the close-for-good call would get caught as an fd operation on closed fd by the Async stuff if set after the flag. The special Comm callback is only special in that it is not required to check flags open before fclose()'ing the system-level FD, which will allow new connections to open on the FD. It is special because it is calling an internal comm method not some external close handler. Its profile and parameters are different. I would not call it a callback because of that, but the label is not important. It is not a close callback in terms of comm_add_close_handler. Yet it seems to me it needs to be run as an async after the other async close-handlers are run. Due to the non-determinence of the async timing. Between the initial comm_close() call and the special Comm callback, we don't need to care if callbacks write their shutdown statements to the fd (its still technically open) but the closed flag prevents comm accepting any new delayed event registrations or reads. Exactly. Our only problem is with code that calls comm with a stale fd, after that fd has been really closed and a new connection was opened with the same fd. That's not a new problem and we will solve it in v3.2 using comm handlers. I hope the above puts us on the same page about the implementation sketch for the comm_close API, but please yell if something still seems broken. Okay. It looks the way I would implement it from scratch. Amos -- Please use Squid 2.6.STABLE19 or 3.0.STABLE4
Re: client_side and comm_close
On Mon, 2008-04-21 at 23:45 +1200, Amos Jeffries wrote: IMO, The proposed API _very_ first two lines of code in comm_close are to register a special Comm callback to perform the fclose() call, and then to immediately set fd_table flag closed for the rest of the comm_close process. Agreed on the flag, disagreed on the call. The special/internal Comm call (to self) should be scheduled last (after all close callbacks) and not first because the close handler might need access to some FD-related info. That info should be preserved until all close handlers have been called. With that condition at the start we can guarantee that any registrations made during the close sequence are either non-FD-relevant or caught. Yes, the flag is sufficient for that. The internal close for good call can still be last. I was thinking the close-for-good call would get caught as an fd operation on closed fd by the Async stuff if set after the flag. Actually, the internal close-for-good call handler asserts that the flag is set! This internal handler does not have the structure or the restrictions of the public comm_* functions... The special Comm callback is only special in that it is not required to check flags open before fclose()'ing the system-level FD, which will allow new connections to open on the FD. It is special because it is calling an internal comm method not some external close handler. Its profile and parameters are different. I would not call it a callback because of that, but the label is not important. It is not a close callback in terms of comm_add_close_handler. Yet it seems to me it needs to be run as an async after the other async close-handlers are run. Due to the non-determinence of the async timing. Yes, the it must called asynchronously. The call is scheduled after all close callbacks are scheduled so that it will be dialed last. Alex.
Re: client_side and comm_close
On Sun, 2008-04-20 at 11:47 +0800, Adrian Chadd wrote: Yes, then you left it by introducing something which exposed the bugs. The problems this thread is discussing were known before AsyncCalls. But they weren't exhibiting as -bugs-. They were: assertion in bug2309 (and many other problems I personally worked on) were discovered in Squid3 way before AsyncCalls (v3.0 and earlier). Its great that we now know about the bugs - but now you've got a codebase that you're trying to stabilise This thread is not about stabilizing client-side code. It is about changing its key classes and the relationships among them (for many reasons). This will certainly destabilize the code short-term. If it would not, I would not need to ask others opinion whether we should wait with that work! I thought the focus on 3.1 - and I'd check, but the Wiki history for the Roadmap stuff I put in early in the 3.1 roadmap cycle - was to focus on stability and performance. ... _I_ helped start building the 3.1 roadmap, if you remember. _I_ helped draft the notion that 3.1 should be about performance and stability fixes. This thread is (was) dedicated to the discussion whether _adjusting_ v3.1 roadmap to add client-side cleanup is a good idea. Since we have more than one person working on the project, we should try to coordinate significant changes. If you think client-side cleanup is a bad idea for v3.1, just post your arguments -- no reason to talk about AsyncCalls or other unrelated matters (which can be discussed in other threads, of course). As for v3.1 original focus that you put in early, I do not even want to argue about that, and I am happy to assume that your original v3.1 roadmap was exactly what we should have done. What matters though, is the _current_ Squid3 roadmap, which we are trying to base on specific commitments of specific folks. It is far from perfect, but I think it is improving and reflects the current reality reasonably well. If you want people to adopt Squid-3 then you should try and bring another stable release out with IPv6 (once all of -those- issues are fixed) and without too much quirkiness, and soon. We already have a Squid3 release that is reasonably stable and improving. I am not sure why we need to rush v3.1 out when the planned features have not been completed yet. (And did not you claim that IPv6 as a Squid feature is not really important? This thread is not about IPv6 though.) Alex.
Re: client_side and comm_close
lör 2008-04-19 klockan 21:01 -0600 skrev Alex Rousskov: I agree with many of your specific suggestions (and will discuss them separately), but I want to understand what problem you are proposing to solve first. For example: - Fix a few new bugs introduced by async calls. - Fix certain comm API problems. - Improve client-side code. - Something else. This thread was just about improving client-side code (more specifically, when to do it), but the responses seem to be are all over the map so I am trying to understand what the goal is before diving into specific development ideas. I am focusing more on comm than client_side at the moment. Fixing client_side without a well defined and reliable comm layer is not very meaningful. Regards Henrik
Re: client_side and comm_close
On Sun, 2008-04-20 at 09:22 +0200, Henrik Nordstrom wrote: lör 2008-04-19 klockan 23:05 -0600 skrev Alex Rousskov: It is easy to implement two kinds of call scheduling, but I am not sure I understand the merits of doing so. What do you mean by the event loops in this context? Are you talking about eventAdd() queue? Or I/O event (i.e., comm) callbacks? I am guessing it is the latter. Both. The event checks (I/O, timed events etc) run from the main loop and are always the first AsyncCall callers for anything going on. The immediate calls from comm were not really safe (because of the recursive/reentrant behavior) so I am not sure what we would achieve here. As everything going on in Squid is initiated by the event loop you don't get recursion from this change.. main loop - event loop - immediate async call - handler - other I/O actions etc but from here always async.. I agree regarding the first initiation point, but disagree that it is sufficient to guarantee non-recursive behavior: In addition to callbacks, we have direct function calls to comm and other APIs. The combination of immediate callbacks and those function calls will cause recursion as control will reenter the jobs multiple times. For example: event loop - immediate callback - handler - comm_* call - immediate callback - handler - ... Please clarify which of the following comm_close implementation would solve the problems you want to solve in this context: A) Immediately call all not-yet-dialed callbacks for the affected FD. All relevant I/O callbacks are called with ERR_COMM_CLOSING flag set (if we still have that). The closing callbacks, if any, are called last. B) Asynchronously call all not-yet-dialed callbacks for the affected FD. All relevant I/O callbacks are called with ERR_COMM_CLOSING flag set (if we still have that). The closing callbacks, if any, are called last. C) Other. Don't call them. See ERR_COMM_CLOSING discussion. The I/O calls are no longer relevant at the moment the fd has been semantically closed (comm_close called by the code). Only the close callbacks matter at that time. Note: The close callbacks as such need to be async, just not the closing of the fd (or at least at a semantical level). OK, I think you picked option B then :-). Let me polish it to better match the current level of understanding and the dropping ERR_COMM_CLOSING agreement: comm_close(fd) API: 1) No I/O callbacks will be dialed after comm_close is called (including the time when comm_close is running). 2) All close callbacks registered before comm_close was called will be called asynchronously, sometime after comm_close exits. 3) The code can ask Comm whether a FD associated with a close callback has been closed. The answer will be yes after comm_close is called and no before that. This interface needs to be added. Direct access to fd_table[fd].flags will not work because the same FD could have been assigned to another connection already. The code will have to use its close callback to determine FD status. 4) Registering any callback after comm_close has been called is wrong. Comm will assert if it can detect such late registrations. Not all late registrations will be detected because the same FD can be already assigned to a different (new) connection[*]. The above comm_close API is easy to implement without massive code changes. Do you think it is the right API? If not, what should be changed? [*] We can reliably detect #4 violations by replacing fd integers outside of comm with a unique integer ID or, better, a handler object containing metadata. That kind of change may be too big for v3.1 though. Thank you, Alex.
Re: client_side and comm_close
Maybe it can be easier: Just keeping the comm_close as is and in AsyncCallsQueue just cancels/not execute asyncCall's for which the fd_table[fd].flags.closing is set. We only need a way to identify comm_close handlers and the asyncCall which will do the actual close of the socket. Alex Rousskov wrote: OK, I think you picked option B then :-). Let me polish it to better match the current level of understanding and the dropping ERR_COMM_CLOSING agreement: comm_close(fd) API: 1) No I/O callbacks will be dialed after comm_close is called (including the time when comm_close is running). 2) All close callbacks registered before comm_close was called will be called asynchronously, sometime after comm_close exits. 3) The code can ask Comm whether a FD associated with a close callback has been closed. The answer will be yes after comm_close is called and no before that. This interface needs to be added. Direct access to fd_table[fd].flags will not work because the same FD could have been assigned to another connection already. The code will have to use its close callback to determine FD status. 4) Registering any callback after comm_close has been called is wrong. Comm will assert if it can detect such late registrations. Not all late registrations will be detected because the same FD can be already assigned to a different (new) connection[*]. The above comm_close API is easy to implement without massive code changes. Do you think it is the right API? If not, what should be changed? [*] We can reliably detect #4 violations by replacing fd integers outside of comm with a unique integer ID or, better, a handler object containing metadata. That kind of change may be too big for v3.1 though. Thank you, Alex.
Re: client_side and comm_close
sön 2008-04-20 klockan 22:19 +0300 skrev Christos Tsantilas: The important files are the CommCalls.* AsyncCalls.* and ICAP/AsyncJob.* But I believe that you are understand it enough. I think I understand most of the code now, but not entirely 100% sure on the design and intended use model yet.. Basic principle is simple and neat. Regards Henrik
Re: client_side and comm_close
sön 2008-04-20 klockan 22:56 -0600 skrev Alex Rousskov: - I hope we can avoid similar problems with comm_cancel_read because the two or three places using that call can be adjusted to cancel the read callback directly. We have already talked about that elsewhere. It's not only comm_cancel_read but also commSetTimeout. Regards Henrik
Re: client_side and comm_close
On Mon, 2008-04-21 at 15:48 +1200, Amos Jeffries wrote: comm_close(fd) API: 1) No I/O callbacks will be dialed after comm_close is called (including the time when comm_close is running). Sounds good. 2) All close callbacks registered before comm_close was called will be called asynchronously, sometime after comm_close exits. Sound good. 3) The code can ask Comm whether a FD associated with a close callback has been closed. The answer will be yes after comm_close is called and no before that. This interface needs to be added. Direct access to fd_table[fd].flags will not work because the same FD could have been assigned to another connection already. The code will have to use its close callback to determine FD status. Sounds good, BUT, direct access to fd_table pieces may need to be blocked entirely (private:) so code is forced to go through the Comm API properly. Yes, except if we want to avoid modifying old code that can still access those flags directly because it gets immediate close callbacks (more on that below). (2) states that the higher-level close callbacks may be run at any time. ie after the callback (3) refers to is run. This leaves a window open for disaster, unless the closing callbacks are made immediate, and back we go to recursion... Those are the same close callbacks! There are no low-level and high-level close callbacks here. The external code can use its stored callback pointer to get FD status even after that close callback has been called. There is no problem with that. The callback will not look at fd_table in that case, it will just say yes, the fd is closed as far as you should be concerned. And, per recent suggestions, old code will get immediate close callbacks so it does not need to be modified to use the close callback pointer to ask about FD status. 4) Registering any callback after comm_close has been called is wrong. Comm will assert if it can detect such late registrations. Not all late registrations will be detected because the same FD can be already assigned to a different (new) connection[*]. That non-detection seems to me to be a worry. The same problem in (3) occurs here. (4) can guarantee that the closing callbacks don't play nasty re-registrations. But only if they are called immediate instead of scheduled. Sorry, I do not understand. The late registration of a close callback can come from anywhere. The old code relies on fd only. There is no way for comm to distinguish whether the caller is using a FD from a closed connection or a new one. This problem exists in the old code as well so there is no new danger here! This problem can only be solved when we have comm handlers of one sort or another. The above comm_close API is easy to implement without massive code changes. Do you think it is the right API? If not, what should be changed? Apart from the worry with immediate vs delayed closing callbacks. To reduce that worry somewhat I think, the callbacks which actually use ERR_COMM_CLOSING for anything other than immediate abort will need to be replaced with two; a normal callback that checks the FD is open, and a simple closing callback. I am confused by the normal callback that checks the FD is open part. What is that for? Are you talking about some internal comm calls to close the FD? I believe that handler code that currently does not ignore ERR_COMM_CLOSING notifications will need to be moved into the close handler code (because only the close handler will be called if we eliminate ERR_COMM_CLOSING). Thanks, Alex.
Re: client_side and comm_close
On Mon, 2008-04-21 at 16:02 +1200, Amos Jeffries wrote: On Sun, 2008-04-20 at 22:01 +0300, Tsantilas Christos wrote: Maybe it can be easier: The ease of implementation is a separate question. We still have to agree what we are implementing, and the items below attempt to define that. Ambiguity and lack of clear APIs is quite dangerous here (as the related bugs illustrate!). Just keeping the comm_close as is and in AsyncCallsQueue just cancels/not execute asyncCall's for which the fd_table[fd].flags.closing is set. Implementing #1 below can indeed be as simple as you sketch above. We need to make sure that the final call that closes the FD and makes fd value available to other connections is placed last (that may already be true and is easy to fix if not). Not true. IMO, The proposed API _very_ first two lines of code in comm_close are to register a special Comm callback to perform the fclose() call, and then to immediately set fd_table flag closed for the rest of the comm_close process. Agreed on the flag, disagreed on the call. The special/internal Comm call (to self) should be scheduled last (after all close callbacks) and not first because the close handler might need access to some FD-related info. That info should be preserved until all close handlers have been called. With that condition at the start we can guarantee that any registrations made during the close sequence are either non-FD-relevant or caught. Yes, the flag is sufficient for that. The internal close for good call can still be last. The special Comm callback is only special in that it is not required to check flags open before fclose()'ing the system-level FD, which will allow new connections to open on the FD. It is special because it is calling an internal comm method not some external close handler. Its profile and parameters are different. I would not call it a callback because of that, but the label is not important. It is not a close callback in terms of comm_add_close_handler. Between the initial comm_close() call and the special Comm callback, we don't need to care if callbacks write their shutdown statements to the fd (its still technically open) but the closed flag prevents comm accepting any new delayed event registrations or reads. Exactly. Our only problem is with code that calls comm with a stale fd, after that fd has been really closed and a new connection was opened with the same fd. That's not a new problem and we will solve it in v3.2 using comm handlers. I hope the above puts us on the same page about the implementation sketch for the comm_close API, but please yell if something still seems broken. Thank you, Alex.
Re: client_side and comm_close
fre 2008-04-18 klockan 08:25 -0600 skrev Alex Rousskov: The changes needed on the client-side are pretty much unrelated to async calls. Async calls just make the design problems more obvious and help in solving them. Removing async calls from comm will fix a few bugs, but will be a step backwards and will make ongoing code cleanup more difficult. Here is a different proposal. Split async calls in two. Queued and immediate, with the event loops using immediate calls and everything else queued. The siplest way of doing so safely would be to make async calls immediate if not already within an async call, but adjusting the event loops is also a possibility. This would make the code behave a lot more like how it was before async, eleminating most races. The very few cases which may depend on comm_close being immediate should be possible to identify by cbdataInternalLock calls I think as the state object previously may disappear immediately otherwise. I can't see any in the client_side_* code today. But there very likely is assumptions that after comm_close has been called no further events will fire positively on that fd (only as ERR_COMM_CLOSING). I don't remember why comm_close was made async, but the more I think about it I think it's most likely wrong. comm_close SHOULD invalidate any pending comm operations on that fd, anything else just leads to nightmares. Related to this I also think the COMM_ERR_CLOSING callback from comm should be dropped. Nearly all handlers looking for COMM_ERR_CLOSING do nothing else than return immediately to abort execution.. and the few exceptions there is can be dealt with by a close callback if at all needed: errorSendComplete, invalidates the err object ftpStateData::dataRead, sets a boolean that the read is done HttpStateData::handleRequestBodyProducerAborted injects a COMM_ERR_CLOSING HttpStateData::readReply sets a boolean that the read is done ServerStateData::sentRequestBody clears requestSender the other 20 or so users of comm callbacks just returns immediately on COMM_ERR_CLOSING, aborting the call.. Regards Henrik
Re: client_side and comm_close
On Sat, Apr 19, 2008, Henrik Nordstrom wrote: I don't remember why comm_close was made async, but the more I think about it I think it's most likely wrong. comm_close SHOULD invalidate any pending comm operations on that fd, anything else just leads to nightmares. The back-end of comm_close() may have to be async to properly support windows overlapped IO in the future. You have you cancel pending IOs which IIRC may actually fire before the cancel can be processed. Its been a few years since I toyed with overlapped IO so I can't be sure. Someone with more Windows Overlapped IO experience should chime in. Also, I reckon it would be a smart idea to sit down and code up some very basic Windows overlapped IO examples (like a tcp proxy! :) to properly explore its behaviour before any changes to comm are made. (Thats what I'll be doing before my comm changes in the cacheboy branch, as I really don't want to use libevent/libev's windows mode for IO.) Adrian -- - Xenion - http://www.xenion.com.au/ - VPS Hosting - Commercial Squid Support - - $25/pm entry-level VPSes w/ capped bandwidth charges available in WA -
Re: client_side and comm_close
Hi all, I think we have two different problems here, the client_side code and the comm_close in async calls. If I am not wrong the bug which triggered this discussion is the bug2309 which is squid3.0 bug where AsyncCalls does not exists. The problem here is the client side code which the true is that it needs clean up. It is well known to everyone tried to fix bugs or add code in this part of squid3.. About the comm_close, the AsyncCalls designed and implemented in such way to require the minimum changes in the squid3 code and this was limiting some thinks... Henrik Nordstrom wrote: Here is a different proposal. Split async calls in two. Queued and immediate, with the event loops using immediate calls and everything else queued. The siplest way of doing so safely would be to make async calls immediate if not already within an async call, but adjusting the event loops is also a possibility. This would make the code behave a lot more like how it was before async, eleminating most races. The very few cases which may depend on comm_close being immediate should be possible to identify by cbdataInternalLock calls I think as the state object previously may disappear immediately otherwise. I can't see any in the client_side_* code today. But there very likely is assumptions that after comm_close has been called no further events will fire positively on that fd (only as ERR_COMM_CLOSING). I don't remember why comm_close was made async, but the more I think I think (but not sure now) there was cases where some of the pending asyncCalls must executed. about it I think it's most likely wrong. comm_close SHOULD invalidate any pending comm operations on that fd, anything else just leads to nightmares. To invalidate any pending comm operation for a particular fd you should scan all AsyncCalls queue which maybe is costly in a busy proxy, with thousands of asyncCalls scheduled (maybe the use of secondary asyncCalls indexes for each fd?) If we have only AsyncJob based asyncCalls it would be easy to not execute (invalidate) AsyncCalls which are referred to non valid asyncJobs, but now still exists functions which are scheduled using the old way... Related to this I also think the COMM_ERR_CLOSING callback from comm should be dropped. Nearly all handlers looking for COMM_ERR_CLOSING do nothing else than return immediately to abort execution.. and the few exceptions there is can be dealt with by a close callback if at all needed: errorSendComplete, invalidates the err object ftpStateData::dataRead, sets a boolean that the read is done HttpStateData::handleRequestBodyProducerAborted injects a COMM_ERR_CLOSING HttpStateData::readReply sets a boolean that the read is done ServerStateData::sentRequestBody clears requestSender the other 20 or so users of comm callbacks just returns immediately on COMM_ERR_CLOSING, aborting the call.. Regards Henrik
Re: client_side and comm_close
On Sat, 2008-04-19 at 10:50 +0800, Adrian Chadd wrote: On Fri, Apr 18, 2008, Alex Rousskov wrote: If the client_side's main problem was a few coding bugs caused by async call changes, you may have been right. Unfortunately, our main problem here is the client_side design flaws. Removing async calls from comm will not undo those flaws. I know this. The changes needed on the client-side are pretty much unrelated to async calls. Async calls just make the design problems more obvious and help in solving them. Removing async calls from comm will fix a few bugs, but will be a step backwards and will make ongoing code cleanup more difficult. But now you have a better understanding of the bugs that are introduced with AsyncCalls. You can then plan design changes to the client-side code to work around these. I am not sure how to rephrase this so that it is clear, but client-side design changes are needed to fix client-side bugs, not to work around AsyncCall bugs. AsyncCalls are just a helpful tool here, not the focus. Removing AsyncCalls will not help but will hurt. I simply do not understand why you focus on AsyncCalls in the context of this thread when this thread discusses client-side problems that were there before AsyncCalls and are there after AsyncCalls were added. Another way to look at it is to note that client_side was already pretty stable before async calls (i.e., Squid 3.0). Thus, we have already been where you are proposing to go. Yes, then you left it by introducing something which exposed the bugs. The problems this thread is discussing were known before AsyncCalls. Its great that we now know about the bugs - but now you've got a codebase that you're trying to stabilise This thread is not about stabilizing client-side code. It is about changing its key classes and the relationships among them (for many reasons). This will certainly destabilize the code short-term. If it would not, I would not need to ask others opinion whether we should wait with that work! and you can't guarantee you understand all of the possible ways its going to break in the future. True. In fact, nobody can guarantee he understands all of the possible ways something is going to break in the future. If you back out the async related changes then you'll have a stable codebase again - you can then make smaller incremental changes to the codebase and test them to make sure you haven't broken anything, rather than spending possibly 6 + months finding more surprises. I do not think we can fix the problems discussed here by doing small incremental changes. We are not talking about bug workarounds or a few places that need internal polishing. Remember, the initial idea was to use 3.1 to tidy up the codebase to stabilise it, not tidy up the codebase to unstabilise it. Redesigning chunks of the comm codebase -was- on my todo list for 3.1, but then people dumped stuff into 3 to make it less stable again. This thread is not about redesigning comm APIs, and I do not know much about your todo list, but Squid 3.1 roadmap has been built in public. If any proposed Squid 3.1 features went against your plans, you should have added your feature(s) to the roadmap and discussed conflicts. This procedure does not guarantee anything (those who work together on the project have to compromise), but it works better than complaining about the first steps while approaching the end of the road. And nobody is tidying up the code to make it unstable, obviously. In summary, I do not understand how removing AsyncCalls would help fix client-side design problems that are the subject of this thread. I do not even understand why we are talking about AsyncCalls here. Alex.
Re: client_side and comm_close
On Sat, 2008-04-19 at 11:52 +0200, Henrik Nordstrom wrote: fre 2008-04-18 klockan 08:25 -0600 skrev Alex Rousskov: The changes needed on the client-side are pretty much unrelated to async calls. Async calls just make the design problems more obvious and help in solving them. Removing async calls from comm will fix a few bugs, but will be a step backwards and will make ongoing code cleanup more difficult. Here is a different proposal. I agree with many of your specific suggestions (and will discuss them separately), but I want to understand what problem you are proposing to solve first. For example: - Fix a few new bugs introduced by async calls. - Fix certain comm API problems. - Improve client-side code. - Something else. This thread was just about improving client-side code (more specifically, when to do it), but the responses seem to be are all over the map so I am trying to understand what the goal is before diving into specific development ideas. Thank you, Alex.
Re: client_side and comm_close
On Sat, Apr 19, 2008, Alex Rousskov wrote: But now you have a better understanding of the bugs that are introduced with AsyncCalls. You can then plan design changes to the client-side code to work around these. I am not sure how to rephrase this so that it is clear, but client-side design changes are needed to fix client-side bugs, not to work around AsyncCall bugs. AsyncCalls are just a helpful tool here, not the focus. Removing AsyncCalls will not help but will hurt. I know that client-side changes are required. We've all known that the client-side codebase needs major surgery for various reasons. I simply do not understand why you focus on AsyncCalls in the context of this thread when this thread discusses client-side problems that were there before AsyncCalls and are there after AsyncCalls were added. I'm not focusing on AsyncCalls. I'm focusing on the breakage that its introduced in code which isn't yet ready for it. Yes, then you left it by introducing something which exposed the bugs. The problems this thread is discussing were known before AsyncCalls. But they weren't exhibiting as -bugs-. Its great that we now know about the bugs - but now you've got a codebase that you're trying to stabilise This thread is not about stabilizing client-side code. It is about changing its key classes and the relationships among them (for many reasons). This will certainly destabilize the code short-term. If it would not, I would not need to ask others opinion whether we should wait with that work! I thought the focus on 3.1 - and I'd check, but the Wiki history for the Roadmap stuff I put in early in the 3.1 roadmap cycle - was to focus on stability and performance. The recent work has broken the stability, and doing performance work on an unstable codebase is silly. If you back out the async related changes then you'll have a stable codebase again - you can then make smaller incremental changes to the codebase and test them to make sure you haven't broken anything, rather than spending possibly 6 + months finding more surprises. I do not think we can fix the problems discussed here by doing small incremental changes. We are not talking about bug workarounds or a few places that need internal polishing. And thats where we differed. Remember, the initial idea was to use 3.1 to tidy up the codebase to stabilise it, not tidy up the codebase to unstabilise it. Redesigning chunks of the comm codebase -was- on my todo list for 3.1, but then people dumped stuff into 3 to make it less stable again. This thread is not about redesigning comm APIs, and I do not know much about your todo list, but Squid 3.1 roadmap has been built in public. If any proposed Squid 3.1 features went against your plans, you should have added your feature(s) to the roadmap and discussed conflicts. This procedure does not guarantee anything (those who work together on the project have to compromise), but it works better than complaining about the first steps while approaching the end of the road. _I_ helped start building the 3.1 roadmap, if you remember. _I_ helped draft the notion that 3.1 should be about performance and stability fixes. The introduction of AsyncCalls, whilst a good idea in the long run, just shows that the current client-side codebase isn't ready for it. And nobody is tidying up the code to make it unstable, obviously. But it is! And instead of backing the changes out - giving us a stable codebase again, but with an understanding of what needs to be changed before the AsyncCalls stuff is reintroduced - people still seem to persist along the same path which brought all the quirky instability into Squid-3 in the first place. Stop thinking that I'm focusing on AsyncCalls, and start thinking that I'm focusing on software engineering process with a lean towards stability. If you want people to adopt Squid-3 then you should try and bring another stable release out with IPv6 (once all of -those- issues are fixed) and without too much quirkiness, and soon. -- - Xenion - http://www.xenion.com.au/ - VPS Hosting - Commercial Squid Support - - $25/pm entry-level VPSes w/ capped bandwidth charges available in WA -
Re: client_side and comm_close
On Sat, 2008-04-19 at 11:52 +0200, Henrik Nordstrom wrote: fre 2008-04-18 klockan 08:25 -0600 skrev Alex Rousskov: The changes needed on the client-side are pretty much unrelated to async calls. Async calls just make the design problems more obvious and help in solving them. Removing async calls from comm will fix a few bugs, but will be a step backwards and will make ongoing code cleanup more difficult. Here is a different proposal. Split async calls in two. Queued and immediate, with the event loops using immediate calls and everything else queued. It is easy to implement two kinds of call scheduling, but I am not sure I understand the merits of doing so. What do you mean by the event loops in this context? Are you talking about eventAdd() queue? Or I/O event (i.e., comm) callbacks? I am guessing it is the latter. The siplest way of doing so safely would be to make async calls immediate if not already within an async call, but adjusting the event loops is also a possibility. This would make the code behave a lot more like how it was before async, eleminating most races. The immediate calls from comm were not really safe (because of the recursive/reentrant behavior) so I am not sure what we would achieve here. I hope answers to earlier questions would clarify the intent. Again, I would like to understand and agree on the problem/goal before debating specific solutions... The very few cases which may depend on comm_close being immediate should be possible to identify by cbdataInternalLock calls I think as the state object previously may disappear immediately otherwise. I can't see any in the client_side_* code today. But there very likely is assumptions that after comm_close has been called no further events will fire positively on that fd (only as ERR_COMM_CLOSING). I don't remember why comm_close was made async, but the more I think about it I think it's most likely wrong. comm_close SHOULD invalidate any pending comm operations on that fd, anything else just leads to nightmares. Please clarify which of the following comm_close implementation would solve the problems you want to solve in this context: A) Immediately call all not-yet-dialed callbacks for the affected FD. All relevant I/O callbacks are called with ERR_COMM_CLOSING flag set (if we still have that). The closing callbacks, if any, are called last. B) Asynchronously call all not-yet-dialed callbacks for the affected FD. All relevant I/O callbacks are called with ERR_COMM_CLOSING flag set (if we still have that). The closing callbacks, if any, are called last. C) Other. Implementation A would lead to reentrant job calls, which is the problem we were trying to solve earlier, I think. Implementation B avoids reentrant/recursive behavior. We can have both if B is the Right Thing to do but A is what some old code required. (I still would like to know what problem we are trying to solve though; Is the problem you are trying to solve specific to comm_close?) Thank you, Alex.
Re: client_side and comm_close
On Sat, 2008-04-19 at 17:24 +0300, Tsantilas Christos wrote: I think we have two different problems here, the client_side code and the comm_close in async calls. You are correct. General async call bugs and general comm API problems have been mentioned on this thread as well. Most of these problems are related, of course, but it is rather important to understand which problem the person is talking about when discussing a specific solution :-). If I am not wrong the bug which triggered this discussion is the bug2309 which is squid3.0 bug where AsyncCalls does not exists. If that is indeed the case, it is a good point! The problem here is the client side code which the true is that it needs clean up. It is well known to everyone tried to fix bugs or add code in this part of squid3.. Yes, I hope everybody who worked with the client-side code is in agreement that it needs a clean up. I believe the question opening this thread was when to do it (v3.1 or v3.2) not whether to do it. It is important to note that I can hit bug2309-like assertions in trunk as well. The exact causes may be a little different, but all seem to revolve around comm_close and transaction abort handling (before and after async calls). Thank you, Alex.
Re: client_side and comm_close
Alex Rousskov wrote: On Fri, 2008-04-18 at 15:40 +1200, Amos Jeffries wrote: IMO, This should be queued as part of the 3.1 cleanups, but not as urgently as some other issues. The priorities as I see them now are: - immediately fixable and urgent bugs. - completing the dedicated 3.1 feature list (DEVEL is overdue). - cleanups for fixing more intrusive or design bugs (such as this, and the 407 one). - shakedown for release cycle. We have a few months of shakedown before anything like a good stable. Plenty of time for cleanup not blocking new features. I estimate that properly cleaning up the client side would delay v3.1 stable release by 2-3 months unless we suddenly get more experts involved. Is that delay acceptable to you? Others? If we were in a position to start it now, maybe. But we are not I think. I'm vote is for the timeline above, leaving step-3 as optional, the descision whether to even start it until step 2 is fully completed. We may later decide that it is worthwhile or vital, but I don't think anyone should start in on more major sweeping until we know what all the fallout from the currently underway set is going to be. Amos -- Please use Squid 2.6.STABLE19 or 3.0.STABLE4
Re: client_side and comm_close
On Wed, Apr 16, 2008, Alex Rousskov wrote: In short, we have several related problems here: (a) client_side code is incapable of reliably identifying whether comm_close has been called; (b) ConnStateData::isOpen may not work anymore; (c) client_side code uses many different ways to identify whether more data should be read from the connection; (d) comm_close is used a lot but no longer has an immediate effect and some client_side code may still depend on that effect to be immediate; (e) client_side comm handlers decent very deep, making it difficult to propagate errors and comm_close status up. We should decide whether to continue patching holes here and there or clean up the client_side*cc connection management mess for good. Should we continue to patch isolated v3.0 problems and cleanup v3.1? Or is this a v3.2 project? Or am I exaggerating the problems since common cases usually work fine? I'd suggest another option - roll back all of the async calls changes to the comm code, stabilise the codebase without it and re-evaluate what should occur (in smaller chunks, rather than dropping in a new comm manager) before reintroducing it. I think we've seen that there's still a gap in how development branches are tested and that the best way we have to test code is to put it in production :) Adrian -- - Xenion - http://www.xenion.com.au/ - VPS Hosting - Commercial Squid Support - - $25/pm entry-level VPSes w/ capped bandwidth charges available in WA -
Re: client_side and comm_close
On Fri, 2008-04-18 at 18:34 +1200, Amos Jeffries wrote: Alex Rousskov wrote: On Fri, 2008-04-18 at 15:40 +1200, Amos Jeffries wrote: IMO, This should be queued as part of the 3.1 cleanups, but not as urgently as some other issues. The priorities as I see them now are: - immediately fixable and urgent bugs. - completing the dedicated 3.1 feature list (DEVEL is overdue). - cleanups for fixing more intrusive or design bugs (such as this, and the 407 one). - shakedown for release cycle. We have a few months of shakedown before anything like a good stable. Plenty of time for cleanup not blocking new features. I estimate that properly cleaning up the client side would delay v3.1 stable release by 2-3 months unless we suddenly get more experts involved. Is that delay acceptable to you? Others? If we were in a position to start it now, maybe. But we are not I think. You are probably right. FWIW, I will not be able to work on that until eCAP is done. I'm vote is for the timeline above, leaving step-3 as optional, the descision whether to even start it until step 2 is fully completed. We may later decide that it is worthwhile or vital, but I don't think anyone should start in on more major sweeping until we know what all the fallout from the currently underway set is going to be. In the unlikely event somebody with enough client-side knowledge can start on this now, I would probably support it (and help with the design). Otherwise, I agree that your plan is the best option available. Thank you, Alex.
Re: client_side and comm_close
On Fri, 2008-04-18 at 16:15 +0800, Adrian Chadd wrote: On Wed, Apr 16, 2008, Alex Rousskov wrote: In short, we have several related problems here: (a) client_side code is incapable of reliably identifying whether comm_close has been called; (b) ConnStateData::isOpen may not work anymore; (c) client_side code uses many different ways to identify whether more data should be read from the connection; (d) comm_close is used a lot but no longer has an immediate effect and some client_side code may still depend on that effect to be immediate; (e) client_side comm handlers decent very deep, making it difficult to propagate errors and comm_close status up. We should decide whether to continue patching holes here and there or clean up the client_side*cc connection management mess for good. Should we continue to patch isolated v3.0 problems and cleanup v3.1? Or is this a v3.2 project? Or am I exaggerating the problems since common cases usually work fine? I'd suggest another option - roll back all of the async calls changes to the comm code, stabilise the codebase without it and re-evaluate what should occur (in smaller chunks, rather than dropping in a new comm manager) before reintroducing it. If the client_side's main problem was a few coding bugs caused by async call changes, you may have been right. Unfortunately, our main problem here is the client_side design flaws. Removing async calls from comm will not undo those flaws. The changes needed on the client-side are pretty much unrelated to async calls. Async calls just make the design problems more obvious and help in solving them. Removing async calls from comm will fix a few bugs, but will be a step backwards and will make ongoing code cleanup more difficult. Another way to look at it is to note that client_side was already pretty stable before async calls (i.e., Squid 3.0). Thus, we have already been where you are proposing to go. Alex.
Re: client_side and comm_close
On Fri, Apr 18, 2008, Alex Rousskov wrote: I'd suggest another option - roll back all of the async calls changes to the comm code, stabilise the codebase without it and re-evaluate what should occur (in smaller chunks, rather than dropping in a new comm manager) before reintroducing it. If the client_side's main problem was a few coding bugs caused by async call changes, you may have been right. Unfortunately, our main problem here is the client_side design flaws. Removing async calls from comm will not undo those flaws. I know this. The changes needed on the client-side are pretty much unrelated to async calls. Async calls just make the design problems more obvious and help in solving them. Removing async calls from comm will fix a few bugs, but will be a step backwards and will make ongoing code cleanup more difficult. But now you have a better understanding of the bugs that are introduced with AsyncCalls. You can then plan design changes to the client-side code to work around these. Another way to look at it is to note that client_side was already pretty stable before async calls (i.e., Squid 3.0). Thus, we have already been where you are proposing to go. Yes, then you left it by introducing something which exposed the bugs. Its great that we now know about the bugs - but now you've got a codebase that you're trying to stabilise and you can't guarantee you understand all of the possible ways its going to break in the future. If you back out the async related changes then you'll have a stable codebase again - you can then make smaller incremental changes to the codebase and test them to make sure you haven't broken anything, rather than spending possibly 6 + months finding more surprises. Remember, the initial idea was to use 3.1 to tidy up the codebase to stabilise it, not tidy up the codebase to unstabilise it. Redesigning chunks of the comm codebase -was- on my todo list for 3.1, but then people dumped stuff into 3 to make it less stable again. Adrian -- - Xenion - http://www.xenion.com.au/ - VPS Hosting - Commercial Squid Support - - $25/pm entry-level VPSes w/ capped bandwidth charges available in WA -
Re: client_side and comm_close
Hi all, Alex Rousskov wrote: In reply to comment #8 for bug #2309 comm_read assertion failed: ccb-active == false http://www.squid-cache.org/bugs/show_bug.cgi?id=2309 Note to the one looking into this bug: The assert happens if a comm_read() is called while there already is a comm_read() pending. Yes. I would suspect the half closed client check has been wrongly adopted to call comm_read where it used commSetSelect periodically before.. d The above may be true, but there are other conditions causing the same assertion in my tests, with and without the pending comm_close patch from Christos. As I can understand the problem does not exist only in squid3.0 or only in squid3-trunk. The main problem with the client_side code is that the comm_read done by the ConnStateData class, the comm_writes by the ClientSocketContext class and comm_close (and other comm operations ) by all. This makes very difficult to manage the comm related problems in client_side code In short, we have several related problems here: (a) client_side code is incapable of reliably identifying whether comm_close has been called; (b) ConnStateData::isOpen may not work anymore; (c) client_side code uses many different ways to identify whether more data should be read from the connection; (d) comm_close is used a lot but no longer has an immediate effect and some client_side code may still depend on that effect to be immediate; (e) client_side comm handlers decent very deep, making it difficult to propagate errors and comm_close status up. We should decide whether to continue patching holes here and there or clean up the client_side*cc connection management mess for good. Should we continue to patch isolated v3.0 problems and cleanup v3.1? Or is this a v3.2 project? Or am I exaggerating the problems since common cases usually work fine? Thank you, Alex.
Re: client_side and comm_close
On Thu, 2008-04-17 at 20:13 +0300, Tsantilas Christos wrote: Alex Rousskov wrote: In reply to comment #8 for bug #2309 comm_read assertion failed: ccb-active == false http://www.squid-cache.org/bugs/show_bug.cgi?id=2309 Note to the one looking into this bug: The assert happens if a comm_read() is called while there already is a comm_read() pending. Yes. I would suspect the half closed client check has been wrongly adopted to call comm_read where it used commSetSelect periodically before.. d The above may be true, but there are other conditions causing the same assertion in my tests, with and without the pending comm_close patch from Christos. As I can understand the problem does not exist only in squid3.0 or only in squid3-trunk. The main problem with the client_side code is that the comm_read done by the ConnStateData class, the comm_writes by the ClientSocketContext class and comm_close (and other comm operations ) by all. This makes very difficult to manage the comm related problems in client_side code Yes. If we decide to fix this, there will be a single object responsible for client connection management. Others will simply use its services as needed. Comm cannot be such an object because it is too low-level. Currently, the functionality is spread among many classes. We should decide whether to continue patching holes here and there or clean up the client_side*cc connection management mess for good. Should we continue to patch isolated v3.0 problems and cleanup v3.1? Or is this a v3.2 project? Or am I exaggerating the problems since common cases usually work fine? The question remains. Thank you, Alex.
Re: client_side and comm_close
Alex Rousskov wrote: On Thu, 2008-04-17 at 20:13 +0300, Tsantilas Christos wrote: Alex Rousskov wrote: In reply to comment #8 for bug #2309 comm_read assertion failed: ccb-active == false http://www.squid-cache.org/bugs/show_bug.cgi?id=2309 Note to the one looking into this bug: The assert happens if a comm_read() is called while there already is a comm_read() pending. Yes. I would suspect the half closed client check has been wrongly adopted to call comm_read where it used commSetSelect periodically before.. d The above may be true, but there are other conditions causing the same assertion in my tests, with and without the pending comm_close patch from Christos. As I can understand the problem does not exist only in squid3.0 or only in squid3-trunk. The main problem with the client_side code is that the comm_read done by the ConnStateData class, the comm_writes by the ClientSocketContext class and comm_close (and other comm operations ) by all. This makes very difficult to manage the comm related problems in client_side code Yes. If we decide to fix this, there will be a single object responsible for client connection management. Others will simply use its services as needed. Comm cannot be such an object because it is too low-level. Currently, the functionality is spread among many classes. We should decide whether to continue patching holes here and there or clean up the client_side*cc connection management mess for good. Should we continue to patch isolated v3.0 problems and cleanup v3.1? Or is this a v3.2 project? Or am I exaggerating the problems since common cases usually work fine? The question remains. IMO, This should be queued as part of the 3.1 cleanups, but not as urgently as some other issues. The priorities as I see them now are: - immediately fixable and urgent bugs. - completing the dedicated 3.1 feature list (DEVEL is overdue). - cleanups for fixing more intrusive or design bugs (such as this, and the 407 one). - shakedown for release cycle. We have a few months of shakedown before anything like a good stable. Plenty of time for cleanup not blocking new features. Amos -- Please use Squid 2.6.STABLE19 or 3.0.STABLE4
Re: client_side and comm_close
On Fri, 2008-04-18 at 15:40 +1200, Amos Jeffries wrote: IMO, This should be queued as part of the 3.1 cleanups, but not as urgently as some other issues. The priorities as I see them now are: - immediately fixable and urgent bugs. - completing the dedicated 3.1 feature list (DEVEL is overdue). - cleanups for fixing more intrusive or design bugs (such as this, and the 407 one). - shakedown for release cycle. We have a few months of shakedown before anything like a good stable. Plenty of time for cleanup not blocking new features. I estimate that properly cleaning up the client side would delay v3.1 stable release by 2-3 months unless we suddenly get more experts involved. Is that delay acceptable to you? Others? Thank you, Alex.