[jira] [Commented] (TS-947) AIO Race condition on non NT systems
[ https://issues.apache.org/jira/browse/TS-947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13758195#comment-13758195 ] Leif Hedstrom commented on TS-947: -- Can this be closed ? It sounds like it can ? AIO Race condition on non NT systems Key: TS-947 URL: https://issues.apache.org/jira/browse/TS-947 Project: Traffic Server Issue Type: Bug Components: Core Environment: stock build with static libts, running on a 4 core server Reporter: B Wyatt Assignee: John Plevyak Fix For: 4.2.0 Attachments: lock-safe-AIO.patch, timed-wait-AIO.patch Refer to code below. The timeslice starting when a consumer thread determines that the temp_list is empty (A) and ending when it releases the aio_mutex(C) is unsafe if the work queues are empty and it breaks loop execution at B. During this timeslice (A-C) the consumer holds the aio_mutex and as a result request producers enqueue items on the temporary atomic list (D). As a consumer in this state will wait for a signal on aio_cond to proceed before processing the temp_list again, any requests on the temp_list are effectively stalled until a future request produces this signal or manually processes the temp_list. In the case of cache volume initialization, there is no future request and the initialization sequence soft locks. {code:title=iocore/aio/AIO.cc(annotated)} void * aio_thread_main(void *arg) { ... ink_mutex_acquire(my_aio_req-aio_mutex); for (;;) { do { current_req = my_aio_req; /* check if any pending requests on the atomic list */ A if (!INK_ATOMICLIST_EMPTY(my_aio_req-aio_temp_list)) aio_move(my_aio_req); if (!(op = my_aio_req-aio_todo.pop()) !(op = my_aio_req-http_aio_todo.pop())) Bbreak; ... service request ... } while (1); Cink_cond_wait(my_aio_req-aio_cond, my_aio_req-aio_mutex); } ... } static void aio_queue_req(AIOCallbackInternal *op, int fromAPI = 0) { ... if (!ink_mutex_try_acquire(req-aio_mutex)) { Dink_atomiclist_push(req-aio_temp_list, op); } else { /* check if any pending requests on the atomic list */ if (!INK_ATOMICLIST_EMPTY(req-aio_temp_list)) aio_move(req); /* now put the new request */ aio_insert(op, req); ink_cond_signal(req-aio_cond); ink_mutex_release(req-aio_mutex); } ... } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TS-947) AIO Race condition on non NT systems
[ https://issues.apache.org/jira/browse/TS-947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13758339#comment-13758339 ] John Plevyak commented on TS-947: - Yes, this has been fixed. john AIO Race condition on non NT systems Key: TS-947 URL: https://issues.apache.org/jira/browse/TS-947 Project: Traffic Server Issue Type: Bug Components: Core Environment: stock build with static libts, running on a 4 core server Reporter: B Wyatt Assignee: John Plevyak Fix For: 4.2.0 Attachments: lock-safe-AIO.patch, timed-wait-AIO.patch Refer to code below. The timeslice starting when a consumer thread determines that the temp_list is empty (A) and ending when it releases the aio_mutex(C) is unsafe if the work queues are empty and it breaks loop execution at B. During this timeslice (A-C) the consumer holds the aio_mutex and as a result request producers enqueue items on the temporary atomic list (D). As a consumer in this state will wait for a signal on aio_cond to proceed before processing the temp_list again, any requests on the temp_list are effectively stalled until a future request produces this signal or manually processes the temp_list. In the case of cache volume initialization, there is no future request and the initialization sequence soft locks. {code:title=iocore/aio/AIO.cc(annotated)} void * aio_thread_main(void *arg) { ... ink_mutex_acquire(my_aio_req-aio_mutex); for (;;) { do { current_req = my_aio_req; /* check if any pending requests on the atomic list */ A if (!INK_ATOMICLIST_EMPTY(my_aio_req-aio_temp_list)) aio_move(my_aio_req); if (!(op = my_aio_req-aio_todo.pop()) !(op = my_aio_req-http_aio_todo.pop())) Bbreak; ... service request ... } while (1); Cink_cond_wait(my_aio_req-aio_cond, my_aio_req-aio_mutex); } ... } static void aio_queue_req(AIOCallbackInternal *op, int fromAPI = 0) { ... if (!ink_mutex_try_acquire(req-aio_mutex)) { Dink_atomiclist_push(req-aio_temp_list, op); } else { /* check if any pending requests on the atomic list */ if (!INK_ATOMICLIST_EMPTY(req-aio_temp_list)) aio_move(req); /* now put the new request */ aio_insert(op, req); ink_cond_signal(req-aio_cond); ink_mutex_release(req-aio_mutex); } ... } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TS-947) AIO Race condition on non NT systems
[ https://issues.apache.org/jira/browse/TS-947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13131692#comment-13131692 ] Alan M. Carroll commented on TS-947: The timed wait fix was put on trunk, which ameliorates this problem. However, the mailing list consensus was that this is a hack and the queuing logic should be fixed to be correct. John Pleyvack has volunteered to take lead on that although he welcomes contributions from others. AIO Race condition on non NT systems Key: TS-947 URL: https://issues.apache.org/jira/browse/TS-947 Project: Traffic Server Issue Type: Bug Components: Core Environment: stock build with static libts, running on a 4 core server Reporter: B Wyatt Assignee: Alan M. Carroll Fix For: 3.1.2 Attachments: lock-safe-AIO.patch Refer to code below. The timeslice starting when a consumer thread determines that the temp_list is empty (A) and ending when it releases the aio_mutex(C) is unsafe if the work queues are empty and it breaks loop execution at B. During this timeslice (A-C) the consumer holds the aio_mutex and as a result request producers enqueue items on the temporary atomic list (D). As a consumer in this state will wait for a signal on aio_cond to proceed before processing the temp_list again, any requests on the temp_list are effectively stalled until a future request produces this signal or manually processes the temp_list. In the case of cache volume initialization, there is no future request and the initialization sequence soft locks. {code:title=iocore/aio/AIO.cc(annotated)} void * aio_thread_main(void *arg) { ... ink_mutex_acquire(my_aio_req-aio_mutex); for (;;) { do { current_req = my_aio_req; /* check if any pending requests on the atomic list */ A if (!INK_ATOMICLIST_EMPTY(my_aio_req-aio_temp_list)) aio_move(my_aio_req); if (!(op = my_aio_req-aio_todo.pop()) !(op = my_aio_req-http_aio_todo.pop())) Bbreak; ... service request ... } while (1); Cink_cond_wait(my_aio_req-aio_cond, my_aio_req-aio_mutex); } ... } static void aio_queue_req(AIOCallbackInternal *op, int fromAPI = 0) { ... if (!ink_mutex_try_acquire(req-aio_mutex)) { Dink_atomiclist_push(req-aio_temp_list, op); } else { /* check if any pending requests on the atomic list */ if (!INK_ATOMICLIST_EMPTY(req-aio_temp_list)) aio_move(req); /* now put the new request */ aio_insert(op, req); ink_cond_signal(req-aio_cond); ink_mutex_release(req-aio_mutex); } ... } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TS-947) AIO Race condition on non NT systems
[ https://issues.apache.org/jira/browse/TS-947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13131804#comment-13131804 ] Alan M. Carroll commented on TS-947: [Mailing list discussion|http://mail-archives.apache.org/mod_mbox/trafficserver-dev/201109.mbox/%3ccamkwmbeeu8uqa+ymaqvsoqgy0hajjbuevdex17eqjbog9v7...@mail.gmail.com%3E] Key quote by John Pleyvak {quote}So the situation is that the existing design dates from a time when threads and scheduling for Unix were primitive. It was not uncommon in those days for threads to go away for half a second, a second or more in a loaded system. To deal with this, the current design is non-blocking all the way down. This means that no code (correctly written) ever blocks on a mutex (we use try_lock) which is the origin of that AIO code. Now, threading and scheduling has dramatically improved and the problems that motivated that design decision have (most probably :) gone away. It would be dramatically easier to do away with this requirement and write the code with small critical sections and just block if we miss a mutex. First, that is the way most programs are written these days and second, having to back out of the entire stack, saving context is a huge burden. It makes the code hard to read, write and debug. The upshot is that my suggestion to just add a 10 msec timeout is for the short term. I'd like to see a blocking, small critical section AIO/Cache/RamCache patch performance tested against the existing code *then* pull the trigger on such a new design and fix the offending code (including the AIO race). I plan to work on such a patch and would be very interested in talking to anyone who might be interested in such a thing as well. {quote} AIO Race condition on non NT systems Key: TS-947 URL: https://issues.apache.org/jira/browse/TS-947 Project: Traffic Server Issue Type: Bug Components: Core Environment: stock build with static libts, running on a 4 core server Reporter: B Wyatt Assignee: John Plevyak Fix For: 3.1.2 Attachments: lock-safe-AIO.patch Refer to code below. The timeslice starting when a consumer thread determines that the temp_list is empty (A) and ending when it releases the aio_mutex(C) is unsafe if the work queues are empty and it breaks loop execution at B. During this timeslice (A-C) the consumer holds the aio_mutex and as a result request producers enqueue items on the temporary atomic list (D). As a consumer in this state will wait for a signal on aio_cond to proceed before processing the temp_list again, any requests on the temp_list are effectively stalled until a future request produces this signal or manually processes the temp_list. In the case of cache volume initialization, there is no future request and the initialization sequence soft locks. {code:title=iocore/aio/AIO.cc(annotated)} void * aio_thread_main(void *arg) { ... ink_mutex_acquire(my_aio_req-aio_mutex); for (;;) { do { current_req = my_aio_req; /* check if any pending requests on the atomic list */ A if (!INK_ATOMICLIST_EMPTY(my_aio_req-aio_temp_list)) aio_move(my_aio_req); if (!(op = my_aio_req-aio_todo.pop()) !(op = my_aio_req-http_aio_todo.pop())) Bbreak; ... service request ... } while (1); Cink_cond_wait(my_aio_req-aio_cond, my_aio_req-aio_mutex); } ... } static void aio_queue_req(AIOCallbackInternal *op, int fromAPI = 0) { ... if (!ink_mutex_try_acquire(req-aio_mutex)) { Dink_atomiclist_push(req-aio_temp_list, op); } else { /* check if any pending requests on the atomic list */ if (!INK_ATOMICLIST_EMPTY(req-aio_temp_list)) aio_move(req); /* now put the new request */ aio_insert(op, req); ink_cond_signal(req-aio_cond); ink_mutex_release(req-aio_mutex); } ... } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TS-947) AIO Race condition on non NT systems
[ https://issues.apache.org/jira/browse/TS-947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13107719#comment-13107719 ] mohan_zl commented on TS-947: - One question: The variable sleep_wait in struct AIOThreadInfo is not used now, why not nuke it away or use it for Jplevyak's patch, looks like: timespec ten_msec_timespec = ink_based_hrtime_to_timespec(ink_get_hrtime() + HRTIME_MSECONDS(thr_info-sleep_wait)); AIO Race condition on non NT systems Key: TS-947 URL: https://issues.apache.org/jira/browse/TS-947 Project: Traffic Server Issue Type: Bug Components: Core Environment: stock build with static libts, running on a 4 core server Reporter: B Wyatt Fix For: 3.1.1 Attachments: lock-safe-AIO.patch Refer to code below. The timeslice starting when a consumer thread determines that the temp_list is empty (A) and ending when it releases the aio_mutex(C) is unsafe if the work queues are empty and it breaks loop execution at B. During this timeslice (A-C) the consumer holds the aio_mutex and as a result request producers enqueue items on the temporary atomic list (D). As a consumer in this state will wait for a signal on aio_cond to proceed before processing the temp_list again, any requests on the temp_list are effectively stalled until a future request produces this signal or manually processes the temp_list. In the case of cache volume initialization, there is no future request and the initialization sequence soft locks. {code:title=iocore/aio/AIO.cc(annotated)} void * aio_thread_main(void *arg) { ... ink_mutex_acquire(my_aio_req-aio_mutex); for (;;) { do { current_req = my_aio_req; /* check if any pending requests on the atomic list */ A if (!INK_ATOMICLIST_EMPTY(my_aio_req-aio_temp_list)) aio_move(my_aio_req); if (!(op = my_aio_req-aio_todo.pop()) !(op = my_aio_req-http_aio_todo.pop())) Bbreak; ... service request ... } while (1); Cink_cond_wait(my_aio_req-aio_cond, my_aio_req-aio_mutex); } ... } static void aio_queue_req(AIOCallbackInternal *op, int fromAPI = 0) { ... if (!ink_mutex_try_acquire(req-aio_mutex)) { Dink_atomiclist_push(req-aio_temp_list, op); } else { /* check if any pending requests on the atomic list */ if (!INK_ATOMICLIST_EMPTY(req-aio_temp_list)) aio_move(req); /* now put the new request */ aio_insert(op, req); ink_cond_signal(req-aio_cond); ink_mutex_release(req-aio_mutex); } ... } {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TS-947) AIO Race condition on non NT systems
[ https://issues.apache.org/jira/browse/TS-947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13099273#comment-13099273 ] B Wyatt commented on TS-947: I am testing a patch that removes the use of the atomic temp_list entirely. Instead it relies on acquiring the aio_mutex for every request enqueue. This may be sub-optimal, but it is preferable to occasional soft-locking on initialization or request delays. I have attached the patch. AIO Race condition on non NT systems Key: TS-947 URL: https://issues.apache.org/jira/browse/TS-947 Project: Traffic Server Issue Type: Bug Components: Core Environment: stock build with static libts, running on a 4 core server Reporter: B Wyatt Refer to code below. The timeslice starting when a consumer thread determines that the temp_list is empty (A) and ending when it releases the aio_mutex(C) is unsafe if the work queues are empty and it breaks loop execution at B. During this timeslice (A-C) the consumer holds the aio_mutex and as a result request producers enqueue items on the temporary atomic list (D). As a consumer in this state will wait for a signal on aio_cond to proceed before processing the temp_list again, any requests on the temp_list are effectively stalled until a future request produces this signal or manually processes the temp_list. In the case of cache volume initialization, there is no future request and the initialization sequence soft locks. {code:title=iocore/aio/AIO.cc(annotated)} void * aio_thread_main(void *arg) { ... ink_mutex_acquire(my_aio_req-aio_mutex); for (;;) { do { current_req = my_aio_req; /* check if any pending requests on the atomic list */ A if (!INK_ATOMICLIST_EMPTY(my_aio_req-aio_temp_list)) aio_move(my_aio_req); if (!(op = my_aio_req-aio_todo.pop()) !(op = my_aio_req-http_aio_todo.pop())) Bbreak; ... service request ... } while (1); Cink_cond_wait(my_aio_req-aio_cond, my_aio_req-aio_mutex); } ... } static void aio_queue_req(AIOCallbackInternal *op, int fromAPI = 0) { ... if (!ink_mutex_try_acquire(req-aio_mutex)) { Dink_atomiclist_push(req-aio_temp_list, op); } else { /* check if any pending requests on the atomic list */ if (!INK_ATOMICLIST_EMPTY(req-aio_temp_list)) aio_move(req); /* now put the new request */ aio_insert(op, req); ink_cond_signal(req-aio_cond); ink_mutex_release(req-aio_mutex); } ... } {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira