[jira] Commented: (THRIFT-488) ThreadManager crashing bugs

Rush Manbert (JIRA) Wed, 06 May 2009 13:00:56 -0700

    [ 
https://issues.apache.org/jira/browse/THRIFT-488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706570#action_12706570
 ]


Rush Manbert commented on THRIFT-488:
-------------------------------------

Doh! Can't use shared_ptr because then the ThreadManager::Impl destructor is 
never called, so stop() is never called, so nothing is ever deleted and you run 
out of resources after a while.

I could look into using weak_ptr. I think it would work, but it's going to mask 
a problem. I will atach a new patch file called ThreadManagerAssertPatch.txt. 
If you apply it to an unmodified ThreadManager.cpp and apply the test patches 
from THRIFT-487 you will find that it asserts at line 412 in ThreadManager.cpp. 
If you remove that assert, it will start asserting at line 417, and if you 
remove that assert it will start asserting at line 377.

This is because there is a race which I'll try to describe. I will use line 
numbers from the patched version of ThreadManager.cpp.
1) Manager notifies all idle workers at line 405
2) Workers waiting at line 248 wake up. They are active, but the manager's task 
queue is empty, so they nark themselves as idle and decrement the worker count. 
In our case, the count was one and now it is zero, which is equal to 
workerMaxCount_.
3) Sometime before the worker can add himself to the deadWorkers queue, and 
sometimes before he even atempts to get the workerMonitor_ mutex, the manager 
runs again.
4) The manager tests workerCount_ against workerMaxCount_ at line 413 and finds 
that they are equal, so he does not wait. He goes on to the loop at line 418 
and does nothing because deadWorkers is empty. The thing that hapens the most 
presently is that the manager completes his destructor call, then the worker 
wakes up and tries to get the mutex at line 288 and things go sour.

If we add the weak_ptr, then we can factor it into isActive(), and we can also 
check it outside of the scope that starts at line 287. But that masks the race 
condition, it doesn't fix it. My original patch was intended to fix the race 
conditions, including the strange case where a Worker starts but finds that he 
is not active.

I'll do a version that passes a weak_ptr and post my results.

> ThreadManager crashing bugs
> ---------------------------
>
>                 Key: THRIFT-488
>                 URL: https://issues.apache.org/jira/browse/THRIFT-488
>             Project: Thrift
>          Issue Type: Bug
>          Components: Library (C++)
>    Affects Versions: 0.1
>         Environment: Mac OS X 10.5.6, Xcode
>            Reporter: Rush Manbert
>         Attachments: ThreadManager.cpp.patch, ThreadManagerAssertPatch.txt
>
>
> The ThreadManager::Impl and the ThreadManager::Worker classes work together 
> to execute client threads. There are race conditions between the two classes 
> that can cause Bus Error exceptions in certain cases. In general, these occur 
> when a ThreadManager instance is destructed, so might not be seen in long 
> running programs. They happen frequently enough, though, that looped 
> repetitions of ThreadManagerTests::blockTest() (part of the concurrency_test 
> program) fail quite often.
> These errors are generally not seen with the current version of 
> ThreadManagerTests::blockTest() due to errors in the test itself that cause 
> failures at a far higher frequency. In order to see them, you need to apply 
> the patches that are attached to THRIFT-487 
> (https://issues.apache.org/jira/browse/THRIFT-487).
> Test procedure:
> 1) Apply the patch from THRIFT-487 for the Tests.cpp file.
> 2) Run make in lib/cpp in order to rebuild concurrency_test
> 3) Run concurrency_test with the command line argument "thread-manager" and 
> observe that the test hangs in no time.
> 4) Apply the patch from THRIFT-487 for the ThreadManagerTests.h file.
> 5) Run make in lib/cpp
> 6) Run concurrency_test as before. Observe that now it runs for longer 
> (generally) and usually fails with an assert in Monitor.cpp. This failure is 
> because of one of the bugs in ThreadManager.
> 7) Apply the attached patch file for ThreadManager.cpp
> 8) Run make in lib/cpp
> 9) Run concurrency_test as before. It should just run, and run, and run.
> Note that there is a possible path through the original 
> ThreadManager::Worker::run() method where active never becomes true. In 
> practice, exercising this code path is difficuly. The way that I exercised it 
> was to edit line 322 in the patched version of ThreadManager.cpp. I changed 
> the for statement to read:
> for (size_t ix = 0; ix < value + 1; ix++)
> so that the ThreadManager always created more workers than were needed. That 
> extra worker caused quite a bit of trouble until I moved his handling up to 
> the top of the run() method. I don't understand how this situation could 
> occur in real life, but this code appears to handle it correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-488) ThreadManager crashing bugs

Reply via email to