[GitHub] drill issue #1105: DRILL-6125: Fix possible memory leak when query is cancel...
Github user ilooner commented on the issue: https://github.com/apache/drill/pull/1105 @vrozov Thanks for catching this, I believe you are right. hasClouseoutThread guarantees innerNext and close won't be called concurrently. However, I still believe innerNext and receivingFragmentFinished could be called concurrently, since the ControlMessageHandler thread executes recievingFragmentFinished. Additionally in rare cases where a limit query is cancelled recievingFragmentFinished and close could be called concurrently as well. While reflecting on your comments I also saw another issue where the root could be blocked on a next call but a Finished event would not cause root to terminate. In light of all of this I actually think the **synchronized** is not sufficient. We will have to have some way to interrupt the execution of the root when we received a finish signal and only close out the resources after receivingFragmentFinished has been called. Similarly if we receive a finish signal we should ignore any cancellation requests instead of trying to cancel and finish simultaneously and vice versa. I will rework the solution to address these issues and let you know when I have an update. ---
[GitHub] drill issue #1105: DRILL-6125: Fix possible memory leak when query is cancel...
Github user vrozov commented on the issue: https://github.com/apache/drill/pull/1105 @ilooner @arina-ielchiieva How `innerNext()` and `close()` can execute concurrently? Does not `FragmentExecutor.hasCloseoutThread` ensure that either `close()` is called on the `run()` thread or `run()` exits if the fragment is already cancelled? ---
[GitHub] drill issue #1105: DRILL-6125: Fix possible memory leak when query is cancel...
Github user arina-ielchiieva commented on the issue: https://github.com/apache/drill/pull/1105 Looks good, @vrozov are you ok with the changes? ---
[GitHub] drill issue #1105: DRILL-6125: Fix possible memory leak when query is cancel...
Github user ilooner commented on the issue: https://github.com/apache/drill/pull/1105 @arina-ielchiieva Applied review comments. ---
[GitHub] drill issue #1105: DRILL-6125: Fix possible memory leak when query is cancel...
Github user ilooner commented on the issue: https://github.com/apache/drill/pull/1105 @vrozov The main issue is that InnerNext and Close should not execute concurrently. Even when we are using AtomicReferences and volatile the following sequence of events could happen which could cause a memory leak: 1. Let's say there are two Threads. The **Close thread** which starts at the beginning of the close method, and the **Next thread** which starts at the beginning of the innerNext method. 1. Now let's say the **Next Thread** runs and checks **ok**. Since close has not been called yet **ok** is true. 1. Now the **Next Thread** is after the ok check. 1. The **Close thread** now starta executing. And the close thread clears the partitioner. 1. Now after the partitioner is cleared the **Next Thread** can resume executing. If the next thread receives an OK_SCHEMA he will allocate a new partitioner. Since the OK_SCHEMA message may include records the partitioner may partition some data as well. 1. Now the **Close thread** is done, but there is a partitioner that has not been closed, and we will leak memory. In order to property solve this problem we need to make sure that the innerNext and close methods are mutually exclusive so the above scenario can never happen. The easiest way to do that is to use the synchronized key word. If we use the synchronized keyword then we don't have to use volatile or atomic reference. Also as a side note using synchronized will probably be more efficient since a cache flush would only be triggered at the start of the innerNext and close method. Alternatively if we used volatile and AtomicReference a cache flush would be triggered every time we accessed the ok and partitioner variables. ---
[GitHub] drill issue #1105: DRILL-6125: Fix possible memory leak when query is cancel...
Github user vrozov commented on the issue: https://github.com/apache/drill/pull/1105 @ilooner What state is protected by `syncrhonized`? Why is it not sufficient to use `volatile` and `AtomicReference`? ---
[GitHub] drill issue #1105: DRILL-6125: Fix possible memory leak when query is cancel...
Github user ilooner commented on the issue: https://github.com/apache/drill/pull/1105 @arina-ielchiieva @vrozov I believe I have a solution. There were several issues with the original code. 1. It made incorrect assumptions about how cache invalidation works with java **synchronized**. 2. It assumed **innerNext** and **close** would be called sequentially. I believe this PR fixes these issues now and I have gone into more detail about the problems below. # 1. Incorrect Cache Invalidation Assumptions The original code was trying to be smart by trying to reduce synchronization overhead on **innerNext**. So the code in **innerNext** did not synchronize before changing the partitioner object since this would be called often. The code in **close()** and ** receivingFragmentFinished()** synchronized before accessing the partitioner with the intention that this would trigger an update of the partitioner variable state across all threads. Unfortunately, this assumption is invalid (see https://stackoverflow.com/questions/22706739/does-synchronized-guarantee-a-thread-will-see-the-latest-value-of-a-non-volatile). Every thread that accesses a variable must synchronize before accessing a variable in order to properly invalidate cached data on a core. For example if **Thread A** modifies **Variable 1** then **Thread B** synchronizes before accessing **Variable 1** then there is no guarantee **Thread B** will see the most updated value for **Variable 1** since it might . ## Solution In summary the right thing to do is the simple thing. Make the methods synchronized. Unfortunately there is no way to outsmart the system and reduce synchronization overhead without causing race conditions. # 2. Concurrent InnerNext and Close Calls The original code did not consider the case that innerNext was in the middle of execution when close was called. It did try to handle the case where **innerNext** could be called after **close** by setting the **ok** variable. But it didn't even do that right because there was no synchronization around the **ok** variable. ## Solution The right thing to do is the simple thing. Make sure the methods are synchronized so close has to wait until innerNext is done before executing. Also when a query is cancelled the executing thread should be interrupted the thread running innerNext incase it is on a blocking call. ---
[GitHub] drill issue #1105: DRILL-6125: Fix possible memory leak when query is cancel...
Github user ilooner commented on the issue: https://github.com/apache/drill/pull/1105 @priteshm @arina-ielchiieva I should have updated this PR earlier this week, here is my update. After reflecting on Arina's comments and reading some more docs about how java implements volatile and synchronization, I think this solution might not fix the original race condition. I need to to more reading to get a better understanding. Additionally I realized there is another race condition where two threads are simultaneously calling close and innerNext which could cause a memory leak. Haven't had a chance to dig further this week, so I will try to wrap this up next week. ---
[GitHub] drill issue #1105: DRILL-6125: Fix possible memory leak when query is cancel...
Github user priteshm commented on the issue: https://github.com/apache/drill/pull/1105 @arina-ielchiieva is this bug ready to commit? ---
[GitHub] drill issue #1105: DRILL-6125: Fix possible memory leak when query is cancel...
Github user ilooner commented on the issue: https://github.com/apache/drill/pull/1105 @sachouche I traced through the code. The updateAggregateStats method is called, which then calls the getOutgoingBatches method of the code generated Partitioners. That method is just a simple getter. So no one is acquiring the same lock. But even if someone else was, the code in the close method is single threaded, and synchronize blocks are reentrant. ---
[GitHub] drill issue #1105: DRILL-6125: Fix possible memory leak when query is cancel...
Github user ilooner commented on the issue: https://github.com/apache/drill/pull/1105 @sachouche @arina-ielchiieva ---