On Sep 2, 2015, at 12:45 AM, Daniel D. Daugherty <daniel.daughe...@oracle.com> wrote: > > Greetings, > > I've updated the "fix" for this bug based on code review comments > received in round 1. I've dropped most of the changes from round 1 > with a couple of exceptions. > > JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc() > https://bugs.openjdk.java.net/browse/JDK-8049304 > > Webrev URL: http://cr.openjdk.java.net/~dcubed/8049304-webrev/2-jdk9-hs-rt/ > > The easiest way to re-review is to download the two patch files > (round 0 and round 2) and view them in your favorite file merge tool: > > http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/hotspot.patch > http://cr.openjdk.java.net/~dcubed/8049304-webrev/2-jdk9-hs-rt/hotspot.patch
------------------------------------------------------------------------------ src/share/vm/runtime/perfMemory.cpp 68 // Only destroy PerfData objects if we're at a safepoint and the 69 // StatSampler is not active. Otherwise, we risk removing PerfData 70 // objects that are currently being used by running JavaThreads 71 // or the StatSampler. This method is invoked while we are not at 72 // a safepoint during a VM abort so leaving the PerfData objects 73 // around may also help diagnose the failure. 74 // 75 if (SafepointSynchronize::is_at_safepoint() && !StatSampler::is_active()) { 76 PerfDataManager::destroy(); 77 } I don't think that's the right predicate to use to control the call to PerfDataManager::destroy. We may be at a safepoint because (for example) we're in a GC pause, but there are multiple GC worker threads running that might attempt to touch the PerfData, and we get here because of a VM abort. Another problematic example is the one which got me here in the first place, e.g. vm_exit_during_initialization. At that point we haven't started any Java threads to be brought to a safepoint. That is, I think normal exit implies we're at a safepoint here, but the converse is not true. And what we need to know is whether we're in a VM abort or a normal exit. [And vm_shutdown_during_initialization is an additional case that I don't fully understand. I'm not sure whether that should be treated as a normal exit or a VM abort.] I think a reason the testing of these changes didn't show any of these problem cases is that the has_PerfData check has been added to monitor code that doesn't strictly need it. Based on your analysis, I think the additional protection from PerfData deletion should only be needed in the two FutileWakeups cases. Applying it to other cases imposes an extra performance cost and masks some cases of the wrong test being used to decide whether to delete the PerfData. [See also below regarding the OM_PERFDATA_OP macro.] I think if we test for VM abort context then the StatSampler::is_active() test is also unnecessary. The comment in the old code says that normally the StatSampler is disengaged, but not during VM abort, e.g. that part of the test is to dodge a similar problem that would also be solved by distinguishing between VM abort and normal exit. If trying to distinguish which case we're in based on call stack and passing around an argument proves difficult or overly intrusive, a different approach would be to use an (atomic) flag set when we know we're in VM abort context, and checked in perfMemory_exit. ------------------------------------------------------------------------------ src/share/vm/runtime/objectMonitor.hpp 181 // Only perform a PerfData operation if the PerfData object has been 182 // allocated and if the PerfDataManager has not freed the PerfData 183 // objects which can happen at normal VM shutdown. 184 // 185 #define OM_PERFDATA_OP(f, op_str) \ 186 do { \ 187 if (ObjectMonitor::_sync_ ## f != NULL && \ 188 PerfDataManager::has_PerfData()) { \ 189 ObjectMonitor::_sync_ ## f->op_str; \ 190 } \ 191 } while (0) This part looks good to me. We should consider adding PerfDataManager::has_PerfData checks to all uses of PerfData. This can be done in a followup. --TLDR-- I liked the rev1 version's is_safe argument, though I think the implementation could be simpler with the right predicate being used for calling PerfDataManager::destroy (as discussed above). Note that I agree with all the discussion that led to the elimination of OrderAccess or Atomic usage. So I think it could be written as #define OM_PERFDATA_OP(f, opt_str, is_safe) \ do { \ if (ObjectMonitor::_sync_ ## f != NULL && \ ((is_safe) || PerfDataManager::has_PerfData())) { \ ObjectMonitor::_sync_ ## f->op_str; \ } \ } while (0) Distinguishing the two problem cases eliminates unnecessary overhead from the others, only doing the extra work where it is actually required. Distinguishing the two problem cases would also let us consider a more expensive implementation for them, in order to eliminate the race condition. I think an atomic "counter" can do the trick, but I want to think about it some more and do some more literature seaching. I also think this extra step of eliminating the race could be deferred to a followup (if done at all; see below). I recall there being comments that the "is_safe" name and description were confusing. I don't recall whether that discussion went anywhere before the decision was made to abandon that part. I think Dan's response: When 'is_safe' is true, the calling thread can safely and directly check has_PerfData because we are not at a safepoint so we can't be racing with a normal exit. In a normal exit, has_PerfData is only changed (and memory freed) at a safepoint. When 'is_safe' is false, then the code path we're on could be executing in parallel with a safepoint so we need to be more careful with accessing the has_PerfData flag... contained useful information that wasn't clear from the original description of the argument. If the argument were to be reinstated, the above quoted text should be mined to improve the code comment. All that said, I'm now going to argue the contrary position. 1. The cost of the extra test is pretty small. 2. In combination with the delayed destruction, the extra test seems likely to be effectively sufficient. While there is a theoretical possibility of delayed destruction winning the race and leading to a post-destruction PerfData access failure, the likelyhood of that happening seems quite small, especially since we're at a safepoint if we're going to destroy the PerfData. The extra cost of a solution to the race, both in code complexity and performance, may not be worthwhile. 3. We should consider adding PerfDataManager::has_PerfData checks to all uses of PerfData. That's a nice and simple idiom. Doing analysis similar to what Dan did for the monitor cases would likely be hard in at least some cases, might also be fragile, and probably not worth the effort. Adding the test can be done in a followup. All this is my long-winded way of saying this part of rev2 looks good to me. ------------------------------------------------------------------------------ src/share/vm/runtime/perfData.hpp 671 volatile static jint _has_PerfData; 873 static bool has_PerfData() { return _has_PerfData != 0; } Given that _has_PerfData is being treated as an ordinary boolean, with no atomic operations involved to limit the possible type, shouldn't the type just be bool? Also need to update initialization and assignments. Also, I think the more usual order is "static volatile", both in hotspot and in other code bases I'm familiar with. Presently in hotspot/src there are 94 occurrences of "static volatile", and only one "volatile static". ------------------------------------------------------------------------------ src/share/vm/runtime/objectMonitor.cpp 1752 void ObjectMonitor::notify(TRAPS) { ... 1760 OM_PERFDATA_OP(Notifications, inc(1)); Minor style inconsistency: This uses "inc(1)" where all other places use "inc()". Of course, I expect the generated code to be identical. It just looked odd. ------------------------------------------------------------------------------