This is a fix of the following JVMTI scalability issue. A closed benchmark with millions of virtual threads shows 4X overhead when a JVMTI agent has been loaded. For instance, this is observable when an app is executed under control of the Oracle Studio `collect` utility.
The benchmark takes a little bit more than 3 sec without any JVMTI agent: Total: in 3045 ms The benchmark takes more than ~3.2X of the above when executed with the `collect` utility: Creating experiment database test.1.er (Process ID: 25262) ... Picked up JAVA_TOOL_OPTIONS: -agentlib:collector Total: in 9864 ms With the fix in place the overhead of a JVMTI agent is around 1.2X: Creating experiment database test.1.er (Process ID: 26442) ... Picked up JAVA_TOOL_OPTIONS: -agentlib:collector Total: in 3765 ms The most of the overhead is taken by two functions: - `JvmtiVTMSTransitionDisabler::start_VTMS_transition()` - `JvmtiVTMSTransitionDisabler::finish_VTMS_transition()` Oracle Studio Performance Analyzer `err_print utility shows the following performance data for these functions: ``` % er_print -viewmode expert -metrics ie.%totalcpu -csingle JvmtiVTMSTransitionDisabler::start_VTMS_transition test.1.er Attr. Total Name CPU sec. % =============== Callers 42.930 50.06 SharedRuntime::notify_jvmti_vthread_mount(oopDesc*, unsigned char, JavaThread*) 21.505 25.08 JvmtiVTMSTransitionDisabler::VTMS_vthread_end(_jobject*) 21.315 24.86 JvmtiVTMSTransitionDisabler::VTMS_vthread_unmount(_jobject*, bool) =============== Stack Fragment 81.407 94.94 JvmtiVTMSTransitionDisabler::start_VTMS_transition(_jobject*, bool) =============== Callees 4.083 4.76 java_lang_Thread::set_is_in_VTMS_transition(oopDesc*, bool) 0.140 0.16 __tls_get_addr 0.120 0.14 JNIHandles::resolve_external_guard(_jobject*) % er_print -viewmode expert -metrics ie.%totalcpu -csingle JvmtiVTMSTransitionDisabler::finish_VTMS_transition test.1.er Attr. Total Name CPU sec. % =============== Callers 47.363 52.59 SharedRuntime::notify_jvmti_vthread_unmount(oopDesc*, unsigned char, JavaThread*) 21.355 23.71 JvmtiVTMSTransitionDisabler::VTMS_vthread_mount(_jobject*, bool) 21.345 23.70 JvmtiVTMSTransitionDisabler::VTMS_vthread_start(_jobject*) =============== Stack Fragment 64.145 71.22 JvmtiVTMSTransitionDisabler::finish_VTMS_transition(_jobject*, bool) =============== Callees 25.288 28.08 java_lang_Thread::set_is_in_VTMS_transition(oopDesc*, bool) 0.240 0.27 __tls_get_addr 0.200 0.22 JavaThread::set_is_in_VTMS_transition(bool) 0.190 0.21 JNIHandles::resolve_external_guard(_jobject*) The main source of this overhead (~90% of overhead) is atomic increment and decrement of the global counter `VTMS_transition_count`: - `Atomic::inc(&_VTMS_transition_count)`; - `Atomic::dec(&_VTMS_transition_count)`; The fix is to replace this global counter with mark bits `_VTMS_transition_mark` distributed over all `JavaThread`'s. If these lines are commented out or replaced with the distributed thread-local marks the main performance overhead is gone: % er_print -viewmode expert -metrics ie.%totalcpu -csingle JvmtiVTMSTransitionDisabler::start_VTMS_transition test.2.er Attr. Total Name CPU sec. % ============== Callers 1.801 64.29 SharedRuntime::notify_jvmti_vthread_mount(oopDesc*, unsigned char, JavaThread*) 0.580 20.71 JvmtiVTMSTransitionDisabler::VTMS_vthread_unmount(_jobject*, bool) 0.420 15.00 JvmtiVTMSTransitionDisabler::VTMS_vthread_end(_jobject*) ============== Stack Fragment 0.630 22.50 JvmtiVTMSTransitionDisabler::start_VTMS_transition(_jobject*, bool) ============== Callees 1.931 68.93 java_lang_Thread::set_is_in_VTMS_transition(oopDesc*, bool) 0.220 7.86 __tls_get_addr 0.020 0.71 JNIHandles::resolve_external_guard(_jobject*) % er_print -viewmode expert -metrics ie.%totalcpu -csingle JvmtiVTMSTransitionDisabler::finish_VTMS_transition test.2.er Attr. Total Name CPU sec. % ============== Callers 1.661 39.15 JvmtiVTMSTransitionDisabler::VTMS_vthread_mount(_jobject*, bool) 1.351 31.84 JvmtiVTMSTransitionDisabler::VTMS_vthread_start(_jobject*) 1.231 29.01 SharedRuntime::notify_jvmti_vthread_unmount(oopDesc*, unsigned char, JavaThread*) ============== Stack Fragment 0.500 11.79 JvmtiVTMSTransitionDisabler::finish_VTMS_transition(_jobject*, bool) ============== Callees 2.972 70.05 java_lang_Thread::set_is_in_VTMS_transition(oopDesc*, bool) 0.350 8.25 JavaThread::set_is_in_VTMS_transition(bool) 0.340 8.02 __tls_get_addr 0.080 1.89 JNIHandles::resolve_external_guard(_jobject*) The rest of the overhead (~10% of total overhead) is taken by calls to the function `java_lang_Thread::set_is_in_VTMS_transition()`. The plan is to address this in a separate fix. But it is expected to be a littleĀ bit more tricky. Testing: - Tested with mach5 tiers 1-6 ------------- Commit messages: - 8330969: scalability issue with loaded JVMTI agent Changes: https://git.openjdk.org/jdk/pull/18937/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18937&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8330969 Stats: 38 lines in 5 files changed: 14 ins; 11 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/18937.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18937/head:pull/18937 PR: https://git.openjdk.org/jdk/pull/18937