On Wed, 24 Apr 2024 16:04:30 GMT, Serguei Spitsyn <sspit...@openjdk.org> wrote:
> This is a fix of the following JVMTI scalability issue. A closed benchmark > with millions of virtual threads shows 3X-4X overhead when a JVMTI agent has > been loaded. For instance, this is observable when an app is executed under > control of the Oracle Studio `collect` utility. > For performance analysis, experiments and numbers, please, see the comment > below this description. > > The fix is to replace the global counter `_VTMS_transition_count` with the > mark bit `_VTMS_transition_mark` in each `JavaThread`'. > > Testing: > - Tested with mach5 tiers 1-6 The benchmark takes a little bit more than 3 sec without any JVMTI agent: Total: in 3045 ms The benchmark takes more than ~3.2X of the above when executed with the `collect` utility: Creating experiment database test.1.er (Process ID: 25262) ... Picked up JAVA_TOOL_OPTIONS: -agentlib:collector Total: in 9864 ms With the fix in place the overhead of a JVMTI agent is around 1.2X: Creating experiment database test.1.er (Process ID: 26442) ... Picked up JAVA_TOOL_OPTIONS: -agentlib:collector Total: in 3765 ms The most of the overhead is taken by two functions: - `JvmtiVTMSTransitionDisabler::start_VTMS_transition()` - `JvmtiVTMSTransitionDisabler::finish_VTMS_transition()` Oracle Studio Performance Analyzer `err_print` utility shows the following performance data for these functions: ``` % er_print -viewmode expert -metrics ie.%totalcpu -csingle JvmtiVTMSTransitionDisabler::start_VTMS_transition test.1.er Attr. Total Name CPU sec. % =============== Callers 42.930 50.06 SharedRuntime::notify_jvmti_vthread_mount(oopDesc*, unsigned char, JavaThread*) 21.505 25.08 JvmtiVTMSTransitionDisabler::VTMS_vthread_end(_jobject*) 21.315 24.86 JvmtiVTMSTransitionDisabler::VTMS_vthread_unmount(_jobject*, bool) =============== Stack Fragment 81.407 94.94 JvmtiVTMSTransitionDisabler::start_VTMS_transition(_jobject*, bool) =============== Callees 4.083 4.76 java_lang_Thread::set_is_in_VTMS_transition(oopDesc*, bool) 0.140 0.16 __tls_get_addr 0.120 0.14 JNIHandles::resolve_external_guard(_jobject*) % er_print -viewmode expert -metrics ie.%totalcpu -csingle JvmtiVTMSTransitionDisabler::finish_VTMS_transition test.1.er Attr. Total Name CPU sec. % =============== Callers 47.363 52.59 SharedRuntime::notify_jvmti_vthread_unmount(oopDesc*, unsigned char, JavaThread*) 21.355 23.71 JvmtiVTMSTransitionDisabler::VTMS_vthread_mount(_jobject*, bool) 21.345 23.70 JvmtiVTMSTransitionDisabler::VTMS_vthread_start(_jobject*) =============== Stack Fragment 64.145 71.22 JvmtiVTMSTransitionDisabler::finish_VTMS_transition(_jobject*, bool) =============== Callees 25.288 28.08 java_lang_Thread::set_is_in_VTMS_transition(oopDesc*, bool) 0.240 0.27 __tls_get_addr 0.200 0.22 JavaThread::set_is_in_VTMS_transition(bool) 0.190 0.21 JNIHandles::resolve_external_guard(_jobject*) The main source of this overhead (~90% of overhead) is atomic increment and decrement of the global counter `VTMS_transition_count`: - `Atomic::inc(&_VTMS_transition_count)`; - `Atomic::dec(&_VTMS_transition_count)`; The fix is to replace this global counter with mark bits `_VTMS_transition_mark` distributed over all `JavaThread`'s. If these lines are commented out or replaced with the distributed thread-local marks the main performance overhead is gone: % er_print -viewmode expert -metrics ie.%totalcpu -csingle JvmtiVTMSTransitionDisabler::start_VTMS_transition test.2.er Attr. Total Name CPU sec. % ============== Callers 1.801 64.29 SharedRuntime::notify_jvmti_vthread_mount(oopDesc*, unsigned char, JavaThread*) 0.580 20.71 JvmtiVTMSTransitionDisabler::VTMS_vthread_unmount(_jobject*, bool) 0.420 15.00 JvmtiVTMSTransitionDisabler::VTMS_vthread_end(_jobject*) ============== Stack Fragment 0.630 22.50 JvmtiVTMSTransitionDisabler::start_VTMS_transition(_jobject*, bool) ============== Callees 1.931 68.93 java_lang_Thread::set_is_in_VTMS_transition(oopDesc*, bool) 0.220 7.86 __tls_get_addr 0.020 0.71 JNIHandles::resolve_external_guard(_jobject*) % er_print -viewmode expert -metrics ie.%totalcpu -csingle JvmtiVTMSTransitionDisabler::finish_VTMS_transition test.2.er Attr. Total Name CPU sec. % ============== Callers 1.661 39.15 JvmtiVTMSTransitionDisabler::VTMS_vthread_mount(_jobject*, bool) 1.351 31.84 JvmtiVTMSTransitionDisabler::VTMS_vthread_start(_jobject*) 1.231 29.01 SharedRuntime::notify_jvmti_vthread_unmount(oopDesc*, unsigned char, JavaThread*) ============== Stack Fragment 0.500 11.79 JvmtiVTMSTransitionDisabler::finish_VTMS_transition(_jobject*, bool) ============== Callees 2.972 70.05 java_lang_Thread::set_is_in_VTMS_transition(oopDesc*, bool) 0.350 8.25 JavaThread::set_is_in_VTMS_transition(bool) 0.340 8.02 __tls_get_addr 0.080 1.89 JNIHandles::resolve_external_guard(_jobject*) The rest of the overhead (~10% of total overhead) is taken by calls to the function `java_lang_Thread::set_is_in_VTMS_transition()`. The plan is to address this in a separate fix. But it is expected to be a littleĀ bit more tricky. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18937#issuecomment-2075566469