Summary: why does the JVM for my (successful) map task exit before the yourkit agent has written the snapshot?
I'm trying to use YourKit[1] to profile map reduce jobs running in yarn. In a similar exercise running on MRv1, this was straight forward ( -agentpath:/opt/yjp/bin/linux-x86-64/libyjpagent.so=sampling,onexit=snapshot.... ). When I try the same thing in our MRv2 environment [2], the yourkit agent logs suggest that the process is being interrupted before the snapshot can be generated. For example, in this case, an "empty" snapshot is created -- the snapshot file is available, and readable, but does not contain any telemetry data. 15064 67.067: SP: pin java.lang.InterruptedException 15064 67.092: f_t_a_d: #56 trn=17 conv=0/1011 ms=1 15064 67.198: vmDeath: 1 15064 67.198: vmDeath: 2 15064 67.198: vmDeath: 3 15064 67.258: [m: 3927 1163] WSN: starting... <eof> But the problem is clearly a race, as I don't always see the WSN messages start. For instance: 15064 73.234: f_t_a_d: #50 trn=71 conv=0/691 ms=1 15064 74.620: SP: pin java.lang.InterruptedException 15064 74.647: f_t_a_d: #56 trn=17 conv=0/1352 ms=1 15064 74.752: vmDeath: 1 15064 74.752: vmDeath: 2 15064 74.752: vmDeath: 3 <eof> In this case, there was no snapshot generated at all. These two snippets both represent my attempts to run a trivial M/R job hadoop --config ~/hadoop-performance jar ...[3] The yourkit agent reports that it is running as 15064 0.000: Command line: /usr/lib/jvm/java-7-oracle-cloudera/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Djava.net.preferIPv4Stack=true -Xmx3435973837 -Djava.io.tmpdir=/opt/yarn/nm/usercache/analytics/appcache/application_1426523383307_4394/container_e18_1426523383307_4394_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1426523383307_4394/container_e18_1426523383307_4394_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -agentpath:/opt/yjp/bin/linux-x86-64/libyjpagent.so=sampling,onexit=snapshot,dir=/var/log/hadoop-yarn/yjp/pnq.1550/snapshot,logdir=/var/log/hadoop-yarn/yjp/pnq.1550/logDir org.apache.hadoop.mapred.YarnChild 172.16.39.75 37815 attempt_1426523383307_4394_m_000000_0 2 When we try to profile Spark applications, we find the snapshot as I expected. The yourkit log in that instance looks like 15064 22.900: vmDeath: 1 15064 22.900: vmDeath: 2 15064 22.900: vmDeath: 3 15064 23.019: [m: 8207 959] WSN: starting... 15064 23.680: [m: 8207 960] WSN: tables written 15064 23.680: [m: 8207 960] WSN: heap segment closed 15064 23.681: [m: 8207 960] WSN: map done 15064 23.726: SP: gc #0 28 ms=45 sz=36370 p=20 t=125 15064 23.726: [m: 8207 960] WSN: done 15064 23.727: Snapshot captured: /var/log/hadoop-yarn/yjp/pnq.6666/CoarseGrainedExecutorBackend-2015-08-04-shutdown-3.snapshot 15064 23.728: vmDeath: 4 15064 23.728: vmDeath: OK Having been run as the command: 15064 0.001: Command line: /usr/lib/jvm/java-7-oracle-cloudera/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms7192m -Xmx7192m -agentpath:/opt/yjp/bin/linux-x86-64/libyjpagent.so=tracing,onexit=snapshot,dir=/var/log/hadoop-yarn/yjp/pnq.6666,logdir=/var/log/hadoop-yarn/yjp/logDir -Djava.io.tmpdir=/opt/yarn/nm/usercache/rshah/appcache/application_1426523383307_4409/container_e18_1426523383307_4409_01_000002/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1426523383307_4409/container_e18_1426523383307_4409_01_000002 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp:// [email protected]:53957/user/CoarseGrainedScheduler 4 beta-hadoop1.oak.vast.com 8 application_1426523383307_4409 I've also verified, by connecting to the yourkit agent while the map task is running, that the telemetry is available. I did find this question in stackoverflow, which describes some parameters that might help with the race during shutdown http://stackoverflow.com/questions/25983999/hadoop-hprof-profiling-no-cpu-samples-written However, using these parameters have not improved my results; in fact, I can't detect any change to the outcome at all -- as far as I can tell, the code path I'm on is not impacted by changing these settings. Has anybody documented this investigation before me? Pointers? Suggestions? Sympathy? I'm not stuck yet, but I've definitely reached the point in this investigation where daily progress is purely hypothetical. Thanks, Danil [1] YourKit Java Profiler 2015 build 15064 [2] 2.5.0-cdh5.3.0 -- I don't have any reason at this point to assume this is specific to this version. [3] ~/hadoop-performance is my copy of our default configuration, with minor(?) changes to specify properties to enable the profiling.
