Crunch API to run code at JVM startup / shutdown

Clément MATHIEU Thu, 20 Aug 2015 03:34:53 -0700

Hi,

I am trying to setup something to automatically profile my Crunch jobson an Hadoop cluster.

I have been a long time user of hprof & "mapred.task.profile" because itis so easy to use on Hadoop. However, I am now moving away from it:


 - will be removed from Java 9
 - suffers from safe point bias
 - does not allow to profile native code
 - gathering other metrics than stack trace samples can be useful

I had like to replace hprof by Flight Recorder and/or perf. Unlikehprof, both need to be started and stopped programmatically since thereis not glue for them in Hadoop. I can see three options:



1. Hack the app

It can be done using DoFn.initialize/cleanup. Or all DoFns invoke thesame idempotent code, or dedicated DoFns are inserted at specificpoints. Both seems horrific and disgusting :)



2. Java agent

Profiling is not tied to Crunch and any tool can be profiled. Maindrawbacks are that the agent must be deployed on all the nodes and thatit does not have easy access to metadata like user, job name, stage etc.

A good example of such agent is statsd-jvm-profiler, seehttps://github.com/etsy/statsd-jvm-profiler. They even have a smallbridge to push Cascading metadata to the agent, seehttps://github.com/etsy/statsd-jvm-profiler/blob/master/example/StatsDProfilerFlowListener.scala.



3. Dedicated Crunch API

Some code needs to be executed on JVM startup / shutdown. AFAIK it isnot currently possible but could be added (however I am not sure how toimplement it on Spark). Unlike a javaagent, it does not require todeploy something on the nodes, metadata can be pushed to the services(ie. ctx) and it is more flexible.

I believe that allowing users to easily run code at JVM startup /shutdown would be an useful improvement. Any opinion ?


Clément MATHIEU

Crunch API to run code at JVM startup / shutdown

Reply via email to