I am using Spark 1.5.2 in yarn mode with Hadoop 2.6.0 (cdh5.4.2) and I am
consistently seeing the below exception in the map container logs for Spark
jobs (full stacktrace at the end of the message):
java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:304)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:329)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
at
org.apache.hadoop.yarn.conf.YarnConfiguration.<clinit>(YarnConfiguration.java:605)
at
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.newConfiguration(YarnSparkHadoopUtil.scala:61)
This is the code that generates the above exception:
String home = System.getProperty("hadoop.home.dir");
// fall back to the system/user-global env variable
if (home == null) {
home = System.getenv("HADOOP_HOME");
}
try {
// couldn't find either setting for hadoop's home directory
if (home == null) {
throw new IOException("HADOOP_HOME or hadoop.home.dir are not
set.");
}
I have hadoop home set in multiple places, such as:
- in bin/yarn as a system property
- in libexec/hadoop-config.sh as environment variable
- in conf/spark-env.sh as environment variable
However, this doesn't get passed in to the container JVM's. In fact, that
is the case even with a plain YARN job. I took a simple WordCount
application and added setup() method with the below code:
String homeDirProp = System.getProperty("hadoop.home.dir");
String homeDirEnv = System.getenv("HADOOP_HOME");
System.out.println("hadoop.home.dir="+homeDirProp+"
HADOOP_HOME="+homeDirEnv);
and when I check the stdout of the containers, I see this:
hadoop.home.dir=null HADOOP_HOME=null
As it stands, the IOException doesn't immediately fail the job, but I am
trying to understand another issue with determining proxy IP and want to
rule this out. Interestingly, there doesn't seem to be anyway to pass a
system property or environment variable to map/reduce containers, so there
is no direct way to satisfy the Shell class, but it would be possible for
some other class to inject the system property as a workaround before it is
looked up by Shell.
Anyone else seen this issue? Could I be missing something here?
Thank you,
Hari
Full stack trace:
java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:304)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:329)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
at
org.apache.hadoop.yarn.conf.YarnConfiguration.<clinit>(YarnConfiguration.java:605)
at
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.newConfiguration(YarnSparkHadoopUtil.scala:61)
at
org.apache.spark.deploy.SparkHadoopUtil.<init>(SparkHadoopUtil.scala:52)
at
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.<init>(YarnSparkHadoopUtil.scala:46)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at java.lang.Class.newInstance(Class.java:442)
at
org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:386)
at
org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil.scala:384)
at
org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil.scala:384)
at
org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:401)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:149)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:250)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)