Hi everyone, I used to launch EC2 clusters with the spark scripts running Hadoop 1. I recently changed it and launched a new cluster with the hadoop major version set to 2.
Spark-ec2 <args> --hadoop-major-version=2 <more-args> In the old cluster, I would start persistent-hdfs and migrate data from S3 with distcp with: Persistent-hdfs/bin/hadoop distcp <src> <dst> However, when I do the same thing on the new cluster, I get an error: /root/Persistent-hdfs/sbin/start-all.sh /root/Persistent-hdfs/bin/hadoop distcp <src> <dst> 2013-12-06 20:38:44,808 INFO mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use org.apache.hadoop.mapred.LocalClientProtocolProvider due to error: Invalid "mapreduce.jobtracker.address" configuration value for LocalJobRunner : "ec2-54-193-48-31.us-west-1.compute.amazonaws.com:9001" 2013-12-06 20:38:44,809 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. I'm wondering how the cluster has been configured differently when Hadoop 2 is specified for the EC2 scripts, and why distcp isn't working here. Thanks! -Matt Cheah
