Joe Mudd created SPARK-5435: ------------------------------- Summary: saveAsNewAPIHadoopDataset is not setting up the local configuration Key: SPARK-5435 URL: https://issues.apache.org/jira/browse/SPARK-5435 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.2.0 Environment: Cloudera 5.3.0 Reporter: Joe Mudd
The HCatOutputFormat utilizes FileOutpuFormatContainer which refers to the MRv1 FileOutputFormat.getUniqueName() method. Since the local configuration has not been set up, getUniqueName() ends up throwing an IllegalArgumentException. It appears the saveAsNewAPIHadoopDataset().writeshard method needs to record Job information in the local Hadoop configuration similar to HadoopRDD.addLocalConfiguration(). In a test build, I ended up setting both the MRv1 and MRv2 names since just having the MRv2 names did not work. Here's the traceback: java.lang.IllegalArgumentException: This method can only be called from within a Job at org.apache.hadoop.mapred.FileOutputFormat.getUniqueName(FileOutputFormat.java:286) at org.apache.hive.hcatalog.mapreduce.FileOutputFormatContainer.getRecordWriter(FileOutputFormatContainer.java:101) at org.apache.hive.hcatalog.mapreduce.HCatOutputFormat.getRecordWriter(HCatOutputFormat.java:260) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:984) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:965) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org