I have a spark job that creates 6 million rows in RDDs. I convert the RDD
into Data-frame and write it to HDFS. Currently it takes 3 minutes to write
it to HDFS.
I am using spark 1.5.1 with YARN.
Here is the snippet:-
RDDList.parallelStream().forEach(mapJavaRDD -> {
if (mapJavaRDD != null) {
JavaRDD<Row> rowRDD =
mapJavaRDD.mapPartitionsWithIndex((integer, v2) -> {
<logical operation>
return new ArrayList<Row>(1).iterator();
}, false);
DataFrame dF = sqlContext.createDataFrame(rowRDD,
schema).coalesce(3);
synchronized (finalLock) {
dF.write().mode(SaveMode.Append).parquet("hdfs
location");
}
});
After looking into the logs I know the following is the reason for the job
taking too long:-
dF.write().mode(SaveMode.Append).parquet("hdfs
location");
I also get the following errors due to it:-
15/10/21 21:12:30 WARN scheduler.TaskSetManager: Stage 31 contains a task of
very large size (378 KB). The maximum recommended task size is 100 KB.4 of
these kind of warnings appeared.
java.lang.IllegalArgumentException: java.lang.IllegalArgumentException:
spark.sql.execution.id is already set
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Improve-parquet-write-speed-to-HDFS-and-spark-sql-execution-id-is-already-set-ERROR-tp25184.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]