Hi all, Doing some simple column transformations (e.g. trimming strings) on a DataFrame using UDFs. This DataFrame is in Avro format and being loaded off HDFS. The job has about 16,000 parts/tasks.
About half way through the job, then fails with a message; org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 6843 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) This seems strange as we are saving the dataframe using the df.write.parquet("/my/path") API. As far as I understand, the code that runs these checks https://github.com/apache/spark/blob/7a375bb87a8df56d9dde0c484e725e5c497a9876/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L595 is only used in one method; https://github.com/apache/spark/blob/7a375bb87a8df56d9dde0c484e725e5c497a9876/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala#L47 in TaskResultGetter. So it looks that even though we are not doing a .collect in the driver, its still returning metadata about the successful tasks. I would assume its returning the result value as a DirectTaskResult, with a BoxedUnit as the return type. This data structure should be super small. However it seems this is node the case. This code is being run from the Spark 1.6 shell. Any ideas why this what could be happening here? Cheers, ~N