[jira] [Updated] (SPARK-42439) Job description in v2 FileWrites can have the wrong committer
[ https://issues.apache.org/jira/browse/SPARK-42439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-42439: --- Labels: bug pull-request-available (was: bug) > Job description in v2 FileWrites can have the wrong committer > - > > Key: SPARK-42439 > URL: https://issues.apache.org/jira/browse/SPARK-42439 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.1 >Reporter: Lorenzo Martini >Priority: Minor > Labels: bug, pull-request-available > > There is a difference in behavior between v1 writes and v2 writes in the > order of events happening when configuring the file writer and the committer. > v1: > # writer.prepareWrite() > # committer.setupJob() > v2: > # committer.setupJob() > # writer.prepareWrite() > > This is because the `prepareWrite()` call (that is the one performing the > call ` > job.setOutputFormatClass(classOf[ParquetOutputFormat[Row]])`) > happens as part of the `createWriteJobDescription` which is `lazy val` in the > `toBatch` call and therefore is evaluated after the `committer.setupJob` at > the end of the `toBatch` > This causes issues when evaluating the committer as some elements might be > missing, for example the aforementioned output format class not being set, > causing the committer being set up as generic write instead of parquet write. > > The fix is very simple and it is to make the `createJobDescription` call > non-lazy -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42439) Job description in v2 FileWrites can have the wrong committer
[ https://issues.apache.org/jira/browse/SPARK-42439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lorenzo Martini updated SPARK-42439: Issue Type: Bug (was: Improvement) > Job description in v2 FileWrites can have the wrong committer > - > > Key: SPARK-42439 > URL: https://issues.apache.org/jira/browse/SPARK-42439 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.1 >Reporter: Lorenzo Martini >Priority: Minor > Labels: bug > > There is a difference in behavior between v1 writes and v2 writes in the > order of events happening when configuring the file writer and the committer. > v1: > # writer.prepareWrite() > # committer.setupJob() > v2: > # committer.setupJob() > # writer.prepareWrite() > > This is because the `prepareWrite()` call (that is the one performing the > call ` > job.setOutputFormatClass(classOf[ParquetOutputFormat[Row]])`) > happens as part of the `createWriteJobDescription` which is `lazy val` in the > `toBatch` call and therefore is evaluated after the `committer.setupJob` at > the end of the `toBatch` > This causes issues when evaluating the committer as some elements might be > missing, for example the aforementioned output format class not being set, > causing the committer being set up as generic write instead of parquet write. > > The fix is very simple and it is to make the `createJobDescription` call > non-lazy -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42439) Job description in v2 FileWrites can have the wrong committer
[ https://issues.apache.org/jira/browse/SPARK-42439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lorenzo Martini updated SPARK-42439: Labels: bug (was: ) > Job description in v2 FileWrites can have the wrong committer > - > > Key: SPARK-42439 > URL: https://issues.apache.org/jira/browse/SPARK-42439 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.1 >Reporter: Lorenzo Martini >Priority: Minor > Labels: bug > > There is a difference in behavior between v1 writes and v2 writes in the > order of events happening when configuring the file writer and the committer. > v1: > # writer.prepareWrite() > # committer.setupJob() > v2: > # committer.setupJob() > # writer.prepareWrite() > > This is because the `prepareWrite()` call (that is the one performing the > call ` > job.setOutputFormatClass(classOf[ParquetOutputFormat[Row]])`) > happens as part of the `createWriteJobDescription` which is `lazy val` in the > `toBatch` call and therefore is evaluated after the `committer.setupJob` at > the end of the `toBatch` > This causes issues when evaluating the committer as some elements might be > missing, for example the aforementioned output format class not being set, > causing the committer being set up as generic write instead of parquet write. > > The fix is very simple and it is to make the `createJobDescription` call > non-lazy -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org