[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38397: [SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output
dongjoon-hyun commented on code in PR #38397: URL: https://github.com/apache/spark/pull/38397#discussion_r1007642185 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala: ## @@ -173,6 +181,22 @@ class ParquetFileFormat val datetimeRebaseModeInRead = parquetOptions.datetimeRebaseModeInRead val int96RebaseModeInRead = parquetOptions.int96RebaseModeInRead +// Should always be set by FileSourceScanExec creating this. +// Check conf before checking option, to allow working around an issue by changing conf. +val returningBatch = sparkSession.sessionState.conf.parquetVectorizedReaderEnabled && + options.get(FileFormat.OPTION_RETURNING_BATCH) +.getOrElse { + throw new IllegalArgumentException( +"OPTION_RETURNING_BATCH should always be set for ParquetFileFormat." + Review Comment: Ditto. nit. Add one more space at the end of the message. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38397: [SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output
dongjoon-hyun commented on code in PR #38397: URL: https://github.com/apache/spark/pull/38397#discussion_r1007641451 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala: ## @@ -126,9 +136,24 @@ class OrcFileFormat val resultSchema = StructType(requiredSchema.fields ++ partitionSchema.fields) val sqlConf = sparkSession.sessionState.conf -val enableVectorizedReader = supportBatch(sparkSession, resultSchema) val capacity = sqlConf.orcVectorizedReaderBatchSize +// Should always be set by FileSourceScanExec creating this. +// Check conf before checking option, to allow working around an issue by changing conf. +val enableVectorizedReader = sqlConf.orcVectorizedReaderEnabled && + options.get(FileFormat.OPTION_RETURNING_BATCH) +.getOrElse { + throw new IllegalArgumentException( +"OPTION_RETURNING_BATCH should always be set for OrcFileFormat." + + "To workaround this issue, set spark.sql.orc.enableVectorizedReader=false.") Review Comment: Is this a correct recommendation? Why not recommend to set `OPTION_RETURNING_BATCH`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38397: [SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output
dongjoon-hyun commented on code in PR #38397: URL: https://github.com/apache/spark/pull/38397#discussion_r1007640461 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala: ## @@ -126,9 +136,24 @@ class OrcFileFormat val resultSchema = StructType(requiredSchema.fields ++ partitionSchema.fields) val sqlConf = sparkSession.sessionState.conf -val enableVectorizedReader = supportBatch(sparkSession, resultSchema) val capacity = sqlConf.orcVectorizedReaderBatchSize +// Should always be set by FileSourceScanExec creating this. +// Check conf before checking option, to allow working around an issue by changing conf. +val enableVectorizedReader = sqlConf.orcVectorizedReaderEnabled && + options.get(FileFormat.OPTION_RETURNING_BATCH) +.getOrElse { + throw new IllegalArgumentException( +"OPTION_RETURNING_BATCH should always be set for OrcFileFormat." + Review Comment: nit. Add one space at the end? ``` "OPTION_RETURNING_BATCH should always be set for OrcFileFormat." + "OPTION_RETURNING_BATCH should always be set for OrcFileFormat. " + ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org