[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression
EnricoMi commented on code in PR #38312: URL: https://github.com/apache/spark/pull/38312#discussion_r1095704158 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala: ## @@ -49,24 +49,28 @@ import org.apache.spark.sql.types._ * @param caseSensitive Whether use case sensitive analysis when comparing Spark catalyst read * schema with Parquet schema. * @param inferTimestampNTZ Whether TimestampNTZType type is enabled. + * @param nanosAsLong Whether timestamps with nanos are converted to long. */ class ParquetToSparkSchemaConverter( assumeBinaryIsString: Boolean = SQLConf.PARQUET_BINARY_AS_STRING.defaultValue.get, assumeInt96IsTimestamp: Boolean = SQLConf.PARQUET_INT96_AS_TIMESTAMP.defaultValue.get, caseSensitive: Boolean = SQLConf.CASE_SENSITIVE.defaultValue.get, inferTimestampNTZ: Boolean = SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED.defaultValue.get) { +nanosAsLong: Boolean = SQLConf.LEGACY_PARQUET_NANOS_AS_LONG.defaultValue.get) { Review Comment: ```suggestion inferTimestampNTZ: Boolean = SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED.defaultValue.get, nanosAsLong: Boolean = SQLConf.LEGACY_PARQUET_NANOS_AS_LONG.defaultValue.get) { ``` ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala: ## @@ -65,12 +68,14 @@ abstract class ParquetSchemaTest extends ParquetTest with SharedSparkSession { caseSensitive: Boolean = false, inferTimestampNTZ: Boolean = true, sparkReadSchema: Option[StructType] = None, - expectedParquetColumn: Option[ParquetColumn] = None): Unit = { + expectedParquetColumn: Option[ParquetColumn] = None, + nanosAsLong: Boolean = false): Unit = { val converter = new ParquetToSparkSchemaConverter( assumeBinaryIsString = binaryAsString, assumeInt96IsTimestamp = int96AsTimestamp, caseSensitive = caseSensitive, inferTimestampNTZ = inferTimestampNTZ) + nanosAsLong = nanosAsLong) Review Comment: ```suggestion inferTimestampNTZ = inferTimestampNTZ, nanosAsLong = nanosAsLong) ``` ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala: ## @@ -454,13 +459,15 @@ object ParquetFileFormat extends Logging { val assumeBinaryIsString = sparkSession.sessionState.conf.isParquetBinaryAsString val assumeInt96IsTimestamp = sparkSession.sessionState.conf.isParquetINT96AsTimestamp val inferTimestampNTZ = sparkSession.sessionState.conf.parquetInferTimestampNTZEnabled +val nanosAsLong = sparkSession.sessionState.conf.legacyParquetNanosAsLong val reader = (files: Seq[FileStatus], conf: Configuration, ignoreCorruptFiles: Boolean) => { // Converter used to convert Parquet `MessageType` to Spark SQL `StructType` val converter = new ParquetToSparkSchemaConverter( assumeBinaryIsString = assumeBinaryIsString, assumeInt96IsTimestamp = assumeInt96IsTimestamp, inferTimestampNTZ = inferTimestampNTZ) +nanosAsLong = nanosAsLong) Review Comment: ```suggestion inferTimestampNTZ = inferTimestampNTZ, nanosAsLong = nanosAsLong) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression
EnricoMi commented on code in PR #38312: URL: https://github.com/apache/spark/pull/38312#discussion_r1095651650 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -4946,6 +4953,8 @@ class SQLConf extends Serializable with Logging { def parquetInferTimestampNTZEnabled: Boolean = getConf(PARQUET_INFER_TIMESTAMP_NTZ_ENABLED) + def parquetTimestampNTZEnabled: Boolean = getConf(PARQUET_TIMESTAMP_NTZ_ENABLED) + Review Comment: Looks like this sneaked in while rebasing... Renamed recently here: https://github.com/apache/spark/commit/ae24327e8984e7ee8ba948e3be0b6b2f28df68d0#diff-13c5b65678b327277c68d17910ae93629801af00117a0e3da007afd95b6c6764L4946-R4947 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression
EnricoMi commented on code in PR #38312: URL: https://github.com/apache/spark/pull/38312#discussion_r1025422089 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala: ## @@ -1040,6 +1040,14 @@ class ParquetSchemaSuite extends ParquetSchemaTest { } } + test("SPARK-40819 - ability to read parquet file with TIMESTAMP(NANOS, true)") { +val testDataPath = getClass.getResource("/test-data/timestamp-nanos.parquet") +val data = spark.read.parquet(testDataPath.toString).select("birthday") + +assert(data.schema.fields.head.dataType == LongType) +assert(data.take(1).head.getAs[Long](0) == 16685371290L) Review Comment: Can we have true nano second values to be sure they are not zeroed by the loader but loaded as in the file, e.g. `1665532812012345678L`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression
EnricoMi commented on code in PR #38312: URL: https://github.com/apache/spark/pull/38312#discussion_r1025413969 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala: ## @@ -1040,6 +1040,14 @@ class ParquetSchemaSuite extends ParquetSchemaTest { } } + test("SPARK-40819 - ability to read parquet file with TIMESTAMP(NANOS, true)") { Review Comment: nit: other tests in this file follow the naming convention "SPARK-X: ..." ```suggestion test("SPARK-40819: ability to read parquet file with TIMESTAMP(NANOS, true)") { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression
EnricoMi commented on code in PR #38312: URL: https://github.com/apache/spark/pull/38312#discussion_r1025407302 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala: ## @@ -1040,6 +1040,14 @@ class ParquetSchemaSuite extends ParquetSchemaTest { } } + test("SPARK-40819 - ability to read parquet file with TIMESTAMP(NANOS, true)") { +val testDataPath = getClass.getResource("/test-data/timestamp-nanos.parquet") +val data = spark.read.parquet(testDataPath.toString).select("birthday") + +assert(data.schema.fields.head.dataType == LongType) +assert(data.take(1).head.getAs[Long](0) == 16685371290L) Review Comment: Shall we sort the read dataframe to be guarantee we compare the first row (Unless all rows have the same timestamp)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression
EnricoMi commented on code in PR #38312: URL: https://github.com/apache/spark/pull/38312#discussion_r1017736181 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala: ## @@ -198,6 +205,31 @@ abstract class ParquetSchemaTest extends ParquetTest with SharedSparkSession { } class ParquetSchemaInferenceSuite extends ParquetSchemaTest { + testSchemaInference[Tuple1[Long]]( Review Comment: @cloud-fan moving Parquet from `1.10.1` to `1.12.3` introduced this regression where Spark 3.1 returned `LongType` and Spark 3.2 fails on illegal type. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression
EnricoMi commented on code in PR #38312: URL: https://github.com/apache/spark/pull/38312#discussion_r1002966819 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala: ## @@ -271,6 +271,10 @@ class ParquetToSparkSchemaConverter( } else { TimestampNTZType } + // SPARK-40819: NANOS are not supported as a Timestamp, convert to LongType without + // timezone awareness to address behaviour regression introduced by SPARK-34661 Review Comment: Yes, the mere purpose of this exercise is to get access to the nano precision. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression
EnricoMi commented on code in PR #38312: URL: https://github.com/apache/spark/pull/38312#discussion_r1002966819 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala: ## @@ -271,6 +271,10 @@ class ParquetToSparkSchemaConverter( } else { TimestampNTZType } + // SPARK-40819: NANOS are not supported as a Timestamp, convert to LongType without + // timezone awareness to address behaviour regression introduced by SPARK-34661 Review Comment: Yes, the main purpose of the nano timestamp is the nano precision. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression
EnricoMi commented on code in PR #38312: URL: https://github.com/apache/spark/pull/38312#discussion_r1001531184 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala: ## @@ -271,6 +271,8 @@ class ParquetToSparkSchemaConverter( } else { TimestampNTZType } + case timestamp: TimestampLogicalTypeAnnotation if timestamp.getUnit == TimeUnit.NANOS => Review Comment: So this is not a valid workaround. @sunchao how is your feeling about restoring earlier useful behaviour? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression
EnricoMi commented on code in PR #38312: URL: https://github.com/apache/spark/pull/38312#discussion_r1001088325 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala: ## @@ -271,6 +271,8 @@ class ParquetToSparkSchemaConverter( } else { TimestampNTZType } + case timestamp: TimestampLogicalTypeAnnotation if timestamp.getUnit == TimeUnit.NANOS => Review Comment: Yes, the (earlier) existing behaviour was useful and should be restored until properly supported as typed nano timestamp. Unless a workaround can be found that restores the earlier behaviour. Does providing a schema (`spark.read.schema(...)`) with long type override the parquet timestamp type from the parquet file? Would that be a workaround? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org