[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

2023-02-03 Thread via GitHub


EnricoMi commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1095704158


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##
@@ -49,24 +49,28 @@ import org.apache.spark.sql.types._
  * @param caseSensitive Whether use case sensitive analysis when comparing 
Spark catalyst read
  *  schema with Parquet schema.
  * @param inferTimestampNTZ Whether TimestampNTZType type is enabled.
+ * @param nanosAsLong Whether timestamps with nanos are converted to long.
  */
 class ParquetToSparkSchemaConverter(
 assumeBinaryIsString: Boolean = 
SQLConf.PARQUET_BINARY_AS_STRING.defaultValue.get,
 assumeInt96IsTimestamp: Boolean = 
SQLConf.PARQUET_INT96_AS_TIMESTAMP.defaultValue.get,
 caseSensitive: Boolean = SQLConf.CASE_SENSITIVE.defaultValue.get,
 inferTimestampNTZ: Boolean = 
SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED.defaultValue.get) {
+nanosAsLong: Boolean = 
SQLConf.LEGACY_PARQUET_NANOS_AS_LONG.defaultValue.get) {

Review Comment:
   ```suggestion
   inferTimestampNTZ: Boolean = 
SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED.defaultValue.get,
   nanosAsLong: Boolean = 
SQLConf.LEGACY_PARQUET_NANOS_AS_LONG.defaultValue.get) {
   ```



##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala:
##
@@ -65,12 +68,14 @@ abstract class ParquetSchemaTest extends ParquetTest with 
SharedSparkSession {
   caseSensitive: Boolean = false,
   inferTimestampNTZ: Boolean = true,
   sparkReadSchema: Option[StructType] = None,
-  expectedParquetColumn: Option[ParquetColumn] = None): Unit = {
+  expectedParquetColumn: Option[ParquetColumn] = None,
+  nanosAsLong: Boolean = false): Unit = {
 val converter = new ParquetToSparkSchemaConverter(
   assumeBinaryIsString = binaryAsString,
   assumeInt96IsTimestamp = int96AsTimestamp,
   caseSensitive = caseSensitive,
   inferTimestampNTZ = inferTimestampNTZ)
+  nanosAsLong = nanosAsLong)

Review Comment:
   ```suggestion
 inferTimestampNTZ = inferTimestampNTZ,
 nanosAsLong = nanosAsLong)
   ```



##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala:
##
@@ -454,13 +459,15 @@ object ParquetFileFormat extends Logging {
 val assumeBinaryIsString = 
sparkSession.sessionState.conf.isParquetBinaryAsString
 val assumeInt96IsTimestamp = 
sparkSession.sessionState.conf.isParquetINT96AsTimestamp
 val inferTimestampNTZ = 
sparkSession.sessionState.conf.parquetInferTimestampNTZEnabled
+val nanosAsLong = sparkSession.sessionState.conf.legacyParquetNanosAsLong
 
 val reader = (files: Seq[FileStatus], conf: Configuration, 
ignoreCorruptFiles: Boolean) => {
   // Converter used to convert Parquet `MessageType` to Spark SQL 
`StructType`
   val converter = new ParquetToSparkSchemaConverter(
 assumeBinaryIsString = assumeBinaryIsString,
 assumeInt96IsTimestamp = assumeInt96IsTimestamp,
 inferTimestampNTZ = inferTimestampNTZ)
+nanosAsLong = nanosAsLong)

Review Comment:
   ```suggestion
   inferTimestampNTZ = inferTimestampNTZ,
   nanosAsLong = nanosAsLong)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

2023-02-03 Thread via GitHub


EnricoMi commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1095651650


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -4946,6 +4953,8 @@ class SQLConf extends Serializable with Logging {
 
   def parquetInferTimestampNTZEnabled: Boolean = 
getConf(PARQUET_INFER_TIMESTAMP_NTZ_ENABLED)
 
+  def parquetTimestampNTZEnabled: Boolean = 
getConf(PARQUET_TIMESTAMP_NTZ_ENABLED)
+

Review Comment:
   Looks like this sneaked in while rebasing...
   
   Renamed recently here: 
https://github.com/apache/spark/commit/ae24327e8984e7ee8ba948e3be0b6b2f28df68d0#diff-13c5b65678b327277c68d17910ae93629801af00117a0e3da007afd95b6c6764L4946-R4947



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

2022-11-17 Thread GitBox


EnricoMi commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1025422089


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala:
##
@@ -1040,6 +1040,14 @@ class ParquetSchemaSuite extends ParquetSchemaTest {
 }
   }
 
+  test("SPARK-40819 - ability to read parquet file with TIMESTAMP(NANOS, 
true)") {
+val testDataPath = 
getClass.getResource("/test-data/timestamp-nanos.parquet")
+val data = spark.read.parquet(testDataPath.toString).select("birthday")
+
+assert(data.schema.fields.head.dataType == LongType)
+assert(data.take(1).head.getAs[Long](0) == 16685371290L)

Review Comment:
   Can we have true nano second values to be sure they are not zeroed by the 
loader but loaded as in the file, e.g. `1665532812012345678L`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

2022-11-17 Thread GitBox


EnricoMi commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1025413969


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala:
##
@@ -1040,6 +1040,14 @@ class ParquetSchemaSuite extends ParquetSchemaTest {
 }
   }
 
+  test("SPARK-40819 - ability to read parquet file with TIMESTAMP(NANOS, 
true)") {

Review Comment:
   nit: other tests in this file follow the naming convention "SPARK-X: ..."
   
   ```suggestion
 test("SPARK-40819: ability to read parquet file with TIMESTAMP(NANOS, 
true)") {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

2022-11-17 Thread GitBox


EnricoMi commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1025407302


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala:
##
@@ -1040,6 +1040,14 @@ class ParquetSchemaSuite extends ParquetSchemaTest {
 }
   }
 
+  test("SPARK-40819 - ability to read parquet file with TIMESTAMP(NANOS, 
true)") {
+val testDataPath = 
getClass.getResource("/test-data/timestamp-nanos.parquet")
+val data = spark.read.parquet(testDataPath.toString).select("birthday")
+
+assert(data.schema.fields.head.dataType == LongType)
+assert(data.take(1).head.getAs[Long](0) == 16685371290L)

Review Comment:
   Shall we sort the read dataframe to be guarantee we compare the first row 
(Unless all rows have the same timestamp)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

2022-11-09 Thread GitBox


EnricoMi commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1017736181


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala:
##
@@ -198,6 +205,31 @@ abstract class ParquetSchemaTest extends ParquetTest with 
SharedSparkSession {
 }
 
 class ParquetSchemaInferenceSuite extends ParquetSchemaTest {
+  testSchemaInference[Tuple1[Long]](

Review Comment:
   @cloud-fan moving Parquet from `1.10.1` to `1.12.3` introduced this 
regression where Spark 3.1 returned `LongType` and Spark 3.2 fails on illegal 
type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

2022-10-24 Thread GitBox


EnricoMi commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1002966819


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##
@@ -271,6 +271,10 @@ class ParquetToSparkSchemaConverter(
 } else {
   TimestampNTZType
 }
+  // SPARK-40819: NANOS are not supported as a Timestamp, convert to 
LongType without
+  // timezone awareness to address behaviour regression introduced by 
SPARK-34661

Review Comment:
   Yes, the mere purpose of this exercise is to get access to the nano 
precision.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

2022-10-24 Thread GitBox


EnricoMi commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1002966819


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##
@@ -271,6 +271,10 @@ class ParquetToSparkSchemaConverter(
 } else {
   TimestampNTZType
 }
+  // SPARK-40819: NANOS are not supported as a Timestamp, convert to 
LongType without
+  // timezone awareness to address behaviour regression introduced by 
SPARK-34661

Review Comment:
   Yes, the main purpose of the nano timestamp is the nano precision.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

2022-10-21 Thread GitBox


EnricoMi commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1001531184


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##
@@ -271,6 +271,8 @@ class ParquetToSparkSchemaConverter(
 } else {
   TimestampNTZType
 }
+  case timestamp: TimestampLogicalTypeAnnotation if timestamp.getUnit 
== TimeUnit.NANOS =>

Review Comment:
   So this is not a valid workaround. @sunchao how is your feeling about 
restoring earlier useful behaviour?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

2022-10-20 Thread GitBox


EnricoMi commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1001088325


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##
@@ -271,6 +271,8 @@ class ParquetToSparkSchemaConverter(
 } else {
   TimestampNTZType
 }
+  case timestamp: TimestampLogicalTypeAnnotation if timestamp.getUnit 
== TimeUnit.NANOS =>

Review Comment:
   Yes, the (earlier) existing behaviour was useful and should be restored 
until properly supported as typed nano timestamp. Unless a workaround can be 
found that restores the earlier behaviour.
   
   Does providing a schema (`spark.read.schema(...)`) with long type override 
the parquet timestamp type from the parquet file? Would that be a workaround?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org