[GitHub] [spark] guykhazma commented on pull request #28826: [SPARK-31988][SQL] Schema pruning may discard attribute metadata

2020-10-29 Thread GitBox


guykhazma commented on pull request #28826:
URL: https://github.com/apache/spark/pull/28826#issuecomment-718496266


   @viirya @maropu any comments?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] guykhazma commented on pull request #28826: [SPARK-31988][SQL] Schema pruning may discard attribute metadata

2020-08-23 Thread GitBox


guykhazma commented on pull request #28826:
URL: https://github.com/apache/spark/pull/28826#issuecomment-678771791


   @viirya @maropu can you please take a look and see if this can get in.
   thanks



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] guykhazma commented on pull request #28826: [SPARK-31988][SQL] Schema pruning may discard attribute metadata

2020-07-10 Thread GitBox


guykhazma commented on pull request #28826:
URL: https://github.com/apache/spark/pull/28826#issuecomment-656699933


   @viirya @maropu any comments?
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] guykhazma commented on pull request #28826: [SPARK-31988][SQL] Schema pruning may discard attribute metadata

2020-06-22 Thread GitBox


guykhazma commented on pull request #28826:
URL: https://github.com/apache/spark/pull/28826#issuecomment-647571832


   @viirya `derivedFromAtt` is set to `false` when the expression is requesting 
a nested field .
   The metadata for a nested column is not preserved (also in Spark 2.4) so I 
am not sure what the expected behaviour should be here. 
   Note that the metadata for nested field also not preserved when using:
   ```Scala
   df.select("col_a.name").schema
   ```
   when `df` is the dataframe that was created locally with the specified 
schema.
   If this is considered a bug then we can resolve this as well (will require 
some more changes).
   
   For example this test triggers a code path where `derivedFromAtt` is `false` 
but it currently passes since metadata is not preserved for nested columns:
   
   ```Scala
 test("SPARK-31988 - make sure schema metadata is preserved - nested 
schema") {
   withSQLConf((SQLConf.USE_V1_SOURCE_LIST.key, 
"avro,csv,json,kafka,orc,text,parquet")) {
 withTempPath{ f =>
   // create custom dataset with schema metadata
   val data = Seq(
 Row(Row("a", 45), "b")
   )
   val schema = List(
 StructField("col_a", StructType(
   List(
 StructField("name", StringType, true,
   new MetadataBuilder().putString("check", "b").build()),
 StructField("age", IntegerType, true)
   )
 ), true,
   new MetadataBuilder().putString("key", "value").build()),
 StructField("col_b", StringType, true)
   )
   
   val df = spark.createDataFrame(
 spark.sparkContext.parallelize(data),
 StructType(schema)
   )
   df.write.parquet(f.getAbsolutePath)
   
   // read from storage
   val readDF = spark.read.parquet(f.getAbsolutePath)
   // write again
   withTempPath { f =>
 readDF.select("col_a.name").write.parquet(f.getAbsolutePath)
 // read again and verify the schema is equal (including the 
metadata)
 val readDF2 = spark.read.parquet(f.getAbsolutePath)
 assert(readDF.select("col_a.name").schema == readDF2.schema)
   }
 }
   }
 }
   ```
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] guykhazma commented on pull request #28826: [SPARK-31988][SQL] Schema pruning may discard attribute metadata

2020-06-21 Thread GitBox


guykhazma commented on pull request #28826:
URL: https://github.com/apache/spark/pull/28826#issuecomment-647140487


   @HyukjinKwon @maropu @dongjoon-hyun @viirya added a test and also another 
fix that was needed - the function `sortLeftFieldsByRight` also ignored the 
`metadata` field.
   
   As for why it happens only for v2 is because `V2ScanRelationPushDown` calls 
`pruneColumns` which trigger this code path (see 
[here](https://github.com/apache/spark/blob/d2a656c81ef784657a02e7347bfe87e4331fd2c9/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala#L50))



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] guykhazma commented on pull request #28826: [SPARK-31988][SQL] Schema pruning may discard attribute metadata

2020-06-21 Thread GitBox


guykhazma commented on pull request #28826:
URL: https://github.com/apache/spark/pull/28826#issuecomment-647112895


   @dongjoon-hyun did you make sure to run it with datasource v2?
   I will add tests for both and a better explanation later today.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] guykhazma commented on pull request #28826: [SPARK-31988][SQL] Schema pruning may discard attribute metadata

2020-06-15 Thread GitBox


guykhazma commented on pull request #28826:
URL: https://github.com/apache/spark/pull/28826#issuecomment-644146555


   @maropu which test do you suggest to add?
   This is private function which is not tested anywhere also not all file 
formats are able to save the metadata (for example csv) so seems to me that 
adding a test to 
[SchemaPruningSuite](https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruningSuite.scala)
 is not the right thing.
   I can add a test similar to the above code snippet, if that seems ok to you 
where would you suggest to put it?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org