[GitHub] [spark] guykhazma commented on pull request #28826: [SPARK-31988][SQL] Schema pruning may discard attribute metadata
guykhazma commented on pull request #28826: URL: https://github.com/apache/spark/pull/28826#issuecomment-718496266 @viirya @maropu any comments? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] guykhazma commented on pull request #28826: [SPARK-31988][SQL] Schema pruning may discard attribute metadata
guykhazma commented on pull request #28826: URL: https://github.com/apache/spark/pull/28826#issuecomment-678771791 @viirya @maropu can you please take a look and see if this can get in. thanks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] guykhazma commented on pull request #28826: [SPARK-31988][SQL] Schema pruning may discard attribute metadata
guykhazma commented on pull request #28826: URL: https://github.com/apache/spark/pull/28826#issuecomment-656699933 @viirya @maropu any comments? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] guykhazma commented on pull request #28826: [SPARK-31988][SQL] Schema pruning may discard attribute metadata
guykhazma commented on pull request #28826: URL: https://github.com/apache/spark/pull/28826#issuecomment-647571832 @viirya `derivedFromAtt` is set to `false` when the expression is requesting a nested field . The metadata for a nested column is not preserved (also in Spark 2.4) so I am not sure what the expected behaviour should be here. Note that the metadata for nested field also not preserved when using: ```Scala df.select("col_a.name").schema ``` when `df` is the dataframe that was created locally with the specified schema. If this is considered a bug then we can resolve this as well (will require some more changes). For example this test triggers a code path where `derivedFromAtt` is `false` but it currently passes since metadata is not preserved for nested columns: ```Scala test("SPARK-31988 - make sure schema metadata is preserved - nested schema") { withSQLConf((SQLConf.USE_V1_SOURCE_LIST.key, "avro,csv,json,kafka,orc,text,parquet")) { withTempPath{ f => // create custom dataset with schema metadata val data = Seq( Row(Row("a", 45), "b") ) val schema = List( StructField("col_a", StructType( List( StructField("name", StringType, true, new MetadataBuilder().putString("check", "b").build()), StructField("age", IntegerType, true) ) ), true, new MetadataBuilder().putString("key", "value").build()), StructField("col_b", StringType, true) ) val df = spark.createDataFrame( spark.sparkContext.parallelize(data), StructType(schema) ) df.write.parquet(f.getAbsolutePath) // read from storage val readDF = spark.read.parquet(f.getAbsolutePath) // write again withTempPath { f => readDF.select("col_a.name").write.parquet(f.getAbsolutePath) // read again and verify the schema is equal (including the metadata) val readDF2 = spark.read.parquet(f.getAbsolutePath) assert(readDF.select("col_a.name").schema == readDF2.schema) } } } } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] guykhazma commented on pull request #28826: [SPARK-31988][SQL] Schema pruning may discard attribute metadata
guykhazma commented on pull request #28826: URL: https://github.com/apache/spark/pull/28826#issuecomment-647140487 @HyukjinKwon @maropu @dongjoon-hyun @viirya added a test and also another fix that was needed - the function `sortLeftFieldsByRight` also ignored the `metadata` field. As for why it happens only for v2 is because `V2ScanRelationPushDown` calls `pruneColumns` which trigger this code path (see [here](https://github.com/apache/spark/blob/d2a656c81ef784657a02e7347bfe87e4331fd2c9/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala#L50)) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] guykhazma commented on pull request #28826: [SPARK-31988][SQL] Schema pruning may discard attribute metadata
guykhazma commented on pull request #28826: URL: https://github.com/apache/spark/pull/28826#issuecomment-647112895 @dongjoon-hyun did you make sure to run it with datasource v2? I will add tests for both and a better explanation later today. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] guykhazma commented on pull request #28826: [SPARK-31988][SQL] Schema pruning may discard attribute metadata
guykhazma commented on pull request #28826: URL: https://github.com/apache/spark/pull/28826#issuecomment-644146555 @maropu which test do you suggest to add? This is private function which is not tested anywhere also not all file formats are able to save the metadata (for example csv) so seems to me that adding a test to [SchemaPruningSuite](https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruningSuite.scala) is not the right thing. I can add a test similar to the above code snippet, if that seems ok to you where would you suggest to put it? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org