[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-662801978 > Looks almost okay now, so could you split this PR into pieces? I think its somewhat big fro reviews. For example; > > 1. More refactoring PR for `HiveScriptTransformationExec` and `BaseScriptTransfromationExec` just like [SPARK-32105](https://issues.apache.org/jira/browse/SPARK-32105) > 2. PR to improve test coverage of `HiveScriptTransformationExec` > 3. Then, PR to implement Spark-native TRRANSFORM > > WDTY? Yea, raise pr one by one. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-662787840 All done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-662529996 @maropu add UT of ROW FORMAT DELIMIT in SQLQueryTestSuite. and seems all have done, slow net work and too comment, check tomorrow in company. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-662283426 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-662213113 > The current main code part looks almost okay, but IMO we need more tests for better test coverage as commented above. Sure, update later. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-662200594 > I meant the behaivour of v3.0.0. Either way, this PR is getting larger, so I think its okay to support it in follow-ups. All right, really too big. Keep like this This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-662183023 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-661678311 > > @maropu Update for spark defult serde support ArrayType/MapType/StructType and UT changed too > > How about separating PRs for the support? I think this PR should focus on implementing a spark default serde that have the same behaivour with the hive default serde (`LazySimpleSerde`) as much as possible. Since current 'LazySimpleSerde' support Array/Map/Struct since pr #27556, so I did this. So I just revert latest commit? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-661663602 @maropu Update for spark defult serde support ArrayType/MapType/StructType and UT changed too This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-661597390 @maropu ArrayType/MapType/StructType data format show as below ``` test("SPARK-32106: TRANSFORM support complex data types as input and ouput type (hive serde)") { assume(TestUtils.testCommandAvailable("/bin/bash")) withTempView("v") { val df = Seq( (1, "1", Array(0, 1, 2), Map("a" -> 1)), (2, "2", Array(3, 4, 5), Map("b" -> 2))) .toDF("a", "b", "c", "d") .select('a, 'b, 'c, 'd, struct('a, 'b).as("e")) df.createTempView("v") sql( """ |SELECT TRANSFORM(c, d, e) |USING 'cat' as (c, d, e) |FROM v """.stripMargin).show() sql( """ |SELECT TRANSFORM(c, d, e) |USING 'cat' as (c int, d string, e string) |FROM v """.stripMargin).show() sql( """ |SELECT TRANSFORM(c, d, e) |USING 'cat' as (c array, d map, e struct) |FROM v """.stripMargin).show() } } ``` ``` +-+---+---+ |c| d| e| +-+---+---+ |012|a1|11| |345|b2|22| +-+---+---+ ++---+---+ | c| d| e| ++---+---+ |null|a1|11| |null|b2|22| ++---+---+ +-++--+ |c| d| e| +-++--+ |[0, 1, 2]|[a -> 1]|[1, 1]| |[3, 4, 5]|[b -> 2]|[2, 2]| +-++--+ ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-660879513 > **[Test build #126162 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126162/testReport)** for PR 29085 at commit [`e16c136`](https://github.com/apache/spark/commit/e16c13620032f8062cb0fcd6ecad9836c97febf7). > > * This patch **fails Spark unit tests**. > * This patch merges cleanly. > * This patch adds no public classes. @maropu according to this failed UT, https://user-images.githubusercontent.com/46485123/87915370-ab8fd980-caa4-11ea-8792-9bba53fa5b4d.png";> Add back ``` case DateType if conf.datetimeJava8ApiEnabled => wrapperConvertException(data => DateTimeUtils.stringToDate( UTF8String.fromString(data), DateTimeUtils.getZoneId(conf.sessionLocalTimeZone)) .map(DateTimeUtils.daysToLocalDate).orNull, converter) case TimestampType if conf.datetimeJava8ApiEnabled => wrapperConvertException(data => DateTimeUtils.stringToTimestamp( UTF8String.fromString(data), DateTimeUtils.getZoneId(conf.sessionLocalTimeZone)) .map(DateTimeUtils.microsToInstant).orNull, converter) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-660848422 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-660611390 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-660580865 > > Some some can't run without hive > > Which kind of queries? DataType such as CalenderIntervalType, https://github.com/apache/spark/pull/29085#discussion_r456770157 Schemaless when input column more than 2. https://github.com/apache/spark/pull/29085#discussion_r456751726 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-660493862 @maropu Add ut in SQLQueryTest suite. Some some can't run without hive, so I didn't add a lot, some will be add in spark's serde pr, and some will add in pr about support transform with aggregation This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-660469205 > Could you add tests in `SQLQueryTestSuite`, too? Probably, we can add `sql-tests/inputs/transform.sql`? Sure, update later. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-660422452 > As @cloud-fan [suggested above](https://github.com/apache/spark/pull/29085#discussion_r454299734), you need to move the two tests into `sql/core`; could you check my PR for your branch? [AngersZh#5](https://github.com/AngersZh/spark/pull/5) Thanks a lot, I tried a lot but can't solve the conflict. Now I can understand the inheritance relationship more clearly This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-658286746 > so eventually we don't need to use `Cast` to convert catalyst value to string? There will always be a serde (default or user-specified). Yes, but in this step, we should do this to keep data and UT right. need to `Cast` and `Wrapper` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-658223626 > What's the behavior of hive if the script transformation doesn't specify a serde? Does Hive pick a default serde, or it well defines the behavior of non-serde? In current pr, I just add a temporary method, and wait @alfozan 's spark's serde. As far as I know, `DelimitedJSONSerDe` can handle complex data type such as Array, Map, Struct. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-658222144 > What's the behavior of hive if the script transformation doesn't specify a serde? Does Hive pick a default serde, or it well defines the behavior of non-serde? In current code, when we don't write serde with transform, it will use LazySimpleSerde https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala#L717-L723 it means only when you write a wrong serde, and ScriptTransformationExec can't find corresponding serde class, it will execute code about https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala#L236-L238 https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala#L175-L187 As this https://github.com/apache/spark/pull/29085#issuecomment-658057950 comment, we know that without serde, we can't handle input data string correctly, same reason, we can't handle output data too, so add a `wrapper` method to convert string to corresponding data type. In this https://github.com/apache/spark/pull/29085#issuecomment-658213516 Jenkins result you can see output data type probelm This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-658166682 > @AngersZh Yes, we implemented two native SerDe classes (`SimpleSerDe` for `ROW FORMAT DELIMITED ` and `DelimitedJSONSerDe` for the JSON variant) so there's no longer dependency on Hive's SerDes. I'd be happy to create a PR after this one with SerDe classes. Yea, that's what I want to do next, glad to hear that you will share your code. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-658104535 > Can we use `Cast` to turn catalyst value to string and pass to the script? Nice advise! Updated This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-657602335 > At least we should define how to convert catalyst values to strings, right? `UnsafeArray.toString` just gives you meaningless binary string. So we need to handle string format in `BaseScriptTransformationWriterThread.processRowsWithoutSerde()` to make sure for each data type, the final data is write This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-657553668 @alfozan Hi, alfozan, I know that in facebook using script transform a lot, in your case, do you will use script transform with serde? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-657550973 > can you explain the serde part? How can we do script transformation in sql/core without the hive serde lib? In most case, we won't use script with serde, so we can implement script transform in sql/core first only with default format. You can see the code, when inputSerdeClass & outputSerdeClass is empty, the code script use won't relay on hive. In our product, we don't have transform script with serde. And we can discusses if we need to support spark's serde in sql/core. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
AngersZh commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-657487168 cc @cloud-fan @maropu @HyukjinKwon This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org