[GitHub] [spark] AmplabJenkins removed a comment on pull request #28953: [SPARK-32013][SQL] Support query execution before reading DataFrame and before/after writing DataFrame over JDBC
AmplabJenkins removed a comment on pull request #28953: URL: https://github.com/apache/spark/pull/28953#issuecomment-667819031 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29331: [SPARK-32517][CORE] Add StorageLevel.DISK_ONLY_3
AmplabJenkins removed a comment on pull request #29331: URL: https://github.com/apache/spark/pull/29331#issuecomment-667819083 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28953: [SPARK-32013][SQL] Support query execution before reading DataFrame and before/after writing DataFrame over JDBC
AmplabJenkins commented on pull request #28953: URL: https://github.com/apache/spark/pull/28953#issuecomment-667819031 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29331: [SPARK-32517][CORE] Add StorageLevel.DISK_ONLY_3
AmplabJenkins commented on pull request #29331: URL: https://github.com/apache/spark/pull/29331#issuecomment-667819083 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29331: [SPARK-32517][CORE] Add StorageLevel.DISK_ONLY_3
SparkQA commented on pull request #29331: URL: https://github.com/apache/spark/pull/29331#issuecomment-667818351 **[Test build #126955 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126955/testReport)** for PR 29331 at commit [`0cf67c4`](https://github.com/apache/spark/commit/0cf67c43d225c198607d6957fc26b64a26aeefaa). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29291: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
SparkQA commented on pull request #29291: URL: https://github.com/apache/spark/pull/29291#issuecomment-667818385 **[Test build #126956 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126956/testReport)** for PR 29291 at commit [`883973b`](https://github.com/apache/spark/commit/883973b9bc8a9c530a002cf4b48217546929fb5e). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28953: [SPARK-32013][SQL] Support query execution before reading DataFrame and before/after writing DataFrame over JDBC
AmplabJenkins removed a comment on pull request #28953: URL: https://github.com/apache/spark/pull/28953#issuecomment-667816168 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/126952/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28953: [SPARK-32013][SQL] Support query execution before reading DataFrame and before/after writing DataFrame over JDBC
AmplabJenkins removed a comment on pull request #28953: URL: https://github.com/apache/spark/pull/28953#issuecomment-667816161 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28953: [SPARK-32013][SQL] Support query execution before reading DataFrame and before/after writing DataFrame over JDBC
SparkQA removed a comment on pull request #28953: URL: https://github.com/apache/spark/pull/28953#issuecomment-667807499 **[Test build #126952 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126952/testReport)** for PR 28953 at commit [`70d8719`](https://github.com/apache/spark/commit/70d8719e8877ac7b4f4d0b0b8bb309ee1611df07). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request #29331: [SPARK-32517][CORE] Add StorageLevel.DISK_ONLY_3
dongjoon-hyun opened a new pull request #29331: URL: https://github.com/apache/spark/pull/29331 ### What changes were proposed in this pull request? This PR aims to add `StorageLevel.DISK_ONLY_3` as a built-in `StorageLevel`. ### Why are the changes needed? Disaggregate clusters or clusters without storage services like HDFS are increasing. Previously, the users were able to use similar `MEMORY_AND_DISK_2` or a user-created StorageLevel . This PR aims to support it officially. ### Does this PR introduce _any_ user-facing change? Yes. This provides a new built-in option. ### How was this patch tested? Pass the GitHub Action or Jenkins with the revised test cases. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28953: [SPARK-32013][SQL] Support query execution before reading DataFrame and before/after writing DataFrame over JDBC
AmplabJenkins commented on pull request #28953: URL: https://github.com/apache/spark/pull/28953#issuecomment-667816161 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28953: [SPARK-32013][SQL] Support query execution before reading DataFrame and before/after writing DataFrame over JDBC
SparkQA commented on pull request #28953: URL: https://github.com/apache/spark/pull/28953#issuecomment-667816089 **[Test build #126952 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126952/testReport)** for PR 28953 at commit [`70d8719`](https://github.com/apache/spark/commit/70d8719e8877ac7b4f4d0b0b8bb309ee1611df07). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] leanken edited a comment on pull request #29304: [SPARK-32494][SQL] Null Aware Anti Join Optimize Support Multi-Column
leanken edited a comment on pull request #29304: URL: https://github.com/apache/spark/pull/29304#issuecomment-667811949 @agrawaldevesh I am finally understand the complexity of multi column support, thanks to your remind again and again, feel sorry about my naive. Do you think it still worth to carry on to support multi column? sincerely ask for your suggestion. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29291: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
AmplabJenkins removed a comment on pull request #29291: URL: https://github.com/apache/spark/pull/29291#issuecomment-667815478 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29330: [SPARK-32432] Added support for reading ORC/Parquet files with SymlinkTextInputFormat
AmplabJenkins removed a comment on pull request #29330: URL: https://github.com/apache/spark/pull/29330#issuecomment-667815514 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29291: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
AmplabJenkins commented on pull request #29291: URL: https://github.com/apache/spark/pull/29291#issuecomment-667815478 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29330: [SPARK-32432] Added support for reading ORC/Parquet files with SymlinkTextInputFormat
AmplabJenkins commented on pull request #29330: URL: https://github.com/apache/spark/pull/29330#issuecomment-667815514 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29330: [SPARK-32432] Added support for reading ORC/Parquet files with SymlinkTextInputFormat
SparkQA commented on pull request #29330: URL: https://github.com/apache/spark/pull/29330#issuecomment-667814945 **[Test build #126954 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126954/testReport)** for PR 29330 at commit [`c97f003`](https://github.com/apache/spark/commit/c97f0031eb7c18d53ef6c302213e8766cb5d2e99). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] manuzhang commented on pull request #29321: [SPARK-32083][SQL][3.0] AQE coalesce should at least return one partition
manuzhang commented on pull request #29321: URL: https://github.com/apache/spark/pull/29321#issuecomment-667814511 @cloud-fan The title seems not to be related to the partial backport. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #29328: [WIP][SPARK-32516][SQL] 'path' option should be treated consistently when loading dataframes for different APIs
imback82 commented on a change in pull request #29328: URL: https://github.com/apache/spark/pull/29328#discussion_r464203736 ## File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ## @@ -245,15 +245,22 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { "read files of Hive data source directly.") } +val updatedPaths = if (paths.length == 1) { Review comment: +1 for your suggestion This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] leanken commented on pull request #29304: [SPARK-32494][SQL] Null Aware Anti Join Optimize Support Multi-Column
leanken commented on pull request #29304: URL: https://github.com/apache/spark/pull/29304#issuecomment-667814305 > @agrawaldevesh I am finally understand the complexity of multi column support, thanks to your remind again and again, feel sorry about my naive. Do you think it still worth to carry on to support multi column? sincerely ask for you suggestion. as for how to support it, i think it might be 1. scan buildSide to gather information about which columns contains null 2. build HashedRelation with original input include anyNull Key 3. building a extra HashedRelation which is all combination null padding. when probe doing on streamedSide 1. if streamedSide key is a all non-null value, using the gathered null information on right side, to try find match in original HashedRelation, for example (1,2,3) with buildSide c2, c3 with null value, try match using following keys (1,2,3) (1,null,3)(1,2,null)(1,null,null) 2. if streamedSide key contains any column which is null value, for example (null, 2, 3), use the key to look up in extra hashedRelation because it contains all possible combinations. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29328: [WIP][SPARK-32516][SQL] 'path' option should be treated consistently when loading dataframes for different APIs
cloud-fan commented on a change in pull request #29328: URL: https://github.com/apache/spark/pull/29328#discussion_r464203078 ## File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ## @@ -245,15 +245,22 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { "read files of Hive data source directly.") } +val updatedPaths = if (paths.length == 1) { Review comment: If we are worried about silent result changing, we can fail if there are `path` option and `load` is called with path parameters. The error message should ask users to either remove the `path` options, we put it into the `load` parameters. ## File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ## @@ -245,15 +245,22 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { "read files of Hive data source directly.") } +val updatedPaths = if (paths.length == 1) { Review comment: If we are worried about silent result changing, we can fail if there are `path` option and `load` is called with path parameters. The error message should ask users to either remove the `path` options, or put it into the `load` parameters. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani opened a new pull request #29330: [SPARK-32432] Added support for reading ORC/Parquet files with SymlinkTextInputFormat
moomindani opened a new pull request #29330: URL: https://github.com/apache/spark/pull/29330 ### What changes were proposed in this pull request? This pull-request is to add support for reading ORC/Parquet files with SymlinkTextInputFormat in Apache Spark. ### Why are the changes needed? Hive style symlink (SymlinkTextInputFormat) is commonly used in different analytic engines including prestodb and prestosql. Currently SymlinkTextInputFormat works with JSON/CSV files but does not work with ORC/Parquet files in Apache Spark (and Apache Hive). On the other hand, prestodb and prestosql support SymlinkTextInputFormat with ORC/Parquet files. This pull-request is to add support for reading ORC/Parquet files with SymlinkTextInputFormat in Apache Spark. See details in the JIRA. SPARK-32432 ### Does this PR introduce _any_ user-facing change? Yes. Currently Spark returns exceptions if users try to use SymlinkTextInputFormat with ORC/Parquet files. With this patch, Spark can handle symlink which indicates locations of ORC/Parquet files. ### How was this patch tested? I added a new test suite `SymlinkSuite` and confirmed it passed. ``` $ ./build/sbt "project hive" "test-only org.apache.spark.sql.hive.SymlinkSuite" ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29328: [WIP][SPARK-32516][SQL] 'path' option should be treated consistently when loading dataframes for different APIs
cloud-fan commented on a change in pull request #29328: URL: https://github.com/apache/spark/pull/29328#discussion_r464202756 ## File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ## @@ -245,15 +245,22 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { "read files of Hive data source directly.") } +val updatedPaths = if (paths.length == 1) { Review comment: I think the most intuitive behavior is to drop the `path` option if `load` is called with path parameters, no matter it's one or more paths. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] leanken commented on pull request #29304: [SPARK-32494][SQL] Null Aware Anti Join Optimize Support Multi-Column
leanken commented on pull request #29304: URL: https://github.com/apache/spark/pull/29304#issuecomment-667811949 @agrawaldevesh I am finally understand the complexity of multi column support, thanks to your remind again and again, feel sorry about my naive. Do you think it still worth to carry on to support multi column? sincerely ask for you suggestion. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29291: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
AmplabJenkins removed a comment on pull request #29291: URL: https://github.com/apache/spark/pull/29291#issuecomment-667810979 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29328: [WIP][SPARK-32516][SQL] 'path' option should be treated consistently when loading dataframes for different APIs
AmplabJenkins removed a comment on pull request #29328: URL: https://github.com/apache/spark/pull/29328#issuecomment-667810713 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/126949/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29328: [WIP][SPARK-32516][SQL] 'path' option should be treated consistently when loading dataframes for different APIs
SparkQA removed a comment on pull request #29328: URL: https://github.com/apache/spark/pull/29328#issuecomment-667792091 **[Test build #126949 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126949/testReport)** for PR 29328 at commit [`296d4bb`](https://github.com/apache/spark/commit/296d4bbab647189fb32f3ffc0051086f244bcfca). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29328: [WIP][SPARK-32516][SQL] 'path' option should be treated consistently when loading dataframes for different APIs
AmplabJenkins removed a comment on pull request #29328: URL: https://github.com/apache/spark/pull/29328#issuecomment-667810710 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29291: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
AmplabJenkins commented on pull request #29291: URL: https://github.com/apache/spark/pull/29291#issuecomment-667810979 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29328: [WIP][SPARK-32516][SQL] 'path' option should be treated consistently when loading dataframes for different APIs
AmplabJenkins commented on pull request #29328: URL: https://github.com/apache/spark/pull/29328#issuecomment-667810710 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29328: [WIP][SPARK-32516][SQL] 'path' option should be treated consistently when loading dataframes for different APIs
SparkQA commented on pull request #29328: URL: https://github.com/apache/spark/pull/29328#issuecomment-667810614 **[Test build #126949 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126949/testReport)** for PR 29328 at commit [`296d4bb`](https://github.com/apache/spark/commit/296d4bbab647189fb32f3ffc0051086f244bcfca). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #29329: Investigate JUnit XML test reporter
HyukjinKwon closed pull request #29329: URL: https://github.com/apache/spark/pull/29329 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29320: [WIP][SPARK-32507][DOCS][PYTHON] Add main page for PySpark documentation
AmplabJenkins removed a comment on pull request #29320: URL: https://github.com/apache/spark/pull/29320#issuecomment-667809951 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon opened a new pull request #29329: Investigate JUnit XML test reporter
HyukjinKwon opened a new pull request #29329: URL: https://github.com/apache/spark/pull/29329 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29291: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
SparkQA commented on pull request #29291: URL: https://github.com/apache/spark/pull/29291#issuecomment-667810108 **[Test build #126953 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126953/testReport)** for PR 29291 at commit [`39583dd`](https://github.com/apache/spark/commit/39583dde43da9580245cd34768d3f613fab8b090). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29320: [WIP][SPARK-32507][DOCS][PYTHON] Add main page for PySpark documentation
AmplabJenkins commented on pull request #29320: URL: https://github.com/apache/spark/pull/29320#issuecomment-667809951 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29320: [WIP][SPARK-32507][DOCS][PYTHON] Add main page for PySpark documentation
SparkQA removed a comment on pull request #29320: URL: https://github.com/apache/spark/pull/29320#issuecomment-667802282 **[Test build #126951 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126951/testReport)** for PR 29320 at commit [`6d5f6ef`](https://github.com/apache/spark/commit/6d5f6ef069cb8e0fbb65616ca98f919cdd367fda). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29320: [WIP][SPARK-32507][DOCS][PYTHON] Add main page for PySpark documentation
SparkQA commented on pull request #29320: URL: https://github.com/apache/spark/pull/29320#issuecomment-667809709 **[Test build #126951 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126951/testReport)** for PR 29320 at commit [`6d5f6ef`](https://github.com/apache/spark/commit/6d5f6ef069cb8e0fbb65616ca98f919cdd367fda). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28953: [SPARK-32013][SQL] Support query execution before reading DataFrame and before/after writing DataFrame over JDBC
AmplabJenkins removed a comment on pull request #28953: URL: https://github.com/apache/spark/pull/28953#issuecomment-667808043 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28953: [SPARK-32013][SQL] Support query execution before reading DataFrame and before/after writing DataFrame over JDBC
AmplabJenkins commented on pull request #28953: URL: https://github.com/apache/spark/pull/28953#issuecomment-667808043 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28953: [SPARK-32013][SQL] Support query execution before reading DataFrame and before/after writing DataFrame over JDBC
SparkQA commented on pull request #28953: URL: https://github.com/apache/spark/pull/28953#issuecomment-667807499 **[Test build #126952 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126952/testReport)** for PR 28953 at commit [`70d8719`](https://github.com/apache/spark/commit/70d8719e8877ac7b4f4d0b0b8bb309ee1611df07). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29322: [SPARK-32511][SQL] Add dropFields method to Column class
AmplabJenkins removed a comment on pull request #29322: URL: https://github.com/apache/spark/pull/29322#issuecomment-667804781 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29322: [SPARK-32511][SQL] Add dropFields method to Column class
AmplabJenkins commented on pull request #29322: URL: https://github.com/apache/spark/pull/29322#issuecomment-667804781 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29322: [SPARK-32511][SQL] Add dropFields method to Column class
SparkQA commented on pull request #29322: URL: https://github.com/apache/spark/pull/29322#issuecomment-667804240 **[Test build #126947 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126947/testReport)** for PR 29322 at commit [`19587e8`](https://github.com/apache/spark/commit/19587e830a7889616583f48b44da61ca296c5215). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29322: [SPARK-32511][SQL] Add dropFields method to Column class
SparkQA removed a comment on pull request #29322: URL: https://github.com/apache/spark/pull/29322#issuecomment-667748670 **[Test build #126947 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126947/testReport)** for PR 29322 at commit [`19587e8`](https://github.com/apache/spark/commit/19587e830a7889616583f48b44da61ca296c5215). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #29320: [WIP][SPARK-32507][DOCS][PYTHON] Add main page for PySpark documentation
viirya commented on a change in pull request #29320: URL: https://github.com/apache/spark/pull/29320#discussion_r464195184 ## File path: python/docs/source/index.rst ## @@ -21,8 +21,42 @@ PySpark Documentation = +PySpark is an interface for Apache Spark in Python. It not only allows you to write +Spark applications using Python APIs, but also provides the PySpark shell for +interactively analyzing your data in a distributed environment. PySpark supports most +of Spark's features such as Spark SQL, DataFrmae, Streaming, MLlib +(Machine Learning) and Spark Core. + +.. image:: ../../../docs/img/pyspark-components.png + :alt: PySpark Compoenents + +**Spark SQL and DataFrame** + +Spark SQL is a Spark module for structured data processing. It provides +a programming abstraction called DataFrame and can also act as distributed +SQL query engine. + +**Streaming** + +Running on top of Spark, the streaming feature in Apache Spark enables powerful +interactive and analytical applications across both streaming and historical data, +while inheriting Spark’s ease of use and fault tolerance characteristics. + +**MLlib** + +Built on top of Spark, MLlib is a scalable machine learning library that provides +a uniform set of high-level APIs that help users create and tune practical machine +learning pipelines. + +**Spark Core** + +Spark Core is the underlying general execution engine for the Spark platform that all +other functionality is built on top of. It provides an RDD (Resilient Disributed Dataset) Review comment: Disributed -> Distributed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #28527: [SPARK-31709][SQL] Proper base path for database/table location when it is a relative path
cloud-fan commented on a change in pull request #28527: URL: https://github.com/apache/spark/pull/28527#discussion_r464195071 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala ## @@ -350,6 +358,16 @@ class SessionCatalog( } } + private def makeQualifiedTablePath(locationUri: URI, database: String): URI = { +if (locationUri.isAbsolute) { + locationUri +} else { + val dbName = formatDatabaseName(database) + val dbLocation = makeQualifiedDBPath(getDatabaseMetadata(dbName).locationUri) Review comment: I'm a bit concerned about it as it adds an extra database lookup. Is it better to push this work to the underlying external catalog? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #29320: [WIP][SPARK-32507][DOCS][PYTHON] Add main page for PySpark documentation
viirya commented on a change in pull request #29320: URL: https://github.com/apache/spark/pull/29320#discussion_r464194094 ## File path: python/docs/source/index.rst ## @@ -21,8 +21,42 @@ PySpark Documentation = +PySpark is an interface for Apache Spark in Python. It not only allows you to write +Spark applications using Python APIs, but also provides the PySpark shell for +interactively analyzing your data in a distributed environment. PySpark supports most +of Spark's features such as Spark SQL, DataFrmae, Streaming, MLlib Review comment: DataFrmae -> DataFrame This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29320: [WIP][SPARK-32507][DOCS][PYTHON] Add main page for PySpark documentation
AmplabJenkins removed a comment on pull request #29320: URL: https://github.com/apache/spark/pull/29320#issuecomment-667802688 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29320: [WIP][SPARK-32507][DOCS][PYTHON] Add main page for PySpark documentation
AmplabJenkins commented on pull request #29320: URL: https://github.com/apache/spark/pull/29320#issuecomment-667802688 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29320: [WIP][SPARK-32507][DOCS][PYTHON] Add main page for PySpark documentation
SparkQA commented on pull request #29320: URL: https://github.com/apache/spark/pull/29320#issuecomment-667802282 **[Test build #126951 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126951/testReport)** for PR 29320 at commit [`6d5f6ef`](https://github.com/apache/spark/commit/6d5f6ef069cb8e0fbb65616ca98f919cdd367fda). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #29169: [SPARK-32357][INFRA] Add a step in GitHub Actions to show failed tests
viirya commented on pull request #29169: URL: https://github.com/apache/spark/pull/29169#issuecomment-667801606 @HyukjinKwon Hi, this is for a while. Do you have some more thoughts? Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #29326: [WIP][SPARK-32502][BUILD] Upgrade Guava to 27.0-jre and Hadoop to 3.2.1
viirya commented on pull request #29326: URL: https://github.com/apache/spark/pull/29326#issuecomment-667801138 It is a trouble that hive-exec uses a method that became package-private since Guava version 20. So there is incompatibility with Guava versions > 19.0. ``` sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.IllegalAccessError: tried to access method com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator; from class org.apache.hadoop.hive.ql.exec.FetchOperator at org.apache.hadoop.hive.ql.exec.FetchOperator.(FetchOperator.java:108) at org.apache.hadoop.hive.ql.exec.FetchTask.initialize(FetchTask.java:87) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:541) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227) ``` hive-exec doesn't shade Guava until https://issues.apache.org/jira/browse/HIVE-22126 that targets 4.0.0. This seems a dead end for upgrading Guava in Spark for now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28953: [SPARK-32013][SQL] Support query execution before reading DataFrame and before/after writing DataFrame over JDBC
AmplabJenkins removed a comment on pull request #28953: URL: https://github.com/apache/spark/pull/28953#issuecomment-667799383 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/126950/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28953: [SPARK-32013][SQL] Support query execution before reading DataFrame and before/after writing DataFrame over JDBC
SparkQA commented on pull request #28953: URL: https://github.com/apache/spark/pull/28953#issuecomment-667799372 **[Test build #126950 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126950/testReport)** for PR 28953 at commit [`7ca42b1`](https://github.com/apache/spark/commit/7ca42b1cbb0917658874a058c999f092a290fcd8). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28953: [SPARK-32013][SQL] Support query execution before reading DataFrame and before/after writing DataFrame over JDBC
AmplabJenkins commented on pull request #28953: URL: https://github.com/apache/spark/pull/28953#issuecomment-667799382 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28953: [SPARK-32013][SQL] Support query execution before reading DataFrame and before/after writing DataFrame over JDBC
SparkQA removed a comment on pull request #28953: URL: https://github.com/apache/spark/pull/28953#issuecomment-667798788 **[Test build #126950 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126950/testReport)** for PR 28953 at commit [`7ca42b1`](https://github.com/apache/spark/commit/7ca42b1cbb0917658874a058c999f092a290fcd8). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28953: [SPARK-32013][SQL] Support query execution before reading DataFrame and before/after writing DataFrame over JDBC
AmplabJenkins removed a comment on pull request #28953: URL: https://github.com/apache/spark/pull/28953#issuecomment-667799185 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28953: [SPARK-32013][SQL] Support query execution before reading DataFrame and before/after writing DataFrame over JDBC
AmplabJenkins commented on pull request #28953: URL: https://github.com/apache/spark/pull/28953#issuecomment-667799185 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28953: [SPARK-32013][SQL] Support query execution before reading DataFrame and before/after writing DataFrame over JDBC
SparkQA commented on pull request #28953: URL: https://github.com/apache/spark/pull/28953#issuecomment-667798788 **[Test build #126950 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126950/testReport)** for PR 28953 at commit [`7ca42b1`](https://github.com/apache/spark/commit/7ca42b1cbb0917658874a058c999f092a290fcd8). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29328: [WIP][SPARK-32516][SQL] 'path' option should be treated consistently when loading dataframes for different APIs
AmplabJenkins removed a comment on pull request #29328: URL: https://github.com/apache/spark/pull/29328#issuecomment-667792444 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29328: [WIP][SPARK-32516][SQL] 'path' option should be treated consistently when loading dataframes for different APIs
AmplabJenkins commented on pull request #29328: URL: https://github.com/apache/spark/pull/29328#issuecomment-667792444 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29328: [WIP][SPARK-32516][SQL] 'path' option should be treated consistently when loading dataframes for different APIs
SparkQA commented on pull request #29328: URL: https://github.com/apache/spark/pull/29328#issuecomment-667792091 **[Test build #126949 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126949/testReport)** for PR 29328 at commit [`296d4bb`](https://github.com/apache/spark/commit/296d4bbab647189fb32f3ffc0051086f244bcfca). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29321: [SPARK-32083][SQL][3.0] AQE coalesce should at least return one partition
AmplabJenkins removed a comment on pull request #29321: URL: https://github.com/apache/spark/pull/29321#issuecomment-667789608 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29321: [SPARK-32083][SQL][3.0] AQE coalesce should at least return one partition
AmplabJenkins commented on pull request #29321: URL: https://github.com/apache/spark/pull/29321#issuecomment-667789608 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29321: [SPARK-32083][SQL][3.0] AQE coalesce should at least return one partition
SparkQA commented on pull request #29321: URL: https://github.com/apache/spark/pull/29321#issuecomment-667789287 **[Test build #126948 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126948/testReport)** for PR 29321 at commit [`6a06fba`](https://github.com/apache/spark/commit/6a06fba70cce84cf23b6d85951fb99d25c7adcc7). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] leanken edited a comment on pull request #29304: [SPARK-32494][SQL] Null Aware Anti Join Optimize Support Multi-Column
leanken edited a comment on pull request #29304: URL: https://github.com/apache/spark/pull/29304#issuecomment-667788398 I just found out a negative case for it it should return (1,2,3) in expansion solution, but it return nothing in BNLJ. you are right about the correctness, let me rethink and come back to you later. ``` spark.sql( """ |CREATE TEMPORARY VIEW m AS SELECT * FROM VALUES | (1, 2, 3) | AS m(a, b, c) """.stripMargin).collect() spark.sql( """ |CREATE TEMPORARY VIEW s AS SELECT * FROM VALUES | (1, null, 3) | AS s(c, d, e) """.stripMargin).collect() spark.sql( """ |select * from m where (a,b,c) not in (select * from s) """.stripMargin).collect().foreach(println) ``` and we should do something on streamedSide too, if we want this hash lookup to apply correctly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #27066: [SPARK-31317][SQL] Add withField method to Column
cloud-fan commented on a change in pull request #27066: URL: https://github.com/apache/spark/pull/27066#discussion_r464181036 ## File path: sql/core/src/main/scala/org/apache/spark/sql/Column.scala ## @@ -871,6 +871,72 @@ class Column(val expr: Expression) extends Logging { */ def getItem(key: Any): Column = withExpr { UnresolvedExtractValue(expr, Literal(key)) } + // scalastyle:off line.size.limit + /** + * An expression that adds/replaces field in `StructType` by name. + * + * {{{ + * val df = sql("SELECT named_struct('a', 1, 'b', 2) struct_col") + * df.select($"struct_col".withField("c", lit(3))) Review comment: Yes, please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] leanken commented on pull request #29304: [SPARK-32494][SQL] Null Aware Anti Join Optimize Support Multi-Column
leanken commented on pull request #29304: URL: https://github.com/apache/spark/pull/29304#issuecomment-667788398 I just found out a negative case for it it should return (1,2,3) in expansion solution, but it return nothing in BNLJ. you are right about the correctness, let me rethink and come back to you later. ``` spark.sql( """ |CREATE TEMPORARY VIEW m AS SELECT * FROM VALUES | (1, 2, 3) | AS m(a, b, c) """.stripMargin).collect() spark.sql( """ |CREATE TEMPORARY VIEW s AS SELECT * FROM VALUES | (1, null, 3) | AS s(c, d, e) """.stripMargin).collect() spark.sql( """ |select * from m where (a,b,c) not in (select * from s) """.stripMargin).collect().foreach(println) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] leanken removed a comment on pull request #29304: [SPARK-32494][SQL] Null Aware Anti Join Optimize Support Multi-Column
leanken removed a comment on pull request #29304: URL: https://github.com/apache/spark/pull/29304#issuecomment-667786567 ![image](https://user-images.githubusercontent.com/17242071/89143652-e8b49b00-d57d-11ea-8fd5-b0f03f812cf3.png) `build a secondary access structure` In my case, I am building all possible secondary access structure beforehand. @agrawaldevesh This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] leanken commented on pull request #29304: [SPARK-32494][SQL] Null Aware Anti Join Optimize Support Multi-Column
leanken commented on pull request #29304: URL: https://github.com/apache/spark/pull/29304#issuecomment-667786567 ![image](https://user-images.githubusercontent.com/17242071/89143652-e8b49b00-d57d-11ea-8fd5-b0f03f812cf3.png) `build a secondary access structure` In my case, I am building all possible secondary access structure beforehand. @agrawaldevesh This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] fqaiser94 commented on a change in pull request #27066: [SPARK-31317][SQL] Add withField method to Column
fqaiser94 commented on a change in pull request #27066: URL: https://github.com/apache/spark/pull/27066#discussion_r464177138 ## File path: sql/core/src/main/scala/org/apache/spark/sql/Column.scala ## @@ -871,6 +871,72 @@ class Column(val expr: Expression) extends Logging { */ def getItem(key: Any): Column = withExpr { UnresolvedExtractValue(expr, Literal(key)) } + // scalastyle:off line.size.limit + /** + * An expression that adds/replaces field in `StructType` by name. + * + * {{{ + * val df = sql("SELECT named_struct('a', 1, 'b', 2) struct_col") + * df.select($"struct_col".withField("c", lit(3))) Review comment: I failed to write a test case to cover this scenario, my bad. And yea, I just tried this example again, and I can see that it fails. The issue is that I `override foldable` for this `Unevaluable` Expression. And so, when `foldable` returns true, Spark tries to evaluate the expression and it fails at that point. I kind-of realized this as well recently and in my PR for `dropFields` [here](https://github.com/apache/spark/pull/29322/files#diff-c1758d627a06084e577be0d33d47f44eL566), I've fixed the issue (basically i just don't `override foldable` anymore, which by default returns `false`). I guess I should submit a follow-up PR to fix this immediately with associated unit tests? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #29125: [SPARK-32018][SQL][3.0] UnsafeRow.setDecimal should set null with overflowed value
cloud-fan commented on pull request #29125: URL: https://github.com/apache/spark/pull/29125#issuecomment-667784890 @skambha the `sum` shouldn't fail without ANSI mode, this PR fixes it. It's indeed a bug that we can write an overflowed decimal to UnsafeRow but can't read it. The `sum` is also buggy but we can't backport the fix due to streaming compatibility reasons. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] fqaiser94 commented on a change in pull request #27066: [SPARK-31317][SQL] Add withField method to Column
fqaiser94 commented on a change in pull request #27066: URL: https://github.com/apache/spark/pull/27066#discussion_r464177138 ## File path: sql/core/src/main/scala/org/apache/spark/sql/Column.scala ## @@ -871,6 +871,72 @@ class Column(val expr: Expression) extends Logging { */ def getItem(key: Any): Column = withExpr { UnresolvedExtractValue(expr, Literal(key)) } + // scalastyle:off line.size.limit + /** + * An expression that adds/replaces field in `StructType` by name. + * + * {{{ + * val df = sql("SELECT named_struct('a', 1, 'b', 2) struct_col") + * df.select($"struct_col".withField("c", lit(3))) Review comment: I failed to write a test case to cover this scenario, my bad. And yea, I just tried this example again, and I can see that it fails. The issue is that I `override foldable` for this `Unevaluable` Expression. And so, when `foldable` returns true, Spark tries to evaluate the expression and it fails at that point. I kind-of realized this as well recently and in my PR for `dropFields` [here](https://github.com/apache/spark/pull/29322/files#diff-c1758d627a06084e577be0d33d47f44eL566), I've fixed the issue (basically i just don't `override foldable` anymore, which my default returns `false`). I guess I should submit a follow-up PR to fix this immediately with associated unit tests? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] leanken removed a comment on pull request #29304: [SPARK-32494][SQL] Null Aware Anti Join Optimize Support Multi-Column
leanken removed a comment on pull request #29304: URL: https://github.com/apache/spark/pull/29304#issuecomment-667783096 ![image](https://user-images.githubusercontent.com/17242071/89143099-fb2dd500-d57b-11ea-881e-9d248403db9d.png) this is quite the same with expansion, first it go through all the data in buildSide, to gather information about which column might have exists with null values, and let's say there are c1 c2 c3 in buildSide, after scan, found that only c1 c2 with null values then left record (1, 2, 3) will try to found match (1, null, 3) (null, 2, 3) (null, null, 3) And what i am trying to do is that I will not scan the buildSide to gather null information, I just assume that every column might have the null value, and with all combination of null padding. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] fqaiser94 commented on a change in pull request #27066: [SPARK-31317][SQL] Add withField method to Column
fqaiser94 commented on a change in pull request #27066: URL: https://github.com/apache/spark/pull/27066#discussion_r464177138 ## File path: sql/core/src/main/scala/org/apache/spark/sql/Column.scala ## @@ -871,6 +871,72 @@ class Column(val expr: Expression) extends Logging { */ def getItem(key: Any): Column = withExpr { UnresolvedExtractValue(expr, Literal(key)) } + // scalastyle:off line.size.limit + /** + * An expression that adds/replaces field in `StructType` by name. + * + * {{{ + * val df = sql("SELECT named_struct('a', 1, 'b', 2) struct_col") + * df.select($"struct_col".withField("c", lit(3))) Review comment: I failed to write a test case to cover this scenario, my bad. And yea, I just tried this example again, and I can see that it fails. The issue is that I `override foldable` for this `Unevaluable` Expression. And so, when `foldable` returns true, Spark tries to evaluate the expression and it fails at that point. I kind-of realized this as well recently and in my PR for `dropFields` [here](https://github.com/apache/spark/pull/29322/files#diff-c1758d627a06084e577be0d33d47f44eL566), I've fixed the issue (basically i just don't `override foldable` anymore). I guess I should submit a follow-up PR to fix this immediately with associated unit tests? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] leanken edited a comment on pull request #29304: [SPARK-32494][SQL] Null Aware Anti Join Optimize Support Multi-Column
leanken edited a comment on pull request #29304: URL: https://github.com/apache/spark/pull/29304#issuecomment-667783096 ![image](https://user-images.githubusercontent.com/17242071/89143099-fb2dd500-d57b-11ea-881e-9d248403db9d.png) this is quite the same with expansion, first it go through all the data in buildSide, to gather information about which column might have exists with null values, and let's say there are c1 c2 c3 in buildSide, after scan, found that only c1 c2 with null values then left record (1, 2, 3) will try to found match (1, null, 3) (null, 2, 3) (null, null, 3) And what i am trying to do is that I will not scan the buildSide to gather null information, I just assume that every column might have the null value, and with all combination of null padding. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #29318: [SPARK-32509][SQL] Ignore unused DPP True Filter in Canonicalization
cloud-fan closed pull request #29318: URL: https://github.com/apache/spark/pull/29318 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #29318: [SPARK-32509][SQL] Ignore unused DPP True Filter in Canonicalization
cloud-fan commented on pull request #29318: URL: https://github.com/apache/spark/pull/29318#issuecomment-667783276 thanks, merging to master/3.0! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] leanken commented on pull request #29304: [SPARK-32494][SQL] Null Aware Anti Join Optimize Support Multi-Column
leanken commented on pull request #29304: URL: https://github.com/apache/spark/pull/29304#issuecomment-667783096 ![image](https://user-images.githubusercontent.com/17242071/89143099-fb2dd500-d57b-11ea-881e-9d248403db9d.png) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] leanken edited a comment on pull request #29304: [SPARK-32494][SQL] Null Aware Anti Join Optimize Support Multi-Column
leanken edited a comment on pull request #29304: URL: https://github.com/apache/spark/pull/29304#issuecomment-667781762 > > Step 2: Say there is a right (build) side row (1, null, 3). It should be counted as a match against a row on the left side (1, 2, 3). What makes this tricky is that say say you have a build row (1, 5, 3), then (1, 5, 3) should NOT match the probe row (1, 2, 3). But if you explode (1, 5, 3) into a (1, null, 3) then it might incorrectly match (1, 2, 3). How do you handle both of these subcases ? > > Step 3: Consider a build row (1, 5, null), it should match the left row (1, null, 3). In addition, it should not match the build row (1, 5, 7). How do you handle these subcases ? > > Above, when I mean "match" -- I mean that the left side would match the build row and WON'T be returned. Whereas with non match I mean that the left side would not match the build side and thus WILL be returned. We have different meanings for the words 'match' and 'not-match'. So please read my 'match' == 'NAAJ should not return the left row', and conversely for non-match. > > I would really really really encourage you to: > > * Please reread the paper section 6.2 in its entirety many times and understand the above cases. I had to read it many times myself. It is very tricky as you pointed out. > * Add them as test cases comparing them with the original BNLJ implementation, both the negative and positive cases. > > This is really tricky and I don't think the current implementation you have of expanding the hash table with a simple lookup on the stream side would suffice. I will also try to play around with your PR locally and run them as tests to convince myself. I hope I am wrong ;-). Yes, I do understand of the Paper 6.2. Basically the paper describe the algorithm in the perspective of StreamedSide. But the expansion state the perspective of BuildSide. Let's just do revert inferencing of the following case. if buildSide exist a row (1,2,3), what data in StreamedSide will evaluated as TRUE OR UNKNOWN and dropped. it should be (null, 2, 3) (1, null, 3) (1, 2, null) (null, null, 3) (null, 2, null) (1, null, null) and of course (1,2,3) right? Only in above combination, streamedSide row will be dropped besides non-all-null case, right? Once you find a exact same record in HashedRelation include null columns, you dropped. ``` if (lookupKey.allNull()) { false } else { // Anti Join: Drop the row on the streamed side if it is a match on the build hashed.get(lookupKey) == null } ``` I suppose this solution is working because it's passing all the not in cases in SQLQueryTestSuite. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #29317: [SPARK-32510][SQL] Check duplicate nested columns in read from JDBC datasource
cloud-fan commented on pull request #29317: URL: https://github.com/apache/spark/pull/29317#issuecomment-667781962 thanks, merging to master! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #29317: [SPARK-32510][SQL] Check duplicate nested columns in read from JDBC datasource
cloud-fan closed pull request #29317: URL: https://github.com/apache/spark/pull/29317 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] leanken commented on pull request #29304: [SPARK-32494][SQL] Null Aware Anti Join Optimize Support Multi-Column
leanken commented on pull request #29304: URL: https://github.com/apache/spark/pull/29304#issuecomment-667781762 > > Step 2: Say there is a right (build) side row (1, null, 3). It should be counted as a match against a row on the left side (1, 2, 3). What makes this tricky is that say say you have a build row (1, 5, 3), then (1, 5, 3) should NOT match the probe row (1, 2, 3). But if you explode (1, 5, 3) into a (1, null, 3) then it might incorrectly match (1, 2, 3). How do you handle both of these subcases ? > > Step 3: Consider a build row (1, 5, null), it should match the left row (1, null, 3). In addition, it should not match the build row (1, 5, 7). How do you handle these subcases ? > > Above, when I mean "match" -- I mean that the left side would match the build row and WON'T be returned. Whereas with non match I mean that the left side would not match the build side and thus WILL be returned. We have different meanings for the words 'match' and 'not-match'. So please read my 'match' == 'NAAJ should not return the left row', and conversely for non-match. > > I would really really really encourage you to: > > * Please reread the paper section 6.2 in its entirety many times and understand the above cases. I had to read it many times myself. It is very tricky as you pointed out. > * Add them as test cases comparing them with the original BNLJ implementation, both the negative and positive cases. > > This is really tricky and I don't think the current implementation you have of expanding the hash table with a simple lookup on the stream side would suffice. I will also try to play around with your PR locally and run them as tests to convince myself. I hope I am wrong ;-). Yes, I do understand of the Paper 6.2. Basically the paper describe the algorithm in the perspective of StreamedSide. But the expansion state the perspective of BuildSide. Let's just do revert inferencing of the following case. if buildSide exist a row (1,2,3), what data in StreamedSide will evaluated as TRUE OR UNKNOWN and dropped. it should be (null, 2, 3) (1, null, 3) (1, 2, null) (null, null, 3) (null, 2, null) (1, null, null) and of course (1,2,3) right? Only in above combination, streamedSide row will be dropped besides non-all-null case, right? I suppose this solution is working because it's passing all the not in cases in SQLQueryTestSuite. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #29067: [SPARK-32274][SQL] Make SQL cache serialization pluggable
cloud-fan closed pull request #29067: URL: https://github.com/apache/spark/pull/29067 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #29067: [SPARK-32274][SQL] Make SQL cache serialization pluggable
cloud-fan commented on pull request #29067: URL: https://github.com/apache/spark/pull/29067#issuecomment-667780468 github action passes, I'm merging to master, thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29146: [SPARK-32257][SQL] Reports explicit errors for invalid usage of SET/RESET command
cloud-fan commented on a change in pull request #29146: URL: https://github.com/apache/spark/pull/29146#discussion_r464172835 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala ## @@ -61,6 +64,80 @@ class SparkSqlParserSuite extends AnalysisTest { private def intercept(sqlCommand: String, messages: String*): Unit = interceptParseException(parser.parsePlan)(sqlCommand, messages: _*) + test("Checks if SET/RESET can parse all the configurations") { +// Force to build static SQL configurations +StaticSQLConf +(SQLConf.sqlConfEntries.values.asScala ++ ConfigEntry.knownConfigs.values.asScala) Review comment: `SQLConf` also uses `ConfigEntry`, I think `ConfigEntry.knownConfigs` already covers all the registered configs. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29146: [SPARK-32257][SQL] Reports explicit errors for invalid usage of SET/RESET command
cloud-fan commented on a change in pull request #29146: URL: https://github.com/apache/spark/pull/29146#discussion_r464172446 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ## @@ -66,17 +68,29 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder(conf) { * character in the raw string. */ override def visitSetConfiguration(ctx: SetConfigurationContext): LogicalPlan = withOrigin(ctx) { -// Construct the command. -val raw = remainder(ctx.SET.getSymbol) -val keyValueSeparatorIndex = raw.indexOf('=') -if (keyValueSeparatorIndex >= 0) { - val key = raw.substring(0, keyValueSeparatorIndex).trim - val value = raw.substring(keyValueSeparatorIndex + 1).trim - SetCommand(Some(key -> Option(value))) -} else if (raw.nonEmpty) { - SetCommand(Some(raw.trim -> None)) +val configKeyValueDef = """([a-zA-Z_\d\\.:]+)\s*=(.*)""".r +remainder(ctx.SET.getSymbol).trim match { + case configKeyValueDef(key, value) => +SetCommand(Some(key -> Option(value.trim))) + case configKeyDef(key) => +SetCommand(Some(key -> None)) + case s if s == "-v" => +SetCommand(Some("-v" -> None)) + case s if s.isEmpty => +SetCommand(None) + case _ => throw new ParseException("Expected format is 'SET', 'SET key', or " + +"'SET key=value'. If you want to include special characters in key, " + +"please use quotes, e.g., SET `ke y`=value.", ctx) +} + } + + override def visitSetQuotedConfiguration(ctx: SetQuotedConfigurationContext) +: LogicalPlan = withOrigin(ctx) { +val keyStr = ctx.quotedConfigKey().getText +if (ctx.value != null) { + SetCommand(Some(keyStr -> Option(remainder(ctx.EQ().getSymbol).trim))) Review comment: `(EQ value=.*)` we have an alias, can we use it here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29146: [SPARK-32257][SQL] Reports explicit errors for invalid usage of SET/RESET command
cloud-fan commented on a change in pull request #29146: URL: https://github.com/apache/spark/pull/29146#discussion_r464172133 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ## @@ -66,17 +68,29 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder(conf) { * character in the raw string. */ override def visitSetConfiguration(ctx: SetConfigurationContext): LogicalPlan = withOrigin(ctx) { -// Construct the command. -val raw = remainder(ctx.SET.getSymbol) -val keyValueSeparatorIndex = raw.indexOf('=') -if (keyValueSeparatorIndex >= 0) { - val key = raw.substring(0, keyValueSeparatorIndex).trim - val value = raw.substring(keyValueSeparatorIndex + 1).trim - SetCommand(Some(key -> Option(value))) -} else if (raw.nonEmpty) { - SetCommand(Some(raw.trim -> None)) +val configKeyValueDef = """([a-zA-Z_\d\\.:]+)\s*=(.*)""".r +remainder(ctx.SET.getSymbol).trim match { + case configKeyValueDef(key, value) => +SetCommand(Some(key -> Option(value.trim))) + case configKeyDef(key) => Review comment: ah nvm, we will also match `configKeyValueDef` first. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29146: [SPARK-32257][SQL] Reports explicit errors for invalid usage of SET/RESET command
cloud-fan commented on a change in pull request #29146: URL: https://github.com/apache/spark/pull/29146#discussion_r464172042 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ## @@ -66,17 +68,29 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder(conf) { * character in the raw string. */ override def visitSetConfiguration(ctx: SetConfigurationContext): LogicalPlan = withOrigin(ctx) { -// Construct the command. -val raw = remainder(ctx.SET.getSymbol) -val keyValueSeparatorIndex = raw.indexOf('=') -if (keyValueSeparatorIndex >= 0) { - val key = raw.substring(0, keyValueSeparatorIndex).trim - val value = raw.substring(keyValueSeparatorIndex + 1).trim - SetCommand(Some(key -> Option(value))) -} else if (raw.nonEmpty) { - SetCommand(Some(raw.trim -> None)) +val configKeyValueDef = """([a-zA-Z_\d\\.:]+)\s*=(.*)""".r +remainder(ctx.SET.getSymbol).trim match { + case configKeyValueDef(key, value) => +SetCommand(Some(key -> Option(value.trim))) + case configKeyDef(key) => Review comment: Will it match something like `a ###`? Shall we use `([a-zA-Z_\d\\.:]+)$`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29146: [SPARK-32257][SQL] Reports explicit errors for invalid usage of SET/RESET command
cloud-fan commented on a change in pull request #29146: URL: https://github.com/apache/spark/pull/29146#discussion_r464171781 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ## @@ -66,17 +68,29 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder(conf) { * character in the raw string. */ override def visitSetConfiguration(ctx: SetConfigurationContext): LogicalPlan = withOrigin(ctx) { -// Construct the command. -val raw = remainder(ctx.SET.getSymbol) -val keyValueSeparatorIndex = raw.indexOf('=') -if (keyValueSeparatorIndex >= 0) { - val key = raw.substring(0, keyValueSeparatorIndex).trim - val value = raw.substring(keyValueSeparatorIndex + 1).trim - SetCommand(Some(key -> Option(value))) -} else if (raw.nonEmpty) { - SetCommand(Some(raw.trim -> None)) +val configKeyValueDef = """([a-zA-Z_\d\\.:]+)\s*=(.*)""".r Review comment: Can we put it in the class body so we don't need to compile the regex repeatedly? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29146: [SPARK-32257][SQL] Reports explicit errors for invalid usage of SET/RESET command
cloud-fan commented on a change in pull request #29146: URL: https://github.com/apache/spark/pull/29146#discussion_r464171523 ## File path: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 ## @@ -246,11 +246,17 @@ statement | SET TIME ZONE interval #setTimeZone | SET TIME ZONE timezone=(STRING | LOCAL) #setTimeZone | SET TIME ZONE .*? #setTimeZone +| SET quotedConfigKey (EQ value=.*)? #setQuotedConfiguration | SET .*? #setConfiguration +| RESET quotedConfigKey #resetQuotedConfiguration | RESET .*? #resetConfiguration | unsupportedHiveNativeCommands .*? #failNativeCommand ; +quotedConfigKey Review comment: hmm, is it necessary to create an alias? How about `SET key= quotedIdentifier (EQ value=.*)?` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #27066: [SPARK-31317][SQL] Add withField method to Column
cloud-fan commented on a change in pull request #27066: URL: https://github.com/apache/spark/pull/27066#discussion_r464170447 ## File path: sql/core/src/main/scala/org/apache/spark/sql/Column.scala ## @@ -871,6 +871,72 @@ class Column(val expr: Expression) extends Logging { */ def getItem(key: Any): Column = withExpr { UnresolvedExtractValue(expr, Literal(key)) } + // scalastyle:off line.size.limit + /** + * An expression that adds/replaces field in `StructType` by name. + * + * {{{ + * val df = sql("SELECT named_struct('a', 1, 'b', 2) struct_col") + * df.select($"struct_col".withField("c", lit(3))) Review comment: weird, we have tests to cover these examples. @fqaiser94 can you take a look? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] agrawaldevesh commented on pull request #29304: [SPARK-32494][SQL] Null Aware Anti Join Optimize Support Multi-Column
agrawaldevesh commented on pull request #29304: URL: https://github.com/apache/spark/pull/29304#issuecomment-667773518 > Step 2: Say there is a right (build) side row (1, null, 3). It should be counted as a match against a row on the left side (1, 2, 3). What makes this tricky is that say say you have a build row (1, 5, 3), then (1, 5, 3) should NOT match the probe row (1, 2, 3). But if you explode (1, 5, 3) into a (1, null, 3) then it might incorrectly match (1, 2, 3). How do you handle both of these subcases ? Step 3: Consider a build row (1, 5, null), it should match the left row (1, null, 3). In addition, it should not match the build row (1, 5, 7). How do you handle these subcases ? Above, when I mean "match" -- I mean that the left side would match the build row and WON'T be returned. Whereas with non match I mean that the left side would not match the build side and thus WILL be returned. We have different meanings for the words 'match' and 'not-match'. So please read my 'match' == 'NAAJ should not return the left row', and conversely for non-match. I would really really really encourage you to: - Please reread the paper section 6.2 in its entirety many times and understand the above cases. I had to read it many times myself. It is very tricky as you pointed out. - Add them as test cases comparing them with the original BNLJ implementation, both the negative and positive cases. This is really tricky and I don't think the current implementation you have of expanding the hash table with a simple lookup on the stream side would suffice. I will also try to play around with your PR locally and run them as tests to convince myself. I hope I am wrong ;-). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on pull request #28527: [SPARK-31709][SQL] Proper base path for database/table location when it is a relative path
yaooqinn commented on pull request #28527: URL: https://github.com/apache/spark/pull/28527#issuecomment-667772325 gentle ping @cloud-fan This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #29192: [SPARK-32393][SQL] Support PostgreSQL `bpchar` array
maropu commented on pull request #29192: URL: https://github.com/apache/spark/pull/29192#issuecomment-667772167 kindly ping. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangshisan commented on pull request #29266: [SPARK-32464][SQL] Support skew handling on join that has one side wi…
wangshisan commented on pull request #29266: URL: https://github.com/apache/spark/pull/29266#issuecomment-667772074 @cloud-fan @JkSelf Could you have a look? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #29320: [WIP][SPARK-32507][DOCS][PYTHON] Add main page for PySpark documentation
HyukjinKwon commented on pull request #29320: URL: https://github.com/apache/spark/pull/29320#issuecomment-667772011 > What's docs/img/pyspark-components.pptx for? It is for the image I used in the main page in case some people want to edit. There are other pptx files in `docs/img` as well for that purpose. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on a change in pull request #29303: [SPARK-32492][SQL] Fulfill missing column meta information COLUMN_SIZE /DECIMAL_DIGITS/NUM_PREC_RADIX/ORDINAL_POSITION for thrif
yaooqinn commented on a change in pull request #29303: URL: https://github.com/apache/spark/pull/29303#discussion_r464165400 ## File path: sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetColumnsOperation.scala ## @@ -126,12 +124,52 @@ private[hive] class SparkGetColumnsOperation( HiveThriftServer2.eventManager.onStatementFinish(statementId) } + /** + * For numeric and datetime types, it returns the default size of its catalyst type + * For struct type, when its elements are fixed-size, the summation of all element sizes will be + * returned. + * For array, map, string, and binaries, the column size is variable, return null as unknown. + */ + private def getColumnSize(typ: DataType): Option[Int] = typ match { Review comment: Hive does not return same result for each type This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org