[GitHub] spark issue #13517: [SPARK-14839][SQL] Support for other types for `tablePro...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13517 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61699/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13517: [SPARK-14839][SQL] Support for other types for `tablePro...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13517 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13517: [SPARK-14839][SQL] Support for other types for `tablePro...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13517 **[Test build #61699 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61699/consoleFull)** for PR 13517 at commit [`1307f8c`](https://github.com/apache/spark/commit/1307f8cbdd4b26885a81ad6e5770c2bb82a0159e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13517: [SPARK-14839][SQL] Support for other types for `tablePro...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13517 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13517: [SPARK-14839][SQL] Support for other types for `tablePro...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13517 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61698/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13517: [SPARK-14839][SQL] Support for other types for `tablePro...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13517 **[Test build #61698 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61698/consoleFull)** for PR 13517 at commit [`30dfea0`](https://github.com/apache/spark/commit/30dfea05bb0ce864a7ccb5fe6a2d091c7fe3c988). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12203: [SPARK-14423][YARN] Avoid same name files added to distr...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/12203 @RicoGit This is a behavior change for jars uploading to distributed cache, I'm not sure if it is suitable to back-port to branch 1.6. Also this problem is not so severe in 1.6 since we do the assembly for packaging. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14008: [SPARK-16281][SQL] Implement parse_url SQL function
Github user janplus commented on the issue: https://github.com/apache/spark/pull/14008 @dongjoon-hyun @cloud-fan It is nice to have you review my PR. Thank you! I have add a new commit with following things: 1. Revert driver side's literal key invalidation. 2. Resolve conflicts with master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...
Github user janplus commented on a diff in the pull request: https://github.com/apache/spark/pull/14008#discussion_r69401574 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala --- @@ -725,4 +725,51 @@ class StringExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 0) checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 0) } + + test("ParseUrl") { +def checkParseUrl(expected: String, urlStr: String, partToExtract: String): Unit = { + checkEvaluation( +ParseUrl(Seq(Literal.create(urlStr, StringType), + Literal.create(partToExtract, StringType))), expected) +} +def checkParseUrlWithKey( +expected: String, urlStr: String, +partToExtract: String, key: String): Unit = { + checkEvaluation( +ParseUrl(Seq(Literal.create(urlStr, StringType), Literal.create(partToExtract, StringType), + Literal.create(key, StringType))), expected) +} + +checkParseUrl("spark.apache.org", "http://spark.apache.org/path?query=1";, "HOST") +checkParseUrl("/path", "http://spark.apache.org/path?query=1";, "PATH") +checkParseUrl("query=1", "http://spark.apache.org/path?query=1";, "QUERY") +checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref";, "REF") +checkParseUrl("http", "http://spark.apache.org/path?query=1";, "PROTOCOL") +checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1";, "FILE") +checkParseUrl("spark.apache.org:8080", "http://spark.apache.org:8080/path?query=1";, "AUTHORITY") +checkParseUrl("userinfo", "http://useri...@spark.apache.org/path?query=1";, "USERINFO") +checkParseUrlWithKey("1", "http://spark.apache.org/path?query=1";, "QUERY", "query") + +// Null checking +checkParseUrl(null, null, "HOST") +checkParseUrl(null, "http://spark.apache.org/path?query=1";, null) +checkParseUrl(null, null, null) +checkParseUrl(null, "test", "HOST") +checkParseUrl(null, "http://spark.apache.org/path?query=1";, "NO") +checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, "HOST", "query") +checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, "QUERY", "quer") +checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, "QUERY", null) +checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, "QUERY", "") + +// exceptional cases +intercept[java.util.regex.PatternSyntaxException] { --- End diff -- OK, @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69400558 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/RandomForestClassifierSuite.scala --- @@ -158,7 +159,7 @@ class RandomForestClassifierSuite } test("Fitting without numClasses in metadata") { -val df: DataFrame = spark.createDataFrame(TreeTests.featureImportanceData(sc)) +val df: DataFrame = TreeTests.featureImportanceData(sc).toDF() --- End diff -- I also agree with this but actually it seems both are fine assuming from this discussion, https://github.com/apache/spark/pull/12452 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69400523 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala --- @@ -116,7 +117,7 @@ class MultilayerPerceptronClassifierSuite // the input seed is somewhat magic, to make this test pass val rdd = sc.parallelize(generateMultinomialLogisticInput( coefficients, xMean, xVariance, true, nPoints, 1), 2) -val dataFrame = spark.createDataFrame(rdd).toDF("label", "features") +val dataFrame = rdd.toDF("label", "features") --- End diff -- Again, I also agree with this but I am hesitated to change this because it is explicitly set. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69400465 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -55,7 +56,7 @@ class LogisticRegressionSuite generateMultinomialLogisticInput(coefficients, xMean, xVariance, addIntercept = true, nPoints, 42) - spark.createDataFrame(sc.parallelize(testData, 4)) + sc.parallelize(testData, 4).toDF() --- End diff -- I guess, to be strict, `sc.parallelize(testData, 4).toDF()` and `testData.toDF.repartition(4)` would not be exactly the same. It seems the author of this test code intended to explicitly set the initial number of partitions to `4` and I left as it is although I think as you said because I am not 100% sure and it is not the part of this issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13517: [SPARK-14839][SQL] Support for other types for `tablePro...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13517 **[Test build #61699 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61699/consoleFull)** for PR 13517 at commit [`1307f8c`](https://github.com/apache/spark/commit/1307f8cbdd4b26885a81ad6e5770c2bb82a0159e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13517: [SPARK-14839][SQL] Support for other types for `tablePro...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13517 **[Test build #61698 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61698/consoleFull)** for PR 13517 at commit [`30dfea0`](https://github.com/apache/spark/commit/30dfea05bb0ce864a7ccb5fe6a2d091c7fe3c988). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13517: [SPARK-14839][SQL] Support for other types for `t...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/13517#discussion_r69399556 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala --- @@ -1117,4 +1117,26 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv } } } + + test("SPARK-14839: Support for other types as option in OPTIONS clause") { --- End diff -- Sure! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14033: [SPARK-16286][SQL] Implement stack table generating func...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14033 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61696/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14033: [SPARK-16286][SQL] Implement stack table generating func...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14033 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14033: [SPARK-16286][SQL] Implement stack table generating func...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14033 **[Test build #61696 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61696/consoleFull)** for PR 14033 at commit [`f02e1dd`](https://github.com/apache/spark/commit/f02e1dd0928992e530ea8d8a0663050fecdcd4ce). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class Stack(children: Seq[Expression])` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13532: [SPARK-15204][SQL] improve nullability inference for Agg...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13532 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13532: [SPARK-15204][SQL] improve nullability inference for Agg...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13532 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61697/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13532: [SPARK-15204][SQL] improve nullability inference for Agg...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13532 **[Test build #61697 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61697/consoleFull)** for PR 13532 at commit [`23263e4`](https://github.com/apache/spark/commit/23263e4940f5b6e67ee7b06b9e0fad72bbe7606f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13532: [SPARK-15204][SQL] improve nullability inference for Agg...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13532 **[Test build #61697 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61697/consoleFull)** for PR 13532 at commit [`23263e4`](https://github.com/apache/spark/commit/23263e4940f5b6e67ee7b06b9e0fad72bbe7606f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14033: [SPARK-16286][SQL] Implement stack table generating func...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14033 **[Test build #61696 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61696/consoleFull)** for PR 14033 at commit [`f02e1dd`](https://github.com/apache/spark/commit/f02e1dd0928992e530ea8d8a0663050fecdcd4ce). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13532: [SPARK-15204][SQL] improve nullability inference ...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/13532#discussion_r69397207 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala --- @@ -305,4 +305,13 @@ class DatasetAggregatorSuite extends QueryTest with SharedSQLContext { val ds = Seq(1, 2, 3).toDS() checkDataset(ds.select(MapTypeBufferAgg.toColumn), 1) } + + test("spark-15204 improve nullability inference for Aggregator") { +val ds1 = Seq(1, 3, 2, 5).toDS() +assert(ds1.select(typed.sum((i: Int) => i)).schema.head.nullable === false) +val ds2 = Seq(AggData(1, "a"), AggData(2, "a")).toDS() +assert(ds2.groupByKey(_.b).agg(SeqAgg.toColumn).schema(1).nullable === true) --- End diff -- the last assert with NameAgg tests String as output of the Aggregator. is that good enough? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14036: [SPARK-16323] [SQL] Add IntegerDivide to avoid unnecessa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14036 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14036: [SPARK-16323] [SQL] Add IntegerDivide to avoid unnecessa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14036 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61695/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14036: [SPARK-16323] [SQL] Add IntegerDivide to avoid unnecessa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14036 **[Test build #61695 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61695/consoleFull)** for PR 14036 at commit [`054d54e`](https://github.com/apache/spark/commit/054d54e1dae7c372fb1c7ef5ee52f7d9ff6a619d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14004: [SPARK-16285][SQL] Implement sentences SQL functions
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14004 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14004: [SPARK-16285][SQL] Implement sentences SQL functions
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14004 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61694/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14004: [SPARK-16285][SQL] Implement sentences SQL functions
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14004 **[Test build #61694 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61694/consoleFull)** for PR 14004 at commit [`d021d39`](https://github.com/apache/spark/commit/d021d394a630fd92e3f6f410139372db368c542e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14033: [SPARK-16286][SQL] Implement stack table generating func...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14033 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61693/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14033: [SPARK-16286][SQL] Implement stack table generating func...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14033 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14033: [SPARK-16286][SQL] Implement stack table generating func...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14033 **[Test build #61693 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61693/consoleFull)** for PR 14033 at commit [`d6691f7`](https://github.com/apache/spark/commit/d6691f700737f4e4146b0c7861bff025f5af7eec). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class Stack(children: Seq[Expression])` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13971: [SPARK-16289][SQL] Implement posexplode table gen...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/13971#discussion_r69395876 --- Diff: python/pyspark/sql/functions.py --- @@ -1637,6 +1637,27 @@ def explode(col): return Column(jc) +@since(2.1) +def posexplode(col): --- End diff -- For this one, I thought the reason is `explode` is already registered. `posexplode` is a pair of that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13976: [SPARK-16288][SQL] Implement inline table generating fun...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/13976 Thank you, @cloud-fan and @rxin ! :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce implementation with a dense...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13680 Sorry. This is my misunderstanding. I was confused between ```UnsafeRow``` and ```UnsafeArrayData```. ```UnsafeArrayData``` keeps only one type in an instance. ```[integer] [offset] [float] [offset]``` is invalid. ```[integer] [integer] [integer]``` or ```[offset] [offset] [offset]``` is valid. Option 1 could work. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14036: [SPARK-16323] [SQL] Add IntegerDivide to avoid unnecessa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14036 **[Test build #61695 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61695/consoleFull)** for PR 14036 at commit [`054d54e`](https://github.com/apache/spark/commit/054d54e1dae7c372fb1c7ef5ee52f7d9ff6a619d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13971: [SPARK-16289][SQL] Implement posexplode table gen...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13971#discussion_r69395230 --- Diff: python/pyspark/sql/functions.py --- @@ -1637,6 +1637,27 @@ def explode(col): return Column(jc) +@since(2.1) +def posexplode(col): --- End diff -- cc @rxin , is `posexplode` a special hive fallback function that we need to register? other ones don't get registered in `functions` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13976: [SPARK-16288][SQL] Implement inline table generat...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13976 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13976: [SPARK-16288][SQL] Implement inline table generating fun...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13976 merging to master, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13976: [SPARK-16288][SQL] Implement inline table generat...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13976#discussion_r69395123 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala --- @@ -195,3 +195,38 @@ case class Explode(child: Expression) extends ExplodeBase(child, position = fals extended = "> SELECT _FUNC_(array(10,20));\n 0\t10\n 1\t20") // scalastyle:on line.size.limit case class PosExplode(child: Expression) extends ExplodeBase(child, position = true) + +/** + * Explodes an array of structs into a table. + */ +@ExpressionDescription( + usage = "_FUNC_(a) - Explodes an array of structs into a table.", + extended = "> SELECT _FUNC_(array(struct(1, 'a'), struct(2, 'b')));\n [1,a]\n [2,b]") +case class Inline(child: Expression) extends UnaryExpression with Generator with CodegenFallback { + + override def children: Seq[Expression] = child :: Nil + + override def checkInputDataTypes(): TypeCheckResult = child.dataType match { +case ArrayType(et, _) if et.isInstanceOf[StructType] => + TypeCheckResult.TypeCheckSuccess +case _ => + TypeCheckResult.TypeCheckFailure( +s"input to function $prettyName should be array of struct type, not ${child.dataType}") + } + + override def elementSchema: StructType = child.dataType match { +case ArrayType(et : StructType, _) => et + } + + private lazy val numFields = elementSchema.fields.length + + override def eval(input: InternalRow): TraversableOnce[InternalRow] = { +val inputArray = child.eval(input).asInstanceOf[ArrayData] +if (inputArray == null) { + Nil +} else { + for (i <- 0 until inputArray.numElements()) +yield inputArray.getStruct(i, numFields) --- End diff -- ah i see, `for-yield` returns an iterator. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13494: [SPARK-15752] [SQL] support optimization for metadata on...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13494 LGTM exception some naming/testing comments. This is not a small patch and definitely need more reviewers. Let's also add more comments inside the new rule to make it easier to review. cc @yhuai @liancheng to take a look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14008: [SPARK-16281][SQL] Implement parse_url SQL function
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14008 So far, there were some different opinions on `new URL` error handling and `Literal Pattern` handling. It's a frequent pattern of comments. :) I agree @cloud-fan 's opinions also. If I have strong objections, I wrote more comments. @janplus , you can think @cloud-fan 's comments supersedes mine. I enjoyed reviewing and discussion on this PR. I hope your PR is merged to the master soon! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13494#discussion_r69395021 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2865,4 +2865,15 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { sql(s"SELECT '$literal' AS DUMMY"), Row(s"$expected") :: Nil) } + + test("spark-15752 metadata only optimizer for datasource table") { +val data = (1 to 10).map(i => (i, s"data-$i", i % 2, if ((i % 2) == 0) "even" else "odd")) --- End diff -- also use `withSQLConf` here to make sure the optimization is enabled. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13494#discussion_r69395014 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/MetadataOnlyOptimizerSuite.scala --- @@ -0,0 +1,87 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution + +import org.apache.spark.sql._ +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.SharedSQLContext + +class MetadataOnlyOptimizerSuite extends QueryTest with SharedSQLContext { + import testImplicits._ + + override def beforeAll(): Unit = { +super.beforeAll() +val data = (1 to 10).map(i => (i, s"data-$i", i % 2, if ((i % 2) == 0) "even" else "odd")) + .toDF("id", "data", "partId", "part") +data.write.partitionBy("partId", "part").mode("append").saveAsTable("srcpart_15752") + } + + private def checkWithMetadataOnly(df: DataFrame): Unit = { +val localRelations = df.queryExecution.optimizedPlan.collect { + case l @ LocalRelation(_, _) => l +} +assert(localRelations.size == 1) + } + + private def checkWithoutMetadataOnly(df: DataFrame): Unit = { +val localRelations = df.queryExecution.optimizedPlan.collect{ + case l @ LocalRelation(_, _) => l +} +assert(localRelations.size == 0) + } + + test("spark-15752 metadata only optimizer for partition table") { +withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "true") { + checkWithMetadataOnly(sql("select part from srcpart_15752 where part = 0 group by part")) + checkWithMetadataOnly(sql("select max(part) from srcpart_15752")) + checkWithMetadataOnly(sql("select max(part) from srcpart_15752 where part = 0")) + checkWithMetadataOnly( +sql("select part, min(partId) from srcpart_15752 where part = 0 group by part")) + checkWithMetadataOnly( +sql("select max(x) from (select part + 1 as x from srcpart_15752 where part = 1) t")) + checkWithMetadataOnly(sql("select distinct part from srcpart_15752")) + checkWithMetadataOnly(sql("select distinct part, partId from srcpart_15752")) + checkWithMetadataOnly( +sql("select distinct x from (select part + 1 as x from srcpart_15752 where part = 0) t")) + + // Now donot support metadata only optimizer + checkWithoutMetadataOnly(sql("select part, max(id) from srcpart_15752 group by part")) + checkWithoutMetadataOnly(sql("select distinct part, id from srcpart_15752")) + checkWithoutMetadataOnly(sql("select part, sum(partId) from srcpart_15752 group by part")) + checkWithoutMetadataOnly( +sql("select part from srcpart_15752 where part = 1 group by rollup(part)")) + checkWithoutMetadataOnly( +sql("select part from (select part from srcpart_15752 where part = 0 union all " + --- End diff -- the last 2 cases can be added in follow-up PRs :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13494#discussion_r69394970 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/MetadataOnlyOptimizerSuite.scala --- @@ -0,0 +1,87 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution + +import org.apache.spark.sql._ +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.SharedSQLContext + +class MetadataOnlyOptimizerSuite extends QueryTest with SharedSQLContext { + import testImplicits._ + + override def beforeAll(): Unit = { +super.beforeAll() +val data = (1 to 10).map(i => (i, s"data-$i", i % 2, if ((i % 2) == 0) "even" else "odd")) + .toDF("id", "data", "partId", "part") +data.write.partitionBy("partId", "part").mode("append").saveAsTable("srcpart_15752") --- End diff -- The session is shared among all test suites, so we should drop the table after all tests here, or we may pollute other test suites. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13494#discussion_r69394921 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala --- @@ -30,6 +30,7 @@ class SparkOptimizer( extends Optimizer(catalog, conf) { override def batches: Seq[Batch] = super.batches :+ +Batch("Metadata Only Optimization", Once, MetadataOnlyOptimizer(catalog, conf)) :+ --- End diff -- `Metadata Only` seems not so intuitive, any ideas here? cc @yhuai @liancheng --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13494#discussion_r69394890 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/MetadataOnlyOptimizer.scala --- @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.{CatalystConf, InternalRow} +import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, SessionCatalog} +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.aggregate._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, LogicalRelation} + +/** + * When scanning only partition columns, get results based on partition data without scanning files. + * It's used for operators that only need distinct values. Currently only [[Aggregate]] operator + * which satisfy the following conditions are supported: + * 1. aggregate expression is partition columns. + * e.g. SELECT col FROM tbl GROUP BY col. + * 2. aggregate function on partition columns with DISTINCT. + * e.g. SELECT count(DISTINCT col) FROM tbl GROUP BY col. + * 3. aggregate function on partition columns which have same result w or w/o DISTINCT keyword. + * e.g. SELECT Max(col2) FROM tbl GROUP BY col1. + */ +case class MetadataOnlyOptimizer( +catalog: SessionCatalog, +conf: CatalystConf) extends Rule[LogicalPlan] { + + def apply(plan: LogicalPlan): LogicalPlan = { +if (!conf.optimizerMetadataOnly) { + return plan +} + +plan.transform { + case a @ Aggregate(_, aggExprs, child @ PartitionedRelation(partAttrs, relation)) => +if (a.references.subsetOf(partAttrs)) { + val aggFunctions = aggExprs.flatMap(_.collect { +case agg: AggregateExpression => agg + }) + val isPartitionDataOnly = aggFunctions.isEmpty || aggFunctions.forall { agg => +agg.isDistinct || (agg.aggregateFunction match { + case _: Max => true + case _: Min => true + case _ => false +}) + } + if (isPartitionDataOnly) { +a.withNewChildren(Seq(usePartitionData(child, relation))) + } else { +a + } +} else { + a +} +} + } + + private def usePartitionData(child: LogicalPlan, relation: LogicalPlan): LogicalPlan = { +child transform { + case plan if plan eq relation => +relation match { + case l @ LogicalRelation(fsRelation: HadoopFsRelation, _, _) => +val partColumns = fsRelation.partitionSchema.map(_.name.toLowerCase).toSet +val partAttrs = l.output.filter(a => partColumns.contains(a.name.toLowerCase)) +val partitionData = fsRelation.location.listFiles(Nil) +LocalRelation(partAttrs, partitionData.map(_.values)) --- End diff -- This is something we need to discuss, there may be a lot of partition values and using `LocalRelation` may not give enough parallelism here. cc @yhuai @liancheng --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13976: [SPARK-16288][SQL] Implement inline table generat...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/13976#discussion_r69394891 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala --- @@ -195,3 +195,38 @@ case class Explode(child: Expression) extends ExplodeBase(child, position = fals extended = "> SELECT _FUNC_(array(10,20));\n 0\t10\n 1\t20") // scalastyle:on line.size.limit case class PosExplode(child: Expression) extends ExplodeBase(child, position = true) + +/** + * Explodes an array of structs into a table. + */ +@ExpressionDescription( + usage = "_FUNC_(a) - Explodes an array of structs into a table.", + extended = "> SELECT _FUNC_(array(struct(1, 'a'), struct(2, 'b')));\n [1,a]\n [2,b]") +case class Inline(child: Expression) extends UnaryExpression with Generator with CodegenFallback { + + override def children: Seq[Expression] = child :: Nil + + override def checkInputDataTypes(): TypeCheckResult = child.dataType match { +case ArrayType(et, _) if et.isInstanceOf[StructType] => + TypeCheckResult.TypeCheckSuccess +case _ => + TypeCheckResult.TypeCheckFailure( +s"input to function $prettyName should be array of struct type, not ${child.dataType}") + } + + override def elementSchema: StructType = child.dataType match { +case ArrayType(et : StructType, _) => et + } + + private lazy val numFields = elementSchema.fields.length + + override def eval(input: InternalRow): TraversableOnce[InternalRow] = { +val inputArray = child.eval(input).asInstanceOf[ArrayData] +if (inputArray == null) { + Nil +} else { + for (i <- 0 until inputArray.numElements()) +yield inputArray.getStruct(i, numFields) --- End diff -- Thank you, @cloud-fan . By the way, for about this, @rxin gave me an advice at the first commit of this PR. > we don't need to materialize the array, do we? We can create an iterator to return the results. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13494#discussion_r69394848 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2865,4 +2865,15 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { sql(s"SELECT '$literal' AS DUMMY"), Row(s"$expected") :: Nil) } + + test("spark-15752 metadata only optimizer for datasource table") { +val data = (1 to 10).map(i => (i, s"data-$i", i % 2, if ((i % 2) == 0) "even" else "odd")) --- End diff -- put these codes inside `withTable` when we create a table in the tests, then our framework will clean up the table automatically. ``` withTable("xxx") { ... } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14033: [SPARK-16286][SQL] Implement stack table generating func...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14033 **[Test build #61693 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61693/consoleFull)** for PR 14033 at commit [`d6691f7`](https://github.com/apache/spark/commit/d6691f700737f4e4146b0c7861bff025f5af7eec). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14004: [SPARK-16285][SQL] Implement sentences SQL functions
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14004 **[Test build #61694 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61694/consoleFull)** for PR 14004 at commit [`d021d39`](https://github.com/apache/spark/commit/d021d394a630fd92e3f6f410139372db368c542e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14033: [SPARK-16286][SQL] Implement stack table generating func...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14033 cc @rxin and @cloud-fan . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14033: [SPARK-16286][SQL] Implement stack table generating func...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14033 Rebased to resolve conflicts. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14004: [SPARK-16285][SQL] Implement sentences SQL functions
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14004 Rebased to resolve conflicts. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13494#discussion_r69394772 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/MetadataOnlyOptimizer.scala --- @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.{CatalystConf, InternalRow} +import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, SessionCatalog} +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.aggregate._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, LogicalRelation} + +/** + * When scanning only partition columns, get results based on partition data without scanning files. + * It's used for operators that only need distinct values. Currently only [[Aggregate]] operator + * which satisfy the following conditions are supported: + * 1. aggregate expression is partition columns. + * e.g. SELECT col FROM tbl GROUP BY col. + * 2. aggregate function on partition columns with DISTINCT. + * e.g. SELECT count(DISTINCT col) FROM tbl GROUP BY col. --- End diff -- this example is wrong, we can not aggregate on grouping columns, it should be `SELECT count(DISTINCT col1) FROM tbl GROUP BY col2` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14036: [SPARK-16323] [SQL] Add IntegerDivide to avoid unnecessa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14036 **[Test build #61692 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61692/consoleFull)** for PR 14036 at commit [`e30747a`](https://github.com/apache/spark/commit/e30747a9126e88e04fa4adc36c4b42d70245cb85). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14036: [SPARK-16323] [SQL] Add IntegerDivide to avoid unnecessa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14036 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61692/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce implementation with a dense...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13680 Option 1 can work for this array: ```UnsafeDataArray: ...[integer] [offset] [offset] [float]```. This is because 2 offsets are adjacent. Can option 1 work for this ```UnsafeDataArray: ...[integer] [offset] [float] [offset]```? How do we subtract 2 offsets that are not adjacent? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14036: [SPARK-16323] [SQL] Add IntegerDivide to avoid unnecessa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14036 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce implementation with a dense...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13680 hmmm, looks like we are not in the same page... How could 2 offsets not adjacent? We only keep offsets in the `value or offset` region, and put them one by one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13967: [SPARK-16278][SPARK-16279][SQL] Implement map_keys/map_v...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/13967 Thank you, @cloud-fan and @rxin ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13680#discussion_r69394475 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/UnsafeArrayDataBenchmark.scala --- @@ -0,0 +1,256 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark + +import org.apache.spark.SparkConf +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder +import org.apache.spark.sql.catalyst.expressions.{UnsafeArrayData, UnsafeRow} +import org.apache.spark.sql.catalyst.expressions.codegen.{BufferHolder, UnsafeArrayWriter} +import org.apache.spark.unsafe.Platform +import org.apache.spark.util.Benchmark + +/** + * Benchmark [[UnsafeArrayDataBenchmark]] for UnsafeArrayData + * To run this: + * build/sbt "sql/test-only *benchmark.UnsafeArrayDataBenchmark" + * + * Benchmarks in this file are skipped in normal builds. + */ +class UnsafeArrayDataBenchmark extends BenchmarkBase { + + def calculateHeaderPortionInBytes(count: Int) : Int = { +// Use this assignment for SPARK-15962 +// val size = 4 + 4 * count +val size = UnsafeArrayData.calculateHeaderPortionInBytes(count) +size + } + + def readUnsafeArray(iters: Int): Unit = { +val count = 1024 * 1024 * 16 + +var intResult: Int = 0 +val intBuffer = new Array[Int](count) +val intEncoder = ExpressionEncoder[Array[Int]].resolveAndBind() +val intInternalRow = intEncoder.toRow(intBuffer) +val intUnsafeArray = intInternalRow.getArray(0) +val readIntArray = { i: Int => + var n = 0 + while (n < iters) { +val len = intUnsafeArray.numElements +var sum = 0.toInt +var i = 0 +while (i < len) { + sum += intUnsafeArray.getInt(i) + i += 1 +} +intResult = sum +n += 1 + } +} + +var doubleResult: Double = 0 +val doubleBuffer = new Array[Double](count) +val doubleEncoder = ExpressionEncoder[Array[Double]].resolveAndBind() +val doubleInternalRow = doubleEncoder.toRow(doubleBuffer) +val doubleUnsafeArray = doubleInternalRow.getArray(0) +val readDoubleArray = { i: Int => + var n = 0 + while (n < iters) { +val len = doubleUnsafeArray.numElements +var sum = 0.toDouble +var i = 0 +while (i < len) { + sum += doubleUnsafeArray.getDouble(i) + i += 1 +} +doubleResult = sum +n += 1 + } +} + +val benchmark = new Benchmark("Read UnsafeArrayData", count * iters) +benchmark.addCase("Int")(readIntArray) +benchmark.addCase("Double")(readDoubleArray) +benchmark.run +/* +Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.4 +Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz + +Read UnsafeArrayData:Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative + +Int281 / 296597.5 1.7 1.0X +Double 298 / 301562.3 1.8 0.9X +*/ + } + + def writeUnsafeArray(iters: Int): Unit = { +val count = 1024 * 1024 * 16 + +val intUnsafeRow = new UnsafeRow(1) +val intUnsafeArrayWriter = new UnsafeArrayWriter +val intBufferHolder = new BufferHolder(intUnsafeRow, 64) +intBufferHolder.reset() +intUnsafeArrayWriter.initialize(intBufferHolder, count, 4) +val intCursor = intBufferHolder.cursor +val writeIntArray = { i: Int => + var n = 0 + while (n < iters) { +intBufferHolder.cursor = intCursor +val len = co
[GitHub] spark issue #14012: [SPARK-16343][SQL] Improve the PushDownPredicate rule to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14012 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61691/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14012: [SPARK-16343][SQL] Improve the PushDownPredicate rule to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14012 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14012: [SPARK-16343][SQL] Improve the PushDownPredicate rule to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14012 **[Test build #61691 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61691/consoleFull)** for PR 14012 at commit [`856d86d`](https://github.com/apache/spark/commit/856d86d788b318c2975a5318b181678f4b71f5bc). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13990: [SPARK-16287][SQL] Implement str_to_map SQL function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13990 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13990: [SPARK-16287][SQL] Implement str_to_map SQL function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13990 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61690/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13990: [SPARK-16287][SQL] Implement str_to_map SQL function
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13990 **[Test build #61690 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61690/consoleFull)** for PR 13990 at commit [`6b2390d`](https://github.com/apache/spark/commit/6b2390d2dc141553eef3df9dd34744b9d08ccfdb). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class StringToMap(child: Expression, pairDelim: Expression, keyValueDelim: Expression)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14036: [SPARK-16323] [SQL] Add IntegerDivide to avoid unnecessa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14036 **[Test build #61692 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61692/consoleFull)** for PR 14036 at commit [`e30747a`](https://github.com/apache/spark/commit/e30747a9126e88e04fa4adc36c4b42d70245cb85). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14036: [SPARK-16323] [SQL] Add IntegerDivide to avoid unnecessa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14036 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61689/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14036: [SPARK-16323] [SQL] Add IntegerDivide to avoid unnecessa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14036 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14036: [SPARK-16323] [SQL] Add IntegerDivide to avoid unnecessa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14036 **[Test build #61689 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61689/consoleFull)** for PR 14036 at commit [`faa2fd3`](https://github.com/apache/spark/commit/faa2fd3b3d8a91f46bac6260e3ddd9745cbce758). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14012: [SPARK-16343][SQL] Improve the PushDownPredicate rule to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14012 **[Test build #61691 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61691/consoleFull)** for PR 14012 at commit [`856d86d`](https://github.com/apache/spark/commit/856d86d788b318c2975a5318b181678f4b71f5bc). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14012: [SPARK-16343][SQL] Improve the PushDownPredicate rule to...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/14012 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13765 LGTM, cc @yhuai to take another look --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14007: [SPARK-16329] [SQL] Star Expansion over Table Containing...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14007 Sure, will do it soon. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13990: [SPARK-16287][SQL] Implement str_to_map SQL funct...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13990#discussion_r69393130 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala --- @@ -393,3 +393,73 @@ case class CreateNamedStructUnsafe(children: Seq[Expression]) extends Expression override def prettyName: String = "named_struct_unsafe" } + +/** + * Creates a map after splitting the input text into key/value pairs using delimeters + */ +@ExpressionDescription( + usage = """_FUNC_(text[, pairDelim, keyValueDelim]) - Creates a map after splitting the text into +key/value pairs using delimiters. +Default delimiters are ',' for pairDelim and '=' for keyValueDelim.""") +case class StringToMap(child: Expression, pairDelim: Expression, keyValueDelim: Expression) + extends TernaryExpression with ExpectsInputTypes { + + def this(child: Expression) = { +this(child, Literal(","), Literal("=")) + } + + override def children: Seq[Expression] = Seq(child, pairDelim, keyValueDelim) + + override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType, StringType) --- End diff -- Does hive allow non-literal delimiter? e.g. `str_to_map(text, d1, d2)` while `d1` and `d2` are table columms --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13990: [SPARK-16287][SQL] Implement str_to_map SQL funct...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13990#discussion_r69393113 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala --- @@ -393,3 +393,73 @@ case class CreateNamedStructUnsafe(children: Seq[Expression]) extends Expression override def prettyName: String = "named_struct_unsafe" } + +/** + * Creates a map after splitting the input text into key/value pairs using delimeters + */ +@ExpressionDescription( + usage = """_FUNC_(text[, pairDelim, keyValueDelim]) - Creates a map after splitting the text into +key/value pairs using delimiters. +Default delimiters are ',' for pairDelim and '=' for keyValueDelim.""") +case class StringToMap(child: Expression, pairDelim: Expression, keyValueDelim: Expression) --- End diff -- how about renaming `child` to `text`? to make it consistent with the comment: `_FUNC_(text[, pairDelim, keyValueDelim])` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce implementation with a dense...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13680 One more thought about the format: `UnsafeRow` use 8 bytes to store offset and length for variable-length type, this is because `UnsafeRow` is word-aligned, so we can't calculate the element size by subtracting 2 adjacent offsets, but need an extra length information. This is also why in unsafe row writer we call `zeroOutPaddingBytes` before we write variable-length field. However, `UnsafeArrayData` doesn't have this requirement, as it's already a field of row and itself will be word-aligned. So we can: 1. only use 4 bytes to store offset for variable-length type element, and calculate the length by subtracting 2 adjacent offsets. 2. still use 8 bytes to store offset and length, and follow unsafe row writer to `zeroOutPaddingBytes` before write. Personally I prefer option 1 as it's more compact. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13409: [SPARK-15667][SQL]Throw exception if columns number of o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13409 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13409: [SPARK-15667][SQL]Throw exception if columns number of o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13409 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61688/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13409: [SPARK-15667][SQL]Throw exception if columns number of o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13409 **[Test build #61688 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61688/consoleFull)** for PR 13409 at commit [`50977da`](https://github.com/apache/spark/commit/50977da0e4705b14ab0332e3b79515190a6bc8a9). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13680#discussion_r69392928 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/UnsafeArrayDataBenchmark.scala --- @@ -0,0 +1,256 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark + +import org.apache.spark.SparkConf +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder +import org.apache.spark.sql.catalyst.expressions.{UnsafeArrayData, UnsafeRow} +import org.apache.spark.sql.catalyst.expressions.codegen.{BufferHolder, UnsafeArrayWriter} +import org.apache.spark.unsafe.Platform +import org.apache.spark.util.Benchmark + +/** + * Benchmark [[UnsafeArrayDataBenchmark]] for UnsafeArrayData + * To run this: + * build/sbt "sql/test-only *benchmark.UnsafeArrayDataBenchmark" + * + * Benchmarks in this file are skipped in normal builds. + */ +class UnsafeArrayDataBenchmark extends BenchmarkBase { + + def calculateHeaderPortionInBytes(count: Int) : Int = { +// Use this assignment for SPARK-15962 +// val size = 4 + 4 * count +val size = UnsafeArrayData.calculateHeaderPortionInBytes(count) +size + } + + def readUnsafeArray(iters: Int): Unit = { +val count = 1024 * 1024 * 16 + +var intResult: Int = 0 +val intBuffer = new Array[Int](count) +val intEncoder = ExpressionEncoder[Array[Int]].resolveAndBind() +val intInternalRow = intEncoder.toRow(intBuffer) +val intUnsafeArray = intInternalRow.getArray(0) +val readIntArray = { i: Int => + var n = 0 + while (n < iters) { +val len = intUnsafeArray.numElements +var sum = 0.toInt +var i = 0 +while (i < len) { + sum += intUnsafeArray.getInt(i) + i += 1 +} +intResult = sum +n += 1 + } +} + +var doubleResult: Double = 0 +val doubleBuffer = new Array[Double](count) +val doubleEncoder = ExpressionEncoder[Array[Double]].resolveAndBind() +val doubleInternalRow = doubleEncoder.toRow(doubleBuffer) +val doubleUnsafeArray = doubleInternalRow.getArray(0) +val readDoubleArray = { i: Int => + var n = 0 + while (n < iters) { +val len = doubleUnsafeArray.numElements +var sum = 0.toDouble +var i = 0 +while (i < len) { + sum += doubleUnsafeArray.getDouble(i) + i += 1 +} +doubleResult = sum +n += 1 + } +} + +val benchmark = new Benchmark("Read UnsafeArrayData", count * iters) +benchmark.addCase("Int")(readIntArray) +benchmark.addCase("Double")(readDoubleArray) +benchmark.run +/* +Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.4 +Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz + +Read UnsafeArrayData:Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative + +Int281 / 296597.5 1.7 1.0X +Double 298 / 301562.3 1.8 0.9X +*/ + } + + def writeUnsafeArray(iters: Int): Unit = { +val count = 1024 * 1024 * 16 + +val intUnsafeRow = new UnsafeRow(1) +val intUnsafeArrayWriter = new UnsafeArrayWriter +val intBufferHolder = new BufferHolder(intUnsafeRow, 64) +intBufferHolder.reset() +intUnsafeArrayWriter.initialize(intBufferHolder, count, 4) +val intCursor = intBufferHolder.cursor +val writeIntArray = { i: Int => + var n = 0 + while (n < iters) { +intBufferHolder.cursor = intCursor +val len
[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13680#discussion_r69392901 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/UnsafeArrayDataBenchmark.scala --- @@ -0,0 +1,256 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark + +import org.apache.spark.SparkConf +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder +import org.apache.spark.sql.catalyst.expressions.{UnsafeArrayData, UnsafeRow} +import org.apache.spark.sql.catalyst.expressions.codegen.{BufferHolder, UnsafeArrayWriter} +import org.apache.spark.unsafe.Platform +import org.apache.spark.util.Benchmark + +/** + * Benchmark [[UnsafeArrayDataBenchmark]] for UnsafeArrayData + * To run this: + * build/sbt "sql/test-only *benchmark.UnsafeArrayDataBenchmark" + * + * Benchmarks in this file are skipped in normal builds. + */ +class UnsafeArrayDataBenchmark extends BenchmarkBase { + + def calculateHeaderPortionInBytes(count: Int) : Int = { +// Use this assignment for SPARK-15962 +// val size = 4 + 4 * count +val size = UnsafeArrayData.calculateHeaderPortionInBytes(count) +size + } + + def readUnsafeArray(iters: Int): Unit = { +val count = 1024 * 1024 * 16 + +var intResult: Int = 0 +val intBuffer = new Array[Int](count) +val intEncoder = ExpressionEncoder[Array[Int]].resolveAndBind() +val intInternalRow = intEncoder.toRow(intBuffer) +val intUnsafeArray = intInternalRow.getArray(0) +val readIntArray = { i: Int => + var n = 0 + while (n < iters) { +val len = intUnsafeArray.numElements +var sum = 0.toInt +var i = 0 +while (i < len) { + sum += intUnsafeArray.getInt(i) + i += 1 +} +intResult = sum +n += 1 + } +} + +var doubleResult: Double = 0 +val doubleBuffer = new Array[Double](count) +val doubleEncoder = ExpressionEncoder[Array[Double]].resolveAndBind() +val doubleInternalRow = doubleEncoder.toRow(doubleBuffer) +val doubleUnsafeArray = doubleInternalRow.getArray(0) +val readDoubleArray = { i: Int => + var n = 0 + while (n < iters) { +val len = doubleUnsafeArray.numElements +var sum = 0.toDouble +var i = 0 +while (i < len) { + sum += doubleUnsafeArray.getDouble(i) + i += 1 +} +doubleResult = sum +n += 1 + } +} + +val benchmark = new Benchmark("Read UnsafeArrayData", count * iters) +benchmark.addCase("Int")(readIntArray) +benchmark.addCase("Double")(readDoubleArray) +benchmark.run +/* +Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.4 +Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz + +Read UnsafeArrayData:Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative + +Int281 / 296597.5 1.7 1.0X +Double 298 / 301562.3 1.8 0.9X +*/ + } + + def writeUnsafeArray(iters: Int): Unit = { +val count = 1024 * 1024 * 16 + +val intUnsafeRow = new UnsafeRow(1) +val intUnsafeArrayWriter = new UnsafeArrayWriter +val intBufferHolder = new BufferHolder(intUnsafeRow, 64) +intBufferHolder.reset() +intUnsafeArrayWriter.initialize(intBufferHolder, count, 4) +val intCursor = intBufferHolder.cursor +val writeIntArray = { i: Int => + var n = 0 + while (n < iters) { +intBufferHolder.cursor = intCursor +val len
[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13680#discussion_r69392883 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/UnsafeArrayDataBenchmark.scala --- @@ -0,0 +1,256 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark + +import org.apache.spark.SparkConf +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder +import org.apache.spark.sql.catalyst.expressions.{UnsafeArrayData, UnsafeRow} +import org.apache.spark.sql.catalyst.expressions.codegen.{BufferHolder, UnsafeArrayWriter} +import org.apache.spark.unsafe.Platform +import org.apache.spark.util.Benchmark + +/** + * Benchmark [[UnsafeArrayDataBenchmark]] for UnsafeArrayData + * To run this: + * build/sbt "sql/test-only *benchmark.UnsafeArrayDataBenchmark" + * + * Benchmarks in this file are skipped in normal builds. + */ +class UnsafeArrayDataBenchmark extends BenchmarkBase { + + def calculateHeaderPortionInBytes(count: Int) : Int = { +// Use this assignment for SPARK-15962 +// val size = 4 + 4 * count +val size = UnsafeArrayData.calculateHeaderPortionInBytes(count) +size + } + + def readUnsafeArray(iters: Int): Unit = { +val count = 1024 * 1024 * 16 + +var intResult: Int = 0 +val intBuffer = new Array[Int](count) +val intEncoder = ExpressionEncoder[Array[Int]].resolveAndBind() +val intInternalRow = intEncoder.toRow(intBuffer) +val intUnsafeArray = intInternalRow.getArray(0) +val readIntArray = { i: Int => + var n = 0 + while (n < iters) { +val len = intUnsafeArray.numElements +var sum = 0.toInt +var i = 0 +while (i < len) { + sum += intUnsafeArray.getInt(i) + i += 1 +} +intResult = sum +n += 1 + } +} + +var doubleResult: Double = 0 +val doubleBuffer = new Array[Double](count) +val doubleEncoder = ExpressionEncoder[Array[Double]].resolveAndBind() +val doubleInternalRow = doubleEncoder.toRow(doubleBuffer) +val doubleUnsafeArray = doubleInternalRow.getArray(0) +val readDoubleArray = { i: Int => + var n = 0 + while (n < iters) { +val len = doubleUnsafeArray.numElements +var sum = 0.toDouble +var i = 0 +while (i < len) { + sum += doubleUnsafeArray.getDouble(i) + i += 1 +} +doubleResult = sum +n += 1 + } +} + +val benchmark = new Benchmark("Read UnsafeArrayData", count * iters) +benchmark.addCase("Int")(readIntArray) +benchmark.addCase("Double")(readDoubleArray) +benchmark.run +/* +Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.4 +Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz + +Read UnsafeArrayData:Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative + +Int281 / 296597.5 1.7 1.0X +Double 298 / 301562.3 1.8 0.9X +*/ + } + + def writeUnsafeArray(iters: Int): Unit = { +val count = 1024 * 1024 * 16 + +val intUnsafeRow = new UnsafeRow(1) +val intUnsafeArrayWriter = new UnsafeArrayWriter +val intBufferHolder = new BufferHolder(intUnsafeRow, 64) +intBufferHolder.reset() +intUnsafeArrayWriter.initialize(intBufferHolder, count, 4) +val intCursor = intBufferHolder.cursor +val writeIntArray = { i: Int => + var n = 0 + while (n < iters) { +intBufferHolder.cursor = intCursor +val len
[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13680#discussion_r69392853 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/UnsafeArrayDataBenchmark.scala --- @@ -0,0 +1,256 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark + +import org.apache.spark.SparkConf +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder +import org.apache.spark.sql.catalyst.expressions.{UnsafeArrayData, UnsafeRow} +import org.apache.spark.sql.catalyst.expressions.codegen.{BufferHolder, UnsafeArrayWriter} +import org.apache.spark.unsafe.Platform +import org.apache.spark.util.Benchmark + +/** + * Benchmark [[UnsafeArrayDataBenchmark]] for UnsafeArrayData + * To run this: + * build/sbt "sql/test-only *benchmark.UnsafeArrayDataBenchmark" + * + * Benchmarks in this file are skipped in normal builds. + */ +class UnsafeArrayDataBenchmark extends BenchmarkBase { + + def calculateHeaderPortionInBytes(count: Int) : Int = { +// Use this assignment for SPARK-15962 +// val size = 4 + 4 * count +val size = UnsafeArrayData.calculateHeaderPortionInBytes(count) +size + } + + def readUnsafeArray(iters: Int): Unit = { +val count = 1024 * 1024 * 16 + +var intResult: Int = 0 +val intBuffer = new Array[Int](count) +val intEncoder = ExpressionEncoder[Array[Int]].resolveAndBind() +val intInternalRow = intEncoder.toRow(intBuffer) +val intUnsafeArray = intInternalRow.getArray(0) +val readIntArray = { i: Int => + var n = 0 + while (n < iters) { +val len = intUnsafeArray.numElements +var sum = 0.toInt +var i = 0 +while (i < len) { + sum += intUnsafeArray.getInt(i) + i += 1 +} +intResult = sum +n += 1 + } +} + +var doubleResult: Double = 0 +val doubleBuffer = new Array[Double](count) +val doubleEncoder = ExpressionEncoder[Array[Double]].resolveAndBind() +val doubleInternalRow = doubleEncoder.toRow(doubleBuffer) +val doubleUnsafeArray = doubleInternalRow.getArray(0) +val readDoubleArray = { i: Int => + var n = 0 + while (n < iters) { +val len = doubleUnsafeArray.numElements +var sum = 0.toDouble +var i = 0 +while (i < len) { + sum += doubleUnsafeArray.getDouble(i) + i += 1 +} +doubleResult = sum +n += 1 + } +} + +val benchmark = new Benchmark("Read UnsafeArrayData", count * iters) +benchmark.addCase("Int")(readIntArray) +benchmark.addCase("Double")(readDoubleArray) +benchmark.run +/* +Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.4 +Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz + +Read UnsafeArrayData:Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative + +Int281 / 296597.5 1.7 1.0X +Double 298 / 301562.3 1.8 0.9X +*/ + } + + def writeUnsafeArray(iters: Int): Unit = { +val count = 1024 * 1024 * 16 + +val intUnsafeRow = new UnsafeRow(1) +val intUnsafeArrayWriter = new UnsafeArrayWriter +val intBufferHolder = new BufferHolder(intUnsafeRow, 64) +intBufferHolder.reset() +intUnsafeArrayWriter.initialize(intBufferHolder, count, 4) +val intCursor = intBufferHolder.cursor +val writeIntArray = { i: Int => + var n = 0 + while (n < iters) { +intBufferHolder.cursor = intCursor +val len
[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13680#discussion_r69392823 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/UnsafeArrayDataBenchmark.scala --- @@ -0,0 +1,298 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark + +import org.apache.spark.SparkConf +import org.apache.spark.sql.catalyst.expressions.{UnsafeArrayData, UnsafeRow} +import org.apache.spark.sql.catalyst.expressions.codegen.{BufferHolder, UnsafeArrayWriter} +import org.apache.spark.unsafe.Platform +import org.apache.spark.util.Benchmark + +/** + * Benchmark [[UnsafeArrayDataBenchmark]] for UnsafeArrayData + * To run this: + * build/sbt "sql/test-only *benchmark.UnsafeArrayDataBenchmark" + * + * Benchmarks in this file are skipped in normal builds. + */ +class UnsafeArrayDataBenchmark extends BenchmarkBase { + + new SparkConf() +.setMaster("local[1]") +.setAppName("microbenchmark") +.set("spark.driver.memory", "3g") + + def calculateHeaderPortionInBytes(count: Int) : Int = { +// Use this assignment for SPARK-15962 +// val size = 4 + 4 * count +val size = UnsafeArrayData.calculateHeaderPortionInBytes(count) +size + } + + def readUnsafeArray(iters: Int): Unit = { --- End diff -- `encoder.toRow(array)` actually writes the unsafe array. It will generate code to use the array writer and write the data to buffer holder, I think it's good to benchmark it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13680#discussion_r69392720 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/UnsafeArrayDataBenchmark.scala --- @@ -0,0 +1,256 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark + +import org.apache.spark.SparkConf +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder +import org.apache.spark.sql.catalyst.expressions.{UnsafeArrayData, UnsafeRow} +import org.apache.spark.sql.catalyst.expressions.codegen.{BufferHolder, UnsafeArrayWriter} +import org.apache.spark.unsafe.Platform +import org.apache.spark.util.Benchmark + +/** + * Benchmark [[UnsafeArrayDataBenchmark]] for UnsafeArrayData + * To run this: + * build/sbt "sql/test-only *benchmark.UnsafeArrayDataBenchmark" + * + * Benchmarks in this file are skipped in normal builds. + */ +class UnsafeArrayDataBenchmark extends BenchmarkBase { + + def calculateHeaderPortionInBytes(count: Int) : Int = { +// Use this assignment for SPARK-15962 +// val size = 4 + 4 * count +val size = UnsafeArrayData.calculateHeaderPortionInBytes(count) +size + } + + def readUnsafeArray(iters: Int): Unit = { +val count = 1024 * 1024 * 16 + +var intResult: Int = 0 +val intBuffer = new Array[Int](count) --- End diff -- we should assign some random values for this array. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13680#discussion_r69392701 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/UnsafeArrayDataBenchmark.scala --- @@ -0,0 +1,256 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark + +import org.apache.spark.SparkConf +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder +import org.apache.spark.sql.catalyst.expressions.{UnsafeArrayData, UnsafeRow} +import org.apache.spark.sql.catalyst.expressions.codegen.{BufferHolder, UnsafeArrayWriter} +import org.apache.spark.unsafe.Platform +import org.apache.spark.util.Benchmark + +/** + * Benchmark [[UnsafeArrayDataBenchmark]] for UnsafeArrayData + * To run this: + * build/sbt "sql/test-only *benchmark.UnsafeArrayDataBenchmark" + * + * Benchmarks in this file are skipped in normal builds. + */ +class UnsafeArrayDataBenchmark extends BenchmarkBase { + + def calculateHeaderPortionInBytes(count: Int) : Int = { --- End diff -- looks like we can remove this method? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13680#discussion_r69392691 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/UnsafeArraySuite.scala --- @@ -18,27 +18,110 @@ package org.apache.spark.sql.catalyst.util import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder import org.apache.spark.sql.catalyst.expressions.UnsafeArrayData +import org.apache.spark.unsafe.Platform class UnsafeArraySuite extends SparkFunSuite { - test("from primitive int array") { -val array = Array(1, 10, 100) -val unsafe = UnsafeArrayData.fromPrimitiveArray(array) -assert(unsafe.numElements == 3) -assert(unsafe.getSizeInBytes == 4 + 4 * 3 + 4 * 3) -assert(unsafe.getInt(0) == 1) -assert(unsafe.getInt(1) == 10) -assert(unsafe.getInt(2) == 100) + val booleanArray = Array(false, true) + val shortArray = Array(1.toShort, 10.toShort, 100.toShort) + val intArray = Array(1, 10, 100) + val longArray = Array(1.toLong, 10.toLong, 100.toLong) + val floatArray = Array(1.1.toFloat, 2.2.toFloat, 3.3.toFloat) + val doubleArray = Array(1.1, 2.2, 3.3) + + test("read array") { +val unsafeBoolean = ExpressionEncoder[Array[Boolean]].resolveAndBind(). + toRow(booleanArray).getArray(0) +assert(unsafeBoolean.isInstanceOf[UnsafeArrayData]) +assert(unsafeBoolean.numElements == 2) +booleanArray.zipWithIndex.map { case (e, i) => + assert(unsafeBoolean.getBoolean(i) == e) +} + +val unsafeShort = ExpressionEncoder[Array[Short]].resolveAndBind(). + toRow(shortArray).getArray(0) +assert(unsafeShort.isInstanceOf[UnsafeArrayData]) +assert(unsafeShort.numElements == 3) +shortArray.zipWithIndex.map { case (e, i) => + assert(unsafeShort.getShort(i) == e) +} + +val unsafeInt = ExpressionEncoder[Array[Int]].resolveAndBind(). + toRow(intArray).getArray(0) +assert(unsafeInt.isInstanceOf[UnsafeArrayData]) +assert(unsafeInt.numElements == 3) +intArray.zipWithIndex.map { case (e, i) => + assert(unsafeInt.getInt(i) == e) +} + +val unsafeLong = ExpressionEncoder[Array[Long]].resolveAndBind(). + toRow(longArray).getArray(0) +assert(unsafeLong.isInstanceOf[UnsafeArrayData]) +assert(unsafeLong.numElements == 3) +longArray.zipWithIndex.map { case (e, i) => + assert(unsafeLong.getLong(i) == e) +} + +val unsafeFloat = ExpressionEncoder[Array[Float]].resolveAndBind(). + toRow(floatArray).getArray(0) +assert(unsafeFloat.isInstanceOf[UnsafeArrayData]) +assert(unsafeFloat.numElements == 3) +floatArray.zipWithIndex.map { case (e, i) => + assert(unsafeFloat.getFloat(i) == e) +} + +val unsafeDouble = ExpressionEncoder[Array[Double]].resolveAndBind(). + toRow(doubleArray).getArray(0) +assert(unsafeDouble.isInstanceOf[UnsafeArrayData]) +assert(unsafeDouble.numElements == 3) +doubleArray.zipWithIndex.map { case (e, i) => + assert(unsafeDouble.getDouble(i) == e) +} } - test("from primitive double array") { -val array = Array(1.1, 2.2, 3.3) -val unsafe = UnsafeArrayData.fromPrimitiveArray(array) -assert(unsafe.numElements == 3) -assert(unsafe.getSizeInBytes == 4 + 4 * 3 + 8 * 3) -assert(unsafe.getDouble(0) == 1.1) -assert(unsafe.getDouble(1) == 2.2) -assert(unsafe.getDouble(2) == 3.3) + test("from primitive array") { +val unsafeInt = UnsafeArrayData.fromPrimitiveArray(intArray) +assert(unsafeInt.numElements == 3) +assert(unsafeInt.getSizeInBytes == + ((4 + scala.math.ceil(3/64.toDouble) * 8 + 4 * 3 + 7).toInt / 8) * 8) +intArray.zipWithIndex.map { case (e, i) => + assert(unsafeInt.getInt(i) == e) +} + +val unsafeDouble = UnsafeArrayData.fromPrimitiveArray(doubleArray) +assert(unsafeDouble.numElements == 3) +assert(unsafeDouble.getSizeInBytes == + ((4 + scala.math.ceil(3/64.toDouble) * 8 + 8 * 3 + 7).toInt / 8) * 8) +doubleArray.zipWithIndex.map { case (e, i) => + assert(unsafeDouble.getDouble(i) == e) +} + } + + test("to primitive array") { +val intCount = intArray.length +val intUnsafeArray = new UnsafeArrayData +val intHeader = UnsafeArrayData.calculateHeaderPortionInBytes(intCount) +val intSize = intHeader + 4 * intCount +val intBuffer = new Array[Byte](intSize) +Platform.putInt(intBuffer, Platform.BYTE_ARRAY_OFFSET, intCount) +intUnsafeArray.pointTo(intBuffer, Platform.BYTE_ARRAY_O
[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13680#discussion_r69392605 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala --- @@ -222,16 +226,17 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro case _ => s"$arrayWriter.write($index, $element);" } +val dataType = if (ctx.isPrimitiveType(jt)) ctx.primitiveTypeName(et) else "" --- End diff -- how about `primitiveTypeName`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13680#discussion_r69392598 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala --- @@ -189,29 +189,33 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro val jt = ctx.javaType(et) -val fixedElementSize = et match { +val elementOrOffsetSize = et match { case t: DecimalType if t.precision <= Decimal.MAX_LONG_DIGITS => 8 case _ if ctx.isPrimitiveType(jt) => et.defaultSize - case _ => 0 + case _ => 8 // we need 8 bytes to store offset and length for variable-length types] --- End diff -- remove the tailing `]`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13680#discussion_r69392581 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java --- @@ -33,91 +38,144 @@ // The offset of the global buffer where we start to write this array. private int startingOffset; + // The number of elements in this array + private int numElements; + + private int headerInBytes; + + private void assertIndexIsValid(int index) { +assert index >= 0 : "index (" + index + ") should >= 0"; +assert index < numElements : "index (" + index + ") should < " + numElements; + } + public void initialize(BufferHolder holder, int numElements, int fixedElementSize) { -// We need 4 bytes to store numElements and 4 bytes each element to store offset. -final int fixedSize = 4 + 4 * numElements; +// We need 4 bytes to store numElements in header +this.numElements = numElements; +this.headerInBytes = calculateHeaderPortionInBytes(numElements); this.holder = holder; this.startingOffset = holder.cursor; -holder.grow(fixedSize); -Platform.putInt(holder.buffer, holder.cursor, numElements); -holder.cursor += fixedSize; +// Grows the global buffer ahead for header and fixed size data. +holder.grow(headerInBytes + fixedElementSize * numElements); + +// Initialize information in header --- End diff -- we should explain it more clear, e.g. write `numElements` to the head and clear out the null bits --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13680#discussion_r69392548 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java --- @@ -33,91 +38,144 @@ // The offset of the global buffer where we start to write this array. private int startingOffset; + // The number of elements in this array + private int numElements; + + private int headerInBytes; + + private void assertIndexIsValid(int index) { +assert index >= 0 : "index (" + index + ") should >= 0"; +assert index < numElements : "index (" + index + ") should < " + numElements; + } + public void initialize(BufferHolder holder, int numElements, int fixedElementSize) { --- End diff -- let's also update the name for `fixedElementSize` here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13680#discussion_r69392530 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java --- @@ -341,63 +325,115 @@ public UnsafeArrayData copy() { return arrayCopy; } - public static UnsafeArrayData fromPrimitiveArray(int[] arr) { -if (arr.length > (Integer.MAX_VALUE - 4) / 8) { - throw new UnsupportedOperationException("Cannot convert this array to unsafe format as " + -"it's too big."); -} + @Override + public boolean[] toBooleanArray() { +int size = numElements(); +boolean[] values = new boolean[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.BYTE_ARRAY_OFFSET, size); +return values; + } -final int offsetRegionSize = 4 * arr.length; -final int valueRegionSize = 4 * arr.length; -final int totalSize = 4 + offsetRegionSize + valueRegionSize; -final byte[] data = new byte[totalSize]; + @Override + public byte[] toByteArray() { +int size = numElements(); +byte[] values = new byte[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.BYTE_ARRAY_OFFSET, size); +return values; + } -Platform.putInt(data, Platform.BYTE_ARRAY_OFFSET, arr.length); + @Override + public short[] toShortArray() { +int size = numElements(); +short[] values = new short[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.SHORT_ARRAY_OFFSET, size * 2); +return values; + } -int offsetPosition = Platform.BYTE_ARRAY_OFFSET + 4; -int valueOffset = 4 + offsetRegionSize; -for (int i = 0; i < arr.length; i++) { - Platform.putInt(data, offsetPosition, valueOffset); - offsetPosition += 4; - valueOffset += 4; -} + @Override + public int[] toIntArray() { +int size = numElements(); +int[] values = new int[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.INT_ARRAY_OFFSET, size * 4); +return values; + } -Platform.copyMemory(arr, Platform.INT_ARRAY_OFFSET, data, - Platform.BYTE_ARRAY_OFFSET + 4 + offsetRegionSize, valueRegionSize); + @Override + public long[] toLongArray() { +int size = numElements(); +long[] values = new long[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.LONG_ARRAY_OFFSET, size * 8); +return values; + } -UnsafeArrayData result = new UnsafeArrayData(); -result.pointTo(data, Platform.BYTE_ARRAY_OFFSET, totalSize); -return result; + @Override + public float[] toFloatArray() { +int size = numElements(); +float[] values = new float[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.FLOAT_ARRAY_OFFSET, size * 4); +return values; } - public static UnsafeArrayData fromPrimitiveArray(double[] arr) { -if (arr.length > (Integer.MAX_VALUE - 4) / 12) { + @Override + public double[] toDoubleArray() { +int size = numElements(); +double[] values = new double[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.DOUBLE_ARRAY_OFFSET, size * 8); +return values; + } + + private static UnsafeArrayData fromPrimitiveArray( + Object arr, int offset, int length, int elementSize) { +final long headerSize = calculateHeaderPortionInBytes(length); +final long valueRegionSize = (long)elementSize * (long)length; +final long allocationSize = (headerSize + valueRegionSize + 7) / 8; +if (allocationSize > (long)Integer.MAX_VALUE) { throw new UnsupportedOperationException("Cannot convert this array to unsafe format as " + "it's too big."); } -final int offsetRegionSize = 4 * arr.length; -final int valueRegionSize = 8 * arr.length; -final int totalSize = 4 + offsetRegionSize + valueRegionSize; -final byte[] data = new byte[totalSize]; +final long[] data = new long[(int)allocationSize]; -Platform.putInt(data, Platform.BYTE_ARRAY_OFFSET, arr.length); - -int offsetPosition = Platform.BYTE_ARRAY_OFFSET + 4; -int valueOffset = 4 + offsetRegionSize; -for (int i = 0; i < arr.length; i++) { - Platform.putInt(data, offsetPosition, valueOffset); - offsetPosition += 4; - valueOffset += 8; -} - -Platform.copyMemory(
[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13680#discussion_r69392506 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java --- @@ -341,63 +325,115 @@ public UnsafeArrayData copy() { return arrayCopy; } - public static UnsafeArrayData fromPrimitiveArray(int[] arr) { -if (arr.length > (Integer.MAX_VALUE - 4) / 8) { - throw new UnsupportedOperationException("Cannot convert this array to unsafe format as " + -"it's too big."); -} + @Override + public boolean[] toBooleanArray() { +int size = numElements(); +boolean[] values = new boolean[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.BYTE_ARRAY_OFFSET, size); +return values; + } -final int offsetRegionSize = 4 * arr.length; -final int valueRegionSize = 4 * arr.length; -final int totalSize = 4 + offsetRegionSize + valueRegionSize; -final byte[] data = new byte[totalSize]; + @Override + public byte[] toByteArray() { +int size = numElements(); +byte[] values = new byte[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.BYTE_ARRAY_OFFSET, size); +return values; + } -Platform.putInt(data, Platform.BYTE_ARRAY_OFFSET, arr.length); + @Override + public short[] toShortArray() { +int size = numElements(); +short[] values = new short[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.SHORT_ARRAY_OFFSET, size * 2); +return values; + } -int offsetPosition = Platform.BYTE_ARRAY_OFFSET + 4; -int valueOffset = 4 + offsetRegionSize; -for (int i = 0; i < arr.length; i++) { - Platform.putInt(data, offsetPosition, valueOffset); - offsetPosition += 4; - valueOffset += 4; -} + @Override + public int[] toIntArray() { +int size = numElements(); +int[] values = new int[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.INT_ARRAY_OFFSET, size * 4); +return values; + } -Platform.copyMemory(arr, Platform.INT_ARRAY_OFFSET, data, - Platform.BYTE_ARRAY_OFFSET + 4 + offsetRegionSize, valueRegionSize); + @Override + public long[] toLongArray() { +int size = numElements(); +long[] values = new long[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.LONG_ARRAY_OFFSET, size * 8); +return values; + } -UnsafeArrayData result = new UnsafeArrayData(); -result.pointTo(data, Platform.BYTE_ARRAY_OFFSET, totalSize); -return result; + @Override + public float[] toFloatArray() { +int size = numElements(); +float[] values = new float[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.FLOAT_ARRAY_OFFSET, size * 4); +return values; } - public static UnsafeArrayData fromPrimitiveArray(double[] arr) { -if (arr.length > (Integer.MAX_VALUE - 4) / 12) { + @Override + public double[] toDoubleArray() { +int size = numElements(); +double[] values = new double[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.DOUBLE_ARRAY_OFFSET, size * 8); +return values; + } + + private static UnsafeArrayData fromPrimitiveArray( + Object arr, int offset, int length, int elementSize) { +final long headerSize = calculateHeaderPortionInBytes(length); +final long valueRegionSize = (long)elementSize * (long)length; +final long allocationSize = (headerSize + valueRegionSize + 7) / 8; +if (allocationSize > (long)Integer.MAX_VALUE) { throw new UnsupportedOperationException("Cannot convert this array to unsafe format as " + "it's too big."); } -final int offsetRegionSize = 4 * arr.length; -final int valueRegionSize = 8 * arr.length; -final int totalSize = 4 + offsetRegionSize + valueRegionSize; -final byte[] data = new byte[totalSize]; +final long[] data = new long[(int)allocationSize]; -Platform.putInt(data, Platform.BYTE_ARRAY_OFFSET, arr.length); - -int offsetPosition = Platform.BYTE_ARRAY_OFFSET + 4; -int valueOffset = 4 + offsetRegionSize; -for (int i = 0; i < arr.length; i++) { - Platform.putInt(data, offsetPosition, valueOffset); - offsetPosition += 4; - valueOffset += 8; -} - -Platform.copyMemory(
[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13680#discussion_r69392481 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java --- @@ -341,63 +325,115 @@ public UnsafeArrayData copy() { return arrayCopy; } - public static UnsafeArrayData fromPrimitiveArray(int[] arr) { -if (arr.length > (Integer.MAX_VALUE - 4) / 8) { - throw new UnsupportedOperationException("Cannot convert this array to unsafe format as " + -"it's too big."); -} + @Override + public boolean[] toBooleanArray() { +int size = numElements(); +boolean[] values = new boolean[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.BYTE_ARRAY_OFFSET, size); +return values; + } -final int offsetRegionSize = 4 * arr.length; -final int valueRegionSize = 4 * arr.length; -final int totalSize = 4 + offsetRegionSize + valueRegionSize; -final byte[] data = new byte[totalSize]; + @Override + public byte[] toByteArray() { +int size = numElements(); +byte[] values = new byte[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.BYTE_ARRAY_OFFSET, size); +return values; + } -Platform.putInt(data, Platform.BYTE_ARRAY_OFFSET, arr.length); + @Override + public short[] toShortArray() { +int size = numElements(); +short[] values = new short[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.SHORT_ARRAY_OFFSET, size * 2); +return values; + } -int offsetPosition = Platform.BYTE_ARRAY_OFFSET + 4; -int valueOffset = 4 + offsetRegionSize; -for (int i = 0; i < arr.length; i++) { - Platform.putInt(data, offsetPosition, valueOffset); - offsetPosition += 4; - valueOffset += 4; -} + @Override + public int[] toIntArray() { +int size = numElements(); +int[] values = new int[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.INT_ARRAY_OFFSET, size * 4); +return values; + } -Platform.copyMemory(arr, Platform.INT_ARRAY_OFFSET, data, - Platform.BYTE_ARRAY_OFFSET + 4 + offsetRegionSize, valueRegionSize); + @Override + public long[] toLongArray() { +int size = numElements(); +long[] values = new long[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.LONG_ARRAY_OFFSET, size * 8); +return values; + } -UnsafeArrayData result = new UnsafeArrayData(); -result.pointTo(data, Platform.BYTE_ARRAY_OFFSET, totalSize); -return result; + @Override + public float[] toFloatArray() { +int size = numElements(); +float[] values = new float[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.FLOAT_ARRAY_OFFSET, size * 4); +return values; } - public static UnsafeArrayData fromPrimitiveArray(double[] arr) { -if (arr.length > (Integer.MAX_VALUE - 4) / 12) { + @Override + public double[] toDoubleArray() { +int size = numElements(); +double[] values = new double[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.DOUBLE_ARRAY_OFFSET, size * 8); +return values; + } + + private static UnsafeArrayData fromPrimitiveArray( + Object arr, int offset, int length, int elementSize) { +final long headerSize = calculateHeaderPortionInBytes(length); +final long valueRegionSize = (long)elementSize * (long)length; +final long allocationSize = (headerSize + valueRegionSize + 7) / 8; --- End diff -- it's confusing to use `size` for all the names, how about `headerInBytes`, `valueRegionInBytes`, `totalSizeInWords`ï¼ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13990: [SPARK-16287][SQL] Implement str_to_map SQL function
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13990 **[Test build #61690 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61690/consoleFull)** for PR 13990 at commit [`6b2390d`](https://github.com/apache/spark/commit/6b2390d2dc141553eef3df9dd34744b9d08ccfdb). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13680#discussion_r69392452 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java --- @@ -341,63 +325,115 @@ public UnsafeArrayData copy() { return arrayCopy; } - public static UnsafeArrayData fromPrimitiveArray(int[] arr) { -if (arr.length > (Integer.MAX_VALUE - 4) / 8) { - throw new UnsupportedOperationException("Cannot convert this array to unsafe format as " + -"it's too big."); -} + @Override + public boolean[] toBooleanArray() { +int size = numElements(); +boolean[] values = new boolean[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.BYTE_ARRAY_OFFSET, size); +return values; + } -final int offsetRegionSize = 4 * arr.length; -final int valueRegionSize = 4 * arr.length; -final int totalSize = 4 + offsetRegionSize + valueRegionSize; -final byte[] data = new byte[totalSize]; + @Override + public byte[] toByteArray() { +int size = numElements(); +byte[] values = new byte[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.BYTE_ARRAY_OFFSET, size); +return values; + } -Platform.putInt(data, Platform.BYTE_ARRAY_OFFSET, arr.length); + @Override + public short[] toShortArray() { +int size = numElements(); +short[] values = new short[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.SHORT_ARRAY_OFFSET, size * 2); +return values; + } -int offsetPosition = Platform.BYTE_ARRAY_OFFSET + 4; -int valueOffset = 4 + offsetRegionSize; -for (int i = 0; i < arr.length; i++) { - Platform.putInt(data, offsetPosition, valueOffset); - offsetPosition += 4; - valueOffset += 4; -} + @Override + public int[] toIntArray() { +int size = numElements(); +int[] values = new int[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.INT_ARRAY_OFFSET, size * 4); +return values; + } -Platform.copyMemory(arr, Platform.INT_ARRAY_OFFSET, data, - Platform.BYTE_ARRAY_OFFSET + 4 + offsetRegionSize, valueRegionSize); + @Override + public long[] toLongArray() { +int size = numElements(); +long[] values = new long[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.LONG_ARRAY_OFFSET, size * 8); +return values; + } -UnsafeArrayData result = new UnsafeArrayData(); -result.pointTo(data, Platform.BYTE_ARRAY_OFFSET, totalSize); -return result; + @Override + public float[] toFloatArray() { +int size = numElements(); +float[] values = new float[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.FLOAT_ARRAY_OFFSET, size * 4); +return values; } - public static UnsafeArrayData fromPrimitiveArray(double[] arr) { -if (arr.length > (Integer.MAX_VALUE - 4) / 12) { + @Override + public double[] toDoubleArray() { +int size = numElements(); +double[] values = new double[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.DOUBLE_ARRAY_OFFSET, size * 8); +return values; + } + + private static UnsafeArrayData fromPrimitiveArray( + Object arr, int offset, int length, int elementSize) { +final long headerSize = calculateHeaderPortionInBytes(length); +final long valueRegionSize = (long)elementSize * (long)length; +final long allocationSize = (headerSize + valueRegionSize + 7) / 8; +if (allocationSize > (long)Integer.MAX_VALUE) { --- End diff -- hmm, why cast to long here? It's ok to compare long and int directly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org