[GitHub] spark pull request: [SPARK-13078][SQL] Infrastructure for the inte...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10982#issuecomment-177098340 **[Test build #50442 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50442/consoleFull)** for PR 10982 at commit [`d1bb199`](https://github.com/apache/spark/commit/d1bb1997497a8a1b3f18c47bb0c394d4bf3029f3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6363][BUILD] Make Scala 2.11 the defaul...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10608#issuecomment-177098407 It's a bit hard to know whether the repl changes make sense or not, but I think we just need to try it out and see if problems come up. LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13078][SQL] Infrastructure for the inte...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10982#issuecomment-177098416 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50440/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13078][SQL] Infrastructure for the inte...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10982#issuecomment-177109563 **[Test build #50443 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50443/consoleFull)** for PR 10982 at commit [`964193d`](https://github.com/apache/spark/commit/964193d920bf494148bbd0deee58c4d1e6dc3327). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12982][SQL] Add table name validation i...
Github user jayadevanmurali commented on the pull request: https://github.com/apache/spark/pull/10983#issuecomment-177109715 @hvanhovell I was able to replicate this in spark 2.0.0. Steps ayadevan@Satellite-L640:~/spark$ ./bin/spark-shell Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80) Type in expressions to have them evaluated. Type :help for more information. 16/01/30 14:19:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/01/30 14:19:22 WARN Utils: Your hostname, Satellite-L640 resolves to a loopback address: 127.0.1.1; using 100.86.225.72 instead (on interface ppp0) 16/01/30 14:19:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Spark context available as sc (master = local[*], app id = local-1454143767817). SQL context available as sqlContext. scala> import org.apache.spark.sql.types.{StringType, StructField, StructType} import org.apache.spark.sql.types.{StringType, StructField, StructType} scala> import org.apache.spark.sql.{DataFrame, Row, SQLContext} import org.apache.spark.sql.{DataFrame, Row, SQLContext} scala> import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.{SparkContext, SparkConf} scala> val rows = List(Row("foo"), Row("bar")); rows: List[org.apache.spark.sql.Row] = List([foo], [bar]) scala> val schema = StructType(Seq(StructField("col", StringType))); schema: org.apache.spark.sql.types.StructType = StructType(StructField(col,StringType,true)) scala> val rdd = sc.parallelize(rows); rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[0] at parallelize at :32 scala> val df = sqlContext.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [col: string] scala> df.registerTempTable("t~") scala> df.sqlContext.dropTempTable("t~") java.lang.RuntimeException: [1.2] failure: ``.'' expected but `~' found t~ ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58) at org.apache.spark.sql.SQLContext.table(SQLContext.scala:836) at org.apache.spark.sql.SQLContext.dropTempTable(SQLContext.scala:763) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:39) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:44) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:46) at $iwC$$iwC$$iwC$$iwC$$iwC.(:48) at $iwC$$iwC$$iwC$$iwC.(:50) at $iwC$$iwC$$iwC.(:52) at $iwC$$iwC.(:54) at $iwC.(:56) at (:58) at .(:62) at .() at .(:7) at .() at $print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1045) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1326) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:821) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:852) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:800) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at
[GitHub] spark pull request: [SPARK-8171] [Web UI] Simulated infinite scrol...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10910#issuecomment-177129741 **[Test build #50439 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50439/consoleFull)** for PR 10910 at commit [`35e08c7`](https://github.com/apache/spark/commit/35e08c7d3f7b89a04405795aa806cf5bbf76d9ec). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13100] [SQL] improving the performance ...
GitHub user wangyang1992 opened a pull request: https://github.com/apache/spark/pull/10994 [SPARK-13100] [SQL] improving the performance of stringToDate method in DateTimeUtils.scala Using an instance variable to hold an GMT TimeZone object instead of instantiate it every time. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyang1992/spark datetimeUtil Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10994.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10994 commit 19defc9c83da6206288c7ee70ce97f2e08603f72 Author: wangyangDate: 2016-01-30T08:33:40Z improving the performance of stringToDate method in DateTimeUtils.scala --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12982][SQL] Add table name validation i...
Github user jayadevanmurali commented on a diff in the pull request: https://github.com/apache/spark/pull/10983#discussion_r51342667 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -747,7 +747,7 @@ class SQLContext private[sql]( * only during the lifetime of this instance of SQLContext. */ private[sql] def registerDataFrameAsTable(df: DataFrame, tableName: String): Unit = { -catalog.registerTable(TableIdentifier(tableName), df.logicalPlan) +catalog.registerTable(SqlParser.parseTableIdentifier(tableName), df.logicalPlan) --- End diff -- Ok I can see the variable definition at line 211 of SqlContext.scala @transient protected[sql] val sqlParser = new SparkSQLParser(getSQLDialect().parse(_)) But this varable is not used anywhare.All methods use Sqlarser.parseTableIdentifier() for example @Experimental def createExternalTable( tableName: String, source: String, options: Map[String, String]): DataFrame = { **val tableIdent = SqlParser.parseTableIdentifier(tableName)** val cmd = CreateTableUsing( tableIdent, userSpecifiedSchema = None, source, temporary = false, options, allowExisting = false, managedIfNoPath = false) executePlan(cmd).toRdd table(tableIdent) } Correct me if am wrong. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12798] [SQL] generated BroadcastHashJoi...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10989#issuecomment-177111484 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12798] [SQL] generated BroadcastHashJoi...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10989#issuecomment-177111489 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50438/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6363][BUILD] Make Scala 2.11 the defaul...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10608#issuecomment-177098650 Merging this in master. Hopefully compilation will be faster. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13078][SQL] Infrastructure for the inte...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10982#issuecomment-177098415 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13100] [SQL] improving the performance ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10994#issuecomment-177106034 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13100] [SQL] improving the performance ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10994#issuecomment-177114527 **[Test build #50444 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50444/consoleFull)** for PR 10994 at commit [`19defc9`](https://github.com/apache/spark/commit/19defc9c83da6206288c7ee70ce97f2e08603f72). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12850] [SQL] Support Bucket Pruning (Pr...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10942#issuecomment-177098020 **[Test build #50441 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50441/consoleFull)** for PR 10942 at commit [`925827b`](https://github.com/apache/spark/commit/925827bc01e484c1d1ffb584fde86324b0640ca2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13070][SQL] Better error message when P...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10979#issuecomment-177099376 cc @liancheng --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12850] [SQL] Support Bucket Pruning (Pr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10942#discussion_r51342432 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala --- @@ -59,6 +61,141 @@ class BucketedReadSuite extends QueryTest with SQLTestUtils with TestHiveSinglet } } + // To verify if pruning works, we compare the results before filtering + private def checkPrunedAnswers( + sourceDataFrame: DataFrame, + filterCondition: Column, + expectedAnswer: DataFrame): Unit = { +val filter = sourceDataFrame.filter(filterCondition).queryExecution.executedPlan +assert( + filter.isInstanceOf[execution.Filter] || + (filter.isInstanceOf[WholeStageCodegen] && + filter.asInstanceOf[WholeStageCodegen].plan.isInstanceOf[execution.Filter])) +checkAnswer( + expectedAnswer.orderBy(expectedAnswer.logicalPlan.output.map(attr => Column(attr)) : _*), + filter.children.head.executeCollectPublic().sortBy(_.toString())) + } + + test("read partitioning bucketed tables with bucket pruning filters") { +val df = (10 until 50).map(i => (i % 5, i % 13 + 10, i.toString)).toDF("i", "j", "k") + +withTable("bucketed_table") { + // The number of buckets should be large enough to make sure each bucket contains + // at most one bucketing key value. + // json does not support predicate push-down, and thus json is used here --- End diff -- Does it mean bucket pruning is not very useful for parquet? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12850] [SQL] Support Bucket Pruning (Pr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10942#discussion_r51342422 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala --- @@ -59,6 +61,141 @@ class BucketedReadSuite extends QueryTest with SQLTestUtils with TestHiveSinglet } } + // To verify if pruning works, we compare the results before filtering + private def checkPrunedAnswers( + sourceDataFrame: DataFrame, + filterCondition: Column, + expectedAnswer: DataFrame): Unit = { +val filter = sourceDataFrame.filter(filterCondition).queryExecution.executedPlan +assert( + filter.isInstanceOf[execution.Filter] || + (filter.isInstanceOf[WholeStageCodegen] && --- End diff -- damn forgot about the `WholeStageCodegen` stuff. How about we call `filter.find` to get the underlying relation operator directly? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13078][SQL] Infrastructure for the inte...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10982#issuecomment-177100554 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13078][SQL] Infrastructure for the inte...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10982#issuecomment-177100558 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50442/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13078][SQL] Infrastructure for the inte...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10982#issuecomment-177100464 **[Test build #50442 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50442/consoleFull)** for PR 10982 at commit [`d1bb199`](https://github.com/apache/spark/commit/d1bb1997497a8a1b3f18c47bb0c394d4bf3029f3). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12850] [SQL] Support Bucket Pruning (Pr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10942#discussion_r51342412 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala --- @@ -59,6 +61,141 @@ class BucketedReadSuite extends QueryTest with SQLTestUtils with TestHiveSinglet } } + // To verify if pruning works, we compare the results before filtering + private def checkPrunedAnswers( + sourceDataFrame: DataFrame, + filterCondition: Column, --- End diff -- instead of having these 2 parameters, how about we just ask caller to pass in a `filteredDataFrame`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12850] [SQL] Support Bucket Pruning (Pr...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/10942#issuecomment-177102470 @gatorsmile thanks for your work, it's very close now :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13100] [SQL] improving the performance ...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/10994#issuecomment-177108672 Jenkins, this is ok to test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13078][SQL] Infrastructure for the inte...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/10982#discussion_r51342217 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -0,0 +1,172 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.catalog + +import org.apache.spark.sql.AnalysisException + + +/** + * Interface for the system catalog (of columns, partitions, tables, and databases). + * + * This is only used for non-temporary items, and implementations must be thread-safe as they + * can be accessed in multiple threads. + */ +abstract class Catalog { + + // -- + // Databases + // -- + + def createDatabase(dbDefinition: Database, ifNotExists: Boolean): Unit --- End diff -- need to define when we should throw exceptions in api contract --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6363][BUILD] Make Scala 2.11 the defaul...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10608 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12798] [SQL] generated BroadcastHashJoi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10989#issuecomment-177111361 **[Test build #50438 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50438/consoleFull)** for PR 10989 at commit [`0139fde`](https://github.com/apache/spark/commit/0139fdeeefc2038e995c44c7e966e09e30063418). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8171] [Web UI] Simulated infinite scrol...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10910#issuecomment-177129764 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8171] [Web UI] Simulated infinite scrol...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10910#issuecomment-177129767 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50439/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12850] [SQL] Support Bucket Pruning (Pr...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10942#issuecomment-177288104 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50445/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12850] [SQL] Support Bucket Pruning (Pr...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10942#issuecomment-177288102 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [test-maven] Shade protobuf-java
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10995#issuecomment-177290339 **[Test build #50447 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50447/consoleFull)** for PR 10995 at commit [`21cbc45`](https://github.com/apache/spark/commit/21cbc45d9971f7c64356709fc7d3b5c5ffbb06c8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10521][SQL] Utilize Docker for test DB2...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9893#issuecomment-177286324 **[Test build #50446 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50446/consoleFull)** for PR 9893 at commit [`c59a1e6`](https://github.com/apache/spark/commit/c59a1e667e0142c20ee982171f13bfce02b93aa4). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10521][SQL] Utilize Docker for test DB2...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9893#issuecomment-177286495 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10521][SQL] Utilize Docker for test DB2...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9893#issuecomment-177286497 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50446/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12850] [SQL] Support Bucket Pruning (Pr...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10942#issuecomment-177288023 **[Test build #50445 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50445/consoleFull)** for PR 10942 at commit [`f5acd00`](https://github.com/apache/spark/commit/f5acd00d4c28a6e65ca8200ec93b1874e921e0f0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [test-maven] Shade protobuf-java
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10995#issuecomment-177291160 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50447/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [test-maven] Shade protobuf-java
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10995#issuecomment-177291156 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12850] [SQL] Support Bucket Pruning (Pr...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10942#discussion_r51347234 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala --- @@ -59,6 +61,141 @@ class BucketedReadSuite extends QueryTest with SQLTestUtils with TestHiveSinglet } } + // To verify if pruning works, we compare the results before filtering + private def checkPrunedAnswers( + sourceDataFrame: DataFrame, + filterCondition: Column, + expectedAnswer: DataFrame): Unit = { +val filter = sourceDataFrame.filter(filterCondition).queryExecution.executedPlan +assert( + filter.isInstanceOf[execution.Filter] || + (filter.isInstanceOf[WholeStageCodegen] && + filter.asInstanceOf[WholeStageCodegen].plan.isInstanceOf[execution.Filter])) +checkAnswer( + expectedAnswer.orderBy(expectedAnswer.logicalPlan.output.map(attr => Column(attr)) : _*), + filter.children.head.executeCollectPublic().sortBy(_.toString())) + } + + test("read partitioning bucketed tables with bucket pruning filters") { +val df = (10 until 50).map(i => (i % 5, i % 13 + 10, i.toString)).toDF("i", "j", "k") + +withTable("bucketed_table") { + // The number of buckets should be large enough to make sure each bucket contains + // at most one bucketing key value. + // json does not support predicate push-down, and thus json is used here --- End diff -- Bucketing pruning can avoid scanning many useless bucket files. In each bucket file, it could have many different values. Row filtering in Parquet is a really great feature for efficiently scanning a given bucket. We need both for achieving the best performance. Let me try to answer why record filtering in Parquet is not perfect to resolve all the issues: - The current way is very limited. To filter row groups, it is based on the min / max value in the row group. That means, it might scan many useless row groups. - It is not free. It still needs to scan metadata to prune row groups. - Parquet team is trying to improve it by adding more advanced statistics into the metadata (e.g., bloom filters in PARQUET-41 and dictionary in PARQUET-384). Also, there still exist a few limits (e.g., PARQUET-295). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12850] [SQL] Support Bucket Pruning (Pr...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10942#discussion_r51347237 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala --- @@ -59,6 +61,141 @@ class BucketedReadSuite extends QueryTest with SQLTestUtils with TestHiveSinglet } } + // To verify if pruning works, we compare the results before filtering + private def checkPrunedAnswers( + sourceDataFrame: DataFrame, + filterCondition: Column, + expectedAnswer: DataFrame): Unit = { +val filter = sourceDataFrame.filter(filterCondition).queryExecution.executedPlan +assert( + filter.isInstanceOf[execution.Filter] || + (filter.isInstanceOf[WholeStageCodegen] && --- End diff -- Sure, will do. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12850] [SQL] Support Bucket Pruning (Pr...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10942#discussion_r51347247 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala --- @@ -59,6 +61,141 @@ class BucketedReadSuite extends QueryTest with SQLTestUtils with TestHiveSinglet } } + // To verify if pruning works, we compare the results before filtering + private def checkPrunedAnswers( + sourceDataFrame: DataFrame, + filterCondition: Column, --- End diff -- Sure, will do. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12850] [SQL] Support Bucket Pruning (Pr...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10942#issuecomment-177265449 **[Test build #50445 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50445/consoleFull)** for PR 10942 at commit [`f5acd00`](https://github.com/apache/spark/commit/f5acd00d4c28a6e65ca8200ec93b1874e921e0f0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [test-maven] Shade protobuf-java
Github user tedyu commented on the pull request: https://github.com/apache/spark/pull/10995#issuecomment-177266287 Would like some feedback before creating JIRA. Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [test-maven] Shade protobuf-java
GitHub user tedyu opened a pull request: https://github.com/apache/spark/pull/10995 [test-maven] Shade protobuf-java See this thread for background information: http://search-hadoop.com/m/q3RTtdkUFK11xQhP1/Spark+not+able+to+fetch+events+from+Amazon+Kinesis This PR shades com.google.protobuf:protobuf-java as org.spark-project.protobuf You can merge this pull request into a Git repository by running: $ git pull https://github.com/tedyu/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10995.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10995 commit 21cbc45d9971f7c64356709fc7d3b5c5ffbb06c8 Author: tedyuDate: 2016-01-30T18:16:09Z Shade protobuf-java --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [test-maven] Shade protobuf-java
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10995#issuecomment-177268114 **[Test build #50447 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50447/consoleFull)** for PR 10995 at commit [`21cbc45`](https://github.com/apache/spark/commit/21cbc45d9971f7c64356709fc7d3b5c5ffbb06c8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10521][SQL] Utilize Docker for test DB2...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9893#issuecomment-177265711 **[Test build #50446 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50446/consoleFull)** for PR 9893 at commit [`c59a1e6`](https://github.com/apache/spark/commit/c59a1e667e0142c20ee982171f13bfce02b93aa4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12850] [SQL] Support Bucket Pruning (Pr...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10942#issuecomment-177133947 **[Test build #50441 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50441/consoleFull)** for PR 10942 at commit [`925827b`](https://github.com/apache/spark/commit/925827bc01e484c1d1ffb584fde86324b0640ca2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12850] [SQL] Support Bucket Pruning (Pr...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10942#issuecomment-177134000 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50441/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13078][SQL] Infrastructure for the inte...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10982#issuecomment-177136524 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50443/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13078][SQL] Infrastructure for the inte...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10982#issuecomment-177136523 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13078][SQL] Infrastructure for the inte...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10982#issuecomment-177136488 **[Test build #50443 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50443/consoleFull)** for PR 10982 at commit [`964193d`](https://github.com/apache/spark/commit/964193d920bf494148bbd0deee58c4d1e6dc3327). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `abstract class CatalogTestCases extends SparkFunSuite ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13100] [SQL] improving the performance ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10994#issuecomment-177141259 **[Test build #50444 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50444/consoleFull)** for PR 10994 at commit [`19defc9`](https://github.com/apache/spark/commit/19defc9c83da6206288c7ee70ce97f2e08603f72). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13100] [SQL] improving the performance ...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/10994#issuecomment-177147895 LGTM. Are the other such instances in the code? Best to look for these all at once --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13100] [SQL] improving the performance ...
Github user wangyang1992 commented on the pull request: https://github.com/apache/spark/pull/10994#issuecomment-177151828 @srowen No, just that one in this file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12850] [SQL] Support Bucket Pruning (Pr...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10942#issuecomment-177133999 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12982][SQL] Add table name validation i...
Github user hvanhovell commented on the pull request: https://github.com/apache/spark/pull/10983#issuecomment-177137733 You are using an older version of the master branch (last commit 25 days ago). Your version still has the ```org.apache.spark.sql.catalyst.SqlParser``` class. That has been removed since commit https://github.com/apache/spark/commit/7cd7f2202547224593517b392f56e49e4c94cabc. Please update your master, and try again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13100] [SQL] improving the performance ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10994#issuecomment-177142164 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50444/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13100] [SQL] improving the performance ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10994#issuecomment-177142154 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13100] [SQL] improving the performance ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10994 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML][MINOR] Invalid MulticlassClassification r...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10996#issuecomment-177350346 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML][MINOR] Invalid MulticlassClassification r...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10996#issuecomment-177350347 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50451/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML][MINOR] Invalid MulticlassClassification r...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10996#issuecomment-177350319 **[Test build #50451 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50451/consoleFull)** for PR 10996 at commit [`41f5338`](https://github.com/apache/spark/commit/41f533825e080b47f2a31f1dc4cbac0adf39e40f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6847][Core][Streaming]Fix stack overflo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10934#issuecomment-177353166 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6847][Core][Streaming]Fix stack overflo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10934#issuecomment-177353167 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50448/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6847][Core][Streaming]Fix stack overflo...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/10934#discussion_r51352293 --- Diff: streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala --- @@ -821,6 +821,75 @@ class CheckpointSuite extends TestSuiteBase with DStreamCheckpointTester checkpointWriter.stop() } + test("SPARK-6847: stack overflow when updateStateByKey is followed by a checkpointed dstream") { +// In this test, there are two updateStateByKey operators. The RDD DAG is as follows: +// +// batch 1batch 2batch 3 ... +// +// 1) input rdd input rdd input rdd +// | | | +// v v v +// 2) cogroup rdd ---> cogroup rdd ---> cogroup rdd ... +// | /| /| +// v/ v/ v +// 3) map rdd ---map rdd ---map rdd ... +// | | | +// v v v +// 4) cogroup rdd ---> cogroup rdd ---> cogroup rdd ... +// | /| /| +// v/ v/ v +// 5) map rdd ---map rdd ---map rdd ... +// +// Every batch depends on its previous batch, so "updateStateByKey" needs to do checkpoint to +// break the RDD chain. However, before SPARK-6847, when the state RDD (layer 5) of the second +// "updateStateByKey" does checkpoint, it won't checkpoint the state RDD (layer 3) of the first +// "updateStateByKey" (Note: "updateStateByKey" has already marked that its state RDD (layer 3) +// should be checkpointed). Hence, the connections between layer 2 and layer 3 won't be broken +// and the RDD chain will grow infinitely and cause StackOverflow. +// +// Therefore SPARK-6847 introduces "spark.checkpoint.checkpointAllMarked" to force checkpointing +// all marked RDDs in the DAG to resolve this issue. (For the previous example, it will break +// connections between layer 2 and layer 3) +ssc = new StreamingContext(master, framework, batchDuration) +val batchCounter = new BatchCounter(ssc) +ssc.checkpoint(checkpointDir) +val inputDStream = new CheckpointInputDStream(ssc) +val updateFunc = (values: Seq[Int], state: Option[Int]) => { + Some(values.sum + state.getOrElse(0)) +} +@volatile var shouldCheckpointAllMarkedRDDs = false +@volatile var rddsCheckpointed = false +inputDStream.map(i => (i, i)) + .updateStateByKey(updateFunc).checkpoint(batchDuration) + .updateStateByKey(updateFunc).checkpoint(batchDuration) + .foreachRDD { rdd => +/** + * Find all RDDs that are marked for checkpointing in the specified RDD and its ancestors. + */ +def findAllMarkedRDDs(rdd: RDD[_]): List[RDD[_]] = { --- End diff -- > I meant put this in a private def outside of this test actually. It would make the test body smaller. But it will refer to the CheckpointSuite class which is not serializable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10702#issuecomment-177337405 **[Test build #50450 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50450/consoleFull)** for PR 10702 at commit [`e83b822`](https://github.com/apache/spark/commit/e83b8223846cc41942469fc4b78e9f0500239e0f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10702#issuecomment-177343128 **[Test build #50450 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50450/consoleFull)** for PR 10702 at commit [`e83b822`](https://github.com/apache/spark/commit/e83b8223846cc41942469fc4b78e9f0500239e0f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10702#issuecomment-177343337 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6847][Core][Streaming]Fix stack overflo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10934#issuecomment-177353119 **[Test build #50448 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50448/consoleFull)** for PR 10934 at commit [`20e4509`](https://github.com/apache/spark/commit/20e45095506067f3f5195470e3a390cd4872e531). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12705] [SPARK-10777] [SQL] Analyzer Rul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10678#issuecomment-177353382 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50449/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6847][Core][Streaming]Fix stack overflo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10934#issuecomment-177327392 **[Test build #50448 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50448/consoleFull)** for PR 10934 at commit [`20e4509`](https://github.com/apache/spark/commit/20e45095506067f3f5195470e3a390cd4872e531). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10702#issuecomment-177343340 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50450/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12988][SQL] Can't drop columns that con...
Github user dilipbiswal commented on the pull request: https://github.com/apache/spark/pull/10943#issuecomment-177351900 @cloud-fan Hi Wenchen, let me know if i have interpreted your suggestion correctly ? Please let me know if something is amiss. df.resolve() has many callers .. so i have not changed its name but have added a comment. Let me know if you want me to refactor it. Thanks.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12705] [SPARK-10777] [SQL] Analyzer Rul...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10678#discussion_r51352501 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -521,38 +522,96 @@ class Analyzer( */ object ResolveSortReferences extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { - case s @ Sort(ordering, global, p @ Project(projectList, child)) - if !s.resolved && p.resolved => -val (newOrdering, missing) = resolveAndFindMissing(ordering, p, child) + // Here, this rule only resolves the missing sort references if the child is not Aggregate + // Another rule ResolveAggregateFunctions will resolve that case. --- End diff -- @cloud-fan I kept the function implementation in the `ResolveAggregateFunctions`, but I called the function in `ResolveSortReferences`. Since the rule `ResolveAggregateFunctions` covers two cases (`filter` and `sort`), I am afraid the code readers might feel confused if we split them into two rules. This function call is public. I am not sure if this way is appropriate? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12705] [SPARK-10777] [SQL] Analyzer Rul...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10678#issuecomment-177331561 **[Test build #50449 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50449/consoleFull)** for PR 10678 at commit [`ba02f46`](https://github.com/apache/spark/commit/ba02f4695e4bfd07a9bef72f783bef3894d8191e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML][MINOR] Invalid MulticlassClassification r...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10996#issuecomment-177347951 **[Test build #50451 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50451/consoleFull)** for PR 10996 at commit [`41f5338`](https://github.com/apache/spark/commit/41f533825e080b47f2a31f1dc4cbac0adf39e40f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12705] [SPARK-10777] [SQL] Analyzer Rul...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10678#issuecomment-177353314 **[Test build #50449 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50449/consoleFull)** for PR 10678 at commit [`ba02f46`](https://github.com/apache/spark/commit/ba02f4695e4bfd07a9bef72f783bef3894d8191e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13100] [SQL] improving the performance ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10994#issuecomment-177328648 Thanks - merging this in. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML][MINOR] Invalid MulticlassClassification r...
GitHub user Lewuathe opened a pull request: https://github.com/apache/spark/pull/10996 [ML][MINOR] Invalid MulticlassClassification reference in ml-guide In [ml-guide](https://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation), there is invalid reference to `MulticlassClassificationEvaluator` apidoc. https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.evaluation.MultiClassClassificationEvaluator You can merge this pull request into a Git repository by running: $ git pull https://github.com/Lewuathe/spark fix-typo-in-ml-guide Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10996.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10996 commit 41f533825e080b47f2a31f1dc4cbac0adf39e40f Author: LewuatheDate: 2016-01-31T00:23:17Z [ML][MINOR] Invalid MulticlassClassification reference in ml-guide --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12988][SQL] Can't drop columns that con...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10943#issuecomment-177352984 **[Test build #50452 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50452/consoleFull)** for PR 10943 at commit [`8201994`](https://github.com/apache/spark/commit/82019947e9777a93ac4d137aed52e09a6434b56e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12705] [SPARK-10777] [SQL] Analyzer Rul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10678#issuecomment-177353380 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13105] Reject NATURAL JOIN queries rath...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10997#issuecomment-177376532 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13105] Reject NATURAL JOIN queries rath...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10997#issuecomment-177376534 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50453/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12850] [SQL] Support Bucket Pruning (Pr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10942#discussion_r51355104 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala --- @@ -59,6 +60,136 @@ class BucketedReadSuite extends QueryTest with SQLTestUtils with TestHiveSinglet } } + // To verify bucket pruning, we compare the contents of remaining buckets (before filtering) + // with the expectedAnswer. + private def checkPrunedAnswers( + bucketedDataFrame: DataFrame, + expectedAnswer: DataFrame): Unit = { +val rdd = bucketedDataFrame.queryExecution.executedPlan.find(_.isInstanceOf[PhysicalRDD]) +assert(rdd.isDefined) +checkAnswer( + expectedAnswer.orderBy(expectedAnswer.logicalPlan.output.map(attr => Column(attr)) : _*), + rdd.get.executeCollectPublic().sortBy(_.toString())) + } + + test("read partitioning bucketed tables with bucket pruning filters") { +val df = (10 until 50).map(i => (i % 5, i % 13 + 10, i.toString)).toDF("i", "j", "k") + +withTable("bucketed_table") { + // The number of buckets should be large enough to make sure each bucket contains --- End diff -- why this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user iyounus commented on a diff in the pull request: https://github.com/apache/spark/pull/10702#discussion_r51355081 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala --- @@ -558,6 +575,47 @@ class LinearRegressionSuite } } + test("linear regression model with constant label") { +/* + R code: + for (formula in c(b.const ~ . -1, b.const ~ .)) { + model <- lm(formula, data=df.const.label, weights=w) + print(as.vector(coef(model))) + } + [1] -9.221298 3.394343 + [1] 17 0 0 +*/ +val expected = Seq( + Vectors.dense(0.0, -9.221298, 3.394343), + Vectors.dense(17.0, 0.0, 0.0)) + +Seq("auto", "l-bfgs", "normal").foreach { solver => + var idx = 0 + for (fitIntercept <- Seq(false, true)) { +val model = new LinearRegression() + .setFitIntercept(fitIntercept) + .setWeightCol("weight") + .setSolver(solver) + .fit(datasetWithWeightConstantLabel) +val actual = Vectors.dense(model.intercept, model.coefficients(0), model.coefficients(1)) +assert(actual ~== expected(idx) absTol 1e-4) +idx += 1 --- End diff -- I'm not sure how to _check the size of lost history_. Could you please point me to some example? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13105] Reject NATURAL JOIN queries rath...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/10997#issuecomment-177381245 how about hive context? Should we update `HiveQl.scala` too? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10702#issuecomment-177381602 **[Test build #50455 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50455/consoleFull)** for PR 10702 at commit [`c0744d8`](https://github.com/apache/spark/commit/c0744d8a3c08756546925c9f82274f50d1d4affd). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10702#issuecomment-177385145 Commenting on your issues. Issue 1: With `WeightedLeastSquares`, we have option to standardize the label and features separately. As a result, if the label is not standardized, even `yStd == 0`, the problem can be solved. As a result, in your case 4, when label is not standardized, and the features are standardized, this is not defined, so the users should get the result. For case 3, can you elaborate why analytical solution exists even the label is standardized? Issue 2: In my opinion, even case 1, and case 2 are ill-defined since in GLMNET, the label is standardized by default, and GLMNET will not return any result at all. It just happens that without regularization, with/without standardization on labels will not change the solution, so we just treat them as if we don't standardize the label. This can explain your case 3. Issue 3: I think this is because your normal equation solver doesn't standardize the label, so the discrepancies occur. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10702#issuecomment-177396200 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50455/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12988][SQL] Can't drop columns that con...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10943#discussion_r51355492 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -150,6 +153,17 @@ class DataFrame private[sql]( } } + /** + * Resolves a column name. This is called when it is required to resolve a column by its + * name only and not as a column path.. + */ + private[sql] def resolveColName(colName: String, userSuppliedName: String): Boolean = { --- End diff -- how about ``` private[sql] def indexOf(colName: String): Option[Int] = { val resolver = sqlContext.analyzer.resolver val index = queryExecution.analyzed.output.indexWhere(f => resolver(f.name, colName)) if (index >= 0) Some(index) else None } ``` then we can rewrite `withColumn` to: ``` indexOf(colName).map { index => select(output.updated(index, col.as(colName)).map(Column(_)) : _*) }.getOrElse { select(Column("*"), col.as(colName)) } ``` There may be better name for this, like `resolveToIndex` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10702#issuecomment-177396130 **[Test build #50455 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50455/consoleFull)** for PR 10702 at commit [`c0744d8`](https://github.com/apache/spark/commit/c0744d8a3c08756546925c9f82274f50d1d4affd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12705] [SPARK-10777] [SQL] Analyzer Rul...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10678#discussion_r51355710 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -521,38 +522,99 @@ class Analyzer( */ object ResolveSortReferences extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { - case s @ Sort(ordering, global, p @ Project(projectList, child)) - if !s.resolved && p.resolved => -val (newOrdering, missing) = resolveAndFindMissing(ordering, p, child) + case s @ Sort(_, _, a: Aggregate) if a.resolved => --- End diff -- @cloud-fan Sure, let me change it. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12705] [SPARK-10777] [SQL] Analyzer Rul...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10678#discussion_r51355714 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -521,38 +522,99 @@ class Analyzer( */ object ResolveSortReferences extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { - case s @ Sort(ordering, global, p @ Project(projectList, child)) - if !s.resolved && p.resolved => -val (newOrdering, missing) = resolveAndFindMissing(ordering, p, child) + case s @ Sort(_, _, a: Aggregate) if a.resolved => --- End diff -- `ResolveAggregateFunctions` can handle missing attributes that can be resolved in grandchild. If there are more complex cases, I think that rule can at least resolve aggregate functions and go back to this rule to complete resolution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12705] [SPARK-10777] [SQL] Analyzer Rul...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10678#discussion_r51355706 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -521,38 +522,99 @@ class Analyzer( */ object ResolveSortReferences extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { - case s @ Sort(ordering, global, p @ Project(projectList, child)) - if !s.resolved && p.resolved => -val (newOrdering, missing) = resolveAndFindMissing(ordering, p, child) + case s @ Sort(_, _, a: Aggregate) if a.resolved => --- End diff -- @davies The missing attributes are also handled in `ResolveAggregateFunctions`. Thus it works. To answer your first question regarding `!s.resolved`, this is part of the algorithm design in the rule `ResolveAggregateFunctions`, as shown below: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L706-L708 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12567][SQL] Add aes_{encrypt,decrypt} U...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10527#issuecomment-177414971 **[Test build #50457 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50457/consoleFull)** for PR 10527 at commit [`04a14cf`](https://github.com/apache/spark/commit/04a14cf072630fbe3619bf241ff3d10d383594a5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12567][SQL] Add aes_{encrypt,decrypt} U...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10527#issuecomment-177415021 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50457/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12567][SQL] Add aes_{encrypt,decrypt} U...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10527#issuecomment-177415020 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12705] [SPARK-10777] [SQL] Analyzer Rul...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/10678#discussion_r51356255 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -521,38 +522,99 @@ class Analyzer( */ object ResolveSortReferences extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { - case s @ Sort(ordering, global, p @ Project(projectList, child)) - if !s.resolved && p.resolved => -val (newOrdering, missing) = resolveAndFindMissing(ordering, p, child) + case s @ Sort(_, _, a: Aggregate) if a.resolved => --- End diff -- So this seems that the rule in `ResolveAggregateFunctions` does not really resolve the missing attributes, we could keep that rule unchanged in this PR. If it's not trivial to fix this, we could create another JIRA for that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12705] [SPARK-10777] [SQL] Analyzer Rul...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10678#discussion_r51356658 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -521,38 +522,99 @@ class Analyzer( */ object ResolveSortReferences extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { - case s @ Sort(ordering, global, p @ Project(projectList, child)) - if !s.resolved && p.resolved => -val (newOrdering, missing) = resolveAndFindMissing(ordering, p, child) + case s @ Sort(_, _, a: Aggregate) if a.resolved => --- End diff -- @cloud-fan I will let `ResolveAggregateFunctions` handle the missing attribute resolution as long as the child of Sort is Aggregate. ```scala // Skip sort with aggregate. This will be handled in ResolveAggregateFunctions case sa @ Sort(_, _, child: Aggregate) => sa ``` When rewriting `ResolveSortReferences` in another PR, I will try to make the behaviors of both rules identical for resolving the missing attributes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12798] [SQL] generated BroadcastHashJoi...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/10989#discussion_r51356688 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/BenchmarkWholeStageCodegen.scala --- @@ -81,6 +82,30 @@ class BenchmarkWholeStageCodegen extends SparkFunSuite { benchmark.run() } + def testBroadcastHashJoin(values: Int): Unit = { +val benchmark = new Benchmark("BroadcastHashJoin", values) + +val dim = broadcast(sqlContext.range(1 << 16).selectExpr("id as k", "cast(id as string) as v")) + +benchmark.addCase("BroadcastHashJoin w/o codegen") { iter => + sqlContext.setConf("spark.sql.codegen.wholeStage", "false") + sqlContext.range(values).join(dim, (col("id") % 6) === col("k")).count() +} +benchmark.addCase(s"BroadcastHashJoin w codegen") { iter => + sqlContext.setConf("spark.sql.codegen.wholeStage", "true") + sqlContext.range(values).join(dim, (col("id") % 6) === col("k")).count() +} + +/* + Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz + BroadcastHashJoin: Avg Time(ms)Avg Rate(M/s) Relative Rate + --- + BroadcastHashJoin w/o codegen 3053.41 3.43 1.00 X + BroadcastHashJoin w codegen 1028.4010.20 2.97 X --- End diff -- Since the dimension table is pretty small, overhead of broadcast is also low, when I ran it with larger range, the improvements did not change much, because looking up in BytesToBytes is the bottleneck. I will have another PR to improve the join with small dimension table. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org