[GitHub] spark pull request #19841: [SPARK-22642][SQL] the createdTempDir will not be...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19841#discussion_r154490891 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -104,147 +105,153 @@ case class InsertIntoHiveTable( val partitionColumns = fileSinkConf.getTableInfo.getProperties.getProperty("partition_columns") val partitionColumnNames = Option(partitionColumns).map(_.split("/")).getOrElse(Array.empty) -// By this time, the partition map must match the table's partition columns -if (partitionColumnNames.toSet != partition.keySet) { - throw new SparkException( -s"""Requested partitioning does not match the ${table.identifier.table} table: - |Requested partitions: ${partition.keys.mkString(",")} - |Table partitions: ${table.partitionColumnNames.mkString(",")}""".stripMargin) -} - -// Validate partition spec if there exist any dynamic partitions -if (numDynamicPartitions > 0) { - // Report error if dynamic partitioning is not enabled - if (!hadoopConf.get("hive.exec.dynamic.partition", "true").toBoolean) { -throw new SparkException(ErrorMsg.DYNAMIC_PARTITION_DISABLED.getMsg) +def processInsert = { + // By this time, the partition map must match the table's partition columns + if (partitionColumnNames.toSet != partition.keySet) { +throw new SparkException( + s"""Requested partitioning does not match the ${table.identifier.table} table: + |Requested partitions: ${partition.keys.mkString(",")} + |Table partitions: ${table.partitionColumnNames.mkString(",")}""".stripMargin) } - // Report error if dynamic partition strict mode is on but no static partition is found - if (numStaticPartitions == 0 && -hadoopConf.get("hive.exec.dynamic.partition.mode", "strict").equalsIgnoreCase("strict")) { -throw new SparkException(ErrorMsg.DYNAMIC_PARTITION_STRICT_MODE.getMsg) - } + // Validate partition spec if there exist any dynamic partitions + if (numDynamicPartitions > 0) { +// Report error if dynamic partitioning is not enabled +if (!hadoopConf.get("hive.exec.dynamic.partition", "true").toBoolean) { + throw new SparkException(ErrorMsg.DYNAMIC_PARTITION_DISABLED.getMsg) +} + +// Report error if dynamic partition strict mode is on but no static partition is found +if (numStaticPartitions == 0 && + hadoopConf.get("hive.exec.dynamic.partition.mode", "strict").equalsIgnoreCase("strict")) { + throw new SparkException(ErrorMsg.DYNAMIC_PARTITION_STRICT_MODE.getMsg) +} - // Report error if any static partition appears after a dynamic partition - val isDynamic = partitionColumnNames.map(partitionSpec(_).isEmpty) - if (isDynamic.init.zip(isDynamic.tail).contains((true, false))) { -throw new AnalysisException(ErrorMsg.PARTITION_DYN_STA_ORDER.getMsg) +// Report error if any static partition appears after a dynamic partition +val isDynamic = partitionColumnNames.map(partitionSpec(_).isEmpty) +if (isDynamic.init.zip(isDynamic.tail).contains((true, false))) { + throw new AnalysisException(ErrorMsg.PARTITION_DYN_STA_ORDER.getMsg) +} } -} -table.bucketSpec match { - case Some(bucketSpec) => -// Writes to bucketed hive tables are allowed only if user does not care about maintaining -// table's bucketing ie. both "hive.enforce.bucketing" and "hive.enforce.sorting" are -// set to false -val enforceBucketingConfig = "hive.enforce.bucketing" -val enforceSortingConfig = "hive.enforce.sorting" + table.bucketSpec match { +case Some(bucketSpec) => + // Writes to bucketed hive tables are allowed only if user does not care about maintaining + // table's bucketing ie. both "hive.enforce.bucketing" and "hive.enforce.sorting" are + // set to false + val enforceBucketingConfig = "hive.enforce.bucketing" + val enforceSortingConfig = "hive.enforce.sorting" -val message = s"Output Hive table ${table.identifier} is bucketed but Spark" + - "currently does NOT populate bucketed output which is compatible with Hive." + val message = s"Output Hive table ${table.identifier} is bucketed but Spark" + +"currently does NOT populate bucketed output which is compatible with Hive." -if (hadoopConf.get(enforceBucketingConfig, "true").toBoolean || -
[GitHub] spark pull request #19841: [SPARK-22642][SQL] the createdTempDir will not be...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19841#discussion_r154490878 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -104,147 +105,153 @@ case class InsertIntoHiveTable( val partitionColumns = fileSinkConf.getTableInfo.getProperties.getProperty("partition_columns") val partitionColumnNames = Option(partitionColumns).map(_.split("/")).getOrElse(Array.empty) -// By this time, the partition map must match the table's partition columns -if (partitionColumnNames.toSet != partition.keySet) { - throw new SparkException( -s"""Requested partitioning does not match the ${table.identifier.table} table: - |Requested partitions: ${partition.keys.mkString(",")} - |Table partitions: ${table.partitionColumnNames.mkString(",")}""".stripMargin) -} - -// Validate partition spec if there exist any dynamic partitions -if (numDynamicPartitions > 0) { - // Report error if dynamic partitioning is not enabled - if (!hadoopConf.get("hive.exec.dynamic.partition", "true").toBoolean) { -throw new SparkException(ErrorMsg.DYNAMIC_PARTITION_DISABLED.getMsg) +def processInsert = { --- End diff -- Generally, we do not like nested method. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19858: [SPARK-22489][DOC][FOLLOWUP] Update broadcast beh...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19858#discussion_r154490770 --- Diff: docs/sql-programming-guide.md --- @@ -1776,6 +1776,8 @@ options. Note that, for DecimalType(38,0)*, the table above intentionally does not cover all other combinations of scales and precisions because currently we only infer decimal type like `BigInteger`/`BigInt`. For example, 1.1 is inferred as double type. - In PySpark, now we need Pandas 0.19.2 or upper if you want to use Pandas related functionalities, such as `toPandas`, `createDataFrame` from Pandas DataFrame, etc. - In PySpark, the behavior of timestamp values for Pandas related functionalities was changed to respect session timezone. If you want to use the old behavior, you need to set a configuration `spark.sql.execution.pandas.respectSessionTimeZone` to `False`. See [SPARK-22395](https://issues.apache.org/jira/browse/SPARK-22395) for details. + + - Since Spark 2.3, when either broadcast hash join or broadcast nested loop join is applicable, we prefer to broadcasting the table that is explicitly specified in a broadcast hint. For details, see the section [Broadcast Hint](#broadcast-hint-for-sql-queries) and [SPARK-22489](https://issues.apache.org/jira/browse/SPARK-22489) for details. --- End diff -- Sorry, there is a duplicate `for details` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19863: [SPARK-22672][TEST][SQL] Move OrcTest to `sql/cor...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19863#discussion_r154490741 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/OrcTest.scala --- @@ -15,18 +15,17 @@ * limitations under the License. */ -package org.apache.spark.sql.hive.orc +package org.apache.spark.sql.execution.datasources import java.io.File import scala.reflect.ClassTag import scala.reflect.runtime.universe.TypeTag import org.apache.spark.sql._ -import org.apache.spark.sql.hive.test.TestHiveSingleton import org.apache.spark.sql.test.SQLTestUtils -private[sql] trait OrcTest extends SQLTestUtils with TestHiveSingleton { +abstract class OrcTest extends QueryTest with SQLTestUtils { --- End diff -- When we have not moved the test suites of ORC, why we need to move out of Hive now? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19805#discussion_r154490687 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -537,9 +537,48 @@ class Dataset[T] private[sql]( */ @Experimental @InterfaceStability.Evolving - def checkpoint(eager: Boolean): Dataset[T] = { --- End diff -- Could you check the test case of `def checkpoint`? At least we need to add a test case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19865: [SPARK-22668][SQL] Exclude global variables from argumen...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19865 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19865: [SPARK-22668][SQL] Exclude global variables from argumen...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19865 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84386/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19865: [SPARK-22668][SQL] Exclude global variables from argumen...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19865 **[Test build #84386 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84386/testReport)** for PR 19865 at commit [`3b25b5f`](https://github.com/apache/spark/commit/3b25b5f2fdaec7b1e97732f1b8a0bf4c588a44f2). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19865: [SPARK-22668][SQL] Exclude global variables from argumen...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19865 **[Test build #84386 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84386/testReport)** for PR 19865 at commit [`3b25b5f`](https://github.com/apache/spark/commit/3b25b5f2fdaec7b1e97732f1b8a0bf4c588a44f2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19865: [SPARK-22668][SQL] Exclude global variables from ...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/19865 [SPARK-22668][SQL] Exclude global variables from arguments of method split by CodegenContext.splitExpressions() ## What changes were proposed in this pull request? This PR fixes to make incorrect result when split method by `CodegenContext.splitExpressions() ` has global variables in its arguments. When variables in arguments are declared as local variables, to assign a value to the variable, which has been originally declared as a global varible, updates a local variable. To fix this problem, this PR excludes global variables from arguments of the split method. Before this PR ``` class Test { int globalInt; void splittedFunction(int globalInt) { ... globalInt = 2; } void apply() { globalInt = 1; ... splittedFunction(globalInt); // globalInt should be 2 here since it is 2 if statemetns are not split } } ``` After this PR. ``` class Test { int globalInt; void splittedFunction() { ... globalInt = 2; } void apply() { globalInt = 1; ... splittedFunction(); // globalInt is 2 } } ``` ## How was this patch tested? Added test case into `CodeGenerationSuite` You can merge this pull request into a Git repository by running: $ git pull https://github.com/kiszk/spark SPARK-22668 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19865.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19865 commit 3b25b5f2fdaec7b1e97732f1b8a0bf4c588a44f2 Author: Kazuaki IshizakiDate: 2017-12-02T06:07:03Z initial commit --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19783 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84385/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19783 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19783 **[Test build #84385 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84385/testReport)** for PR 19783 at commit [`9d2a463`](https://github.com/apache/spark/commit/9d2a463f76ecd164dd15a834aa971a4d222113b6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19864: [SPARK-22673][SQL] InMemoryRelation should utilize on-di...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19864 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19864: [SPARK-22673][SQL] InMemoryRelation should utilize on-di...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19864 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84384/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19864: [SPARK-22673][SQL] InMemoryRelation should utilize on-di...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19864 **[Test build #84384 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84384/testReport)** for PR 19864 at commit [`2082c0e`](https://github.com/apache/spark/commit/2082c0ed776206db4f064b936d9f560a4633b6a0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #10803: [SPARK-12875] [ML] Add Weight of Evidence and Informatio...
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/10803 No it's not merged. Feel free to use the code as you wish. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19845: [SPARK-22651][PYTHON][ML] Prevent initiating mult...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19845 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19845 Merged to master. Thanks for reviewing this @viirya, @jiangxb1987, @dongjoon-hyun and @imatiach-msft. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19783 **[Test build #84385 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84385/testReport)** for PR 19783 at commit [`9d2a463`](https://github.com/apache/spark/commit/9d2a463f76ecd164dd15a834aa971a4d222113b6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19864: [SPARK-22673][SQL] InMemoryRelation should utilize on-di...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19864 **[Test build #84384 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84384/testReport)** for PR 19864 at commit [`2082c0e`](https://github.com/apache/spark/commit/2082c0ed776206db4f064b936d9f560a4633b6a0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19863: [SPARK-22672][TEST][SQL] Move OrcTest to `sql/core`
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19863 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84381/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19863: [SPARK-22672][TEST][SQL] Move OrcTest to `sql/core`
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19863 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19863: [SPARK-22672][TEST][SQL] Move OrcTest to `sql/core`
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19863 **[Test build #84381 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84381/testReport)** for PR 19863 at commit [`a58ceba`](https://github.com/apache/spark/commit/a58ceba87f4e5a2b70d934e45a20030d6ddca3c2). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `abstract class OrcTest extends QueryTest with SQLTestUtils ` * `class OrcFilterSuite extends OrcTest with TestHiveSingleton ` * `class OrcQuerySuite extends OrcTest with TestHiveSingleton with BeforeAndAfterAll ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19858: [SPARK-22489][DOC][FOLLOWUP] Update broadcast behavior c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19858 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19858: [SPARK-22489][DOC][FOLLOWUP] Update broadcast behavior c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19858 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84383/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19858: [SPARK-22489][DOC][FOLLOWUP] Update broadcast behavior c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19858 **[Test build #84383 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84383/testReport)** for PR 19858 at commit [`76148f5`](https://github.com/apache/spark/commit/76148f582404a9f5272ae55d9994133020832536). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19841: [SPARK-22642][SQL] the createdTempDir will not be...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/19841#discussion_r154480032 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -104,147 +105,153 @@ case class InsertIntoHiveTable( val partitionColumns = fileSinkConf.getTableInfo.getProperties.getProperty("partition_columns") val partitionColumnNames = Option(partitionColumns).map(_.split("/")).getOrElse(Array.empty) -// By this time, the partition map must match the table's partition columns -if (partitionColumnNames.toSet != partition.keySet) { - throw new SparkException( -s"""Requested partitioning does not match the ${table.identifier.table} table: - |Requested partitions: ${partition.keys.mkString(",")} - |Table partitions: ${table.partitionColumnNames.mkString(",")}""".stripMargin) -} - -// Validate partition spec if there exist any dynamic partitions -if (numDynamicPartitions > 0) { - // Report error if dynamic partitioning is not enabled - if (!hadoopConf.get("hive.exec.dynamic.partition", "true").toBoolean) { -throw new SparkException(ErrorMsg.DYNAMIC_PARTITION_DISABLED.getMsg) +def processInsert = { + // By this time, the partition map must match the table's partition columns + if (partitionColumnNames.toSet != partition.keySet) { +throw new SparkException( + s"""Requested partitioning does not match the ${table.identifier.table} table: + |Requested partitions: ${partition.keys.mkString(",")} + |Table partitions: ${table.partitionColumnNames.mkString(",")}""".stripMargin) } - // Report error if dynamic partition strict mode is on but no static partition is found - if (numStaticPartitions == 0 && -hadoopConf.get("hive.exec.dynamic.partition.mode", "strict").equalsIgnoreCase("strict")) { -throw new SparkException(ErrorMsg.DYNAMIC_PARTITION_STRICT_MODE.getMsg) - } + // Validate partition spec if there exist any dynamic partitions + if (numDynamicPartitions > 0) { +// Report error if dynamic partitioning is not enabled +if (!hadoopConf.get("hive.exec.dynamic.partition", "true").toBoolean) { + throw new SparkException(ErrorMsg.DYNAMIC_PARTITION_DISABLED.getMsg) +} + +// Report error if dynamic partition strict mode is on but no static partition is found +if (numStaticPartitions == 0 && + hadoopConf.get("hive.exec.dynamic.partition.mode", "strict").equalsIgnoreCase("strict")) { + throw new SparkException(ErrorMsg.DYNAMIC_PARTITION_STRICT_MODE.getMsg) +} - // Report error if any static partition appears after a dynamic partition - val isDynamic = partitionColumnNames.map(partitionSpec(_).isEmpty) - if (isDynamic.init.zip(isDynamic.tail).contains((true, false))) { -throw new AnalysisException(ErrorMsg.PARTITION_DYN_STA_ORDER.getMsg) +// Report error if any static partition appears after a dynamic partition +val isDynamic = partitionColumnNames.map(partitionSpec(_).isEmpty) +if (isDynamic.init.zip(isDynamic.tail).contains((true, false))) { + throw new AnalysisException(ErrorMsg.PARTITION_DYN_STA_ORDER.getMsg) +} } -} -table.bucketSpec match { - case Some(bucketSpec) => -// Writes to bucketed hive tables are allowed only if user does not care about maintaining -// table's bucketing ie. both "hive.enforce.bucketing" and "hive.enforce.sorting" are -// set to false -val enforceBucketingConfig = "hive.enforce.bucketing" -val enforceSortingConfig = "hive.enforce.sorting" + table.bucketSpec match { +case Some(bucketSpec) => + // Writes to bucketed hive tables are allowed only if user does not care about maintaining + // table's bucketing ie. both "hive.enforce.bucketing" and "hive.enforce.sorting" are + // set to false + val enforceBucketingConfig = "hive.enforce.bucketing" + val enforceSortingConfig = "hive.enforce.sorting" -val message = s"Output Hive table ${table.identifier} is bucketed but Spark" + - "currently does NOT populate bucketed output which is compatible with Hive." + val message = s"Output Hive table ${table.identifier} is bucketed but Spark" + +"currently does NOT populate bucketed output which is compatible with Hive." -if (hadoopConf.get(enforceBucketingConfig, "true").toBoolean || -
[GitHub] spark issue #19858: [SPARK-22489][DOC][FOLLOWUP] Update broadcast behavior c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19858 **[Test build #84383 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84383/testReport)** for PR 19858 at commit [`76148f5`](https://github.com/apache/spark/commit/76148f582404a9f5272ae55d9994133020832536). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19627: [SPARK-21088][ML][WIP] CrossValidator, TrainValidationSp...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19627 @jkbradley I think it is better to review #19857 (fix python model specific optimization) and merge it first and then I rebase & update this PR. :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19848 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84380/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19848 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19848 **[Test build #84380 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84380/testReport)** for PR 19848 at commit [`8e68683`](https://github.com/apache/spark/commit/8e6868395e9291ad8ddf2a0251f7e6163618b132). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19864: [SPARK-22673][SQL] InMemoryRelation should utilize on-di...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19864 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19864: [SPARK-22673][SQL] InMemoryRelation should utilize on-di...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19864 **[Test build #84382 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84382/testReport)** for PR 19864 at commit [`32f7c74`](https://github.com/apache/spark/commit/32f7c74a9b5cf4f19e7d14357bb87064383e11b3). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19864: [SPARK-22673][SQL] InMemoryRelation should utilize on-di...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19864 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84382/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18424: [SPARK-17091] Add rule to convert IN predicate to equiva...
Github user a10y commented on the issue: https://github.com/apache/spark/pull/18424 @ptkool are you still tracking this at all? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19864: [SPARK-22673][SQL] InMemoryRelation should utilize on-di...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19864 **[Test build #84382 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84382/testReport)** for PR 19864 at commit [`32f7c74`](https://github.com/apache/spark/commit/32f7c74a9b5cf4f19e7d14357bb87064383e11b3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19864: [SPARK-22673][SQL] InMemoryRelation should utilize on-di...
Github user CodingCat commented on the issue: https://github.com/apache/spark/pull/19864 @cloud-fan @viirya @gatorsmile @felixcheung @hvanhovell @HyukjinKwon @dongjoon-hyun @liancheng --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19864: [SPARK-22673][SQL] InMemoryRelation should utiliz...
GitHub user CodingCat opened a pull request: https://github.com/apache/spark/pull/19864 [SPARK-22673][SQL] InMemoryRelation should utilize on-disk table stats whenever possible ## What changes were proposed in this pull request? The current implementation of InMemoryRelation always uses the most expensive execution plan when writing cache With CBO enabled, we can actually have a more exact estimation of the underlying table size... ## How was this patch tested? existing test You can merge this pull request into a Git repository by running: $ git pull https://github.com/CodingCat/spark SPARK-22673 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19864.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19864 commit b2fb1d25804b7bdbe1a767306a319dc748965bce Author: CodingCatDate: 2016-03-07T14:37:37Z improve the doc for "spark.memory.offHeap.size" commit 0971900d562cb1a18af6f7de02bb8eb95637a640 Author: CodingCat Date: 2016-03-07T19:00:16Z fix commit 32f7c74a9b5cf4f19e7d14357bb87064383e11b3 Author: CodingCat Date: 2017-12-01T23:05:35Z use cbo stats in inmemoryrelation --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19746: [SPARK-22346][ML] VectorSizeHint Transformer for using V...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/19746 @WeichenXu123 From what I've seen, it's more common for people to use VectorAssembler to assemble a bunch of Numeric columns, rather than a bunch of Vector columns. I'd recommend we do things incrementally, adding single-column support before multi-column support (especially since we're still trying to achieve consensus about design for multi-column support, per my recent comment in the umbrella JIRA). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19863: [SPARK-22672][TEST][SQL] Move OrcTest to `sql/core`
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19863 **[Test build #84381 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84381/testReport)** for PR 19863 at commit [`a58ceba`](https://github.com/apache/spark/commit/a58ceba87f4e5a2b70d934e45a20030d6ddca3c2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19863: [SPARK-22672][TEST][SQL] Move OrcTest to `sql/cor...
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/19863 [SPARK-22672][TEST][SQL] Move OrcTest to `sql/core` ## What changes were proposed in this pull request? To support ORC tests without Hive, we had better have `OrcTest` in `sql/core` instead of `sql/hive`. ## How was this patch tested? This is a test suite only change. It should pass all existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-22672 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19863.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19863 commit a58ceba87f4e5a2b70d934e45a20030d6ddca3c2 Author: Dongjoon HyunDate: 2017-12-01T21:58:54Z [SPARK-22672][TEST][SQL] Move OrcTest to `sql/core` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/19527 Question about this PR description comment: > Note that keep can't be used at the same time with dropLast as true. Because they will conflict in encoded vector by producing a vector of zeros. Why is this necessary? With ```n``` categories found in fitting, shouldn't the behavior be the following? * ```keep=true, dropLast=true``` ==> vector size n * ```keep=true, dropLast=false``` ==> vector size n+1 * ```keep=false, dropLast=true``` ==> vector size n-1 * ```keep=false, dropLast=false``` ==> vector size n --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19858: [SPARK-22489][DOC][FOLLOWUP] Update broadcast beh...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19858#discussion_r154453799 --- Diff: docs/sql-programming-guide.md --- @@ -1776,6 +1776,8 @@ options. Note that, for DecimalType(38,0)*, the table above intentionally does not cover all other combinations of scales and precisions because currently we only infer decimal type like `BigInteger`/`BigInt`. For example, 1.1 is inferred as double type. - In PySpark, now we need Pandas 0.19.2 or upper if you want to use Pandas related functionalities, such as `toPandas`, `createDataFrame` from Pandas DataFrame, etc. - In PySpark, the behavior of timestamp values for Pandas related functionalities was changed to respect session timezone. If you want to use the old behavior, you need to set a configuration `spark.sql.execution.pandas.respectSessionTimeZone` to `False`. See [SPARK-22395](https://issues.apache.org/jira/browse/SPARK-22395) for details. + + - Since Spark 2.3, broadcast behaviour changed to broadcast the join side with an explicit broadcast hint first. See [SPARK-22489](https://issues.apache.org/jira/browse/SPARK-22489) for details. --- End diff -- ``` Since Spark 2.3, when either broadcast hash join or broadcast nested loop join is applicable, we prefer to broadcasting the table that is explicitly specified in a broadcast hint. For details, see the section [JDBC/ODBC](#broadcast-hint-for-sql-queries) and [SPARK-22489](https://issues.apache.org/jira/browse/SPARK-22489) for details. ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19527#discussion_r154452715 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala --- @@ -41,8 +41,12 @@ import org.apache.spark.sql.types.{DoubleType, NumericType, StructType} * The output vectors are sparse. * * @see `StringIndexer` for converting categorical values into category indices + * @deprecated `OneHotEncoderEstimator` will be renamed `OneHotEncoder` and this `OneHotEncoder` --- End diff -- Note for the future: For 3.0, it'd be nice to do what you're describing here but also leave OneHotEncoderEstimator as a deprecated alias. That way, user code won't break but will have deprecation warnings when upgrading to 3.0. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19838: [SPARK-22638][SS]Use a separate queue for Streami...
Github user zsxwing closed the pull request at: https://github.com/apache/spark/pull/19838 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19838: [SPARK-22638][SS]Use a separate queue for StreamingQuery...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/19838 Thanks! Merging to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19651 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84378/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19651 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19651 **[Test build #84378 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84378/testReport)** for PR 19651 at commit [`74cb053`](https://github.com/apache/spark/commit/74cb053ffaaa5c850810cfbd71e76aad38e2ba5b). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class OrcDeserializer(` * ` sealed trait CatalystDataUpdater ` * ` final class RowUpdater(row: InternalRow) extends CatalystDataUpdater ` * ` final class ArrayDataUpdater(array: ArrayData) extends CatalystDataUpdater ` * `class OrcSerializer(dataSchema: StructType) ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19848 **[Test build #84380 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84380/testReport)** for PR 19848 at commit [`8e68683`](https://github.com/apache/spark/commit/8e6868395e9291ad8ddf2a0251f7e6163618b132). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/19045 So it seems like the YARN changes are only going to happen in Hadoop 3+ so this might make sense regardless of what happens in https://github.com/apache/spark/pull/19267 (since folks like K8 or whoever can send the message as desired). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19838: [SPARK-22638][SS]Use a separate queue for StreamingQuery...
Github user tdas commented on the issue: https://github.com/apache/spark/pull/19838 LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/19848 LGTM but I'll leave it here a bit for others to take a look. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19858: [SPARK-22489][DOC][FOLLOWUP] Update broadcast behavior c...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/19858 cc @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/19840 Instead of setting `PYSPARK_PYTHON=~/anaconda3/envs/py3/bin/python`, what happens if you set `PYSPARK_DRIVER_PYTHON=~/anaconda3/envs/py3/bin/python`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/19831 Instead of manually setting up table statistics, I'm just trying to simulate the statistics for these tables by this way. If `totalSize (or rawDataSize) > 0` and `rowCount = 0`, at least one parameter is incorrect, and should not be optimized based on these incorrect statistics. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18113: [SPARK-20890][SQL] Added min and max typed aggregation f...
Github user setjet commented on the issue: https://github.com/apache/spark/pull/18113 @cloud-fan done, could you please have a look? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19758: [SPARK-3162][MLlib] Local Tree Training Pt 1: Refactor R...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19758 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84379/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19758: [SPARK-3162][MLlib] Local Tree Training Pt 1: Refactor R...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19758 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19758: [SPARK-3162][MLlib] Local Tree Training Pt 1: Refactor R...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19758 **[Test build #84379 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84379/testReport)** for PR 19758 at commit [`5bcccda`](https://github.com/apache/spark/commit/5bcccda6a599f60d2d084ecf871649675a792d5d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19627: [SPARK-21088][ML][WIP] CrossValidator, TrainValidationSp...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/19627 Is this still WIP or ready? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/19631 +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19651#discussion_r154408356 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala --- @@ -0,0 +1,210 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.orc.storage.ql.io.sarg.{PredicateLeaf, SearchArgument, SearchArgumentFactory} +import org.apache.orc.storage.ql.io.sarg.SearchArgument.Builder +import org.apache.orc.storage.serde2.io.HiveDecimalWritable + +import org.apache.spark.sql.sources.Filter +import org.apache.spark.sql.types._ + +/** + * Helper object for building ORC `SearchArgument`s, which are used for ORC predicate push-down. + * + * Due to limitation of ORC `SearchArgument` builder, we had to end up with a pretty weird double- + * checking pattern when converting `And`/`Or`/`Not` filters. + * + * An ORC `SearchArgument` must be built in one pass using a single builder. For example, you can't + * build `a = 1` and `b = 2` first, and then combine them into `a = 1 AND b = 2`. This is quite + * different from the cases in Spark SQL or Parquet, where complex filters can be easily built using + * existing simpler ones. + * + * The annoying part is that, `SearchArgument` builder methods like `startAnd()`, `startOr()`, and + * `startNot()` mutate internal state of the builder instance. This forces us to translate all + * convertible filters with a single builder instance. However, before actually converting a filter, + * we've no idea whether it can be recognized by ORC or not. Thus, when an inconvertible filter is + * found, we may already end up with a builder whose internal state is inconsistent. + * + * For example, to convert an `And` filter with builder `b`, we call `b.startAnd()` first, and then + * try to convert its children. Say we convert `left` child successfully, but find that `right` + * child is inconvertible. Alas, `b.startAnd()` call can't be rolled back, and `b` is inconsistent + * now. + * + * The workaround employed here is that, for `And`/`Or`/`Not`, we first try to convert their + * children with brand new builders, and only do the actual conversion with the right builder + * instance when the children are proven to be convertible. + * + * P.S.: Hive seems to use `SearchArgument` together with `ExprNodeGenericFuncDesc` only. Usage of + * builder methods mentioned above can only be found in test code, where all tested filters are + * known to be convertible. + */ +private[orc] object OrcFilters { --- End diff -- Yes. It's logically the same with old version. Only API usage is updated here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19651#discussion_r154407217 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala --- @@ -0,0 +1,210 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.orc.storage.ql.io.sarg.{PredicateLeaf, SearchArgument, SearchArgumentFactory} +import org.apache.orc.storage.ql.io.sarg.SearchArgument.Builder +import org.apache.orc.storage.serde2.io.HiveDecimalWritable + +import org.apache.spark.sql.sources.Filter +import org.apache.spark.sql.types._ + +/** + * Helper object for building ORC `SearchArgument`s, which are used for ORC predicate push-down. + * + * Due to limitation of ORC `SearchArgument` builder, we had to end up with a pretty weird double- + * checking pattern when converting `And`/`Or`/`Not` filters. + * + * An ORC `SearchArgument` must be built in one pass using a single builder. For example, you can't + * build `a = 1` and `b = 2` first, and then combine them into `a = 1 AND b = 2`. This is quite + * different from the cases in Spark SQL or Parquet, where complex filters can be easily built using + * existing simpler ones. + * + * The annoying part is that, `SearchArgument` builder methods like `startAnd()`, `startOr()`, and + * `startNot()` mutate internal state of the builder instance. This forces us to translate all + * convertible filters with a single builder instance. However, before actually converting a filter, + * we've no idea whether it can be recognized by ORC or not. Thus, when an inconvertible filter is + * found, we may already end up with a builder whose internal state is inconsistent. + * + * For example, to convert an `And` filter with builder `b`, we call `b.startAnd()` first, and then + * try to convert its children. Say we convert `left` child successfully, but find that `right` + * child is inconvertible. Alas, `b.startAnd()` call can't be rolled back, and `b` is inconsistent + * now. + * + * The workaround employed here is that, for `And`/`Or`/`Not`, we first try to convert their + * children with brand new builders, and only do the actual conversion with the right builder + * instance when the children are proven to be convertible. + * + * P.S.: Hive seems to use `SearchArgument` together with `ExprNodeGenericFuncDesc` only. Usage of + * builder methods mentioned above can only be found in test code, where all tested filters are + * known to be convertible. + */ +private[orc] object OrcFilters { --- End diff -- I didn't review it careful, just assume it's same with the old version, with API update. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19758: [SPARK-3162][MLlib] Local Tree Training Pt 1: Refactor R...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19758 **[Test build #84379 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84379/testReport)** for PR 19758 at commit [`5bcccda`](https://github.com/apache/spark/commit/5bcccda6a599f60d2d084ecf871649675a792d5d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19651 Thank you so much, @cloud-fan . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19651 **[Test build #84378 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84378/testReport)** for PR 19651 at commit [`74cb053`](https://github.com/apache/spark/commit/74cb053ffaaa5c850810cfbd71e76aad38e2ba5b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19848: [SPARK-22162] Executors and the driver should use...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19848#discussion_r154403383 --- Diff: core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala --- @@ -70,7 +70,8 @@ object SparkHadoopMapRedUtil extends Logging { if (shouldCoordinateWithDriver) { val outputCommitCoordinator = SparkEnv.get.outputCommitCoordinator val taskAttemptNumber = TaskContext.get().attemptNumber() -val canCommit = outputCommitCoordinator.canCommit(jobId, splitId, taskAttemptNumber) +val stageId = TaskContext.get().stageId() +val canCommit = outputCommitCoordinator.canCommit(stageId, splitId, taskAttemptNumber) --- End diff -- Ah, great, another public class that has no good reason to be public... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19631: [SPARK-22372][core, yarn] Make cluster submission...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19631#discussion_r154403001 --- Diff: project/MimaExcludes.scala --- @@ -36,6 +36,12 @@ object MimaExcludes { // Exclude rules for 2.3.x lazy val v23excludes = v22excludes ++ Seq( +// SPARK-22372: Make cluster submission use SparkApplication. + ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.deploy.SparkHadoopUtil.getSecretKeyFromUserCredentials"), --- End diff -- The change is 2.3 only. I've always questioned why `SparkHadoopUtil` is public in the first place. I'm not that worried about removing the functionality in this method because Spark apps don't really need to depend on it; it's easy to call the Hadoop API directly, and the API did nothing before except in YARN mode. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19789: [SPARK-22562][Streaming] CachedKafkaConsumer unsafe evic...
Github user daroo commented on the issue: https://github.com/apache/spark/pull/19789 @HyukjinKwon @zsxwing Any chance to go ahead with this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19861: [SPARK-22387][SQL] Propagate session configs to data sou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19861 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19861: [SPARK-22387][SQL] Propagate session configs to data sou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19861 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84377/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin read less data whe...
Github user gczsjdy commented on the issue: https://github.com/apache/spark/pull/19862 cc @cloud-fan @viirya @ConeyLiu --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19861: [SPARK-22387][SQL] Propagate session configs to data sou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19861 **[Test build #84377 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84377/testReport)** for PR 19861 at commit [`eaa6cae`](https://github.com/apache/spark/commit/eaa6cae1fd61f21238acdcfc7477d559cc47). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19862: Make SortMergeJoin read less data when wholeStageCodegen...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19862 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19862: Make SortMergeJoin read less data when wholeStage...
GitHub user gczsjdy opened a pull request: https://github.com/apache/spark/pull/19862 Make SortMergeJoin read less data when wholeStageCodegen is off ## What changes were proposed in this pull request? In SortMergeJoin(with wholeStageCodegen), an optimization already exists: if the left table of a partition is empty then there is no need to read the right table of this corresponding partition. This benefits the case in which many partitions of left table is empty and the right table is big. While in the code path without wholeStageCodegen, this optimization doesn't happen. This is mainly due to the lack of optimization in codegen-SortMergeJoin & the lack of support in `SortExec`, which doesn't do lazy evaluation. UI when wholeStageCodegen is off: https://user-images.githubusercontent.com/7685352/33493586-8311ac58-d6fb-11e7-816c-7a0fb2065345.PNG;> When the switch is on: ![image](https://user-images.githubusercontent.com/7685352/33493675-c821b81a-d6fb-11e7-8cf8-2e5baff75be3.png) This PR lazy evaluates sort, and only trigger the right table read after reading the left table and find it's not empty. After this PR, with wholeStageCodegen off: ![image](https://user-images.githubusercontent.com/7685352/33493784-2e1ee89a-d6fc-11e7-8201-71273de0b857.png) ## How was this patch tested? Existed test suites. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gczsjdy/spark smj_less_read Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19862.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19862 commit a8511e8b7e0240c471f88e703545301425960901 Author: GuoChenzhaoDate: 2017-11-28T09:01:34Z Init solution commit a5c14b473636e571c6d4f17798f220be080a27a1 Author: GuoChenzhao Date: 2017-11-28T09:31:07Z Style commit a26ca57d56b0b2df81daa82da49f4ff564fc10f5 Author: GuoChenzhao Date: 2017-11-28T09:37:54Z Comments commit 6d875f84de95d67f84ef9774cb6b6ee8273d46a6 Author: GuoChenzhao Date: 2017-12-01T11:32:06Z lazy evaluation in SortExec --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19860: [SPARK-22669][SQL] Avoid unnecessary function calls in c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19860 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19860: [SPARK-22669][SQL] Avoid unnecessary function calls in c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19860 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84376/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19860: [SPARK-22669][SQL] Avoid unnecessary function calls in c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19860 **[Test build #84376 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84376/testReport)** for PR 19860 at commit [`ce74fb8`](https://github.com/apache/spark/commit/ce74fb8baff13332bed8c64db6c2a49be815f81f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19859: [SPARK-22634][BUILD] Update Bouncy Castle to 1.58
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19859 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19859: [SPARK-22634][BUILD] Update Bouncy Castle to 1.58
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19859 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84375/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19859: [SPARK-22634][BUILD] Update Bouncy Castle to 1.58
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19859 **[Test build #84375 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84375/testReport)** for PR 19859 at commit [`222d369`](https://github.com/apache/spark/commit/222d3690c09fdc4e99e6f2f45e786175d3153049). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/19631 > If I understand what you're saying correctly, that should be considered a security issue in that server application regardless of this change. The server should not be exposing its environment to unprivileged users. Yes this is what i was saying and agree one could argue its their fault its more of us being nice because we are not adding it in and they might not realize its something they shouldn't expose. I'm not sure if there was anything that sensitive in there before but still good practice to not expose either way. > Let me think a little about this. I prefer the environment to the credentials approach because the > latter are written to disk, but at the same time, that's less problematic than exposing the secret to users in the STS. > I agree written to disk isn't ideal either but its only local disk and not HDFS so user would have to compromise the box vs someone simply accidentally allowing access to env/conf via a ui or console. reviewing changes now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19631: [SPARK-22372][core, yarn] Make cluster submission...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/19631#discussion_r154378257 --- Diff: project/MimaExcludes.scala --- @@ -36,6 +36,12 @@ object MimaExcludes { // Exclude rules for 2.3.x lazy val v23excludes = v22excludes ++ Seq( +// SPARK-22372: Make cluster submission use SparkApplication. + ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.deploy.SparkHadoopUtil.getSecretKeyFromUserCredentials"), --- End diff -- This is only going into spark 2.3 ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19831 Is it really an issue? If you manually set a wrong statistics, how would you expect the system to do? I think data source tables don't allow you set the statistics manually, so this problem is inherited from Hive. cc @wzhfy to confirm. This PR treats 0 row count as invalid, which is arguable, i.e. if we analyze an empty table, and then the 0 row count is valid. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19631: [SPARK-22372][core, yarn] Make cluster submission...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/19631#discussion_r154373635 --- Diff: core/src/main/scala/org/apache/spark/SecurityManager.scala --- @@ -542,7 +496,54 @@ private[spark] class SecurityManager( * Gets the secret key. * @return the secret key as a String if authentication is enabled, otherwise returns null */ - def getSecretKey(): String = secretKey + def getSecretKey(): String = { +if (isAuthenticationEnabled) { + Option(sparkConf.getenv(ENV_AUTH_SECRET)) --- End diff -- I agree SPARK_AUTH_SECRET_CONF is awkward and not really secure, when I initially did this , this is what was requested by other committers since standalone and mesos needed more security work around it anyway. I don't follow how the MesosSecretConfig is going to be used fully. Are these just regular spark configs passed around or distributed through mesos somehow? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19855: [SPARK-22662] [SQL] Failed to prune columns after rewrit...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19855 cc @jiangxb1987 who has more context. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18906 For https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala#L42, I don't think it's a public API. So, Scala side are: ```scala case class UserDefinedFunction protected[sql] ( f: AnyRef, dataType: DataType, inputTypes: Option[Seq[DataType]]) { ``` ```scala def udf[RT: TypeTag](f: Function0[RT]): UserDefinedFunction = { ``` ```scala def register[RT: TypeTag](name: String, func: Function0[RT]): UserDefinedFunction = { ``` You are proposing adding `nullable` as below: ```python class UserDefinedFunction(object): """ User defined function in Python .. versionadded:: 1.3 """ def __init__(self, func, returnType=StringType(), name=None, evalType=PythonEvalType.SQL_BATCHED_UDF, nullable=True): ``` ```python def udf(f=None, returnType=StringType(), nullable=True): ``` ```python def registerFunction(self, name, f, returnType=StringType(), nullable=True): ``` I don't think these are quite consistent? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19861: [SPARK-22387][SQL] Propagate session configs to data sou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19861 **[Test build #84377 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84377/testReport)** for PR 19861 at commit [`eaa6cae`](https://github.com/apache/spark/commit/eaa6cae1fd61f21238acdcfc7477d559cc47). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19861: [SPARK-22387][SQL] Propagate session configs to d...
GitHub user jiangxb1987 opened a pull request: https://github.com/apache/spark/pull/19861 [SPARK-22387][SQL] Propagate session configs to data source read/write options ## What changes were proposed in this pull request? Introduce a new interface `ConfigSupport` for `DataSourceV2`, it can help to propagate session configs with chosen key-prefixes to the particular data source. ## How was this patch tested? Add corresponding test case in `DataSourceV2Suite` You can merge this pull request into a Git repository by running: $ git pull https://github.com/jiangxb1987/spark datasource-configs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19861.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19861 commit 356c55eb5d50db1ab0fe9f15285cf31d993fad8a Author: Xingbo JiangDate: 2017-12-01T12:17:58Z propagate session configs to data source options. commit b076a69b0180825d1726019ef6213d1be2324c26 Author: Xingbo Jiang Date: 2017-12-01T13:36:22Z fix scalastyle --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19850: [SPARK-22653][CORE] executorAddress registered in Coarse...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/19850 thanks @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user ptkool commented on the issue: https://github.com/apache/spark/pull/18906 @HyukjinKwon As requested, here are the related Scala API changes: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala#L42 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala#L69 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3281 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala#L167 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19860: [SPARK-22669][SQL] Avoid unnecessary function calls in c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19860 **[Test build #84376 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84376/testReport)** for PR 19860 at commit [`ce74fb8`](https://github.com/apache/spark/commit/ce74fb8baff13332bed8c64db6c2a49be815f81f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19860: [SPARK-22669][SQL] Avoid unnecessary function cal...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/19860 [SPARK-22669][SQL] Avoid unnecessary function calls in code generation ## What changes were proposed in this pull request? In many parts of the codebase for code generation, we are splitting the code to avoid exceptions due to the 64KB method size limit. This is generating a lot of methods which are called every time, even though sometime this is not needed. As pointed out here: https://github.com/apache/spark/pull/19752#discussion_r153081547, this is a not negligible overhead which can be avoided. The PR applies the same approach used in #19752 also to the other places where this was feasible. ## How was this patch tested? existing UTs. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-22669 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19860.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19860 commit ce74fb8baff13332bed8c64db6c2a49be815f81f Author: Marco GaidoDate: 2017-12-01T12:55:58Z [SPARK-22669][SQL] Avoid unnecessary function calls in code generation --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r154341528 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala --- @@ -99,3 +94,91 @@ class TypedAverage[IN](val f: IN => Double) extends Aggregator[IN, (Double, Long toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]] } } + +class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, Double, Double] { + override def zero: Double = Double.PositiveInfinity + override def reduce(b: Double, a: IN): Double = math.min(b, f(a)) + override def merge(b1: Double, b2: Double): Double = math.min(b1, b2) + override def finish(reduction: Double): Double = { +if (Double.PositiveInfinity == reduction) { --- End diff -- +1 for option 1. It sounds the best among three to me too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19859: [SPARK-22634][BUILD] Update Bouncy Castle to 1.58
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19859 **[Test build #84375 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84375/testReport)** for PR 19859 at commit [`222d369`](https://github.com/apache/spark/commit/222d3690c09fdc4e99e6f2f45e786175d3153049). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19859: [SPARK-22634][BUILD] Update Bouncy Castle to 1.58
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/19859 [SPARK-22634][BUILD] Update Bouncy Castle to 1.58 ## What changes were proposed in this pull request? Update Bouncy Castle to 1.58, and jets3t to 0.9.4 to (sort of) match. ## How was this patch tested? Existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-22634 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19859.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19859 commit 222d3690c09fdc4e99e6f2f45e786175d3153049 Author: Sean OwenDate: 2017-12-01T12:11:25Z Update Bouncy Castle to 1.58 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19846: [SPARK-22393][SPARK-SHELL] spark-shell can't find...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19846 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org