[GitHub] spark issue #19843: [SPARK-22644][ML][TEST][WIP] Make ML testsuite support S...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19843 **[Test build #84287 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84287/testReport)** for PR 19843 at commit [`08954fe`](https://github.com/apache/spark/commit/08954fe0e299a3dcf992d5a6cbd5f12d907ac453). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19839: SPARK-22373 Bump Janino dependency version to fix thread...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19839 I think we don't need this since we need to upgrade to the next janino release for the issue related to SPARK-18016. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19843: [SPARK-22644][ML][TEST][WIP] Make ML testsuite support S...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19843 **[Test build #84286 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84286/testReport)** for PR 19843 at commit [`072f4b9`](https://github.com/apache/spark/commit/072f4b9f330af61e7de07c8c2e57421448aa306b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19843: [SPARK-22644][ML][TEST][WIP] Make ML testsuite su...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19843 [SPARK-22644][ML][TEST][WIP] Make ML testsuite support StructuredStreaming test ## What changes were proposed in this pull request? We need to add some helper code to make testing ML transformers & models easier with streaming data. These tests might help us catch any remaining issues and we could encourage future PRs to use these tests to prevent new Models & Transformers from having issues. I add a `MLTest` trait which extends `StreamTest` trait, and override `createSparkSession`. So ML testsuite can only extend `MLTest`, to use both ML & Stream test util functions. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark ml_stream_test_helper Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19843.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19843 commit 072f4b9f330af61e7de07c8c2e57421448aa306b Author: WeichenXuDate: 2017-11-28T14:04:35Z init pr --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19805 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84282/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19805 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19793: [SPARK-22574] [Mesos] [Submit] Check submission r...
Github user Gschiavon commented on a diff in the pull request: https://github.com/apache/spark/pull/19793#discussion_r153709924 --- Diff: core/src/test/scala/org/apache/spark/deploy/rest/SubmitRestProtocolSuite.scala --- @@ -86,6 +86,8 @@ class SubmitRestProtocolSuite extends SparkFunSuite { message.clientSparkVersion = "1.2.3" message.appResource = "honey-walnut-cherry.jar" message.mainClass = "org.apache.spark.examples.SparkPie" +message.appArgs = Array("hdfs://tmp/auth") +message.environmentVariables = Map("SPARK_HOME" -> "/test") --- End diff -- @susanxhuynh Yes we have to make sure we don't break the Standalone mode. I've reviewed StandaloneSubmitRequestServlet class and i think we might be facing the same problem here, (L.126-131) You can find checks for 'appResource' and 'mainClass' but not for 'appArgs'(L.140) or 'environmentVariables'(L.143) when they could be null as they are initialised as null. I think is the same case, let me know what you think. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19805 **[Test build #84282 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84282/testReport)** for PR 19805 at commit [`59b5562`](https://github.com/apache/spark/commit/59b5562d1cfe738dc991ef17afad71abd891195d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user fjh100456 commented on the issue: https://github.com/apache/spark/pull/19218 @gatorsmile I'd tested the performance of 'uncompressed', 'snappy', 'gzip' compression algorithm for parquet, the input data volume is 22MB, 220MB, 1100MB, respectively run 10 times, finally 'snappy' in several cases are more excellent. The test results are as follows:(TimeUnit: ms) ![default](https://user-images.githubusercontent.com/26785576/33362659-74c0cf06-d518-11e7-9907-2f353ffed37d.png) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19842: [SPARK-22643][SQL] ColumnarArray should be an immutable ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19842 **[Test build #84285 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84285/testReport)** for PR 19842 at commit [`aaa33dd`](https://github.com/apache/spark/commit/aaa33dde6c392ad96f4673a70480d7bbb73e0d41). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19842: [SPARK-22643][SQL] ColumnarArray should be an imm...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/19842 [SPARK-22643][SQL] ColumnarArray should be an immutable view ## What changes were proposed in this pull request? To make `ColumnVector` public, `ColumnarArray` need to be public too, and we should not have mutable public fields in a public class. This PR proposes to make `ColumnarArray` an immutable view of the data, and always create a new instance of `ColumnarArray` in `ColumnVector#getArray` ## How was this patch tested? locally tested with TPCDS, no performance regression is observed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark column-vector Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19842.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19842 commit aaa33dde6c392ad96f4673a70480d7bbb73e0d41 Author: Wenchen FanDate: 2017-11-29T07:16:57Z ColumnarArray should be an immutable view --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19842: [SPARK-22643][SQL] ColumnarArray should be an immutable ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19842 cc @michal-databricks @hvanhovell @kiszk @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19812: [SPARK-22598][CORE] ExecutorAllocationManager does not r...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19812 Does this failure ". For some reason, all of the 3 executors failed. " happened during task running or before task submission? Besides, if you're running on yarn, yarn will bring new executors to meet the requirement, also `ExecutorAllocationManager` will be notified with executor lost/register. Can you please tell us how to reproduce your scenario? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19750 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19750 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84281/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19750 **[Test build #84281 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84281/testReport)** for PR 19750 at commit [`81b3f3d`](https://github.com/apache/spark/commit/81b3f3dd223785e02ad2df2a6e40046a76539681). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19812: [SPARK-22598][CORE] ExecutorAllocationManager does not r...
Github user liutang123 commented on the issue: https://github.com/apache/spark/pull/19812 Hi @jerryshao , I modified the info of this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - B...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19468 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19468 For future pull requests, can you create subtasks under https://issues.apache.org/jira/browse/SPARK-18278 ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19468 Thanks - merging in master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19840 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84280/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19840 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19840 **[Test build #84280 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84280/testReport)** for PR 19840 at commit [`8ff5663`](https://github.com/apache/spark/commit/8ff5663fe9a32eae79c8ee6bc310409170a8da64). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19825: [SPARK-22615][SQL] Handle more cases in PropagateEmptyRe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19825 **[Test build #84284 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84284/testReport)** for PR 19825 at commit [`c359fe3`](https://github.com/apache/spark/commit/c359fe33de8425f932c0d447de358e98c0d3eb84). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19841: [SPARK-22642][SQL] the createdTempDir will not be delete...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19841 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19841: [SPARK-22642][SQL] the createdTempDir will not be...
GitHub user zuotingbing opened a pull request: https://github.com/apache/spark/pull/19841 [SPARK-22642][SQL] the createdTempDir will not be deleted if an exception occurs ## What changes were proposed in this pull request? We found staging directories will not be dropped sometimes in our production environment. The createdTempDir will not be deleted if an exception occurs, we should delete createdTempDir in finally. Refer to SPARK-18703ã ## How was this patch tested? exist tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/zuotingbing/spark SPARK-stagedir Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19841.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19841 commit 99d1fd74b7f25a7ddbda4dd01a8b4c03b3da778a Author: zuotingbingDate: 2017-11-29T06:22:20Z [SPARK-22642][SQL] the createdTempDir will not be deleted if an exception occurs --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19823: [SPARK-22601][SQL] Data load is getting displayed succes...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19823 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19823: [SPARK-22601][SQL] Data load is getting displayed succes...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19823 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84283/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19821: [WIP][SPARK-22608][SQL] add new API to CodeGeneration.sp...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19821 Sure, I have resolved the conflict in my environment. I will commit soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19823: [SPARK-22601][SQL] Data load is getting displayed succes...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19823 **[Test build #84283 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84283/testReport)** for PR 19823 at commit [`a60843c`](https://github.com/apache/spark/commit/a60843c12f197282ae046a96772cf12c548f43a9). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19651 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84278/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19651 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19651 **[Test build #84278 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84278/testReport)** for PR 19651 at commit [`e13dfa3`](https://github.com/apache/spark/commit/e13dfa3600b1a1be3d659eb79ceb73b4078067f8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19834: [SPARK-22585][Core] Path in addJar is not url encoded
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19834 LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18714: [SPARK-20236][SQL] runtime partition overwrite
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/18714 ping, very interested in this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset A...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19805#discussion_r153693534 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -537,9 +536,55 @@ class Dataset[T] private[sql]( */ @Experimental @InterfaceStability.Evolving - def checkpoint(eager: Boolean): Dataset[T] = { + def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager) + + /** + * Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be + * used to truncate the logical plan of this Dataset, which is especially useful in iterative + * algorithms where the plan may grow exponentially. Local checkpoints are written to executor + * storage and despite potentially faster they are unreliable and may compromise job completion. + * + * @group basic + * @since 2.3.0 + */ + @Experimental + @InterfaceStability.Evolving + def localCheckpoint(): Dataset[T] = _checkpoint(eager = true, local = true) + + /** + * Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate + * the logical plan of this Dataset, which is especially useful in iterative algorithms where the + * plan may grow exponentially. Local checkpoints are written to executor storage and despite + * potentially faster they are unreliable and may compromise job completion. + * + * @group basic + * @since 2.3.0 + */ + @Experimental + @InterfaceStability.Evolving + def localCheckpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager, local = true) --- End diff -- ditto. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19651 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84279/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset A...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19805#discussion_r153693511 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -537,9 +536,55 @@ class Dataset[T] private[sql]( */ @Experimental @InterfaceStability.Evolving - def checkpoint(eager: Boolean): Dataset[T] = { + def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager) --- End diff -- We don't need the default value for `eager` here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset A...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19805#discussion_r153693403 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -518,13 +518,12 @@ class Dataset[T] private[sql]( * the logical plan of this Dataset, which is especially useful in iterative algorithms where the * plan may grow exponentially. It will be saved to files inside the checkpoint * directory set with `SparkContext#setCheckpointDir`. - * --- End diff -- Maybe we should revert this back? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset A...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19805#discussion_r153694085 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -537,9 +536,55 @@ class Dataset[T] private[sql]( */ @Experimental @InterfaceStability.Evolving - def checkpoint(eager: Boolean): Dataset[T] = { + def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager) + + /** + * Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be + * used to truncate the logical plan of this Dataset, which is especially useful in iterative + * algorithms where the plan may grow exponentially. Local checkpoints are written to executor + * storage and despite potentially faster they are unreliable and may compromise job completion. + * + * @group basic + * @since 2.3.0 + */ + @Experimental + @InterfaceStability.Evolving + def localCheckpoint(): Dataset[T] = _checkpoint(eager = true, local = true) + + /** + * Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate + * the logical plan of this Dataset, which is especially useful in iterative algorithms where the + * plan may grow exponentially. Local checkpoints are written to executor storage and despite + * potentially faster they are unreliable and may compromise job completion. + * + * @group basic + * @since 2.3.0 + */ + @Experimental + @InterfaceStability.Evolving + def localCheckpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager, local = true) + + /** + * Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the + * logical plan of this Dataset, which is especially useful in iterative algorithms where the + * plan may grow exponentially. + * By default reliable checkpoints are created and saved to files inside the checkpoint + * directory set with `SparkContext#setCheckpointDir`. If local is set to true a local checkpoint + * is performed instead. Local checkpoints are written to executor storage and despite + * potentially faster they are unreliable and may compromise job completion. + * + * @group basic + * @since 2.3.0 + */ + @Experimental + @InterfaceStability.Evolving + private[sql] def _checkpoint(eager: Boolean, local: Boolean = false): Dataset[T] = { --- End diff -- We should make this `private`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19651 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19651 **[Test build #84277 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84277/testReport)** for PR 19651 at commit [`decaa50`](https://github.com/apache/spark/commit/decaa50be159ecf6b87dfb68cbe44bec91eaea9e). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` trait Converter ` * ` trait OrcDataUpdater ` * ` final class RowUpdater(row: InternalRow, i: Int) extends OrcDataUpdater ` * ` final class ArrayDataUpdater(updater: OrcDataUpdater) extends OrcDataUpdater ` * ` final class MapDataUpdater(` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19651 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84277/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19651 **[Test build #84279 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84279/testReport)** for PR 19651 at commit [`fdab6a7`](https://github.com/apache/spark/commit/fdab6a7eb3acc63f78889fd31f8f078fca66aa0f). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19651 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19823: [SPARK-22601][SQL] Data load is getting displayed succes...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19823 **[Test build #84283 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84283/testReport)** for PR 19823 at commit [`a60843c`](https://github.com/apache/spark/commit/a60843c12f197282ae046a96772cf12c548f43a9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19823: [SPARK-22601][SQL] Data load is getting displayed succes...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19823 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19823: [SPARK-22601][SQL] Data load is getting displayed...
Github user sujith71955 commented on a diff in the pull request: https://github.com/apache/spark/pull/19823#discussion_r153693386 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala --- @@ -2392,5 +2392,13 @@ abstract class DDLSuite extends QueryTest with SQLTestUtils { } } } +test("load command invalid path validation ") { --- End diff -- fixed. thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19823: [SPARK-22601][SQL] Data load is getting displayed...
Github user sujith71955 commented on a diff in the pull request: https://github.com/apache/spark/pull/19823#discussion_r153693359 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala --- @@ -2392,5 +2392,13 @@ abstract class DDLSuite extends QueryTest with SQLTestUtils { } } } +test("load command invalid path validation ") { --- End diff -- updated it. In my local build it was working fine. Thanks for the feedback --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/19805 @ferdonline Could you file a JIRA issue and add the id to the title like `[SPARK-xxx][PYTHON][SQL] ...`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19805 **[Test build #84282 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84282/testReport)** for PR 19805 at commit [`59b5562`](https://github.com/apache/spark/commit/59b5562d1cfe738dc991ef17afad71abd891195d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/19805 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19792: [SPARK-22566][PYTHON] Better error message for `_...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19792#discussion_r153692220 --- Diff: python/pyspark/sql/types.py --- @@ -1108,19 +1109,23 @@ def _has_nulltype(dt): return isinstance(dt, NullType) -def _merge_type(a, b): +def _merge_type(a, b, path=''): --- End diff -- Well, can you follow the #18521 format for now? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19833: [SPARK-22605][SQL] SQL write job should also set Spark t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19833 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19833: [SPARK-22605][SQL] SQL write job should also set Spark t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19833 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84276/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19833: [SPARK-22605][SQL] SQL write job should also set Spark t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19833 **[Test build #84276 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84276/testReport)** for PR 19833 at commit [`be1c62e`](https://github.com/apache/spark/commit/be1c62e5b16a8b4a7bb177be5a5042c9006d2903). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19835: [SPARK-21866][ML][PYTHON][FOLLOWUP] Few cleanups and tes...
Github user imatiach-msft commented on the issue: https://github.com/apache/spark/pull/19835 the changes look good to me, the extra verification logic for arguments is a great addition --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19750 **[Test build #84281 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84281/testReport)** for PR 19750 at commit [`81b3f3d`](https://github.com/apache/spark/commit/81b3f3dd223785e02ad2df2a6e40046a76539681). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19835: [SPARK-21866][ML][PYTHON][FOLLOWUP] Few cleanups ...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/19835#discussion_r153689403 --- Diff: python/pyspark/ml/image.py --- @@ -146,7 +163,12 @@ def toImage(self, array, origin=""): mode = ocvTypes["CV_8UC4"] else: raise ValueError("Invalid number of channels") -data = bytearray(array.astype(dtype=np.uint8).ravel()) + +# Running `bytearray(numpy.array([1]))` fails in specific Python versions +# with a specific Numpy version, for example in Python 3.6.0 and NumPy 1.13.3. +# Here, it avoids it by converting it to bytes. +data = bytearray(array.astype(dtype=np.uint8).ravel().tobytes()) --- End diff -- strange, but the comment explains the issue well and I think this is a good workaround --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19835: [SPARK-21866][ML][PYTHON][FOLLOWUP] Few cleanups ...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/19835#discussion_r153689356 --- Diff: python/pyspark/ml/tests.py --- @@ -1836,6 +1836,24 @@ def test_read_images(self): self.assertEqual(ImageSchema.imageFields, expected) self.assertEqual(ImageSchema.undefinedImageType, "Undefined") +with QuietTest(self.sc): --- End diff -- nice tests! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19750 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19833: [SPARK-22605][SQL] SQL write job should also set ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19833#discussion_r153687993 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala --- @@ -106,6 +105,13 @@ class BasicWriteTaskStatsTracker(hadoopConf: Configuration) override def getFinalStats(): WriteTaskStats = { statCurrentFile() + +// Reports bytesWritten and recordsWritten to the Spark output metrics. +Option(TaskContext.get()).map(_.taskMetrics().outputMetrics).foreach { outputMetrics => --- End diff -- Which test suite covers this newly added logic? Previously, BasicWriteTaskStatsTrackerSuite seems to fail (8 of 10 test cases) because TaskContext is absent in the test suite. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec in execut...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19840 **[Test build #84280 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84280/testReport)** for PR 19840 at commit [`8ff5663`](https://github.com/apache/spark/commit/8ff5663fe9a32eae79c8ee6bc310409170a8da64). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19840: [SPARK-22640][PYSPARK][YARN]switch python exec in...
GitHub user yaooqinn opened a pull request: https://github.com/apache/spark/pull/19840 [SPARK-22640][PYSPARK][YARN]switch python exec in executor side ## What changes were proposed in this pull request? ``` PYSPARK_PYTHON=~/anaconda3/envs/py3/bin/python \ bin/spark-submit --master yarn --deploy-mode client \ --archives ~/anaconda3/envs/py3.zip \ --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python \ --conf spark.executorEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python \ /home/hadoop/data/apache-spark/spark-2.1.2-bin-hadoop2.7/examples/src/main/python/mllib/correlations_example.py ``` In the case above, I created a python environment, delivered it via `--arichives`, then visited it on Executor Node via `spark.executorEnv.PYSPARK_PYTHON`. But Executor seemed to use `PYSPARK_PYTHON=~/anaconda3/envs/py3/bin/python` instead of `spark.executorEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python`, then application end with ioe. this pr aim to switch the python exec when user specifies it. ## How was this patch tested? manually verified with the case above. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yaooqinn/spark SPARK-22640 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19840.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19840 commit 8ff5663fe9a32eae79c8ee6bc310409170a8da64 Author: Kent YaoDate: 2017-11-29T03:26:47Z switch python exec in executor side --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19836: [SPARK-22618][CORE] Catch exception in removeRDD ...
Github user brad-kaiser commented on a diff in the pull request: https://github.com/apache/spark/pull/19836#discussion_r153686556 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala --- @@ -159,11 +160,18 @@ class BlockManagerMasterEndpoint( // Ask the slaves to remove the RDD, and put the result in a sequence of Futures. // The dispatcher is used as an implicit argument into the Future sequence construction. val removeMsg = RemoveRdd(rddId) -Future.sequence( - blockManagerInfo.values.map { bm => -bm.slaveEndpoint.ask[Int](removeMsg) - }.toSeq -) + +val handleRemoveRddException: PartialFunction[Throwable, Int] = { + case e: IOException => +logError(s"Error trying to remove rdd $rddId", e) --- End diff -- Thanks for looking at my change. I changed the log to warning and "rdd" to "RDD". I had pulled out the partial function out because I felt like the expression was getting too deeply nested and hard to read. I certainly don't have to do that though. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153686107 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin/bucket into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin/bucket into which a column values falls. + */ + + def findFirstBucketForValue(value: Double, histogram: Histogram): Int = { --- End diff -- Shall we unify all names to `bin`/`bins` in code and comments? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19651 **[Test build #84279 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84279/testReport)** for PR 19651 at commit [`fdab6a7`](https://github.com/apache/spark/commit/fdab6a7eb3acc63f78889fd31f8f078fca66aa0f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19839: SPARK-22373 Bump Janino dependency version to fix thread...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19839 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19839: SPARK-22373 Bump Janino dependency version to fix...
GitHub user Victsm opened a pull request: https://github.com/apache/spark/pull/19839 SPARK-22373 Bump Janino dependency version to fix thread safety issue⦠⦠with Janino when compiling generated code. ## What changes were proposed in this pull request? Bump up Janino dependency version to fix thread safety issue during compiling generated code ## How was this patch tested? Check https://issues.apache.org/jira/browse/SPARK-22373 for details. Converted part of the code in CodeGenerator into a standalone application, so the issue can be consistently reproduced locally. Verified that changing Janino dependency version resolved this issue. You can merge this pull request into a Git repository by running: $ git pull https://github.com/Victsm/spark SPARK-22373 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19839.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19839 commit 5f54a89fddacad5f041fb35e46b7faba27c4789d Author: Min ShenDate: 2017-11-29T03:14:51Z SPARK-22373 Bump Janino dependency version to fix thread safety issue with Janino when compiling generated code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19651 **[Test build #84278 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84278/testReport)** for PR 19651 at commit [`e13dfa3`](https://github.com/apache/spark/commit/e13dfa3600b1a1be3d659eb79ceb73b4078067f8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19651#discussion_r153683384 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala --- @@ -0,0 +1,304 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ArrayBuffer + +import org.apache.hadoop.io._ +import org.apache.orc.mapred.{OrcList, OrcMap, OrcStruct, OrcTimestamp} +import org.apache.orc.storage.serde2.io.{DateWritable, HiveDecimalWritable} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.SpecificInternalRow +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +private[orc] class OrcDeserializer( +dataSchema: StructType, +requiredSchema: StructType, +missingColumnNames: Seq[String]) { + + private[this] val currentRow = new SpecificInternalRow(requiredSchema.map(_.dataType)) + + private[this] val length = requiredSchema.length + + private[this] val fieldConverters: Array[Converter] = requiredSchema.zipWithIndex.map { +case (f, ordinal) => + if (missingColumnNames.contains(f.name)) { +null + } else { +newConverter(f.dataType, new RowUpdater(currentRow, ordinal)) + } + }.toArray + + def deserialize(orcStruct: OrcStruct): InternalRow = { +val names = orcStruct.getSchema.getFieldNames +val fieldRefs = requiredSchema.map { f => + val name = f.name + if (missingColumnNames.contains(name)) { +null + } else { +if (names.contains(name)) { + orcStruct.getFieldValue(name) +} else { + orcStruct.getFieldValue("_col" + dataSchema.fieldIndex(name)) +} + } +}.toArray + +var i = 0 +while (i < length) { + val writable = fieldRefs(i) + if (writable == null) { +currentRow.setNullAt(i) + } else { +fieldConverters(i).set(writable) + } + i += 1 +} +currentRow + } + + private[this] def newConverter(dataType: DataType, updater: OrcDataUpdater): Converter = +dataType match { + case NullType => +new Converter { + override def set(value: Any): Unit = updater.setNullAt() +} + + case BooleanType => +new Converter { + override def set(value: Any): Unit = +updater.setBoolean(value.asInstanceOf[BooleanWritable].get) +} + + case ByteType => +new Converter { + override def set(value: Any): Unit = updater.setByte(value.asInstanceOf[ByteWritable].get) +} + + case ShortType => +new Converter { + override def set(value: Any): Unit = +updater.setShort(value.asInstanceOf[ShortWritable].get) +} + + case IntegerType => +new Converter { + override def set(value: Any): Unit = updater.setInt(value.asInstanceOf[IntWritable].get) +} + + case LongType => +new Converter { + override def set(value: Any): Unit = updater.setLong(value.asInstanceOf[LongWritable].get) +} + + case FloatType => +new Converter { + override def set(value: Any): Unit = +updater.setFloat(value.asInstanceOf[FloatWritable].get) +} + + case DoubleType => +new Converter { + override def set(value: Any): Unit = +updater.setDouble(value.asInstanceOf[DoubleWritable].get) +} + + case StringType => +new Converter {
[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19651#discussion_r153682330 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala --- @@ -0,0 +1,304 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ArrayBuffer + +import org.apache.hadoop.io._ +import org.apache.orc.mapred.{OrcList, OrcMap, OrcStruct, OrcTimestamp} +import org.apache.orc.storage.serde2.io.{DateWritable, HiveDecimalWritable} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.SpecificInternalRow +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +private[orc] class OrcDeserializer( +dataSchema: StructType, +requiredSchema: StructType, +missingColumnNames: Seq[String]) { + + private[this] val currentRow = new SpecificInternalRow(requiredSchema.map(_.dataType)) + + private[this] val length = requiredSchema.length + + private[this] val fieldConverters: Array[Converter] = requiredSchema.zipWithIndex.map { +case (f, ordinal) => + if (missingColumnNames.contains(f.name)) { +null + } else { +newConverter(f.dataType, new RowUpdater(currentRow, ordinal)) + } + }.toArray + + def deserialize(orcStruct: OrcStruct): InternalRow = { +val names = orcStruct.getSchema.getFieldNames +val fieldRefs = requiredSchema.map { f => + val name = f.name + if (missingColumnNames.contains(name)) { +null + } else { +if (names.contains(name)) { + orcStruct.getFieldValue(name) +} else { + orcStruct.getFieldValue("_col" + dataSchema.fieldIndex(name)) +} + } +}.toArray + +var i = 0 +while (i < length) { + val writable = fieldRefs(i) + if (writable == null) { +currentRow.setNullAt(i) + } else { +fieldConverters(i).set(writable) + } + i += 1 +} +currentRow + } + + private[this] def newConverter(dataType: DataType, updater: OrcDataUpdater): Converter = +dataType match { + case NullType => +new Converter { + override def set(value: Any): Unit = updater.setNullAt() +} + + case BooleanType => +new Converter { + override def set(value: Any): Unit = +updater.setBoolean(value.asInstanceOf[BooleanWritable].get) +} + + case ByteType => +new Converter { + override def set(value: Any): Unit = updater.setByte(value.asInstanceOf[ByteWritable].get) +} + + case ShortType => +new Converter { + override def set(value: Any): Unit = +updater.setShort(value.asInstanceOf[ShortWritable].get) +} + + case IntegerType => +new Converter { + override def set(value: Any): Unit = updater.setInt(value.asInstanceOf[IntWritable].get) +} + + case LongType => +new Converter { + override def set(value: Any): Unit = updater.setLong(value.asInstanceOf[LongWritable].get) +} + + case FloatType => +new Converter { + override def set(value: Any): Unit = +updater.setFloat(value.asInstanceOf[FloatWritable].get) +} + + case DoubleType => +new Converter { + override def set(value: Any): Unit = +updater.setDouble(value.asInstanceOf[DoubleWritable].get) +} + + case StringType => +new Converter {
[GitHub] spark pull request #19129: [SPARK-13656][SQL] Delete spark.sql.parquet.cache...
Github user zzl1787 commented on a diff in the pull request: https://github.com/apache/spark/pull/19129#discussion_r153682329 --- Diff: docs/sql-programming-guide.md --- @@ -1587,6 +1580,10 @@ options. Note that this is different from the Hive behavior. - As a result, `DROP TABLE` statements on those tables will not remove the data. + - From Spark 2.0.1, `spark.sql.parquet.cacheMetadata` is no longer used. See + [SPARK-16321](https://issues.apache.org/jira/browse/SPARK-16321) and + [SPARK-15639](https://issues.apache.org/jira/browse/SPARK-15639) for details. --- End diff -- @dongjoon-hyun Ok, got this, and thank you. Finally I find the parameter to control this. `spark.sql.filesourceTableRelationCacheSize = 0` This will disable the metadata cache. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19651#discussion_r153682051 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala --- @@ -0,0 +1,226 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ArrayBuffer + +import org.apache.hadoop.io._ +import org.apache.orc.mapred.{OrcList, OrcMap, OrcStruct, OrcTimestamp} +import org.apache.orc.storage.serde2.io.{DateWritable, HiveDecimalWritable} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.SpecificInternalRow +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.execution.datasources.orc.OrcUtils.withNullSafe +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +private[orc] class OrcDeserializer( +dataSchema: StructType, +requiredSchema: StructType, +missingColumnNames: Seq[String]) { + + private[this] val mutableRow = new SpecificInternalRow(requiredSchema.map(_.dataType)) + + private[this] val length = requiredSchema.length + + private[this] val unwrappers = requiredSchema.map { f => +if (missingColumnNames.contains(f.name)) { + (value: Any, row: InternalRow, ordinal: Int) => row.setNullAt(ordinal) +} else { + unwrapperFor(f.dataType) +} + }.toArray + + def deserialize(orcStruct: OrcStruct): InternalRow = { +val names = orcStruct.getSchema.getFieldNames +val fieldRefs = requiredSchema.map { f => + val name = f.name + if (missingColumnNames.contains(name)) { +null + } else { +if (names.contains(name)) { + orcStruct.getFieldValue(name) +} else { + orcStruct.getFieldValue("_col" + dataSchema.fieldIndex(name)) --- End diff -- `OrcStruct` will raise Exception here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19714: [SPARK-22489][SQL] Shouldn't change broadcast joi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19714#discussion_r153680715 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala --- @@ -153,6 +151,27 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] { // --- BroadcastHashJoin + case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right) +if canBuildRight(joinType) && canBuildLeft(joinType) + && left.stats.hints.broadcast && right.stats.hints.broadcast => --- End diff -- I think we can create new methods for it ``` def shouldBuildLeft(joinType: JoinType, left: LogicalPlan, right: LogicalPlan): Boolean { if (left.stats.hints.broadcast) { if (canBuildRight(joinType) && right.stats.hints.broadcast) { // if both sides have broadcast hint, only broadcast left side if its estimated pyhsical size is smaller than right side left.stats.sizeInBytes <= right.stats.sizeInBytes } else { // if only left side has the broadcast hint, broadcast the left side. true } } else { if (canBuildRight(joinType) && right.stats.hints.broadcast) { // if only right side has the broadcast hint, do not broadcast the left side. false } else { // if neither one of the sides has broadcast hint, only broadcast the left side if its estimated physical size is smaller than the treshold and smaller than right side. canBroadcast(left) && left.stats.sizeInBytes <= right.stats.sizeInBytes } } } def shouldBuildRight... ``` and use it like ``` case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right) if canBuildRight(joinType) && shouldBuildRight(joinType, left, right) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19714: [SPARK-22489][SQL] Shouldn't change broadcast joi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19714#discussion_r153679188 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala --- @@ -91,10 +91,10 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] { * predicates can be evaluated by matching join keys. If found, Join implementations are chosen * with the following precedence: * - * - Broadcast: if one side of the join has an estimated physical size that is smaller than the - * user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold - * or if that side has an explicit broadcast hint (e.g. the user applied the - * [[org.apache.spark.sql.functions.broadcast()]] function to a DataFrame), then that side + * - Broadcast: if one side of the join has an explicit broadcast hint (e.g. the user applied the --- End diff -- ``` Broadcast: We prefer to broadcast the join side with an explicit broadcast hint(e.g. the user applied the [[org.apache.spark.sql.functions.broadcast()]] function to a DataFrame). If both sides have the broadcast hint, we prefer to broadcast the side with a smaller estimated physical size. If neither one of the sides has the broadcast hint, we only broadcast the join side if its estimated physical size that is smaller than the user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold. ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-18278] [Submission] Spark on Kubernetes - ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r153678912 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala --- @@ -0,0 +1,160 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s + +import java.util.concurrent.TimeUnit + +import org.apache.spark.{SPARK_VERSION => sparkVersion} +import org.apache.spark.internal.Logging +import org.apache.spark.internal.config.ConfigBuilder +import org.apache.spark.network.util.ByteUnit + +private[spark] object Config extends Logging { + + val KUBERNETES_NAMESPACE = +ConfigBuilder("spark.kubernetes.namespace") + .doc("The namespace that will be used for running the driver and executor pods. When using " + +"spark-submit in cluster mode, this can also be passed to spark-submit via the " + +"--kubernetes-namespace command line argument.") + .stringConf + .createWithDefault("default") + + val DRIVER_DOCKER_IMAGE = +ConfigBuilder("spark.kubernetes.driver.docker.image") + .doc("Docker image to use for the driver. Specify this using the standard Docker tag format.") + .stringConf + .createWithDefault(s"spark-driver:$sparkVersion") + + val EXECUTOR_DOCKER_IMAGE = +ConfigBuilder("spark.kubernetes.executor.docker.image") + .doc("Docker image to use for the executors. Specify this using the standard Docker tag " + +"format.") + .stringConf + .createWithDefault(s"spark-executor:$sparkVersion") + + val DOCKER_IMAGE_PULL_POLICY = +ConfigBuilder("spark.kubernetes.docker.image.pullPolicy") + .doc("Docker image pull policy when pulling any docker image in Kubernetes integration") + .stringConf + .createWithDefault("IfNotPresent") + + + val KUBERNETES_AUTH_DRIVER_CONF_PREFIX = + "spark.kubernetes.authenticate.driver" + val KUBERNETES_AUTH_DRIVER_MOUNTED_CONF_PREFIX = + "spark.kubernetes.authenticate.driver.mounted" + val OAUTH_TOKEN_CONF_SUFFIX = "oauthToken" + val OAUTH_TOKEN_FILE_CONF_SUFFIX = "oauthTokenFile" + val CLIENT_KEY_FILE_CONF_SUFFIX = "clientKeyFile" + val CLIENT_CERT_FILE_CONF_SUFFIX = "clientCertFile" + val CA_CERT_FILE_CONF_SUFFIX = "caCertFile" + + val KUBERNETES_SERVICE_ACCOUNT_NAME = + ConfigBuilder(s"$KUBERNETES_AUTH_DRIVER_CONF_PREFIX.serviceAccountName") + .doc("Service account that is used when running the driver pod. The driver pod uses " + +"this service account when requesting executor pods from the API server. If specific " + +"credentials are given for the driver pod to use, the driver will favor " + +"using those credentials instead.") + .stringConf + .createOptional + + val KUBERNETES_DRIVER_LIMIT_CORES = +ConfigBuilder("spark.kubernetes.driver.limit.cores") + .doc("Specify the hard cpu limit for the driver pod") + .stringConf + .createOptional + + val KUBERNETES_EXECUTOR_LIMIT_CORES = +ConfigBuilder("spark.kubernetes.executor.limit.cores") + .doc("Specify the hard cpu limit for a single executor pod") + .stringConf + .createOptional + + val KUBERNETES_DRIVER_MEMORY_OVERHEAD = +ConfigBuilder("spark.kubernetes.driver.memoryOverhead") + .doc("The amount of off-heap memory (in megabytes) to be allocated for the driver and the " + +"driver submission server. This is memory that accounts for things like VM overheads, " + +"interned strings, other native overheads, etc. This tends to grow with the driver's " + +"memory size (typically 6-10%).") + .bytesConf(ByteUnit.MiB) + .createOptional + + // Note that while we set a default for this when we start up the + // scheduler, the specific default value is
[GitHub] spark pull request #19823: [SPARK-22601][SQL] Data load is getting displayed...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19823#discussion_r153678875 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala --- @@ -2392,5 +2392,13 @@ abstract class DDLSuite extends QueryTest with SQLTestUtils { } } } +test("load command invalid path validation ") { --- End diff -- nit: insert an empty line before the test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19651 **[Test build #84277 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84277/testReport)** for PR 19651 at commit [`decaa50`](https://github.com/apache/spark/commit/decaa50be159ecf6b87dfb68cbe44bec91eaea9e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19823: [SPARK-22601][SQL] Data load is getting displayed...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19823#discussion_r153678491 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala --- @@ -2392,5 +2392,13 @@ abstract class DDLSuite extends QueryTest with SQLTestUtils { } } } +test("load command invalid path validation ") { --- End diff -- @sujith71955 according to Jenkins, there's a whitespace at end of line --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19831 Besides, if the size stats `totalSize` or `rawDataSize` is wrong, the problem also exists whether CBO is enabled or not. We need to change that in the title too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19714: [SPARK-22489][SQL] Shouldn't change broadcast joi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19714#discussion_r153677941 --- Diff: docs/sql-programming-guide.md --- @@ -1492,6 +1492,61 @@ that these options will be deprecated in future release as more optimizations ar +## Broadcast Hint for SQL Queries + +Broadcast hint is a way for users to manually annotate a query and suggest to the query optimizer the join method. +It is very useful when the query optimizer cannot make optimal decision with respect to join methods +due to conservativeness or the lack of proper statistics. +Spark Broadcast Hint has higher priority than autoBroadcastJoin mechanism, examples: + + + + + +{% highlight scala %} +val src = sql("SELECT * FROM src") +broadcast(src).join(recordsDF, Seq("key")).show() --- End diff -- a more standard way: ``` import org.apache.spark.sql.functions.broadcast broadcast(spark.table("src")).join(spark.table("records"), Seq("key")).show() ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19831 BTW, the case here is not about join reorder, it's actually about broadcast decision. Could you update the title of this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19714: [SPARK-22489][SQL] Shouldn't change broadcast joi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19714#discussion_r153677668 --- Diff: docs/sql-programming-guide.md --- @@ -1492,6 +1492,61 @@ that these options will be deprecated in future release as more optimizations ar +## Broadcast Hint for SQL Queries + +Broadcast hint is a way for users to manually annotate a query and suggest to the query optimizer the join method. +It is very useful when the query optimizer cannot make optimal decision with respect to join methods +due to conservativeness or the lack of proper statistics. +Spark Broadcast Hint has higher priority than autoBroadcastJoin mechanism, examples: + + + + + +{% highlight scala %} +val src = sql("SELECT * FROM src") +broadcast(src).join(recordsDF, Seq("key")).show() +{% endhighlight %} + + + + + +{% highlight java %} +Dataset src = sql("SELECT * FROM src"); +broadcast(src).join(recordsDF, Seq("key")).show(); +{% endhighlight %} + + + + + +{% highlight python %} +src = spark.sql("SELECT * FROM src") +recordsDF.join(broadcast(src), "key").show() +{% endhighlight %} + + + + + +{% highlight r %} +src <- sql("SELECT COUNT(*) FROM src") +showDF(join(broadcast(src), recordsDF, src$key == recordsDF$key))) +{% endhighlight %} + + + + + +{% highlight sql %} +SELECT /*+ BROADCAST(r) */ * FROM records r JOIN src s ON r.key = s.key +{% endhighlight %} + + + +(Note that we accept `BROADCAST`, `BROADCASTJOIN` and `MAPJOIN` for broadcast hint) --- End diff -- shall we treat it as a comment on the SQL example? ``` --we accept BROADCAST, BROADCASTJOIN and MAPJOIN for broadcast hint SELECT /*+ BROADCAST(r) */ * FROM records r JOIN src s ON r.key = s.key ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19831: [SPARK-22626][SQL] Wrong Hive table statistics ma...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19831#discussion_r153677676 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -418,7 +418,7 @@ private[hive] class HiveClientImpl( // Note that this statistics could be overridden by Spark's statistics if that's available. val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_)) val rawDataSize = properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_)) - val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ >= 0) + val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ > 0) --- End diff -- The root problem is that user can set "wrong" table properties. So if we want to prevent using wrong stats, we need to detect changes in properties. Otherwise your case can't be avoided. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19831: [SPARK-22626][SQL] Wrong Hive table statistics ma...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19831#discussion_r153677300 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -418,7 +418,7 @@ private[hive] class HiveClientImpl( // Note that this statistics could be overridden by Spark's statistics if that's available. val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_)) val rawDataSize = properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_)) - val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ >= 0) + val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ > 0) --- End diff -- Hive has a flag called `StatsSetupConst.COLUMN_STATS_ACCURATE`. If I remember correctly, this flag will become **false** if user changes table properties or table data. Can you check if the flag exists in your case? If so, we can use the flag to decide whether to read statistics from Hive. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19833: [SPARK-22605][SQL] SQL write job should also set Spark t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19833 **[Test build #84276 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84276/testReport)** for PR 19833 at commit [`be1c62e`](https://github.com/apache/spark/commit/be1c62e5b16a8b4a7bb177be5a5042c9006d2903). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19821: [WIP][SPARK-22608][SQL] add new API to CodeGeneration.sp...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19821 @kiszk Can you fix the conflict? now we can add a middle-advanced version: ``` def splitExpressions( expressions: Seq[String], funcName: String, extraArguments: Seq[(String, String)]) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19838: [SPARK-22638][SS]Use a separate query for Streami...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19838#discussion_r153674694 --- Diff: core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala --- @@ -87,7 +87,9 @@ private[spark] class LiveListenerBus(conf: SparkConf) { * of each other (each one uses a separate thread for delivering events), allowing slower * listeners to be somewhat isolated from others. */ - private def addToQueue(listener: SparkListenerInterface, queue: String): Unit = synchronized { + private[spark] def addToQueue( --- End diff -- Is it necessary to make this change? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-18278] [Submission] Spark on Kubernetes - ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r15367 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/Client.scala --- @@ -0,0 +1,236 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s.submit + +import java.util.{Collections, UUID} + +import scala.collection.JavaConverters._ +import scala.collection.mutable +import scala.util.control.NonFatal + +import io.fabric8.kubernetes.api.model._ +import io.fabric8.kubernetes.client.KubernetesClient + +import org.apache.spark.SparkConf +import org.apache.spark.deploy.SparkApplication +import org.apache.spark.deploy.k8s.Config._ +import org.apache.spark.deploy.k8s.Constants._ +import org.apache.spark.deploy.k8s.KubernetesClientFactory +import org.apache.spark.deploy.k8s.submit.steps.DriverConfigurationStep +import org.apache.spark.internal.Logging +import org.apache.spark.util.Utils + +/** + * Encapsulates arguments to the submission client. + * + * @param mainAppResource the main application resource + * @param mainClass the main class of the application to run + * @param driverArgs arguments to the driver + */ +private[spark] case class ClientArguments( + mainAppResource: MainAppResource, + mainClass: String, + driverArgs: Array[String]) + +private[spark] object ClientArguments { + + def fromCommandLineArgs(args: Array[String]): ClientArguments = { +var mainAppResource: Option[MainAppResource] = None +var mainClass: Option[String] = None +val driverArgs = mutable.Buffer.empty[String] + +args.sliding(2, 2).toList.collect { + case Array("--primary-java-resource", primaryJavaResource: String) => +mainAppResource = Some(JavaMainAppResource(primaryJavaResource)) + case Array("--main-class", clazz: String) => +mainClass = Some(clazz) + case Array("--arg", arg: String) => +driverArgs += arg + case other => +val invalid = other.mkString(" ") +throw new RuntimeException(s"Unknown arguments: $invalid") +} + +require(mainAppResource.isDefined, + "Main app resource must be defined by --primary-java-resource.") +require(mainClass.isDefined, "Main class must be specified via --main-class") + +ClientArguments( + mainAppResource.get, + mainClass.get, + driverArgs.toArray) + } +} + +/** + * Submits a Spark application to run on Kubernetes by creating the driver pod and starting a + * watcher that monitors and logs the application status. Waits for the application to terminate if + * spark.kubernetes.submission.waitAppCompletion is true. + * + * @param submissionSteps steps that collectively configure the driver + * @param submissionSparkConf the submission client Spark configuration + * @param kubernetesClient the client to talk to the Kubernetes API server + * @param waitForAppCompletion a flag indicating whether the client should wait for the application + * to complete + * @param appName the application name + * @param loggingPodStatusWatcher a watcher that monitors and logs the application status + */ +private[spark] class Client( +submissionSteps: Seq[DriverConfigurationStep], +submissionSparkConf: SparkConf, +kubernetesClient: KubernetesClient, +waitForAppCompletion: Boolean, +appName: String, +loggingPodStatusWatcher: LoggingPodStatusWatcher) extends Logging { + + private val driverJavaOptions = submissionSparkConf.get( +org.apache.spark.internal.config.DRIVER_JAVA_OPTIONS) + + /** +* Run command that initializes a DriverSpec that will be updated after each +* DriverConfigurationStep in the sequence that is passed in. The final
[GitHub] spark issue #19813: [WIP][SPARK-22600][SQL] Fix 64kb limit for deeply nested...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19813 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [WIP][SPARK-22600][SQL] Fix 64kb limit for deeply nested...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19813 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84275/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [WIP][SPARK-22600][SQL] Fix 64kb limit for deeply nested...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19813 **[Test build #84275 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84275/testReport)** for PR 19813 at commit [`8c7f749`](https://github.com/apache/spark/commit/8c7f7496e610fdf4b512c57efd108ccf0238b126). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19838: [SPARK-22638][SS]Use a separate query for StreamingQuery...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19838 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84274/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19838: [SPARK-22638][SS]Use a separate query for StreamingQuery...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19838 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19838: [SPARK-22638][SS]Use a separate query for StreamingQuery...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19838 **[Test build #84274 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84274/testReport)** for PR 19838 at commit [`60035fa`](https://github.com/apache/spark/commit/60035fa865fd85cc7e9441d2dc55d46693b16dee). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/19468 LGTM, thanks for the awesome work! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19594: [SPARK-21984] [SQL] Join estimation based on equi...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19594#discussion_r153665547 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/JoinEstimationSuite.scala --- @@ -67,6 +68,205 @@ class JoinEstimationSuite extends StatsEstimationTestBase { rowCount = 2, attributeStats = AttributeMap(Seq("key-1-2", "key-2-3").map(nameToColInfo))) + private def estimateByHistogram( + histogram1: Histogram, + histogram2: Histogram, + expectedMin: Double, + expectedMax: Double, + expectedNdv: Long, + expectedRows: Long): Unit = { +val col1 = attr("key1") +val col2 = attr("key2") +val c1 = generateJoinChild(col1, histogram1, expectedMin, expectedMax) +val c2 = generateJoinChild(col2, histogram2, expectedMin, expectedMax) + +val c1JoinC2 = Join(c1, c2, Inner, Some(EqualTo(col1, col2))) +val c2JoinC1 = Join(c2, c1, Inner, Some(EqualTo(col2, col1))) +val expectedStatsAfterJoin = Statistics( + sizeInBytes = expectedRows * (8 + 2 * 4), + rowCount = Some(expectedRows), + attributeStats = AttributeMap(Seq( +col1 -> c1.stats.attributeStats(col1).copy( + distinctCount = expectedNdv, min = Some(expectedMin), max = Some(expectedMax)), +col2 -> c2.stats.attributeStats(col2).copy( + distinctCount = expectedNdv, min = Some(expectedMin), max = Some(expectedMax +) + +// Join order should not affect estimation result. +Seq(c1JoinC2, c2JoinC1).foreach { join => + assert(join.stats == expectedStatsAfterJoin) +} + } + + private def generateJoinChild( + col: Attribute, + histogram: Histogram, + expectedMin: Double, + expectedMax: Double): LogicalPlan = { +val colStat = inferColumnStat(histogram) +val t = StatsTestPlan( + outputList = Seq(col), + rowCount = (histogram.height * histogram.bins.length).toLong, + attributeStats = AttributeMap(Seq(col -> colStat))) + +val filterCondition = new ArrayBuffer[Expression]() +if (expectedMin > colStat.min.get.toString.toDouble) { + filterCondition += GreaterThanOrEqual(col, Literal(expectedMin)) +} +if (expectedMax < colStat.max.get.toString.toDouble) { + filterCondition += LessThanOrEqual(col, Literal(expectedMax)) +} +if (filterCondition.isEmpty) t else Filter(filterCondition.reduce(And), t) + } + + private def inferColumnStat(histogram: Histogram): ColumnStat = { +var ndv = 0L +for (i <- histogram.bins.indices) { + val bin = histogram.bins(i) + if (i == 0 || bin.hi != histogram.bins(i - 1).hi) { +ndv += bin.ndv + } +} +ColumnStat(distinctCount = ndv, min = Some(histogram.bins.head.lo), + max = Some(histogram.bins.last.hi), nullCount = 0, avgLen = 4, maxLen = 4, + histogram = Some(histogram)) + } + + test("equi-height histograms: a bin is contained by another one") { +val histogram1 = Histogram(height = 300, Array( + HistogramBin(lo = 10, hi = 30, ndv = 10), HistogramBin(lo = 30, hi = 60, ndv = 30))) +val histogram2 = Histogram(height = 100, Array( + HistogramBin(lo = 0, hi = 50, ndv = 50), HistogramBin(lo = 50, hi = 100, ndv = 40))) +// test bin trimming +val (t1, h1) = trimBin(histogram2.bins(0), height = 100, min = 10, max = 60) +assert(t1 == HistogramBin(lo = 10, hi = 50, ndv = 40) && h1 == 80) +val (t2, h2) = trimBin(histogram2.bins(1), height = 100, min = 10, max = 60) +assert(t2 == HistogramBin(lo = 50, hi = 60, ndv = 8) && h2 == 20) + +val expectedRanges = Seq( + OverlappedRange(10, 30, math.min(10, 40*1/2), math.max(10, 40*1/2), 300, 80*1/2), + OverlappedRange(30, 50, math.min(30*2/3, 40*1/2), math.max(30*2/3, 40*1/2), 300*2/3, 80*1/2), + OverlappedRange(50, 60, math.min(30*1/3, 8), math.max(30*1/3, 8), 300*1/3, 20) +) +assert(expectedRanges.equals( + getOverlappedRanges(histogram1, histogram2, newMin = 10D, newMax = 60D))) + +estimateByHistogram( + histogram1 = histogram1, + histogram2 = histogram2, + expectedMin = 10D, + expectedMax = 60D, + // 10 + 20 + 8 + expectedNdv = 38L, + // 300*40/20 + 200*40/20 + 100*20/10 + expectedRows = 1200L) + } + + test("equi-height histograms: a bin has only one value") { +val histogram1 = Histogram(height = 300, Array( + HistogramBin(lo = 30, hi = 30, ndv = 1), HistogramBin(lo = 30, hi = 60, ndv = 30))) +val histogram2 = Histogram(height = 100, Array( +
[GitHub] spark pull request #19594: [SPARK-21984] [SQL] Join estimation based on equi...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19594#discussion_r153665092 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/JoinEstimationSuite.scala --- @@ -67,6 +68,205 @@ class JoinEstimationSuite extends StatsEstimationTestBase { rowCount = 2, attributeStats = AttributeMap(Seq("key-1-2", "key-2-3").map(nameToColInfo))) + private def estimateByHistogram( + histogram1: Histogram, + histogram2: Histogram, + expectedMin: Double, + expectedMax: Double, + expectedNdv: Long, + expectedRows: Long): Unit = { +val col1 = attr("key1") +val col2 = attr("key2") +val c1 = generateJoinChild(col1, histogram1, expectedMin, expectedMax) +val c2 = generateJoinChild(col2, histogram2, expectedMin, expectedMax) + +val c1JoinC2 = Join(c1, c2, Inner, Some(EqualTo(col1, col2))) +val c2JoinC1 = Join(c2, c1, Inner, Some(EqualTo(col2, col1))) +val expectedStatsAfterJoin = Statistics( + sizeInBytes = expectedRows * (8 + 2 * 4), + rowCount = Some(expectedRows), + attributeStats = AttributeMap(Seq( +col1 -> c1.stats.attributeStats(col1).copy( + distinctCount = expectedNdv, min = Some(expectedMin), max = Some(expectedMax)), +col2 -> c2.stats.attributeStats(col2).copy( + distinctCount = expectedNdv, min = Some(expectedMin), max = Some(expectedMax +) + +// Join order should not affect estimation result. +Seq(c1JoinC2, c2JoinC1).foreach { join => + assert(join.stats == expectedStatsAfterJoin) +} + } + + private def generateJoinChild( + col: Attribute, + histogram: Histogram, + expectedMin: Double, + expectedMax: Double): LogicalPlan = { +val colStat = inferColumnStat(histogram) +val t = StatsTestPlan( + outputList = Seq(col), + rowCount = (histogram.height * histogram.bins.length).toLong, + attributeStats = AttributeMap(Seq(col -> colStat))) + +val filterCondition = new ArrayBuffer[Expression]() +if (expectedMin > colStat.min.get.toString.toDouble) { + filterCondition += GreaterThanOrEqual(col, Literal(expectedMin)) +} +if (expectedMax < colStat.max.get.toString.toDouble) { + filterCondition += LessThanOrEqual(col, Literal(expectedMax)) +} +if (filterCondition.isEmpty) t else Filter(filterCondition.reduce(And), t) + } + + private def inferColumnStat(histogram: Histogram): ColumnStat = { +var ndv = 0L +for (i <- histogram.bins.indices) { + val bin = histogram.bins(i) + if (i == 0 || bin.hi != histogram.bins(i - 1).hi) { +ndv += bin.ndv + } +} +ColumnStat(distinctCount = ndv, min = Some(histogram.bins.head.lo), + max = Some(histogram.bins.last.hi), nullCount = 0, avgLen = 4, maxLen = 4, + histogram = Some(histogram)) + } + + test("equi-height histograms: a bin is contained by another one") { +val histogram1 = Histogram(height = 300, Array( + HistogramBin(lo = 10, hi = 30, ndv = 10), HistogramBin(lo = 30, hi = 60, ndv = 30))) +val histogram2 = Histogram(height = 100, Array( + HistogramBin(lo = 0, hi = 50, ndv = 50), HistogramBin(lo = 50, hi = 100, ndv = 40))) +// test bin trimming +val (t1, h1) = trimBin(histogram2.bins(0), height = 100, min = 10, max = 60) +assert(t1 == HistogramBin(lo = 10, hi = 50, ndv = 40) && h1 == 80) +val (t2, h2) = trimBin(histogram2.bins(1), height = 100, min = 10, max = 60) +assert(t2 == HistogramBin(lo = 50, hi = 60, ndv = 8) && h2 == 20) + +val expectedRanges = Seq( + OverlappedRange(10, 30, math.min(10, 40*1/2), math.max(10, 40*1/2), 300, 80*1/2), + OverlappedRange(30, 50, math.min(30*2/3, 40*1/2), math.max(30*2/3, 40*1/2), 300*2/3, 80*1/2), + OverlappedRange(50, 60, math.min(30*1/3, 8), math.max(30*1/3, 8), 300*1/3, 20) +) +assert(expectedRanges.equals( + getOverlappedRanges(histogram1, histogram2, newMin = 10D, newMax = 60D))) + +estimateByHistogram( + histogram1 = histogram1, + histogram2 = histogram2, + expectedMin = 10D, + expectedMax = 60D, + // 10 + 20 + 8 + expectedNdv = 38L, + // 300*40/20 + 200*40/20 + 100*20/10 + expectedRows = 1200L) + } + + test("equi-height histograms: a bin has only one value") { +val histogram1 = Histogram(height = 300, Array( + HistogramBin(lo = 30, hi = 30, ndv = 1), HistogramBin(lo = 30, hi = 60, ndv = 30))) +val histogram2 = Histogram(height = 100, Array( +
[GitHub] spark issue #19837: [SPARK-22637][SQL] Only refresh a logical plan once.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19837 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org