[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19868 can somebody explain to me what the pr description has to do with missingFiles? I'm probably missing something but i feel the implementation is very different from the pr description. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22192 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96211/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22192 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22192 **[Test build #96211 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96211/testReport)** for PR 22192 at commit [`447c5e5`](https://github.com/apache/spark/commit/447c5e5974ca2a176026e63518a7a6cf29b78008). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22457: [SPARK-24626] Add statistics prefix to parallelFi...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22457 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22459: [SPARK-23173][SQL] rename spark.sql.fromJsonForce...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22459 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22173: [SPARK-24335] Spark external shuffle server improvement ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22173 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22173: [SPARK-24335] Spark external shuffle server improvement ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22173 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96208/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22173: [SPARK-24335] Spark external shuffle server improvement ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22173 **[Test build #96208 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96208/testReport)** for PR 22173 at commit [`0a43f22`](https://github.com/apache/spark/commit/0a43f223fb97da4fcf355dba945a3a350245c5f3). * This patch **fails from timeout after a configured wait of `400m`**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22462: [SPARK-25460][SS] DataSourceV2: SS sources do not respec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22462 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22462: [SPARK-25460][SS] DataSourceV2: SS sources do not respec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22462 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3219/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22462: [SPARK-25460][SS] DataSourceV2: SS sources do not respec...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22462 **[Test build #96225 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96225/testReport)** for PR 22462 at commit [`d238a66`](https://github.com/apache/spark/commit/d238a66f0ebcfe0d93a14e6685d6294668b82953). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22462: [SPARK-25460][SS] DataSourceV2: SS sources do not respec...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22462 cc @cloud-fan, this actually bugged me. Mind taking a look when you are available please? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22462: [SPARK-25460][SS] DataSourceV2: SS sources do not...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/22462 [SPARK-25460][SS] DataSourceV2: SS sources do not respect SessionConfigSupport ## What changes were proposed in this pull request? This PR proposes to respect `SessionConfigSupport` in SS datasources as well. Currently these are only respected in batch sources: https://github.com/apache/spark/blob/e06da95cd9423f55cdb154a2778b0bddf7be984c/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L198-L203 https://github.com/apache/spark/blob/e06da95cd9423f55cdb154a2778b0bddf7be984c/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L244-L249 If a developer makes a datasource V2 that supports both structured streaming and batch jobs, batch jobs respect a specific configuration, let's say, URL to connect and fetch data (which end users might not be aware of); however, structured streaming ends up with not supporting this (and should explicitly be set into options). ## How was this patch tested? Unit tests were added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-25460 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22462.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22462 commit d238a66f0ebcfe0d93a14e6685d6294668b82953 Author: hyukjinkwon Date: 2018-09-19T04:44:40Z DataSourceV2: Structured Streaming does not respect SessionConfigSupport --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22449: [SPARK-22666][ML][FOLLOW-UP] Return a correctly formatte...
Github user mengxr commented on the issue: https://github.com/apache/spark/pull/22449 @WeichenXu123 I think we should fix the test instead of removing "//" from URI if authority is empty. Because both "scheme:/" and "scheme:///" are valid. ~~~scala scala> val u1 = new URI("file:///a/b/c") u1: java.net.URI = file:///a/b/c scala> val u2 = new URI("file:/a/b/c") u2: java.net.URI = file:/a/b/c scala> u1 == u2 res1: Boolean = true ~~~ Shall we update the test? Instead of compare the row record, we compare its fields one by one and convert `origin` to `URI` before comparison? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22461: [SPARK-25453][SQL][TEST] OracleIntegrationSuite IllegalA...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22461 @gatorsmile Thanks a lot! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22461: [SPARK-25453][SQL][TEST] OracleIntegrationSuite IllegalA...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22461 **[Test build #96224 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96224/testReport)** for PR 22461 at commit [`fb7b9d2`](https://github.com/apache/spark/commit/fb7b9d22812f789b970f8dae8a9cd3763c0eccff). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22461: [SPARK-25453][SQL][TEST] OracleIntegrationSuite IllegalA...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22461 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22433: [SPARK-25442][SQL][K8S] Support STS to run in k8s deploy...
Github user suryag10 commented on the issue: https://github.com/apache/spark/pull/22433 @mridulm @liyinan926 @jacobdr @ifilonenko code check for space,"/" handling is already present at https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L259 I had reverted back the fix in start-thriftserver.sh. Please review and merge. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22461: [SPARK-25453][SQL][TEST] OracleIntegrationSuite IllegalA...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22461 add to whitelist --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22392 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22433: [SPARK-25442][SQL][K8S] Support STS to run in k8s deploy...
Github user suryag10 commented on the issue: https://github.com/apache/spark/pull/22433 > > Agreed with @mridulm that the naming restriction is specific to k8s and should be handled in a k8s specific way, e.g., somewhere around https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L208. > > Ok, Will update the PR with the same. Hi, Handling of this conversion is already present in https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L259 I had reverted back the change in start-thriftserver.sh file. Please review and merge. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22443: [SPARK-25339][TEST] Refactor FilterPushdownBenchmark
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22443 **[Test build #96223 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96223/testReport)** for PR 22443 at commit [`075ef7a`](https://github.com/apache/spark/commit/075ef7a0696332c3b9d35ff1750ea7ade08e7c3d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22443: [SPARK-25339][TEST] Refactor FilterPushdownBenchmark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22443 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3218/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22443: [SPARK-25339][TEST] Refactor FilterPushdownBenchmark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22443 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22443: [SPARK-25339][TEST] Refactor FilterPushdownBenchmark
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22443 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16677 I'm convinced, there are 2 major issues: 1. abusing shuffle. we need a new mechanism for driver to analyze some statistics about data (records per map task) 2. too many small tasks. We need a better algorithm to decide the parallelism of limit. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22459: [SPARK-23173][SQL] rename spark.sql.fromJsonForceNullabl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22459 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22459: [SPARK-23173][SQL] rename spark.sql.fromJsonForceNullabl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22459 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96210/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22459: [SPARK-23173][SQL] rename spark.sql.fromJsonForceNullabl...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22459 **[Test build #96210 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96210/testReport)** for PR 22459 at commit [`f6f427f`](https://github.com/apache/spark/commit/f6f427fd06a87274d6950550137ce0f551276ff6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22192 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96212/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22192 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22192 **[Test build #96212 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96212/testReport)** for PR 22192 at commit [`447c5e5`](https://github.com/apache/spark/commit/447c5e5974ca2a176026e63518a7a6cf29b78008). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22443: [SPARK-25339][TEST] Refactor FilterPushdownBenchmark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22443 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96213/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22443: [SPARK-25339][TEST] Refactor FilterPushdownBenchmark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22443 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22443: [SPARK-25339][TEST] Refactor FilterPushdownBenchmark
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22443 **[Test build #96213 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96213/testReport)** for PR 22443 at commit [`075ef7a`](https://github.com/apache/spark/commit/075ef7a0696332c3b9d35ff1750ea7ade08e7c3d). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16677 ok after thinking about it more, i think we should just revert all of these changes and go back to the drawing board. here's why: 1. the prs change some of the most common/core parts of spark, and are not properly designed (as in they haven't gone through actual discussions; there's not even a doc on how they work). the prs created a much more complicated implementations for limit / top k. you might be able to justify the complexity with the perf improvements, but we better write them down, discuss them, and make sure they are the right design choices. this is just a comment about the process, not the actual design. 2. now onto the design, i am having issues with two major parts: 2a. this pr really wanted an abstraction to buffer data, and then have the driver analyze some statistics about data (records per map task), and then make decisions. because spark doesn't yet have that infrastructure, this pr just adds some hacks to shuffle to make it work. there is no proper abstraction here. 2b. i'm not even sure if the algorithm here is the right one. the pr tries to parallelize as much as possible by keeping the same number of tasks. imo a simpler design that would work for more common cases is to buffer the data, get the records per map task, and create a new rdd with the first N number of partitions that reach limit. that way, we don't launch too many asks, and we retain ordering. 3. the pr implementation quality is poor. variable names are confusing (output vs records); it's severely lacking documentation; the doc for the config option is arcane. sorry about all of the above, but we gotta do better. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22448: [SPARK-25417][SQL] Improve findTightestCommonType to coe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22448 **[Test build #96222 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96222/testReport)** for PR 22448 at commit [`2304c7d`](https://github.com/apache/spark/commit/2304c7d759ee950785b81219b48d8a441741bd83). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22448: [SPARK-25417][SQL] Improve findTightestCommonType to coe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22448 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22448: [SPARK-25417][SQL] Improve findTightestCommonType to coe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22448 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3217/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22456: [SPARK-19355][SQL] Fix variable names numberOfOut...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/22456#discussion_r218666270 --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala --- @@ -31,7 +31,7 @@ import org.apache.spark.util.Utils /** * Result returned by a ShuffleMapTask to a scheduler. Includes the block manager address that the - * task ran on, the sizes of outputs for each reducer, and the number of outputs of the map task, + * task ran on, the sizes of outputs for each reducer, and the number of records of the map task, --- End diff -- size was about bytes; so it doesn't really matter whether it's a record or a row or a block. it's also already pointed out below that it's about bytes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16677: [SPARK-19355][SQL] Use map output statistics to i...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16677#discussion_r218665902 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala --- @@ -93,25 +96,93 @@ trait BaseLimitExec extends UnaryExecNode with CodegenSupport { } /** - * Take the first `limit` elements of each child partition, but do not collect or shuffle them. + * Take the `limit` elements of the child output. */ -case class LocalLimitExec(limit: Int, child: SparkPlan) extends BaseLimitExec { +case class GlobalLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode { - override def outputOrdering: Seq[SortOrder] = child.outputOrdering + override def output: Seq[Attribute] = child.output override def outputPartitioning: Partitioning = child.outputPartitioning -} -/** - * Take the first `limit` elements of the child's single output partition. - */ -case class GlobalLimitExec(limit: Int, child: SparkPlan) extends BaseLimitExec { + override def outputOrdering: Seq[SortOrder] = child.outputOrdering - override def requiredChildDistribution: List[Distribution] = AllTuples :: Nil + private val serializer: Serializer = new UnsafeRowSerializer(child.output.size) - override def outputPartitioning: Partitioning = child.outputPartitioning + protected override def doExecute(): RDD[InternalRow] = { +val childRDD = child.execute() +val partitioner = LocalPartitioning(childRDD) +val shuffleDependency = ShuffleExchangeExec.prepareShuffleDependency( + childRDD, child.output, partitioner, serializer) +val numberOfOutput: Seq[Long] = if (shuffleDependency.rdd.getNumPartitions != 0) { + // submitMapStage does not accept RDD with 0 partition. + // So, we will not submit this dependency. + val submittedStageFuture = sparkContext.submitMapStage(shuffleDependency) + submittedStageFuture.get().recordsByPartitionId.toSeq +} else { + Nil +} - override def outputOrdering: Seq[SortOrder] = child.outputOrdering +// During global limit, try to evenly distribute limited rows across data +// partitions. If disabled, scanning data partitions sequentially until reaching limit number. +// Besides, if child output has certain ordering, we can't evenly pick up rows from +// each parititon. +val flatGlobalLimit = sqlContext.conf.limitFlatGlobalLimit && child.outputOrdering == Nil + +val shuffled = new ShuffledRowRDD(shuffleDependency) + +val sumOfOutput = numberOfOutput.sum +if (sumOfOutput <= limit) { + shuffled +} else if (!flatGlobalLimit) { + var numRowTaken = 0 + val takeAmounts = numberOfOutput.map { num => +if (numRowTaken + num < limit) { + numRowTaken += num.toInt + num.toInt +} else { + val toTake = limit - numRowTaken + numRowTaken += toTake + toTake +} + } + val broadMap = sparkContext.broadcast(takeAmounts) + shuffled.mapPartitionsWithIndexInternal { case (index, iter) => +iter.take(broadMap.value(index).toInt) + } +} else { + // We try to evenly require the asked limit number of rows across all child rdd's partitions. + var rowsNeedToTake: Long = limit + val takeAmountByPartition: Array[Long] = Array.fill[Long](numberOfOutput.length)(0L) + val remainingRowsByPartition: Array[Long] = Array(numberOfOutput: _*) + + while (rowsNeedToTake > 0) { +val nonEmptyParts = remainingRowsByPartition.count(_ > 0) +// If the rows needed to take are less the number of non-empty partitions, take one row from +// each non-empty partitions until we reach `limit` rows. +// Otherwise, evenly divide the needed rows to each non-empty partitions. +val takePerPart = math.max(1, rowsNeedToTake / nonEmptyParts) +remainingRowsByPartition.zipWithIndex.foreach { case (num, index) => + // In case `rowsNeedToTake` < `nonEmptyParts`, we may run out of `rowsNeedToTake` during + // the traversal, so we need to add this check. + if (rowsNeedToTake > 0 && num > 0) { +if (num >= takePerPart) { + rowsNeedToTake -= takePerPart + takeAmountByPartition(index) += takePerPart + remainingRowsByPartition(index) -= takePerPart +} else { + rowsNeedToTake -= num + takeAmountByPartition(index) += num + remainingRowsByPartition(index) -= num +} + } +} + } + val broadMap
[GitHub] spark pull request #22448: [SPARK-25417][SQL] Improve findTightestCommonType...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22448#discussion_r218662840 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -106,6 +107,22 @@ object TypeCoercion { case (t1, t2) => findTypeForComplex(t1, t2, findTightestCommonType) } + /** + * Finds a wider decimal type between the two supplied decimal types without + * any loss of precision. + */ + def findWiderDecimalType(d1: DecimalType, d2: DecimalType): Option[DecimalType] = { +val scale = max(d1.scale, d2.scale) +val range = max(d1.precision - d1.scale, d2.precision - d2.scale) + +// Check the resultant decimal type does not exceed the allowable limits. +if (range + scale <= DecimalType.MAX_PRECISION && scale <= DecimalType.MAX_SCALE) { --- End diff -- @maropu OK.. i will remove this check. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22381: [SPARK-25394][CORE] Add an application status met...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/22381#discussion_r218660634 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusStore.scala --- @@ -503,9 +503,12 @@ private[spark] object AppStatusStore { /** * Create an in-memory store for a live application. */ - def createLiveStore(conf: SparkConf): AppStatusStore = { + def createLiveStore( + conf: SparkConf, + appStatusSource: Option[AppStatusSource] = None): --- End diff -- Yep, sorry for the late reply. :( --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22460: DO NOT MERGE
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22460 **[Test build #4341 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4341/testReport)** for PR 22460 at commit [`76c99af`](https://github.com/apache/spark/commit/76c99afc9c5013968f34e9603a896e21c92a4eb2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22460: DO NOT MERGE
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22460 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96218/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22460: DO NOT MERGE
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22460 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19773: [SPARK-22546][SQL] Supporting for changing column dataTy...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/19773 @maropu @dongjoon-hyun Great thanks for your guidance ! ``` Apache Spark already supports changing column types as a part of schema evolution. Especially, ORC vectorized reader support upcasting although it's not the same with canCast. For the detail support Spark coverage, see SPARK-23007. It covered all built-in data source at that time. ``` Great thanks, I'll study these background soon. ``` Please note that every data sources have different capability. So, this PR needs to prevent ALTER TABLE CHANGE COLUMN for those data sources case-by-case. And, we need corresponding test cases. ``` Got it, I'll keep following the cases in this PR, I roughly split these into 4 tasks and update the description of this PR firstly. I'll pay attention to the corresponding test cases in each task. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22447: [SPARK-25450][SQL] PushProjectThroughUnion rule uses the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22447 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3216/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22461: [SPARK-25453] OracleIntegrationSuite IllegalArgumentExce...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22461 To `[SPARK-25453][SQL][TEST]` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22447: [SPARK-25450][SQL] PushProjectThroughUnion rule uses the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22447 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22461: [SPARK-25453] OracleIntegrationSuite IllegalArgumentExce...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22461 Thanks! LGTM, too. @gatorsmile @HyukjinKwon Can you trigger tests? btw, the jenkins does not run these docker integration tests? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22461: [SPARK-25453] OracleIntegrationSuite IllegalArgumentExce...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22461 cc @maropu --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22447: [SPARK-25450][SQL] PushProjectThroughUnion rule uses the...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22447 **[Test build #96221 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96221/testReport)** for PR 22447 at commit [`7193de3`](https://github.com/apache/spark/commit/7193de3ad8675229eef131214ed62f2ece5cd416). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22461: [SPARK-25453] OracleIntegrationSuite IllegalArgumentExce...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22461 Could you add `[TEST]` to title, otherwise LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22447: [SPARK-25450][SQL] PushProjectThroughUnion rule uses the...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22447 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22461: [SPARK-25453] OracleIntegrationSuite IllegalArgumentExce...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22461 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21632 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3215/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21632 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22461: [SPARK-25453] OracleIntegrationSuite IllegalArgumentExce...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22461 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22461: [SPARK-25453] OracleIntegrationSuite IllegalArgumentExce...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22461 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22456: [SPARK-19355][SQL] Fix variable names numberOfOutput -> ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22456 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22456: [SPARK-19355][SQL] Fix variable names numberOfOutput -> ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22456 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96206/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22461: [SPARK-25453] OracleIntegrationSuite IllegalArgum...
GitHub user seancxmao opened a pull request: https://github.com/apache/spark/pull/22461 [SPARK-25453] OracleIntegrationSuite IllegalArgumentException: Timestamp format must be -mm-dd hh:mm:ss[.f] ## What changes were proposed in this pull request? This PR aims to fix the failed test of `OracleIntegrationSuite`. ## How was this patch tested? Existing integration tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/seancxmao/spark SPARK-25453 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22461.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22461 commit fb7b9d22812f789b970f8dae8a9cd3763c0eccff Author: seancxmao Date: 2018-09-19T03:18:38Z [SPARK-25453] OracleIntegrationSuite IllegalArgumentException: Timestamp format must be -mm-dd hh:mm:ss[.f] --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21632 **[Test build #96220 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96220/testReport)** for PR 21632 at commit [`b9f2425`](https://github.com/apache/spark/commit/b9f2425bcf29997160b4f582ecc74d8a94c708cf). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22456: [SPARK-19355][SQL] Fix variable names numberOfOutput -> ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22456 **[Test build #96206 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96206/testReport)** for PR 22456 at commit [`793fc19`](https://github.com/apache/spark/commit/793fc19d2519f47f5f3278b79e827f1159d9e440). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21632: [SPARK-19591][ML][MLlib] Add sample weights to de...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/21632#discussion_r218657097 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala --- @@ -66,6 +69,9 @@ class DecisionTreeClassifier @Since("1.4.0") ( override def setMinInstancesPerNode(value: Int): this.type = set(minInstancesPerNode, value) /** @group setParam */ + @Since("2.2.0") --- End diff -- done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21632: [SPARK-19591][ML][MLlib] Add sample weights to de...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/21632#discussion_r218657039 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala --- @@ -65,6 +68,9 @@ class DecisionTreeRegressor @Since("1.4.0") (@Since("1.4.0") override val uid: S override def setMinInstancesPerNode(value: Int): this.type = set(minInstancesPerNode, value) /** @group setParam */ + @Since("2.2.0") --- End diff -- done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21632: [SPARK-19591][ML][MLlib] Add sample weights to de...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/21632#discussion_r218657065 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala --- @@ -97,28 +103,48 @@ class DecisionTreeClassifier @Since("1.4.0") ( @Since("1.6.0") override def setSeed(value: Long): this.type = set(seed, value) + /** + * Sets the value of param [[weightCol]]. + * If this is not set or empty, we treat all instance weights as 1.0. + * Default is not set, so all instances have weight one. + * + * @group setParam + */ + @Since("2.2.0") --- End diff -- done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21632 **[Test build #96219 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96219/testReport)** for PR 21632 at commit [`8a18157`](https://github.com/apache/spark/commit/8a18157b9ba7952457d91c771cae5c70405cacf7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21632 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3214/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21632 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21527: [SPARK-24519] Make the threshold for highly compr...
Github user hthuynh2 commented on a diff in the pull request: https://github.com/apache/spark/pull/21527#discussion_r218656012 --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala --- @@ -50,7 +50,9 @@ private[spark] sealed trait MapStatus { private[spark] object MapStatus { def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = { -if (uncompressedSizes.length > 2000) { +if (uncompressedSizes.length > Option(SparkEnv.get) --- End diff -- How about creating a "static" val shuffleMinNumPartsToHighlyCompress for this? Please let me know if this is good for you so I can update it. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22460: DO NOT MERGE
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22460 **[Test build #96218 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96218/testReport)** for PR 22460 at commit [`76c99af`](https://github.com/apache/spark/commit/76c99afc9c5013968f34e9603a896e21c92a4eb2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22460: DO NOT MERGE
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22460 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22460: DO NOT MERGE
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22460 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3213/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22460: DO NOT MERGE
GitHub user squito opened a pull request: https://github.com/apache/spark/pull/22460 DO NOT MERGE You can merge this pull request into a Git repository by running: $ git pull https://github.com/squito/spark debugging Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22460.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22460 commit 76c99afc9c5013968f34e9603a896e21c92a4eb2 Author: Imran Rashid Date: 2018-09-19T03:04:03Z debugging --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22457: [SPARK-24626] Add statistics prefix to parallelFileListi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22457 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96209/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22457: [SPARK-24626] Add statistics prefix to parallelFileListi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22457 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22451: [SPARK-24777][SQL] Add write benchmark for AVRO
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22451 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3212/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22451: [SPARK-24777][SQL] Add write benchmark for AVRO
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22451 **[Test build #96217 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96217/testReport)** for PR 22451 at commit [`34af59d`](https://github.com/apache/spark/commit/34af59db044274e0c32a3d01cea7f260f0a36bdd). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22457: [SPARK-24626] Add statistics prefix to parallelFileListi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22457 **[Test build #96209 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96209/testReport)** for PR 22457 at commit [`dde781f`](https://github.com/apache/spark/commit/dde781fa51782150f84932a5e2e701d0fdf2d355). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22451: [SPARK-24777][SQL] Add write benchmark for AVRO
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22451 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22451: [SPARK-24777][SQL] Add write benchmark for AVRO
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22451 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16677 Let me take an example from the PR description > For example, we have three partitions with rows (100, 100, 50) respectively. In global limit of 100 rows, we may take (34, 33, 33) rows for each partition locally. After global limit we still have three partitions. Without this patch, we need to take the first 100 rows from each partition, and then perform a shuffle to send all data into one partition and take the first 100 rows. So if the limit is big, this patch is super useful, if the limit is small, this patch is not that useful but should not be slower. The only overhead I can think of is, `MapStatus` needs to carry the numRecords metrics. It should be a small overhead, as `MapStatus` already carries many information. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16677: [SPARK-19355][SQL] Use map output statistics to i...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16677#discussion_r218652707 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala --- @@ -93,25 +96,93 @@ trait BaseLimitExec extends UnaryExecNode with CodegenSupport { } /** - * Take the first `limit` elements of each child partition, but do not collect or shuffle them. + * Take the `limit` elements of the child output. */ -case class LocalLimitExec(limit: Int, child: SparkPlan) extends BaseLimitExec { +case class GlobalLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode { - override def outputOrdering: Seq[SortOrder] = child.outputOrdering + override def output: Seq[Attribute] = child.output override def outputPartitioning: Partitioning = child.outputPartitioning -} -/** - * Take the first `limit` elements of the child's single output partition. - */ -case class GlobalLimitExec(limit: Int, child: SparkPlan) extends BaseLimitExec { + override def outputOrdering: Seq[SortOrder] = child.outputOrdering - override def requiredChildDistribution: List[Distribution] = AllTuples :: Nil + private val serializer: Serializer = new UnsafeRowSerializer(child.output.size) - override def outputPartitioning: Partitioning = child.outputPartitioning + protected override def doExecute(): RDD[InternalRow] = { +val childRDD = child.execute() +val partitioner = LocalPartitioning(childRDD) +val shuffleDependency = ShuffleExchangeExec.prepareShuffleDependency( + childRDD, child.output, partitioner, serializer) +val numberOfOutput: Seq[Long] = if (shuffleDependency.rdd.getNumPartitions != 0) { + // submitMapStage does not accept RDD with 0 partition. + // So, we will not submit this dependency. + val submittedStageFuture = sparkContext.submitMapStage(shuffleDependency) + submittedStageFuture.get().recordsByPartitionId.toSeq +} else { + Nil +} - override def outputOrdering: Seq[SortOrder] = child.outputOrdering +// During global limit, try to evenly distribute limited rows across data +// partitions. If disabled, scanning data partitions sequentially until reaching limit number. +// Besides, if child output has certain ordering, we can't evenly pick up rows from +// each parititon. +val flatGlobalLimit = sqlContext.conf.limitFlatGlobalLimit && child.outputOrdering == Nil + +val shuffled = new ShuffledRowRDD(shuffleDependency) + +val sumOfOutput = numberOfOutput.sum +if (sumOfOutput <= limit) { + shuffled +} else if (!flatGlobalLimit) { + var numRowTaken = 0 + val takeAmounts = numberOfOutput.map { num => +if (numRowTaken + num < limit) { + numRowTaken += num.toInt + num.toInt +} else { + val toTake = limit - numRowTaken + numRowTaken += toTake + toTake +} + } + val broadMap = sparkContext.broadcast(takeAmounts) + shuffled.mapPartitionsWithIndexInternal { case (index, iter) => +iter.take(broadMap.value(index).toInt) + } +} else { + // We try to evenly require the asked limit number of rows across all child rdd's partitions. + var rowsNeedToTake: Long = limit + val takeAmountByPartition: Array[Long] = Array.fill[Long](numberOfOutput.length)(0L) + val remainingRowsByPartition: Array[Long] = Array(numberOfOutput: _*) + + while (rowsNeedToTake > 0) { +val nonEmptyParts = remainingRowsByPartition.count(_ > 0) +// If the rows needed to take are less the number of non-empty partitions, take one row from +// each non-empty partitions until we reach `limit` rows. +// Otherwise, evenly divide the needed rows to each non-empty partitions. +val takePerPart = math.max(1, rowsNeedToTake / nonEmptyParts) +remainingRowsByPartition.zipWithIndex.foreach { case (num, index) => + // In case `rowsNeedToTake` < `nonEmptyParts`, we may run out of `rowsNeedToTake` during + // the traversal, so we need to add this check. + if (rowsNeedToTake > 0 && num > 0) { +if (num >= takePerPart) { + rowsNeedToTake -= takePerPart + takeAmountByPartition(index) += takePerPart + remainingRowsByPartition(index) -= takePerPart +} else { + rowsNeedToTake -= num + takeAmountByPartition(index) += num + remainingRowsByPartition(index) -= num +} + } +} + } + val
[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22237 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22237 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96205/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22237 **[Test build #96205 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96205/testReport)** for PR 22237 at commit [`5884173`](https://github.com/apache/spark/commit/58841735e14fe334c779b83eb915bff3a4f900e6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/18544 The issue has been addressed a long time ago @cloud-fan @maropu --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22448: [SPARK-25417][SQL] Improve findTightestCommonType to coe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22448 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96204/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16677: [SPARK-19355][SQL] Use map output statistics to i...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16677#discussion_r218651545 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala --- @@ -93,25 +96,93 @@ trait BaseLimitExec extends UnaryExecNode with CodegenSupport { } /** - * Take the first `limit` elements of each child partition, but do not collect or shuffle them. + * Take the `limit` elements of the child output. */ -case class LocalLimitExec(limit: Int, child: SparkPlan) extends BaseLimitExec { +case class GlobalLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode { - override def outputOrdering: Seq[SortOrder] = child.outputOrdering + override def output: Seq[Attribute] = child.output override def outputPartitioning: Partitioning = child.outputPartitioning -} -/** - * Take the first `limit` elements of the child's single output partition. - */ -case class GlobalLimitExec(limit: Int, child: SparkPlan) extends BaseLimitExec { + override def outputOrdering: Seq[SortOrder] = child.outputOrdering - override def requiredChildDistribution: List[Distribution] = AllTuples :: Nil + private val serializer: Serializer = new UnsafeRowSerializer(child.output.size) - override def outputPartitioning: Partitioning = child.outputPartitioning + protected override def doExecute(): RDD[InternalRow] = { +val childRDD = child.execute() +val partitioner = LocalPartitioning(childRDD) +val shuffleDependency = ShuffleExchangeExec.prepareShuffleDependency( + childRDD, child.output, partitioner, serializer) +val numberOfOutput: Seq[Long] = if (shuffleDependency.rdd.getNumPartitions != 0) { + // submitMapStage does not accept RDD with 0 partition. + // So, we will not submit this dependency. + val submittedStageFuture = sparkContext.submitMapStage(shuffleDependency) + submittedStageFuture.get().recordsByPartitionId.toSeq +} else { + Nil +} - override def outputOrdering: Seq[SortOrder] = child.outputOrdering +// During global limit, try to evenly distribute limited rows across data +// partitions. If disabled, scanning data partitions sequentially until reaching limit number. +// Besides, if child output has certain ordering, we can't evenly pick up rows from +// each parititon. +val flatGlobalLimit = sqlContext.conf.limitFlatGlobalLimit && child.outputOrdering == Nil + +val shuffled = new ShuffledRowRDD(shuffleDependency) + +val sumOfOutput = numberOfOutput.sum +if (sumOfOutput <= limit) { + shuffled +} else if (!flatGlobalLimit) { + var numRowTaken = 0 + val takeAmounts = numberOfOutput.map { num => +if (numRowTaken + num < limit) { + numRowTaken += num.toInt + num.toInt +} else { + val toTake = limit - numRowTaken + numRowTaken += toTake + toTake +} + } + val broadMap = sparkContext.broadcast(takeAmounts) + shuffled.mapPartitionsWithIndexInternal { case (index, iter) => +iter.take(broadMap.value(index).toInt) + } +} else { + // We try to evenly require the asked limit number of rows across all child rdd's partitions. + var rowsNeedToTake: Long = limit + val takeAmountByPartition: Array[Long] = Array.fill[Long](numberOfOutput.length)(0L) + val remainingRowsByPartition: Array[Long] = Array(numberOfOutput: _*) + + while (rowsNeedToTake > 0) { +val nonEmptyParts = remainingRowsByPartition.count(_ > 0) +// If the rows needed to take are less the number of non-empty partitions, take one row from +// each non-empty partitions until we reach `limit` rows. +// Otherwise, evenly divide the needed rows to each non-empty partitions. +val takePerPart = math.max(1, rowsNeedToTake / nonEmptyParts) +remainingRowsByPartition.zipWithIndex.foreach { case (num, index) => + // In case `rowsNeedToTake` < `nonEmptyParts`, we may run out of `rowsNeedToTake` during + // the traversal, so we need to add this check. + if (rowsNeedToTake > 0 && num > 0) { +if (num >= takePerPart) { + rowsNeedToTake -= takePerPart + takeAmountByPartition(index) += takePerPart + remainingRowsByPartition(index) -= takePerPart +} else { + rowsNeedToTake -= num + takeAmountByPartition(index) += num + remainingRowsByPartition(index) -= num +} + } +} + } + val
[GitHub] spark issue #22448: [SPARK-25417][SQL] Improve findTightestCommonType to coe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22448 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22448: [SPARK-25417][SQL] Improve findTightestCommonType to coe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22448 **[Test build #96204 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96204/testReport)** for PR 22448 at commit [`0de3328`](https://github.com/apache/spark/commit/0de332801c4b50f175b4c2eb0205096bcd4093b3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22456: [SPARK-19355][SQL] Fix variable names numberOfOut...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22456#discussion_r218651070 --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala --- @@ -31,7 +31,7 @@ import org.apache.spark.util.Utils /** * Result returned by a ShuffleMapTask to a scheduler. Includes the block manager address that the - * task ran on, the sizes of outputs for each reducer, and the number of outputs of the map task, + * task ran on, the sizes of outputs for each reducer, and the number of records of the map task, --- End diff -- Shall we also change `the sizes of outputs for each reducer` to `the sizes of output records for each reducer`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22381: [SPARK-25394][CORE] Add an application status metrics so...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22381 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3211/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22381: [SPARK-25394][CORE] Add an application status metrics so...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22381 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22381: [SPARK-25394][CORE] Add an application status metrics so...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22381 **[Test build #96215 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96215/testReport)** for PR 22381 at commit [`3a2db16`](https://github.com/apache/spark/commit/3a2db16813a2bab3160f444fe0855855187a6178). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22414: [SPARK-25424][SQL] Window duration and slide duration wi...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22414 How about this? https://github.com/apache/spark/compare/master...maropu:pr22414 IMO simple fixes and tests are better. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22165: [SPARK-25017][Core] Add test suite for BarrierCoordinato...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22165 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3210/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org