[GitHub] spark issue #21804: [SPARK-24268][SQL] Use datatype.catalogString in error m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21804 **[Test build #93236 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93236/testReport)** for PR 21804 at commit [`eb78665`](https://github.com/apache/spark/commit/eb786655387ecf7320d9b4957b45564253fb1af4). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21804: [SPARK-24268][SQL] Use datatype.catalogString in error m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21804 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93236/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21804: [SPARK-24268][SQL] Use datatype.catalogString in error m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21804 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21804: [SPARK-24268][SQL] Use datatype.catalogString in error m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21804 **[Test build #93236 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93236/testReport)** for PR 21804 at commit [`eb78665`](https://github.com/apache/spark/commit/eb786655387ecf7320d9b4957b45564253fb1af4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21804: [SPARK-24268][SQL] Use datatype.catalogString in error m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21804 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21804: [SPARK-24268][SQL] Use datatype.catalogString in error m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21804 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1096/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21754: [SPARK-24705][SQL] Cannot reuse an exchange opera...
Github user markhamstra commented on a diff in the pull request: https://github.com/apache/spark/pull/21754#discussion_r203416454 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/Exchange.scala --- @@ -85,14 +85,20 @@ case class ReusedExchangeExec(override val output: Seq[Attribute], child: Exchan */ case class ReuseExchange(conf: SQLConf) extends Rule[SparkPlan] { + private def supportReuseExchange(exchange: Exchange): Boolean = exchange match { +// If a coordinator defined in an exchange operator, the exchange cannot be reused --- End diff -- This seems overstated if this comment in the JIRA description is correct: "When the cache tabel device_loc is executed before this query is executed, everything is fine". In fact, if Xiao Li is correct in that statement, then this PR is eliminating a useful optimization in cases where it doesn't need to -- i.e. it is preventing Exchange reuse any time adaptive execution is used instead of only preventing reuse when it will actually cause a problem. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21131: [SPARK-23433][CORE] Late zombie task completions ...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/21131#discussion_r203415036 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -764,6 +769,19 @@ private[spark] class TaskSetManager( maybeFinishTaskSet() } + private[scheduler] def markPartitionCompleted(partitionId: Int): Unit = { +partitionToIndex.get(partitionId).foreach { index => + if (!successful(index)) { +tasksSuccessful += 1 +successful(index) = true +if (tasksSuccessful == numTasks) { + isZombie = true +} +maybeFinishTaskSet() --- End diff -- I think you're right, its not needed, its called when the tasks succeed, fail, or are aborted, and when this called while that taskset still has running tasks, then its a no-op, as it would fail the `runningTasks == 0` check inside `maybeFinishTaskSet()`. do you think its worth removing? I'm fine either way. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21804: [SPARK-24268][SQL] Use datatype.catalogString in ...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/21804 [SPARK-24268][SQL] Use datatype.catalogString in error messages ## What changes were proposed in this pull request? As stated in https://github.com/apache/spark/pull/21321, in the error messages we should use `catalogString`. This is not the case, as SPARK-22893 used `simpleString` in order to have the same representation everywhere and it missed some places. The PR unifies the messages using alway the `catalogString` representation of the dataTypes in the messages. ## How was this patch tested? existing/modified UTs You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-24268_catalog Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21804.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21804 commit eb786655387ecf7320d9b4957b45564253fb1af4 Author: Marco Gaido Date: 2018-07-18T14:47:12Z [SPARK-24268][SQL] Use datatype.catalogString in error messages --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21803: [SPARK-24849][SQL] Converting a value of StructType to a...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21803 How about the case where a column name has special characters that should be backquoted, e.g., 'aaa:bbb'? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21802: [SPARK-23928][SQL] Add shuffle collection functio...
Github user mn-mikke commented on a diff in the pull request: https://github.com/apache/spark/pull/21802#discussion_r203388798 --- Diff: python/pyspark/sql/functions.py --- @@ -2382,6 +2382,20 @@ def array_sort(col): return Column(sc._jvm.functions.array_sort(_to_java_column(col))) +@since(2.4) +def shuffle(col): +""" +Collection function: Generates a random permutation of the given array. + +.. note:: The function is non-deterministic because its results depends on order of rows which --- End diff -- Isn't it non-deterministic rather for the fact that the permutation is determined randomly? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21802: [SPARK-23928][SQL] Add shuffle collection functio...
Github user mn-mikke commented on a diff in the pull request: https://github.com/apache/spark/pull/21802#discussion_r203407122 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -1444,6 +1444,51 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { ) } + test("shuffle function") { +// Shuffle expressions should produce same results at retries in the same DataFrame. +def checkResult(df: DataFrame): Unit = { + checkAnswer(df, df.collect()) +} + +// primitive-type elements +val idf = Seq( + Seq(1, 9, 8, 7), + Seq(5, 8, 9, 7, 2), + Seq.empty, + null +).toDF("i") + +def checkResult1(): Unit = { --- End diff -- Maybe a different name for the method? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21803: [SPARK-24849][SQL] Converting a value of StructType to a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21803 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21533: [SPARK-24195][Core] Bug fix for local:/ path in SparkCon...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21533 **[Test build #4219 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4219/testReport)** for PR 21533 at commit [`eb46ccf`](https://github.com/apache/spark/commit/eb46ccfec084c2439a26eee38015381f091fe164). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21803: [SPARK-24849][SQL] Converting a value of StructType to a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21803 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21612: [SPARK-24628][DOC]Typos of the example code in do...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21612 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21803: [SPARK-24849][SQL] Converting a value of StructType to a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21803 **[Test build #93235 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93235/testReport)** for PR 21803 at commit [`34511db`](https://github.com/apache/spark/commit/34511db4c283e1013de203ca03ce152b26cf62f4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21803: [SPARK-24849][SQL] Converting a value of StructTy...
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/21803 [SPARK-24849][SQL] Converting a value of StructType to a DDL string ## What changes were proposed in this pull request? In the PR, I propose to extend the `StructType` object by new method `toDDL` which converts a value of the `StructType` type to a string formatted in DDL style. The resulted string can be used in a table creation. ## How was this patch tested? I add a test for checking the new method and 2 round trip tests: `fromDDL` -> `toDDL` and `toDDL` -> `fromDDL` You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 to-ddl Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21803.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21803 commit 38f905ad61f9197d12213bd93f2f755d428ee431 Author: Maxim Gekk Date: 2018-07-18T14:33:31Z New method - toDDL commit 6e0509326393ab0554b66df0ae65ba263b2c4fa9 Author: Maxim Gekk Date: 2018-07-18T14:39:16Z Simplification of a test commit 34511db4c283e1013de203ca03ce152b26cf62f4 Author: Maxim Gekk Date: 2018-07-18T14:44:38Z New test for cases --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21612: [SPARK-24628][DOC]Typos of the example code in docs/mlli...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/21612 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21767: SPARK-24804 There are duplicate words in the test...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21767 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21767: SPARK-24804 There are duplicate words in the test title ...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/21767 yeah, please avoid PRs that are this trivial, it's just not worth the overhead. But I merged it this time. Also please read https://spark.apache.org/contributing.html --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.
Github user echarles commented on the issue: https://github.com/apache/spark/pull/21748 @mccheah Tried this PR in client-mode In-Cluster on minikube v0.25.2: Exectuors are started but directly removed. As the start/remove is so fast, I can hardly see logs (and the logs I have seen don't show any stacktrace). Maybe something in my env? The config I have for the client mode is: ``` # DRIVER_POD_NAME=$HOSTNAME --conf spark.kubernetes.driver.pod.name="$DRIVER_POD_NAME" \ --conf spark.driver.host="$DRIVER_POD_NAME" \ --conf spark.driver.port=7077 \ --conf spark.driver.blockManager.port=1 \ ``` The driver log is: ``` 2018-07-18 14:29:43 INFO SparkContext:54 - Created broadcast 0 from broadcast at DAGScheduler.scala:1039 2018-07-18 14:29:43 INFO DAGScheduler:54 - Submitting 10 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)) 2018-07-18 14:29:43 INFO TaskSchedulerImpl:54 - Adding task set 0.0 with 10 tasks 2018-07-18 14:29:45 INFO ExecutorPodsAllocator:54 - Going to request 1 executors from Kubernetes. 2018-07-18 14:29:46 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 5 from BlockManagerMaster. 2018-07-18 14:29:46 INFO BlockManagerMaster:54 - Removal of executor 5 requested 2018-07-18 14:29:46 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:54 - Asked to remove non-existent executor 5 2018-07-18 14:29:52 INFO BlockManagerMaster:54 - Removal of executor 6 requested 2018-07-18 14:29:52 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:54 - Asked to remove non-existent executor 6 2018-07-18 14:29:52 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 6 from BlockManagerMaster. 2018-07-18 14:29:52 INFO ExecutorPodsAllocator:54 - Going to request 1 executors from Kubernetes. 2018-07-18 14:29:55 INFO BlockManagerMaster:54 - Removal of executor 7 requested 2018-07-18 14:29:55 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:54 - Asked to remove non-existent executor 7 ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/21638 Because this method is internal to Spark, why not just take out the parameter? Yes it's superfluous now, but it's been this way for a while, and seems perhaps better to avoid a behavior change. In fact you can pull a `minPartitions` parameter out of several private methods then. You can't remove the parameter to `binaryFiles`, sure, but it can be documented as doing nothing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21781: [INFRA] Close stale PR
Github user srowen commented on the issue: https://github.com/apache/spark/pull/21781 I would add: https://github.com/apache/spark/pull/19233 https://github.com/apache/spark/pull/20100 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21765: [MINOR][CORE] Add test cases for RDD.cartesian
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21765 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21765: [MINOR][CORE] Add test cases for RDD.cartesian
Github user srowen commented on the issue: https://github.com/apache/spark/pull/21765 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21774: [SPARK-24811][SQL]Avro: add new function from_avro and t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21774 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21774: [SPARK-24811][SQL]Avro: add new function from_avro and t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21774 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93230/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21774: [SPARK-24811][SQL]Avro: add new function from_avro and t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21774 **[Test build #93230 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93230/testReport)** for PR 21774 at commit [`204a59d`](https://github.com/apache/spark/commit/204a59d0088b9a3c959c6e3bce6b2fd663d991be). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class AvroFunctionsSuite extends QueryTest with SharedSQLContext ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21795: [SPARK-24840][SQL] do not use dummy filter to swi...
Github user mn-mikke commented on a diff in the pull request: https://github.com/apache/spark/pull/21795#discussion_r203379244 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -1147,65 +1149,66 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { val nseqi : Seq[Int] = null val nseqs : Seq[String] = null val df = Seq( - (Seq(1), Seq(2, 3), Seq(5L, 6L), nseqi, Seq("a", "b", "c"), Seq("d", "e"), Seq("f"), nseqs), (Seq(1, 0), Seq.empty[Int], Seq(2L), nseqi, Seq("a"), Seq.empty[String], Seq(null), nseqs) ).toDF("i1", "i2", "i3", "in", "s1", "s2", "s3", "sn") -val dummyFilter = (c: Column) => c.isNull || c.isNotNull // switch codeGen on - // Simple test cases -checkAnswer( --- End diff -- Good catch! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21795: [SPARK-24840][SQL] do not use dummy filter to swi...
Github user mn-mikke commented on a diff in the pull request: https://github.com/apache/spark/pull/21795#discussion_r203378508 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -924,26 +926,26 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { null ).toDF("i") -checkAnswer( - idf.select(reverse('i)), - Seq(Row(Seq(7, 8, 9, 1)), Row(Seq(2, 7, 9, 8, 5)), Row(Seq.empty), Row(null)) -) -checkAnswer( - idf.filter(dummyFilter('i)).select(reverse('i)), - Seq(Row(Seq(7, 8, 9, 1)), Row(Seq(2, 7, 9, 8, 5)), Row(Seq.empty), Row(null)) -) -checkAnswer( - idf.selectExpr("reverse(i)"), - Seq(Row(Seq(7, 8, 9, 1)), Row(Seq(2, 7, 9, 8, 5)), Row(Seq.empty), Row(null)) -) -checkAnswer( - oneRowDF.selectExpr("reverse(array(1, null, 2, null))"), - Seq(Row(Seq(null, 2, null, 1))) -) -checkAnswer( - oneRowDF.filter(dummyFilter('i)).selectExpr("reverse(array(1, null, 2, null))"), - Seq(Row(Seq(null, 2, null, 1))) -) +def checkResult2(): Unit = { --- End diff -- What about using more specific names for functions ```checkResult2```, ```checkResult3``` etc.? Maybe ```checkStringTestCases```, ```checkCasesWithArraysOfComplexTypes``` or something like that? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20856: [SPARK-23731][SQL] FileSourceScanExec throws NullPointer...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20856 @HyukjinKwon thanks for your great analysis. I agree with you that the proposed fix is more a "workaround" than a real fix for the issue we have here. The main problem here as you pointed out is that we have a bad (invalid?) `FileSourceScanExec` on the executors. Probably this has never been an issue as on the executors we accessed only some properties which were correctly populated and we assumed that the other operation would have been performed only on driver side. I think the cleanest approach (not sure it is entirely feasible) would be to choose one of the following option: - check that all exec expression (in this case `FileSourceScanExec`) are working properly both on driver and executor side; - define which operation/attributes can be accessed on executor side too and which only on driver side, document it and enforce it (if feasible). What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21440: [SPARK-24307][CORE] Support reading remote cached...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/21440#discussion_r203381863 --- Diff: core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala --- @@ -166,6 +170,34 @@ private[spark] class ChunkedByteBuffer(var chunks: Array[ByteBuffer]) { } +object ChunkedByteBuffer { + // TODO eliminate this method if we switch BlockManager to getting InputStreams + def fromManagedBuffer(data: ManagedBuffer, maxChunkSize: Int): ChunkedByteBuffer = { +data match { + case f: FileSegmentManagedBuffer => +map(f.getFile, maxChunkSize, f.getOffset, f.getLength) + case other => +new ChunkedByteBuffer(other.nioByteBuffer()) +} + } + + def map(file: File, maxChunkSize: Int, offset: Long, length: Long): ChunkedByteBuffer = { +Utils.tryWithResource(new FileInputStream(file).getChannel()) { channel => --- End diff -- great, thanks for the explanation --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21440: [SPARK-24307][CORE] Support reading remote cached partit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21440 **[Test build #93234 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93234/testReport)** for PR 21440 at commit [`4664942`](https://github.com/apache/spark/commit/4664942f0509b8d34ff27ddc9427351ed836f663). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21440: [SPARK-24307][CORE] Support reading remote cached partit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21440 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21440: [SPARK-24307][CORE] Support reading remote cached partit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21440 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1095/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20949: [SPARK-19018][SQL] Add support for custom encoding on cs...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20949 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20949: [SPARK-19018][SQL] Add support for custom encoding on cs...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20949 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93229/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20949: [SPARK-19018][SQL] Add support for custom encoding on cs...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20949 **[Test build #93229 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93229/testReport)** for PR 20949 at commit [`025958a`](https://github.com/apache/spark/commit/025958a7d9e8a741875db2af8878f60cb07409d3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21770 **[Test build #93233 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93233/testReport)** for PR 21770 at commit [`5d33d53`](https://github.com/apache/spark/commit/5d33d535f7c04a7231c3b088ac3fcde313f5da8c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21770 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21770 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1094/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20949: [SPARK-19018][SQL] Add support for custom encoding on cs...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20949 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93226/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20949: [SPARK-19018][SQL] Add support for custom encoding on cs...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20949 Build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21774: [SPARK-24811][SQL]Avro: add new function from_avro and t...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/21774 This is ready for review. @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20949: [SPARK-19018][SQL] Add support for custom encoding on cs...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20949 **[Test build #93226 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93226/testReport)** for PR 20949 at commit [`fd857b0`](https://github.com/apache/spark/commit/fd857b005abba233eb7409479436c0abe4e23e4f). * This patch **fails Spark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21596: [SPARK-24601] Bump Jackson version
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21596 **[Test build #93232 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93232/testReport)** for PR 21596 at commit [`7d4ac0b`](https://github.com/apache/spark/commit/7d4ac0b25ca0b38e48e20f288e7389fbbf83a01a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21802: [SPARK-23928][SQL] Add shuffle collection function.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21802 **[Test build #93231 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93231/testReport)** for PR 21802 at commit [`b4cbb55`](https://github.com/apache/spark/commit/b4cbb5558088356fe6be1cda053c9f91fbe7c538). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21802: [SPARK-23928][SQL] Add shuffle collection function.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21802 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1093/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21802: [SPARK-23928][SQL] Add shuffle collection function.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21802 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21386: [SPARK-23928][SQL][WIP] Add shuffle collection function.
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/21386 @pkuwm I submitted a PR #21802 based on this. Could you take a look if you have time? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21772: [SPARK-24809] [SQL] Serializing LongHashedRelatio...
Github user liutang123 commented on a diff in the pull request: https://github.com/apache/spark/pull/21772#discussion_r203365167 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala --- @@ -726,8 +726,9 @@ private[execution] final class LongToUnsafeRowMap(val mm: TaskMemoryManager, cap writeLong(array.length) writeLongArray(writeBuffer, array, array.length) -val used = ((cursor - Platform.LONG_ARRAY_OFFSET) / 8).toInt -writeLong(used) +val cursorFlag = cursor - Platform.LONG_ARRAY_OFFSET +writeLong(cursorFlag) +val used = (cursorFlag / 8).toInt --- End diff -- ![image](https://issues.apache.org/jira/secure/attachment/12932027/Spark%20LongHashedRelation%20serialization.svg) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21802: [SPARK-23928][SQL] Add shuffle collection function.
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/21802 cc @pkuwm --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21802: [SPARK-23928][SQL] Add shuffle collection functio...
GitHub user ueshin opened a pull request: https://github.com/apache/spark/pull/21802 [SPARK-23928][SQL] Add shuffle collection function. ## What changes were proposed in this pull request? This PR adds a new collection function: shuffle. It generates a random permutation of the given array. ## How was this patch tested? New tests are added to CollectionExpressionsSuite.scala and DataFrameFunctionsSuite.scala. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ueshin/apache-spark issues/SPARK-23928/shuffle Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21802.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21802 commit a3dbd93c0acbb2a3f3fb50574ae1e126c66c4d2d Author: pkuwm Date: 2018-07-17T23:18:03Z Add shuffle collection function. commit b4cbb5558088356fe6be1cda053c9f91fbe7c538 Author: Takuya UESHIN Date: 2018-07-18T12:17:59Z Refactor Shuffle function. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20856: [SPARK-23731][SQL] FileSourceScanExec throws NullPointer...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20856 Okay, I was investigating this and the fix itself looks quite inappropriate. This looks what happened now. I can reproduce this by a bit of messy way: ```diff diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala index 8d06804ce1e..d25fc9a7ba9 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala @@ -37,7 +37,9 @@ class EquivalentExpressions { case _ => false } -override def hashCode: Int = e.semanticHash() +override def hashCode: Int = { + 1 +} } ``` ```scala spark.range(1).write.mode("overwrite").parquet("/tmp/foo") spark.read.parquet("/tmp/foo").createOrReplaceTempView("foo") spark.conf.set("spark.sql.codegen.wholeStage", false) sql("SELECT (SELECT id FROM foo) == (SELECT id FROM foo)").collect() ``` This is what I see and think: 1. Sub scalar query was made (for instance `SELECT (SELECT id FROM foo)`). 2. Try to extract some common expressions (via `CodeGenerator.subexpressionElimination`) so that it can generates some common codes and can be reused. 3. During this, seems it extracts some expressions that can be reused (via `EquivalentExpressions.addExprTree`) https://github.com/apache/spark/blob/b2deef64f604ddd9502a31105ed47cb63470ec85/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L1102 4. During this, if the hash (`EquivalentExpressions.Expr.hashCode`) happened to be the same at `EquivalentExpressions.addExpr` anyhow, `EquivalentExpressions.Expr.equals` is called to identicy object in the same hash, which eventually calls `semanticEquals` in `ScalarSubquery` https://github.com/apache/spark/blob/087879a77acb37b790c36f8da67355b90719c2dc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala#L54 https://github.com/apache/spark/blob/087879a77acb37b790c36f8da67355b90719c2dc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala#L36 5. `ScalarSubquery`'s `semanticEquals` needs `SubqueryExec`'s `sameResult` https://github.com/apache/spark/blob/77a2fc5b521788b406bb32bcc3c637c1d7406e58/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala#L58 6. `SubqueryExec`'s `sameResult` requires a canonicalized plan which calls `FileSourceScanExec`'s `doCanonicalize` https://github.com/apache/spark/blob/e008ad175256a3192fdcbd2c4793044d52f46d57/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L258 7. In `FileSourceScanExec`'s `doCanonicalize`, `FileSourceScanExec`'s `relation` is required but seems `@transient` so it becomes `null`. https://github.com/apache/spark/blob/e76b0124fbe463def00b1dffcfd8fd47e04772fe/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L527 https://github.com/apache/spark/blob/e76b0124fbe463def00b1dffcfd8fd47e04772fe/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L160 8. NPE is thrown: ``` java.lang.NullPointerException at org.apache.spark.sql.execution.FileSourceScanExec.(DataSourceScanExec.scala:169) at org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:526) at org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:159) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:211) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:210) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$3.apply(QueryPlan.scala:225) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$3.apply(QueryPlan.scala:225) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:296) at org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:225) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:211) at
[GitHub] spark issue #21596: [SPARK-24601] Bump Jackson version
Github user Fokko commented on the issue: https://github.com/apache/spark/pull/21596 Rebased onto master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21795: [SPARK-24840][SQL] do not use dummy filter to switch cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21795 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93225/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21795: [SPARK-24840][SQL] do not use dummy filter to switch cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21795 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21795: [SPARK-24840][SQL] do not use dummy filter to switch cod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21795 **[Test build #93225 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93225/testReport)** for PR 21795 at commit [`de5a232`](https://github.com/apache/spark/commit/de5a2323b5b46a4c073e3ff1dce6daea395dd1dd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21589 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21589 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93223/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21589 **[Test build #93223 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93223/testReport)** for PR 21589 at commit [`eebb310`](https://github.com/apache/spark/commit/eebb31099f078cc05bf0f6d6e32c94d4ee818f9e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/18544 It's not reasonable, `failFunctionLookup` throws `NoSuchFunctionException`. The function actually exists in current selected database, we should throw the exception which is due to an initialization failure, but not `NoSuchFunctionException`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21469: [SPARK-24441][SS] Expose total estimated size of states ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21469 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93224/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21469: [SPARK-24441][SS] Expose total estimated size of states ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21469 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21469: [SPARK-24441][SS] Expose total estimated size of states ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21469 **[Test build #93224 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93224/testReport)** for PR 21469 at commit [`5b203d4`](https://github.com/apache/spark/commit/5b203d4967eda3a09f7c8d83cf86e7ac6a427182). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/21589 > User's are not expected to override it unless they want fine grained control over the value This is actually one of the use cases when an user need to take control or tune a query. The `defaultParallelism` is used in many places like https://github.com/apache/spark/blob/9549a2814951f9ba969955d78ac4bd2240f85989/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L594-L597 . If he/she wants to tune the behavior in the methods, he/she has to change `defaultParallelism`. In this way the factor `5` in `df.repartition(5 * sc.defaultParallelism)` should be tune accordingly. In this way we just force users to introduce absolutely unnecessary complexity and dependencies in their code. If I need number of cores in my cluster, I would like to have a direct way to take it instead of hope a method returns me this number implicitly. > One thing to be kept in mind is that dynamic resource allocation will kick in after tasks are submitted ... Let me show you another use case which I observe in my experience. Our customers can write a code in notebooks and can attach their notebooks to different cluster. Usually code is developed and debugged on small (staging) cluster. After that the notebooks are re-attached to production cluster which may have completely different size. Pretty often users just leave existing params/constants like in `repartition()` as is. It usually leads to underloading or overloading a clusters. Why cannot they use `defaultParallelism` everywhere? Look at the use case above - tuning one part of user's app requires changing factors in another parts (absolutely independent from the first one). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21733: [SPARK-24763][SS] Remove redundant key data from value i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21733 Build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21733: [SPARK-24763][SS] Remove redundant key data from value i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21733 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21733: [SPARK-24763][SS] Remove redundant key data from value i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21733 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93222/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21733: [SPARK-24763][SS] Remove redundant key data from value i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21733 **[Test build #93222 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93222/testReport)** for PR 21733 at commit [`4754469`](https://github.com/apache/spark/commit/4754469ebdb36da1d3ae1234a49472716a143119). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` sealed trait StreamingAggregationStateManager extends Serializable ` * ` abstract class StreamingAggregationStateManagerBaseImpl(` * ` class StreamingAggregationStateManagerImplV1(` * ` class StreamingAggregationStateManagerImplV2(` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21733: [SPARK-24763][SS] Remove redundant key data from value i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21733 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93221/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21733: [SPARK-24763][SS] Remove redundant key data from value i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21733 **[Test build #93221 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93221/testReport)** for PR 21733 at commit [`db9d9ce`](https://github.com/apache/spark/commit/db9d9ce6dc4912672ca0af14833b5d0c239f9562). * This patch **fails Spark unit tests**. * This patch **does not merge cleanly**. * This patch adds the following public classes _(experimental)_: * ` sealed trait StreamingAggregationStateManager extends Serializable ` * ` abstract class StreamingAggregationStateManagerBaseImpl(` * ` class StreamingAggregationStateManagerImplV1(` * ` class StreamingAggregationStateManagerImplV2(` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21752: [SPARK-24788][SQL] fixed UnresolvedException when toStri...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21752 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21801: [SPARK-24386][SPARK-24768][BUILD][FOLLOWUP] Fix l...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21801 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21764: [SPARK-24802] Optimization Rule Exclusion
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21764 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93220/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21764: [SPARK-24802] Optimization Rule Exclusion
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21764 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21764: [SPARK-24802] Optimization Rule Exclusion
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21764 **[Test build #93220 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93220/testReport)** for PR 21764 at commit [`84f1a6b`](https://github.com/apache/spark/commit/84f1a6b5cba08df8684179e9d7195545be655e76). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21801: [SPARK-24386][SPARK-24768][BUILD][FOLLOWUP] Fix lint-jav...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21801 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21801: [SPARK-24386][SPARK-24768][BUILD][FOLLOWUP] Fix lint-jav...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21801 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21801: [SPARK-24386][SPARK-24768][BUILD][FOLLOWUP] Fix lint-jav...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21801 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93218/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21801: [SPARK-24386][SPARK-24768][BUILD][FOLLOWUP] Fix lint-jav...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21801 **[Test build #93218 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93218/testReport)** for PR 21801 at commit [`7f78d75`](https://github.com/apache/spark/commit/7f78d750411a4098527b2b332495f5dd4f20c63e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21589 +CC @markhamstra since you were looking at API stability. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21589 I am not convinced by the rationale given for adding the new api's in the jira. The examples given there can be easily modeled using `defaultParallelism` (to get current state) and executor events (to get numCores, memory per executor). For example: `df.repartition(5 * sc.defaultParallelism)` The other argument seems to be that users can override this value and set it to a static constant. User's are not expected to override it unless they want fine grained control over the value and spark is expected to honor it when specified. One thing to be kept in mind is that dynamic resource allocation will kick in after tasks are submitted (when there are insufficient resources available) - so trying to fine tune this for an application, in presence of DRA, uses these api's is not going to be effective anyway. If there are corner cases where `defaultParallelism` is not accurate, we should fix those to reflect the current value. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21102: [SPARK-23913][SQL] Add array_intersect function
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21102#discussion_r203322643 --- Diff: core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala --- @@ -163,7 +187,7 @@ class OpenHashSet[@specialized(Long, Int) T: ClassTag]( * to a new position (in the new data array). */ def rehashIfNeeded(k: T, allocateFunc: (Int) => Unit, moveFunc: (Int, Int) => Unit) { -if (_size > _growThreshold) { +if (_occupied > _growThreshold) { --- End diff -- For accuracy sake - my example snippet above will fail much earlier - due to OpenHashSet. MAX_CAPACITY. Though that is probably not the point anyway :-) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21102: [SPARK-23913][SQL] Add array_intersect function
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21102#discussion_r203322056 --- Diff: core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala --- @@ -163,7 +187,7 @@ class OpenHashSet[@specialized(Long, Int) T: ClassTag]( * to a new position (in the new data array). */ def rehashIfNeeded(k: T, allocateFunc: (Int) => Unit, moveFunc: (Int, Int) => Unit) { -if (_size > _growThreshold) { +if (_occupied > _growThreshold) { --- End diff -- There is no explicitly entry here - it is simply unoccupied slots in an array. The slot is free, it can be used by some other (new) entry when insert is called. It must be trivial to see how very bad behavior can happen with actual size of set being very small - with a series of add/remove's : resulting in unending growth of the set. something like this, for example, is enough to cause set to blow to 2B entries: ``` var i = 0 while (i < Int.MaxValue) { set.add(1) set.remove(1) assert (0 == set.size) i += 1 } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/21589 > I am not seeing the utility of these two methods. @mridulm I describe the utility of the methods in the ticket: https://issues.apache.org/jira/browse/SPARK-24591 > defaultParallelism already captures the current number of cores. The `defaultParallelism` can be changed by users. And pretty often it is not reflected to number of cores. > For monitoring usecases, existing events fired via listener can be used to keep track of current executor population (if that is the intended usecase). The basic cluster properties should be easily discoverable via APIs, I believe. And monitoring is just one of use cases. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21789: [SPARK-24829][SQL]In Spark Thrift Server, CAST AS...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21789#discussion_r203320896 --- Diff: sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala --- @@ -766,6 +774,14 @@ class HiveThriftHttpServerSuite extends HiveThriftJdbcTest { assert(resultSet.getString(2) === HiveUtils.builtinHiveVersion) } } + + test("Checks cast as float") { --- End diff -- then probably better to add it into HiveThriftJdbcTest? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21789: [SPARK-24829][SQL]In Spark Thrift Server, CAST AS...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21789#discussion_r203321155 --- Diff: sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/Column.java --- @@ -349,7 +349,7 @@ public void addValue(Type type, Object field) { break; case FLOAT_TYPE: nulls.set(size, field == null); -doubleVars()[size] = field == null ? 0 : ((Float)field).doubleValue(); +doubleVars()[size] = field == null ? 0 : new Double(field.toString()); --- End diff -- if the problem is the precision, isn't enough to cast it to Double instead of creating a double out of a string? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21758: [SPARK-24795][CORE] Implement barrier execution mode
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21758 I had left a few comments on SPARK-24375 @jiangxb1987 ... unfortunately the jira's have moved around a bit. If this is active PR for introducing the feature, would be great to get clarity on them. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21221: [SPARK-23429][CORE] Add executor memory metrics t...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21221#discussion_r203319952 --- Diff: core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala --- @@ -160,11 +160,29 @@ case class SparkListenerBlockUpdated(blockUpdatedInfo: BlockUpdatedInfo) extends * Periodic updates from executors. * @param execId executor id * @param accumUpdates sequence of (taskId, stageId, stageAttemptId, accumUpdates) + * @param executorUpdates executor level metrics updates */ @DeveloperApi case class SparkListenerExecutorMetricsUpdate( execId: String, -accumUpdates: Seq[(Long, Int, Int, Seq[AccumulableInfo])]) +accumUpdates: Seq[(Long, Int, Int, Seq[AccumulableInfo])], +executorUpdates: Option[Array[Long]] = None) + extends SparkListenerEvent + +/** + * Peak metric values for the executor for the stage, written to the history log at stage + * completion. + * @param execId executor id + * @param stageId stage id + * @param stageAttemptId stage attempt + * @param executorMetrics executor level metrics, indexed by MetricGetter.values + */ +@DeveloperApi +case class SparkListenerStageExecutorMetrics( +execId: String, +stageId: Int, +stageAttemptId: Int, +executorMetrics: Array[Long]) --- End diff -- +1 on enum's @squito ! The only concern would be evolving the enum's in a later release - changing enum could result in source incompatibility. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21102: [SPARK-23913][SQL] Add array_intersect function
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21102#discussion_r203319710 --- Diff: core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala --- @@ -163,7 +187,7 @@ class OpenHashSet[@specialized(Long, Int) T: ClassTag]( * to a new position (in the new data array). */ def rehashIfNeeded(k: T, allocateFunc: (Int) => Unit, moveFunc: (Int, Int) => Unit) { -if (_size > _growThreshold) { +if (_occupied > _growThreshold) { --- End diff -- When 'remove' is called, '_size' is decremented. But, an entry is not released. This is a motivation to introduce 'occupied'. I will try to use another implementation without 'remove' while it may introduce some overhead. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21729: [SPARK-24755][Core] Executor loss can cause task to not ...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21729 Looks good to me, thanks for fixing this @hthuynh2 ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21774: [SPARK-24811][SQL]Avro: add new function from_avro and t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21774 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21652: [SPARK-24551][K8S] Add integration tests for secrets
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21652 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/1091/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21774: [SPARK-24811][SQL]Avro: add new function from_avro and t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21774 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1092/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21652: [SPARK-24551][K8S] Add integration tests for secrets
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21652 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1091/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21652: [SPARK-24551][K8S] Add integration tests for secrets
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21652 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21652: [SPARK-24551][K8S] Add integration tests for secrets
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21652 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21652: [SPARK-24551][K8S] Add integration tests for secrets
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21652 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93228/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org