[GitHub] spark issue #22683: [SPARK-25696] The storage memory displayed on spark Appl...
Github user httfighter commented on the issue: https://github.com/apache/spark/pull/22683 @srowen OK. Thank you very much for your advice. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22683: [SPARK-25696] The storage memory displayed on spark Appl...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22683 I believe most of the Spark code really uses units like KiB, multiples of 1024, as you see here. However it's often referred to as KB, incorrectly, and so on. It seems like the UI is the only place that uses 1000 and KB, which is consistent in itself, but should really be 1024/KiB. that change to the UI seems fine. I think there are a few comments and docs in the code that really also should change to say KiB and so on, and we can make those changes here, but it's OK to focus on making the UI based on KiB. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23128: [SPARK-26142][SQL] Support passing shuffle metrics to ex...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23128 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23128: [SPARK-26142][SQL] Support passing shuffle metrics to ex...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23128 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5314/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23128: [SPARK-26142][SQL] Support passing shuffle metrics to ex...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/23128 @gatorsmile Thanks Xiao! Conflicts resolve done, as Reynold comments in https://github.com/apache/spark/pull/23105#discussion_r235950427, when the ShuffleMetricsReporter move to ShuffleReadMetricsReporter in write pr, it will conflict again here, I'll keep tracking the relevant pr. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23128: [SPARK-26142][SQL] Support passing shuffle metrics to ex...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23128 **[Test build #99223 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99223/testReport)** for PR 23128 at commit [`cb46bfe`](https://github.com/apache/spark/commit/cb46bfeb930b71d560340393e95097ee66303862). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23128: [SPARK-26142][SQL] Support passing shuffle metric...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/23128#discussion_r236032855 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLShuffleMetricsReporter.scala --- @@ -0,0 +1,60 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.metric + +import org.apache.spark.executor.TempShuffleReadMetrics + +/** + * A shuffle metrics reporter for SQL exchange operators. + * @param tempMetrics [[TempShuffleReadMetrics]] created in TaskContext. + * @param metrics All metrics in current SparkPlan. + */ +class SQLShuffleMetricsReporter( + tempMetrics: TempShuffleReadMetrics, + metrics: Map[String, SQLMetric]) extends TempShuffleReadMetrics { + + override def incRemoteBlocksFetched(v: Long): Unit = { +metrics(SQLMetrics.REMOTE_BLOCKS_FETCHED).add(v) --- End diff -- Sorry for the less consideration on per-row operation here, I should be more careful. Fix done in cb46bfe. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22683: [SPARK-25696] The storage memory displayed on spark Appl...
Github user httfighter commented on the issue: https://github.com/apache/spark/pull/22683 @srowen @ajbozarth I am not sure about some things, can you give me some advice? In the process of modification, I have a question. In Spark, whether M and MB represent MiB. Spark does not use the unit of kilobytes to convert between numbers. private static final ImmutableMap byteSuffixes = ImmutableMap.builder() .put("b", ByteUnit.BYTE) .put("k", ByteUnit.KiB) .put("kb", ByteUnit.KiB) .put("m", ByteUnit.MiB) .put("mb", ByteUnit.MiB) .put("g", ByteUnit.GiB) .put("gb", ByteUnit.GiB) .put("t", ByteUnit.TiB) .put("tb", ByteUnit.TiB) .put("p", ByteUnit.PiB) .put("pb", ByteUnit.PiB) .build(); spark.kryoserializer.buffer 64k Initial size of Kryo's serialization buffer, in KiB unless otherwise specified. Note that there will be one buffer per core on each worker. This buffer will grow up to spark.kryoserializer.buffer.max if needed. If this is the case, can we only guarantee the uniform use of 1024 for digital conversion, no changes to the unit displays in log, UI, comments and configured messages. Otherwise, we need to modify all the UI, log, comments, and configuration information to ensure consistency, there is no guarantee that all can be modified, and there will be no problems after the modification. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21732 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21732 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99222/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21732 **[Test build #99222 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99222/testReport)** for PR 21732 at commit [`62fdb17`](https://github.com/apache/spark/commit/62fdb17b4f72d935f25041c801708e3939e16074). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23052 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99221/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23052 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23052 **[Test build #99221 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99221/testReport)** for PR 23052 at commit [`76e1466`](https://github.com/apache/spark/commit/76e1466a39aa2a40d999791bb9d3b09628921e85). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` class RuleSummary(` * `class QueryPlanningTracker ` * `class QueryExecution(` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23128: [SPARK-26142][SQL] Support passing shuffle metrics to ex...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23128 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99220/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23128: [SPARK-26142][SQL] Support passing shuffle metrics to ex...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23128 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23128: [SPARK-26142][SQL] Support passing shuffle metrics to ex...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23128 **[Test build #99220 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99220/testReport)** for PR 23128 at commit [`1b556ec`](https://github.com/apache/spark/commit/1b556ecf869685af8f34d448ac3f08102a758124). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class SQLShuffleMetricsReporter(` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21732 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21732 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5313/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21732 **[Test build #99222 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99222/testReport)** for PR 21732 at commit [`62fdb17`](https://github.com/apache/spark/commit/62fdb17b4f72d935f25041c801708e3939e16074). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23128: [SPARK-26142][SQL] Support passing shuffle metric...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23128#discussion_r236025838 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLShuffleMetricsReporter.scala --- @@ -0,0 +1,60 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.metric + +import org.apache.spark.executor.TempShuffleReadMetrics + +/** + * A shuffle metrics reporter for SQL exchange operators. + * @param tempMetrics [[TempShuffleReadMetrics]] created in TaskContext. + * @param metrics All metrics in current SparkPlan. + */ +class SQLShuffleMetricsReporter( + tempMetrics: TempShuffleReadMetrics, + metrics: Map[String, SQLMetric]) extends TempShuffleReadMetrics { + + override def incRemoteBlocksFetched(v: Long): Unit = { +metrics(SQLMetrics.REMOTE_BLOCKS_FETCHED).add(v) --- End diff -- (Iâm not referring to just this function, but in general, especially for per-row). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23128: [SPARK-26142][SQL] Support passing shuffle metric...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23128#discussion_r236025817 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLShuffleMetricsReporter.scala --- @@ -0,0 +1,60 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.metric + +import org.apache.spark.executor.TempShuffleReadMetrics + +/** + * A shuffle metrics reporter for SQL exchange operators. + * @param tempMetrics [[TempShuffleReadMetrics]] created in TaskContext. + * @param metrics All metrics in current SparkPlan. + */ +class SQLShuffleMetricsReporter( + tempMetrics: TempShuffleReadMetrics, + metrics: Map[String, SQLMetric]) extends TempShuffleReadMetrics { + + override def incRemoteBlocksFetched(v: Long): Unit = { +metrics(SQLMetrics.REMOTE_BLOCKS_FETCHED).add(v) --- End diff -- Doing a hashmap lookup here could introduce serious performance regressions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23052 **[Test build #99221 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99221/testReport)** for PR 23052 at commit [`76e1466`](https://github.com/apache/spark/commit/76e1466a39aa2a40d999791bb9d3b09628921e85). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23105: [SPARK-26140] Enable custom metrics implementatio...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/23105 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23128: [SPARK-26142][SQL] Support passing shuffle metrics to ex...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/23128 @xuanyuanking Could you address the conflicts? Thanks for you fast work! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23125: [SPARK-26156][WebUI] Revise summary section of stage pag...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23125 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99216/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23125: [SPARK-26156][WebUI] Revise summary section of stage pag...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23125 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23125: [SPARK-26156][WebUI] Revise summary section of stage pag...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23125 **[Test build #99216 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99216/testReport)** for PR 23125 at commit [`a2d7c7a`](https://github.com/apache/spark/commit/a2d7c7a39c265ce41882f3bef1363af4c3962cdb). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23128: [SPARK-26139][SQL] Support passing shuffle metrics to ex...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23128 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5312/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23128: [SPARK-26139][SQL] Support passing shuffle metrics to ex...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23128 **[Test build #99220 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99220/testReport)** for PR 23128 at commit [`1b556ec`](https://github.com/apache/spark/commit/1b556ecf869685af8f34d448ac3f08102a758124). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23128: [SPARK-26139][SQL] Support passing shuffle metrics to ex...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23128 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23105: [SPARK-26140] Enable custom metrics implementatio...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23105#discussion_r236020103 --- Diff: core/src/main/scala/org/apache/spark/shuffle/metrics.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.shuffle + +/** + * An interface for reporting shuffle read metrics, for each shuffle. This interface assumes + * all the methods are called on a single-threaded, i.e. concrete implementations would not need + * to synchronize. + * + * All methods have additional Spark visibility modifier to allow public, concrete implementations + * that still have these methods marked as private[spark]. + */ +private[spark] trait ShuffleReadMetricsReporter { --- End diff -- @xuanyuanking just submitted a PR on how to use it :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23128: [SPARK-26139][SQL] Support passing shuffle metric...
GitHub user xuanyuanking opened a pull request: https://github.com/apache/spark/pull/23128 [SPARK-26139][SQL] Support passing shuffle metrics to exchange operator ## What changes were proposed in this pull request? Implement `SQLShuffleMetricsReporter` on the sql side as the customized ShuffleMetricsReporter, which extended the `TempShuffleReadMetrics` and update SQLMetrics, in this way shuffle metrics can be reported in the SQL UI. ## How was this patch tested? Add UT in SQLMetricsSuite. Manual test locally, before: ![image](https://user-images.githubusercontent.com/4833765/48960517-30f97880-efa8-11e8-982c-92d05938fd1d.png) after: ![image](https://user-images.githubusercontent.com/4833765/48960587-b54bfb80-efa8-11e8-8e95-7a3c8c74cc5c.png) You can merge this pull request into a Git repository by running: $ git pull https://github.com/xuanyuanking/spark SPARK-26142 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23128.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23128 commit fba590f040c8fd7ce75df3f733f246db18e79ee6 Author: Reynold Xin Date: 2018-11-21T14:56:23Z [SPARK-26140] Pull TempShuffleReadMetrics creation out of shuffle reader commit 35b48b21028110742aed7f7f5b5d62109c2f0adf Author: Reynold Xin Date: 2018-11-21T15:02:04Z less movement of code commit 1b556ecf869685af8f34d448ac3f08102a758124 Author: liyuanjian Date: 2018-11-23T21:02:25Z [SPARK-26142] Implement shuffle read metric in SQL --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23103: [SPARK-26121] [Structured Streaming] Allow users to defi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23103 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23103: [SPARK-26121] [Structured Streaming] Allow users to defi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23103 **[Test build #99219 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99219/testReport)** for PR 23103 at commit [`9196623`](https://github.com/apache/spark/commit/9196623d2b1fa29522dcc400e27edccf8fea946a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23103: [SPARK-26121] [Structured Streaming] Allow users to defi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23103 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99219/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and RDDScan...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/23127 Looks good. One more higher level question that can also be addressed in a follow-up. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23127: [SPARK-26159] Codegen for LocalTableScanExec and ...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/23127#discussion_r236017398 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -406,14 +415,39 @@ trait BlockingOperatorWithCodegen extends CodegenSupport { override def limitNotReachedChecks: Seq[String] = Nil } +/** + * Leaf codegen node reading from a single RDD. + */ +trait InputRDDCodegen extends CodegenSupport { --- End diff -- Should we reconcile this with the code gen for `RowDataSourceScanExec`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23080 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99215/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23080 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23080 **[Test build #99215 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99215/testReport)** for PR 23080 at commit [`a4c4b67`](https://github.com/apache/spark/commit/a4c4b6710cb67bddd9badbb53aa07b0d93242bc5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and RDDScan...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23127 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and RDDScan...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23127 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99218/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and RDDScan...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23127 **[Test build #99218 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99218/testReport)** for PR 23127 at commit [`23c2d91`](https://github.com/apache/spark/commit/23c2d9111f1cff9059746bb7b48bb8ef7ad7027b). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait InputRDDCodegen extends CodegenSupport ` * `case class InputAdapter(child: SparkPlan) extends UnaryExecNode with InputRDDCodegen ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23103: [SPARK-26121] [Structured Streaming] Allow users to defi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23103 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23103: [SPARK-26121] [Structured Streaming] Allow users to defi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23103 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5311/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23103: [SPARK-26121] [Structured Streaming] Allow users to defi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23103 **[Test build #99219 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99219/testReport)** for PR 23103 at commit [`9196623`](https://github.com/apache/spark/commit/9196623d2b1fa29522dcc400e27edccf8fea946a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23022: [SPARK-26038] Decimal toScalaBigInt/toJavaBigInte...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/23022 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22512: [SPARK-25498][SQL] InterpretedMutableProjection should h...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22512 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22512: [SPARK-25498][SQL] InterpretedMutableProjection should h...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22512 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99217/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22512: [SPARK-25498][SQL] InterpretedMutableProjection should h...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22512 **[Test build #99217 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99217/testReport)** for PR 22512 at commit [`5227e42`](https://github.com/apache/spark/commit/5227e422cda4d110c6b5a950ebdc49d6e25914f1). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and Existin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23127 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and Existin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23127 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5310/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23022: [SPARK-26038] Decimal toScalaBigInt/toJavaBigInteger for...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/23022 Merging to master. Thank! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and Existin...
Github user juliuszsompolski commented on the issue: https://github.com/apache/spark/pull/23127 cc @hvanhovell @rednaxelafx --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and Existin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23127 **[Test build #99218 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99218/testReport)** for PR 23127 at commit [`23c2d91`](https://github.com/apache/spark/commit/23c2d9111f1cff9059746bb7b48bb8ef7ad7027b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23127: [SPARK-26159] Codegen for LocalTableScanExec and ...
GitHub user juliuszsompolski opened a pull request: https://github.com/apache/spark/pull/23127 [SPARK-26159] Codegen for LocalTableScanExec and ExistingRDDExec ## What changes were proposed in this pull request? Implement codegen for LocalTableScanExec and ExistingRDDExec. Refactor to share code with InputAdapter. ## How was this patch tested? Covered and used in existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/juliuszsompolski/apache-spark SPARK-26159 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23127.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23127 commit 23c2d9111f1cff9059746bb7b48bb8ef7ad7027b Author: Juliusz Sompolski Date: 2018-11-13T09:19:09Z localtablescanexec codegen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23022: [SPARK-26038] Decimal toScalaBigInt/toJavaBigInteger for...
Github user juliuszsompolski commented on the issue: https://github.com/apache/spark/pull/23022 ping @hvanhovell --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22466: [SPARK-25464][SQL] Create Database to the location,only ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22466 Regarding the potentially high cost of file listing, `CREATE DATABASE` is not a frequent operation. The cost is high only if the target directory is non-empty with many many files. We are blocking users from creating such a database. Thus, the cost is not a big deal I think. We need to list this behavior change in the SQL migration guide. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22466: [SPARK-25464][SQL] Create Database to the locatio...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22466#discussion_r236005686 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala --- @@ -2370,4 +2370,17 @@ class HiveDDLSuite )) } } + + test("SPARK-25464 create a database with a non empty location") { --- End diff -- Do we have a test case to check "create a database with an empty location"? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23119: [SPARK-25954][SS][FOLLOWUP][test-maven] Add Zookeeper 3....
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/23119 Thanks, @gaborgsomogyi and @HyukjinKwon . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23124: [SPARK-25829][SQL] remove duplicated map keys wit...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/23124#discussion_r235999222 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala --- @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.util + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types.{AtomicType, CalendarIntervalType, DataType, MapType} + +/** + * A builder of [[ArrayBasedMapData]], which fails if a null map key is detected, and removes + * duplicated map keys w.r.t. the last wins policy. + */ +class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Serializable { + assert(!keyType.existsRecursively(_.isInstanceOf[MapType]), "key of map cannot be/contain map") + + private lazy val keyToIndex = keyType match { +case _: AtomicType | _: CalendarIntervalType => mutable.HashMap.empty[Any, Int] +case _ => + // for complex types, use interpreted ordering to be able to compare unsafe data with safe + // data, e.g. UnsafeRow vs GenericInternalRow. + mutable.TreeMap.empty[Any, Int](TypeUtils.getInterpretedOrdering(keyType)) --- End diff -- After merging this PR, I'll check again and file a JIRA for that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23124: [SPARK-25829][SQL] remove duplicated map keys wit...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/23124#discussion_r235999040 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala --- @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.util + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types.{AtomicType, CalendarIntervalType, DataType, MapType} + +/** + * A builder of [[ArrayBasedMapData]], which fails if a null map key is detected, and removes + * duplicated map keys w.r.t. the last wins policy. + */ +class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Serializable { + assert(!keyType.existsRecursively(_.isInstanceOf[MapType]), "key of map cannot be/contain map") + + private lazy val keyToIndex = keyType match { +case _: AtomicType | _: CalendarIntervalType => mutable.HashMap.empty[Any, Int] +case _ => + // for complex types, use interpreted ordering to be able to compare unsafe data with safe + // data, e.g. UnsafeRow vs GenericInternalRow. + mutable.TreeMap.empty[Any, Int](TypeUtils.getInterpretedOrdering(keyType)) --- End diff -- Sure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22512: [SPARK-25498][SQL] InterpretedMutableProjection should h...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22512 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22512: [SPARK-25498][SQL] InterpretedMutableProjection should h...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22512 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5309/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23125: [SPARK-26156][WebUI] Revise summary section of stage pag...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23125 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23125: [SPARK-26156][WebUI] Revise summary section of stage pag...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23125 **[Test build #99216 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99216/testReport)** for PR 23125 at commit [`a2d7c7a`](https://github.com/apache/spark/commit/a2d7c7a39c265ce41882f3bef1363af4c3962cdb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22512: [SPARK-25498][SQL] InterpretedMutableProjection should h...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22512 **[Test build #99217 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99217/testReport)** for PR 22512 at commit [`5227e42`](https://github.com/apache/spark/commit/5227e422cda4d110c6b5a950ebdc49d6e25914f1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23125: [SPARK-26156][WebUI] Revise summary section of stage pag...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23125 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5308/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23125: [SPARK-26156][WebUI] Revise summary section of st...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23125#discussion_r235995724 --- Diff: core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala --- @@ -79,6 +79,9 @@ private[ui] class StagePage(parent: StagesTab, store: AppStatusStore) extends We localityNamesAndCounts.sorted.mkString("; ") } + private def jobURL(request: HttpServletRequest, jobId: Int): String = --- End diff -- Make sense. I have updated the code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22512: [SPARK-25498][SQL] InterpretedMutableProjection should h...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22512 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23022: [SPARK-26038] Decimal toScalaBigInt/toJavaBigInteger for...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/23022 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22824: [SPARK-25834] [Structured Streaming]Update Mode should n...
Github user koeninger commented on the issue: https://github.com/apache/spark/pull/22824 @tdas or @jose-torres any opinion on whether it's worth refactoring these checks as suggested by @arunmahadevan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23084: [SPARK-26117][CORE][SQL]use SparkOutOfMemoryError instea...
Github user heary-cao commented on the issue: https://github.com/apache/spark/pull/23084 @cloud-fan,thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18304: [SPARK-21098] Set lineseparator csv multiline and csv wr...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18304 @danielvdende now the newlines are automatically derected. should be not an issue anymore. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23080: [SPARK-26108][SQL] Support custom lineSep in CSV ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/23080 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23080 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23080 **[Test build #99215 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99215/testReport)** for PR 23080 at commit [`a4c4b67`](https://github.com/apache/spark/commit/a4c4b6710cb67bddd9badbb53aa07b0d93242bc5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23080 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5307/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23080 @MaxGekk, thanks for working on this one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23080 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23080 Last changes were only doc changes. Let me get this in. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23080 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23111: [SPARK-26148][PYTHON][TESTS] Increases default paralleli...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23111 Hey all, I will merge this in few days if there's no more comments. It's going to speed up the tests roughly 12 ~ 15 mins. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23102: [SPARK-26137][CORE] Use Java system property "fil...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/23102#discussion_r235975268 --- Diff: core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala --- @@ -61,11 +62,12 @@ private[deploy] object DependencyUtils extends Logging { hadoopConf: Configuration, secMgr: SecurityManager): String = { val targetDir = Utils.createTempDir() +val fileSeparator = Pattern.quote(System.getProperty("file.separator")) Option(jars) .map { resolveGlobPaths(_, hadoopConf) .split(",") - .filterNot(_.contains(userJar.split("/").last)) + .filterNot(_.contains(userJar.split(fileSeparator).last)) --- End diff -- Beyond the original purpose of this PR, is it better to move `userJar.split(fileSeparator).last` before line 66? This is because `userJar` is not changed in `map { ... }`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23102: [SPARK-26137][CORE] Use Java system property "file.separ...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/23102 @MaxGekk This PR may change a separator for `userJar` that has `\` on Windows. `resolveGlobPaths` is not applied to `userJar`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/23126 Plug do some more test on real data after add this commit we use data from http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked+Potential+Signals and data from http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems to do some more accuracy test, the accuracy result is OK --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/23126 After add this commit We get the result for RowMatrix computeCovariance function: For the input data 1.0,2.0,3.0,4.0,5.0 2.0,3.0,1.0,2.0,6.0 RowMatrix function computeCovariance result: 2.5 1.75 1.75 3.7 For the input data generated by data1 = np.random.normal(loc=10, scale=0.09, size=1000) data2 = np.random.normal(loc=20, scale=0.02,size=1000) RowMatrix function computeCovariance result: 8.109505250896888E-11 -5.003160564607658E-15 -5.003160564607658E-15 4.08276584628234E-12 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23126 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23126 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23126 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/23126 Compare Spark computeCovariance function in RowMatrix for DenseVector and Numpy's function cov, Find two problem, below is the result: 1)The Spark function computeCovariance in RowMatrix is not accuracy input data 1.0,2.0,3.0,4.0,5.0 2.0,3.0,1.0,2.0,6.0 Numpy function cov result: [[2.5 1.75] [ 1.75 3.7 ]] RowMatrix function computeCovariance result: 2.5 1.75 1.75 3.701 2)For some input case, the result is not good generate input data by below logic data1 = np.random.normal(loc=10, scale=0.09, size=1000) data2 = np.random.normal(loc=20, scale=0.02,size=1000) Numpy function cov result: [[ 8.10536442e-11 -4.35439574e-15] [ -4.35439574e-15 3.99928264e-12]] RowMatrix function computeCovariance result: -0.0027484893798828125 0.001491546630859375 0.0014915466308593758.087158203125E-4 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...
GitHub user KyleLi1985 opened a pull request: https://github.com/apache/spark/pull/23126 [SPARK-26158] [MLLIB] fix covariance accuracy problem for DenseVector ## What changes were proposed in this pull request? Enhance accuracy of the covariance logic in RowMatrix for function computeCovariance ## How was this patch tested? Unit test Accuracy test You can merge this pull request into a Git repository by running: $ git pull https://github.com/KyleLi1985/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23126.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23126 commit a8bfdfffbe82a77943adb4bf84ca939d786afc8a Author: æ亮 Date: 2018-11-23T14:35:27Z fix covariance accuracy problem for DenseVector --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23083: [SPARK-26114][CORE] ExternalSorter Leak
Github user advancedxy commented on a diff in the pull request: https://github.com/apache/spark/pull/23083#discussion_r235955118 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -727,9 +727,10 @@ private[spark] class ExternalSorter[K, V, C]( spills.clear() forceSpillFiles.foreach(s => s.file.delete()) forceSpillFiles.clear() -if (map != null || buffer != null) { +if (map != null || buffer != null || readingIterator != null) { map = null // So that the memory can be garbage-collected buffer = null // So that the memory can be garbage-collected + readingIterator = null // So that the memory can be garbage-collected --- End diff -- Hi @szhem , I discussed with wenchen offline. I think this is the key point. After nulling out `readingIterator`, `ExternalSorter` should released all the memories it occupied. Yes, `ExternalSorter` is leaked in `TaskCompletionListener`, but it would already be stopped in `CompletionIterator` in happy path. The stopped sorter wouldn't occupy too much memory. The `readingIterator` is occupying memory because it may reference `map/buffer.partitionedDestructiveSortedIterator`, which itself references `map/buffer`. So only nulling out map or buffer is not enough. Can you try with this modification only and see whether OOM still occurs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23124: [SPARK-25829][SQL] remove duplicated map keys wit...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/23124#discussion_r235952965 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala --- @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.util + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types.{AtomicType, CalendarIntervalType, DataType, MapType} + +/** + * A builder of [[ArrayBasedMapData]], which fails if a null map key is detected, and removes + * duplicated map keys w.r.t. the last wins policy. + */ +class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Serializable { + assert(!keyType.existsRecursively(_.isInstanceOf[MapType]), "key of map cannot be/contain map") + + private lazy val keyToIndex = keyType match { +case _: AtomicType | _: CalendarIntervalType => mutable.HashMap.empty[Any, Int] +case _ => + // for complex types, use interpreted ordering to be able to compare unsafe data with safe + // data, e.g. UnsafeRow vs GenericInternalRow. + mutable.TreeMap.empty[Any, Int](TypeUtils.getInterpretedOrdering(keyType)) + } + + // TODO: specialize it + private lazy val keys = mutable.ArrayBuffer.empty[Any] + private lazy val values = mutable.ArrayBuffer.empty[Any] + + private lazy val keyGetter = InternalRow.getAccessor(keyType) + private lazy val valueGetter = InternalRow.getAccessor(valueType) + + def reset(): Unit = { +keyToIndex.clear() +keys.clear() +values.clear() + } + + def put(key: Any, value: Any): Unit = { +if (key == null) { + throw new RuntimeException("Cannot use null as map key.") +} + +val maybeExistingIdx = keyToIndex.get(key) +if (maybeExistingIdx.isDefined) { + // Overwrite the previous value, as the policy is last wins. + values(maybeExistingIdx.get) = value +} else { + keyToIndex.put(key, values.length) + keys.append(key) + values.append(value) +} + } + + // write a 2-field row, the first field is key and the second field is value. + def put(entry: InternalRow): Unit = { +if (entry.isNullAt(0)) { + throw new RuntimeException("Cannot use null as map key.") +} +put(keyGetter(entry, 0), valueGetter(entry, 1)) + } + + def putAll(keyArray: Array[Any], valueArray: Array[Any]): Unit = { +if (keyArray.length != valueArray.length) { + throw new RuntimeException( +"The key array and value array of MapData must have the same length.") +} + +var i = 0 +while (i < keyArray.length) { + put(keyArray(i), valueArray(i)) + i += 1 +} + } + + def putAll(keyArray: ArrayData, valueArray: ArrayData): Unit = { +if (keyArray.numElements() != valueArray.numElements()) { + throw new RuntimeException( +"The key array and value array of MapData must have the same length.") +} + +var i = 0 +while (i < keyArray.numElements()) { + put(keyGetter(keyArray, i), valueGetter(valueArray, i)) + i += 1 +} + } + + def build(): ArrayBasedMapData = { +new ArrayBasedMapData(new GenericArrayData(keys.toArray), new GenericArrayData(values.toArray)) --- End diff -- Is it better to call reset() after calling new ArrayBasedMapData to reduce memory consumption in Java heap? At caller side, ArrayBasedMapBuilder is not released. Therefore, until reset() will be called next time, each ArrayBasedMapBuilder keeps unused data in keys, values, and keyToIndex. They consumes Java heap unexpectedly. --- - To unsubscribe, e-mail: reviews-unsubscr...@sp
[GitHub] spark pull request #23124: [SPARK-25829][SQL] remove duplicated map keys wit...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/23124#discussion_r235950666 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala --- @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.util + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types.{AtomicType, CalendarIntervalType, DataType, MapType} + +/** + * A builder of [[ArrayBasedMapData]], which fails if a null map key is detected, and removes + * duplicated map keys w.r.t. the last wins policy. + */ +class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Serializable { + assert(!keyType.existsRecursively(_.isInstanceOf[MapType]), "key of map cannot be/contain map") + + private lazy val keyToIndex = keyType match { +case _: AtomicType | _: CalendarIntervalType => mutable.HashMap.empty[Any, Int] +case _ => + // for complex types, use interpreted ordering to be able to compare unsafe data with safe + // data, e.g. UnsafeRow vs GenericInternalRow. + mutable.TreeMap.empty[Any, Int](TypeUtils.getInterpretedOrdering(keyType)) + } + + // TODO: specialize it + private lazy val keys = mutable.ArrayBuffer.empty[Any] + private lazy val values = mutable.ArrayBuffer.empty[Any] + + private lazy val keyGetter = InternalRow.getAccessor(keyType) + private lazy val valueGetter = InternalRow.getAccessor(valueType) + + def reset(): Unit = { +keyToIndex.clear() +keys.clear() +values.clear() + } + + def put(key: Any, value: Any): Unit = { +if (key == null) { + throw new RuntimeException("Cannot use null as map key.") +} + +val maybeExistingIdx = keyToIndex.get(key) +if (maybeExistingIdx.isDefined) { + // Overwrite the previous value, as the policy is last wins. + values(maybeExistingIdx.get) = value +} else { + keyToIndex.put(key, values.length) + keys.append(key) + values.append(value) +} + } + + // write a 2-field row, the first field is key and the second field is value. + def put(entry: InternalRow): Unit = { +if (entry.isNullAt(0)) { + throw new RuntimeException("Cannot use null as map key.") +} +put(keyGetter(entry, 0), valueGetter(entry, 1)) + } + + def putAll(keyArray: Array[Any], valueArray: Array[Any]): Unit = { +if (keyArray.length != valueArray.length) { + throw new RuntimeException( +"The key array and value array of MapData must have the same length.") +} + +var i = 0 +while (i < keyArray.length) { + put(keyArray(i), valueArray(i)) + i += 1 +} + } + + def putAll(keyArray: ArrayData, valueArray: ArrayData): Unit = { +if (keyArray.numElements() != valueArray.numElements()) { + throw new RuntimeException( +"The key array and value array of MapData must have the same length.") +} + +var i = 0 +while (i < keyArray.numElements()) { + put(keyGetter(keyArray, i), valueGetter(valueArray, i)) + i += 1 +} + } + + def build(): ArrayBasedMapData = { +new ArrayBasedMapData(new GenericArrayData(keys.toArray), new GenericArrayData(values.toArray)) + } + + def from(keyArray: ArrayData, valueArray: ArrayData): ArrayBasedMapData = { +assert(keyToIndex.isEmpty, "'from' can only be called with a fresh GenericMapBuilder.") +putAll(keyArray, valueArray) +if (keyToIndex.size == keyArray.numElements()) { + // If there is no duplicated map keys, creates the MapData with the input key and value array, + // as they might already in unsafe format and are more efficient. + new ArrayBasedM
[GitHub] spark pull request #23105: [SPARK-26140] Enable custom metrics implementatio...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23105#discussion_r235950427 --- Diff: core/src/main/scala/org/apache/spark/shuffle/ShuffleManager.scala --- @@ -48,7 +48,8 @@ private[spark] trait ShuffleManager { handle: ShuffleHandle, startPartition: Int, endPartition: Int, - context: TaskContext): ShuffleReader[K, C] + context: TaskContext, + metrics: ShuffleMetricsReporter): ShuffleReader[K, C] --- End diff -- It is a read metrics here actually. In the write PR this is renamed ShuffleReadMetricsReporter. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23124: [SPARK-25829][SQL] remove duplicated map keys wit...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/23124#discussion_r235950148 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala --- @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.util + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types.{AtomicType, CalendarIntervalType, DataType, MapType} + +/** + * A builder of [[ArrayBasedMapData]], which fails if a null map key is detected, and removes + * duplicated map keys w.r.t. the last wins policy. + */ +class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Serializable { + assert(!keyType.existsRecursively(_.isInstanceOf[MapType]), "key of map cannot be/contain map") + + private lazy val keyToIndex = keyType match { +case _: AtomicType | _: CalendarIntervalType => mutable.HashMap.empty[Any, Int] +case _ => + // for complex types, use interpreted ordering to be able to compare unsafe data with safe + // data, e.g. UnsafeRow vs GenericInternalRow. + mutable.TreeMap.empty[Any, Int](TypeUtils.getInterpretedOrdering(keyType)) + } + + // TODO: specialize it + private lazy val keys = mutable.ArrayBuffer.empty[Any] + private lazy val values = mutable.ArrayBuffer.empty[Any] + + private lazy val keyGetter = InternalRow.getAccessor(keyType) + private lazy val valueGetter = InternalRow.getAccessor(valueType) + + def reset(): Unit = { +keyToIndex.clear() +keys.clear() +values.clear() + } + + def put(key: Any, value: Any): Unit = { +if (key == null) { + throw new RuntimeException("Cannot use null as map key.") +} + +val maybeExistingIdx = keyToIndex.get(key) +if (maybeExistingIdx.isDefined) { + // Overwrite the previous value, as the policy is last wins. + values(maybeExistingIdx.get) = value +} else { + keyToIndex.put(key, values.length) + keys.append(key) + values.append(value) +} + } + + // write a 2-field row, the first field is key and the second field is value. + def put(entry: InternalRow): Unit = { +if (entry.isNullAt(0)) { + throw new RuntimeException("Cannot use null as map key.") +} +put(keyGetter(entry, 0), valueGetter(entry, 1)) + } + + def putAll(keyArray: Array[Any], valueArray: Array[Any]): Unit = { +if (keyArray.length != valueArray.length) { + throw new RuntimeException( +"The key array and value array of MapData must have the same length.") +} + +var i = 0 +while (i < keyArray.length) { + put(keyArray(i), valueArray(i)) + i += 1 +} + } + + def putAll(keyArray: ArrayData, valueArray: ArrayData): Unit = { +if (keyArray.numElements() != valueArray.numElements()) { + throw new RuntimeException( +"The key array and value array of MapData must have the same length.") +} + +var i = 0 +while (i < keyArray.numElements()) { + put(keyGetter(keyArray, i), valueGetter(valueArray, i)) + i += 1 +} + } + + def build(): ArrayBasedMapData = { +new ArrayBasedMapData(new GenericArrayData(keys.toArray), new GenericArrayData(values.toArray)) + } --- End diff -- Is it better to call `reset()` after calling `new ArrayBasedMapData` to reduce memory consumption? At caller side, `ArrayBasedMapBuilder` is not released. Therefore, until reset() will be called next time, each `ArrayBasedMapBuilder` keeps unused data in `keys`, `values`, and `keyToIndex`. They consumes Java heap unexpectedly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spa
[GitHub] spark pull request #23124: [SPARK-25829][SQL] remove duplicated map keys wit...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/23124#discussion_r235947044 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala --- @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.util + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types.{AtomicType, CalendarIntervalType, DataType, MapType} + +/** + * A builder of [[ArrayBasedMapData]], which fails if a null map key is detected, and removes + * duplicated map keys w.r.t. the last wins policy. + */ +class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Serializable { + assert(!keyType.existsRecursively(_.isInstanceOf[MapType]), "key of map cannot be/contain map") + + private lazy val keyToIndex = keyType match { +case _: AtomicType | _: CalendarIntervalType => mutable.HashMap.empty[Any, Int] +case _ => + // for complex types, use interpreted ordering to be able to compare unsafe data with safe + // data, e.g. UnsafeRow vs GenericInternalRow. + mutable.TreeMap.empty[Any, Int](TypeUtils.getInterpretedOrdering(keyType)) + } + + // TODO: specialize it + private lazy val keys = mutable.ArrayBuffer.empty[Any] + private lazy val values = mutable.ArrayBuffer.empty[Any] + + private lazy val keyGetter = InternalRow.getAccessor(keyType) + private lazy val valueGetter = InternalRow.getAccessor(valueType) + + def reset(): Unit = { +keyToIndex.clear() +keys.clear() +values.clear() + } + + def put(key: Any, value: Any): Unit = { +if (key == null) { + throw new RuntimeException("Cannot use null as map key.") +} + +val maybeExistingIdx = keyToIndex.get(key) +if (maybeExistingIdx.isDefined) { + // Overwrite the previous value, as the policy is last wins. + values(maybeExistingIdx.get) = value +} else { + keyToIndex.put(key, values.length) + keys.append(key) + values.append(value) +} + } + + // write a 2-field row, the first field is key and the second field is value. + def put(entry: InternalRow): Unit = { +if (entry.isNullAt(0)) { + throw new RuntimeException("Cannot use null as map key.") +} +put(keyGetter(entry, 0), valueGetter(entry, 1)) + } + + def putAll(keyArray: Array[Any], valueArray: Array[Any]): Unit = { +if (keyArray.length != valueArray.length) { + throw new RuntimeException( +"The key array and value array of MapData must have the same length.") +} + +var i = 0 +while (i < keyArray.length) { + put(keyArray(i), valueArray(i)) + i += 1 +} + } + + def putAll(keyArray: ArrayData, valueArray: ArrayData): Unit = { +if (keyArray.numElements() != valueArray.numElements()) { + throw new RuntimeException( +"The key array and value array of MapData must have the same length.") +} + +var i = 0 +while (i < keyArray.numElements()) { + put(keyGetter(keyArray, i), valueGetter(valueArray, i)) + i += 1 +} + } + + def build(): ArrayBasedMapData = { +new ArrayBasedMapData(new GenericArrayData(keys.toArray), new GenericArrayData(values.toArray)) + } + + def from(keyArray: ArrayData, valueArray: ArrayData): ArrayBasedMapData = { +assert(keyToIndex.isEmpty, "'from' can only be called with a fresh GenericMapBuilder.") +putAll(keyArray, valueArray) --- End diff -- Can we call `new ArrayBasedMapData(keyArray, valueArray)` without calling `putAll(keyArray, valueArray)` if `keyArray.asInstanceOf[ArrayData].containsNull` is false? ---
[GitHub] spark issue #23125: [SPARK-26156][WebUI] Revise summary section of stage pag...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23125 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org