date:20181123

[GitHub] spark issue #22683: [SPARK-25696] The storage memory displayed on spark Appl...

2018-11-23 Thread httfighter

Github user httfighter commented on the issue:

https://github.com/apache/spark/pull/22683
  
@srowen OK. Thank you very much for your advice.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22683: [SPARK-25696] The storage memory displayed on spark Appl...

2018-11-23 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22683
  
I believe most of the Spark code really uses units like KiB, multiples of 
1024, as you see here. However it's often referred to as KB, incorrectly, and 
so on. It seems like the UI is the only place that uses 1000 and KB, which is 
consistent in itself, but should really be 1024/KiB. that change to the UI 
seems fine. I think there are a few comments and docs in the code that really 
also should change to say KiB and so on, and we can make those changes here, 
but it's OK to focus on making the UI based on KiB.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23128: [SPARK-26142][SQL] Support passing shuffle metrics to ex...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23128
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23128: [SPARK-26142][SQL] Support passing shuffle metrics to ex...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23128
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5314/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23128: [SPARK-26142][SQL] Support passing shuffle metrics to ex...

2018-11-23 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/23128
  
@gatorsmile Thanks Xiao! Conflicts resolve done, as Reynold comments in 
https://github.com/apache/spark/pull/23105#discussion_r235950427, when the 
ShuffleMetricsReporter move to ShuffleReadMetricsReporter in write pr, it will 
conflict again here, I'll keep tracking the relevant pr.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23128: [SPARK-26142][SQL] Support passing shuffle metrics to ex...

2018-11-23 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23128
  
**[Test build #99223 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99223/testReport)**
 for PR 23128 at commit 
[`cb46bfe`](https://github.com/apache/spark/commit/cb46bfeb930b71d560340393e95097ee66303862).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23128: [SPARK-26142][SQL] Support passing shuffle metric...

2018-11-23 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/23128#discussion_r236032855
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLShuffleMetricsReporter.scala
 ---
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.metric
+
+import org.apache.spark.executor.TempShuffleReadMetrics
+
+/**
+ * A shuffle metrics reporter for SQL exchange operators.
+ * @param tempMetrics [[TempShuffleReadMetrics]] created in TaskContext.
+ * @param metrics All metrics in current SparkPlan.
+ */
+class SQLShuffleMetricsReporter(
+  tempMetrics: TempShuffleReadMetrics,
+  metrics: Map[String, SQLMetric]) extends TempShuffleReadMetrics {
+
+  override def incRemoteBlocksFetched(v: Long): Unit = {
+metrics(SQLMetrics.REMOTE_BLOCKS_FETCHED).add(v)
--- End diff --

Sorry for the less consideration on per-row operation here, I should be 
more careful. Fix done in cb46bfe.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22683: [SPARK-25696] The storage memory displayed on spark Appl...

2018-11-23 Thread httfighter

Github user httfighter commented on the issue:

https://github.com/apache/spark/pull/22683
  
@srowen @ajbozarth I am not sure about some things, can you give me some 
advice? In the process of modification, I have a question. In Spark, whether M 
and MB represent MiB.  Spark does not use the unit of kilobytes to convert 
between numbers.
  private static final ImmutableMap byteSuffixes =
ImmutableMap.builder()
  .put("b", ByteUnit.BYTE)
  .put("k", ByteUnit.KiB)
  .put("kb", ByteUnit.KiB)
  .put("m", ByteUnit.MiB)
  .put("mb", ByteUnit.MiB)
  .put("g", ByteUnit.GiB)
  .put("gb", ByteUnit.GiB)
  .put("t", ByteUnit.TiB)
  .put("tb", ByteUnit.TiB)
  .put("p", ByteUnit.PiB)
  .put("pb", ByteUnit.PiB)
  .build();

  spark.kryoserializer.buffer
  64k
  
Initial size of Kryo's serialization buffer, in KiB unless otherwise 
specified. 
Note that there will be one buffer per core on each worker. This 
buffer will grow up to
spark.kryoserializer.buffer.max if needed.
  


 If this is the case, can we only guarantee the uniform use of 1024 for 
digital conversion, no changes to the  unit displays in log, UI, comments and 
configured messages. Otherwise, we need to modify all the UI, log, comments, 
and configuration information to ensure consistency, there is no guarantee that 
all can be modified, and there will be no problems after the modification.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21732
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21732
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99222/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders

2018-11-23 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21732
  
**[Test build #99222 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99222/testReport)**
 for PR 21732 at commit 
[`62fdb17`](https://github.com/apache/spark/commit/62fdb17b4f72d935f25041c801708e3939e16074).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23052
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99221/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23052
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-23 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23052
  
**[Test build #99221 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99221/testReport)**
 for PR 23052 at commit 
[`76e1466`](https://github.com/apache/spark/commit/76e1466a39aa2a40d999791bb9d3b09628921e85).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  class RuleSummary(`
  * `class QueryPlanningTracker `
  * `class QueryExecution(`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23128: [SPARK-26142][SQL] Support passing shuffle metrics to ex...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23128
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99220/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23128: [SPARK-26142][SQL] Support passing shuffle metrics to ex...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23128
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23128: [SPARK-26142][SQL] Support passing shuffle metrics to ex...

2018-11-23 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23128
  
**[Test build #99220 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99220/testReport)**
 for PR 23128 at commit 
[`1b556ec`](https://github.com/apache/spark/commit/1b556ecf869685af8f34d448ac3f08102a758124).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class SQLShuffleMetricsReporter(`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21732
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21732
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5313/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders

2018-11-23 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21732
  
**[Test build #99222 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99222/testReport)**
 for PR 21732 at commit 
[`62fdb17`](https://github.com/apache/spark/commit/62fdb17b4f72d935f25041c801708e3939e16074).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23128: [SPARK-26142][SQL] Support passing shuffle metric...

2018-11-23 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23128#discussion_r236025838
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLShuffleMetricsReporter.scala
 ---
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.metric
+
+import org.apache.spark.executor.TempShuffleReadMetrics
+
+/**
+ * A shuffle metrics reporter for SQL exchange operators.
+ * @param tempMetrics [[TempShuffleReadMetrics]] created in TaskContext.
+ * @param metrics All metrics in current SparkPlan.
+ */
+class SQLShuffleMetricsReporter(
+  tempMetrics: TempShuffleReadMetrics,
+  metrics: Map[String, SQLMetric]) extends TempShuffleReadMetrics {
+
+  override def incRemoteBlocksFetched(v: Long): Unit = {
+metrics(SQLMetrics.REMOTE_BLOCKS_FETCHED).add(v)
--- End diff --

(Iâm not referring to just this function, but in general, especially for 
per-row).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23128: [SPARK-26142][SQL] Support passing shuffle metric...

2018-11-23 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23128#discussion_r236025817
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLShuffleMetricsReporter.scala
 ---
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.metric
+
+import org.apache.spark.executor.TempShuffleReadMetrics
+
+/**
+ * A shuffle metrics reporter for SQL exchange operators.
+ * @param tempMetrics [[TempShuffleReadMetrics]] created in TaskContext.
+ * @param metrics All metrics in current SparkPlan.
+ */
+class SQLShuffleMetricsReporter(
+  tempMetrics: TempShuffleReadMetrics,
+  metrics: Map[String, SQLMetric]) extends TempShuffleReadMetrics {
+
+  override def incRemoteBlocksFetched(v: Long): Unit = {
+metrics(SQLMetrics.REMOTE_BLOCKS_FETCHED).add(v)
--- End diff --

Doing a hashmap lookup here could introduce serious performance regressions.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-23 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23052
  
**[Test build #99221 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99221/testReport)**
 for PR 23052 at commit 
[`76e1466`](https://github.com/apache/spark/commit/76e1466a39aa2a40d999791bb9d3b09628921e85).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23105: [SPARK-26140] Enable custom metrics implementatio...

2018-11-23 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/23105


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23128: [SPARK-26142][SQL] Support passing shuffle metrics to ex...

2018-11-23 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/23128
  
@xuanyuanking Could you address the conflicts? Thanks for you fast work!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23125: [SPARK-26156][WebUI] Revise summary section of stage pag...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23125
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99216/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23125: [SPARK-26156][WebUI] Revise summary section of stage pag...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23125
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23125: [SPARK-26156][WebUI] Revise summary section of stage pag...

2018-11-23 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23125
  
**[Test build #99216 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99216/testReport)**
 for PR 23125 at commit 
[`a2d7c7a`](https://github.com/apache/spark/commit/a2d7c7a39c265ce41882f3bef1363af4c3962cdb).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23128: [SPARK-26139][SQL] Support passing shuffle metrics to ex...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23128
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5312/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23128: [SPARK-26139][SQL] Support passing shuffle metrics to ex...

2018-11-23 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23128
  
**[Test build #99220 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99220/testReport)**
 for PR 23128 at commit 
[`1b556ec`](https://github.com/apache/spark/commit/1b556ecf869685af8f34d448ac3f08102a758124).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23128: [SPARK-26139][SQL] Support passing shuffle metrics to ex...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23128
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23105: [SPARK-26140] Enable custom metrics implementatio...

2018-11-23 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23105#discussion_r236020103
  
--- Diff: core/src/main/scala/org/apache/spark/shuffle/metrics.scala ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.shuffle
+
+/**
+ * An interface for reporting shuffle read metrics, for each shuffle. This 
interface assumes
+ * all the methods are called on a single-threaded, i.e. concrete 
implementations would not need
+ * to synchronize.
+ *
+ * All methods have additional Spark visibility modifier to allow public, 
concrete implementations
+ * that still have these methods marked as private[spark].
+ */
+private[spark] trait ShuffleReadMetricsReporter {
--- End diff --

@xuanyuanking just submitted a PR on how to use it :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23128: [SPARK-26139][SQL] Support passing shuffle metric...

2018-11-23 Thread xuanyuanking

GitHub user xuanyuanking opened a pull request:

https://github.com/apache/spark/pull/23128

[SPARK-26139][SQL] Support passing shuffle metrics to exchange operator

## What changes were proposed in this pull request?

Implement `SQLShuffleMetricsReporter` on the sql side as the customized 
ShuffleMetricsReporter, which extended the `TempShuffleReadMetrics` and update 
SQLMetrics, in this way shuffle metrics can be reported in the SQL UI.

## How was this patch tested?

Add UT in SQLMetricsSuite.
Manual test locally, before:

![image](https://user-images.githubusercontent.com/4833765/48960517-30f97880-efa8-11e8-982c-92d05938fd1d.png)
after:

![image](https://user-images.githubusercontent.com/4833765/48960587-b54bfb80-efa8-11e8-8e95-7a3c8c74cc5c.png)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xuanyuanking/spark SPARK-26142

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23128.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23128


commit fba590f040c8fd7ce75df3f733f246db18e79ee6
Author: Reynold Xin 
Date:   2018-11-21T14:56:23Z

[SPARK-26140] Pull TempShuffleReadMetrics creation out of shuffle reader

commit 35b48b21028110742aed7f7f5b5d62109c2f0adf
Author: Reynold Xin 
Date:   2018-11-21T15:02:04Z

less movement of code

commit 1b556ecf869685af8f34d448ac3f08102a758124
Author: liyuanjian 
Date:   2018-11-23T21:02:25Z

[SPARK-26142] Implement shuffle read metric in SQL




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23103: [SPARK-26121] [Structured Streaming] Allow users to defi...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23103
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23103: [SPARK-26121] [Structured Streaming] Allow users to defi...

2018-11-23 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23103
  
**[Test build #99219 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99219/testReport)**
 for PR 23103 at commit 
[`9196623`](https://github.com/apache/spark/commit/9196623d2b1fa29522dcc400e27edccf8fea946a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23103: [SPARK-26121] [Structured Streaming] Allow users to defi...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23103
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99219/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and RDDScan...

2018-11-23 Thread hvanhovell

Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/23127
  
Looks good. One more higher level question that can also be addressed in a 
follow-up.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23127: [SPARK-26159] Codegen for LocalTableScanExec and ...

2018-11-23 Thread hvanhovell

Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/23127#discussion_r236017398
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -406,14 +415,39 @@ trait BlockingOperatorWithCodegen extends 
CodegenSupport {
   override def limitNotReachedChecks: Seq[String] = Nil
 }
 
+/**
+ * Leaf codegen node reading from a single RDD.
+ */
+trait InputRDDCodegen extends CodegenSupport {
--- End diff --

Should we reconcile this with the code gen for `RowDataSourceScanExec`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23080
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99215/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23080
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...

2018-11-23 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23080
  
**[Test build #99215 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99215/testReport)**
 for PR 23080 at commit 
[`a4c4b67`](https://github.com/apache/spark/commit/a4c4b6710cb67bddd9badbb53aa07b0d93242bc5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and RDDScan...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23127
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and RDDScan...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23127
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99218/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and RDDScan...

2018-11-23 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23127
  
**[Test build #99218 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99218/testReport)**
 for PR 23127 at commit 
[`23c2d91`](https://github.com/apache/spark/commit/23c2d9111f1cff9059746bb7b48bb8ef7ad7027b).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait InputRDDCodegen extends CodegenSupport `
  * `case class InputAdapter(child: SparkPlan) extends UnaryExecNode with 
InputRDDCodegen `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23103: [SPARK-26121] [Structured Streaming] Allow users to defi...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23103
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23103: [SPARK-26121] [Structured Streaming] Allow users to defi...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23103
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5311/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23103: [SPARK-26121] [Structured Streaming] Allow users to defi...

2018-11-23 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23103
  
**[Test build #99219 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99219/testReport)**
 for PR 23103 at commit 
[`9196623`](https://github.com/apache/spark/commit/9196623d2b1fa29522dcc400e27edccf8fea946a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23022: [SPARK-26038] Decimal toScalaBigInt/toJavaBigInte...

2018-11-23 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/23022


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22512: [SPARK-25498][SQL] InterpretedMutableProjection should h...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22512
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22512: [SPARK-25498][SQL] InterpretedMutableProjection should h...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22512
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99217/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22512: [SPARK-25498][SQL] InterpretedMutableProjection should h...

2018-11-23 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22512
  
**[Test build #99217 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99217/testReport)**
 for PR 22512 at commit 
[`5227e42`](https://github.com/apache/spark/commit/5227e422cda4d110c6b5a950ebdc49d6e25914f1).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and Existin...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23127
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and Existin...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23127
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5310/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23022: [SPARK-26038] Decimal toScalaBigInt/toJavaBigInteger for...

2018-11-23 Thread hvanhovell

Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/23022
  
Merging to master. Thank!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and Existin...

2018-11-23 Thread juliuszsompolski

Github user juliuszsompolski commented on the issue:

https://github.com/apache/spark/pull/23127
  
cc @hvanhovell @rednaxelafx


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and Existin...

2018-11-23 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23127
  
**[Test build #99218 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99218/testReport)**
 for PR 23127 at commit 
[`23c2d91`](https://github.com/apache/spark/commit/23c2d9111f1cff9059746bb7b48bb8ef7ad7027b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23127: [SPARK-26159] Codegen for LocalTableScanExec and ...

2018-11-23 Thread juliuszsompolski

GitHub user juliuszsompolski opened a pull request:

https://github.com/apache/spark/pull/23127

[SPARK-26159] Codegen for LocalTableScanExec and ExistingRDDExec

## What changes were proposed in this pull request?

Implement codegen for LocalTableScanExec and ExistingRDDExec. Refactor to 
share code with InputAdapter.


## How was this patch tested?

Covered and used in existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/juliuszsompolski/apache-spark SPARK-26159

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23127.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23127


commit 23c2d9111f1cff9059746bb7b48bb8ef7ad7027b
Author: Juliusz Sompolski 
Date:   2018-11-13T09:19:09Z

localtablescanexec codegen




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23022: [SPARK-26038] Decimal toScalaBigInt/toJavaBigInteger for...

2018-11-23 Thread juliuszsompolski

Github user juliuszsompolski commented on the issue:

https://github.com/apache/spark/pull/23022
  
ping @hvanhovell 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22466: [SPARK-25464][SQL] Create Database to the location,only ...

2018-11-23 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22466
  
Regarding the potentially high cost of file listing, `CREATE DATABASE` is 
not a frequent operation. The cost is high only if the target directory is 
non-empty with many many files. We are blocking users from creating such a 
database. Thus, the cost is not a big deal I think. 

We need to list this behavior change in the SQL migration guide.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22466: [SPARK-25464][SQL] Create Database to the locatio...

2018-11-23 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22466#discussion_r236005686
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala 
---
@@ -2370,4 +2370,17 @@ class HiveDDLSuite
   ))
 }
   }
+
+  test("SPARK-25464 create a database with a non empty location") {
--- End diff --

Do we have a test case to check "create a database with an empty location"?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23119: [SPARK-25954][SS][FOLLOWUP][test-maven] Add Zookeeper 3....

2018-11-23 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/23119
  
Thanks, @gaborgsomogyi and @HyukjinKwon .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23124: [SPARK-25829][SQL] remove duplicated map keys wit...

2018-11-23 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23124#discussion_r235999222
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala
 ---
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{AtomicType, CalendarIntervalType, 
DataType, MapType}
+
+/**
+ * A builder of [[ArrayBasedMapData]], which fails if a null map key is 
detected, and removes
+ * duplicated map keys w.r.t. the last wins policy.
+ */
+class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends 
Serializable {
+  assert(!keyType.existsRecursively(_.isInstanceOf[MapType]), "key of map 
cannot be/contain map")
+
+  private lazy val keyToIndex = keyType match {
+case _: AtomicType | _: CalendarIntervalType => 
mutable.HashMap.empty[Any, Int]
+case _ =>
+  // for complex types, use interpreted ordering to be able to compare 
unsafe data with safe
+  // data, e.g. UnsafeRow vs GenericInternalRow.
+  mutable.TreeMap.empty[Any, 
Int](TypeUtils.getInterpretedOrdering(keyType))
--- End diff --

After merging this PR, I'll check again and file a JIRA for that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23124: [SPARK-25829][SQL] remove duplicated map keys wit...

2018-11-23 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23124#discussion_r235999040
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala
 ---
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{AtomicType, CalendarIntervalType, 
DataType, MapType}
+
+/**
+ * A builder of [[ArrayBasedMapData]], which fails if a null map key is 
detected, and removes
+ * duplicated map keys w.r.t. the last wins policy.
+ */
+class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends 
Serializable {
+  assert(!keyType.existsRecursively(_.isInstanceOf[MapType]), "key of map 
cannot be/contain map")
+
+  private lazy val keyToIndex = keyType match {
+case _: AtomicType | _: CalendarIntervalType => 
mutable.HashMap.empty[Any, Int]
+case _ =>
+  // for complex types, use interpreted ordering to be able to compare 
unsafe data with safe
+  // data, e.g. UnsafeRow vs GenericInternalRow.
+  mutable.TreeMap.empty[Any, 
Int](TypeUtils.getInterpretedOrdering(keyType))
--- End diff --

Sure.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22512: [SPARK-25498][SQL] InterpretedMutableProjection should h...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22512
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22512: [SPARK-25498][SQL] InterpretedMutableProjection should h...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22512
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5309/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23125: [SPARK-26156][WebUI] Revise summary section of stage pag...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23125
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23125: [SPARK-26156][WebUI] Revise summary section of stage pag...

2018-11-23 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23125
  
**[Test build #99216 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99216/testReport)**
 for PR 23125 at commit 
[`a2d7c7a`](https://github.com/apache/spark/commit/a2d7c7a39c265ce41882f3bef1363af4c3962cdb).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22512: [SPARK-25498][SQL] InterpretedMutableProjection should h...

2018-11-23 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22512
  
**[Test build #99217 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99217/testReport)**
 for PR 22512 at commit 
[`5227e42`](https://github.com/apache/spark/commit/5227e422cda4d110c6b5a950ebdc49d6e25914f1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23125: [SPARK-26156][WebUI] Revise summary section of stage pag...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23125
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5308/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23125: [SPARK-26156][WebUI] Revise summary section of st...

2018-11-23 Thread gengliangwang

Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23125#discussion_r235995724
  
--- Diff: core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala ---
@@ -79,6 +79,9 @@ private[ui] class StagePage(parent: StagesTab, store: 
AppStatusStore) extends We
 localityNamesAndCounts.sorted.mkString("; ")
   }
 
+  private def jobURL(request: HttpServletRequest, jobId: Int): String =
--- End diff --

Make sense. I have updated the code.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22512: [SPARK-25498][SQL] InterpretedMutableProjection should h...

2018-11-23 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22512
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23022: [SPARK-26038] Decimal toScalaBigInt/toJavaBigInteger for...

2018-11-23 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/23022
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22824: [SPARK-25834] [Structured Streaming]Update Mode should n...

2018-11-23 Thread koeninger

Github user koeninger commented on the issue:

https://github.com/apache/spark/pull/22824
  
@tdas or @jose-torres any opinion on whether it's worth refactoring these 
checks as suggested by @arunmahadevan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23084: [SPARK-26117][CORE][SQL]use SparkOutOfMemoryError instea...

2018-11-23 Thread heary-cao

Github user heary-cao commented on the issue:

https://github.com/apache/spark/pull/23084
  
@cloud-fan,thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18304: [SPARK-21098] Set lineseparator csv multiline and csv wr...

2018-11-23 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18304
  
@danielvdende now the newlines are automatically derected. should be not an 
issue anymore.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23080: [SPARK-26108][SQL] Support custom lineSep in CSV ...

2018-11-23 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/23080


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23080
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...

2018-11-23 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23080
  
**[Test build #99215 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99215/testReport)**
 for PR 23080 at commit 
[`a4c4b67`](https://github.com/apache/spark/commit/a4c4b6710cb67bddd9badbb53aa07b0d93242bc5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23080
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5307/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...

2018-11-23 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23080
  
@MaxGekk, thanks for working on this one.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...

2018-11-23 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23080
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...

2018-11-23 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23080
  
Last changes were only doc changes. Let me get this in.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...

2018-11-23 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23080
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23111: [SPARK-26148][PYTHON][TESTS] Increases default paralleli...

2018-11-23 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23111
  
Hey all, I will merge this in few days if there's no more comments. It's 
going to speed up the tests roughly 12 ~ 15 mins.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23102: [SPARK-26137][CORE] Use Java system property "fil...

2018-11-23 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23102#discussion_r235975268
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala 
---
@@ -61,11 +62,12 @@ private[deploy] object DependencyUtils extends Logging {
   hadoopConf: Configuration,
   secMgr: SecurityManager): String = {
 val targetDir = Utils.createTempDir()
+val fileSeparator = Pattern.quote(System.getProperty("file.separator"))
 Option(jars)
   .map {
 resolveGlobPaths(_, hadoopConf)
   .split(",")
-  .filterNot(_.contains(userJar.split("/").last))
+  .filterNot(_.contains(userJar.split(fileSeparator).last))
--- End diff --

Beyond the original purpose of this PR, is it better to move 
`userJar.split(fileSeparator).last` before line 66? This is because `userJar` 
is not changed in `map { ... }`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23102: [SPARK-26137][CORE] Use Java system property "file.separ...

2018-11-23 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/23102
  
@MaxGekk This PR may change a separator for `userJar` that has `\` on 
Windows. `resolveGlobPaths` is not applied to `userJar`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-23 Thread KyleLi1985

Github user KyleLi1985 commented on the issue:

https://github.com/apache/spark/pull/23126
  
Plug do some more test on real data after add this commit

we use data from

http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked+Potential+Signals

and data from

http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems

to do some more accuracy test, the accuracy result is OK


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-23 Thread KyleLi1985

Github user KyleLi1985 commented on the issue:

https://github.com/apache/spark/pull/23126
  
After add this commit
We get the result for RowMatrix computeCovariance function:

For the input data
1.0,2.0,3.0,4.0,5.0
2.0,3.0,1.0,2.0,6.0

RowMatrix function computeCovariance result:
2.5 1.75
1.75 3.7

For the input data generated by 
data1 = np.random.normal(loc=10, scale=0.09, size=1000)
data2 = np.random.normal(loc=20, scale=0.02,size=1000)

RowMatrix function computeCovariance result:
8.109505250896888E-11   -5.003160564607658E-15 
-5.003160564607658E-15  4.08276584628234E-12






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23126
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23126
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23126
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-23 Thread KyleLi1985

Github user KyleLi1985 commented on the issue:

https://github.com/apache/spark/pull/23126
  
Compare Spark computeCovariance function in RowMatrix for DenseVector and 
Numpy's function cov,

Find two problem, below is the result:

1)The Spark function computeCovariance in RowMatrix is not accuracy

input data

1.0,2.0,3.0,4.0,5.0
2.0,3.0,1.0,2.0,6.0

Numpy function cov result:

[[2.5   1.75]

 [ 1.75  3.7 ]]

RowMatrix function computeCovariance result:

2.5   1.75  

1.75  3.701

 

2)For some input case, the result is not good

generate input data by below logic

data1 = np.random.normal(loc=10, scale=0.09, size=1000)
data2 = np.random.normal(loc=20, scale=0.02,size=1000)

 

Numpy function cov result:

[[  8.10536442e-11  -4.35439574e-15]

[ -4.35439574e-15   3.99928264e-12]]

 

RowMatrix function computeCovariance result:

-0.0027484893798828125  0.001491546630859375 

0.0014915466308593758.087158203125E-4


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...

2018-11-23 Thread KyleLi1985

GitHub user KyleLi1985 opened a pull request:

https://github.com/apache/spark/pull/23126

[SPARK-26158] [MLLIB] fix covariance accuracy problem for DenseVector

## What changes were proposed in this pull request?
Enhance accuracy of the covariance logic in RowMatrix for function 
computeCovariance

## How was this patch tested?
Unit test
Accuracy test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/KyleLi1985/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23126.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23126


commit a8bfdfffbe82a77943adb4bf84ca939d786afc8a
Author: æäº® 
Date:   2018-11-23T14:35:27Z

fix covariance accuracy problem for DenseVector




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23083: [SPARK-26114][CORE] ExternalSorter Leak

2018-11-23 Thread advancedxy

Github user advancedxy commented on a diff in the pull request:

https://github.com/apache/spark/pull/23083#discussion_r235955118
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala ---
@@ -727,9 +727,10 @@ private[spark] class ExternalSorter[K, V, C](
 spills.clear()
 forceSpillFiles.foreach(s => s.file.delete())
 forceSpillFiles.clear()
-if (map != null || buffer != null) {
+if (map != null || buffer != null || readingIterator != null) {
   map = null // So that the memory can be garbage-collected
   buffer = null // So that the memory can be garbage-collected
+  readingIterator = null // So that the memory can be garbage-collected
--- End diff --

Hi @szhem , I discussed with wenchen offline. I think this is the key 
point. After nulling out `readingIterator`, `ExternalSorter` should released 
all the memories it occupied.

Yes, `ExternalSorter` is leaked in `TaskCompletionListener`,  but it would 
already be stopped in `CompletionIterator` in happy path. The stopped sorter 
wouldn't occupy too much memory. The `readingIterator` 
is occupying memory because it may reference 
`map/buffer.partitionedDestructiveSortedIterator`, which itself references 
`map/buffer`. So only nulling out map or buffer is not enough.

Can you try with this modification only and see whether OOM still occurs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23124: [SPARK-25829][SQL] remove duplicated map keys wit...

2018-11-23 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23124#discussion_r235952965
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala
 ---
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{AtomicType, CalendarIntervalType, 
DataType, MapType}
+
+/**
+ * A builder of [[ArrayBasedMapData]], which fails if a null map key is 
detected, and removes
+ * duplicated map keys w.r.t. the last wins policy.
+ */
+class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends 
Serializable {
+  assert(!keyType.existsRecursively(_.isInstanceOf[MapType]), "key of map 
cannot be/contain map")
+
+  private lazy val keyToIndex = keyType match {
+case _: AtomicType | _: CalendarIntervalType => 
mutable.HashMap.empty[Any, Int]
+case _ =>
+  // for complex types, use interpreted ordering to be able to compare 
unsafe data with safe
+  // data, e.g. UnsafeRow vs GenericInternalRow.
+  mutable.TreeMap.empty[Any, 
Int](TypeUtils.getInterpretedOrdering(keyType))
+  }
+
+  // TODO: specialize it
+  private lazy val keys = mutable.ArrayBuffer.empty[Any]
+  private lazy val values = mutable.ArrayBuffer.empty[Any]
+
+  private lazy val keyGetter = InternalRow.getAccessor(keyType)
+  private lazy val valueGetter = InternalRow.getAccessor(valueType)
+
+  def reset(): Unit = {
+keyToIndex.clear()
+keys.clear()
+values.clear()
+  }
+
+  def put(key: Any, value: Any): Unit = {
+if (key == null) {
+  throw new RuntimeException("Cannot use null as map key.")
+}
+
+val maybeExistingIdx = keyToIndex.get(key)
+if (maybeExistingIdx.isDefined) {
+  // Overwrite the previous value, as the policy is last wins.
+  values(maybeExistingIdx.get) = value
+} else {
+  keyToIndex.put(key, values.length)
+  keys.append(key)
+  values.append(value)
+}
+  }
+
+  // write a 2-field row, the first field is key and the second field is 
value.
+  def put(entry: InternalRow): Unit = {
+if (entry.isNullAt(0)) {
+  throw new RuntimeException("Cannot use null as map key.")
+}
+put(keyGetter(entry, 0), valueGetter(entry, 1))
+  }
+
+  def putAll(keyArray: Array[Any], valueArray: Array[Any]): Unit = {
+if (keyArray.length != valueArray.length) {
+  throw new RuntimeException(
+"The key array and value array of MapData must have the same 
length.")
+}
+
+var i = 0
+while (i < keyArray.length) {
+  put(keyArray(i), valueArray(i))
+  i += 1
+}
+  }
+
+  def putAll(keyArray: ArrayData, valueArray: ArrayData): Unit = {
+if (keyArray.numElements() != valueArray.numElements()) {
+  throw new RuntimeException(
+"The key array and value array of MapData must have the same 
length.")
+}
+
+var i = 0
+while (i < keyArray.numElements()) {
+  put(keyGetter(keyArray, i), valueGetter(valueArray, i))
+  i += 1
+}
+  }
+
+  def build(): ArrayBasedMapData = {
+new ArrayBasedMapData(new GenericArrayData(keys.toArray), new 
GenericArrayData(values.toArray))
--- End diff --



Is it better to call reset() after calling new ArrayBasedMapData to reduce 
memory consumption in Java heap?

At caller side, ArrayBasedMapBuilder is not released. Therefore, until 
reset() will be called next time, each ArrayBasedMapBuilder keeps unused data 
in keys, values, and keyToIndex. They consumes Java heap unexpectedly.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@sp

[GitHub] spark pull request #23124: [SPARK-25829][SQL] remove duplicated map keys wit...

2018-11-23 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23124#discussion_r235950666
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala
 ---
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{AtomicType, CalendarIntervalType, 
DataType, MapType}
+
+/**
+ * A builder of [[ArrayBasedMapData]], which fails if a null map key is 
detected, and removes
+ * duplicated map keys w.r.t. the last wins policy.
+ */
+class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends 
Serializable {
+  assert(!keyType.existsRecursively(_.isInstanceOf[MapType]), "key of map 
cannot be/contain map")
+
+  private lazy val keyToIndex = keyType match {
+case _: AtomicType | _: CalendarIntervalType => 
mutable.HashMap.empty[Any, Int]
+case _ =>
+  // for complex types, use interpreted ordering to be able to compare 
unsafe data with safe
+  // data, e.g. UnsafeRow vs GenericInternalRow.
+  mutable.TreeMap.empty[Any, 
Int](TypeUtils.getInterpretedOrdering(keyType))
+  }
+
+  // TODO: specialize it
+  private lazy val keys = mutable.ArrayBuffer.empty[Any]
+  private lazy val values = mutable.ArrayBuffer.empty[Any]
+
+  private lazy val keyGetter = InternalRow.getAccessor(keyType)
+  private lazy val valueGetter = InternalRow.getAccessor(valueType)
+
+  def reset(): Unit = {
+keyToIndex.clear()
+keys.clear()
+values.clear()
+  }
+
+  def put(key: Any, value: Any): Unit = {
+if (key == null) {
+  throw new RuntimeException("Cannot use null as map key.")
+}
+
+val maybeExistingIdx = keyToIndex.get(key)
+if (maybeExistingIdx.isDefined) {
+  // Overwrite the previous value, as the policy is last wins.
+  values(maybeExistingIdx.get) = value
+} else {
+  keyToIndex.put(key, values.length)
+  keys.append(key)
+  values.append(value)
+}
+  }
+
+  // write a 2-field row, the first field is key and the second field is 
value.
+  def put(entry: InternalRow): Unit = {
+if (entry.isNullAt(0)) {
+  throw new RuntimeException("Cannot use null as map key.")
+}
+put(keyGetter(entry, 0), valueGetter(entry, 1))
+  }
+
+  def putAll(keyArray: Array[Any], valueArray: Array[Any]): Unit = {
+if (keyArray.length != valueArray.length) {
+  throw new RuntimeException(
+"The key array and value array of MapData must have the same 
length.")
+}
+
+var i = 0
+while (i < keyArray.length) {
+  put(keyArray(i), valueArray(i))
+  i += 1
+}
+  }
+
+  def putAll(keyArray: ArrayData, valueArray: ArrayData): Unit = {
+if (keyArray.numElements() != valueArray.numElements()) {
+  throw new RuntimeException(
+"The key array and value array of MapData must have the same 
length.")
+}
+
+var i = 0
+while (i < keyArray.numElements()) {
+  put(keyGetter(keyArray, i), valueGetter(valueArray, i))
+  i += 1
+}
+  }
+
+  def build(): ArrayBasedMapData = {
+new ArrayBasedMapData(new GenericArrayData(keys.toArray), new 
GenericArrayData(values.toArray))
+  }
+
+  def from(keyArray: ArrayData, valueArray: ArrayData): ArrayBasedMapData 
= {
+assert(keyToIndex.isEmpty, "'from' can only be called with a fresh 
GenericMapBuilder.")
+putAll(keyArray, valueArray)
+if (keyToIndex.size == keyArray.numElements()) {
+  // If there is no duplicated map keys, creates the MapData with the 
input key and value array,
+  // as they might already in unsafe format and are more efficient.
+  new ArrayBasedM

[GitHub] spark pull request #23105: [SPARK-26140] Enable custom metrics implementatio...

2018-11-23 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23105#discussion_r235950427
  
--- Diff: core/src/main/scala/org/apache/spark/shuffle/ShuffleManager.scala 
---
@@ -48,7 +48,8 @@ private[spark] trait ShuffleManager {
   handle: ShuffleHandle,
   startPartition: Int,
   endPartition: Int,
-  context: TaskContext): ShuffleReader[K, C]
+  context: TaskContext,
+  metrics: ShuffleMetricsReporter): ShuffleReader[K, C]
--- End diff --

It is a read metrics here actually. In the write PR this is renamed 
ShuffleReadMetricsReporter.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23124: [SPARK-25829][SQL] remove duplicated map keys wit...

2018-11-23 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23124#discussion_r235950148
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala
 ---
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{AtomicType, CalendarIntervalType, 
DataType, MapType}
+
+/**
+ * A builder of [[ArrayBasedMapData]], which fails if a null map key is 
detected, and removes
+ * duplicated map keys w.r.t. the last wins policy.
+ */
+class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends 
Serializable {
+  assert(!keyType.existsRecursively(_.isInstanceOf[MapType]), "key of map 
cannot be/contain map")
+
+  private lazy val keyToIndex = keyType match {
+case _: AtomicType | _: CalendarIntervalType => 
mutable.HashMap.empty[Any, Int]
+case _ =>
+  // for complex types, use interpreted ordering to be able to compare 
unsafe data with safe
+  // data, e.g. UnsafeRow vs GenericInternalRow.
+  mutable.TreeMap.empty[Any, 
Int](TypeUtils.getInterpretedOrdering(keyType))
+  }
+
+  // TODO: specialize it
+  private lazy val keys = mutable.ArrayBuffer.empty[Any]
+  private lazy val values = mutable.ArrayBuffer.empty[Any]
+
+  private lazy val keyGetter = InternalRow.getAccessor(keyType)
+  private lazy val valueGetter = InternalRow.getAccessor(valueType)
+
+  def reset(): Unit = {
+keyToIndex.clear()
+keys.clear()
+values.clear()
+  }
+
+  def put(key: Any, value: Any): Unit = {
+if (key == null) {
+  throw new RuntimeException("Cannot use null as map key.")
+}
+
+val maybeExistingIdx = keyToIndex.get(key)
+if (maybeExistingIdx.isDefined) {
+  // Overwrite the previous value, as the policy is last wins.
+  values(maybeExistingIdx.get) = value
+} else {
+  keyToIndex.put(key, values.length)
+  keys.append(key)
+  values.append(value)
+}
+  }
+
+  // write a 2-field row, the first field is key and the second field is 
value.
+  def put(entry: InternalRow): Unit = {
+if (entry.isNullAt(0)) {
+  throw new RuntimeException("Cannot use null as map key.")
+}
+put(keyGetter(entry, 0), valueGetter(entry, 1))
+  }
+
+  def putAll(keyArray: Array[Any], valueArray: Array[Any]): Unit = {
+if (keyArray.length != valueArray.length) {
+  throw new RuntimeException(
+"The key array and value array of MapData must have the same 
length.")
+}
+
+var i = 0
+while (i < keyArray.length) {
+  put(keyArray(i), valueArray(i))
+  i += 1
+}
+  }
+
+  def putAll(keyArray: ArrayData, valueArray: ArrayData): Unit = {
+if (keyArray.numElements() != valueArray.numElements()) {
+  throw new RuntimeException(
+"The key array and value array of MapData must have the same 
length.")
+}
+
+var i = 0
+while (i < keyArray.numElements()) {
+  put(keyGetter(keyArray, i), valueGetter(valueArray, i))
+  i += 1
+}
+  }
+
+  def build(): ArrayBasedMapData = {
+new ArrayBasedMapData(new GenericArrayData(keys.toArray), new 
GenericArrayData(values.toArray))
+  }
--- End diff --

Is it better to call `reset()` after calling `new ArrayBasedMapData` to 
reduce memory consumption?

At caller side, `ArrayBasedMapBuilder` is not released. Therefore, until 
reset() will be called next time, each `ArrayBasedMapBuilder` keeps unused data 
in `keys`, `values`, and `keyToIndex`. They consumes Java heap unexpectedly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spa

[GitHub] spark pull request #23124: [SPARK-25829][SQL] remove duplicated map keys wit...

2018-11-23 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23124#discussion_r235947044
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala
 ---
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{AtomicType, CalendarIntervalType, 
DataType, MapType}
+
+/**
+ * A builder of [[ArrayBasedMapData]], which fails if a null map key is 
detected, and removes
+ * duplicated map keys w.r.t. the last wins policy.
+ */
+class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends 
Serializable {
+  assert(!keyType.existsRecursively(_.isInstanceOf[MapType]), "key of map 
cannot be/contain map")
+
+  private lazy val keyToIndex = keyType match {
+case _: AtomicType | _: CalendarIntervalType => 
mutable.HashMap.empty[Any, Int]
+case _ =>
+  // for complex types, use interpreted ordering to be able to compare 
unsafe data with safe
+  // data, e.g. UnsafeRow vs GenericInternalRow.
+  mutable.TreeMap.empty[Any, 
Int](TypeUtils.getInterpretedOrdering(keyType))
+  }
+
+  // TODO: specialize it
+  private lazy val keys = mutable.ArrayBuffer.empty[Any]
+  private lazy val values = mutable.ArrayBuffer.empty[Any]
+
+  private lazy val keyGetter = InternalRow.getAccessor(keyType)
+  private lazy val valueGetter = InternalRow.getAccessor(valueType)
+
+  def reset(): Unit = {
+keyToIndex.clear()
+keys.clear()
+values.clear()
+  }
+
+  def put(key: Any, value: Any): Unit = {
+if (key == null) {
+  throw new RuntimeException("Cannot use null as map key.")
+}
+
+val maybeExistingIdx = keyToIndex.get(key)
+if (maybeExistingIdx.isDefined) {
+  // Overwrite the previous value, as the policy is last wins.
+  values(maybeExistingIdx.get) = value
+} else {
+  keyToIndex.put(key, values.length)
+  keys.append(key)
+  values.append(value)
+}
+  }
+
+  // write a 2-field row, the first field is key and the second field is 
value.
+  def put(entry: InternalRow): Unit = {
+if (entry.isNullAt(0)) {
+  throw new RuntimeException("Cannot use null as map key.")
+}
+put(keyGetter(entry, 0), valueGetter(entry, 1))
+  }
+
+  def putAll(keyArray: Array[Any], valueArray: Array[Any]): Unit = {
+if (keyArray.length != valueArray.length) {
+  throw new RuntimeException(
+"The key array and value array of MapData must have the same 
length.")
+}
+
+var i = 0
+while (i < keyArray.length) {
+  put(keyArray(i), valueArray(i))
+  i += 1
+}
+  }
+
+  def putAll(keyArray: ArrayData, valueArray: ArrayData): Unit = {
+if (keyArray.numElements() != valueArray.numElements()) {
+  throw new RuntimeException(
+"The key array and value array of MapData must have the same 
length.")
+}
+
+var i = 0
+while (i < keyArray.numElements()) {
+  put(keyGetter(keyArray, i), valueGetter(valueArray, i))
+  i += 1
+}
+  }
+
+  def build(): ArrayBasedMapData = {
+new ArrayBasedMapData(new GenericArrayData(keys.toArray), new 
GenericArrayData(values.toArray))
+  }
+
+  def from(keyArray: ArrayData, valueArray: ArrayData): ArrayBasedMapData 
= {
+assert(keyToIndex.isEmpty, "'from' can only be called with a fresh 
GenericMapBuilder.")
+putAll(keyArray, valueArray)
--- End diff --

Can we call `new ArrayBasedMapData(keyArray, valueArray)` without calling 
`putAll(keyArray, valueArray)` if 
`keyArray.asInstanceOf[ArrayData].containsNull` is false?


---

[GitHub] spark issue #23125: [SPARK-26156][WebUI] Revise summary section of stage pag...

2018-11-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23125
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 156 matches

Mail list logo