[GitHub] spark issue #23269: [SPARK-26316] Currently the wrong implementation in the ...

2018-12-10 Thread JkSelf
Github user JkSelf commented on the issue:

https://github.com/apache/spark/pull/23269
  
@viirya  ok I will update. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23204: Revert "[SPARK-21052][SQL] Add hash map metrics to join"

2018-12-09 Thread JkSelf
Github user JkSelf commented on the issue:

https://github.com/apache/spark/pull/23204
  
@cloud-fan the new ticket is in 
[here](https://github.com/apache/spark/pull/23269 ). I will close this ticket.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23269: partial revert 21052 because of the performance d...

2018-12-09 Thread JkSelf
GitHub user JkSelf opened a pull request:

https://github.com/apache/spark/pull/23269

partial revert 21052 because of the performance degradation in TPC-DS

## What changes were proposed in this pull request?
We tested TPC-DS in spark2.3 with and without  
[L486](https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486)
 and 
[L487](https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487)
 in following cluster configuration. And the result [tpc-ds 
result](https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0)
 has performance degradation. So we currently partial revert 21052.
**Cluster info:**

  | Master Node | Worker Nodes
-- | -- | --
Node | 1x | 4x
Processor | Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz | Intel(R) Xeon(R) 
Platinum 8180 CPU @ 2.50GHz
Memory | 192 GB | 384 GB
Storage Main | 8 x 960G SSD | 8 x 960G SSD
Network | 10Gbe |  
Role | CM Management NameNodeSecondary NameNodeResource ManagerHive 
Metastore Server | DataNodeNodeManager
OS Version | CentOS 7.2 | CentOS 7.2
Hadoop | Apache Hadoop 2.7.5 | Apache Hadoop 2.7.5
Hive | Apache Hive 2.2.0 |  
Spark | Apache Spark 2.1.0  & Apache Spark2.3.0 |  
JDK  version | 1.8.0_112 | 1.8.0_112

**Related parameters setting:**

Component | Parameter | Value
-- | -- | --
Yarn Resource Manager | yarn.scheduler.maximum-allocation-mb | 120GB
  | yarn.scheduler.minimum-allocation-mb | 1GB
  | yarn.scheduler.maximum-allocation-vcores | 121
  | Yarn.resourcemanager.scheduler.class | Fair Scheduler
Yarn Node Manager | yarn.nodemanager.resource.memory-mb | 120GB
  | yarn.nodemanager.resource.cpu-vcores | 121
Spark | spark.executor.memory | 110GB
  | spark.executor.cores | 50





## How was this patch tested?
N/A

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/JkSelf/spark partial-revert-21052

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23269.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23269


commit 03cfe2b7506f5c5421aaf2858f3f31f2153db8fb
Author: jiake 
Date:   2018-12-10T06:50:32Z

partial revert 21052 because of the performance degradation in tpc-ds




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23204: Revert "[SPARK-21052][SQL] Add hash map metrics to join"

2018-12-09 Thread JkSelf
Github user JkSelf commented on the issue:

https://github.com/apache/spark/pull/23204
  
@cloud-fan @dongjoon-hyun  update the patch, please help review if you have 
time. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23204: Revert "[SPARK-21052][SQL] Add hash map metrics to join"

2018-12-08 Thread JkSelf
Github user JkSelf commented on the issue:

https://github.com/apache/spark/pull/23204
  
@cloud-fan  ok, i will revert as your comments later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23204: Revert "[SPARK-21052][SQL] Add hash map metrics to join"

2018-12-08 Thread JkSelf
Github user JkSelf commented on the issue:

https://github.com/apache/spark/pull/23204
  
The result of all queries in tpcds with 1TB data scale is in [tpcds 
result](https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23214: [SPARK-26155] Optimizing the performance of LongToUnsafe...

2018-12-03 Thread JkSelf
Github user JkSelf commented on the issue:

https://github.com/apache/spark/pull/23214
  
@LuciferYang  the patch is fine in my test environment. 
@adrian-wang  I will run all the tpcds queries in spark2.3 and spark2.3 
with this patch later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23204: Revert "[SPARK-21052][SQL] Add hash map metrics to join"

2018-12-03 Thread JkSelf
Github user JkSelf commented on the issue:

https://github.com/apache/spark/pull/23204
  
**Cluster info:**

  | Master Node | Worker Nodes
-- | -- | --
Node | 1x | 4x
Processor | Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz | Intel(R) Xeon(R) 
Platinum 8180 CPU @ 2.50GHz
Memory | 192 GB | 384 GB
Storage Main | 8 x 960G SSD | 8 x 960G SSD
Network | 10Gbe
Role | CM Management NameNodeSecondary NameNodeResource ManagerHive 
Metastore Server | DataNodeNodeManager
OS Version | CentOS 7.2 | CentOS 7.2
Hadoop | Apache Hadoop 2.7.5 | Apache Hadoop 2.7.5
Hive | Apache Hive 2.2.0 |  
Spark | Apache Spark 2.1.0  & Apache Spark2.3.0 |  
JDK  version | 1.8.0_112 | 1.8.0_112

**Related parameters setting:**

Component | Parameter | Value
-- | -- | --
Yarn Resource Manager | yarn.scheduler.maximum-allocation-mb | 40GB
  | yarn.scheduler.minimum-allocation-mb | 1GB
  | yarn.scheduler.maximum-allocation-vcores | 121
  | Yarn.resourcemanager.scheduler.class | Fair Scheduler
Yarn Node Manager | yarn.nodemanager.resource.memory-mb | 40GB
  | yarn.nodemanager.resource.cpu-vcores | 121
Spark | spark.executor.memory | 34GB
  | spark.executor.cores | 40

In above test environment, we found a serious performance degradation issue 
in Spark2.3 when running TPC-DS on SKX 8180. We investigated this problem and 
figured out the root cause is in community patch SPARK-21052 which add metrics 
to hash join process. And the impact code is 
[L486](https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486)
 and 
[L487](https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487)
  .

Following is the result of TPC-DS Q19 in spark2.1, spark2.3 remove 
L486&487, spark2.3 add L486&487 and spark2.4.

spark2.1 | spark2.3 remove L486&487 | spark2.3 addL486&487 | spark2.4
-- | -- | -- | --
49s | 47s | 307s | 270s





---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23204: Revert "[SPARK-21052][SQL] Add hash map metrics t...

2018-12-03 Thread JkSelf
GitHub user JkSelf opened a pull request:

https://github.com/apache/spark/pull/23204

Revert "[SPARK-21052][SQL] Add hash map metrics to join"

Because of the performance degradation discussion in 
[SPARK-26155](https://issues.apache.org/jira/browse/SPARK-26155), currently we 
revert [SPARK-21052](https://issues.apache.org/jira/browse/SPARK-21052)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/JkSelf/spark revert-21052

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23204.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23204


commit 7d5008d11e37086f4a8206276791b9424bcf60b7
Author: jiake 
Date:   2018-12-03T08:18:44Z

Revert "[SPARK-21052][SQL] Add hash map metrics to join"

This reverts commit 18066f2e61f430b691ed8a777c9b4e5786bf9dbc.

Conflicts:

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org