[jira] [Resolved] (SPARK-33740) hadoop configs in hive-site.xml can overrides pre-existing hadoop ones

2020-12-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33740.
---
Fix Version/s: 3.1.0
   3.0.2
 Assignee: Kent Yao
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/30709

> hadoop configs in hive-site.xml can overrides pre-existing hadoop ones
> --
>
> Key: SPARK-33740
> URL: https://issues.apache.org/jira/browse/SPARK-33740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0, 3.2.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33756) BytesToBytesMap's iterator hasNext method should be idempotent.

2020-12-11 Thread Xianjin YE (Jira)
Xianjin YE created SPARK-33756:
--

 Summary: BytesToBytesMap's iterator hasNext method should be 
idempotent.
 Key: SPARK-33756
 URL: https://issues.apache.org/jira/browse/SPARK-33756
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Xianjin YE


BytesToBytesMap's MapIterator's hasNext method is not idempotent. 
{code:java}
// 
public boolean hasNext() {
  if (numRecords == 0) {
if (reader != null) {
  // if called multiple multiple times, it will throw NoSuchElement 
exception
  handleFailedDelete();
}
  }
  return numRecords > 0;
}
{code}
Multiple calls to this `hasNext` method will call `handleFailedDelete()` 
multiple times, which will throw NoSuchElementException  as the spillWrites has 
already been empty.

 

We observed this issue for in one of our production jobs after upgrading to 
Spark 3.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33755) Allow creating orc table when row format separator is defined

2020-12-11 Thread xiepengjie (Jira)
xiepengjie created SPARK-33755:
--

 Summary: Allow creating orc table when row format separator is 
defined
 Key: SPARK-33755
 URL: https://issues.apache.org/jira/browse/SPARK-33755
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.2
Reporter: xiepengjie






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33730) Standardize warning types

2020-12-11 Thread Shril Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247788#comment-17247788
 ] 

Shril Kumar commented on SPARK-33730:
-

[~hyukjin.kwon], [~zero323] can I pick this up?

> Standardize warning types
> -
>
> Key: SPARK-33730
> URL: https://issues.apache.org/jira/browse/SPARK-33730
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should use warnings properly per 
> [https://docs.python.org/3/library/warnings.html#warning-categories]
> In particular,
>  - we should use {{FutureWarning}} instead of {{DeprecationWarning}} for the 
> places we should show the warnings to end-users by default.
>  - we should __maybe__ think about customizing stacklevel 
> ([https://docs.python.org/3/library/warnings.html#warnings.warn]) like pandas 
> does.
>  - ...
> Current warnings are a bit messy and somewhat arbitrary.
> To be more explicit, we'll have to fix:
> {code:java}
> pyspark/context.py:warnings.warn(
> pyspark/context.py:warnings.warn(
> pyspark/ml/classification.py:warnings.warn("weightCol is 
> ignored, "
> pyspark/ml/clustering.py:warnings.warn("Deprecated in 3.0.0. It will 
> be removed in future versions. Use "
> pyspark/mllib/classification.py:warnings.warn(
> pyspark/mllib/feature.py:warnings.warn("Both withMean and withStd 
> are false. The model does nothing.")
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/rdd.py:warnings.warn("mapPartitionsWithSplit is deprecated; "
> pyspark/rdd.py:warnings.warn(
> pyspark/shell.py:warnings.warn("Failed to initialize Spark session.")
> pyspark/shuffle.py:warnings.warn("Please install psutil to have 
> better "
> pyspark/sql/catalog.py:warnings.warn(
> pyspark/sql/catalog.py:warnings.warn(
> pyspark/sql/column.py:warnings.warn(
> pyspark/sql/column.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/dataframe.py:warnings.warn(
> pyspark/sql/dataframe.py:warnings.warn("to_replace is a dict 
> and value is not None. value will be ignored.")
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use degrees 
> instead.", DeprecationWarning)
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use radians 
> instead.", DeprecationWarning)
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use 
> approx_count_distinct instead.", DeprecationWarning)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/functions.py:warnings.warn(
> pyspark/sql/pandas/group_ops.py:warnings.warn(
> pyspark/sql/session.py:warnings.warn("Fall back to non-hive 
> support because failing to access HiveConf, "
> {code}
> PySpark prints warnings via using {{print}} in some places as well. We should 
> also see if we should switch and replace to {{warnings.warn}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33527) Extend the function of decode so as consistent with mainstream databases

2020-12-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247786#comment-17247786
 ] 

Apache Spark commented on SPARK-33527:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/30727

> Extend the function of decode so as consistent with mainstream databases
> 
>
> Key: SPARK-33527
> URL: https://issues.apache.org/jira/browse/SPARK-33527
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.2.0
>
>
> In Spark, decode(bin, charset) - Decodes the first argument using the second 
> argument character set.
> Unfortunately this is NOT what any other SQL vendor understands DECODE to do.
> DECODE generally is a short hand for a simple case expression:
> {code:java}
> SELECT DECODE(c1, 1, 'Hello', 2, 'World', '!') FROM (VALUES (1), (2), (3)) AS 
> T(c1)
> => 
> (Hello),
> (World)
> (!)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33754) Update kubernetes/integration-tests/README.md to follow the default Hadoop profile updated

2020-12-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247776#comment-17247776
 ] 

Apache Spark commented on SPARK-33754:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/30726

> Update kubernetes/integration-tests/README.md to follow the default Hadoop 
> profile updated
> --
>
> Key: SPARK-33754
> URL: https://issues.apache.org/jira/browse/SPARK-33754
> Project: Spark
>  Issue Type: Improvement
>  Components: docs, Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> kubernetes/integration-tests/README.md says about how to run the integration 
> tests for Kubernetes as follows.
> {code}
> To run tests with Hadoop 3.2 instead of Hadoop 2.7, use `--hadoop-profile`.
> ./dev/dev-run-integration-tests.sh --hadoop-profile hadoop-2.7
> {code}
> In the current master, the default Hadoop profile is hadoop-3.2 so it's 
> better to update the document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33731) Standardize exception types

2020-12-11 Thread Shril Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1724#comment-1724
 ] 

Shril Kumar edited comment on SPARK-33731 at 12/11/20, 9:20 AM:


[~hyukjin.kwon] can you help me pick up this issue? I can contribute to this.


was (Author: shril):
[~hyukjin.kwon] can you help me pick up this issue?

> Standardize exception types
> ---
>
> Key: SPARK-33731
> URL: https://issues.apache.org/jira/browse/SPARK-33731
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should:
> - have a better hierarchy for exception types
> - or at least use the default type of exceptions correctly instead of just 
> throwing a plain Exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33754) Update kubernetes/integration-tests/README.md to follow the default Hadoop profile updated

2020-12-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33754:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Update kubernetes/integration-tests/README.md to follow the default Hadoop 
> profile updated
> --
>
> Key: SPARK-33754
> URL: https://issues.apache.org/jira/browse/SPARK-33754
> Project: Spark
>  Issue Type: Improvement
>  Components: docs, Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> kubernetes/integration-tests/README.md says about how to run the integration 
> tests for Kubernetes as follows.
> {code}
> To run tests with Hadoop 3.2 instead of Hadoop 2.7, use `--hadoop-profile`.
> ./dev/dev-run-integration-tests.sh --hadoop-profile hadoop-2.7
> {code}
> In the current master, the default Hadoop profile is hadoop-3.2 so it's 
> better to update the document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33754) Update kubernetes/integration-tests/README.md to follow the default Hadoop profile updated

2020-12-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33754:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Update kubernetes/integration-tests/README.md to follow the default Hadoop 
> profile updated
> --
>
> Key: SPARK-33754
> URL: https://issues.apache.org/jira/browse/SPARK-33754
> Project: Spark
>  Issue Type: Improvement
>  Components: docs, Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> kubernetes/integration-tests/README.md says about how to run the integration 
> tests for Kubernetes as follows.
> {code}
> To run tests with Hadoop 3.2 instead of Hadoop 2.7, use `--hadoop-profile`.
> ./dev/dev-run-integration-tests.sh --hadoop-profile hadoop-2.7
> {code}
> In the current master, the default Hadoop profile is hadoop-3.2 so it's 
> better to update the document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33731) Standardize exception types

2020-12-11 Thread Shril Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1724#comment-1724
 ] 

Shril Kumar commented on SPARK-33731:
-

[~hyukjin.kwon] can you help me pick up this issue?

> Standardize exception types
> ---
>
> Key: SPARK-33731
> URL: https://issues.apache.org/jira/browse/SPARK-33731
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should:
> - have a better hierarchy for exception types
> - or at least use the default type of exceptions correctly instead of just 
> throwing a plain Exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33754) Update kubernetes/integration-tests/README.md to follow the default Hadoop profile updated

2020-12-11 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-33754:
--

 Summary: Update kubernetes/integration-tests/README.md to follow 
the default Hadoop profile updated
 Key: SPARK-33754
 URL: https://issues.apache.org/jira/browse/SPARK-33754
 Project: Spark
  Issue Type: Improvement
  Components: docs, Kubernetes, Tests
Affects Versions: 3.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


kubernetes/integration-tests/README.md says about how to run the integration 
tests for Kubernetes as follows.

{code}
To run tests with Hadoop 3.2 instead of Hadoop 2.7, use `--hadoop-profile`.

./dev/dev-run-integration-tests.sh --hadoop-profile hadoop-2.7
{code}

In the current master, the default Hadoop profile is hadoop-3.2 so it's better 
to update the document.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33737) Use an Informer+Lister API in the ExecutorPodWatcher

2020-12-11 Thread Stavros Kontopoulos (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247759#comment-17247759
 ] 

Stavros Kontopoulos commented on SPARK-33737:
-

In addition current implementation has been out for long and it is stable. Need 
to be sure that any updates will not cause any issues.
I can work on a PR and see how things integrate.

> Use an Informer+Lister API in the ExecutorPodWatcher
> 
>
> Key: SPARK-33737
> URL: https://issues.apache.org/jira/browse/SPARK-33737
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Kubernetes backend uses Fabric8 client and a 
> [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42]
>  to monitor the K8s Api server for pod changes. Every watcher keeps a 
> websocket connection open and has no caching mechanism at that part. Caching 
> at the Spark K8s resource manager exists in other areas where we are hitting 
> the Api Server for Pod CRUD ops like 
> [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49].
> In an env where a lot of connections are kept due to large scale jobs this 
> could be problematic and impose a lot of load against the API server. A lot 
> of long running jobs should not create pod changes eg. Streaming jobs to 
> justify a continuous watching mechanism.
> Latest Frabric8 client versions have implemented a SharedInformer API+Lister, 
> an example can be found 
> [here|https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/InformerExample.java#L37].
> This new API follows the implementation of the official java K8s client and 
> the go counterpart and it is backed up by a caching mechanism which is 
> re-synced after a configurable period to avoid hitting the API server all the 
> time. There is also a lister that keeps track of current status of resources. 
> Using such a mechanism is common place when implementing a K8s controller.
> The suggestion is to update to v4.13.0 the client (has all updates in wrt 
> that API) and use the informer+lister API where applicable. 
> I think the lister could also replace part of the snapshotting/notification 
> mechanism.
> /cc [~dongjoon] [~eje] [~holden] WDYTH?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)

2020-12-11 Thread dzcxzl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-33753:
---
Description: 
 

HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf).
 When the number of hive partitions read by the driver is large, 
HadoopRDD.getPartitions will create many jobconfs and add them to the cache.
 The executor will also create a jobconf, add it to the cache, and share it 
among exeuctors.

The number of jobconfs in the driver cache increases the memory pressure. When 
the driver memory configuration is not high, full gc becoming very frequent, 
and these jobconfs are hardly reused.

For example, spark.driver.memory=2560m, the read partition is about 14,000, and 
a jobconf 96kb.

!jobconf.png!

 

The following is a repair comparison, full gc decreased from 62s to 0.8s, and 
the number of times decreased from 31 to 5. And the driver applied for less 
memory (Old Gen 1.667G->968M), the job execution time is also reduced.

 

Current:

!current_job_finish_time.png!

jstat -gcutil PID 2s

!current_gcutil.png!

!current_visual_gc.png!

 

Try to change softValues to weakValues

!fix_job_finish_time.png!

!fix_gcutil.png!

!fix_visual_gc.png!

 

 

 

 

 

  was:
 

HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf).
 When the number of hive partitions read by the driver is large, 
HadoopRDD.getPartitions will create many jobconfs and add them to the cache.
 The executor will also create a jobconf, add it to the cache, and share it 
among exeuctors.

The number of jobconfs in the driver cache increases the memory pressure. When 
the driver memory configuration is not high, full gc will be frequently used, 
and these jobconfs are hardly reused.

For example, spark.driver.memory=2560m, the read partition is about 14,000, and 
a jobconf 96kb.

!jobconf.png!

 

The following is a repair comparison, full gc decreased from 62s to 0.8s, and 
the number of times decreased from 31 to 5. And the driver applied for less 
memory (Old Gen 1.667G->968M), the job execution time is also reduced.

 

Current:

!current_job_finish_time.png!

jstat -gcutil PID 2s

!current_gcutil.png!

!current_visual_gc.png!

 

Try to change softValues to weakValues

!fix_job_finish_time.png!

!fix_gcutil.png!

!fix_visual_gc.png!

 

 

 

 

 


> Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
> ---
>
> Key: SPARK-33753
> URL: https://issues.apache.org/jira/browse/SPARK-33753
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Priority: Minor
> Attachments: current_gcutil.png, current_job_finish_time.png, 
> current_visual_gc.png, fix_gcutil.png, fix_job_finish_time.png, 
> fix_visual_gc.png, jobconf.png
>
>
>  
> HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf).
>  When the number of hive partitions read by the driver is large, 
> HadoopRDD.getPartitions will create many jobconfs and add them to the cache.
>  The executor will also create a jobconf, add it to the cache, and share it 
> among exeuctors.
> The number of jobconfs in the driver cache increases the memory pressure. 
> When the driver memory configuration is not high, full gc becoming very 
> frequent, and these jobconfs are hardly reused.
> For example, spark.driver.memory=2560m, the read partition is about 14,000, 
> and a jobconf 96kb.
> !jobconf.png!
>  
> The following is a repair comparison, full gc decreased from 62s to 0.8s, and 
> the number of times decreased from 31 to 5. And the driver applied for less 
> memory (Old Gen 1.667G->968M), the job execution time is also reduced.
>  
> Current:
> !current_job_finish_time.png!
> jstat -gcutil PID 2s
> !current_gcutil.png!
> !current_visual_gc.png!
>  
> Try to change softValues to weakValues
> !fix_job_finish_time.png!
> !fix_gcutil.png!
> !fix_visual_gc.png!
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)

2020-12-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247755#comment-17247755
 ] 

Apache Spark commented on SPARK-33753:
--

User 'cxzl25' has created a pull request for this issue:
https://github.com/apache/spark/pull/30725

> Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
> ---
>
> Key: SPARK-33753
> URL: https://issues.apache.org/jira/browse/SPARK-33753
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Priority: Minor
> Attachments: current_gcutil.png, current_job_finish_time.png, 
> current_visual_gc.png, fix_gcutil.png, fix_job_finish_time.png, 
> fix_visual_gc.png, jobconf.png
>
>
>  
> HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf).
>  When the number of hive partitions read by the driver is large, 
> HadoopRDD.getPartitions will create many jobconfs and add them to the cache.
>  The executor will also create a jobconf, add it to the cache, and share it 
> among exeuctors.
> The number of jobconfs in the driver cache increases the memory pressure. 
> When the driver memory configuration is not high, full gc becoming very 
> frequent, and these jobconfs are hardly reused.
> For example, spark.driver.memory=2560m, the read partition is about 14,000, 
> and a jobconf 96kb.
> !jobconf.png!
>  
> The following is a repair comparison, full gc decreased from 62s to 0.8s, and 
> the number of times decreased from 31 to 5. And the driver applied for less 
> memory (Old Gen 1.667G->968M), the job execution time is also reduced.
>  
> Current:
> !current_job_finish_time.png!
> jstat -gcutil PID 2s
> !current_gcutil.png!
> !current_visual_gc.png!
>  
> Try to change softValues to weakValues
> !fix_job_finish_time.png!
> !fix_gcutil.png!
> !fix_visual_gc.png!
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)

2020-12-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33753:


Assignee: (was: Apache Spark)

> Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
> ---
>
> Key: SPARK-33753
> URL: https://issues.apache.org/jira/browse/SPARK-33753
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Priority: Minor
> Attachments: current_gcutil.png, current_job_finish_time.png, 
> current_visual_gc.png, fix_gcutil.png, fix_job_finish_time.png, 
> fix_visual_gc.png, jobconf.png
>
>
>  
> HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf).
>  When the number of hive partitions read by the driver is large, 
> HadoopRDD.getPartitions will create many jobconfs and add them to the cache.
>  The executor will also create a jobconf, add it to the cache, and share it 
> among exeuctors.
> The number of jobconfs in the driver cache increases the memory pressure. 
> When the driver memory configuration is not high, full gc will be frequently 
> used, and these jobconfs are hardly reused.
> For example, spark.driver.memory=2560m, the read partition is about 14,000, 
> and a jobconf 96kb.
> !jobconf.png!
>  
> The following is a repair comparison, full gc decreased from 62s to 0.8s, and 
> the number of times decreased from 31 to 5. And the driver applied for less 
> memory (Old Gen 1.667G->968M), the job execution time is also reduced.
>  
> Current:
> !current_job_finish_time.png!
> jstat -gcutil PID 2s
> !current_gcutil.png!
> !current_visual_gc.png!
>  
> Try to change softValues to weakValues
> !fix_job_finish_time.png!
> !fix_gcutil.png!
> !fix_visual_gc.png!
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)

2020-12-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33753:


Assignee: Apache Spark

> Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
> ---
>
> Key: SPARK-33753
> URL: https://issues.apache.org/jira/browse/SPARK-33753
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: Apache Spark
>Priority: Minor
> Attachments: current_gcutil.png, current_job_finish_time.png, 
> current_visual_gc.png, fix_gcutil.png, fix_job_finish_time.png, 
> fix_visual_gc.png, jobconf.png
>
>
>  
> HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf).
>  When the number of hive partitions read by the driver is large, 
> HadoopRDD.getPartitions will create many jobconfs and add them to the cache.
>  The executor will also create a jobconf, add it to the cache, and share it 
> among exeuctors.
> The number of jobconfs in the driver cache increases the memory pressure. 
> When the driver memory configuration is not high, full gc will be frequently 
> used, and these jobconfs are hardly reused.
> For example, spark.driver.memory=2560m, the read partition is about 14,000, 
> and a jobconf 96kb.
> !jobconf.png!
>  
> The following is a repair comparison, full gc decreased from 62s to 0.8s, and 
> the number of times decreased from 31 to 5. And the driver applied for less 
> memory (Old Gen 1.667G->968M), the job execution time is also reduced.
>  
> Current:
> !current_job_finish_time.png!
> jstat -gcutil PID 2s
> !current_gcutil.png!
> !current_visual_gc.png!
>  
> Try to change softValues to weakValues
> !fix_job_finish_time.png!
> !fix_gcutil.png!
> !fix_visual_gc.png!
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)

2020-12-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247754#comment-17247754
 ] 

Apache Spark commented on SPARK-33753:
--

User 'cxzl25' has created a pull request for this issue:
https://github.com/apache/spark/pull/30725

> Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
> ---
>
> Key: SPARK-33753
> URL: https://issues.apache.org/jira/browse/SPARK-33753
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Priority: Minor
> Attachments: current_gcutil.png, current_job_finish_time.png, 
> current_visual_gc.png, fix_gcutil.png, fix_job_finish_time.png, 
> fix_visual_gc.png, jobconf.png
>
>
>  
> HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf).
>  When the number of hive partitions read by the driver is large, 
> HadoopRDD.getPartitions will create many jobconfs and add them to the cache.
>  The executor will also create a jobconf, add it to the cache, and share it 
> among exeuctors.
> The number of jobconfs in the driver cache increases the memory pressure. 
> When the driver memory configuration is not high, full gc will be frequently 
> used, and these jobconfs are hardly reused.
> For example, spark.driver.memory=2560m, the read partition is about 14,000, 
> and a jobconf 96kb.
> !jobconf.png!
>  
> The following is a repair comparison, full gc decreased from 62s to 0.8s, and 
> the number of times decreased from 31 to 5. And the driver applied for less 
> memory (Old Gen 1.667G->968M), the job execution time is also reduced.
>  
> Current:
> !current_job_finish_time.png!
> jstat -gcutil PID 2s
> !current_gcutil.png!
> !current_visual_gc.png!
>  
> Try to change softValues to weakValues
> !fix_job_finish_time.png!
> !fix_gcutil.png!
> !fix_visual_gc.png!
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)

2020-12-11 Thread dzcxzl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-33753:
---
Description: 
 

HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf).
 When the number of hive partitions read by the driver is large, 
HadoopRDD.getPartitions will create many jobconfs and add them to the cache.
 The executor will also create a jobconf, add it to the cache, and share it 
among exeuctors.

The number of jobconfs in the driver cache increases the memory pressure. When 
the driver memory configuration is not high, full gc will be frequently used, 
and these jobconfs are hardly reused.

For example, spark.driver.memory=2560m, the read partition is about 14,000, and 
a jobconf 96kb.

!jobconf.png!

 

The following is a repair comparison, full gc decreased from 62s to 0.8s, and 
the number of times decreased from 31 to 5. And the driver applied for less 
memory (Old Gen 1.667G->968M), the job execution time is also reduced.

 

Current:

!current_job_finish_time.png!

jstat -gcutil PID 2s

!current_gcutil.png!

!current_visual_gc.png!

 

Try to change softValues to weakValues

!fix_job_finish_time.png!

!fix_gcutil.png!

!fix_visual_gc.png!

 

 

 

 

 

  was:
 

HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf).
 When the number of hive partitions read by the driver is large, 
HadoopRDD.getPartitions will create many jobconfs and add them to the cache.
 The executor will also create a jobconf, add it to the cache, and share it 
among exeuctors.

The number of jobconfs in the driver cache increases the memory pressure. When 
the driver memory configuration is not high, full gc will be frequently used, 
and these jobconfs are hardly reused.

For example, spark.driver.memory=2560m, the read partition is about 14,000, and 
a jobconf 96kb.

The following is a repair comparison, full gc decreased from 62s to 0.8s, and 
the number of times decreased from 31 to 5. And the driver applied for less 
memory (Old Gen 1.667G->968M), the job execution time is also reduced.

 

Current:

!current_job_finish_time.png!

jstat -gcutil PID 2s

!current_gcutil.png!

!current_visual_gc.png!

 

Try to change softValues to weakValues

!fix_job_finish_time.png!

!fix_gcutil.png!

!fix_visual_gc.png!

 

 

 

 

 


> Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
> ---
>
> Key: SPARK-33753
> URL: https://issues.apache.org/jira/browse/SPARK-33753
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Priority: Minor
> Attachments: current_gcutil.png, current_job_finish_time.png, 
> current_visual_gc.png, fix_gcutil.png, fix_job_finish_time.png, 
> fix_visual_gc.png, jobconf.png
>
>
>  
> HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf).
>  When the number of hive partitions read by the driver is large, 
> HadoopRDD.getPartitions will create many jobconfs and add them to the cache.
>  The executor will also create a jobconf, add it to the cache, and share it 
> among exeuctors.
> The number of jobconfs in the driver cache increases the memory pressure. 
> When the driver memory configuration is not high, full gc will be frequently 
> used, and these jobconfs are hardly reused.
> For example, spark.driver.memory=2560m, the read partition is about 14,000, 
> and a jobconf 96kb.
> !jobconf.png!
>  
> The following is a repair comparison, full gc decreased from 62s to 0.8s, and 
> the number of times decreased from 31 to 5. And the driver applied for less 
> memory (Old Gen 1.667G->968M), the job execution time is also reduced.
>  
> Current:
> !current_job_finish_time.png!
> jstat -gcutil PID 2s
> !current_gcutil.png!
> !current_visual_gc.png!
>  
> Try to change softValues to weakValues
> !fix_job_finish_time.png!
> !fix_gcutil.png!
> !fix_visual_gc.png!
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)

2020-12-11 Thread dzcxzl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-33753:
---
Attachment: jobconf.png

> Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
> ---
>
> Key: SPARK-33753
> URL: https://issues.apache.org/jira/browse/SPARK-33753
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Priority: Minor
> Attachments: current_gcutil.png, current_job_finish_time.png, 
> current_visual_gc.png, fix_gcutil.png, fix_job_finish_time.png, 
> fix_visual_gc.png, jobconf.png
>
>
>  
> HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf).
>  When the number of hive partitions read by the driver is large, 
> HadoopRDD.getPartitions will create many jobconfs and add them to the cache.
>  The executor will also create a jobconf, add it to the cache, and share it 
> among exeuctors.
> The number of jobconfs in the driver cache increases the memory pressure. 
> When the driver memory configuration is not high, full gc will be frequently 
> used, and these jobconfs are hardly reused.
> For example, spark.driver.memory=2560m, the read partition is about 14,000, 
> and a jobconf 96kb.
> The following is a repair comparison, full gc decreased from 62s to 0.8s, and 
> the number of times decreased from 31 to 5. And the driver applied for less 
> memory (Old Gen 1.667G->968M), the job execution time is also reduced.
>  
> Current:
> !current_job_finish_time.png!
> jstat -gcutil PID 2s
> !current_gcutil.png!
> !current_visual_gc.png!
>  
> Try to change softValues to weakValues
> !fix_job_finish_time.png!
> !fix_gcutil.png!
> !fix_visual_gc.png!
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)

2020-12-11 Thread dzcxzl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-33753:
---
Description: 
 

HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf).
 When the number of hive partitions read by the driver is large, 
HadoopRDD.getPartitions will create many jobconfs and add them to the cache.
 The executor will also create a jobconf, add it to the cache, and share it 
among exeuctors.

The number of jobconfs in the driver cache increases the memory pressure. When 
the driver memory configuration is not high, full gc will be frequently used, 
and these jobconfs are hardly reused.

For example, spark.driver.memory=2560m, the read partition is about 14,000, and 
a jobconf 96kb.

The following is a repair comparison, full gc decreased from 62s to 0.8s, and 
the number of times decreased from 31 to 5. And the driver applied for less 
memory (Old Gen 1.667G->968M), the job execution time is also reduced.

 

Current:

!current_job_finish_time.png!

jstat -gcutil PID 2s

!current_gcutil.png!

!current_visual_gc.png!

 

Try to change softValues to weakValues

!fix_job_finish_time.png!

!fix_gcutil.png!

!fix_visual_gc.png!

 

 

 

 

 

  was:
HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf).
 When the number of hive partitions read by the driver is large, 
HadoopRDD.getPartitions will create many jobconfs and add them to the cache.
 The executor will also create a jobconf, add it to the cache, and share it 
among exeuctors.

The number of jobconfs in the driver cache increases the memory pressure. When 
the driver memory configuration is not high, full gc will be frequently used, 
and these jobconfs are hardly reused.

For example, spark.driver.memory=2560m, the read partition is about 14,000, and 
a jobconf 96kb.

The following is a repair comparison, full gc decreased from 62s to 0.8s, and 
the number of times decreased from 31 to 5. And the driver applied for less 
memory (Old Gen 1.667G->968M).

 

Current:

!image-2020-12-11-16-17-28-991.png!

jstat -gcutil PID 2s

!image-2020-12-11-16-08-53-656.png!

!image-2020-12-11-16-10-07-363.png!

 

Try to change softValues to weakValues

!image-2020-12-11-16-11-26-673.png!

!image-2020-12-11-16-11-35-988.png!

!image-2020-12-11-16-12-22-035.png!

 

 

 

 

 


> Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
> ---
>
> Key: SPARK-33753
> URL: https://issues.apache.org/jira/browse/SPARK-33753
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Priority: Minor
> Attachments: current_gcutil.png, current_job_finish_time.png, 
> current_visual_gc.png, fix_gcutil.png, fix_job_finish_time.png, 
> fix_visual_gc.png
>
>
>  
> HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf).
>  When the number of hive partitions read by the driver is large, 
> HadoopRDD.getPartitions will create many jobconfs and add them to the cache.
>  The executor will also create a jobconf, add it to the cache, and share it 
> among exeuctors.
> The number of jobconfs in the driver cache increases the memory pressure. 
> When the driver memory configuration is not high, full gc will be frequently 
> used, and these jobconfs are hardly reused.
> For example, spark.driver.memory=2560m, the read partition is about 14,000, 
> and a jobconf 96kb.
> The following is a repair comparison, full gc decreased from 62s to 0.8s, and 
> the number of times decreased from 31 to 5. And the driver applied for less 
> memory (Old Gen 1.667G->968M), the job execution time is also reduced.
>  
> Current:
> !current_job_finish_time.png!
> jstat -gcutil PID 2s
> !current_gcutil.png!
> !current_visual_gc.png!
>  
> Try to change softValues to weakValues
> !fix_job_finish_time.png!
> !fix_gcutil.png!
> !fix_visual_gc.png!
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)

2020-12-11 Thread dzcxzl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-33753:
---
Attachment: fix_visual_gc.png
fix_job_finish_time.png
fix_gcutil.png

> Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
> ---
>
> Key: SPARK-33753
> URL: https://issues.apache.org/jira/browse/SPARK-33753
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Priority: Minor
> Attachments: current_gcutil.png, current_job_finish_time.png, 
> current_visual_gc.png, fix_gcutil.png, fix_job_finish_time.png, 
> fix_visual_gc.png
>
>
> HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf).
>  When the number of hive partitions read by the driver is large, 
> HadoopRDD.getPartitions will create many jobconfs and add them to the cache.
>  The executor will also create a jobconf, add it to the cache, and share it 
> among exeuctors.
> The number of jobconfs in the driver cache increases the memory pressure. 
> When the driver memory configuration is not high, full gc will be frequently 
> used, and these jobconfs are hardly reused.
> For example, spark.driver.memory=2560m, the read partition is about 14,000, 
> and a jobconf 96kb.
> The following is a repair comparison, full gc decreased from 62s to 0.8s, and 
> the number of times decreased from 31 to 5. And the driver applied for less 
> memory (Old Gen 1.667G->968M).
>  
> Current:
> !image-2020-12-11-16-17-28-991.png!
> jstat -gcutil PID 2s
> !image-2020-12-11-16-08-53-656.png!
> !image-2020-12-11-16-10-07-363.png!
>  
> Try to change softValues to weakValues
> !image-2020-12-11-16-11-26-673.png!
> !image-2020-12-11-16-11-35-988.png!
> !image-2020-12-11-16-12-22-035.png!
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)

2020-12-11 Thread dzcxzl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-33753:
---
Attachment: current_gcutil.png

> Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
> ---
>
> Key: SPARK-33753
> URL: https://issues.apache.org/jira/browse/SPARK-33753
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Priority: Minor
> Attachments: current_gcutil.png, current_job_finish_time.png, 
> current_visual_gc.png
>
>
> HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf).
>  When the number of hive partitions read by the driver is large, 
> HadoopRDD.getPartitions will create many jobconfs and add them to the cache.
>  The executor will also create a jobconf, add it to the cache, and share it 
> among exeuctors.
> The number of jobconfs in the driver cache increases the memory pressure. 
> When the driver memory configuration is not high, full gc will be frequently 
> used, and these jobconfs are hardly reused.
> For example, spark.driver.memory=2560m, the read partition is about 14,000, 
> and a jobconf 96kb.
> The following is a repair comparison, full gc decreased from 62s to 0.8s, and 
> the number of times decreased from 31 to 5. And the driver applied for less 
> memory (Old Gen 1.667G->968M).
>  
> Current:
> !image-2020-12-11-16-17-28-991.png!
> jstat -gcutil PID 2s
> !image-2020-12-11-16-08-53-656.png!
> !image-2020-12-11-16-10-07-363.png!
>  
> Try to change softValues to weakValues
> !image-2020-12-11-16-11-26-673.png!
> !image-2020-12-11-16-11-35-988.png!
> !image-2020-12-11-16-12-22-035.png!
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)

2020-12-11 Thread dzcxzl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-33753:
---
Attachment: current_visual_gc.png

> Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
> ---
>
> Key: SPARK-33753
> URL: https://issues.apache.org/jira/browse/SPARK-33753
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Priority: Minor
> Attachments: current_gcutil.png, current_job_finish_time.png, 
> current_visual_gc.png
>
>
> HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf).
>  When the number of hive partitions read by the driver is large, 
> HadoopRDD.getPartitions will create many jobconfs and add them to the cache.
>  The executor will also create a jobconf, add it to the cache, and share it 
> among exeuctors.
> The number of jobconfs in the driver cache increases the memory pressure. 
> When the driver memory configuration is not high, full gc will be frequently 
> used, and these jobconfs are hardly reused.
> For example, spark.driver.memory=2560m, the read partition is about 14,000, 
> and a jobconf 96kb.
> The following is a repair comparison, full gc decreased from 62s to 0.8s, and 
> the number of times decreased from 31 to 5. And the driver applied for less 
> memory (Old Gen 1.667G->968M).
>  
> Current:
> !image-2020-12-11-16-17-28-991.png!
> jstat -gcutil PID 2s
> !image-2020-12-11-16-08-53-656.png!
> !image-2020-12-11-16-10-07-363.png!
>  
> Try to change softValues to weakValues
> !image-2020-12-11-16-11-26-673.png!
> !image-2020-12-11-16-11-35-988.png!
> !image-2020-12-11-16-12-22-035.png!
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)

2020-12-11 Thread dzcxzl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-33753:
---
Attachment: current_job_finish_time.png

> Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
> ---
>
> Key: SPARK-33753
> URL: https://issues.apache.org/jira/browse/SPARK-33753
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Priority: Minor
> Attachments: current_job_finish_time.png
>
>
> HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf).
>  When the number of hive partitions read by the driver is large, 
> HadoopRDD.getPartitions will create many jobconfs and add them to the cache.
>  The executor will also create a jobconf, add it to the cache, and share it 
> among exeuctors.
> The number of jobconfs in the driver cache increases the memory pressure. 
> When the driver memory configuration is not high, full gc will be frequently 
> used, and these jobconfs are hardly reused.
> For example, spark.driver.memory=2560m, the read partition is about 14,000, 
> and a jobconf 96kb.
> The following is a repair comparison, full gc decreased from 62s to 0.8s, and 
> the number of times decreased from 31 to 5. And the driver applied for less 
> memory (Old Gen 1.667G->968M).
>  
> Current:
> !image-2020-12-11-16-17-28-991.png!
> jstat -gcutil PID 2s
> !image-2020-12-11-16-08-53-656.png!
> !image-2020-12-11-16-10-07-363.png!
>  
> Try to change softValues to weakValues
> !image-2020-12-11-16-11-26-673.png!
> !image-2020-12-11-16-11-35-988.png!
> !image-2020-12-11-16-12-22-035.png!
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)

2020-12-11 Thread dzcxzl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-33753:
---
Description: 
HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf).
 When the number of hive partitions read by the driver is large, 
HadoopRDD.getPartitions will create many jobconfs and add them to the cache.
 The executor will also create a jobconf, add it to the cache, and share it 
among exeuctors.

The number of jobconfs in the driver cache increases the memory pressure. When 
the driver memory configuration is not high, full gc will be frequently used, 
and these jobconfs are hardly reused.

For example, spark.driver.memory=2560m, the read partition is about 14,000, and 
a jobconf 96kb.

The following is a repair comparison, full gc decreased from 62s to 0.8s, and 
the number of times decreased from 31 to 5. And the driver applied for less 
memory (Old Gen 1.667G->968M).

 

Current:

!image-2020-12-11-16-17-28-991.png!

jstat -gcutil PID 2s

!image-2020-12-11-16-08-53-656.png!

!image-2020-12-11-16-10-07-363.png!

 

Try to change softValues to weakValues

!image-2020-12-11-16-11-26-673.png!

!image-2020-12-11-16-11-35-988.png!

!image-2020-12-11-16-12-22-035.png!

 

 

 

 

 

  was:
HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf).
 When the number of hive partitions read by the driver is large, 
HadoopRDD.getPartitions will create many jobconfs and add them to the cache.
 The executor will also create a jobconf, add it to the cache, and share it 
among exeuctors.

The number of jobconfs in the driver cache increases the memory pressure. When 
the driver memory configuration is not high, full gc will be frequently used, 
and these jobconfs are hardly reused.

For example, spark.driver.memory=2560m, the read partition is about 14,000, and 
a jobconf 96kb.

The following is a repair comparison, full gc decreased from 62s to 0.8s, and 
the number of times decreased from 31 to 5. And the driver applied for less 
memory (Old Gen 1.667G->968M).

 

Current:

!image-2020-12-11-16-08-23-861.png!

jstat -gcutil PID 2s

!image-2020-12-11-16-08-53-656.png!

!image-2020-12-11-16-10-07-363.png!

 

Try to change softValues to weakValues

!image-2020-12-11-16-11-26-673.png!

!image-2020-12-11-16-11-35-988.png!

!image-2020-12-11-16-12-22-035.png!

 

 

 

 

 


> Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
> ---
>
> Key: SPARK-33753
> URL: https://issues.apache.org/jira/browse/SPARK-33753
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Priority: Minor
>
> HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf).
>  When the number of hive partitions read by the driver is large, 
> HadoopRDD.getPartitions will create many jobconfs and add them to the cache.
>  The executor will also create a jobconf, add it to the cache, and share it 
> among exeuctors.
> The number of jobconfs in the driver cache increases the memory pressure. 
> When the driver memory configuration is not high, full gc will be frequently 
> used, and these jobconfs are hardly reused.
> For example, spark.driver.memory=2560m, the read partition is about 14,000, 
> and a jobconf 96kb.
> The following is a repair comparison, full gc decreased from 62s to 0.8s, and 
> the number of times decreased from 31 to 5. And the driver applied for less 
> memory (Old Gen 1.667G->968M).
>  
> Current:
> !image-2020-12-11-16-17-28-991.png!
> jstat -gcutil PID 2s
> !image-2020-12-11-16-08-53-656.png!
> !image-2020-12-11-16-10-07-363.png!
>  
> Try to change softValues to weakValues
> !image-2020-12-11-16-11-26-673.png!
> !image-2020-12-11-16-11-35-988.png!
> !image-2020-12-11-16-12-22-035.png!
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)

2020-12-11 Thread dzcxzl (Jira)
dzcxzl created SPARK-33753:
--

 Summary: Reduce the memory footprint and gc of the cache 
(hadoopJobMetadata)
 Key: SPARK-33753
 URL: https://issues.apache.org/jira/browse/SPARK-33753
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: dzcxzl


HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf).
 When the number of hive partitions read by the driver is large, 
HadoopRDD.getPartitions will create many jobconfs and add them to the cache.
 The executor will also create a jobconf, add it to the cache, and share it 
among exeuctors.

The number of jobconfs in the driver cache increases the memory pressure. When 
the driver memory configuration is not high, full gc will be frequently used, 
and these jobconfs are hardly reused.

For example, spark.driver.memory=2560m, the read partition is about 14,000, and 
a jobconf 96kb.

The following is a repair comparison, full gc decreased from 62s to 0.8s, and 
the number of times decreased from 31 to 5. And the driver applied for less 
memory (Old Gen 1.667G->968M).

 

Current:

!image-2020-12-11-16-08-23-861.png!

jstat -gcutil PID 2s

!image-2020-12-11-16-08-53-656.png!

!image-2020-12-11-16-10-07-363.png!

 

Try to change softValues to weakValues

!image-2020-12-11-16-11-26-673.png!

!image-2020-12-11-16-11-35-988.png!

!image-2020-12-11-16-12-22-035.png!

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2