[jira] [Resolved] (SPARK-33740) hadoop configs in hive-site.xml can overrides pre-existing hadoop ones
[ https://issues.apache.org/jira/browse/SPARK-33740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33740. --- Fix Version/s: 3.1.0 3.0.2 Assignee: Kent Yao Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/30709 > hadoop configs in hive-site.xml can overrides pre-existing hadoop ones > -- > > Key: SPARK-33740 > URL: https://issues.apache.org/jira/browse/SPARK-33740 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.0, 3.2.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.2, 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33756) BytesToBytesMap's iterator hasNext method should be idempotent.
Xianjin YE created SPARK-33756: -- Summary: BytesToBytesMap's iterator hasNext method should be idempotent. Key: SPARK-33756 URL: https://issues.apache.org/jira/browse/SPARK-33756 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Xianjin YE BytesToBytesMap's MapIterator's hasNext method is not idempotent. {code:java} // public boolean hasNext() { if (numRecords == 0) { if (reader != null) { // if called multiple multiple times, it will throw NoSuchElement exception handleFailedDelete(); } } return numRecords > 0; } {code} Multiple calls to this `hasNext` method will call `handleFailedDelete()` multiple times, which will throw NoSuchElementException as the spillWrites has already been empty. We observed this issue for in one of our production jobs after upgrading to Spark 3.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33755) Allow creating orc table when row format separator is defined
xiepengjie created SPARK-33755: -- Summary: Allow creating orc table when row format separator is defined Key: SPARK-33755 URL: https://issues.apache.org/jira/browse/SPARK-33755 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.2 Reporter: xiepengjie -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33730) Standardize warning types
[ https://issues.apache.org/jira/browse/SPARK-33730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247788#comment-17247788 ] Shril Kumar commented on SPARK-33730: - [~hyukjin.kwon], [~zero323] can I pick this up? > Standardize warning types > - > > Key: SPARK-33730 > URL: https://issues.apache.org/jira/browse/SPARK-33730 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > We should use warnings properly per > [https://docs.python.org/3/library/warnings.html#warning-categories] > In particular, > - we should use {{FutureWarning}} instead of {{DeprecationWarning}} for the > places we should show the warnings to end-users by default. > - we should __maybe__ think about customizing stacklevel > ([https://docs.python.org/3/library/warnings.html#warnings.warn]) like pandas > does. > - ... > Current warnings are a bit messy and somewhat arbitrary. > To be more explicit, we'll have to fix: > {code:java} > pyspark/context.py:warnings.warn( > pyspark/context.py:warnings.warn( > pyspark/ml/classification.py:warnings.warn("weightCol is > ignored, " > pyspark/ml/clustering.py:warnings.warn("Deprecated in 3.0.0. It will > be removed in future versions. Use " > pyspark/mllib/classification.py:warnings.warn( > pyspark/mllib/feature.py:warnings.warn("Both withMean and withStd > are false. The model does nothing.") > pyspark/mllib/regression.py:warnings.warn( > pyspark/mllib/regression.py:warnings.warn( > pyspark/mllib/regression.py:warnings.warn( > pyspark/rdd.py:warnings.warn("mapPartitionsWithSplit is deprecated; " > pyspark/rdd.py:warnings.warn( > pyspark/shell.py:warnings.warn("Failed to initialize Spark session.") > pyspark/shuffle.py:warnings.warn("Please install psutil to have > better " > pyspark/sql/catalog.py:warnings.warn( > pyspark/sql/catalog.py:warnings.warn( > pyspark/sql/column.py:warnings.warn( > pyspark/sql/column.py:warnings.warn( > pyspark/sql/context.py:warnings.warn( > pyspark/sql/context.py:warnings.warn( > pyspark/sql/context.py:warnings.warn( > pyspark/sql/context.py:warnings.warn( > pyspark/sql/context.py:warnings.warn( > pyspark/sql/dataframe.py:warnings.warn( > pyspark/sql/dataframe.py:warnings.warn("to_replace is a dict > and value is not None. value will be ignored.") > pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use degrees > instead.", DeprecationWarning) > pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use radians > instead.", DeprecationWarning) > pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use > approx_count_distinct instead.", DeprecationWarning) > pyspark/sql/pandas/conversion.py:warnings.warn(msg) > pyspark/sql/pandas/conversion.py:warnings.warn(msg) > pyspark/sql/pandas/conversion.py:warnings.warn(msg) > pyspark/sql/pandas/conversion.py:warnings.warn(msg) > pyspark/sql/pandas/conversion.py:warnings.warn(msg) > pyspark/sql/pandas/functions.py:warnings.warn( > pyspark/sql/pandas/group_ops.py:warnings.warn( > pyspark/sql/session.py:warnings.warn("Fall back to non-hive > support because failing to access HiveConf, " > {code} > PySpark prints warnings via using {{print}} in some places as well. We should > also see if we should switch and replace to {{warnings.warn}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33527) Extend the function of decode so as consistent with mainstream databases
[ https://issues.apache.org/jira/browse/SPARK-33527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247786#comment-17247786 ] Apache Spark commented on SPARK-33527: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/30727 > Extend the function of decode so as consistent with mainstream databases > > > Key: SPARK-33527 > URL: https://issues.apache.org/jira/browse/SPARK-33527 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.2.0 > > > In Spark, decode(bin, charset) - Decodes the first argument using the second > argument character set. > Unfortunately this is NOT what any other SQL vendor understands DECODE to do. > DECODE generally is a short hand for a simple case expression: > {code:java} > SELECT DECODE(c1, 1, 'Hello', 2, 'World', '!') FROM (VALUES (1), (2), (3)) AS > T(c1) > => > (Hello), > (World) > (!) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33754) Update kubernetes/integration-tests/README.md to follow the default Hadoop profile updated
[ https://issues.apache.org/jira/browse/SPARK-33754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247776#comment-17247776 ] Apache Spark commented on SPARK-33754: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/30726 > Update kubernetes/integration-tests/README.md to follow the default Hadoop > profile updated > -- > > Key: SPARK-33754 > URL: https://issues.apache.org/jira/browse/SPARK-33754 > Project: Spark > Issue Type: Improvement > Components: docs, Kubernetes, Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > kubernetes/integration-tests/README.md says about how to run the integration > tests for Kubernetes as follows. > {code} > To run tests with Hadoop 3.2 instead of Hadoop 2.7, use `--hadoop-profile`. > ./dev/dev-run-integration-tests.sh --hadoop-profile hadoop-2.7 > {code} > In the current master, the default Hadoop profile is hadoop-3.2 so it's > better to update the document. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33731) Standardize exception types
[ https://issues.apache.org/jira/browse/SPARK-33731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1724#comment-1724 ] Shril Kumar edited comment on SPARK-33731 at 12/11/20, 9:20 AM: [~hyukjin.kwon] can you help me pick up this issue? I can contribute to this. was (Author: shril): [~hyukjin.kwon] can you help me pick up this issue? > Standardize exception types > --- > > Key: SPARK-33731 > URL: https://issues.apache.org/jira/browse/SPARK-33731 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > We should: > - have a better hierarchy for exception types > - or at least use the default type of exceptions correctly instead of just > throwing a plain Exception. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33754) Update kubernetes/integration-tests/README.md to follow the default Hadoop profile updated
[ https://issues.apache.org/jira/browse/SPARK-33754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33754: Assignee: Kousuke Saruta (was: Apache Spark) > Update kubernetes/integration-tests/README.md to follow the default Hadoop > profile updated > -- > > Key: SPARK-33754 > URL: https://issues.apache.org/jira/browse/SPARK-33754 > Project: Spark > Issue Type: Improvement > Components: docs, Kubernetes, Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > kubernetes/integration-tests/README.md says about how to run the integration > tests for Kubernetes as follows. > {code} > To run tests with Hadoop 3.2 instead of Hadoop 2.7, use `--hadoop-profile`. > ./dev/dev-run-integration-tests.sh --hadoop-profile hadoop-2.7 > {code} > In the current master, the default Hadoop profile is hadoop-3.2 so it's > better to update the document. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33754) Update kubernetes/integration-tests/README.md to follow the default Hadoop profile updated
[ https://issues.apache.org/jira/browse/SPARK-33754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33754: Assignee: Apache Spark (was: Kousuke Saruta) > Update kubernetes/integration-tests/README.md to follow the default Hadoop > profile updated > -- > > Key: SPARK-33754 > URL: https://issues.apache.org/jira/browse/SPARK-33754 > Project: Spark > Issue Type: Improvement > Components: docs, Kubernetes, Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Minor > > kubernetes/integration-tests/README.md says about how to run the integration > tests for Kubernetes as follows. > {code} > To run tests with Hadoop 3.2 instead of Hadoop 2.7, use `--hadoop-profile`. > ./dev/dev-run-integration-tests.sh --hadoop-profile hadoop-2.7 > {code} > In the current master, the default Hadoop profile is hadoop-3.2 so it's > better to update the document. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33731) Standardize exception types
[ https://issues.apache.org/jira/browse/SPARK-33731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1724#comment-1724 ] Shril Kumar commented on SPARK-33731: - [~hyukjin.kwon] can you help me pick up this issue? > Standardize exception types > --- > > Key: SPARK-33731 > URL: https://issues.apache.org/jira/browse/SPARK-33731 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > We should: > - have a better hierarchy for exception types > - or at least use the default type of exceptions correctly instead of just > throwing a plain Exception. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33754) Update kubernetes/integration-tests/README.md to follow the default Hadoop profile updated
Kousuke Saruta created SPARK-33754: -- Summary: Update kubernetes/integration-tests/README.md to follow the default Hadoop profile updated Key: SPARK-33754 URL: https://issues.apache.org/jira/browse/SPARK-33754 Project: Spark Issue Type: Improvement Components: docs, Kubernetes, Tests Affects Versions: 3.1.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta kubernetes/integration-tests/README.md says about how to run the integration tests for Kubernetes as follows. {code} To run tests with Hadoop 3.2 instead of Hadoop 2.7, use `--hadoop-profile`. ./dev/dev-run-integration-tests.sh --hadoop-profile hadoop-2.7 {code} In the current master, the default Hadoop profile is hadoop-3.2 so it's better to update the document. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33737) Use an Informer+Lister API in the ExecutorPodWatcher
[ https://issues.apache.org/jira/browse/SPARK-33737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247759#comment-17247759 ] Stavros Kontopoulos commented on SPARK-33737: - In addition current implementation has been out for long and it is stable. Need to be sure that any updates will not cause any issues. I can work on a PR and see how things integrate. > Use an Informer+Lister API in the ExecutorPodWatcher > > > Key: SPARK-33737 > URL: https://issues.apache.org/jira/browse/SPARK-33737 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Kubernetes backend uses Fabric8 client and a > [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] > to monitor the K8s Api server for pod changes. Every watcher keeps a > websocket connection open and has no caching mechanism at that part. Caching > at the Spark K8s resource manager exists in other areas where we are hitting > the Api Server for Pod CRUD ops like > [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. > In an env where a lot of connections are kept due to large scale jobs this > could be problematic and impose a lot of load against the API server. A lot > of long running jobs should not create pod changes eg. Streaming jobs to > justify a continuous watching mechanism. > Latest Frabric8 client versions have implemented a SharedInformer API+Lister, > an example can be found > [here|https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/InformerExample.java#L37]. > This new API follows the implementation of the official java K8s client and > the go counterpart and it is backed up by a caching mechanism which is > re-synced after a configurable period to avoid hitting the API server all the > time. There is also a lister that keeps track of current status of resources. > Using such a mechanism is common place when implementing a K8s controller. > The suggestion is to update to v4.13.0 the client (has all updates in wrt > that API) and use the informer+lister API where applicable. > I think the lister could also replace part of the snapshotting/notification > mechanism. > /cc [~dongjoon] [~eje] [~holden] WDYTH? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
[ https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-33753: --- Description: HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf). When the number of hive partitions read by the driver is large, HadoopRDD.getPartitions will create many jobconfs and add them to the cache. The executor will also create a jobconf, add it to the cache, and share it among exeuctors. The number of jobconfs in the driver cache increases the memory pressure. When the driver memory configuration is not high, full gc becoming very frequent, and these jobconfs are hardly reused. For example, spark.driver.memory=2560m, the read partition is about 14,000, and a jobconf 96kb. !jobconf.png! The following is a repair comparison, full gc decreased from 62s to 0.8s, and the number of times decreased from 31 to 5. And the driver applied for less memory (Old Gen 1.667G->968M), the job execution time is also reduced. Current: !current_job_finish_time.png! jstat -gcutil PID 2s !current_gcutil.png! !current_visual_gc.png! Try to change softValues to weakValues !fix_job_finish_time.png! !fix_gcutil.png! !fix_visual_gc.png! was: HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf). When the number of hive partitions read by the driver is large, HadoopRDD.getPartitions will create many jobconfs and add them to the cache. The executor will also create a jobconf, add it to the cache, and share it among exeuctors. The number of jobconfs in the driver cache increases the memory pressure. When the driver memory configuration is not high, full gc will be frequently used, and these jobconfs are hardly reused. For example, spark.driver.memory=2560m, the read partition is about 14,000, and a jobconf 96kb. !jobconf.png! The following is a repair comparison, full gc decreased from 62s to 0.8s, and the number of times decreased from 31 to 5. And the driver applied for less memory (Old Gen 1.667G->968M), the job execution time is also reduced. Current: !current_job_finish_time.png! jstat -gcutil PID 2s !current_gcutil.png! !current_visual_gc.png! Try to change softValues to weakValues !fix_job_finish_time.png! !fix_gcutil.png! !fix_visual_gc.png! > Reduce the memory footprint and gc of the cache (hadoopJobMetadata) > --- > > Key: SPARK-33753 > URL: https://issues.apache.org/jira/browse/SPARK-33753 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: dzcxzl >Priority: Minor > Attachments: current_gcutil.png, current_job_finish_time.png, > current_visual_gc.png, fix_gcutil.png, fix_job_finish_time.png, > fix_visual_gc.png, jobconf.png > > > > HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf). > When the number of hive partitions read by the driver is large, > HadoopRDD.getPartitions will create many jobconfs and add them to the cache. > The executor will also create a jobconf, add it to the cache, and share it > among exeuctors. > The number of jobconfs in the driver cache increases the memory pressure. > When the driver memory configuration is not high, full gc becoming very > frequent, and these jobconfs are hardly reused. > For example, spark.driver.memory=2560m, the read partition is about 14,000, > and a jobconf 96kb. > !jobconf.png! > > The following is a repair comparison, full gc decreased from 62s to 0.8s, and > the number of times decreased from 31 to 5. And the driver applied for less > memory (Old Gen 1.667G->968M), the job execution time is also reduced. > > Current: > !current_job_finish_time.png! > jstat -gcutil PID 2s > !current_gcutil.png! > !current_visual_gc.png! > > Try to change softValues to weakValues > !fix_job_finish_time.png! > !fix_gcutil.png! > !fix_visual_gc.png! > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
[ https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247755#comment-17247755 ] Apache Spark commented on SPARK-33753: -- User 'cxzl25' has created a pull request for this issue: https://github.com/apache/spark/pull/30725 > Reduce the memory footprint and gc of the cache (hadoopJobMetadata) > --- > > Key: SPARK-33753 > URL: https://issues.apache.org/jira/browse/SPARK-33753 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: dzcxzl >Priority: Minor > Attachments: current_gcutil.png, current_job_finish_time.png, > current_visual_gc.png, fix_gcutil.png, fix_job_finish_time.png, > fix_visual_gc.png, jobconf.png > > > > HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf). > When the number of hive partitions read by the driver is large, > HadoopRDD.getPartitions will create many jobconfs and add them to the cache. > The executor will also create a jobconf, add it to the cache, and share it > among exeuctors. > The number of jobconfs in the driver cache increases the memory pressure. > When the driver memory configuration is not high, full gc becoming very > frequent, and these jobconfs are hardly reused. > For example, spark.driver.memory=2560m, the read partition is about 14,000, > and a jobconf 96kb. > !jobconf.png! > > The following is a repair comparison, full gc decreased from 62s to 0.8s, and > the number of times decreased from 31 to 5. And the driver applied for less > memory (Old Gen 1.667G->968M), the job execution time is also reduced. > > Current: > !current_job_finish_time.png! > jstat -gcutil PID 2s > !current_gcutil.png! > !current_visual_gc.png! > > Try to change softValues to weakValues > !fix_job_finish_time.png! > !fix_gcutil.png! > !fix_visual_gc.png! > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
[ https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33753: Assignee: (was: Apache Spark) > Reduce the memory footprint and gc of the cache (hadoopJobMetadata) > --- > > Key: SPARK-33753 > URL: https://issues.apache.org/jira/browse/SPARK-33753 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: dzcxzl >Priority: Minor > Attachments: current_gcutil.png, current_job_finish_time.png, > current_visual_gc.png, fix_gcutil.png, fix_job_finish_time.png, > fix_visual_gc.png, jobconf.png > > > > HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf). > When the number of hive partitions read by the driver is large, > HadoopRDD.getPartitions will create many jobconfs and add them to the cache. > The executor will also create a jobconf, add it to the cache, and share it > among exeuctors. > The number of jobconfs in the driver cache increases the memory pressure. > When the driver memory configuration is not high, full gc will be frequently > used, and these jobconfs are hardly reused. > For example, spark.driver.memory=2560m, the read partition is about 14,000, > and a jobconf 96kb. > !jobconf.png! > > The following is a repair comparison, full gc decreased from 62s to 0.8s, and > the number of times decreased from 31 to 5. And the driver applied for less > memory (Old Gen 1.667G->968M), the job execution time is also reduced. > > Current: > !current_job_finish_time.png! > jstat -gcutil PID 2s > !current_gcutil.png! > !current_visual_gc.png! > > Try to change softValues to weakValues > !fix_job_finish_time.png! > !fix_gcutil.png! > !fix_visual_gc.png! > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
[ https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33753: Assignee: Apache Spark > Reduce the memory footprint and gc of the cache (hadoopJobMetadata) > --- > > Key: SPARK-33753 > URL: https://issues.apache.org/jira/browse/SPARK-33753 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: dzcxzl >Assignee: Apache Spark >Priority: Minor > Attachments: current_gcutil.png, current_job_finish_time.png, > current_visual_gc.png, fix_gcutil.png, fix_job_finish_time.png, > fix_visual_gc.png, jobconf.png > > > > HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf). > When the number of hive partitions read by the driver is large, > HadoopRDD.getPartitions will create many jobconfs and add them to the cache. > The executor will also create a jobconf, add it to the cache, and share it > among exeuctors. > The number of jobconfs in the driver cache increases the memory pressure. > When the driver memory configuration is not high, full gc will be frequently > used, and these jobconfs are hardly reused. > For example, spark.driver.memory=2560m, the read partition is about 14,000, > and a jobconf 96kb. > !jobconf.png! > > The following is a repair comparison, full gc decreased from 62s to 0.8s, and > the number of times decreased from 31 to 5. And the driver applied for less > memory (Old Gen 1.667G->968M), the job execution time is also reduced. > > Current: > !current_job_finish_time.png! > jstat -gcutil PID 2s > !current_gcutil.png! > !current_visual_gc.png! > > Try to change softValues to weakValues > !fix_job_finish_time.png! > !fix_gcutil.png! > !fix_visual_gc.png! > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
[ https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247754#comment-17247754 ] Apache Spark commented on SPARK-33753: -- User 'cxzl25' has created a pull request for this issue: https://github.com/apache/spark/pull/30725 > Reduce the memory footprint and gc of the cache (hadoopJobMetadata) > --- > > Key: SPARK-33753 > URL: https://issues.apache.org/jira/browse/SPARK-33753 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: dzcxzl >Priority: Minor > Attachments: current_gcutil.png, current_job_finish_time.png, > current_visual_gc.png, fix_gcutil.png, fix_job_finish_time.png, > fix_visual_gc.png, jobconf.png > > > > HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf). > When the number of hive partitions read by the driver is large, > HadoopRDD.getPartitions will create many jobconfs and add them to the cache. > The executor will also create a jobconf, add it to the cache, and share it > among exeuctors. > The number of jobconfs in the driver cache increases the memory pressure. > When the driver memory configuration is not high, full gc will be frequently > used, and these jobconfs are hardly reused. > For example, spark.driver.memory=2560m, the read partition is about 14,000, > and a jobconf 96kb. > !jobconf.png! > > The following is a repair comparison, full gc decreased from 62s to 0.8s, and > the number of times decreased from 31 to 5. And the driver applied for less > memory (Old Gen 1.667G->968M), the job execution time is also reduced. > > Current: > !current_job_finish_time.png! > jstat -gcutil PID 2s > !current_gcutil.png! > !current_visual_gc.png! > > Try to change softValues to weakValues > !fix_job_finish_time.png! > !fix_gcutil.png! > !fix_visual_gc.png! > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
[ https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-33753: --- Description: HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf). When the number of hive partitions read by the driver is large, HadoopRDD.getPartitions will create many jobconfs and add them to the cache. The executor will also create a jobconf, add it to the cache, and share it among exeuctors. The number of jobconfs in the driver cache increases the memory pressure. When the driver memory configuration is not high, full gc will be frequently used, and these jobconfs are hardly reused. For example, spark.driver.memory=2560m, the read partition is about 14,000, and a jobconf 96kb. !jobconf.png! The following is a repair comparison, full gc decreased from 62s to 0.8s, and the number of times decreased from 31 to 5. And the driver applied for less memory (Old Gen 1.667G->968M), the job execution time is also reduced. Current: !current_job_finish_time.png! jstat -gcutil PID 2s !current_gcutil.png! !current_visual_gc.png! Try to change softValues to weakValues !fix_job_finish_time.png! !fix_gcutil.png! !fix_visual_gc.png! was: HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf). When the number of hive partitions read by the driver is large, HadoopRDD.getPartitions will create many jobconfs and add them to the cache. The executor will also create a jobconf, add it to the cache, and share it among exeuctors. The number of jobconfs in the driver cache increases the memory pressure. When the driver memory configuration is not high, full gc will be frequently used, and these jobconfs are hardly reused. For example, spark.driver.memory=2560m, the read partition is about 14,000, and a jobconf 96kb. The following is a repair comparison, full gc decreased from 62s to 0.8s, and the number of times decreased from 31 to 5. And the driver applied for less memory (Old Gen 1.667G->968M), the job execution time is also reduced. Current: !current_job_finish_time.png! jstat -gcutil PID 2s !current_gcutil.png! !current_visual_gc.png! Try to change softValues to weakValues !fix_job_finish_time.png! !fix_gcutil.png! !fix_visual_gc.png! > Reduce the memory footprint and gc of the cache (hadoopJobMetadata) > --- > > Key: SPARK-33753 > URL: https://issues.apache.org/jira/browse/SPARK-33753 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: dzcxzl >Priority: Minor > Attachments: current_gcutil.png, current_job_finish_time.png, > current_visual_gc.png, fix_gcutil.png, fix_job_finish_time.png, > fix_visual_gc.png, jobconf.png > > > > HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf). > When the number of hive partitions read by the driver is large, > HadoopRDD.getPartitions will create many jobconfs and add them to the cache. > The executor will also create a jobconf, add it to the cache, and share it > among exeuctors. > The number of jobconfs in the driver cache increases the memory pressure. > When the driver memory configuration is not high, full gc will be frequently > used, and these jobconfs are hardly reused. > For example, spark.driver.memory=2560m, the read partition is about 14,000, > and a jobconf 96kb. > !jobconf.png! > > The following is a repair comparison, full gc decreased from 62s to 0.8s, and > the number of times decreased from 31 to 5. And the driver applied for less > memory (Old Gen 1.667G->968M), the job execution time is also reduced. > > Current: > !current_job_finish_time.png! > jstat -gcutil PID 2s > !current_gcutil.png! > !current_visual_gc.png! > > Try to change softValues to weakValues > !fix_job_finish_time.png! > !fix_gcutil.png! > !fix_visual_gc.png! > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
[ https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-33753: --- Attachment: jobconf.png > Reduce the memory footprint and gc of the cache (hadoopJobMetadata) > --- > > Key: SPARK-33753 > URL: https://issues.apache.org/jira/browse/SPARK-33753 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: dzcxzl >Priority: Minor > Attachments: current_gcutil.png, current_job_finish_time.png, > current_visual_gc.png, fix_gcutil.png, fix_job_finish_time.png, > fix_visual_gc.png, jobconf.png > > > > HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf). > When the number of hive partitions read by the driver is large, > HadoopRDD.getPartitions will create many jobconfs and add them to the cache. > The executor will also create a jobconf, add it to the cache, and share it > among exeuctors. > The number of jobconfs in the driver cache increases the memory pressure. > When the driver memory configuration is not high, full gc will be frequently > used, and these jobconfs are hardly reused. > For example, spark.driver.memory=2560m, the read partition is about 14,000, > and a jobconf 96kb. > The following is a repair comparison, full gc decreased from 62s to 0.8s, and > the number of times decreased from 31 to 5. And the driver applied for less > memory (Old Gen 1.667G->968M), the job execution time is also reduced. > > Current: > !current_job_finish_time.png! > jstat -gcutil PID 2s > !current_gcutil.png! > !current_visual_gc.png! > > Try to change softValues to weakValues > !fix_job_finish_time.png! > !fix_gcutil.png! > !fix_visual_gc.png! > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
[ https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-33753: --- Description: HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf). When the number of hive partitions read by the driver is large, HadoopRDD.getPartitions will create many jobconfs and add them to the cache. The executor will also create a jobconf, add it to the cache, and share it among exeuctors. The number of jobconfs in the driver cache increases the memory pressure. When the driver memory configuration is not high, full gc will be frequently used, and these jobconfs are hardly reused. For example, spark.driver.memory=2560m, the read partition is about 14,000, and a jobconf 96kb. The following is a repair comparison, full gc decreased from 62s to 0.8s, and the number of times decreased from 31 to 5. And the driver applied for less memory (Old Gen 1.667G->968M), the job execution time is also reduced. Current: !current_job_finish_time.png! jstat -gcutil PID 2s !current_gcutil.png! !current_visual_gc.png! Try to change softValues to weakValues !fix_job_finish_time.png! !fix_gcutil.png! !fix_visual_gc.png! was: HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf). When the number of hive partitions read by the driver is large, HadoopRDD.getPartitions will create many jobconfs and add them to the cache. The executor will also create a jobconf, add it to the cache, and share it among exeuctors. The number of jobconfs in the driver cache increases the memory pressure. When the driver memory configuration is not high, full gc will be frequently used, and these jobconfs are hardly reused. For example, spark.driver.memory=2560m, the read partition is about 14,000, and a jobconf 96kb. The following is a repair comparison, full gc decreased from 62s to 0.8s, and the number of times decreased from 31 to 5. And the driver applied for less memory (Old Gen 1.667G->968M). Current: !image-2020-12-11-16-17-28-991.png! jstat -gcutil PID 2s !image-2020-12-11-16-08-53-656.png! !image-2020-12-11-16-10-07-363.png! Try to change softValues to weakValues !image-2020-12-11-16-11-26-673.png! !image-2020-12-11-16-11-35-988.png! !image-2020-12-11-16-12-22-035.png! > Reduce the memory footprint and gc of the cache (hadoopJobMetadata) > --- > > Key: SPARK-33753 > URL: https://issues.apache.org/jira/browse/SPARK-33753 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: dzcxzl >Priority: Minor > Attachments: current_gcutil.png, current_job_finish_time.png, > current_visual_gc.png, fix_gcutil.png, fix_job_finish_time.png, > fix_visual_gc.png > > > > HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf). > When the number of hive partitions read by the driver is large, > HadoopRDD.getPartitions will create many jobconfs and add them to the cache. > The executor will also create a jobconf, add it to the cache, and share it > among exeuctors. > The number of jobconfs in the driver cache increases the memory pressure. > When the driver memory configuration is not high, full gc will be frequently > used, and these jobconfs are hardly reused. > For example, spark.driver.memory=2560m, the read partition is about 14,000, > and a jobconf 96kb. > The following is a repair comparison, full gc decreased from 62s to 0.8s, and > the number of times decreased from 31 to 5. And the driver applied for less > memory (Old Gen 1.667G->968M), the job execution time is also reduced. > > Current: > !current_job_finish_time.png! > jstat -gcutil PID 2s > !current_gcutil.png! > !current_visual_gc.png! > > Try to change softValues to weakValues > !fix_job_finish_time.png! > !fix_gcutil.png! > !fix_visual_gc.png! > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
[ https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-33753: --- Attachment: fix_visual_gc.png fix_job_finish_time.png fix_gcutil.png > Reduce the memory footprint and gc of the cache (hadoopJobMetadata) > --- > > Key: SPARK-33753 > URL: https://issues.apache.org/jira/browse/SPARK-33753 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: dzcxzl >Priority: Minor > Attachments: current_gcutil.png, current_job_finish_time.png, > current_visual_gc.png, fix_gcutil.png, fix_job_finish_time.png, > fix_visual_gc.png > > > HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf). > When the number of hive partitions read by the driver is large, > HadoopRDD.getPartitions will create many jobconfs and add them to the cache. > The executor will also create a jobconf, add it to the cache, and share it > among exeuctors. > The number of jobconfs in the driver cache increases the memory pressure. > When the driver memory configuration is not high, full gc will be frequently > used, and these jobconfs are hardly reused. > For example, spark.driver.memory=2560m, the read partition is about 14,000, > and a jobconf 96kb. > The following is a repair comparison, full gc decreased from 62s to 0.8s, and > the number of times decreased from 31 to 5. And the driver applied for less > memory (Old Gen 1.667G->968M). > > Current: > !image-2020-12-11-16-17-28-991.png! > jstat -gcutil PID 2s > !image-2020-12-11-16-08-53-656.png! > !image-2020-12-11-16-10-07-363.png! > > Try to change softValues to weakValues > !image-2020-12-11-16-11-26-673.png! > !image-2020-12-11-16-11-35-988.png! > !image-2020-12-11-16-12-22-035.png! > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
[ https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-33753: --- Attachment: current_gcutil.png > Reduce the memory footprint and gc of the cache (hadoopJobMetadata) > --- > > Key: SPARK-33753 > URL: https://issues.apache.org/jira/browse/SPARK-33753 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: dzcxzl >Priority: Minor > Attachments: current_gcutil.png, current_job_finish_time.png, > current_visual_gc.png > > > HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf). > When the number of hive partitions read by the driver is large, > HadoopRDD.getPartitions will create many jobconfs and add them to the cache. > The executor will also create a jobconf, add it to the cache, and share it > among exeuctors. > The number of jobconfs in the driver cache increases the memory pressure. > When the driver memory configuration is not high, full gc will be frequently > used, and these jobconfs are hardly reused. > For example, spark.driver.memory=2560m, the read partition is about 14,000, > and a jobconf 96kb. > The following is a repair comparison, full gc decreased from 62s to 0.8s, and > the number of times decreased from 31 to 5. And the driver applied for less > memory (Old Gen 1.667G->968M). > > Current: > !image-2020-12-11-16-17-28-991.png! > jstat -gcutil PID 2s > !image-2020-12-11-16-08-53-656.png! > !image-2020-12-11-16-10-07-363.png! > > Try to change softValues to weakValues > !image-2020-12-11-16-11-26-673.png! > !image-2020-12-11-16-11-35-988.png! > !image-2020-12-11-16-12-22-035.png! > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
[ https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-33753: --- Attachment: current_visual_gc.png > Reduce the memory footprint and gc of the cache (hadoopJobMetadata) > --- > > Key: SPARK-33753 > URL: https://issues.apache.org/jira/browse/SPARK-33753 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: dzcxzl >Priority: Minor > Attachments: current_gcutil.png, current_job_finish_time.png, > current_visual_gc.png > > > HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf). > When the number of hive partitions read by the driver is large, > HadoopRDD.getPartitions will create many jobconfs and add them to the cache. > The executor will also create a jobconf, add it to the cache, and share it > among exeuctors. > The number of jobconfs in the driver cache increases the memory pressure. > When the driver memory configuration is not high, full gc will be frequently > used, and these jobconfs are hardly reused. > For example, spark.driver.memory=2560m, the read partition is about 14,000, > and a jobconf 96kb. > The following is a repair comparison, full gc decreased from 62s to 0.8s, and > the number of times decreased from 31 to 5. And the driver applied for less > memory (Old Gen 1.667G->968M). > > Current: > !image-2020-12-11-16-17-28-991.png! > jstat -gcutil PID 2s > !image-2020-12-11-16-08-53-656.png! > !image-2020-12-11-16-10-07-363.png! > > Try to change softValues to weakValues > !image-2020-12-11-16-11-26-673.png! > !image-2020-12-11-16-11-35-988.png! > !image-2020-12-11-16-12-22-035.png! > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
[ https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-33753: --- Attachment: current_job_finish_time.png > Reduce the memory footprint and gc of the cache (hadoopJobMetadata) > --- > > Key: SPARK-33753 > URL: https://issues.apache.org/jira/browse/SPARK-33753 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: dzcxzl >Priority: Minor > Attachments: current_job_finish_time.png > > > HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf). > When the number of hive partitions read by the driver is large, > HadoopRDD.getPartitions will create many jobconfs and add them to the cache. > The executor will also create a jobconf, add it to the cache, and share it > among exeuctors. > The number of jobconfs in the driver cache increases the memory pressure. > When the driver memory configuration is not high, full gc will be frequently > used, and these jobconfs are hardly reused. > For example, spark.driver.memory=2560m, the read partition is about 14,000, > and a jobconf 96kb. > The following is a repair comparison, full gc decreased from 62s to 0.8s, and > the number of times decreased from 31 to 5. And the driver applied for less > memory (Old Gen 1.667G->968M). > > Current: > !image-2020-12-11-16-17-28-991.png! > jstat -gcutil PID 2s > !image-2020-12-11-16-08-53-656.png! > !image-2020-12-11-16-10-07-363.png! > > Try to change softValues to weakValues > !image-2020-12-11-16-11-26-673.png! > !image-2020-12-11-16-11-35-988.png! > !image-2020-12-11-16-12-22-035.png! > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
[ https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-33753: --- Description: HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf). When the number of hive partitions read by the driver is large, HadoopRDD.getPartitions will create many jobconfs and add them to the cache. The executor will also create a jobconf, add it to the cache, and share it among exeuctors. The number of jobconfs in the driver cache increases the memory pressure. When the driver memory configuration is not high, full gc will be frequently used, and these jobconfs are hardly reused. For example, spark.driver.memory=2560m, the read partition is about 14,000, and a jobconf 96kb. The following is a repair comparison, full gc decreased from 62s to 0.8s, and the number of times decreased from 31 to 5. And the driver applied for less memory (Old Gen 1.667G->968M). Current: !image-2020-12-11-16-17-28-991.png! jstat -gcutil PID 2s !image-2020-12-11-16-08-53-656.png! !image-2020-12-11-16-10-07-363.png! Try to change softValues to weakValues !image-2020-12-11-16-11-26-673.png! !image-2020-12-11-16-11-35-988.png! !image-2020-12-11-16-12-22-035.png! was: HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf). When the number of hive partitions read by the driver is large, HadoopRDD.getPartitions will create many jobconfs and add them to the cache. The executor will also create a jobconf, add it to the cache, and share it among exeuctors. The number of jobconfs in the driver cache increases the memory pressure. When the driver memory configuration is not high, full gc will be frequently used, and these jobconfs are hardly reused. For example, spark.driver.memory=2560m, the read partition is about 14,000, and a jobconf 96kb. The following is a repair comparison, full gc decreased from 62s to 0.8s, and the number of times decreased from 31 to 5. And the driver applied for less memory (Old Gen 1.667G->968M). Current: !image-2020-12-11-16-08-23-861.png! jstat -gcutil PID 2s !image-2020-12-11-16-08-53-656.png! !image-2020-12-11-16-10-07-363.png! Try to change softValues to weakValues !image-2020-12-11-16-11-26-673.png! !image-2020-12-11-16-11-35-988.png! !image-2020-12-11-16-12-22-035.png! > Reduce the memory footprint and gc of the cache (hadoopJobMetadata) > --- > > Key: SPARK-33753 > URL: https://issues.apache.org/jira/browse/SPARK-33753 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: dzcxzl >Priority: Minor > > HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf). > When the number of hive partitions read by the driver is large, > HadoopRDD.getPartitions will create many jobconfs and add them to the cache. > The executor will also create a jobconf, add it to the cache, and share it > among exeuctors. > The number of jobconfs in the driver cache increases the memory pressure. > When the driver memory configuration is not high, full gc will be frequently > used, and these jobconfs are hardly reused. > For example, spark.driver.memory=2560m, the read partition is about 14,000, > and a jobconf 96kb. > The following is a repair comparison, full gc decreased from 62s to 0.8s, and > the number of times decreased from 31 to 5. And the driver applied for less > memory (Old Gen 1.667G->968M). > > Current: > !image-2020-12-11-16-17-28-991.png! > jstat -gcutil PID 2s > !image-2020-12-11-16-08-53-656.png! > !image-2020-12-11-16-10-07-363.png! > > Try to change softValues to weakValues > !image-2020-12-11-16-11-26-673.png! > !image-2020-12-11-16-11-35-988.png! > !image-2020-12-11-16-12-22-035.png! > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33753) Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
dzcxzl created SPARK-33753: -- Summary: Reduce the memory footprint and gc of the cache (hadoopJobMetadata) Key: SPARK-33753 URL: https://issues.apache.org/jira/browse/SPARK-33753 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.1 Reporter: dzcxzl HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf). When the number of hive partitions read by the driver is large, HadoopRDD.getPartitions will create many jobconfs and add them to the cache. The executor will also create a jobconf, add it to the cache, and share it among exeuctors. The number of jobconfs in the driver cache increases the memory pressure. When the driver memory configuration is not high, full gc will be frequently used, and these jobconfs are hardly reused. For example, spark.driver.memory=2560m, the read partition is about 14,000, and a jobconf 96kb. The following is a repair comparison, full gc decreased from 62s to 0.8s, and the number of times decreased from 31 to 5. And the driver applied for less memory (Old Gen 1.667G->968M). Current: !image-2020-12-11-16-08-23-861.png! jstat -gcutil PID 2s !image-2020-12-11-16-08-53-656.png! !image-2020-12-11-16-10-07-363.png! Try to change softValues to weakValues !image-2020-12-11-16-11-26-673.png! !image-2020-12-11-16-11-35-988.png! !image-2020-12-11-16-12-22-035.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org