[jira] [Commented] (SPARK-25837) Web UI does not respect spark.ui.retainedJobs in some instances
[ https://issues.apache.org/jira/browse/SPARK-25837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794932#comment-16794932 ] Xiaoju Wu commented on SPARK-25837: --- Did you verify this fix with the reproduce case above? I tried and found the issue is still there: the cleanup was still backed up but better than the version without this fix. > Web UI does not respect spark.ui.retainedJobs in some instances > --- > > Key: SPARK-25837 > URL: https://issues.apache.org/jira/browse/SPARK-25837 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.1 > Environment: Reproduction Environment: > Spark 2.3.1 > Dataproc 1.3-deb9 > 1x master 4 vCPUs, 15 GB > 2x workers 4 vCPUs, 15 GB > >Reporter: Patrick Brown >Assignee: Patrick Brown >Priority: Minor > Fix For: 2.3.3, 2.4.1, 3.0.0 > > Attachments: Screen Shot 2018-10-23 at 4.40.51 PM (1).png > > > Expected Behavior: Web UI only displays 1 completed job and remains > responsive. > Actual Behavior: Both during job execution and following all job completion > for some non short amount of time the UI retains many completed jobs, causing > limited responsiveness. > > To reproduce: > > > spark-shell --conf spark.ui.retainedJobs=1 > > scala> import scala.concurrent._ > scala> import scala.concurrent.ExecutionContext.Implicits.global > scala> for (i <- 0 until 5) { Future > { println(sc.parallelize(0 until i).collect.length) } > } > > > > The attached screenshot shows the state of the webui after running the repro > code, you can see the ui is displaying some 43k completed jobs (takes a long > time to load) after a few minutes of inactivity this will clear out, however > in an application which continues to submit jobs every once in a while, the > issue persists. > > The issue seems to appear when running multiple jobs at once as well as in > sequence for a while and may as well have something to do with high master > CPU usage (thus the collect in the repro code). My rough guess would be > whatever is managing clearing out completed jobs gets overwhelmed (on the > master during repro htop reported almost full CPU usage across all 4 cores). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25837) Web UI does not respect spark.ui.retainedJobs in some instances
[ https://issues.apache.org/jira/browse/SPARK-25837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667707#comment-16667707 ] Apache Spark commented on SPARK-25837: -- User 'patrickbrownsync' has created a pull request for this issue: https://github.com/apache/spark/pull/22883 > Web UI does not respect spark.ui.retainedJobs in some instances > --- > > Key: SPARK-25837 > URL: https://issues.apache.org/jira/browse/SPARK-25837 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.1 > Environment: Reproduction Environment: > Spark 2.3.1 > Dataproc 1.3-deb9 > 1x master 4 vCPUs, 15 GB > 2x workers 4 vCPUs, 15 GB > >Reporter: Patrick Brown >Priority: Minor > Attachments: Screen Shot 2018-10-23 at 4.40.51 PM (1).png > > > Expected Behavior: Web UI only displays 1 completed job and remains > responsive. > Actual Behavior: Both during job execution and following all job completion > for some non short amount of time the UI retains many completed jobs, causing > limited responsiveness. > > To reproduce: > > > spark-shell --conf spark.ui.retainedJobs=1 > > scala> import scala.concurrent._ > scala> import scala.concurrent.ExecutionContext.Implicits.global > scala> for (i <- 0 until 5) { Future > { println(sc.parallelize(0 until i).collect.length) } > } > > > > The attached screenshot shows the state of the webui after running the repro > code, you can see the ui is displaying some 43k completed jobs (takes a long > time to load) after a few minutes of inactivity this will clear out, however > in an application which continues to submit jobs every once in a while, the > issue persists. > > The issue seems to appear when running multiple jobs at once as well as in > sequence for a while and may as well have something to do with high master > CPU usage (thus the collect in the repro code). My rough guess would be > whatever is managing clearing out completed jobs gets overwhelmed (on the > master during repro htop reported almost full CPU usage across all 4 cores). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25837) Web UI does not respect spark.ui.retainedJobs in some instances
[ https://issues.apache.org/jira/browse/SPARK-25837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667646#comment-16667646 ] Patrick Brown commented on SPARK-25837: --- The fundamental problem seems to be in `AppStatusLisener` in the `cleanupStages` method. Using the repro code above it appears that sometimes (not always) stages and tasks get slightly backed up. When this occurs the iteration through tasks starts taking longer and longer: ``` val tasks = kvstore.view(classOf[TaskDataWrapper]) .index("stage") .first(key) .last(key) .asScala ``` This seems to be because for each stage we are then iterating through all the tasks (of which there can be ~400k in this repro code), which can go from taking ~10ms before the back up to ~300ms afterwards due to the large number of tasks. This causes a feedback loop in which the `cleanupStages` method cannot keep up. > Web UI does not respect spark.ui.retainedJobs in some instances > --- > > Key: SPARK-25837 > URL: https://issues.apache.org/jira/browse/SPARK-25837 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.1 > Environment: Reproduction Environment: > Spark 2.3.1 > Dataproc 1.3-deb9 > 1x master 4 vCPUs, 15 GB > 2x workers 4 vCPUs, 15 GB > >Reporter: Patrick Brown >Priority: Minor > Attachments: Screen Shot 2018-10-23 at 4.40.51 PM (1).png > > > Expected Behavior: Web UI only displays 1 completed job and remains > responsive. > Actual Behavior: Both during job execution and following all job completion > for some non short amount of time the UI retains many completed jobs, causing > limited responsiveness. > > To reproduce: > > > spark-shell --conf spark.ui.retainedJobs=1 > > scala> import scala.concurrent._ > scala> import scala.concurrent.ExecutionContext.Implicits.global > scala> for (i <- 0 until 5) { Future > { println(sc.parallelize(0 until i).collect.length) } > } > > > > The attached screenshot shows the state of the webui after running the repro > code, you can see the ui is displaying some 43k completed jobs (takes a long > time to load) after a few minutes of inactivity this will clear out, however > in an application which continues to submit jobs every once in a while, the > issue persists. > > The issue seems to appear when running multiple jobs at once as well as in > sequence for a while and may as well have something to do with high master > CPU usage (thus the collect in the repro code). My rough guess would be > whatever is managing clearing out completed jobs gets overwhelmed (on the > master during repro htop reported almost full CPU usage across all 4 cores). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25837) Web UI does not respect spark.ui.retainedJobs in some instances
[ https://issues.apache.org/jira/browse/SPARK-25837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16664051#comment-16664051 ] Patrick Brown commented on SPARK-25837: --- I would be interested and happy to tackle this, if its an issue that the community agrees should be addressed. > Web UI does not respect spark.ui.retainedJobs in some instances > --- > > Key: SPARK-25837 > URL: https://issues.apache.org/jira/browse/SPARK-25837 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.1 > Environment: Reproduction Environment: > Spark 2.3.1 > Dataproc 1.3-deb9 > 1x master 4 vCPUs, 15 GB > 2x workers 4 vCPUs, 15 GB > >Reporter: Patrick Brown >Priority: Minor > Attachments: Screen Shot 2018-10-23 at 4.40.51 PM (1).png > > > Expected Behavior: Web UI only displays 1 completed job and remains > responsive. > Actual Behavior: Both during job execution and following all job completion > for some non short amount of time the UI retains many completed jobs, causing > limited responsiveness. > > To reproduce: > > > spark-shell --conf spark.ui.retainedJobs=1 > > scala> import scala.concurrent._ > scala> import scala.concurrent.ExecutionContext.Implicits.global > scala> for (i <- 0 until 5) { Future > { println(sc.parallelize(0 until i).collect.length) } > } > > > > The attached screenshot shows the state of the webui after running the repro > code, you can see the ui is displaying some 43k completed jobs (takes a long > time to load) after a few minutes of inactivity this will clear out, however > in an application which continues to submit jobs every once in a while, the > issue persists. > > The issue seems to appear when running multiple jobs at once as well as in > sequence for a while and may as well have something to do with high master > CPU usage (thus the collect in the repro code). My rough guess would be > whatever is managing clearing out completed jobs gets overwhelmed (on the > master during repro htop reported almost full CPU usage across all 4 cores). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org