[jira] [Commented] (SPARK-8119) HeartbeatReceiver should not adjust application executor resources
[ https://issues.apache.org/jira/browse/SPARK-8119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150055#comment-15150055 ] Zhen Peng commented on SPARK-8119: -- Hi [~srowen], I think it's really a serious bug, do you have any reason for not back-porting it to 1.4.x? > HeartbeatReceiver should not adjust application executor resources > -- > > Key: SPARK-8119 > URL: https://issues.apache.org/jira/browse/SPARK-8119 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: SaintBacchus >Assignee: Andrew Or >Priority: Critical > Fix For: 1.5.0 > > > DynamicAllocation will set the total executor to a little number when it > wants to kill some executors. > But in no-DynamicAllocation scenario, Spark will also set the total executor. > So it will cause such problem: sometimes an executor fails down, there is no > more executor which will be pull up by spark. > === EDIT by andrewor14 === > The issue is that the AM forgets about the original number of executors it > wants after calling sc.killExecutor. Even if dynamic allocation is not > enabled, this is still possible because of heartbeat timeouts. > I think the problem is that sc.killExecutor is used incorrectly in > HeartbeatReceiver. The intention of the method is to permanently adjust the > number of executors the application will get. In HeartbeatReceiver, however, > this is used as a best-effort mechanism to ensure that the timed out executor > is dead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14724609#comment-14724609 ] Zhen Peng commented on SPARK-6235: -- Hi [~rxin], is there any update for this issue? > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-938) OpenStack Swift Storage Support
[ https://issues.apache.org/jira/browse/SPARK-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14016231#comment-14016231 ] Zhen Peng commented on SPARK-938: - hi, is there any follow-up on this issue? > OpenStack Swift Storage Support > --- > > Key: SPARK-938 > URL: https://issues.apache.org/jira/browse/SPARK-938 > Project: Spark > Issue Type: New Feature > Components: Documentation, Examples, Input/Output, Spark Core >Affects Versions: 0.8.1 >Reporter: Murali Raju >Priority: Minor > > This issue is to track OpenStack Swift Storage support (development in > progress) in addition to S3 for Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1928) DAGScheduler suspended by local task OOM
[ https://issues.apache.org/jira/browse/SPARK-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009532#comment-14009532 ] Zhen Peng commented on SPARK-1928: -- [~gq] I met this case in our local mode spark streaming application. And in the UT, I have added a test case to simulate this. > DAGScheduler suspended by local task OOM > > > Key: SPARK-1928 > URL: https://issues.apache.org/jira/browse/SPARK-1928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.9.0 >Reporter: Zhen Peng > Fix For: 1.0.0 > > > DAGScheduler does not handle local task OOM properly, and will wait for the > job result forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1928) DAGScheduler suspended by local task OOM
[ https://issues.apache.org/jira/browse/SPARK-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009477#comment-14009477 ] Zhen Peng commented on SPARK-1928: -- https://github.com/apache/spark/pull/883 > DAGScheduler suspended by local task OOM > > > Key: SPARK-1928 > URL: https://issues.apache.org/jira/browse/SPARK-1928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.9.0 >Reporter: Zhen Peng > Fix For: 1.0.0 > > > DAGScheduler does not handle local task OOM properly, and will wait for the > job result forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1929) DAGScheduler suspended by local task OOM
Zhen Peng created SPARK-1929: Summary: DAGScheduler suspended by local task OOM Key: SPARK-1929 URL: https://issues.apache.org/jira/browse/SPARK-1929 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: Zhen Peng Fix For: 1.0.0 DAGScheduler does not handle local task OOM properly, and will wait for the job result forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1928) DAGScheduler suspended by local task OOM
Zhen Peng created SPARK-1928: Summary: DAGScheduler suspended by local task OOM Key: SPARK-1928 URL: https://issues.apache.org/jira/browse/SPARK-1928 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: Zhen Peng Fix For: 1.0.0 DAGScheduler does not handle local task OOM properly, and will wait for the job result forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1901) Standalone worker update exector's state ahead of executor process exit
[ https://issues.apache.org/jira/browse/SPARK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005749#comment-14005749 ] Zhen Peng commented on SPARK-1901: -- https://github.com/apache/spark/pull/854 > Standalone worker update exector's state ahead of executor process exit > --- > > Key: SPARK-1901 > URL: https://issues.apache.org/jira/browse/SPARK-1901 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 0.9.0 > Environment: spark-1.0 rc10 >Reporter: Zhen Peng > Fix For: 1.0.0 > > > Standalone worker updates executor's state prematurely, making the resource > status in an inconsistent state until the executor process really died. > In our cluster, we found this situation may cause new submitted applications > removed by Master for launching executor fail. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1901) Standalone worker update exector's state ahead of executor process exit
Zhen Peng created SPARK-1901: Summary: Standalone worker update exector's state ahead of executor process exit Key: SPARK-1901 URL: https://issues.apache.org/jira/browse/SPARK-1901 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 0.9.0 Environment: spark-1.0 rc10 Reporter: Zhen Peng Fix For: 1.0.0 Standalone worker updates executor's state prematurely, making the resource status in an inconsistent state until the executor process really died. In our cluster, we found this situation may cause new submitted applications removed by Master for launching executor fail. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1886) workers keep dying for uncaught exception of executor id not found
[ https://issues.apache.org/jira/browse/SPARK-1886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002688#comment-14002688 ] Zhen Peng commented on SPARK-1886: -- https://github.com/apache/spark/pull/827 > workers keep dying for uncaught exception of executor id not found > --- > > Key: SPARK-1886 > URL: https://issues.apache.org/jira/browse/SPARK-1886 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 0.9.0 > Environment: spark-1.0-rc8 >Reporter: Zhen Peng > Fix For: 1.0.0 > > > 14/05/19 15:43:30 ERROR OneForOneStrategy: key not found: > app-20140519154218-0132/6 > java.util.NoSuchElementException: key not found: app-20140519154218-0132/6 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:58) > at scala.collection.mutable.HashMap.apply(HashMap.scala:64) > at > org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:266) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1886) workers keep dying for uncaught exception of executor id not found
Zhen Peng created SPARK-1886: Summary: workers keep dying for uncaught exception of executor id not found Key: SPARK-1886 URL: https://issues.apache.org/jira/browse/SPARK-1886 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 0.9.0 Environment: spark-1.0-rc8 Reporter: Zhen Peng Fix For: 1.0.0 14/05/19 15:43:30 ERROR OneForOneStrategy: key not found: app-20140519154218-0132/6 java.util.NoSuchElementException: key not found: app-20140519154218-0132/6 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:266) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- This message was sent by Atlassian JIRA (v6.2#6252)