[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19145 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19145 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19145 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19145 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19145 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...
Github user klion26 commented on the issue: https://github.com/apache/spark/pull/19145 sorry for the late response, IIUC, in MR, this case handled by below 1. AM received the container failed message 2. AM will check whether there are any attempts of the same task is RUNNING or SUCCEED 2.1 If step 2 returns true, then MR ignores the failed message 2.2 if step 2 returns false, then MR will request a new container and run the specified task --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/19145 Have you guys reached a consensus on whether this PR is needed or not? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...
Github user klion26 commented on the issue: https://github.com/apache/spark/pull/19145 @jerryshao thank you for your comment, I will try to find how MR/TEZ handle this --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19145 @klion26 , this is not a problem related to Spark Streaming and Structured Streaming. For any Spark application it will run into this problem. This is basically a YARN problem and looks hard to address in Spark. You might just had this point fix worked, but what if other behavior happened during RM/NM restart? May be this is a one case of inconsistency during RM/NM restart, I'm not sure how to well fix it. Can you please check MR/TEZ if they have a proper fix about this problem, I assume they may also suffer this problem. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...
Github user klion26 commented on the issue: https://github.com/apache/spark/pull/19145 @squito I agree with you that this should be handled by yarn. In my opinion, this is some form of defensive programming. The Spark Streaming and structured streaming will both request more resource than they want, if these things happen. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...
Github user squito commented on the issue: https://github.com/apache/spark/pull/19145 I'm not sure I totally follow the sequence of events, but I get the feeling this should be handled in yarn, not spark. Also, I agree with Jerry, it seems like your `completedContainerIdSet` may grow continuously. You'll remove from it *if* you happen to get a duplicate message. But I think in most cases you will not a get duplicate message, if I understand correctly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...
Github user klion26 commented on the issue: https://github.com/apache/spark/pull/19145 My colleague create a [issue](https://issues.apache.org/jira/browse/YARN-7214) here, I rewrite the description here. Spark Streaming (app1) running on Yarn, app1's one container (c1) runs on NM1. 1. NM1 crashed, and RM found NM1 expired in 10 minutes. 2. RM will remove all containers in NM1(RMNodeImpl). and app1 will receive completed message of c1. But RM can not send c1(to be removed) to NM1 because NM1 lost. 3. NM1 restart and register with RM(c1 in register request), but RM found NM1 is lost and will not handle containers from NM1. 4 NM1 will not heartbeat c1(c1 not in heartbeat request). So c1 will not removed from context of NM1. 5. RM restart, NM1 re register with RM. And c1 will handled and recovered. RM will send c1 completed message to AM of app1. So, app1 received a duplicated completed message of c1. For the fix 1. I changed the code from `completedContainerIdSet.contains(containerId)` to `completedContainerIdSet.remove(containerId)` to reclaim the memory. (The same container will not reported as completed more than twice) 2. The code I added is to ignore the duplicated completed messages, ignore the completed message will avoid requesting new containers. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19145 And based on your fix: 1. looks like you don't have retention mechanism, which will potential introduce memory leak. 2. I don't see your logic to avoid requesting new containers, is your current logic enough to fix the issue. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19145 >But if we restart the RM, then, the lost containers in the NM will be reported to RM as lost again because of recovery Since you already enabled RM and NM recovery, IIUC the failure of RM/NM will not lead to container exit. And after RM/NM restart, it will recover the persistent container metadata, so I think there should be no lost containers reported. Sorry I'm not so familiar with this part in YARN. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...
Github user klion26 commented on the issue: https://github.com/apache/spark/pull/19145 We enabled RM and NM recovery. If we assume there are 2 containers running on this NM, after 10 minute, RM detects the failure of NM and relaunches 2 lost containers in other NMs. This is ok. But if we restart the RM, then, the lost containers in the NM will be **reported to RM as lost again** because of recovery, we will relaunch 2 more containers in other NMs, then we will get 2 more executors than we expected. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19145 Did you enable RM or NM recovery, can you please clarify it? Normally, if we assume there's are 2 containers running on this NM, after 10 minutes, RM will detect the failure of NM and relaunch 2 lost containers in other NMs, and the total number of executors should still be the same. But things will be different if we enabled NM recovery, because now the failure of NM will not lead to container lost. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...
Github user klion26 commented on the issue: https://github.com/apache/spark/pull/19145 Hi @jerryshao, thank you for your reply. # Problem the problem is for long running jobs which run on **yarn with HA** will request more executors than it requests. # How to reproduce 1. start a spark streaming job on yarn 2. mark one of the nodemanagers which runs container of the spark streaming program as lost(this step will take 10 minutes in my environment) 3. the nodemanger which lost in step 2 came back 4. restart the resourcemanager 5. after the resourcemanger restarted, we will get more resource than we request # Question I have one question: should i use `completedContainerIdSet.remove(containerId)` instead of `completedContainerIdSet. contains(containerId)`, if the container lost message will only be reported twice, we should use `remove` instead of `contains` method --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19145 Hi @klion26 , sorry for the late response. Can we please understand the problem first, would you please describe your problem in detail and how to reproduce your issue? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...
Github user klion26 commented on the issue: https://github.com/apache/spark/pull/19145 @jerryshao Could you help me to review this pathc? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...
Github user klion26 commented on the issue: https://github.com/apache/spark/pull/19145 Will the same completed message will be reported more than twice, if these message will not be reported more than twice, then i could use `completedContainerIdSet.remove(containerId)` instead of `completedContainerIdSet.contains(containerId)` to save memory. Could any one can help to review this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...
Github user klion26 commented on the issue: https://github.com/apache/spark/pull/19145 @HyukjinKwon @vanzin @srowen @foxish @djvulee @squito Could you please help to review this pr? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...
Github user klion26 commented on the issue: https://github.com/apache/spark/pull/19145 @HyukjinKwon i am sorry for that, have changed the title form --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org