[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...

2018-06-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19145
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...

2018-02-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19145
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...

2018-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19145
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...

2018-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19145
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...

2017-12-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19145
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...

2017-10-08 Thread klion26
Github user klion26 commented on the issue:

https://github.com/apache/spark/pull/19145
  
sorry for the late response, IIUC, in MR, this case handled by below
1. AM received the container failed message
2. AM will check whether there are any attempts of the same task is RUNNING 
or SUCCEED
2.1 If step 2 returns true, then MR ignores the failed message
2.2 if step 2 returns false, then MR will request a new container and run 
the specified task



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...

2017-09-28 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/19145
  
Have you guys reached a consensus on whether this PR is needed or not?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...

2017-09-21 Thread klion26
Github user klion26 commented on the issue:

https://github.com/apache/spark/pull/19145
  
@jerryshao thank you for your comment, I will try to find how MR/TEZ handle 
this


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...

2017-09-19 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19145
  
@klion26 , this is not a problem related to Spark Streaming and Structured 
Streaming. For any Spark application it will run into this problem. This is 
basically a YARN problem and looks hard to address in Spark. You might just had 
this point fix worked, but what if other behavior happened during RM/NM 
restart? May be this is a one case of inconsistency during RM/NM restart, I'm 
not sure how to well fix it. Can you please check MR/TEZ if they have a proper 
fix about this problem, I assume they may also suffer this problem.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...

2017-09-19 Thread klion26
Github user klion26 commented on the issue:

https://github.com/apache/spark/pull/19145
  
@squito I agree with you that this should be handled by yarn. 

In my opinion, this is some form of defensive programming. The Spark 
Streaming and structured streaming will both request more resource than they 
want, if these things happen.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...

2017-09-19 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/19145
  
I'm not sure I totally follow the sequence of events, but I get the feeling 
this should be handled in yarn, not spark.

Also, I agree with Jerry, it seems like your `completedContainerIdSet` may 
grow continuously.  You'll remove from it *if* you happen to get a duplicate 
message.  But I think in most cases you will not a get duplicate message, if I 
understand correctly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...

2017-09-19 Thread klion26
Github user klion26 commented on the issue:

https://github.com/apache/spark/pull/19145
  
My colleague create a 
[issue](https://issues.apache.org/jira/browse/YARN-7214) here, I rewrite the 
description here.

Spark Streaming (app1) running on Yarn, app1's one container (c1) runs on 
NM1.
1. NM1 crashed, and RM found NM1 expired in 10 minutes.
2. RM will remove all containers in NM1(RMNodeImpl). and app1 will receive 
completed message of c1. But RM can not send c1(to be removed) to NM1 because 
NM1 lost.
3. NM1 restart and register with RM(c1 in register request), but RM found 
NM1 is lost and will not handle containers from NM1.
4 NM1 will not heartbeat c1(c1 not in heartbeat request). So c1 will not 
removed from context of NM1.
5. RM restart, NM1 re register with RM. And c1 will handled and recovered. 
RM will send c1 completed message to AM of app1. So, app1 received a duplicated 
completed message of c1.


For the fix
1. I changed the code from `completedContainerIdSet.contains(containerId)` 
to `completedContainerIdSet.remove(containerId)` to reclaim the memory. (The 
same container will not reported as completed more than twice)
2. The code I added is to ignore the duplicated completed messages, ignore 
the completed message will avoid requesting new containers. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...

2017-09-18 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19145
  
And based on your fix:

1. looks like you don't have retention mechanism, which will potential 
introduce memory leak.
2. I don't see your logic to avoid requesting new containers, is your 
current logic enough to fix the issue.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...

2017-09-18 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19145
  
>But if we restart the RM, then, the lost containers in the NM will be 
reported to RM as lost again because of recovery

Since you already enabled RM and NM recovery, IIUC the failure of RM/NM 
will not lead to container exit. And after RM/NM restart, it will recover the 
persistent container metadata, so I think there should be no lost containers 
reported. Sorry I'm not so familiar with this part in YARN.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...

2017-09-18 Thread klion26
Github user klion26 commented on the issue:

https://github.com/apache/spark/pull/19145
  
We enabled RM and NM recovery.

If we assume there are 2 containers running on this NM, after 10 minute, RM 
detects the failure of NM and relaunches 2 lost containers in other NMs. This 
is ok. 

But if we restart the RM, then, the lost containers in the NM will be 
**reported to RM as lost again** because of recovery, we will relaunch 2 more 
containers in other NMs, then we will get 2 more executors than we expected.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...

2017-09-18 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19145
  
Did you enable RM or NM recovery, can you please clarify it?

Normally, if we assume there's are 2 containers running on this NM, after 
10 minutes, RM will detect the failure of NM and relaunch 2 lost containers in 
other NMs, and the total number of executors should still be the same. But 
things will be different if we enabled NM recovery, because now the failure of 
NM will not lead to container lost.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...

2017-09-18 Thread klion26
Github user klion26 commented on the issue:

https://github.com/apache/spark/pull/19145
  
Hi @jerryshao, thank you for your reply.

# Problem
the problem is for long running jobs which run on **yarn with HA** will 
request more executors than it requests.

# How to reproduce 
1. start a spark streaming job on yarn
2. mark one of the nodemanagers which runs container of the spark streaming 
program as lost(this step will take 10 minutes in my environment)
3. the nodemanger which lost in step 2 came back
4. restart the resourcemanager
5. after the resourcemanger restarted, we will get more resource than we 
request


# Question
I have one question: should i use 
`completedContainerIdSet.remove(containerId)` instead of 
`completedContainerIdSet. contains(containerId)`, if the container lost message 
will only be reported twice, we should use `remove` instead of `contains` method


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...

2017-09-18 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19145
  
Hi @klion26 , sorry for the late response. Can we please understand the 
problem first, would you please describe your problem in detail and how to 
reproduce your issue?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...

2017-09-14 Thread klion26
Github user klion26 commented on the issue:

https://github.com/apache/spark/pull/19145
  
@jerryshao Could you help me to review this pathc?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...

2017-09-13 Thread klion26
Github user klion26 commented on the issue:

https://github.com/apache/spark/pull/19145
  
Will the same completed message will be reported more than twice, if these 
message will not be reported more than twice, then i could use 
`completedContainerIdSet.remove(containerId)` 
 instead of `completedContainerIdSet.contains(containerId)` to save memory.

Could any one can help to review this patch? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...

2017-09-07 Thread klion26
Github user klion26 commented on the issue:

https://github.com/apache/spark/pull/19145
  
@HyukjinKwon  @vanzin @srowen @foxish @djvulee @squito Could you please 
help to review this pr?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19145: [spark-21933][yarn] Spark Streaming request more executo...

2017-09-06 Thread klion26
Github user klion26 commented on the issue:

https://github.com/apache/spark/pull/19145
  
@HyukjinKwon i am sorry for that, have changed the title form


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org