[jira] [Commented] (YARN-2579) Deadlock when EmbeddedElectorService and FatalEventDispatcher try to transition RM to StandBy at the same time

2021-11-17 Thread stefanlee (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17445044#comment-17445044
 ] 

stefanlee commented on YARN-2579:
-

Why this patch could not be found in hadoop-2.7.7?[~rohithsharma] 

> Deadlock when EmbeddedElectorService and FatalEventDispatcher try to 
> transition RM to StandBy at the same time
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Blocker
> Fix For: 2.6.0
>
> Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, 
> YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, 
> YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6307) Refactor FairShareComparator#compare

2018-09-19 Thread stefanlee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620523#comment-16620523
 ] 

stefanlee commented on YARN-6307:
-

Hi, [~yufeigu] ,after merge this patch, we met the problem YARN-4743.

> Refactor FairShareComparator#compare
> 
>
> Key: YARN-6307
> URL: https://issues.apache.org/jira/browse/YARN-6307
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: Yufei Gu
>Assignee: Yufei Gu
>Priority: Major
> Fix For: 2.9.0, 3.0.0-beta1
>
> Attachments: YARN-6307.001.patch, YARN-6307.002.patch, 
> YARN-6307.003.patch
>
>
> The method does three things: compare the min share usage, compare fair share 
> usage by checking weight ratio, break tied by submit time and name. They are 
> mixed with each other which is not easy to read and maintenance, poor style. 
> Additionally, there are potential performance issues, like no need to check 
> weight ratio if minShare usage comparison already indicate the order. It is 
> worth to improve considering huge amount invokings in scheduler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



***UNCHECKED*** [jira] [Commented] (YARN-8436) FSParentQueue: Comparison method violates its general contract

2018-09-19 Thread stefanlee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620509#comment-16620509
 ] 

stefanlee commented on YARN-8436:
-

[~wilfreds]  thanks for this jira, As you mentioned above:
{quote}If during this sorting a different node update changes a child queue 
then we allow that. 
{quote}
doesn't RM handle NODE_UPDATE event one by one?

> FSParentQueue: Comparison method violates its general contract
> --
>
> Key: YARN-8436
> URL: https://issues.apache.org/jira/browse/YARN-8436
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Minor
> Fix For: 3.2.0
>
> Attachments: YARN-8436.001.patch, YARN-8436.002.patch, 
> YARN-8436.003.patch
>
>
> The ResourceManager can fail while sorting queues if an update comes in:
> {code:java}
> FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type NODE_UPDATE to the scheduler
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>   at java.util.TimSort.mergeLo(TimSort.java:777)
>   at java.util.TimSort.mergeAt(TimSort.java:514)
> ...
>   at java.util.Collections.sort(Collections.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:223){code}
> The reason it breaks is a change in the sorted object itself. 
> This is why it fails:
>  * an update from a node comes in as a heartbeat.
>  * the update triggers a check to see if we can assign a container on the 
> node.
>  * walk over the queue hierarchy to find a queue to assign a container to: 
> top down.
>  * for each parent queue we sort the child queues in {{assignContainer}} to 
> decide which queue to descent into.
>  * we lock the parent queue when sort to prevent changes, but we do not lock 
> the child queues that we are sorting.
> If during this sorting a different node update changes a child queue then we 
> allow that. This means that the objects that we are trying to sort now might 
> be out of order. That causes the issue with the comparator. The comparator 
> itself is not broken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8651) We must increase min Resource in FairScheduler after increase number of NM

2018-08-10 Thread stefanlee (JIRA)
stefanlee created YARN-8651:
---

 Summary: We must increase min Resource in FairScheduler after 
increase number of NM
 Key: YARN-8651
 URL: https://issues.apache.org/jira/browse/YARN-8651
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.4.0
Reporter: stefanlee


Nowadays,our cluster has a strange phenomena,before we increase the scale of 
NodeManager, the resource utilization could be 100%, but we found the resource 
utilization does not promote as the cluster expansion and many queue's used 
resource stay at min resource  although they have many demand request.

Then we increase their min resource dynamically,the resource utilization of 
these queues  goes  up and the resources of the entire cluster are also used 
after that.

So i doubte if the bug in *FairSharePolicy#compare.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6307) Refactor FairShareComparator#compare

2018-07-25 Thread stefanlee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1626#comment-1626
 ] 

stefanlee commented on YARN-6307:
-

thanks for this jira,[~yufeigu] [~templedf], I have a  doubt that what  is the 
difference  between *fair share* in _FairSharePolicy#compare_  and  *fair 
share* in  _FairSharePolicy#computeShares_, I think the latter is related to 
*preempt*. there are incomprehensible.

> Refactor FairShareComparator#compare
> 
>
> Key: YARN-6307
> URL: https://issues.apache.org/jira/browse/YARN-6307
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: Yufei Gu
>Assignee: Yufei Gu
>Priority: Major
> Fix For: 2.9.0, 3.0.0-beta1
>
> Attachments: YARN-6307.001.patch, YARN-6307.002.patch, 
> YARN-6307.003.patch
>
>
> The method does three things: compare the min share usage, compare fair share 
> usage by checking weight ratio, break tied by submit time and name. They are 
> mixed with each other which is not easy to read and maintenance, poor style. 
> Additionally, there are potential performance issues, like no need to check 
> weight ratio if minShare usage comparison already indicate the order. It is 
> worth to improve considering huge amount invokings in scheduler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5188) FairScheduler performance bug

2018-03-04 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385711#comment-16385711
 ] 

stefanlee commented on YARN-5188:
-

[~chenfolin] thanks for this jira, you say *I checked the resourcemanager logs, 
I found that every assign container may cost 5 ~ 10 ms, but just 0 ~ 1 ms at 
usual time.*  here the *0~1ms* is each container assigned time ? or each 
*NODE_UPDATE*  complete time ? BTW, if the *continuous-scheduling* and 
*assignmultiple* enabled? what the value of *max.assign* , could you please 
tell me how much performance improvement of this patch?

> FairScheduler performance bug
> -
>
> Key: YARN-5188
> URL: https://issues.apache.org/jira/browse/YARN-5188
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.5.0
>Reporter: ChenFolin
>Priority: Major
> Attachments: YARN-5188-1.patch
>
>
>  My Hadoop Cluster has recently encountered a performance problem. Details as 
> Follows.
> There are two point which can cause this performance issue.
> 1: application sort before assign container at FSLeafQueue. TreeSet is not 
> the best, Why not keep orderly ? and then we can use binary search to help 
> keep orderly when a application's resource usage has changed.
> 2: queue sort and assignContainerPreCheck will lead to compute all leafqueue 
> resource usage ,Why can we store the leafqueue usage at memory and update it 
> when assign container op release container happen?
>
>The efficiency of assign container in the Resourcemanager may fall 
> when the number of running and pending application grows. And the fact is the 
> cluster has too many PendingMB or PengdingVcore , and the Cluster 
> current utilization rate may below 20%.
>I checked the resourcemanager logs, I found that every assign 
> container may cost 5 ~ 10 ms, but just 0 ~ 1 ms at usual time.
>  
>I use TestFairScheduler to reproduce the scene:
>  
>Just one queue: root.defalut
>  10240 apps.
>  
>assign container avg time:  6753.9 us ( 6.7539 ms)  
>  apps sort time (FSLeafQueue : Collections.sort(runnableApps, 
> comparator); ): 4657.01 us ( 4.657 ms )
>  compute LeafQueue Resource usage : 905.171 us ( 0.905171 ms )
>  
>  When just root.default, one assign container op contains : ( one apps 
> sort op ) + 2 * ( compute leafqueue usage op )
>According to the above situation, I think the assign container op has 
> a performance problem  . 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4090) Make Collections.sort() more efficient by caching resource usage

2018-03-04 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385556#comment-16385556
 ] 

stefanlee commented on YARN-4090:
-

yes ,i see it , it's a valuable optimization, if the *assignmultiple* and 
*continuous-scheduling* enabled? and what the value of *max.assign* in this 
test case?[~yufeigu] [~xinxianyin]

> Make Collections.sort() more efficient by caching resource usage
> 
>
> Key: YARN-4090
> URL: https://issues.apache.org/jira/browse/YARN-4090
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Affects Versions: 2.8.1, 3.0.0-alpha3
>Reporter: Xianyin Xin
>Assignee: Yufei Gu
>Priority: Major
> Fix For: 2.9.0, 3.0.0, 3.1.0
>
> Attachments: YARN-4090-TestResult.pdf, YARN-4090-preview.patch, 
> YARN-4090.001.patch, YARN-4090.002.patch, YARN-4090.003.patch, 
> YARN-4090.004.patch, YARN-4090.005.patch, YARN-4090.006.patch, 
> YARN-4090.007.patch, YARN-4090.008.patch, YARN-4090.009.patch, sampling1.jpg, 
> sampling2.jpg
>
>
> Collections.sort() consumes too much time in a scheduling round.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4088) RM should be able to process heartbeats from NM concurrently

2018-03-04 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385503#comment-16385503
 ] 

stefanlee commented on YARN-4088:
-

yep !!! *processing those events in the scheduler thread is serialized* is the 
key point. thanks a lot [~jlowe]

 

> RM should be able to process heartbeats from NM concurrently
> 
>
> Key: YARN-4088
> URL: https://issues.apache.org/jira/browse/YARN-4088
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, scheduler
>Reporter: Srikanth Kandula
>Priority: Major
>
> Today, the RM sequentially processes one heartbeat after another. 
> Imagine a 3000 server cluster with each server heart-beating every 3s. This 
> gives the RM 1ms on average to process each NM heartbeat. That is tough.
> It is true that there are several underlying datastructures that will be 
> touched during heartbeat processing. So, it is non-trivial to parallelize the 
> NM heartbeat. Yet, it is quite doable...
> Parallelizing the NM heartbeat would substantially improve the scalability of 
> the RM, allowing it to either 
> a) run larger clusters or 
> b) support faster heartbeats or dynamic scaling of heartbeats
> c) take more asks from each application or 
> c) use cleverer/ more expensive algorithms such as node labels or better 
> packing or ...
> Indeed the RM's scalability limit has been cited as the motivating reason for 
> a variety of efforts which will become less needed if this can be solved. 
> Ditto for slow heartbeats.  See Sparrow and Mercury papers for example.
> Can we take a shot at this?
> If not, could we discuss why.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4090) Make Collections.sort() more efficient by caching resource usage

2018-03-02 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383390#comment-16383390
 ] 

stefanlee commented on YARN-4090:
-

[~yufeigu] [~xinxianyin] Thanks for this Jira ,how much performance improvement 
of this  optimization?

> Make Collections.sort() more efficient by caching resource usage
> 
>
> Key: YARN-4090
> URL: https://issues.apache.org/jira/browse/YARN-4090
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Affects Versions: 2.8.1, 3.0.0-alpha3
>Reporter: Xianyin Xin
>Assignee: Yufei Gu
>Priority: Major
> Fix For: 2.9.0, 3.0.0, 3.1.0
>
> Attachments: YARN-4090-TestResult.pdf, YARN-4090-preview.patch, 
> YARN-4090.001.patch, YARN-4090.002.patch, YARN-4090.003.patch, 
> YARN-4090.004.patch, YARN-4090.005.patch, YARN-4090.006.patch, 
> YARN-4090.007.patch, YARN-4090.008.patch, YARN-4090.009.patch, sampling1.jpg, 
> sampling2.jpg
>
>
> Collections.sort() consumes too much time in a scheduling round.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4088) RM should be able to process heartbeats from NM concurrently

2018-03-01 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383305#comment-16383305
 ] 

stefanlee commented on YARN-4088:
-

[~jlowe]  if  the OOB heartbeat could reduce the ability of ResourceManager 
schedule?

> RM should be able to process heartbeats from NM concurrently
> 
>
> Key: YARN-4088
> URL: https://issues.apache.org/jira/browse/YARN-4088
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, scheduler
>Reporter: Srikanth Kandula
>Priority: Major
>
> Today, the RM sequentially processes one heartbeat after another. 
> Imagine a 3000 server cluster with each server heart-beating every 3s. This 
> gives the RM 1ms on average to process each NM heartbeat. That is tough.
> It is true that there are several underlying datastructures that will be 
> touched during heartbeat processing. So, it is non-trivial to parallelize the 
> NM heartbeat. Yet, it is quite doable...
> Parallelizing the NM heartbeat would substantially improve the scalability of 
> the RM, allowing it to either 
> a) run larger clusters or 
> b) support faster heartbeats or dynamic scaling of heartbeats
> c) take more asks from each application or 
> c) use cleverer/ more expensive algorithms such as node labels or better 
> packing or ...
> Indeed the RM's scalability limit has been cited as the motivating reason for 
> a variety of efforts which will become less needed if this can be solved. 
> Ditto for slow heartbeats.  See Sparrow and Mercury papers for example.
> Can we take a shot at this?
> If not, could we discuss why.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6407) Improve and fix locks of RM scheduler

2018-02-27 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378337#comment-16378337
 ] 

stefanlee commented on YARN-6407:
-

hi, [~zhengchenyu]  *The setting that max.assign is 5 lead to the assigned 
ability decreased.* what is the value before you modify it to 5?

> Improve and fix locks of RM scheduler
> -
>
> Key: YARN-6407
> URL: https://issues.apache.org/jira/browse/YARN-6407
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.1
> Environment: CentOS 7, 1 Gigabit Ethernet
>Reporter: zhengchenyu
>Priority: Major
> Fix For: 2.7.1
>
>   Original Estimate: 2m
>  Remaining Estimate: 2m
>
> First,this issue dose not duplicate the YARN-3091.
> In our cluster, we have 5k nodes, and the server is configured with 1 Gigabit 
> Ethernet. So network is bottleneck in our cluster.
> We must distcp data from warehouse, because of 1 Gigabit Ethernet, we must 
> set yarn.scheduler.fair.max.assign to 5, or must lead to hotspot.
> The setting that max.assign is 5 lead to the assigned ability decreased. So 
> we start the ContinuousSchedulingThread. 
> As more applicaitons running in our cluster , and with 
> ContinuousSchedulingThread, the problem of lock contention is more serious. 
> In our cluster, the callqueue of ApplicationMasterSeriver's rpc is high 
> occasionally. we worried that more problem occure in future with more 
> application are running.
> Here is our logical graph:
> "1 Gigabit Ethernet" and "data hot spot" ==> "set 
> yarn.scheduler.fair.max.assign to 5" ==> "ContinuousSchedulingThread is 
> started" and "more applcations" => "lock contention"
> I know YARN-3091 solved this problem, but the patch aims that change the 
> object lock to read write lock. This change is still Coarse-Grained. So I 
> think we lock the resources or not lock the large section code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7672) hadoop-sls can not simulate huge scale of YARN

2018-02-26 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16376500#comment-16376500
 ] 

stefanlee commented on YARN-7672:
-

[~yufeigu] thanks a lot.

> hadoop-sls can not simulate huge scale of YARN
> --
>
> Key: YARN-7672
> URL: https://issues.apache.org/jira/browse/YARN-7672
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhangshilong
>Assignee: zhangshilong
>Priority: Major
> Attachments: YARN-7672.patch
>
>
> Our YARN cluster scale to nearly 10 thousands nodes. We need to do scheduler 
> pressure test.
> Using SLS,we start  2000+ threads to simulate NM and AM. But  cpu.load very 
> high to 100+. I thought that will affect  performance evaluation of 
> scheduler. 
> So I thought to separate the scheduler from the simulator.
> I start a real RM. Then SLS will register nodes to RM,And submit apps to RM 
> using RM RPC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7672) hadoop-sls can not simulate huge scale of YARN

2018-02-25 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16376364#comment-16376364
 ] 

stefanlee commented on YARN-7672:
-

[~yufeigu] thanks ,My hadoop version is 2.4.0 and 
*yarn.scheduler.fair.dynamic.max.assign* doesn't in my configuration file, what 
i mean is that the SLS test with *8*  should get more containers than the SLS 
test with *2* ,so the former could complete more quickly than the latter,but i 
found the difference of complete time between them is not obvious . My 
understanding of the *yarn.scheduler.fair.max.assign* is that the greater the 
value ,the better the schedule performance. please correct me if i am wrong.

> hadoop-sls can not simulate huge scale of YARN
> --
>
> Key: YARN-7672
> URL: https://issues.apache.org/jira/browse/YARN-7672
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhangshilong
>Assignee: zhangshilong
>Priority: Major
> Attachments: YARN-7672.patch
>
>
> Our YARN cluster scale to nearly 10 thousands nodes. We need to do scheduler 
> pressure test.
> Using SLS,we start  2000+ threads to simulate NM and AM. But  cpu.load very 
> high to 100+. I thought that will affect  performance evaluation of 
> scheduler. 
> So I thought to separate the scheduler from the simulator.
> I start a real RM. Then SLS will register nodes to RM,And submit apps to RM 
> using RM RPC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7672) hadoop-sls can not simulate huge scale of YARN

2018-02-25 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16376064#comment-16376064
 ] 

stefanlee commented on YARN-7672:
-

[~yufeigu] [~ywskycn]  could you please tell me whether or not the value of 
*yarn.scheduler.fair.max.assign* can increase the containers which the 
FairScheduler assigned , i set this value to *2* and run SLS in the first 
round,  then next round i update this value to *8* and run SLS again . After 
that, i  found there is no different between their assign  ability. (i have 
enabled *yarn.scheduler.fair.assignmultiple* and disabled 
*continuous-scheduling*)

thanks.

> hadoop-sls can not simulate huge scale of YARN
> --
>
> Key: YARN-7672
> URL: https://issues.apache.org/jira/browse/YARN-7672
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhangshilong
>Assignee: zhangshilong
>Priority: Major
> Attachments: YARN-7672.patch
>
>
> Our YARN cluster scale to nearly 10 thousands nodes. We need to do scheduler 
> pressure test.
> Using SLS,we start  2000+ threads to simulate NM and AM. But  cpu.load very 
> high to 100+. I thought that will affect  performance evaluation of 
> scheduler. 
> So I thought to separate the scheduler from the simulator.
> I start a real RM. Then SLS will register nodes to RM,And submit apps to RM 
> using RM RPC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-3933) FairScheduler: Multiple calls to completedContainer are not safe

2018-02-07 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356599#comment-16356599
 ] 

stefanlee edited comment on YARN-3933 at 2/8/18 7:37 AM:
-

[~yufeigu] thanks, yesterday, i found in our cluster the utilization rate of 
resource  is very low , but there is a lot of pending applications in it, and 
RM has no exception, then  i found a queue has negative-usage and also has 
pending resource, so i doubt Whether a queue has negative-usage resource can 
lead to FairScheduler do not assign containers to any other queues. thanks for 
this jira[YARN-3933|https://issues.apache.org/jira/browse/YARN-3933] it seems 
as same as my scenario.


was (Author: imstefanlee):
[~yufeigu] thanks, yesterday, i found in our cluster the utilization rate of 
resource  is very low , but there is a lot of pending applications in it, and 
RM has no exception, then  i found a queue has negative-usage and also has 
pending resource, so i doubt Whether a queue has negative-usage resource can 
lead to FairScheduler do not assign containers to any other queues. thanks for 
this jira[link title|https://issues.apache.org/jira/browse/YARN-3933] it seems 
as same as my scenario.

> FairScheduler: Multiple calls to completedContainer are not safe
> 
>
> Key: YARN-3933
> URL: https://issues.apache.org/jira/browse/YARN-3933
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: Lavkesh Lahngir
>Assignee: Shiwei Guo
>Priority: Major
>  Labels: oct16-medium
> Fix For: 2.8.0, 3.0.0-alpha4
>
> Attachments: YARN-3933.001.patch, YARN-3933.002.patch, 
> YARN-3933.003.patch, YARN-3933.004.patch, YARN-3933.005.patch, 
> YARN-3933.006.patch, yarn-3933-branch-2.8.patch
>
>
> In our cluster we are seeing available memory and cores being negative. 
> Initial inspection:
> Scenario no. 1: 
> In capacity scheduler the method allocateContainersToNode() checks if 
> there are excess reservation of containers for an application, and they are 
> no longer needed then it calls queue.completedContainer() which causes 
> resources being negative. And they were never assigned in the first place. 
> I am still looking through the code. Can somebody suggest how to simulate 
> excess containers assignments ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-3933) FairScheduler: Multiple calls to completedContainer are not safe

2018-02-07 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356599#comment-16356599
 ] 

stefanlee edited comment on YARN-3933 at 2/8/18 7:36 AM:
-

[~yufeigu] thanks, yesterday, i found in our cluster the utilization rate of 
resource  is very low , but there is a lot of pending applications in it, and 
RM has no exception, then  i found a queue has negative-usage and also has 
pending resource, so i doubt Whether a queue has negative-usage resource can 
lead to FairScheduler do not assign containers to any other queues. thanks for 
this jira[link title|https://issues.apache.org/jira/browse/YARN-3933] it seems 
as same as my scenario.


was (Author: imstefanlee):
[~yufeigu] thanks, yesterday, i found in our cluster the utilization rate of 
resource  is very low , but there is a lot of pending applications in it, and 
RM has no exception, then  i found a queue has negative-usage and also has 
pending resource, so i doubt Whether a queue has negative-usage resource can 
lead to FairScheduler do not assign containers to any other queues. thanks for 
this jira[https://issues.apache.org/jira/browse/YARN-3933], it seems as same as 
my scenario.

> FairScheduler: Multiple calls to completedContainer are not safe
> 
>
> Key: YARN-3933
> URL: https://issues.apache.org/jira/browse/YARN-3933
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: Lavkesh Lahngir
>Assignee: Shiwei Guo
>Priority: Major
>  Labels: oct16-medium
> Fix For: 2.8.0, 3.0.0-alpha4
>
> Attachments: YARN-3933.001.patch, YARN-3933.002.patch, 
> YARN-3933.003.patch, YARN-3933.004.patch, YARN-3933.005.patch, 
> YARN-3933.006.patch, yarn-3933-branch-2.8.patch
>
>
> In our cluster we are seeing available memory and cores being negative. 
> Initial inspection:
> Scenario no. 1: 
> In capacity scheduler the method allocateContainersToNode() checks if 
> there are excess reservation of containers for an application, and they are 
> no longer needed then it calls queue.completedContainer() which causes 
> resources being negative. And they were never assigned in the first place. 
> I am still looking through the code. Can somebody suggest how to simulate 
> excess containers assignments ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3933) FairScheduler: Multiple calls to completedContainer are not safe

2018-02-07 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356599#comment-16356599
 ] 

stefanlee commented on YARN-3933:
-

[~yufeigu] thanks, yesterday, i found in our cluster the utilization rate of 
resource  is very low , but there is a lot of pending applications in it, and 
RM has no exception, then  i found a queue has negative-usage and also has 
pending resource, so i doubt Whether a queue has negative-usage resource can 
lead to FairScheduler do not assign containers to any other queues. thanks for 
this jira[https://issues.apache.org/jira/browse/YARN-3933], it seems as same as 
my scenario.

> FairScheduler: Multiple calls to completedContainer are not safe
> 
>
> Key: YARN-3933
> URL: https://issues.apache.org/jira/browse/YARN-3933
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: Lavkesh Lahngir
>Assignee: Shiwei Guo
>Priority: Major
>  Labels: oct16-medium
> Fix For: 2.8.0, 3.0.0-alpha4
>
> Attachments: YARN-3933.001.patch, YARN-3933.002.patch, 
> YARN-3933.003.patch, YARN-3933.004.patch, YARN-3933.005.patch, 
> YARN-3933.006.patch, yarn-3933-branch-2.8.patch
>
>
> In our cluster we are seeing available memory and cores being negative. 
> Initial inspection:
> Scenario no. 1: 
> In capacity scheduler the method allocateContainersToNode() checks if 
> there are excess reservation of containers for an application, and they are 
> no longer needed then it calls queue.completedContainer() which causes 
> resources being negative. And they were never assigned in the first place. 
> I am still looking through the code. Can somebody suggest how to simulate 
> excess containers assignments ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-3933) FairScheduler: Multiple calls to completedContainer are not safe

2018-02-07 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355169#comment-16355169
 ] 

stefanlee edited comment on YARN-3933 at 2/7/18 8:59 AM:
-

Because of *Needy* in *FairSharePolicy* , Whether a queue has negative  used 
resource can lead to FairScheduler do not assign containers  to any other 
queues in many scheduler rounds or not? [~yufeigu] [~djp] [~kasha]


was (Author: imstefanlee):
Because of **Needy** in **FairSharePolicy** , Whether a queue has negative  
used resource can lead to FairScheduler do not assign containers  to any other 
queues in many scheduler rounds or not? [~yufeigu] [~djp] [~kasha]

> FairScheduler: Multiple calls to completedContainer are not safe
> 
>
> Key: YARN-3933
> URL: https://issues.apache.org/jira/browse/YARN-3933
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: Lavkesh Lahngir
>Assignee: Shiwei Guo
>Priority: Major
>  Labels: oct16-medium
> Fix For: 2.8.0, 3.0.0-alpha4
>
> Attachments: YARN-3933.001.patch, YARN-3933.002.patch, 
> YARN-3933.003.patch, YARN-3933.004.patch, YARN-3933.005.patch, 
> YARN-3933.006.patch, yarn-3933-branch-2.8.patch
>
>
> In our cluster we are seeing available memory and cores being negative. 
> Initial inspection:
> Scenario no. 1: 
> In capacity scheduler the method allocateContainersToNode() checks if 
> there are excess reservation of containers for an application, and they are 
> no longer needed then it calls queue.completedContainer() which causes 
> resources being negative. And they were never assigned in the first place. 
> I am still looking through the code. Can somebody suggest how to simulate 
> excess containers assignments ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3933) FairScheduler: Multiple calls to completedContainer are not safe

2018-02-07 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355169#comment-16355169
 ] 

stefanlee commented on YARN-3933:
-

Because of **Needy** in **FairSharePolicy** , Whether a queue has negative  
used resource can lead to FairScheduler do not assign containers  to any other 
queues in many scheduler rounds or not? [~yufeigu] [~djp] [~kasha]

> FairScheduler: Multiple calls to completedContainer are not safe
> 
>
> Key: YARN-3933
> URL: https://issues.apache.org/jira/browse/YARN-3933
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: Lavkesh Lahngir
>Assignee: Shiwei Guo
>Priority: Major
>  Labels: oct16-medium
> Fix For: 2.8.0, 3.0.0-alpha4
>
> Attachments: YARN-3933.001.patch, YARN-3933.002.patch, 
> YARN-3933.003.patch, YARN-3933.004.patch, YARN-3933.005.patch, 
> YARN-3933.006.patch, yarn-3933-branch-2.8.patch
>
>
> In our cluster we are seeing available memory and cores being negative. 
> Initial inspection:
> Scenario no. 1: 
> In capacity scheduler the method allocateContainersToNode() checks if 
> there are excess reservation of containers for an application, and they are 
> no longer needed then it calls queue.completedContainer() which causes 
> resources being negative. And they were never assigned in the first place. 
> I am still looking through the code. Can somebody suggest how to simulate 
> excess containers assignments ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7672) hadoop-sls can not simulate huge scale of YARN

2018-01-12 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16323728#comment-16323728
 ] 

stefanlee edited comment on YARN-7672 at 1/12/18 9:18 AM:
--

[~zsl2007] thanks for this jira. i have merged this patch to my hadoop version 
and there is a problem occurred during my testing.

{code:java}
1. RM1 is active ,RM2 is standby
2. i run SLSRunnerForRealRM and my jobs will running in my cluster with correct 
user name and queue name.
then:
1. RM1 is standby , RM2 is active
2. i run SLSRunnerForRealRM and my jobs will failover to RM2, then them will 
running in my cluster with the user who 
 run SLSRunnerForRealRM. that is ,them will running in one queue.
{code}
i review the hadoop resource and found this prolem occurred in 
*ConfiguredRMFailoverProxyProivder.getProxyInternal->RMProxy.getProxy*

{code:java}
  static  T getProxy(final Configuration conf,
  final Class protocol, final InetSocketAddress rmAddress)
  throws IOException {
return UserGroupInformation.getCurrentUser().doAs(
  new PrivilegedAction() {
@Override
public T run() {
  return (T) YarnRPC.create(conf).getProxy(protocol, rmAddress, conf);
}
  });
{code}

here, it will *getCurrentUser()*, so we should come up with a solution to 
resolve it.
but if we have only one RM, it will run well.:D


was (Author: imstefanlee):
[~zsl2007] thanks for this jira. i have merged this patch to my hadoop version 
and there is a problem occurred during my testing.

{code:java}
1. RM1 is active ,RM2 is standby
2. i run SLSRunnerForRealRM and my jobs will running in my cluster with correct 
user name and queue name.
then:
1. RM1 is standby , RM2 is active
2. i run SLSRunnerForRealRM and my jobs will failover to RM2, then them will 
running in my cluster with the user who 
 run SLSRunnerForRealRM. that is ,them will running in one queue.
{code}
i review the hadoop resource and found this prolem occurred in 
*ConfiguredRMFailoverProxyProivder.getProxyInternal->RMProxy.getProxy*

{code:java}
  static  T getProxy(final Configuration conf,
  final Class protocol, final InetSocketAddress rmAddress)
  throws IOException {
return UserGroupInformation.getCurrentUser().doAs(
  new PrivilegedAction() {
@Override
public T run() {
  return (T) YarnRPC.create(conf).getProxy(protocol, rmAddress, conf);
}
  });
{code}

here, it will *getCurrentUser()*, so we should come up with a solution to 
resolve it.

> hadoop-sls can not simulate huge scale of YARN
> --
>
> Key: YARN-7672
> URL: https://issues.apache.org/jira/browse/YARN-7672
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhangshilong
>Assignee: zhangshilong
> Attachments: YARN-7672.patch
>
>
> Our YARN cluster scale to nearly 10 thousands nodes. We need to do scheduler 
> pressure test.
> Using SLS,we start  2000+ threads to simulate NM and AM. But  cpu.load very 
> high to 100+. I thought that will affect  performance evaluation of 
> scheduler. 
> So I thought to separate the scheduler from the simulator.
> I start a real RM. Then SLS will register nodes to RM,And submit apps to RM 
> using RM RPC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7672) hadoop-sls can not simulate huge scale of YARN

2018-01-12 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16323728#comment-16323728
 ] 

stefanlee commented on YARN-7672:
-

[~zsl2007] thanks for this jira. i have merged this patch to my hadoop version 
and there is a problem occurred during my testing.

{code:java}
1. RM1 is active ,RM2 is standby
2. i run SLSRunnerForRealRM and my jobs will running in my cluster with correct 
user name and queue name.
then:
1. RM1 is standby , RM2 is active
2. i run SLSRunnerForRealRM and my jobs will failover to RM2, then them will 
running in my cluster with the user who 
 run SLSRunnerForRealRM. that is ,them will running in one queue.
{code}
i review the hadoop resource and found this prolem occurred in 
*ConfiguredRMFailoverProxyProivder.getProxyInternal->RMProxy.getProxy*

{code:java}
  static  T getProxy(final Configuration conf,
  final Class protocol, final InetSocketAddress rmAddress)
  throws IOException {
return UserGroupInformation.getCurrentUser().doAs(
  new PrivilegedAction() {
@Override
public T run() {
  return (T) YarnRPC.create(conf).getProxy(protocol, rmAddress, conf);
}
  });
{code}

here, it will *getCurrentUser()*, so we should come up with a solution to 
resolve it.

> hadoop-sls can not simulate huge scale of YARN
> --
>
> Key: YARN-7672
> URL: https://issues.apache.org/jira/browse/YARN-7672
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhangshilong
>Assignee: zhangshilong
> Attachments: YARN-7672.patch
>
>
> Our YARN cluster scale to nearly 10 thousands nodes. We need to do scheduler 
> pressure test.
> Using SLS,we start  2000+ threads to simulate NM and AM. But  cpu.load very 
> high to 100+. I thought that will affect  performance evaluation of 
> scheduler. 
> So I thought to separate the scheduler from the simulator.
> I start a real RM. Then SLS will register nodes to RM,And submit apps to RM 
> using RM RPC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7695) when active RM transit to standby , this RM will new another FairSchedulerUpdate Thread

2018-01-03 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309278#comment-16309278
 ] 

stefanlee commented on YARN-7695:
-

there is another problem in this scenario, when i turn on 
*ContinuousScheduling* , and submit a lot of applications, then my cluster have 
no available resource,  active RM1's log print 
{code:java}
2018-01-03 16:05:49,860 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable: 
Making reservation: node=datanode2 app_id=application_1514952157240_0019
2018-01-03 16:05:49,860 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 Application application_1514952157240_0019 reserved container 
container_1514952157240_0019_02_03 on node host: datanode2:37528 
#containers=2 available= used=, 
currently has 1 at priority 10; currentReservation 3072
2018-01-03 16:05:49,860 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
Updated reserved container container_1514952157240_0019_02_03 on node host: 
datanode2:37528 #containers=2 available= 
used= for application 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp@a9790a8
2018-01-03 16:05:49,868 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Trying to fulfill reservation for application 
appattempt_1514952157240_0019_02 on node: host: datanode2:37528 
#containers=2 available= used=
2018-01-03 16:05:49,868 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable: 
Making reservation: node=datanode2 app_id=application_1514952157240_0019
2018-01-03 16:05:49,868 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 Application application_1514952157240_0019 reserved container 
container_1514952157240_0019_02_03 on node host: datanode2:37528 
#containers=2 available= used=, 
currently has 1 at priority 10; currentReservation 3072
{code}
, then i repeat 3 step in description, active RM1 transit to standby, RM2 runs 
normaly, but standby RM1's log still print above info, it seems 
*ContinuousScheduling*  runs in dead loop. so  i think it is best to stop these 
thread when RM1 transit to standby in this  scenario.

> when active RM transit to standby , this RM will new another 
> FairSchedulerUpdate Thread
> ---
>
> Key: YARN-7695
> URL: https://issues.apache.org/jira/browse/YARN-7695
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.4.0
>Reporter: stefanlee
>
> 1. i test haoop-2.4.0 in my cluster.
> 2. RM1 is active and  RM2 is standby
> 3. i delete /yarn-leader-election/Yarn/ActiveStandbyElectorLock from ZK
> 4. RM1 then transit from active to standby success.
> 5. at last ,i print RM1 jstack info and found two "AllocationFileReloader" 
> and two "FairSchedulerUpdateThread" in RM1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7695) when active RM transit to standby , this RM will new another FairSchedulerUpdate Thread

2018-01-02 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309245#comment-16309245
 ] 

stefanlee commented on YARN-7695:
-

i have a simple fix .in *RMActiveServices.serviceInit*
{code:java}
  // Initialize the scheduler
  if (scheduler == null) {
scheduler = createScheduler();
  }
{code}
[~yufeigu]  [~templedf]please have a look.

> when active RM transit to standby , this RM will new another 
> FairSchedulerUpdate Thread
> ---
>
> Key: YARN-7695
> URL: https://issues.apache.org/jira/browse/YARN-7695
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.4.0
>Reporter: stefanlee
>
> 1. i test haoop-2.4.0 in my cluster.
> 2. RM1 is active and  RM2 is standby
> 3. i delete /yarn-leader-election/Yarn/ActiveStandbyElectorLock from ZK
> 4. RM1 then transit from active to standby success.
> 5. at last ,i print RM1 jstack info and found two "AllocationFileReloader" 
> and two "FairSchedulerUpdateThread" in RM1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7695) when active RM transit to standby , this RM will new another FairSchedulerUpdate Thread

2018-01-02 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309071#comment-16309071
 ] 

stefanlee edited comment on YARN-7695 at 1/3/18 7:27 AM:
-

[~templedf] please have a look.


was (Author: imstefanlee):
[~dan...@cloudera.com] please have a look.

> when active RM transit to standby , this RM will new another 
> FairSchedulerUpdate Thread
> ---
>
> Key: YARN-7695
> URL: https://issues.apache.org/jira/browse/YARN-7695
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.4.0
>Reporter: stefanlee
>
> 1. i test haoop-2.4.0 in my cluster.
> 2. RM1 is active and  RM2 is standby
> 3. i delete /yarn-leader-election/Yarn/ActiveStandbyElectorLock from ZK
> 4. RM1 then transit from active to standby success.
> 5. at last ,i print RM1 jstack info and found two "AllocationFileReloader" 
> and two "FairSchedulerUpdateThread" in RM1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7695) when active RM transit to standby , this RM will new another FairSchedulerUpdate Thread

2018-01-02 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309112#comment-16309112
 ] 

stefanlee edited comment on YARN-7695 at 1/3/18 7:27 AM:
-

i think this problem occured in 
*transitionToStandby->createAndInitActiveServices->RMActiveServices.serviceInit->scheduler.reinitialize(conf,
 rmContext)*, the *scheduler* is a new object?please correct me if i am 
wrong.[~templedf]


was (Author: imstefanlee):
i think this problem occured in 
*transitionToStandby->createAndInitActiveServices->RMActiveServices.serviceInit->scheduler.reinitialize(conf,
 rmContext)*, the *scheduler* is a new object?please correct me if i am 
wrong.[~dan...@cloudera.com]

> when active RM transit to standby , this RM will new another 
> FairSchedulerUpdate Thread
> ---
>
> Key: YARN-7695
> URL: https://issues.apache.org/jira/browse/YARN-7695
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.4.0
>Reporter: stefanlee
>
> 1. i test haoop-2.4.0 in my cluster.
> 2. RM1 is active and  RM2 is standby
> 3. i delete /yarn-leader-election/Yarn/ActiveStandbyElectorLock from ZK
> 4. RM1 then transit from active to standby success.
> 5. at last ,i print RM1 jstack info and found two "AllocationFileReloader" 
> and two "FairSchedulerUpdateThread" in RM1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7695) when active RM transit to standby , this RM will new another FairSchedulerUpdate Thread

2018-01-02 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309112#comment-16309112
 ] 

stefanlee commented on YARN-7695:
-

i think this problem occured in 
*transitionToStandby->createAndInitActiveServices->RMActiveServices.serviceInit->scheduler.reinitialize(conf,
 rmContext)*, the *scheduler* is a new object?please correct me if i am 
wrong.[~dan...@cloudera.com]

> when active RM transit to standby , this RM will new another 
> FairSchedulerUpdate Thread
> ---
>
> Key: YARN-7695
> URL: https://issues.apache.org/jira/browse/YARN-7695
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.4.0
>Reporter: stefanlee
>
> 1. i test haoop-2.4.0 in my cluster.
> 2. RM1 is active and  RM2 is standby
> 3. i delete /yarn-leader-election/Yarn/ActiveStandbyElectorLock from ZK
> 4. RM1 then transit from active to standby success.
> 5. at last ,i print RM1 jstack info and found two "AllocationFileReloader" 
> and two "FairSchedulerUpdateThread" in RM1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7695) when active RM transit to standby , this RM will new another FairSchedulerUpdate Thread

2018-01-02 Thread stefanlee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stefanlee updated YARN-7695:

Description: 
1. i test haoop-2.4.0 in my cluster.
2. RM1 is active and  RM2 is standby
3. i delete /yarn-leader-election/Yarn/ActiveStandbyElectorLock from ZK
4. RM1 then transit from active to standby success.
5. at last ,i print RM1 jstack info and found two "AllocationFileReloader" and 
two "FairSchedulerUpdateThread" in RM1.

  was:
1. i test haoop-2.4.0 in my cluster.
2. RM1 is active and  RM2 is standby
3. i delete /yarn-leader-election/DevSuningYarn/ActiveStandbyElectorLock from ZK
4. RM1 then transit from active to standby success.
5. at last ,i print RM1 jstack info and found two "AllocationFileReloader" and 
two "FairSchedulerUpdateThread" in RM1.


> when active RM transit to standby , this RM will new another 
> FairSchedulerUpdate Thread
> ---
>
> Key: YARN-7695
> URL: https://issues.apache.org/jira/browse/YARN-7695
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.4.0
>Reporter: stefanlee
>
> 1. i test haoop-2.4.0 in my cluster.
> 2. RM1 is active and  RM2 is standby
> 3. i delete /yarn-leader-election/Yarn/ActiveStandbyElectorLock from ZK
> 4. RM1 then transit from active to standby success.
> 5. at last ,i print RM1 jstack info and found two "AllocationFileReloader" 
> and two "FairSchedulerUpdateThread" in RM1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7695) when active RM transit to standby , this RM will new another FairSchedulerUpdate Thread

2018-01-02 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309071#comment-16309071
 ] 

stefanlee commented on YARN-7695:
-

[~dan...@cloudera.com] please have a look.

> when active RM transit to standby , this RM will new another 
> FairSchedulerUpdate Thread
> ---
>
> Key: YARN-7695
> URL: https://issues.apache.org/jira/browse/YARN-7695
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.4.0
>Reporter: stefanlee
>
> 1. i test haoop-2.4.0 in my cluster.
> 2. RM1 is active and  RM2 is standby
> 3. i delete /yarn-leader-election/DevSuningYarn/ActiveStandbyElectorLock from 
> ZK
> 4. RM1 then transit from active to standby success.
> 5. at last ,i print RM1 jstack info and found two "AllocationFileReloader" 
> and two "FairSchedulerUpdateThread" in RM1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7695) when active RM transit to standby , this RM will new another FairSchedulerUpdate Thread

2018-01-02 Thread stefanlee (JIRA)
stefanlee created YARN-7695:
---

 Summary: when active RM transit to standby , this RM will new 
another FairSchedulerUpdate Thread
 Key: YARN-7695
 URL: https://issues.apache.org/jira/browse/YARN-7695
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, resourcemanager
Affects Versions: 2.4.0
Reporter: stefanlee


1. i test haoop-2.4.0 in my cluster.
2. RM1 is active and  RM2 is standby
3. i delete /yarn-leader-election/DevSuningYarn/ActiveStandbyElectorLock from ZK
4. RM1 then transit from active to standby success.
5. at last ,i print RM1 jstack info and found two "AllocationFileReloader" and 
two "FairSchedulerUpdateThread" in RM1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM

2017-12-27 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16305098#comment-16305098
 ] 

stefanlee commented on YARN-3979:
-

[~piaoyu zhang] thanks for this jira, could you please tell me why optimize  
 *yarn.resourcemanager.client.thread-count 50 -> 100*
*yarn.resourcemanager.scheduler.client.thread-count 50->100*
*yarn.resourcemanager.resource-tracker.client.thread-count 50 -> 80*  ?

> Am in ResourceLocalizationService hang 10 min cause RM kill  AM
> ---
>
> Key: YARN-3979
> URL: https://issues.apache.org/jira/browse/YARN-3979
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0
> Environment: CentOS 6.5  Hadoop-2.2.0
>Reporter: zhangyubiao
> Attachments: ERROR103.log
>
>
> 2015-07-27 02:46:17,348 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Created localizer for container_1437735375558
> _104282_01_01
> 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
> Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE)
> 2015-07-27 02:56:18,510 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for appattempt_1437735375558_104282_0
> 1 (auth:TOKEN) for protocol=interface 
> org.apache.hadoop.yarn.api.ContainerManagementProtocolPB



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration

2017-12-25 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303440#comment-16303440
 ] 

stefanlee commented on YARN-3136:
-

[~jlowe]  thanks for this jira, could you please tell me what is the jira about 
*We've already done similar work during AM allocate calls to make sure they 
don't needlessly grab the scheduler lock* ? i found *AM allocate* BLOCKED 
easily in my RM.

> getTransferredContainers can be a bottleneck during AM registration
> ---
>
> Key: YARN-3136
> URL: https://issues.apache.org/jira/browse/YARN-3136
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Sunil G
>  Labels: 2.7.2-candidate
> Fix For: 2.8.0, 2.7.2, 3.0.0-alpha1
>
> Attachments: 0001-YARN-3136.patch, 00010-YARN-3136.patch, 
> 00011-YARN-3136.patch, 00012-YARN-3136.patch, 00013-YARN-3136.patch, 
> 0002-YARN-3136.patch, 0003-YARN-3136.patch, 0004-YARN-3136.patch, 
> 0005-YARN-3136.patch, 0006-YARN-3136.patch, 0007-YARN-3136.patch, 
> 0008-YARN-3136.patch, 0009-YARN-3136.patch, YARN-3136.branch-2.7.patch
>
>
> While examining RM stack traces on a busy cluster I noticed a pattern of AMs 
> stuck waiting for the scheduler lock trying to call getTransferredContainers. 
>  The scheduler lock is highly contended, especially on a large cluster with 
> many nodes heartbeating, and it would be nice if we could find a way to 
> eliminate the need to grab this lock during this call.  We've already done 
> similar work during AM allocate calls to make sure they don't needlessly grab 
> the scheduler lock, and it would be good to do so here as well, if possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4980) SchedulerEventDispatcher should not process by only a single EventProcessor thread

2017-12-25 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303252#comment-16303252
 ] 

stefanlee commented on YARN-4980:
-

[~chenfolin] hi,have you resolved this problem, or optimized it?

> SchedulerEventDispatcher should not process by only a single EventProcessor 
> thread
> --
>
> Key: YARN-4980
> URL: https://issues.apache.org/jira/browse/YARN-4980
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.0, 2.6.4
> Environment: 1 resourcemanager
> 500 nodemanager
> nodemanager heartbeat interval is 3 secs.
>Reporter: ChenFolin
>
> I think only a single EventProcessor thread in SchedulerEventDispatcher to 
> process event is unreasonable.
> I often see "Size of scheduler event-queue is 1000(2000..)..." at the 
> resourcemanager log.
> Now I have 500 nodemanager , event process is slowly. If I  add another 200 
> nodemanagers , the situation will badly.
> ... so,can node events(node add/delete/update...) processed by multi threads?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6487) FairScheduler: remove continuous scheduling (YARN-1010)

2017-12-08 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283668#comment-16283668
 ] 

stefanlee commented on YARN-6487:
-

it seems continuous scheduling will impact scheduler performance.

> FairScheduler: remove continuous scheduling (YARN-1010)
> ---
>
> Key: YARN-6487
> URL: https://issues.apache.org/jira/browse/YARN-6487
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: fairscheduler
>Affects Versions: 2.7.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>
> Remove deprecated FairScheduler continuous scheduler code



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1010) FairScheduler: decouple container scheduling from nodemanager heartbeats

2017-12-08 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283660#comment-16283660
 ] 

stefanlee commented on YARN-1010:
-

 Thanks [~ywskycn]   , i have a doubt that  if  _if 
(!completedContainers.isEmpty())_  will impact scheduler performance, why add 
this judgment here?


> FairScheduler: decouple container scheduling from nodemanager heartbeats
> 
>
> Key: YARN-1010
> URL: https://issues.apache.org/jira/browse/YARN-1010
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.1.0-beta
>Reporter: Alejandro Abdelnur
>Assignee: Wei Yan
>Priority: Critical
> Fix For: 2.3.0
>
> Attachments: YARN-1010.patch, YARN-1010.patch
>
>
> Currently scheduling for a node is done when a node heartbeats.
> For large cluster where the heartbeat interval is set to several seconds this 
> delays scheduling of incoming allocations significantly.
> We could have a continuous loop scanning all nodes and doing scheduling. If 
> there is availability AMs will get the allocation in the next heartbeat after 
> the one that placed the request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-402) Dispatcher warn message is too late

2017-12-07 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283150#comment-16283150
 ] 

stefanlee commented on YARN-402:


thanks [~djp] our cluster will be very busy between 3.00 a.m. and 7.00 a.m. the 
event-queue size will be 4000 to 5000, so how can i optimize this?

> Dispatcher warn message is too late
> ---
>
> Key: YARN-402
> URL: https://issues.apache.org/jira/browse/YARN-402
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Lohit Vijayarenu
>Priority: Minor
>
> AsyncDispatcher throws out Warn when capacity remaining is less than 1000
> {noformat}
> if (remCapacity < 1000) {
> LOG.warn("Very low remaining capacity in the event-queue: "
> + remCapacity);
>   }
> {noformat}
> What would be useful is to warn much before that, may be half full instead of 
> when queue is completely full. I see that eventQueue capacity is int value. 
> So, if one warn's queue has only 1000 capacity left, then service definitely 
> has serious problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6107) ResourceManager recovered with NPE Exception due to zk store failed

2017-08-03 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113838#comment-16113838
 ] 

stefanlee commented on YARN-6107:
-

is there flink type application running in your cluster? this bug is fixed in 
hadoop-2.6, you can reference [https://issues.apache.org/jira/browse/YARN-2823]

> ResourceManager recovered with NPE Exception due to zk store failed
> ---
>
> Key: YARN-6107
> URL: https://issues.apache.org/jira/browse/YARN-6107
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.5.1
>Reporter: liuxiangwei
>
> Firstly, RM is stopped by the exception below:
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired for /nmg01-khan-yarn-on-normandy-rmstore/ZKRM
> StateRoot/RMAppRoot/application_1484014091623_3711/appattempt_1484014091623_3711_01
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1073)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:960)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:957)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1007)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1026)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:957)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:65
> 4)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:236)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:219)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:774)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:845)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:840)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
> at java.lang.Thread.run(Thread.java:662)
> Secondly, Restart the RM but never success due to exception below:
> 2017-01-18 15:07:48,130 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_ADDED t
> o the scheduler
> java.lang.NullPointerException
> The stack trace points to the code blow:
> SchedulerApplication application =
> applications.get(appAttemptId.getApplicationId());
> It seems application does not exist.
> And we found log like this
> 2017-01-18 15:11:21,204 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering 
> app: application_1484014091623_3711 wi
> th 1 attempts and final state = FINISHED
> 2017-01-18 15:11:21,204 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Recovering attempt: appattempt_148
> 4014091623_3711_01 with final state: null
> 2017-01-18 15:11:21,204 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1484014091623_3711_
> 01 State change from NEW to LAUNCHED
> 2017-01-18 15:11:21,204 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
> application_1484014091623_3711 State change from 
> NEW to FINISHED
> the final states do not make equal.  
> We have to check the application whether is null to avoid this problem and 
> make this failover success.



--
This message was sent by Atlassian JIRA

[jira] [Commented] (YARN-2823) NullPointerException in RM HA enabled 3-node cluster

2017-08-03 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113767#comment-16113767
 ] 

stefanlee commented on YARN-2823:
-

IMO, NPE  happened when *transferStateFromPreviousAttempt*  is *true* ,and  the 
value of *transferStateFromPreviousAttempt*  is depend on 
*KeepContainersAcrossApplicationAttempts* in *ApplicationSubmissionContext*, i 
have this NPE,because there is *FLINK* type application running in my cluster, 
then i saw the default value of *KeepContainersAcrossApplicationAttempts* in 
flink code is *true*. so, i want to know if 
*KeepContainersAcrossApplicationAttempts* is *false*, then this NPE can not 
happened?[~jianhe] thanks

> NullPointerException in RM HA enabled 3-node cluster
> 
>
> Key: YARN-2823
> URL: https://issues.apache.org/jira/browse/YARN-2823
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Gour Saha
>Assignee: Jian He
>Priority: Critical
> Fix For: 2.6.0
>
> Attachments: logs_with_NPE_in_RM.zip, YARN-2823.1.patch
>
>
> Branch:
> 2.6.0
> Environment: 
> A 3-node cluster with RM HA enabled. The HA setup went pretty smooth (used 
> Ambari) and then installed HBase using Slider. After some time the RMs went 
> down and would not come back up anymore. Following is the NPE we see in both 
> the RM logs.
> {noformat}
> 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager 
> (ResourceManager.java:run(612)) - Error in handling event type 
> APP_ATTEMPT_ADDED to the scheduler
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603)
> at java.lang.Thread.run(Thread.java:744)
> 2014-09-16 01:36:28,042 INFO  resourcemanager.ResourceManager 
> (ResourceManager.java:run(616)) - Exiting, bbye..
> {noformat}
> All the logs for this 3-node cluster has been uploaded.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6125) The application attempt's diagnostic message should have a maximum size

2017-05-24 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022397#comment-16022397
 ] 

stefanlee commented on YARN-6125:
-

[~templedf] thanks

> The application attempt's diagnostic message should have a maximum size
> ---
>
> Key: YARN-6125
> URL: https://issues.apache.org/jira/browse/YARN-6125
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: Daniel Templeton
>Assignee: Andras Piros
>Priority: Critical
> Fix For: 2.9.0, 3.0.0-alpha3
>
> Attachments: YARN-6125.000.patch, YARN-6125.001.patch, 
> YARN-6125.002.patch, YARN-6125.003.patch, YARN-6125.004.patch, 
> YARN-6125.005.patch, YARN-6125.006.patch, YARN-6125.007.patch, 
> YARN-6125.008.patch, YARN-6125.009.patch
>
>
> We've found through experience that the diagnostic message can grow 
> unbounded.  I've seen attempts that have diagnostic messages over 1MB.  Since 
> the message is stored in the state store, it's a bad idea to allow the 
> message to grow unbounded.  Instead, there should be a property that sets a 
> maximum size on the message.
> I suspect that some of the ZK state store issues we've seen in the past were 
> due to the size of the diagnostic messages and not to the size of the 
> classpath, as is the current prevailing opinion.
> An open question is how best to prune the message once it grows too large.  
> Should we
> # truncate the tail,
> # truncate the head,
> # truncate the middle,
> # add another property to make the behavior selectable, or
> # none of the above?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-5006) ResourceManager quit due to ApplicationStateData exceed the limit size of znode in zk

2017-05-23 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021078#comment-16021078
 ] 

stefanlee edited comment on YARN-5006 at 5/23/17 11:39 AM:
---

[~bibinchundatt] thanks, but  why  "add 1 file into DistributedCache" can 
due to *ApplicationStateData* exceed *1M*?


was (Author: imstefanlee):
thanks, but  why  "add 1 file into DistributedCache" can due to 
ApplicationStateData exceed 1M?

> ResourceManager quit due to ApplicationStateData exceed the limit  size of 
> znode in zk
> --
>
> Key: YARN-5006
> URL: https://issues.apache.org/jira/browse/YARN-5006
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0, 2.7.2
>Reporter: dongtingting
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-5006.001.patch, YARN-5006.002.patch
>
>
> Client submit a job, this job add 1 file into DistributedCache. when the 
> job is submitted, ResourceManager sotre ApplicationStateData into zk. 
> ApplicationStateData  is exceed the limit size of znode. RM exit 1.   
> The related code in RMStateStore.java :
> {code}
>   private static class StoreAppTransition
>   implements SingleArcTransition {
> @Override
> public void transition(RMStateStore store, RMStateStoreEvent event) {
>   if (!(event instanceof RMStateStoreAppEvent)) {
> // should never happen
> LOG.error("Illegal event type: " + event.getClass());
> return;
>   }
>   ApplicationState appState = ((RMStateStoreAppEvent) 
> event).getAppState();
>   ApplicationId appId = appState.getAppId();
>   ApplicationStateData appStateData = ApplicationStateData
>   .newInstance(appState);
>   LOG.info("Storing info for app: " + appId);
>   try {  
> store.storeApplicationStateInternal(appId, appStateData);  //store 
> the appStateData
> store.notifyApplication(new RMAppEvent(appId,
>RMAppEventType.APP_NEW_SAVED));
>   } catch (Exception e) {
> LOG.error("Error storing app: " + appId, e);
> store.notifyStoreOperationFailed(e);   //handle fail event, system 
> exit 
>   }
> };
>   }
> {code}
> The Exception log:
> {code}
>  ...
> 2016-04-20 11:26:35,732 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore 
> AsyncDispatcher event handler: Maxed out ZK retries. Giving up!
> 2016-04-20 11:26:35,732 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore 
> AsyncDispatcher event handler: Error storing app: 
> application_1461061795989_17671
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931)
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:936)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:933)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1075)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1096)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:933)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:947)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:956)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:626)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:138)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:123)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> 

[jira] [Commented] (YARN-5006) ResourceManager quit due to ApplicationStateData exceed the limit size of znode in zk

2017-05-23 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021078#comment-16021078
 ] 

stefanlee commented on YARN-5006:
-

thanks, but  why  "add 1 file into DistributedCache" can due to 
ApplicationStateData exceed 1M?

> ResourceManager quit due to ApplicationStateData exceed the limit  size of 
> znode in zk
> --
>
> Key: YARN-5006
> URL: https://issues.apache.org/jira/browse/YARN-5006
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0, 2.7.2
>Reporter: dongtingting
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-5006.001.patch, YARN-5006.002.patch
>
>
> Client submit a job, this job add 1 file into DistributedCache. when the 
> job is submitted, ResourceManager sotre ApplicationStateData into zk. 
> ApplicationStateData  is exceed the limit size of znode. RM exit 1.   
> The related code in RMStateStore.java :
> {code}
>   private static class StoreAppTransition
>   implements SingleArcTransition {
> @Override
> public void transition(RMStateStore store, RMStateStoreEvent event) {
>   if (!(event instanceof RMStateStoreAppEvent)) {
> // should never happen
> LOG.error("Illegal event type: " + event.getClass());
> return;
>   }
>   ApplicationState appState = ((RMStateStoreAppEvent) 
> event).getAppState();
>   ApplicationId appId = appState.getAppId();
>   ApplicationStateData appStateData = ApplicationStateData
>   .newInstance(appState);
>   LOG.info("Storing info for app: " + appId);
>   try {  
> store.storeApplicationStateInternal(appId, appStateData);  //store 
> the appStateData
> store.notifyApplication(new RMAppEvent(appId,
>RMAppEventType.APP_NEW_SAVED));
>   } catch (Exception e) {
> LOG.error("Error storing app: " + appId, e);
> store.notifyStoreOperationFailed(e);   //handle fail event, system 
> exit 
>   }
> };
>   }
> {code}
> The Exception log:
> {code}
>  ...
> 2016-04-20 11:26:35,732 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore 
> AsyncDispatcher event handler: Maxed out ZK retries. Giving up!
> 2016-04-20 11:26:35,732 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore 
> AsyncDispatcher event handler: Error storing app: 
> application_1461061795989_17671
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931)
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:936)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:933)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1075)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1096)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:933)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:947)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:956)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:626)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:138)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:123)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
> at 
> 

[jira] [Commented] (YARN-6125) The application attempt's diagnostic message should have a maximum size

2017-05-23 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16020819#comment-16020819
 ] 

stefanlee commented on YARN-6125:
-

thanks for this jira, i have a doubt that why class *BounderAppender* is 
*static*?[~templedf] [~andras.piros]

> The application attempt's diagnostic message should have a maximum size
> ---
>
> Key: YARN-6125
> URL: https://issues.apache.org/jira/browse/YARN-6125
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: Daniel Templeton
>Assignee: Andras Piros
>Priority: Critical
> Fix For: 2.9.0, 3.0.0-alpha3
>
> Attachments: YARN-6125.000.patch, YARN-6125.001.patch, 
> YARN-6125.002.patch, YARN-6125.003.patch, YARN-6125.004.patch, 
> YARN-6125.005.patch, YARN-6125.006.patch, YARN-6125.007.patch, 
> YARN-6125.008.patch, YARN-6125.009.patch
>
>
> We've found through experience that the diagnostic message can grow 
> unbounded.  I've seen attempts that have diagnostic messages over 1MB.  Since 
> the message is stored in the state store, it's a bad idea to allow the 
> message to grow unbounded.  Instead, there should be a property that sets a 
> maximum size on the message.
> I suspect that some of the ZK state store issues we've seen in the past were 
> due to the size of the diagnostic messages and not to the size of the 
> classpath, as is the current prevailing opinion.
> An open question is how best to prune the message once it grows too large.  
> Should we
> # truncate the tail,
> # truncate the head,
> # truncate the middle,
> # add another property to make the behavior selectable, or
> # none of the above?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5006) ResourceManager quit due to ApplicationStateData exceed the limit size of znode in zk

2017-05-21 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16019128#comment-16019128
 ] 

stefanlee commented on YARN-5006:
-

thanks for this jira, i have a doubt that why "add 1 file into 
DistributedCache" can due to *ApplicationStateData*  exceed *1M*.  IMO, 
*ApplicationStateData*  contains _startTime, user, 
ApplicationSubmissionContext, state ,diagnostics and finishTime_,  they are all 
small except *diagnostics*, in my scenario, a failed  spark applicaiton has 
*4M* info of *diagnostics*,when it update info to ZK, the 
*ApplicationStateData*  exceed *1M*.  then  RM lost connection with ZK,  so i 
think it is important to fix the size of  *diagnostics*  when 
 operate  *updateApplicationAttemptStateInternal* and 
*updateApplicationStateInternal*  in *ZKRMStateStore*, am i wrong with this 
question?[~dongtingting8...@163.com] [~templedf] [~bibinchundatt]

> ResourceManager quit due to ApplicationStateData exceed the limit  size of 
> znode in zk
> --
>
> Key: YARN-5006
> URL: https://issues.apache.org/jira/browse/YARN-5006
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0, 2.7.2
>Reporter: dongtingting
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-5006.001.patch, YARN-5006.002.patch
>
>
> Client submit a job, this job add 1 file into DistributedCache. when the 
> job is submitted, ResourceManager sotre ApplicationStateData into zk. 
> ApplicationStateData  is exceed the limit size of znode. RM exit 1.   
> The related code in RMStateStore.java :
> {code}
>   private static class StoreAppTransition
>   implements SingleArcTransition {
> @Override
> public void transition(RMStateStore store, RMStateStoreEvent event) {
>   if (!(event instanceof RMStateStoreAppEvent)) {
> // should never happen
> LOG.error("Illegal event type: " + event.getClass());
> return;
>   }
>   ApplicationState appState = ((RMStateStoreAppEvent) 
> event).getAppState();
>   ApplicationId appId = appState.getAppId();
>   ApplicationStateData appStateData = ApplicationStateData
>   .newInstance(appState);
>   LOG.info("Storing info for app: " + appId);
>   try {  
> store.storeApplicationStateInternal(appId, appStateData);  //store 
> the appStateData
> store.notifyApplication(new RMAppEvent(appId,
>RMAppEventType.APP_NEW_SAVED));
>   } catch (Exception e) {
> LOG.error("Error storing app: " + appId, e);
> store.notifyStoreOperationFailed(e);   //handle fail event, system 
> exit 
>   }
> };
>   }
> {code}
> The Exception log:
> {code}
>  ...
> 2016-04-20 11:26:35,732 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore 
> AsyncDispatcher event handler: Maxed out ZK retries. Giving up!
> 2016-04-20 11:26:35,732 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore 
> AsyncDispatcher event handler: Error storing app: 
> application_1461061795989_17671
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931)
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:936)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:933)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1075)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1096)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:933)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:947)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:956)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:626)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:138)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:123)

[jira] [Comment Edited] (YARN-3269) Yarn.nodemanager.remote-app-log-dir could not be configured to fully qualified path

2017-05-03 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994501#comment-15994501
 ] 

stefanlee edited comment on YARN-3269 at 5/3/17 8:59 AM:
-

thanks for this jira, but  when we do not config 
*yarn.nodemanager.remote-app-log-dir*, it will be */tmp/logs* and NM   or  mr 
historyserver  will check */tmp/logs* 's  scheme  when write log to HDFS or 
read log from HDFS,then they will throw exception of {{No AbstractFileSystem 
for scheme: null}}  in class *AbstractFileSystem.java* ,so i suggest that when  
*yarn.nodemanager.remote-app-log-dir*  is default value(scheme is null ), we 
still use *FileContext.getFileContext(conf)*  in *AggregatedLogFormat.java*, 
else we can use *FileContext.getFileContext(remoteAppLogFile.toUri(), conf)* , 
am i wrong with this question? my hadoop version is 2.4.0  [~xgong]  [~zhz]


was (Author: imstefanlee):
thanks for this jira, but  when we do not config 
*yarn.nodemanager.remote-app-log-dir*, it will be */tmp/logs* and NM   or  mr 
historyserver  will check */tmp/logs* 's  scheme  when write log to HDFS or 
read log from HDFS,then they will throw exception of {{No AbstractFileSystem 
for scheme: null}}  in class *AbstractFileSystem.java* ,so i suggest that when  
*yarn.nodemanager.remote-app-log-dir*  is default value(scheme is null ), we 
still use *FileContext.getFileContext(conf)*  in *AggregatedLogFormat.java*, 
else we can use *FileContext.getFileContext(remoteAppLogFile.toUri(), conf)* , 
am i wrong with this question?  [~xgong]  [~zhz]

> Yarn.nodemanager.remote-app-log-dir could not be configured to fully 
> qualified path
> ---
>
> Key: YARN-3269
> URL: https://issues.apache.org/jira/browse/YARN-3269
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Fix For: 2.8.0, 2.7.4, 3.0.0-alpha1
>
> Attachments: YARN-3269.1.patch, YARN-3269.2.patch
>
>
> Log aggregation currently is always relative to the default file system, not 
> an arbitrary file system identified by URI. So we can't put an arbitrary 
> fully-qualified URI into yarn.nodemanager.remote-app-log-dir.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3269) Yarn.nodemanager.remote-app-log-dir could not be configured to fully qualified path

2017-05-03 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994501#comment-15994501
 ] 

stefanlee commented on YARN-3269:
-

thanks for this jira, but  when we do not config 
*yarn.nodemanager.remote-app-log-dir*, it will be */tmp/logs* and NM   or  mr 
historyserver  will check */tmp/logs* 's  scheme  when write log to HDFS or 
read log from HDFS,then they will throw exception of {{No AbstractFileSystem 
for scheme: null}}  in class *AbstractFileSystem.java* ,so i suggest that when  
*yarn.nodemanager.remote-app-log-dir*  is default value(scheme is null ), we 
still use *FileContext.getFileContext(conf)*  in *AggregatedLogFormat.java*, 
else we can use *FileContext.getFileContext(remoteAppLogFile.toUri(), conf)* , 
am i wrong with this question?  [~xgong]  [~zhz]

> Yarn.nodemanager.remote-app-log-dir could not be configured to fully 
> qualified path
> ---
>
> Key: YARN-3269
> URL: https://issues.apache.org/jira/browse/YARN-3269
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Fix For: 2.8.0, 2.7.4, 3.0.0-alpha1
>
> Attachments: YARN-3269.1.patch, YARN-3269.2.patch
>
>
> Log aggregation currently is always relative to the default file system, not 
> an arbitrary file system identified by URI. So we can't put an arbitrary 
> fully-qualified URI into yarn.nodemanager.remote-app-log-dir.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe

2016-11-25 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695267#comment-15695267
 ] 

stefanlee commented on YARN-3795:
-

hi ,i have the same problem,but my scenario is that when i failover RM2 to 
RM1,the zookeeper in RM1 report watcher num is large, and RM1 is health, then i 
reboot the zookeeper in RM1,after that ,i found RM1's web can't access and  a 
lot of "Broken pipe" message in RM1's log ,and "java.io.IOException: Len error" 
 appeared in ZK server 's log ,so i want to  know if your ZK is health when the 
above problem occured?

> ZKRMStateStore crashes due to IOException: Broken pipe
> --
>
> Key: YARN-3795
> URL: https://issues.apache.org/jira/browse/YARN-3795
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: lachisis
>Priority: Critical
> Fix For: 2.7.1
>
>
> 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to dap88/134.41.33.88:2181, initiating session
> 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server dap88/134.41.33.88:2181, sessionid = 
> 0x34db2f72ac50c86, negotiated timeout = 1
> 2015-06-05 06:06:54,881 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Watcher event type: None with state:SyncConnected for path:null for Service 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
> 2015-06-05 06:06:54,881 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2015-06-05 06:06:54,881 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, 
> closing socket connection and attempting reconnect
> java.io.IOException: Broken pipe
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
>   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
>   at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
>   at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)
> 2015-06-05 06:06:54,986 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Watcher event type: None with state:Disconnected for path:null for Service 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
> 2015-06-05 06:06:54,986 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session disconnected
> 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server dap87/134.41.33.87:2181. Will not attempt to 
> authenticate using SASL (unknown error)
> 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to dap87/134.41.33.87:2181, initiating session
> 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server dap87/134.41.33.87:2181, sessionid = 
> 0x34db2f72ac50c86, negotiated timeout = 1
> 2015-06-05 06:06:55,343 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Watcher event type: None with state:SyncConnected for path:null for Service 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
> 2015-06-05 06:06:55,343 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2015-06-05 06:06:55,344 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, 
> closing socket connection and attempting reconnect
> java.io.IOException: Broken pipe
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
>   at 

[jira] [Commented] (YARN-4205) Add a service for monitoring application life time out

2016-11-20 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682384#comment-15682384
 ] 

stefanlee commented on YARN-4205:
-

I  have a doubt that  if it is suitable fo long time job ,e.g spark streaming 
job?

> Add a service for monitoring application life time out
> --
>
> Key: YARN-4205
> URL: https://issues.apache.org/jira/browse/YARN-4205
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: nijel
>Assignee: Rohith Sharma K S
> Fix For: 2.9.0, 3.0.0-alpha2
>
> Attachments: 0001-YARN-4205.patch, 0002-YARN-4205.patch, 
> 0003-YARN-4205.patch, 0004-YARN-4205.patch, 0005-YARN-4205.patch, 
> 0006-YARN-4205.patch, 0007-YARN-4205.1.patch, 0007-YARN-4205.2.patch, 
> 0007-YARN-4205.patch, YARN-4205-addendum.001.patch, YARN-4205_01.patch, 
> YARN-4205_02.patch, YARN-4205_03.patch
>
>
> This JIRA intend to provide a lifetime monitor service. 
> The service will monitor the applications where the life time is configured. 
> If the application is running beyond the lifetime, it will be killed. 
> The lifetime will be considered from the submit time.
> The thread monitoring interval is configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4014) Support user cli interface in for Application Priority

2016-10-26 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607700#comment-15607700
 ] 

stefanlee commented on YARN-4014:
-

it's done , i forgot to add |optional PriorityProto applicationPriority = 1;| 
in yarn_service_protos.proto

> Support user cli interface in for Application Priority
> --
>
> Key: YARN-4014
> URL: https://issues.apache.org/jira/browse/YARN-4014
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: client, resourcemanager
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: 0001-YARN-4014-V1.patch, 0001-YARN-4014.patch, 
> 0002-YARN-4014.patch, 0003-YARN-4014.patch, 0004-YARN-4014.patch, 
> 0004-YARN-4014.patch
>
>
> Track the changes for user-RM client protocol i.e ApplicationClientProtocol 
> changes and discussions in this jira.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5006) ResourceManager quit due to ApplicationStateData exceed the limit size of znode in zk

2016-10-26 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607596#comment-15607596
 ] 

stefanlee commented on YARN-5006:
-

hi,how about the progress of this patch?

> ResourceManager quit due to ApplicationStateData exceed the limit  size of 
> znode in zk
> --
>
> Key: YARN-5006
> URL: https://issues.apache.org/jira/browse/YARN-5006
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0, 2.7.2
>Reporter: dongtingting
>Priority: Critical
>
> Client submit a job, this job add 1 file into DistributedCache. when the 
> job is submitted, ResourceManager sotre ApplicationStateData into zk. 
> ApplicationStateData  is exceed the limit size of znode. RM exit 1.   
> The related code in RMStateStore.java :
> {code}
>   private static class StoreAppTransition
>   implements SingleArcTransition {
> @Override
> public void transition(RMStateStore store, RMStateStoreEvent event) {
>   if (!(event instanceof RMStateStoreAppEvent)) {
> // should never happen
> LOG.error("Illegal event type: " + event.getClass());
> return;
>   }
>   ApplicationState appState = ((RMStateStoreAppEvent) 
> event).getAppState();
>   ApplicationId appId = appState.getAppId();
>   ApplicationStateData appStateData = ApplicationStateData
>   .newInstance(appState);
>   LOG.info("Storing info for app: " + appId);
>   try {  
> store.storeApplicationStateInternal(appId, appStateData);  //store 
> the appStateData
> store.notifyApplication(new RMAppEvent(appId,
>RMAppEventType.APP_NEW_SAVED));
>   } catch (Exception e) {
> LOG.error("Error storing app: " + appId, e);
> store.notifyStoreOperationFailed(e);   //handle fail event, system 
> exit 
>   }
> };
>   }
> {code}
> The Exception log:
> {code}
>  ...
> 2016-04-20 11:26:35,732 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore 
> AsyncDispatcher event handler: Maxed out ZK retries. Giving up!
> 2016-04-20 11:26:35,732 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore 
> AsyncDispatcher event handler: Error storing app: 
> application_1461061795989_17671
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931)
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:936)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:933)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1075)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1096)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:933)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:947)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:956)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:626)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:138)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:123)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:860)
> at 
> 

[jira] [Commented] (YARN-4014) Support user cli interface in for Application Priority

2016-10-25 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607338#comment-15607338
 ] 

stefanlee commented on YARN-4014:
-

[~rohithsharma] [~jianhe] thanks for sharing this jira ,i have added those code 
to hadoop2.4.0 ,but when i mvn package this project,some errors happened as 
follows:
|[ERROR] 
/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/UpdateApplicationPriorityResponsePBImpl.java:[76,10]
 error: cannot find symbol
[ERROR] symbol:   method setApplicationPriority(PriorityProto)
[ERROR] location: variable builder of type Builder
[ERROR] 
/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/UpdateApplicationPriorityResponsePBImpl.java:[88,10]
 error: cannot find symbol
[ERROR] symbol:   method hasApplicationPriority()
[ERROR] location: variable p of type 
UpdateApplicationPriorityResponseProtoOrBuilder
[ERROR] 
/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/UpdateApplicationPriorityResponsePBImpl.java:[92,32]
 error: cannot find symbol
[ERROR] symbol:   method getApplicationPriority()
[ERROR] location: variable p of type 
UpdateApplicationPriorityResponseProtoOrBuilder
[ERROR]/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/UpdateApplicationPriorityResponsePBImpl.java:[100,13]
 error: cannot find symbol|
but i have modified ApplicationClientProtocol.java 
,applicationclient_protocol.proto,etc. and importted  related class at the 
beginning.
|import 
org.apache.hadoop.yarn.proto.YarnServiceProtos.UpdateApplicationPriorityResponseProto;
import 
org.apache.hadoop.yarn.proto.YarnServiceProtos.UpdateApplicationPriorityResponseProtoOrBuilder;|
how can i solve this problem.

> Support user cli interface in for Application Priority
> --
>
> Key: YARN-4014
> URL: https://issues.apache.org/jira/browse/YARN-4014
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: client, resourcemanager
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: 0001-YARN-4014-V1.patch, 0001-YARN-4014.patch, 
> 0002-YARN-4014.patch, 0003-YARN-4014.patch, 0004-YARN-4014.patch, 
> 0004-YARN-4014.patch
>
>
> Track the changes for user-RM client protocol i.e ApplicationClientProtocol 
> changes and discussions in this jira.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-4743) ResourceManager crash because TimSort

2016-09-26 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524820#comment-15524820
 ] 

stefanlee edited comment on YARN-4743 at 9/27/16 2:27 AM:
--

thanks [~gzh1992n] ,the patch you fixed is that the exception happended in 
+FairSharePolicy+,but my scenario is that +Collections.sort(nodeIdList, 
nodeAvailableResourceComparator)+ throws exception when decommission 
nodemanagers, +nodeAvailableResourceComparator+ only compares node available 
memory. 


was (Author: imstefanlee):
thanks [~gzh1992n] ,the patch you fixed is that the exception happended in 
||FairSharePolicy||,but my scenario is that ||Collections.sort(nodeIdList, 
nodeAvailableResourceComparator)|| throws exception when decommission 
nodemanagers, ||nodeAvailableResourceComparator|| only compares node available 
memory. 

> ResourceManager crash because TimSort
> -
>
> Key: YARN-4743
> URL: https://issues.apache.org/jira/browse/YARN-4743
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.6.4
>Reporter: Zephyr Guo
>Assignee: Yufei Gu
> Fix For: 3.0.0-alpha1
>
> Attachments: YARN-4743-v1.patch, YARN-CDH5.4.7.patch, timsort.log
>
>
> {code}
> 2016-02-26 14:08:50,821 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type NODE_UPDATE to the scheduler
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>  at java.util.TimSort.mergeHi(TimSort.java:868)
>  at java.util.TimSort.mergeAt(TimSort.java:485)
>  at java.util.TimSort.mergeCollapse(TimSort.java:410)
>  at java.util.TimSort.sort(TimSort.java:214)
>  at java.util.TimSort.sort(TimSort.java:173)
>  at java.util.Arrays.sort(Arrays.java:659)
>  at java.util.Collections.sort(Collections.java:217)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:316)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:240)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1091)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:989)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1185)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684)
>  at java.lang.Thread.run(Thread.java:745)
> 2016-02-26 14:08:50,822 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {code}
> Actually, this issue found in 2.6.0-cdh5.4.7.
> I think the cause is that we modify {{Resouce}} while we are sorting 
> {{runnableApps}}.
> {code:title=FSLeafQueue.java}
> Comparator comparator = policy.getComparator();
> writeLock.lock();
> try {
>   Collections.sort(runnableApps, comparator);
> } finally {
>   writeLock.unlock();
> }
> readLock.lock();
> {code}
> {code:title=FairShareComparator}
> public int compare(Schedulable s1, Schedulable s2) {
> ..
>   s1.getResourceUsage(), minShare1);
>   boolean s2Needy = Resources.lessThan(RESOURCE_CALCULATOR, null,
>   s2.getResourceUsage(), minShare2);
>   minShareRatio1 = (double) s1.getResourceUsage().getMemory()
>   / Resources.max(RESOURCE_CALCULATOR, null, minShare1, 
> ONE).getMemory();
>   minShareRatio2 = (double) s2.getResourceUsage().getMemory()
>   / Resources.max(RESOURCE_CALCULATOR, null, minShare2, 
> ONE).getMemory();
> ..
> {code}
> {{getResourceUsage}} will return current Resource. The current Resource is 
> unstable. 
> {code:title=FSAppAttempt.java}
> @Override
>   public Resource getResourceUsage() {
> // Here the getPreemptedResources() always return zero, except in
> // a preemption round
> return Resources.subtract(getCurrentConsumption(), 
> getPreemptedResources());
>   }
> {code}
> {code:title=SchedulerApplicationAttempt}
>  public Resource getCurrentConsumption() {
> return currentConsumption;
>   }
> // This method may modify current Resource.
> public synchronized void recoverContainer(RMContainer rmContainer) {
> ..
> Resources.addTo(currentConsumption, rmContainer.getContainer()
>   .getResource());
> ..
>   }
> {code}
> I suggest that use stable Resource in comparator.
> Is there something i 

[jira] [Commented] (YARN-4743) ResourceManager crash because TimSort

2016-09-26 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524820#comment-15524820
 ] 

stefanlee commented on YARN-4743:
-

thanks [~gzh1992n] ,the patch you fixed is that the exception happended in 
||FairSharePolicy||,but my scenario is that ||Collections.sort(nodeIdList, 
nodeAvailableResourceComparator)|| throws exception when decommission 
nodemanagers, ||nodeAvailableResourceComparator|| only compares node available 
memory. 

> ResourceManager crash because TimSort
> -
>
> Key: YARN-4743
> URL: https://issues.apache.org/jira/browse/YARN-4743
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.6.4
>Reporter: Zephyr Guo
>Assignee: Yufei Gu
> Fix For: 3.0.0-alpha1
>
> Attachments: YARN-4743-v1.patch, YARN-CDH5.4.7.patch, timsort.log
>
>
> {code}
> 2016-02-26 14:08:50,821 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type NODE_UPDATE to the scheduler
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>  at java.util.TimSort.mergeHi(TimSort.java:868)
>  at java.util.TimSort.mergeAt(TimSort.java:485)
>  at java.util.TimSort.mergeCollapse(TimSort.java:410)
>  at java.util.TimSort.sort(TimSort.java:214)
>  at java.util.TimSort.sort(TimSort.java:173)
>  at java.util.Arrays.sort(Arrays.java:659)
>  at java.util.Collections.sort(Collections.java:217)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:316)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:240)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1091)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:989)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1185)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684)
>  at java.lang.Thread.run(Thread.java:745)
> 2016-02-26 14:08:50,822 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {code}
> Actually, this issue found in 2.6.0-cdh5.4.7.
> I think the cause is that we modify {{Resouce}} while we are sorting 
> {{runnableApps}}.
> {code:title=FSLeafQueue.java}
> Comparator comparator = policy.getComparator();
> writeLock.lock();
> try {
>   Collections.sort(runnableApps, comparator);
> } finally {
>   writeLock.unlock();
> }
> readLock.lock();
> {code}
> {code:title=FairShareComparator}
> public int compare(Schedulable s1, Schedulable s2) {
> ..
>   s1.getResourceUsage(), minShare1);
>   boolean s2Needy = Resources.lessThan(RESOURCE_CALCULATOR, null,
>   s2.getResourceUsage(), minShare2);
>   minShareRatio1 = (double) s1.getResourceUsage().getMemory()
>   / Resources.max(RESOURCE_CALCULATOR, null, minShare1, 
> ONE).getMemory();
>   minShareRatio2 = (double) s2.getResourceUsage().getMemory()
>   / Resources.max(RESOURCE_CALCULATOR, null, minShare2, 
> ONE).getMemory();
> ..
> {code}
> {{getResourceUsage}} will return current Resource. The current Resource is 
> unstable. 
> {code:title=FSAppAttempt.java}
> @Override
>   public Resource getResourceUsage() {
> // Here the getPreemptedResources() always return zero, except in
> // a preemption round
> return Resources.subtract(getCurrentConsumption(), 
> getPreemptedResources());
>   }
> {code}
> {code:title=SchedulerApplicationAttempt}
>  public Resource getCurrentConsumption() {
> return currentConsumption;
>   }
> // This method may modify current Resource.
> public synchronized void recoverContainer(RMContainer rmContainer) {
> ..
> Resources.addTo(currentConsumption, rmContainer.getContainer()
>   .getResource());
> ..
>   }
> {code}
> I suggest that use stable Resource in comparator.
> Is there something i think wrong?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4743) ResourceManager crash because TimSort

2016-09-22 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15515206#comment-15515206
 ] 

stefanlee commented on YARN-4743:
-

Thanks ,my hadoop version is 2.4.0 and i just found that continuousScheduling 
thread has removed collections.sort in hadoop-3.0.0, i will review the code 
carefully.

> ResourceManager crash because TimSort
> -
>
> Key: YARN-4743
> URL: https://issues.apache.org/jira/browse/YARN-4743
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.6.4
>Reporter: Zephyr Guo
>Assignee: Yufei Gu
> Attachments: YARN-4743-cdh5.4.7.patch
>
>
> {code}
> 2016-02-26 14:08:50,821 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type NODE_UPDATE to the scheduler
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>  at java.util.TimSort.mergeHi(TimSort.java:868)
>  at java.util.TimSort.mergeAt(TimSort.java:485)
>  at java.util.TimSort.mergeCollapse(TimSort.java:410)
>  at java.util.TimSort.sort(TimSort.java:214)
>  at java.util.TimSort.sort(TimSort.java:173)
>  at java.util.Arrays.sort(Arrays.java:659)
>  at java.util.Collections.sort(Collections.java:217)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:316)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:240)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1091)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:989)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1185)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684)
>  at java.lang.Thread.run(Thread.java:745)
> 2016-02-26 14:08:50,822 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {code}
> Actually, this issue found in 2.6.0-cdh5.4.7.
> I think the cause is that we modify {{Resouce}} while we are sorting 
> {{runnableApps}}.
> {code:title=FSLeafQueue.java}
> Comparator comparator = policy.getComparator();
> writeLock.lock();
> try {
>   Collections.sort(runnableApps, comparator);
> } finally {
>   writeLock.unlock();
> }
> readLock.lock();
> {code}
> {code:title=FairShareComparator}
> public int compare(Schedulable s1, Schedulable s2) {
> ..
>   s1.getResourceUsage(), minShare1);
>   boolean s2Needy = Resources.lessThan(RESOURCE_CALCULATOR, null,
>   s2.getResourceUsage(), minShare2);
>   minShareRatio1 = (double) s1.getResourceUsage().getMemory()
>   / Resources.max(RESOURCE_CALCULATOR, null, minShare1, 
> ONE).getMemory();
>   minShareRatio2 = (double) s2.getResourceUsage().getMemory()
>   / Resources.max(RESOURCE_CALCULATOR, null, minShare2, 
> ONE).getMemory();
> ..
> {code}
> {{getResourceUsage}} will return current Resource. The current Resource is 
> unstable. 
> {code:title=FSAppAttempt.java}
> @Override
>   public Resource getResourceUsage() {
> // Here the getPreemptedResources() always return zero, except in
> // a preemption round
> return Resources.subtract(getCurrentConsumption(), 
> getPreemptedResources());
>   }
> {code}
> {code:title=SchedulerApplicationAttempt}
>  public Resource getCurrentConsumption() {
> return currentConsumption;
>   }
> // This method may modify current Resource.
> public synchronized void recoverContainer(RMContainer rmContainer) {
> ..
> Resources.addTo(currentConsumption, rmContainer.getContainer()
>   .getResource());
> ..
>   }
> {code}
> I suggest that use stable Resource in comparator.
> Is there something i think wrong?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4743) ResourceManager crash because TimSort

2016-09-22 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15513222#comment-15513222
 ] 

stefanlee commented on YARN-4743:
-

Thanks [~yufeigu] ,i also met this problem,my scenario is that i decommission 
all my cluster nodemangers,after that ,thread "continuous scheduling" was 
down,then i found this exception was happened in "Collections.sort" in thread 
"continous scheduling" . 

> ResourceManager crash because TimSort
> -
>
> Key: YARN-4743
> URL: https://issues.apache.org/jira/browse/YARN-4743
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.6.4
>Reporter: Zephyr Guo
>Assignee: Yufei Gu
> Attachments: YARN-4743-cdh5.4.7.patch
>
>
> {code}
> 2016-02-26 14:08:50,821 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type NODE_UPDATE to the scheduler
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>  at java.util.TimSort.mergeHi(TimSort.java:868)
>  at java.util.TimSort.mergeAt(TimSort.java:485)
>  at java.util.TimSort.mergeCollapse(TimSort.java:410)
>  at java.util.TimSort.sort(TimSort.java:214)
>  at java.util.TimSort.sort(TimSort.java:173)
>  at java.util.Arrays.sort(Arrays.java:659)
>  at java.util.Collections.sort(Collections.java:217)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:316)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:240)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1091)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:989)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1185)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684)
>  at java.lang.Thread.run(Thread.java:745)
> 2016-02-26 14:08:50,822 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {code}
> Actually, this issue found in 2.6.0-cdh5.4.7.
> I think the cause is that we modify {{Resouce}} while we are sorting 
> {{runnableApps}}.
> {code:title=FSLeafQueue.java}
> Comparator comparator = policy.getComparator();
> writeLock.lock();
> try {
>   Collections.sort(runnableApps, comparator);
> } finally {
>   writeLock.unlock();
> }
> readLock.lock();
> {code}
> {code:title=FairShareComparator}
> public int compare(Schedulable s1, Schedulable s2) {
> ..
>   s1.getResourceUsage(), minShare1);
>   boolean s2Needy = Resources.lessThan(RESOURCE_CALCULATOR, null,
>   s2.getResourceUsage(), minShare2);
>   minShareRatio1 = (double) s1.getResourceUsage().getMemory()
>   / Resources.max(RESOURCE_CALCULATOR, null, minShare1, 
> ONE).getMemory();
>   minShareRatio2 = (double) s2.getResourceUsage().getMemory()
>   / Resources.max(RESOURCE_CALCULATOR, null, minShare2, 
> ONE).getMemory();
> ..
> {code}
> {{getResourceUsage}} will return current Resource. The current Resource is 
> unstable. 
> {code:title=FSAppAttempt.java}
> @Override
>   public Resource getResourceUsage() {
> // Here the getPreemptedResources() always return zero, except in
> // a preemption round
> return Resources.subtract(getCurrentConsumption(), 
> getPreemptedResources());
>   }
> {code}
> {code:title=SchedulerApplicationAttempt}
>  public Resource getCurrentConsumption() {
> return currentConsumption;
>   }
> // This method may modify current Resource.
> public synchronized void recoverContainer(RMContainer rmContainer) {
> ..
> Resources.addTo(currentConsumption, rmContainer.getContainer()
>   .getResource());
> ..
>   }
> {code}
> I suggest that use stable Resource in comparator.
> Is there something i think wrong?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3250) Support admin cli interface in for Application Priority

2016-09-20 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508306#comment-15508306
 ] 

stefanlee commented on YARN-3250:
-

ok,thank you。

> Support admin cli interface in for Application Priority
> ---
>
> Key: YARN-3250
> URL: https://issues.apache.org/jira/browse/YARN-3250
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Sunil G
>Assignee: Rohith Sharma K S
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: 0001-YARN-3250-V1.patch, 0002-YARN-3250.patch, 
> 0003-YARN-3250.patch
>
>
> Current Application Priority Manager supports only configuration via file. 
> To support runtime configurations for admin cli and REST, a common management 
> interface has to be added which can be shared with NodeLabelsManager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-3250) Support admin cli interface in for Application Priority

2016-09-20 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508306#comment-15508306
 ] 

stefanlee edited comment on YARN-3250 at 9/21/16 12:59 AM:
---

thank you.


was (Author: imstefanlee):
ok,thank you。

> Support admin cli interface in for Application Priority
> ---
>
> Key: YARN-3250
> URL: https://issues.apache.org/jira/browse/YARN-3250
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Sunil G
>Assignee: Rohith Sharma K S
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: 0001-YARN-3250-V1.patch, 0002-YARN-3250.patch, 
> 0003-YARN-3250.patch
>
>
> Current Application Priority Manager supports only configuration via file. 
> To support runtime configurations for admin cli and REST, a common management 
> interface has to be added which can be shared with NodeLabelsManager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3250) Support admin cli interface in for Application Priority

2016-09-19 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15502995#comment-15502995
 ] 

stefanlee commented on YARN-3250:
-

[~rohithsharma]  [~sunilg] [~leftnoteasy] thanks  for sharing this patch, i 
have a problem on dealing with "ResourceManagerAdministrationProtocol",because 
of protocol buffer, we should modify 
"yarn_server_resourcemanager_service_protos.proto",but i don't know  
"ResourceManagerAdministrationProtocolPBServiceImpl","RefreshClusterMaxPriorityRequestPBImpl","RefreshClusterMaxPriorityResponsePBImpl"
  are my own writing or through  PB command?  what i mean is that 
"RefreshClusterMaxPriorityRequestPBImpl" can automatically generate ?

> Support admin cli interface in for Application Priority
> ---
>
> Key: YARN-3250
> URL: https://issues.apache.org/jira/browse/YARN-3250
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Sunil G
>Assignee: Rohith Sharma K S
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: 0001-YARN-3250-V1.patch, 0002-YARN-3250.patch, 
> 0003-YARN-3250.patch
>
>
> Current Application Priority Manager supports only configuration via file. 
> To support runtime configurations for admin cli and REST, a common management 
> interface has to be added which can be shared with NodeLabelsManager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1963) Support priorities across applications within the same queue

2016-08-31 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15451780#comment-15451780
 ] 

stefanlee commented on YARN-1963:
-

[~sunilg] Thanks your jira,  i have a doubt that "creating multiple queues and 
making
users submit applications to higher priority and lower priority queues 
separately" in your doc,it  means  we create multiple queues, e.g. queue A and 
queue B,then label A  higher priority queue ,label B lower priority queue, 
after that ,user can submit higher priority application to A ?but i look up 
your code ,it means user can submit different priority application to same 
queue and the queue has a default priority,the cluster has a max priority.

> Support priorities across applications within the same queue 
> -
>
> Key: YARN-1963
> URL: https://issues.apache.org/jira/browse/YARN-1963
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, resourcemanager
>Reporter: Arun C Murthy
>Assignee: Sunil G
> Attachments: 0001-YARN-1963-prototype.patch, YARN Application 
> Priorities Design _02.pdf, YARN Application Priorities Design.pdf, YARN 
> Application Priorities Design_01.pdf
>
>
> It will be very useful to support priorities among applications within the 
> same queue, particularly in production scenarios. It allows for finer-grained 
> controls without having to force admins to create a multitude of queues, plus 
> allows existing applications to continue using existing queues which are 
> usually part of institutional memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4995) FairScheduler: Display per-queue demand on the scheduler page

2016-08-18 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15426123#comment-15426123
 ] 

stefanlee commented on YARN-4995:
-

[~xupeng] thanks your jira, i have a question,the demand resource you mentioned 
will beyond the max resource a lot,IMO,the demand resource is used resource 
plus request resource,because of node locality and rack locality ,a task's 
request resource may include many containers,e.g. 
<20,"node1","memory:1G,cpu:1",1,true> ,<20,"node2","memory:1G,cpu:1",1,true> 
,<20,"rack1","memory:1G,cpu:1",1,true> ,<20,"rack2","memory:1G,cpu:1",1,true> 
,<20,"*","memory:1G,cpu:1",1,true> , a MR application may has 20 maps and  need 
20*,but the demand resource will reach 5*20* 
,so the demand resource cant help user to modify  a queue's min/max resource 
clearly. I dont know my option correct or not?

> FairScheduler: Display per-queue demand on the scheduler page
> -
>
> Key: YARN-4995
> URL: https://issues.apache.org/jira/browse/YARN-4995
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: xupeng
>Assignee: xupeng
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: YARN-4995.001.patch, YARN-4995.002.patch, 
> demo_screenshot.png
>
>
> For now there is no demand resource information for queues on the scheduler 
> page. 
> Just using used resource information, it is hard for us to judge whether the 
> queue is needy (demand > used , but cluster has no available resource). And 
> without demand resource information, modifying min/max resource for queue is 
> not accurate. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2098) App priority support in Fair Scheduler

2016-08-08 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412839#comment-15412839
 ] 

stefanlee commented on YARN-2098:
-

[~ywskycn] how about the progress of this jira?

> App priority support in Fair Scheduler
> --
>
> Key: YARN-2098
> URL: https://issues.apache.org/jira/browse/YARN-2098
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Affects Versions: 2.5.0
>Reporter: Ashwin Shankar
>Assignee: Wei Yan
> Attachments: YARN-2098.patch, YARN-2098.patch
>
>
> This jira is created for supporting app priorities in fair scheduler. 
> AppSchedulable hard codes priority of apps to 1,we should
> change this to get priority from ApplicationSubmissionContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4048) Linux kernel panic under strict CPU limits

2016-07-25 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392949#comment-15392949
 ] 

stefanlee commented on YARN-4048:
-

i have the same problem, my hadoop version is 2.6.2

> Linux kernel panic under strict CPU limits
> --
>
> Key: YARN-4048
> URL: https://issues.apache.org/jira/browse/YARN-4048
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.1
>Reporter: Chengbing Liu
>Priority: Critical
> Attachments: panic.png
>
>
> With YARN-2440 and YARN-2531, we have seen some kernel panics happening under 
> heavy pressure. Even with YARN-2809, it still panics.
> We are using CentOS 6.5, hadoop 2.5.0-cdh5.2.0 with the above patches. I 
> guess the latest version also has the same issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2497) Changes for fair scheduler to support allocate resource respect labels

2016-06-27 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352287#comment-15352287
 ] 

stefanlee commented on YARN-2497:
-

2.6 version does not support it.

> Changes for fair scheduler to support allocate resource respect labels
> --
>
> Key: YARN-2497
> URL: https://issues.apache.org/jira/browse/YARN-2497
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Reporter: Wangda Tan
>Assignee: Naganarasimha G R
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4022) queue not remove from webpage(/cluster/scheduler) when delete queue in xxx-scheduler.xml

2015-11-30 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032978#comment-15032978
 ] 

stefanlee commented on YARN-4022:
-

hi forrestchen, my configuration in cluster is similar to you  ,and when i 
delete a queue in fair-scheduler.xml,I submit an application to the deleted 
queue and the application will run using 'root.default' queue instead, then  
submit to an un-exist queue ,but still run using 'root.default' and no 
exception. why ? my hadoop version is 2.4.0

> queue not remove from webpage(/cluster/scheduler) when delete queue in 
> xxx-scheduler.xml
> 
>
> Key: YARN-4022
> URL: https://issues.apache.org/jira/browse/YARN-4022
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: forrestchen
>  Labels: scheduler
> Attachments: YARN-4022.001.patch, YARN-4022.002.patch, 
> YARN-4022.003.patch, YARN-4022.004.patch
>
>
> When I delete an existing queue by modify the xxx-schedule.xml, I can still 
> see the queue information block in webpage(/cluster/scheduler) though the 
> 'Min Resources' items all become to zero and have no item of 'Max Running 
> Applications'.
> I can still submit an application to the deleted queue and the application 
> will run using 'root.default' queue instead, but submit to an un-exist queue 
> will cause an exception.
> My expectation is the deleted queue will not displayed in webpage and submit 
> application to the deleted queue will act just like the queue doesn't exist.
> PS: There's no application running in the queue I delete.
> Some related config in yarn-site.xml:
> {code}
> 
> yarn.scheduler.fair.user-as-default-queue
> false
> 
> 
> yarn.scheduler.fair.allow-undeclared-pools
> false
> 
> {code}
> a related question is here: 
> http://stackoverflow.com/questions/26488564/hadoop-yarn-why-the-queue-cannot-be-deleted-after-i-revise-my-fair-scheduler-xm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling thread when we lose a node

2015-08-09 Thread StefanLee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14679170#comment-14679170
 ] 

StefanLee commented on YARN-2273:
-

Hi Wei Yan, thank you for taking this JIRA,but after this NPE,why containers 
were not assigned? i have dumped RM`s jstack file and find no dead lock 
etc.thank you.

 NPE in ContinuousScheduling thread when we lose a node
 --

 Key: YARN-2273
 URL: https://issues.apache.org/jira/browse/YARN-2273
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, resourcemanager
Affects Versions: 2.3.0, 2.4.1
 Environment: cdh5.0.2 wheezy
Reporter: Andy Skelton
Assignee: Wei Yan
 Fix For: 2.6.0

 Attachments: YARN-2273-5.patch, YARN-2273-replayException.patch, 
 YARN-2273.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch


 One DN experienced memory errors and entered a cycle of rebooting and 
 rejoining the cluster. After the second time the node went away, the RM 
 produced this:
 {code}
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Application attempt appattempt_1404858438119_4352_01 released container 
 container_1404858438119_4352_01_04 on node: host: 
 node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL
 2014-07-09 21:47:36,571 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
 memory:335872, vCores:328
 2014-07-09 21:47:36,571 ERROR 
 org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
 Thread[ContinuousScheduling,5,main] threw an Exception.
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
   at java.util.TimSort.sort(TimSort.java:203)
   at java.util.TimSort.sort(TimSort.java:173)
   at java.util.Arrays.sort(Arrays.java:659)
   at java.util.Collections.sort(Collections.java:217)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 A few cycles later YARN was crippled. The RM was running and jobs could be 
 submitted but containers were not assigned and no progress was made. 
 Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)