from:"bc Wong \(JIRA\)"

[jira] [Updated] (YARN-2194) Add Cgroup support for RedHat 7

2015-01-14 Thread bc Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bc Wong updated YARN-2194:
--
Description:In previous versions of RedHat, we can build custom cgroup 
hierarchies with use of the cgconfig command from the libcgroup package. From 
RedHat 7, package libcgroup is deprecated and it is not recommended to use it 
since it can easily create conflicts with the default cgroup hierarchy. The 
systemd is provided and recommended for cgroup management. We need to add 
support for this.  (was: In previous versions of RedHat, we can build custom 
cgroup hierarchies with use of the cgconfig command from the libcgroup package. 
From RedHat 7, package libcgroup is deprecated and it is not recommended to use 
it since it can easily create conflicts with the default cgroup hierarchy. The 
systemd is provided and recommended for cgroup management. We need to add 
support for this.)

 Add Cgroup support for RedHat 7
 ---

 Key: YARN-2194
 URL: https://issues.apache.org/jira/browse/YARN-2194
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Wei Yan
Assignee: Wei Yan
 Attachments: YARN-2194-1.patch


In previous versions of RedHat, we can build custom cgroup hierarchies 
 with use of the cgconfig command from the libcgroup package. From RedHat 7, 
 package libcgroup is deprecated and it is not recommended to use it since it 
 can easily create conflicts with the default cgroup hierarchy. The systemd is 
 provided and recommended for cgroup management. We need to add support for 
 this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2931) PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner

2014-12-08 Thread bc Wong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238734#comment-14238734
 ] 

bc Wong commented on YARN-2931:
---

Thanks for the fix! Some nits. ResourceLocalizationService.java
* Instead of commenting out code, would just remove it.

TestResourceLocalizationService.java
* L950: Remove code that commented out.

 PublicLocalizer may fail with FileNotFoundException until directory gets 
 initialized by LocalizeRunner
 --

 Key: YARN-2931
 URL: https://issues.apache.org/jira/browse/YARN-2931
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
 Attachments: YARN-2931.001.patch, YARN-2931.002.patch, 
 YARN-2931.002.patch


 When the data directory is cleaned up and NM is started with existing 
 recovery state, because of YARN-90, it will not recreate the local dirs.
 This causes a PublicLocalizer to fail until getInitializedLocalDirs is called 
 due to some LocalizeRunner for private localization.
 Example error 
 {noformat}
 2014-12-02 22:57:32,629 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc { { hdfs:/blah 
 machine:8020/tmp/hive-hive/hive_2014-12-02_22-56-58_741_2045919883676051996-3/-mr-10004/8060c9dd-54b6-42fc-9d77-34b655fa5e82/reduce.xml,
  1417589819618, FILE, null 
 },pending,[(container_1417589109512_0001_02_03)],119413444132127,DOWNLOADING}
 java.io.FileNotFoundException: File /data/yarn/nm/filecache does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514)
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1051)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:162)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:724)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:720)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:720)
   at org.apache.hadoop.yarn.util.FSDownload.createDir(FSDownload.java:104)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:351)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 2014-12-02 22:57:32,629 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Container container_1417589109512_0001_02_03 transitioned from 
 LOCALIZING to LOCALIZATION_FAILED
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2669) FairScheduler: queueName shouldn't allow periods the allocation.xml

2014-10-28 Thread bc Wong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186448#comment-14186448
 ] 

bc Wong commented on YARN-2669:
---

Replacing . with \_dot\_ sounds fine here. While it doesn't eliminate 
collision, it makes it unlikely. Again, I'd leave it for another patch to do 
the real fix, which is more involved.

 FairScheduler: queueName shouldn't allow periods the allocation.xml
 ---

 Key: YARN-2669
 URL: https://issues.apache.org/jira/browse/YARN-2669
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Wei Yan
Assignee: Wei Yan
Priority: Minor
 Attachments: YARN-2669-1.patch, YARN-2669-2.patch, YARN-2669-3.patch


 For an allocation file like:
 {noformat}
 allocations
   queue name=root.q1
 minResources4096mb,4vcores/minResources
   /queue
 /allocations
 {noformat}
 Users may wish to config minResources for a queue with full path root.q1. 
 However, right now, fair scheduler will treat this configureation for the 
 queue with full name root.root.q1. We need to print out a warning msg to 
 notify users about this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2194) Add Cgroup support for RedHat 7

2014-10-27 Thread bc Wong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14185368#comment-14185368
 ] 

bc Wong commented on YARN-2194:
---

container-executor.c
* L1188: If initialize_user() fails, do you not need to cleanup?
* L1194: Same for create_log_dirs(). Seems that goto cleanup is still warranted.
* L1207: Missing space before S_IRWXU.
* L1243: Nit. Hardcoding 55 here is error-prone. You could allocate a 4K buffer 
here, and use snprintf.
* L1244: You need to check the return value from malloc(). Since you're running 
as root here, everything has to be extra careful.
* L1255: On failure, would log the command being executed.


 Add Cgroup support for RedHat 7
 ---

 Key: YARN-2194
 URL: https://issues.apache.org/jira/browse/YARN-2194
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Wei Yan
Assignee: Wei Yan
 Attachments: YARN-2194-1.patch


 In previous versions of RedHat, we can build custom cgroup hierarchies with 
 use of the cgconfig command from the libcgroup package. From RedHat 7, 
 package libcgroup is deprecated and it is not recommended to use it since it 
 can easily create conflicts with the default cgroup hierarchy. The systemd is 
 provided and recommended for cgroup management. We need to add support for 
 this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2669) FairScheduler: queueName shouldn't allow periods the allocation.xml

2014-10-26 Thread bc Wong (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14184419#comment-14184419
]

bc Wong commented on YARN-2669:
---

AllocationFileLoaderService.java
* When throwing an error, would also output the offending queueName.

QueuePlacementRule.java
* Would log when you convert the username to one that doesn't have a queue.
* I'm worried about username conflicts after the conversion, e.g. eric.koffee
== erick.offee. Replacing the dot with something else helps, but doesn't
eliminate the problem. You'd need some escaping rule, like replacing any
naturally occurring single underscore with two underscores, and then replacing
a dot with a single underscore.

FairScheduler: queueName shouldn't allow periods the allocation.xml
---

Key: YARN-2669
URL: https://issues.apache.org/jira/browse/YARN-2669
Project: Hadoop YARN
Issue Type: Improvement
Reporter: Wei Yan
Assignee: Wei Yan
Priority: Minor
Attachments: YARN-2669-1.patch, YARN-2669-2.patch

For an allocation file like:
{noformat}
allocations
queue name=root.q1
minResources4096mb,4vcores/minResources
/queue
/allocations
{noformat}
Users may wish to config minResources for a queue with full path root.q1.
However, right now, fair scheduler will treat this configureation for the
queue with full name root.root.q1. We need to print out a warning msg to
notify users about this.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2669) FairScheduler: queueName shouldn't allow periods the allocation.xml

2014-10-26 Thread bc Wong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14184540#comment-14184540
 ] 

bc Wong commented on YARN-2669:
---

To qualify what I wrote:

bq. You'd need some escaping rule, like replacing any naturally occurring 
single underscore with two underscores, and then replacing a dot with a single 
underscore.

That seems to be out of scope here, and could use more discussion and feedback.

 FairScheduler: queueName shouldn't allow periods the allocation.xml
 ---

 Key: YARN-2669
 URL: https://issues.apache.org/jira/browse/YARN-2669
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Wei Yan
Assignee: Wei Yan
Priority: Minor
 Attachments: YARN-2669-1.patch, YARN-2669-2.patch


 For an allocation file like:
 {noformat}
 allocations
   queue name=root.q1
 minResources4096mb,4vcores/minResources
   /queue
 /allocations
 {noformat}
 Users may wish to config minResources for a queue with full path root.q1. 
 However, right now, fair scheduler will treat this configureation for the 
 queue with full name root.root.q1. We need to print out a warning msg to 
 notify users about this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2669) FairScheduler: queueName shouldn't allow periods the allocation.xml

2014-10-19 Thread bc Wong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176327#comment-14176327
 ] 

bc Wong commented on YARN-2669:
---

Thanks for the patch, Wei! What if the username has a period in it, and FS is 
configured to take the username as queue name? Is there a separate jira 
tracking that?

 FairScheduler: queueName shouldn't allow periods the allocation.xml
 ---

 Key: YARN-2669
 URL: https://issues.apache.org/jira/browse/YARN-2669
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Wei Yan
Assignee: Wei Yan
Priority: Minor
 Attachments: YARN-2669-1.patch


 For an allocation file like:
 {noformat}
 allocations
   queue name=root.q1
 minResources4096mb,4vcores/minResources
   /queue
 /allocations
 {noformat}
 Users may wish to config minResources for a queue with full path root.q1. 
 However, right now, fair scheduler will treat this configureation for the 
 queue with full name root.root.q1. We need to print out a warning msg to 
 notify users about this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2605) [RM HA] Rest api endpoints doing redirect incorrectly

2014-09-25 Thread bc Wong (JIRA)

bc Wong created YARN-2605:
-

 Summary: [RM HA] Rest api endpoints doing redirect incorrectly
 Key: YARN-2605
 URL: https://issues.apache.org/jira/browse/YARN-2605
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: bc Wong


The standby RM's webui tries to do a redirect via meta-refresh. That is fine 
for pages designed to be viewed by web browsers. But the API endpoints 
shouldn't do that. Most programmatic HTTP clients do not do meta-refresh. I'd 
suggest HTTP 303, or return a well-defined error message (json or xml) stating 
that the standby status and a link to the active RM.

The standby RM is returning this today:
{noformat}
$ curl -i http://bcsec-1.ent.cloudera.com:8088/ws/v1/cluster/metrics
HTTP/1.1 200 OK
Cache-Control: no-cache
Expires: Thu, 25 Sep 2014 18:34:53 GMT
Date: Thu, 25 Sep 2014 18:34:53 GMT
Pragma: no-cache
Expires: Thu, 25 Sep 2014 18:34:53 GMT
Date: Thu, 25 Sep 2014 18:34:53 GMT
Pragma: no-cache
Content-Type: text/plain; charset=UTF-8
Refresh: 3; url=http://bcsec-2.ent.cloudera.com:8088/ws/v1/cluster/metrics
Content-Length: 117
Server: Jetty(6.1.26)

This is standby RM. Redirecting to the current active RM: 
http://bcsec-2.ent.cloudera.com:8088/ws/v1/cluster/metrics
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data

2014-09-22 Thread bc Wong (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143401#comment-14143401
]

bc Wong commented on YARN-1530:
---

Hi [~zjshen]. First, glad to see that we're discussing approaches. You seem to
agree with the premise that *ATS write path should not slow down apps*.

bq. Therefore, is making the timeline server reliable (or always-up) the
essential solution? If the timeline server is reliable, ...

In theory, you can make the ATS *always-up*. In practice, we both know what
real life distributed systems do. Always-up isn't the only thing. The write
path needs to have good uptime and latency regardless of what's happening to
the read path or the backing store.

HDFS is a good default for the write channel because:
* We don't have to design an ATS that is always-up. If you really want to, I'm
sure you can eventually build something with good uptime. But it took other
projects (HDFS, ZK) lots of hard work to get to that point.
* If we reuse HDFS, cluster admins know how to operate HDFS and get good uptime
from it. But it'll take training and hard-learned lessons for operators to
figure out how to get good uptime from ATS, even after you build an always-up
ATS.
* All the popular YARN app frameworks (MR, Spark, etc.) already rely on HDFS by
default. So do most of the 3rd party applications that I know of.
Architecturally, it seems easier for admins to accept that ATS write path
depends on HDFS for reliability, instead of a new component that (we claim)
will be as reliable as HDFS/ZK.

bq. given the whole roadmap of the timeline service, let's think critically of
work that can improve the timeline service most significantly.

Exactly. Strong +1. If we can drop the high uptime + low write latency
requirement from the ATS service, we can avoid tons of effort. ATS doesn't need
to be as reliable as HDFS. We don't need to worry about insulating the write
path from the read path. We don't need to worry about occasional hiccups in
HBase (or whatever the store is). And at the end of all this, everybody sleeps
better because ATS service going down isn't a catastrophic failure.

[Umbrella] Store, manage and serve per-framework application-timeline data
--

Key: YARN-1530
URL: https://issues.apache.org/jira/browse/YARN-1530
Project: Hadoop YARN
Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Attachments: ATS-Write-Pipeline-Design-Proposal.pdf,
ATS-meet-up-8-28-2014-notes.pdf, application timeline design-20140108.pdf,
application timeline design-20140116.pdf, application timeline
design-20140130.pdf, application timeline design-20140210.pdf

This is a sibling JIRA for YARN-321.
Today, each application/framework has to do store, and serve per-framework
data all by itself as YARN doesn't have a common solution. This JIRA attempts
to solve the storage, management and serving of per-framework data from
various applications, both running and finished. The aim is to change YARN to
collect and store data in a generic manner with plugin points for frameworks
to do their own thing w.r.t interpretation and serving.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data

2014-09-14 Thread bc Wong (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133483#comment-14133483
]

bc Wong commented on YARN-1530:
---

Hi [~zjshen]. My main concern with the write path is: *Does the ATS write path
have the right reliability, robustness and scalability so that its failures
would not affect my apps?* I'll try to explain it with specific scenarios and
technology choices. Then maybe you can tell me if those are valid concerns.

First, to make it easy for other readers here, I'm advocating that this event
flow:\\
_Client/App - Reliable channel where event is persisted (HDFS/Kafka) - ATS_ \\
is a lot better than:\\
_Client/App - RPC - ATS_

h4. Scenario 1. ATS service goes down
If we use a reliable channel (e.g. HDFS) for writes, then apps do not suffer at
all even when the ATS goes down. The ATS service going down is a valid
scenario, due to causes ranging from bug to hardware failures. Having the write
path decoupled from the ATS service being up all the time seems a clear win to
me. Writing decoupled components is also a good distributed systems design
principle.

On the other hand, one may argue that _the ATS service will never go down
entirely, or is not supposed to go down entirely_, just like we don't expect
all the ZK nodes or all the RM nodes to down down. That argument then justifies
using direct RPC for writes. Yes, you can design such an ATS service. To this
I'll say:

* YARN apps already depend on ZK/RM/HDFS being up. Every new service dependency
we add will only increase the chances of YARN apps failing or slowing down.
That's true even if the ATS service's uptime is as good as ZK or RM.
* Realistically, getting the ATS service's uptime to the same level as ZK or
HDFS is a long and winding road. Especially when most discussions here assume
HBase as the backing store. HBase's uptime is lower than HDFS/ZK/RM because
it's more complex to operate. If HBase going down means ATS service going down,
then we certainly should guard against this failure scenario.

h4. Scenario 2. ATS service partially down
If the client writes directly to the ATS service using an unreliable channel
(RPC), then the write path will do failover if one of ATS nodes fails. This
transient failure still affects the performance of YARN apps. One can argue
that _non-blocking RPC writes resolve this issue_. To this I'll say:

* Non-blocking RPC writes only works for *long-duration apps*. We already
short-lived applications, in the range of a few minutes. With Spark getting
more popular, this will continue to happen. How short will the app duration
get? The answer is a few seconds, if we want YARN to be the generic cluster
scheduler. Google already sees that kind of job profile, if you look at their
cluster traces. Of course, our scheduler and container allocation needs to get
a lot better for that to happen. But I think that's the goal. Our ATS design
here should consider short-lived applications.
* It sucks if you're running an app that's supposed to finish under a minute,
but then the ATS writes are stalled for an extra minute because one ATS node
does a failover. Again, we can go back to the counter-argument in scenario #1,
about how unlikely this is. I'll repeat that it's more likely that we think.
And if we have a choice to decouple the write path from the ATS service, why
not?

h4. Scenario 3. ATS backing store fails
By backing store, I mean the storage system where ATS persists the events, such
as LevelDB and HBase. In a naive implementation, it seems that if the backing
store fails, then the ATS service will be unavailable. Does that mean the event
write path will fail, and the YARN apps will stall or fail? I hope not. It's
not an issue if we use HDFS as the default write channel, because most YARN
apps already depends on HDFS.

One may argue that _the ATS service will buffer writes (persist them elsewhere)
if the backing store fails_. To this I'll say:

* If we have an alternate code path to persist events first before they hit the
final backing store, why not do that all the time? Such a path will address
scenario #1 and #2 as well.
* HBase has been mentioned as if it's the penicillin of event storage here.
That is probably true for big shops like Twitter and Yahoo, who have the
expertise to operate an HBase cluster well. But most enterprise users or
startups don't. We should assume that those HBase instances will run
suboptimally with occasional widespread failures. Using HBase for event storage
is a poor fit for most people. And I think it's difficult to achieve good
uptime for the ATS service as a whole.

[Umbrella] Store, manage and serve per-framework application-timeline data
--

Key: YARN-1530
URL: https://issues.apache.org/jira/browse/YARN-1530

[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data

2014-09-09 Thread bc Wong (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127403#comment-14127403
]

bc Wong commented on YARN-1530:
---

bq. The current writing channel allows the data to be available on the timeline
server immediately

Let's have reliability before speed. I think one of the requirement of ATS is:
*The channel for writing events should be reliable.*

I'm using *reliable* here in a strong sense, not the TCP-best-effort style
reliability. HDFS is reliable. Kafka is reliable. (They are also scalable and
robust.) A normal RPC connection is not. I don't want the ATS to be able to
slow down my writes, and therefore, my applications, at all. For example, an
ATS failover shouldn't pause all my applications for N seconds. A direct RPC to
the ATS for writing seems a poor choice in general.

Yes, you could make a distributed reliable scalable ATS service to accept
writing events. But that seems a lot of work, while we can leverage existing
technologies.

If the channel itself is pluggable, then we have lots of options. Kafka is a
very good choice, for sites that already deploy Kafka and know how to operate
it. Using HDFS as a channel is also a good default implementation, for people
already know how to scale and manage HDFS. Embedding a Kafka broker with each
ATS daemon is also an option, if we're ok with that dependency.

[Umbrella] Store, manage and serve per-framework application-timeline data
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-596) Use scheduling policies throughout the queue hierarchy to decide which containers to preempt

2014-09-02 Thread bc Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bc Wong updated YARN-596:
-
Description: 
  In the fair scheduler, containers are chosen for preemption in the following 
way:
All containers for all apps that are in queues that are over their fair share 
are put in a list.
The list is sorted in order of the priority that the container was requested in.

This means that an application can shield itself from preemption by requesting 
it's containers at higher priorities, which doesn't really make sense.

Also, an application that is not over its fair share, but that is in a queue 
that is over it's fair share is just as likely to have containers preempted as 
an application that is over its fair share.

  was:
In the fair scheduler, containers are chosen for preemption in the following 
way:
All containers for all apps that are in queues that are over their fair share 
are put in a list.
The list is sorted in order of the priority that the container was requested in.

This means that an application can shield itself from preemption by requesting 
it's containers at higher priorities, which doesn't really make sense.

Also, an application that is not over its fair share, but that is in a queue 
that is over it's fair share is just as likely to have containers preempted as 
an application that is over its fair share.


 Use scheduling policies throughout the queue hierarchy to decide which 
 containers to preempt
 

 Key: YARN-596
 URL: https://issues.apache.org/jira/browse/YARN-596
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Wei Yan
 Fix For: 2.5.0

 Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch


   In the fair scheduler, containers are chosen for preemption in the 
 following way:
 All containers for all apps that are in queues that are over their fair share 
 are put in a list.
 The list is sorted in order of the priority that the container was requested 
 in.
 This means that an application can shield itself from preemption by 
 requesting it's containers at higher priorities, which doesn't really make 
 sense.
 Also, an application that is not over its fair share, but that is in a queue 
 that is over it's fair share is just as likely to have containers preempted 
 as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-07-07 Thread bc Wong (JIRA)

[
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053457#comment-14053457
]

bc Wong commented on YARN-796:
--

[~yufeldman] [~sdaingade], just read your proposal
(LabelBasedScheduling.pdf). Has a few comments:

1. *Would let each node report its own labels.* The current proposal specifies
the node-label mapping in a centralized file. This seems operationally
unfriendly, as the file is hard to maintain.
* You need to get the DNS name right, which could be hard for a multi-homed
setup.
* The proposal uses regexes on FQDN, such as {{perfnode.*}}. This may work if
the hostnames are set up by IT like that. But in reality, I've seen lots of
sites where the FQDN is like {{stmp09wk0013.foobar.com}}, where stmp refers
to the data center, and wk0013 refers to worker 13, and other weird stuff
like that. Now imagine that a centralized node-label mapping file with 2000
nodes with such names. It'd be a nightmare.

Instead, each node can supply its own labels, via
{{yarn.nodemanager.node.labels}} (which specifies labels directly) or
{{yarn.nodemanager.node.labelFile}} (which points to a file that has a single
line containing all the labels). It's easy to generate the label file for each
node. The admin can have puppet push it out, or populate it when the VM is
built, or compute it in a local script by inspecting /proc. (Oh I have 192GB,
so add the label largeMem.) There is little room for mistake.

The NM can still periodically refreshes its own labels, and update the RM via
the heartbeat mechanism. The RM should also expose a node label report, which
is the real-time information of all nodes and their labels.

2. *Labels are per-container, not per-app. Right?* The doc keeps mentioning
application label, ApplicationLabelExpression, etc. Should those be
container label instead? I just want to confirm that each container request
can carry its own label expression. Example use case: Only the mappers need
GPU, not the reducers.

3. *Can we fail container requests with no satisfying nodes?* In
Considerations, #5, you wrote that the app would be in waiting state. Seems
that a fail-fast behaviour would be better. If no node can satisfy the label
expression, then it's better to tell the client no. Very likely somebody made
a typo somewhere.

Allow for (admin) labels on nodes and resource-requests
---

Key: YARN-796
URL: https://issues.apache.org/jira/browse/YARN-796
Project: Hadoop YARN
Issue Type: Sub-task
Reporter: Arun C Murthy
Assignee: Wangda Tan
Attachments: LabelBasedScheduling.pdf, YARN-796.patch

It will be useful for admins to specify labels for nodes. Examples of labels
are OS, processor architecture etc.
We should expose these labels and allow applications to specify labels on
resource-requests.
Obviously we need to support admin operations on adding/removing node labels.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application

2014-07-01 Thread bc Wong (JIRA)

[
https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049414#comment-14049414
]

bc Wong commented on YARN-941:
--

I'm fine with [~xgong]'s solution. I'd still like to see something more generic
to make tokens (HDFS token, HBase token, etc) work with long running apps
though. Perhaps I'll pursue the arbitrary expiration time approach in another
jira.

{quote}
RPC privacy is a very expensive solution for AM-RM communication. First, it
needs setup so AM/RM have access to key infrastructure - having this burden on
all applications is not reasonable. This is compounded by the fact that we use
AMRMTokens in non-secure mode too. Second, AM - RM communication is a very
chatty protocol, it's likely the overhead is huge..
{quote}

True security is often costly. The web/consumer industry went through the same
exercise with HTTP vs HTTPS. You can get at least 10x better performance with
HTTP. But in the end, everybody decided that it's worth it. And passing tokens
around without RPC privacy is just like sending passwords around on HTTP
without SSL.

{quote}
Unfortunately with long running services (the focus of this JIRA), this attack
and its success is not as unlikely. This is the very reason why we roll
master-keys every so often in the first place.
{quote}

With the rolling master key, it's unlikely for the attack to gather enough
cipher text to mount that attack. Besides, a longer key would require so much
computation to attack that it'd be infeasible.

Anyway, appreciate your response, and I'll follow up in another jira.

RM Should have a way to update the tokens it has for a running application
--

Key: YARN-941
URL: https://issues.apache.org/jira/browse/YARN-941
Project: Hadoop YARN
Issue Type: Sub-task
Reporter: Robert Joseph Evans
Assignee: Xuan Gong
Attachments: YARN-941.preview.2.patch, YARN-941.preview.3.patch,
YARN-941.preview.4.patch, YARN-941.preview.patch

When an application is submitted to the RM it includes with it a set of
tokens that the RM will renew on behalf of the application, that will be
passed to the AM when the application is launched, and will be used when
launching the application to access HDFS to download files on behalf of the
application.
For long lived applications/services these tokens can expire, and then the
tokens that the AM has will be invalid, and the tokens that the RM had will
also not work to launch a new AM.
We need to provide an API that will allow the RM to replace the current
tokens for this application with a new set. To avoid any real race issues, I
think this API should be something that the AM calls, so that the client can
connect to the AM with a new set of tokens it got using kerberos, then the AM
can inform the RM of the new set of tokens and quickly update its tokens
internally to use these new ones.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application

2014-06-26 Thread bc Wong (JIRA)

[
https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045062#comment-14045062
]

bc Wong commented on YARN-941:
--

Hi [~vinodkv], could you elaborate more on the point about master keys?
{quote}
Once we roll the master-keys, together with the fact that we want to support
services that run for ever, the only way we can support not expiring tokens is
by making ResourceManager remember master-keys for ever which is not feasible.
{quote}

The server side currently remembers the id, password pair for all valid
tokens. The master key isn't involved in verifying the token verification
process, from what I understand. So if we allow arbitrary expiration time for
AMRM tokens (to be specified by applications), then:
* RM needs to persist the id, password for all tokens that correspond to a
running application.
* Once an application finishes, RM can invalidate its token, and forget about
it.
* RM can keep rolling the master key, since it only affects how the password
(aka token secret) is generated.

Is my understanding of the password/master key interaction correct?

This arbitrary expiration time idea is conceptually a lot simpler than the
token replacement patch in this jira. So I feel it's worth a bit more
discussion. There are a few obvious attack vectors:
# *The attacker gains access to the persistence store, where the RM stores its
id, password map.* In this case, all bets are off. Neither solution is more
secure than the other.
# *The attacker snoops an insecure RPC channel and reads valid tokens from the
network.* The proper solution is to turn on RPC privacy. The token replacement
patch does not offer any real protection. On the contrary, it may give people a
_false sense of security_, which would be worse.
# *The attacker mounts a cryptographic attack, or somehow manages to guess a
valid id, password pair.* Token replacement is better because it limits the
exposure. But this attack is very unlikely. And we can counter that by using a
stronger hash function.

To me, the arbitrary expiration time approach is a lot simpler, without
compromising on security. What do you think?

RM Should have a way to update the tokens it has for a running application
--

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-05-24 Thread bc Wong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008277#comment-14008277
 ] 

bc Wong commented on YARN-796:
--

Having the NMs specify their own labels is probably better from an 
administrative point of view. It's harder for the labels to get out of sync. 
Each node can have a discovery script that updates its labels, which feeds 
into the NM. So an admin can take a bunch of nodes out for upgrade, and put 
them back in without having to carefully reconfigure any central mapping file.

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: YARN-796.patch


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application

2014-05-20 Thread bc Wong (JIRA)

[
https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003440#comment-14003440
]

bc Wong commented on YARN-941:
--

Hi [~xgong], thanks for the patch! I'm interested in talking through the
changes and their security implications, for everybody who's following along. I
think the following are worth highlighting:

# The token update mechanism is via the AM heartbeat. So if the previous AMRM
token has been compromised, the attacker can get the new token.
** I don't think it's a big problem as the RM will only hand out the new token
in _exactly_ one AllocateResponse (except for the case of RM restart). So if
the attacker has the new token, the real AM won't, and it'll die and the token
will get revoked.
# How frequently a running AM gets an updated token is at the mercy of the
configuration (the roll interval and activation delay). In addition, whenever
the RM restarts, all AMs will get a new token on the next heartbeat.
** Should the RM check that the roll interval and activation delay are both
shorter than the token expiration interval?
# The client app is not responsible for renewing the token. The RM will renew
it proactively and update the apps.
** The loss of control may be inconvenient to the app. The AM must also
heartbeat frequently enough to catch the update in time. In practice, it's not
an issue. But it still makes me slightly uncomfortable, since the client is the
usually one renewing its credentials, of all other security protocols I know
of. Here, the RM doesn't have any explicit logic to update an AMRM token before
it expires. The math just generally works out if the admin sets the token
expiry, roll interval and activation delay to the right values.\\
\\
Again, I think this is better than making it the AM's responsibility to get a
new token, which is more burden on the AM. I just want to bring this up for
discussion.

RM Should have a way to update the tokens it has for a running application
--

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2010) RM can't transition to active if it can't recover an app attempt

2014-04-30 Thread bc Wong (JIRA)

bc Wong created YARN-2010:
-

 Summary: RM can't transition to active if it can't recover an app 
attempt
 Key: YARN-2010
 URL: https://issues.apache.org/jira/browse/YARN-2010
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: bc Wong


If the RM fails to recover an app attempt, it won't come up. We should make it 
more resilient.

Specifically, the underlying error is that the app was submitted before 
Kerberos security got turned on. Makes sense for the app to fail in this case. 
But YARN should still start.

{noformat}
2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Exception handling the winning of election 
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active 
at 
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
 
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
 
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
 
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
transitioning to Active mode 
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
 
... 4 more 
Caused by: org.apache.hadoop.service.ServiceStateException: 
org.apache.hadoop.yarn.exceptions.YarnException: 
java.lang.IllegalArgumentException: Missing argument 
at 
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
 
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
 
... 5 more 
Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
java.lang.IllegalArgumentException: Missing argument 
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
 
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
... 8 more 
Caused by: java.lang.IllegalArgumentException: Missing argument 
at javax.crypto.spec.SecretKeySpec.init(SecretKeySpec.java:93) 
at 
org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369)
 
... 13 more 
{noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-1913) Cluster logjam when all resources are consumed by AM

2014-04-08 Thread bc Wong (JIRA)

bc Wong created YARN-1913:
-

 Summary: Cluster logjam when all resources are consumed by AM
 Key: YARN-1913
 URL: https://issues.apache.org/jira/browse/YARN-1913
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.3.0
Reporter: bc Wong


It's possible to deadlock a cluster by submitting many applications at once, 
and have all cluster resources taken up by AMs.

One solution is for the scheduler to limit resources taken up by AMs, as a 
percentage of total cluster resources, via a maxApplicationMasterShare config.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1790) FairSchedule UI not showing apps table

2014-03-06 Thread bc Wong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922132#comment-13922132
 ] 

bc Wong commented on YARN-1790:
---

Seems that the fix of YARN-1407 forgot to change the FairSchedulerAppsBlock to 
use the user-facing app state.

 FairSchedule UI not showing apps table
 --

 Key: YARN-1790
 URL: https://issues.apache.org/jira/browse/YARN-1790
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.3.0
Reporter: bc Wong
Assignee: bc Wong
 Attachments: fs_ui.png


 There is a running job, which shows up in the summary table in the 
 FairScheduler UI, the queue display, etc. Just not in the apps table at the 
 bottom.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1790) FairSchedule UI not showing apps table

2014-03-06 Thread bc Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bc Wong updated YARN-1790:
--

Attachment: fs_ui_fixed.png
0001-YARN-1790.-FairScheduler-UI-not-showing-apps-table.patch

Trivial fix. Also ported YARN-563 to FairScheduler UI. Tested manually (see 
screenshot).

 FairSchedule UI not showing apps table
 --

 Key: YARN-1790
 URL: https://issues.apache.org/jira/browse/YARN-1790
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.3.0
Reporter: bc Wong
Assignee: bc Wong
 Attachments: 
 0001-YARN-1790.-FairScheduler-UI-not-showing-apps-table.patch, fs_ui.png, 
 fs_ui_fixed.png


 There is a running job, which shows up in the summary table in the 
 FairScheduler UI, the queue display, etc. Just not in the apps table at the 
 bottom.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1790) FairSchedule UI not showing apps table

2014-03-06 Thread bc Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bc Wong updated YARN-1790:
--

Attachment: (was: 
0001-YARN-1790.-FairScheduler-UI-not-showing-apps-table.patch)

 FairSchedule UI not showing apps table
 --

 Key: YARN-1790
 URL: https://issues.apache.org/jira/browse/YARN-1790
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.3.0
Reporter: bc Wong
Assignee: bc Wong
 Attachments: 
 0001-YARN-1790.-FairScheduler-UI-not-showing-apps-table.patch, fs_ui.png, 
 fs_ui_fixed.png


 There is a running job, which shows up in the summary table in the 
 FairScheduler UI, the queue display, etc. Just not in the apps table at the 
 bottom.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1790) FairSchedule UI not showing apps table

2014-03-06 Thread bc Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bc Wong updated YARN-1790:
--

Attachment: 0001-YARN-1790.-FairScheduler-UI-not-showing-apps-table.patch

Same patch with --no-prefix.

 FairSchedule UI not showing apps table
 --

 Key: YARN-1790
 URL: https://issues.apache.org/jira/browse/YARN-1790
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.3.0
Reporter: bc Wong
Assignee: bc Wong
 Attachments: 
 0001-YARN-1790.-FairScheduler-UI-not-showing-apps-table.patch, fs_ui.png, 
 fs_ui_fixed.png


 There is a running job, which shows up in the summary table in the 
 FairScheduler UI, the queue display, etc. Just not in the apps table at the 
 bottom.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1790) FairSchedule UI not showing apps table

2014-03-05 Thread bc Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bc Wong updated YARN-1790:
--

Attachment: fs_ui.png

 FairSchedule UI not showing apps table
 --

 Key: YARN-1790
 URL: https://issues.apache.org/jira/browse/YARN-1790
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.3.0
Reporter: bc Wong
Assignee: bc Wong
 Attachments: fs_ui.png


 There is a running job, which shows up in the summary table in the 
 FairScheduler UI, the queue display, etc. Just not in the apps table at the 
 bottom.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-1785) FairScheduler treats app lookup failures as ERRORs

2014-03-04 Thread bc Wong (JIRA)

bc Wong created YARN-1785:
-

 Summary: FairScheduler treats app lookup failures as ERRORs
 Key: YARN-1785
 URL: https://issues.apache.org/jira/browse/YARN-1785
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
Reporter: bc Wong


When invoking the /ws/v1/cluster/apps endpoint, RM will eventually get to 
RMAppImpl#createAndGetApplicationReport, which calls 
RMAppAttemptImpl#getApplicationResourceUsageReport, which looks up the app in 
the scheduler, which may or may not exist. So FairScheduler shouldn't log an 
error for every lookup failure:

{noformat}
2014-02-17 08:23:21,240 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Request for appInfo of unknown attemptappattempt_1392419715319_0135_01

{noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1785) FairScheduler treats app lookup failures as ERRORs

2014-03-04 Thread bc Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bc Wong updated YARN-1785:
--

Attachment: 0001-YARN-1785.-FairScheduler-treats-app-lookup-failures-.patch

Attaching 0001-YARN-1785.-FairScheduler-treats-app-lookup-failures-.patch. 
Verified via manual testing.

 FairScheduler treats app lookup failures as ERRORs
 --

 Key: YARN-1785
 URL: https://issues.apache.org/jira/browse/YARN-1785
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
Reporter: bc Wong
 Attachments: 
 0001-YARN-1785.-FairScheduler-treats-app-lookup-failures-.patch


 When invoking the /ws/v1/cluster/apps endpoint, RM will eventually get to 
 RMAppImpl#createAndGetApplicationReport, which calls 
 RMAppAttemptImpl#getApplicationResourceUsageReport, which looks up the app in 
 the scheduler, which may or may not exist. So FairScheduler shouldn't log an 
 error for every lookup failure:
 {noformat}
 2014-02-17 08:23:21,240 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Request for appInfo of unknown attemptappattempt_1392419715319_0135_01
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2194) Add Cgroup support for RedHat 7

[jira] [Commented] (YARN-2931) PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner

[jira] [Commented] (YARN-2669) FairScheduler: queueName shouldn't allow periods the allocation.xml

[jira] [Commented] (YARN-2194) Add Cgroup support for RedHat 7

[jira] [Commented] (YARN-2669) FairScheduler: queueName shouldn't allow periods the allocation.xml

[jira] [Commented] (YARN-2669) FairScheduler: queueName shouldn't allow periods the allocation.xml

[jira] [Commented] (YARN-2669) FairScheduler: queueName shouldn't allow periods the allocation.xml

[jira] [Created] (YARN-2605) [RM HA] Rest api endpoints doing redirect incorrectly

[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data

[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data

[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data

[jira] [Updated] (YARN-596) Use scheduling policies throughout the queue hierarchy to decide which containers to preempt

[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests

[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application

[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application

[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests

[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application

[jira] [Created] (YARN-2010) RM can't transition to active if it can't recover an app attempt

[jira] [Created] (YARN-1913) Cluster logjam when all resources are consumed by AM

[jira] [Commented] (YARN-1790) FairSchedule UI not showing apps table

[jira] [Updated] (YARN-1790) FairSchedule UI not showing apps table

[jira] [Updated] (YARN-1790) FairSchedule UI not showing apps table

[jira] [Updated] (YARN-1790) FairSchedule UI not showing apps table

[jira] [Updated] (YARN-1790) FairSchedule UI not showing apps table

[jira] [Created] (YARN-1785) FairScheduler treats app lookup failures as ERRORs

[jira] [Updated] (YARN-1785) FairScheduler treats app lookup failures as ERRORs

26 matches

Site Navigation

Mail list logo

Footer information