Hi Ravi, thanks for you reply. I have two screenshots that show the nodes and the scheduler output:
Nodes: http://img42.com/gfrGH Scheduler: http://img42.com/6i8x8 Maybe I'm doing something wrong in my code. The screenshots indicate that the scheduler assigned the resources to my AppicationMaster but it does not allocate the container. The only thing I'm doing in my code is rmClient.allocate(0) and response.getAllocatedContainers() until all containers are allocated. But my code is not getting the response for the last container. Is there anything else that I have to do in the Application Master? Are there any logfiles you suggest looking into? Best, Robert On Fri, Nov 21, 2014 at 12:03 AM, Ravi Prakash <[email protected]> wrote: > Hi Robert! > Do you see a nodemanager really with 46Gb free in the web ui ( > http://resourcemanager:8088/cluster/nodes) ? How big is your > ApplicationMaster itself (because that too is launched in a container). > What happens if you ask for, say 30Gb containers? > CheersRavi > > > On Thursday, November 20, 2014 3:48 AM, Robert Metzger < > [email protected]> wrote: > > > Hi, > > I'm a developer of Apache Flink (incubating). Flink has a YARN client that > allows users to start Flink in their YARN cluster. The current > implementation is not doing fine-grained resource isolation. We just > allocate our master and worker nodes within YARN and keep them running > until the user stops the cluster again. > > The issue I'm facing is that I can not allocate all containers in my YARN > cluster. > > > My cluster has 42 nodes with 46.09 GB each (according to the NodeManager) > If my Flink client is requesting 41 + 1 (workers + master) containers (with > 46GB each), I'm only getting 41 containers (instead of 42). > My ApplicationMaster is sending allocate requests until all worker nodes > have been started. > > AllocateResponse response = rmClient.allocate(0); > for (Container container : response.getAllocatedContainers()) ... > > My code is never getting the allocation response for the last container > from YARN, even though one NodeManager in the cluster is free. > > Is there anything that I'm doing wrong here? How can I resolve the issue? > > > Another question: Some of our users were reporting issues with (worker) > containers keep running once the Flink YARN session has been stopped. The > NodeManagers were reporting that no containers were running but "jps" > showed that the JVMs were still there. > I tried to reproduce the issue on Amazon EMR which was impossible, but on > Google Compute Cloud (HD 2.4.1) I was also seeing the issue. > Is this a configuration / YARN version issue? > Am I as a YARN application supposed to kill my own JVMs or is YARN doing > that for me? > > > Cheers, > Robert > > > >
