I added a couple of bigger servers and now I see multiple containers running, but I still can’t get a job to run. The job details now say:
Diagnostics: [Wed Jul 10 17:27:59 +0000 2019] Application is added to the scheduler and is not yet activated. Queue's AM resource limit exceeded. Details : AM Partition = <DEFAULT_PARTITION>; AM Resource Request = <memory:2048, vCores:1>; Queue Resource Limit for AM = <memory:3072, vCores:1>; User AM Resource Limit of the queue = <memory:3072, vCores:1>; Queue AM Resource Usage = <memory:2048, vCores:2>; I understand WHAT that’s saying, but I don’t understand WHY. Here’s what my scheduler details look like, I don’t see why it’s complaining about the AM, unless something’s not talking to something else right: Queue State: RUNNING Used Capacity: 12.5% Configured Capacity: 100.0% Configured Max Capacity: 100.0% Absolute Used Capacity: 12.5% Absolute Configured Capacity: 100.0% Absolute Configured Max Capacity: 100.0% Used Resources: <memory:3072, vCores:3> Configured Max Application Master Limit: 10.0 Max Application Master Resources: <memory:3072, vCores:1> Used Application Master Resources: <memory:3072, vCores:3> Max Application Master Resources Per User: <memory:3072, vCores:1> Num Schedulable Applications: 3 Num Non-Schedulable Applications: 38 Num Containers: 3 Max Applications: 10000 Max Applications Per User: 10000 Configured Minimum User Limit Percent: 100% Configured User Limit Factor: 1.0 Accessible Node Labels: * Ordering Policy: FifoOrderingPolicy Preemption: disabled Intra-queue Preemption: disabled Default Node Label Expression: <DEFAULT_PARTITION> Default Application Priority: 0 User Name Max Resource Weight Used Resource Max AM Resource Used AM Resource Schedulable Apps Non-Schedulable Apps hdfs <memory:0, vCores:0> 1.0 <memory:0, vCores:0> <memory:3072, vCores:1> <memory:0, vCores:0> 0 1 dr.who <memory:24576, vCores:1> 1.0 <memory:3072, vCores:3> <memory:3072, vCores:1> <memory:3072, vCores:3> 3 37 > On Jul 10, 2019, at 3:37 AM, yangtao.yt <yangtao...@alibaba-inc.com> wrote: > > Hi, Jason. > > According to the information you provided, your cluster has two nodes with > the same resource <memory:1732, vCores:2>, the single running container is AM > container which already takes over <memory:1024, vCores: 1>. > I think a possible cause is that available resource of your cluster was > insufficient for requesting new containers, please refer to the application > attempt UI > (http://<RM-HOST>:<RM-HTTP-PORT>/cluster/appattempt/<APP-ATTEMPT-ID>), you > can find outstanding requests with required resources over there. Another > possible cause is the queue/user limit, you can refer to scheduler UI > (http://<RM-HOST>:<RM-HTTP-PORT>/cluster/scheduler) to check resource quotas > and usage of the queue. > Hope it helps. > > Best, > Tao Yang > >> 在 2019年7月10日,上午8:23,Jason Laughman <ja...@bernetechconsulting.com> 写道: >> >> I’ve been setting up a Hadoop 2.9.1 cluster and have data replicating >> through HDFS, but when I try to run a job via Hive (I see that it’s >> deprecated, but it’s what I’m working with for now) it never gets out of >> accepted state in the web tool. I’ve done some Googling and the general >> consensus is that it’s resource constraints, so can someone tell me if I’ve >> got enough horsepower here? >> >> I’ve got one small name server, three small data servers, and two larger >> data servers. I figured out the the small data servers were too small >> because even if I tried to tweak YARN parameters for RAM and CPU the >> resource managers would immediately shutdown. I added the two larger data >> servers, and now I see two active nodes but only with a total of one >> container: >> >> $ yarn node -list >> 19/07/09 23:54:11 INFO client.RMProxy: Connecting to ResourceManager at >> <resource_manager>:8032 >> Total Nodes:2 >> Node-Id Node-State Node-Http-Address >> Number-of-Running-Containers >> node1:40079 RUNNING node1:8042 1 >> node2:36311 RUNNING node2:8042 0 >> >> There are a ton of some sort of automated jobs backed up on there, and when >> I try to run anything through Hive it just sits there and eventually times >> out (I do see it get accepted). My larger nodes are 4 GB RAM and 2 vcores >> and I set YARN to do automated resource allocation with >> yarn.nodemanager.resource.detect-hardware-capabilities. Is that enough to >> even get a POC lab working? I don’t care about having the three smaller >> servers running as resource nodes, but I’d like to have a better >> understanding of what’s going on with the larger servers, because it seems >> like they’re close to working. >> >> Here’s the metrics data from the website, hopefully somebody can parse it. >> Cluster Metrics >> Apps Submitted Apps Pending Apps Running Apps Completed >> Containers Running Memory Used Memory Total Memory Reserved >> VCores Used VCores Total VCores Reserved >> 292 284 1 7 1 1 GB 3.38 GB 0 B 1 4 >> 0 >> Cluster Nodes Metrics >> Active Nodes Decommissioning Nodes Decommissioned Nodes Lost Nodes >> Unhealthy Nodes Rebooted Nodes Shutdown Nodes >> 2 0 0 0 0 0 4 >> Scheduler Metrics >> Scheduler Type Scheduling Resource Type Minimum Allocation >> Maximum Allocation Maximum Cluster Application Priority >> Capacity Scheduler [MEMORY] <memory:1024, vCores:1> <memory:1732, >> vCores:2> 0 >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org >> For additional commands, e-mail: user-h...@hadoop.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org For additional commands, e-mail: user-h...@hadoop.apache.org