Re: All but one TMs connect when JM has more than 16G of memory

Robert Schmidtke Wed, 30 Sep 2015 09:34:32 -0700

Hi Robert,

thanks for your reply. It got me digging into my setup and I discovered
that one TM was scheduled next to the JM. When specifying -yn 7 the
documentation suggests that this is the number of TMs (of which I wanted
7), and I thought an additional container would be used for the JM (my YARN
cluster has 8 containers). Anyway with this setup the memory added up to
56G and 1M (40G per TM and 16G 1M for the JM), but I set a hard maximum of
56G in my yarn-site.xml which is why the request could not be fulfilled. It
is interesting to note that when I set
both yarn.nodemanager.resource.memory-mb
and yarn.scheduler.maximum-allocation-mb to 56G I get a proper error when
requesting 56G and 1M, but when setting yarn.nodemanager.resource.memory-mb
to 56G and yarn.scheduler.maximum-allocation-mb to 54G I don't get an error
but the aforementioned endless loop. Note I
have yarn.nodemanager.vmem-check-enabled set to false. This is probably a
YARN issue then / my bad configuration.


I'm in a rush now (to get to the Flink meetup) and thus will check the
documentation later to see how to deploy the TMs and JM on separate
machines each, since that is not what's happening at the moment, but this
is what I'd like to have. Thanks again and see you in an hour.

Cheers
Robert

On Wed, Sep 30, 2015 at 5:19 PM, Robert Metzger <rmetz...@apache.org> wrote:

> Hi Robert,
>
> the problem here is that YARN's scheduler (there are different schedulers
> in YARN: FIFO, CapacityScheduler, ...) is not giving Flink's
> ApplicationMaster/JobManager all the containers it is requesting. By
> increasing the size of the AM/JM container, there is probably no memory
> left to fit the last TaskManager container.
> I also experienced this issue, when I wanted to run a Flink job on YARN
> and the containers were fitting theoretically, but YARN was not giving me
> all the containers I requested.
> Back then, I asked on the yarn-dev list [1] (there were also some off-list
> emails) but we could not resolve the issue.
>
> Can you check the resource manager logs? Maybe there is a log message
> which explains why the container request of Flink's AM is not fulfilled.
>
>
> [1]
> http://search-hadoop.com/m/AsBtCilK5r1pKLjf1&subj=Re+QUESTION+Allocating+a+full+YARN+cluster
>
> On Wed, Sep 30, 2015 at 5:02 PM, Robert Schmidtke <ro.schmid...@gmail.com>
> wrote:
>
>> It's me again. This is a strange issue, I hope I managed to find the
>> right keywords. I got 8 machines, 1 for the JM, the other 7 are TMs with
>> 64G of memory each.
>>
>> When running my job like so:
>>
>> $FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16384 -ytm 40960 -yn 7
>> .....
>>
>> The job completes without any problems. When running it like so:
>>
>> $FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16385 -ytm 40960 -yn 7
>> .....
>>
>> (note the one more M of memory for the JM), the execution stalls,
>> continuously reporting:
>>
>> .....
>> TaskManager status (6/7)
>> TaskManager status (6/7)
>> TaskManager status (6/7)
>> .....
>>
>> I did some poking around, but I couldn't find any direct correlation with
>> the code.
>>
>> The JM log says:
>>
>> .....
>> 16:49:01,893 INFO  org.apache.flink.yarn.ApplicationMaster$
>>        -  JVM Options:
>> 16:49:01,893 INFO  org.apache.flink.yarn.ApplicationMaster$
>>        -     -Xmx12289M
>> .....
>>
>> but then continues to report
>>
>> .....
>> 16:52:59,311 INFO
>>  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - The user
>> requested 7 containers, 6 running. 1 containers missing
>> 16:52:59,831 INFO
>>  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - The user
>> requested 7 containers, 6 running. 1 containers missing
>> 16:53:00,351 INFO
>>  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - The user
>> requested 7 containers, 6 running. 1 containers missing
>> .....
>>
>> forever until I cancel the job.
>>
>> If you have any ideas I'm happy to try them out. Thanks in advance for
>> any hints! Cheers.
>>
>> Robert
>> --
>> My GPG Key ID: 336E2680
>>
>
>


-- 
My GPG Key ID: 336E2680

Re: All but one TMs connect when JM has more than 16G of memory

Reply via email to