Re: MapReduce job doesn't make any progress for a very very long time after one Node become unusable.

Namikaze Minato Wed, 24 Feb 2016 02:23:29 -0800

Hello.

I have very rare cases of map and/or reduce tasks stopping or continuing at
a very slow pace (0.1% in an hour). Just a piece of advice here: Instead of
killing the job, kill only the slow/stuck task, it will restart and this
time (hopefully) not get stuck.


Regards,
LLoyd

On 24 February 2016 at 08:56, Varun saxena <[email protected]> wrote:

> Hi Silnov,
>
>
>
> Can you check your AM logs and compare it with MAPREDUCE-6513 scenario ?
>
> I suspect its same.
>
> MAPREDUCE-6513 is marked to go in 2.7.3
>
>
>
> Regards,
>
> Varun Saxena.
>
>
>
>
>
> *From:* Silnov [mailto:[email protected]]
> *Sent:* 24 February 2016 14:52
> *To:* user
> *Subject:* MapReduce job doesn't make any progress for a very very long
> time after one Node become unusable.
>
>
>
> hello everyone! I am a greenhand on hadoop.
>
> I have one question seeking for your help!
>
>
>
>
>
> I have some nodes running hadoop 2.6.0.
> The cluster's configuration remain default largely.
> I run some job on the cluster(especially some job processing a lot of
> data) every day.
> Sometimes, I found my job remain the same progression for a very very long
> time. So I have to kill the job mannually and re-submit it to the cluster.
> It works well before(re-submit the job and it run to the end), but
> something go wrong today.
> After I re-submit the same task for 3 times, its running go deadlock(the
> progression doesn't change for a long time, and each time has a different
> progress value.e.g.33.01%,45.8%,73.21%).
>
> I begin to check the web UI for the hadoop, then I find there are 98 map
> suspend while all the running reduce task have consumed all the avaliable
> memory. I stop the yarn and add configuration below into yarn-site.xml and
> then restart the yarn.
> <property>yarn.app.mapreduce.am.job.reduce.rampup.limit</property>
> <value>0.1</value>
> <property>yarn.app.mapreduce.am.job.reduce.preemption.limit</property>
> <value>1.0</value>
> (wanting the yarn to preempt the reduce task's resource to run suspending
> map task)
> After restart the yarn,I submit the job with the property
> mapreduce.job.reduce.slowstart.completedmaps=1.
> but the same result happen again!!(my job remain the same progress value
> for a very very long time)
>
> I check the web UI for the hadoop again,and find that the suspended map
> task is newed with the previous note:"TaskAttempt killed because it ran on
> unusable node node02:21349".
> Then I check the resourcemanager's log and find some useful messages below:
> ******Deactivating Node node02:21349 as it is now LOST.
> ******node02:21349 Node Transitioned from RUNNING to LOST.
>
> I think this may happen because my network across the cluster is not good
> which cause the RM don't receive the NM's heartbeat in time.
>
> But I wonder that why the yarn framework can't preempt the running reduce
> task's resource to run the suspend map task?(this cause the job remain the
> same progress value for a very very long time )
>
> Any one can help?
> Thank you very much!
>

Re: MapReduce job doesn't make any progress for a very very long time after one Node become unusable.

Reply via email to