Re: Non recapitabile: Map tasks keep Running even after the node is killed on Apache Yarn.

manoj Mon, 17 Aug 2015 17:00:06 -0700

Looks like the App master tries to connect to the Node that was removed for
~30Min before it gives up.
Can I reduce this wait time and number of tries?



Thanks
-Manoj

On Fri, Aug 14, 2015 at 4:54 PM, <[email protected]> wrote:

> *Il recapito non è riuscito per i seguenti destinatari o gruppi:*
>
> [email protected]
> La cassetta postale del destinatario è piena e non può accettare messaggi
> in questo momento. Riprova a inviare il messaggio più tardi.
>
> Il tuo messaggio è stato rifiutato dalla seguente organizzazione:
> MAIL2.scai.intra.
>
>
>
>
>
>
> *Informazioni di diagnostica per gli amministratori:*
>
> Server di generazione: mail2.scai.intra
>
> [email protected]
> MAIL2.scai.intra
> Remote Server returned '554 5.2.2 mailbox full;
> STOREDRV.Deliver.Exception:QuotaExceededException.MapiExceptionShutoffQuotaExceeded;
> Failed to process message due to a permanent exception with message Non
> riesco ad aprire la cassetta postale /o=Scai SpA/ou=Exchange Administrative
> Group (FYDIBOHF23SPDLT)/cn=Configuration/cn=Servers/cn=MAIL2/cn=Microsoft
> System Attendant. 16.55847:7E000000,
> 17.43559:0000000048010000000000002E00000000000000,
> 20.52176:000FE3832000000000000000, 20.50032:000FE3839017000000000000,
> 255.23226:000FE383, 255.27962:FE000000, 255.17082:DD040000,
> 0.26937:00000000, 4.21921:DD040000, 255.27962:FA000000, 255.1494:00000000,
> 0.50608:00000000,
> 5.29818:0000000034666631356633392D393430392D346230382D386264342D37666561303430643232623900000000,
> 1.29920:03000000, 7.29828:0A87B90E0000000000000000,
> 7.29832:0054B80E0000000000000000, 4.45884:DD040000, 4.29880:DD040000,
> 4.29888:DD040000, 1.56872:FE000000, 4.42712:DD040000,
> 5.10786:0000000031352E30302E303939352E3033323A6D61696C3200000000,
> 255.1750:000FE383, 0.26849:00000000, 255.21817:DD040000, 0.26297:0A000000,
> 4.16585:DD040000, 0.32441:00000000, 4.1706:DD040000, 0.24761:0A000000,
> 4.20665:DD040000, 0.25785:00000000, 4.29881:DD040000 [Stage: CreateSession]'
>
> Intestazioni originali del messaggio:
>
> Received: from MAIL2.scai.intra (10.110.4.14) by mail2.scai.intra
>  (10.110.4.14) with Microsoft SMTP Server (TLS) id 15.0.995.29; Sat, 15 Aug
>  2015 01:54:02 +0200
> Received: from mail.grupposcai.it (10.110.4.1) by MAIL2.scai.intra
>  (10.110.4.14) with Microsoft SMTP Server id 15.0.995.29 via Frontend
>  Transport; Sat, 15 Aug 2015 01:54:02 +0200
> X-BYPSHEADER: 12196482
> X-SMScore: -100000
> X-LCID: 9520602
> Received: from [(10.110.4.1)] by GTW-DMZ with Xeams SMTP; Sat, 15 Aug 2015 
> 01:53:57 +0200 (CEST)
> X-SM_EnvelopeFrom: [email protected]
> X-SM_RECEIVED_ON: Sat, 15 Aug 2015 01:53:57 +0200 (CEST)
> X-SMScore: -970
> X-LCID: 9520600
> Received: from [(140.211.11.3)] by GTW-DMZ with Xeams SMTP; Sat, 15 Aug 2015 
> 01:53:45 +0200 (CEST)
> X-SM_EnvelopeFrom: [email protected]
> X-SM_RECEIVED_ON: Sat, 15 Aug 2015 01:53:45 +0200 (CEST)
> Received: (qmail 45474 invoked by uid 500); 14 Aug 2015 23:53:40 -0000
> Mailing-List: contact [email protected]; run by ezmlm
> Precedence: bulk
> List-Help: <mailto:[email protected]>
> List-Unsubscribe: <mailto:[email protected]>
> List-Post: <mailto:[email protected]>
> List-Id: <user.hadoop.apache.org>
> Reply-To: <[email protected]>
> Delivered-To: mailing list [email protected]
> Received: (qmail 45460 invoked by uid 99); 14 Aug 2015 23:53:40 -0000
> Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142)
>     by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Aug 2015 23:53:40 +0000
> Received: from localhost (localhost [127.0.0.1])
>       by spamd1-us-west.apache.org (ASF Mail Server at 
> spamd1-us-west.apache.org) with ESMTP id B1120DDE37
>       for <[email protected]>; Fri, 14 Aug 2015 23:53:39 +0000 (UTC)
> X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
> X-Spam-Flag: NO
> X-Spam-Score: 3.129
> X-Spam-Level: ***
> X-Spam-Status: No, score=3.129 tagged_above=-999 required=6.31
>       tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
>       FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3,
>       RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001]
>       autolearn=disabled
> Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
>       dkim=pass (2048-bit key) header.d=gmail.com
> Received: from mx1-eu-west.apache.org ([10.40.0.8])
>       by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 
> 10024)
>       with ESMTP id 7p9eDKlZHd2d for <[email protected]>;
>       Fri, 14 Aug 2015 23:53:37 +0000 (UTC)
> Received: from mail-oi0-f50.google.com (mail-oi0-f50.google.com 
> [209.85.218.50])
>       by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) 
> with ESMTPS id D9CA42136C
>       for <[email protected]>; Fri, 14 Aug 2015 23:53:36 +0000 (UTC)
> Received: by oip136 with SMTP id 136so52683028oip.1
>         for <[email protected]>; Fri, 14 Aug 2015 16:53:35 -0700 (PDT)
> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
>         d=gmail.com; s=20120113;
>         h=mime-version:date:message-id:subject:from:to:content-type;
>         bh=T1wf4bljaGb/zvLHrMaN7Q+F20Hqif18v22xBCeBLns=;
>         b=qy92Y3pzKefMnhUQIZnh4hS1+n8pN7c0RomzeWyzQZnTDroUk76CvZyxBt0nb+9YNb
>          oPMHKbQLWUvU+qE5N6tBJXu8uPEE0Rzju7n0XJ1AhgAO409atHt5lJsh9X0yz1CU3szK
>          oP70vwr33UlObl10O4lqBnFrVFAX9cK44zh3jYKxvO1gRxk4g5XnW2swmeDrldYf0eR4
>          eibUuU9H1j03RiTggrhVFOhuqs4zVxEIcn7KYDIXxtlkaq3RZMlRIAtg7e/aRttQcwbg
>          f3CuCa/zKtJTEKHCCI+3HQkneVeMHVcwe86UTl/jTDZ5sL0m7rJWSZdyLumhqDDaHmJ0
>          AhKA==
> MIME-Version: 1.0
> X-Received: by 10.202.48.200 with SMTP id w191mr13116197oiw.13.1439596415743;
>  Fri, 14 Aug 2015 16:53:35 -0700 (PDT)
> Received: by 10.182.22.170 with HTTP; Fri, 14 Aug 2015 16:53:35 -0700 (PDT)
> Date: Fri, 14 Aug 2015 16:53:35 -0700
> Message-ID: 
> <cahnhubbsnuflcw5+vzgd3biguyg5cqxunqmmhj6mo+gpczi...@mail.gmail.com>
> Subject: Map tasks keep Running even after the node is killed on Apache Yarn.
> From: manoj <[email protected]>
> To: <[email protected]>
> Content-Type: multipart/alternative; boundary="001a113cd878fafe24051d4e28bf"
> Return-Path: [email protected]
>
>
> Final-Recipient: rfc822;[email protected]
> Action: failed
> Status: 5.2.2
> Diagnostic-Code: smtp;554 5.2.2 mailbox full;
> STOREDRV.Deliver.Exception:QuotaExceededException.MapiExceptionShutoffQuotaExceeded;
> Failed to process message due to a permanent exception with message Non
> riesco ad aprire la cassetta postale /o=Scai SpA/ou=Exchange Administrative
> Group (FYDIBOHF23SPDLT)/cn=Configuration/cn=Servers/cn=MAIL2/cn=Microsoft
> System Attendant. 16.55847:7E000000,
> 17.43559:0000000048010000000000002E00000000000000,
> 20.52176:000FE3832000000000000000, 20.50032:000FE3839017000000000000,
> 255.23226:000FE383, 255.27962:FE000000, 255.17082:DD040000,
> 0.26937:00000000, 4.21921:DD040000, 255.27962:FA000000, 255.1494:00000000,
> 0.50608:00000000,
> 5.29818:0000000034666631356633392D393430392D346230382D386264342D37666561303430643232623900000000,
> 1.29920:03000000, 7.29828:0A87B90E0000000000000000,
> 7.29832:0054B80E0000000000000000, 4.45884:DD040000, 4.29880:DD040000,
> 4.29888:DD040000, 1.56872:FE000000, 4.42712:DD040000,
> 5.10786:0000000031352E30302E303939352E3033323A6D61696C320000
>  0000, 255.1750:000FE383, 0.26849:00000000, 255.21817:DD040000,
> 0.26297:0A000000, 4.16585:DD040000, 0.32441:00000000, 4.1706:DD040000,
> 0.24761:0A000000, 4.20665:DD040000, 0.25785:00000000, 4.29881:DD040000
> [Stage: CreateSession]
> Remote-MTA: dns;MAIL2.scai.intra
>
>
>
> ---------- Forwarded message ----------
> From: manoj <[email protected]>
> To: <[email protected]>
> Cc:
> Date: Fri, 14 Aug 2015 16:53:35 -0700
> Subject: Map tasks keep Running even after the node is killed on Apache
> Yarn.
> Hi,
>
> I'm on Apache2.6.0 YARN and I'm trying to test the dynamic addition and
> removal of nodes from the Cluster.
>
> The test starts a Job with 2 nodes and while the Job is progressing, It
> removes one of the node* by killing the dataNode and NodeManager Daemons.(
> is it ok to remove a node like this? )
>
> *this node is not running ResourceManager/ApplicationMaster for sure.
>
> After the node is successfully removed( I can confirm this from resource
> manager logs- attached) the test adds it back and waits till the job
> completes.
>
> Node Removal Logs:
>
> 2015-08-14 11:15:56,902 INFO 
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:host172:36158 
> Timed out after 60 secs
> 2015-08-14 11:15:56,903 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating 
> Node host172:36158 as it is now LOST
> 2015-08-14 11:15:56,904 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
> host172:36158 Node Transitioned from RUNNING to LOST
> 2015-08-14 11:15:56,905 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1439575616861_0001_01_000006 Container Transitioned from RUNNING to 
> KILLED
> 2015-08-14 11:15:56,906 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Completed container: container_1439575616861_0001_01_000006 in state: KILLED 
> event:KILL
> 2015-08-14 11:15:56,906 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop   
> OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  
> APPID=application_1439575616861_0001    
> CONTAINERID=container_1439575616861_0001_01_000006
> 2015-08-14 11:15:56,906 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: 
> Released container container_1439575616861_0001_01_000006 of capacity 
> <memory:1024, vCores:1> on host host172:36158, which currently has 1 
> containers, <memory:1024, vCores:1> used and <memory:1024, vCores:7> 
> available, release resources=true
> 2015-08-14 11:15:56,906 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> default used=<memory:3584, vCores:3> numContainers=3 user=hadoop 
> user-resources=<memory:3584, vCores:3>
> 2015-08-14 11:15:56,906 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> completedContainer container=Container: [ContainerId: 
> container_1439575616861_0001_01_000006, NodeId: host172:36158, 
> NodeHttpAddress: host172:8042, Resource: <memory:1024, vCores:1>, Priority: 
> 20, Token: Token { kind: ContainerToken, service: XX.XX.0.2:36158 }, ] 
> queue=default: capacity=1.0, absoluteCapacity=1.0, 
> usedResources=<memory:3584, vCores:3>, usedCapacity=1.75, 
> absoluteUsedCapacity=1.75, numApps=1, numContainers=3 cluster=<memory:2048, 
> vCores:8>
> 2015-08-14 11:15:56,906 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> completedContainer queue=root usedCapacity=1.75 absoluteUsedCapacity=1.75 
> used=<memory:3584, vCores:3> cluster=<memory:2048, vCores:8>
> 2015-08-14 11:15:56,906 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Re-sorting completed queue: root.default stats: default: capacity=1.0, 
> absoluteCapacity=1.0, usedResources=<memory:3584, vCores:3>, 
> usedCapacity=1.75, absoluteUsedCapacity=1.75, numApps=1, numContainers=3
> 2015-08-14 11:15:56,906 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Application attempt appattempt_1439575616861_0001_000001 released container 
> container_1439575616861_0001_01_000006 on node: host: host172:36158 
> #containers=1 available=1024 used=1024 with event: KILL
> 2015-08-14 11:15:56,907 INFO   
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1439575616861_0001_01_000005 Container Transitioned from RUNNING to 
> KILLED
> 2015-08-14 11:15:56,907 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Completed container: container_1439575616861_0001_01_000005 in state: KILLED 
> event:KILL
> 2015-08-14 11:15:56,907 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop   
> OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  
> APPID=application_1439575616861_0001    
> CONTAINERID=container_1439575616861_0001_01_000005
> 2015-08-14 11:15:56,907 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: 
> Released container container_1439575616861_0001_01_000005 of capacity 
> <memory:1024, vCores:1> on host host172:36158, which currently has 0 
> containers, <memory:0, vCores:0> used and <memory:2048, vCores:8> available, 
> release resources=true
> 2015-08-14 11:15:56,907 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> default used=<memory:2560, vCores:2> numContainers=2 user=hadoop 
> user-resources=<memory:2560, vCores:2>
> 2015-08-14 11:15:56,907 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> completedContainer container=Container: [ContainerId: 
> container_1439575616861_0001_01_000005, NodeId: host172:36158, 
> NodeHttpAddress: host172:8042, Resource: <memory:1024, vCores:1>, Priority: 
> 20, Token: Token { kind: ContainerToken, service: XX.XX.0.2:36158 }, ] 
> queue=default: capacity=1.0, absoluteCapacity=1.0, 
> usedResources=<memory:2560, vCores:2>, usedCapacity=1.25, 
> absoluteUsedCapacity=1.25, numApps=1, numContainers=2 cluster=<memory:2048, 
> vCores:8>
> 2015-08-14 11:15:56,907 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> completedContainer queue=root usedCapacity=1.25 absoluteUsedCapacity=1.25 
> used=<memory:2560, vCores:2> cluster=<memory:2048, vCores:8>
> 2015-08-14 11:15:56,907 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Re-sorting completed queue: root.default stats: default: capacity=1.0, 
> absoluteCapacity=1.0, usedResources=<memory:2560, vCores:2>, 
> usedCapacity=1.25, absoluteUsedCapacity=1.25, numApps=1, numContainers=2
> 2015-08-14 11:15:56,907 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Application attempt appattempt_1439575616861_0001_000001 released container 
> container_1439575616861_0001_01_000005 on node: host: host172:36158 
> #containers=0 available=2048 used=0 with event: KILL
> 2015-08-14 11:15:56,907 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Removed node host172:36158 clusterResource: <memory:2048, vCores:8>
>
> Node Addition logs:
>
> 2015-08-14 11:19:43,529 INFO org.apache.hadoop.yarn.util.RackResolver: 
> Resolved host172 to /default-rack
> 2015-08-14 11:19:43,530 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
> NodeManager from node host172(cmPort: 59426 httpPort: 8042) registered with 
> capability: <memory:2048, vCores:8>, assigned nodeId host172:59426
> 2015-08-14 11:19:43,533 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
> host172:59426 Node Transitioned from NEW to RUNNING
> 2015-08-14 11:19:43,535 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Added node host172:59426 clusterResource: <memory:4096, vCores:16>
>
> *Here's the problem:*
>
> The Job never completes! According to the logs the mapTasks which were
> scheduled on the node that was removed are still "RUNNING" with a
> mapProgress of 100%. These tasks stays in the same state forever.
>
> In the AppMasterContainer logs I see that it continuously tries to connect
> to the previous node host172/XX.XX.XX.XX:36158 though it was removed and
> added on a different port host172/XX.XX.XX.XX:59426
>
> ......
> ...
> 2015-08-14 11:25:21,662 INFO [ContainerLauncher #7] 
> org.apache.hadoop.ipc.Client: Retrying connect to server: 
> host172/XX.XX.XX.XX:36158. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> ...
> ...
>
> Please let me know if you need to see any more logs.
>
> P.S: The Jobs completes normally without dynamic addition and removal of
> nodes on the same Cluster with same memory settings.
> Thanks,
> --Manoj Kumar M
>
>


-- 
--Manoj Kumar M

Re: Non recapitabile: Map tasks keep Running even after the node is killed on Apache Yarn.

Reply via email to