Looks like the App master tries to connect to the Node that was removed for ~30Min before it gives up. Can I reduce this wait time and number of tries?
Thanks -Manoj On Fri, Aug 14, 2015 at 4:54 PM, <[email protected]> wrote: > *Il recapito non è riuscito per i seguenti destinatari o gruppi:* > > [email protected] > La cassetta postale del destinatario è piena e non può accettare messaggi > in questo momento. Riprova a inviare il messaggio più tardi. > > Il tuo messaggio è stato rifiutato dalla seguente organizzazione: > MAIL2.scai.intra. > > > > > > > *Informazioni di diagnostica per gli amministratori:* > > Server di generazione: mail2.scai.intra > > [email protected] > MAIL2.scai.intra > Remote Server returned '554 5.2.2 mailbox full; > STOREDRV.Deliver.Exception:QuotaExceededException.MapiExceptionShutoffQuotaExceeded; > Failed to process message due to a permanent exception with message Non > riesco ad aprire la cassetta postale /o=Scai SpA/ou=Exchange Administrative > Group (FYDIBOHF23SPDLT)/cn=Configuration/cn=Servers/cn=MAIL2/cn=Microsoft > System Attendant. 16.55847:7E000000, > 17.43559:0000000048010000000000002E00000000000000, > 20.52176:000FE3832000000000000000, 20.50032:000FE3839017000000000000, > 255.23226:000FE383, 255.27962:FE000000, 255.17082:DD040000, > 0.26937:00000000, 4.21921:DD040000, 255.27962:FA000000, 255.1494:00000000, > 0.50608:00000000, > 5.29818:0000000034666631356633392D393430392D346230382D386264342D37666561303430643232623900000000, > 1.29920:03000000, 7.29828:0A87B90E0000000000000000, > 7.29832:0054B80E0000000000000000, 4.45884:DD040000, 4.29880:DD040000, > 4.29888:DD040000, 1.56872:FE000000, 4.42712:DD040000, > 5.10786:0000000031352E30302E303939352E3033323A6D61696C3200000000, > 255.1750:000FE383, 0.26849:00000000, 255.21817:DD040000, 0.26297:0A000000, > 4.16585:DD040000, 0.32441:00000000, 4.1706:DD040000, 0.24761:0A000000, > 4.20665:DD040000, 0.25785:00000000, 4.29881:DD040000 [Stage: CreateSession]' > > Intestazioni originali del messaggio: > > Received: from MAIL2.scai.intra (10.110.4.14) by mail2.scai.intra > (10.110.4.14) with Microsoft SMTP Server (TLS) id 15.0.995.29; Sat, 15 Aug > 2015 01:54:02 +0200 > Received: from mail.grupposcai.it (10.110.4.1) by MAIL2.scai.intra > (10.110.4.14) with Microsoft SMTP Server id 15.0.995.29 via Frontend > Transport; Sat, 15 Aug 2015 01:54:02 +0200 > X-BYPSHEADER: 12196482 > X-SMScore: -100000 > X-LCID: 9520602 > Received: from [(10.110.4.1)] by GTW-DMZ with Xeams SMTP; Sat, 15 Aug 2015 > 01:53:57 +0200 (CEST) > X-SM_EnvelopeFrom: [email protected] > X-SM_RECEIVED_ON: Sat, 15 Aug 2015 01:53:57 +0200 (CEST) > X-SMScore: -970 > X-LCID: 9520600 > Received: from [(140.211.11.3)] by GTW-DMZ with Xeams SMTP; Sat, 15 Aug 2015 > 01:53:45 +0200 (CEST) > X-SM_EnvelopeFrom: [email protected] > X-SM_RECEIVED_ON: Sat, 15 Aug 2015 01:53:45 +0200 (CEST) > Received: (qmail 45474 invoked by uid 500); 14 Aug 2015 23:53:40 -0000 > Mailing-List: contact [email protected]; run by ezmlm > Precedence: bulk > List-Help: <mailto:[email protected]> > List-Unsubscribe: <mailto:[email protected]> > List-Post: <mailto:[email protected]> > List-Id: <user.hadoop.apache.org> > Reply-To: <[email protected]> > Delivered-To: mailing list [email protected] > Received: (qmail 45460 invoked by uid 99); 14 Aug 2015 23:53:40 -0000 > Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) > by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Aug 2015 23:53:40 +0000 > Received: from localhost (localhost [127.0.0.1]) > by spamd1-us-west.apache.org (ASF Mail Server at > spamd1-us-west.apache.org) with ESMTP id B1120DDE37 > for <[email protected]>; Fri, 14 Aug 2015 23:53:39 +0000 (UTC) > X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org > X-Spam-Flag: NO > X-Spam-Score: 3.129 > X-Spam-Level: *** > X-Spam-Status: No, score=3.129 tagged_above=-999 required=6.31 > tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, > FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3, > RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] > autolearn=disabled > Authentication-Results: spamd1-us-west.apache.org (amavisd-new); > dkim=pass (2048-bit key) header.d=gmail.com > Received: from mx1-eu-west.apache.org ([10.40.0.8]) > by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port > 10024) > with ESMTP id 7p9eDKlZHd2d for <[email protected]>; > Fri, 14 Aug 2015 23:53:37 +0000 (UTC) > Received: from mail-oi0-f50.google.com (mail-oi0-f50.google.com > [209.85.218.50]) > by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) > with ESMTPS id D9CA42136C > for <[email protected]>; Fri, 14 Aug 2015 23:53:36 +0000 (UTC) > Received: by oip136 with SMTP id 136so52683028oip.1 > for <[email protected]>; Fri, 14 Aug 2015 16:53:35 -0700 (PDT) > DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; > d=gmail.com; s=20120113; > h=mime-version:date:message-id:subject:from:to:content-type; > bh=T1wf4bljaGb/zvLHrMaN7Q+F20Hqif18v22xBCeBLns=; > b=qy92Y3pzKefMnhUQIZnh4hS1+n8pN7c0RomzeWyzQZnTDroUk76CvZyxBt0nb+9YNb > oPMHKbQLWUvU+qE5N6tBJXu8uPEE0Rzju7n0XJ1AhgAO409atHt5lJsh9X0yz1CU3szK > oP70vwr33UlObl10O4lqBnFrVFAX9cK44zh3jYKxvO1gRxk4g5XnW2swmeDrldYf0eR4 > eibUuU9H1j03RiTggrhVFOhuqs4zVxEIcn7KYDIXxtlkaq3RZMlRIAtg7e/aRttQcwbg > f3CuCa/zKtJTEKHCCI+3HQkneVeMHVcwe86UTl/jTDZ5sL0m7rJWSZdyLumhqDDaHmJ0 > AhKA== > MIME-Version: 1.0 > X-Received: by 10.202.48.200 with SMTP id w191mr13116197oiw.13.1439596415743; > Fri, 14 Aug 2015 16:53:35 -0700 (PDT) > Received: by 10.182.22.170 with HTTP; Fri, 14 Aug 2015 16:53:35 -0700 (PDT) > Date: Fri, 14 Aug 2015 16:53:35 -0700 > Message-ID: > <cahnhubbsnuflcw5+vzgd3biguyg5cqxunqmmhj6mo+gpczi...@mail.gmail.com> > Subject: Map tasks keep Running even after the node is killed on Apache Yarn. > From: manoj <[email protected]> > To: <[email protected]> > Content-Type: multipart/alternative; boundary="001a113cd878fafe24051d4e28bf" > Return-Path: [email protected] > > > Final-Recipient: rfc822;[email protected] > Action: failed > Status: 5.2.2 > Diagnostic-Code: smtp;554 5.2.2 mailbox full; > STOREDRV.Deliver.Exception:QuotaExceededException.MapiExceptionShutoffQuotaExceeded; > Failed to process message due to a permanent exception with message Non > riesco ad aprire la cassetta postale /o=Scai SpA/ou=Exchange Administrative > Group (FYDIBOHF23SPDLT)/cn=Configuration/cn=Servers/cn=MAIL2/cn=Microsoft > System Attendant. 16.55847:7E000000, > 17.43559:0000000048010000000000002E00000000000000, > 20.52176:000FE3832000000000000000, 20.50032:000FE3839017000000000000, > 255.23226:000FE383, 255.27962:FE000000, 255.17082:DD040000, > 0.26937:00000000, 4.21921:DD040000, 255.27962:FA000000, 255.1494:00000000, > 0.50608:00000000, > 5.29818:0000000034666631356633392D393430392D346230382D386264342D37666561303430643232623900000000, > 1.29920:03000000, 7.29828:0A87B90E0000000000000000, > 7.29832:0054B80E0000000000000000, 4.45884:DD040000, 4.29880:DD040000, > 4.29888:DD040000, 1.56872:FE000000, 4.42712:DD040000, > 5.10786:0000000031352E30302E303939352E3033323A6D61696C320000 > 0000, 255.1750:000FE383, 0.26849:00000000, 255.21817:DD040000, > 0.26297:0A000000, 4.16585:DD040000, 0.32441:00000000, 4.1706:DD040000, > 0.24761:0A000000, 4.20665:DD040000, 0.25785:00000000, 4.29881:DD040000 > [Stage: CreateSession] > Remote-MTA: dns;MAIL2.scai.intra > > > > ---------- Forwarded message ---------- > From: manoj <[email protected]> > To: <[email protected]> > Cc: > Date: Fri, 14 Aug 2015 16:53:35 -0700 > Subject: Map tasks keep Running even after the node is killed on Apache > Yarn. > Hi, > > I'm on Apache2.6.0 YARN and I'm trying to test the dynamic addition and > removal of nodes from the Cluster. > > The test starts a Job with 2 nodes and while the Job is progressing, It > removes one of the node* by killing the dataNode and NodeManager Daemons.( > is it ok to remove a node like this? ) > > *this node is not running ResourceManager/ApplicationMaster for sure. > > After the node is successfully removed( I can confirm this from resource > manager logs- attached) the test adds it back and waits till the job > completes. > > Node Removal Logs: > > 2015-08-14 11:15:56,902 INFO > org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:host172:36158 > Timed out after 60 secs > 2015-08-14 11:15:56,903 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating > Node host172:36158 as it is now LOST > 2015-08-14 11:15:56,904 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > host172:36158 Node Transitioned from RUNNING to LOST > 2015-08-14 11:15:56,905 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1439575616861_0001_01_000006 Container Transitioned from RUNNING to > KILLED > 2015-08-14 11:15:56,906 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Completed container: container_1439575616861_0001_01_000006 in state: KILLED > event:KILL > 2015-08-14 11:15:56,906 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop > OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS > APPID=application_1439575616861_0001 > CONTAINERID=container_1439575616861_0001_01_000006 > 2015-08-14 11:15:56,906 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: > Released container container_1439575616861_0001_01_000006 of capacity > <memory:1024, vCores:1> on host host172:36158, which currently has 1 > containers, <memory:1024, vCores:1> used and <memory:1024, vCores:7> > available, release resources=true > 2015-08-14 11:15:56,906 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > default used=<memory:3584, vCores:3> numContainers=3 user=hadoop > user-resources=<memory:3584, vCores:3> > 2015-08-14 11:15:56,906 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > completedContainer container=Container: [ContainerId: > container_1439575616861_0001_01_000006, NodeId: host172:36158, > NodeHttpAddress: host172:8042, Resource: <memory:1024, vCores:1>, Priority: > 20, Token: Token { kind: ContainerToken, service: XX.XX.0.2:36158 }, ] > queue=default: capacity=1.0, absoluteCapacity=1.0, > usedResources=<memory:3584, vCores:3>, usedCapacity=1.75, > absoluteUsedCapacity=1.75, numApps=1, numContainers=3 cluster=<memory:2048, > vCores:8> > 2015-08-14 11:15:56,906 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > completedContainer queue=root usedCapacity=1.75 absoluteUsedCapacity=1.75 > used=<memory:3584, vCores:3> cluster=<memory:2048, vCores:8> > 2015-08-14 11:15:56,906 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Re-sorting completed queue: root.default stats: default: capacity=1.0, > absoluteCapacity=1.0, usedResources=<memory:3584, vCores:3>, > usedCapacity=1.75, absoluteUsedCapacity=1.75, numApps=1, numContainers=3 > 2015-08-14 11:15:56,906 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Application attempt appattempt_1439575616861_0001_000001 released container > container_1439575616861_0001_01_000006 on node: host: host172:36158 > #containers=1 available=1024 used=1024 with event: KILL > 2015-08-14 11:15:56,907 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1439575616861_0001_01_000005 Container Transitioned from RUNNING to > KILLED > 2015-08-14 11:15:56,907 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Completed container: container_1439575616861_0001_01_000005 in state: KILLED > event:KILL > 2015-08-14 11:15:56,907 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop > OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS > APPID=application_1439575616861_0001 > CONTAINERID=container_1439575616861_0001_01_000005 > 2015-08-14 11:15:56,907 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: > Released container container_1439575616861_0001_01_000005 of capacity > <memory:1024, vCores:1> on host host172:36158, which currently has 0 > containers, <memory:0, vCores:0> used and <memory:2048, vCores:8> available, > release resources=true > 2015-08-14 11:15:56,907 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > default used=<memory:2560, vCores:2> numContainers=2 user=hadoop > user-resources=<memory:2560, vCores:2> > 2015-08-14 11:15:56,907 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > completedContainer container=Container: [ContainerId: > container_1439575616861_0001_01_000005, NodeId: host172:36158, > NodeHttpAddress: host172:8042, Resource: <memory:1024, vCores:1>, Priority: > 20, Token: Token { kind: ContainerToken, service: XX.XX.0.2:36158 }, ] > queue=default: capacity=1.0, absoluteCapacity=1.0, > usedResources=<memory:2560, vCores:2>, usedCapacity=1.25, > absoluteUsedCapacity=1.25, numApps=1, numContainers=2 cluster=<memory:2048, > vCores:8> > 2015-08-14 11:15:56,907 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > completedContainer queue=root usedCapacity=1.25 absoluteUsedCapacity=1.25 > used=<memory:2560, vCores:2> cluster=<memory:2048, vCores:8> > 2015-08-14 11:15:56,907 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Re-sorting completed queue: root.default stats: default: capacity=1.0, > absoluteCapacity=1.0, usedResources=<memory:2560, vCores:2>, > usedCapacity=1.25, absoluteUsedCapacity=1.25, numApps=1, numContainers=2 > 2015-08-14 11:15:56,907 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Application attempt appattempt_1439575616861_0001_000001 released container > container_1439575616861_0001_01_000005 on node: host: host172:36158 > #containers=0 available=2048 used=0 with event: KILL > 2015-08-14 11:15:56,907 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Removed node host172:36158 clusterResource: <memory:2048, vCores:8> > > Node Addition logs: > > 2015-08-14 11:19:43,529 INFO org.apache.hadoop.yarn.util.RackResolver: > Resolved host172 to /default-rack > 2015-08-14 11:19:43,530 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > NodeManager from node host172(cmPort: 59426 httpPort: 8042) registered with > capability: <memory:2048, vCores:8>, assigned nodeId host172:59426 > 2015-08-14 11:19:43,533 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > host172:59426 Node Transitioned from NEW to RUNNING > 2015-08-14 11:19:43,535 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Added node host172:59426 clusterResource: <memory:4096, vCores:16> > > *Here's the problem:* > > The Job never completes! According to the logs the mapTasks which were > scheduled on the node that was removed are still "RUNNING" with a > mapProgress of 100%. These tasks stays in the same state forever. > > In the AppMasterContainer logs I see that it continuously tries to connect > to the previous node host172/XX.XX.XX.XX:36158 though it was removed and > added on a different port host172/XX.XX.XX.XX:59426 > > ...... > ... > 2015-08-14 11:25:21,662 INFO [ContainerLauncher #7] > org.apache.hadoop.ipc.Client: Retrying connect to server: > host172/XX.XX.XX.XX:36158. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > ... > ... > > Please let me know if you need to see any more logs. > > P.S: The Jobs completes normally without dynamic addition and removal of > nodes on the same Cluster with same memory settings. > Thanks, > --Manoj Kumar M > > -- --Manoj Kumar M
