How have you implemented the failover ? Also can you attach JTHA logs ? If you hav implemented it using. Zkfc, it would be interesting to look in zookeeper logs as well.
Sent from my iPhone > On Jan 27, 2014, at 3:00 PM, "Karthik Kambatla" <[email protected]> wrote: > > (Redirecting to cdh-user, moving user@hadoop to bcc). > > Hi Oren > > Can you attach slightly longer versions of the log files on both the JTs? > Also, if this is something recurring, it would be nice to monitor the JT heap > usage and GC timeouts using jstat -gcutil <jt-pid>. > > Thanks > Karthik > > > > >> On Thu, Jan 23, 2014 at 8:11 AM, Oren Marmor <[email protected]> wrote: >> Hi. >> We have two HA Jobtrackers in active/standby mode. (CDH4.2 on ubuntu server) >> We had a problem during which the active node suddenly became standby and >> the standby server attempted to start resulting in a java heap space failure. >> any ideas to why the active node turned to standby? >> >> logs attached: >> on (original) active node: >> 2014-01-22 06:48:41,289 INFO org.apache.hadoop.mapred.JobTracker: >> Initializing job_201401041634_5858 >> 2014-01-22 06:48:41,289 INFO org.apache.hadoop.mapred.JobInProgress: >> Initializing job_201401041634_5858 >> 2014-01-22 06:50:27,386 INFO >> org.apache.hadoop.mapred.JobTrackerHAServiceProtocol: Transitioning to >> standby >> 2014-01-22 06:50:27,386 INFO org.apache.hadoop.mapred.JobTracker: Stopping >> pluginDispatcher >> 2014-01-22 06:50:27,386 INFO org.apache.hadoop.mapred.JobTracker: Stopping >> infoServer >> 2014-01-22 06:50:44,093 WARN org.apache.hadoop.ipc.Client: interrupted >> waiting to send params to server >> java.lang.InterruptedException >> at >> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:979) >> at >> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281) >> at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218) >> at java.util.concurrent.FutureTask.get(FutureTask.java:83) >> at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:913) >> at org.apache.hadoop.ipc.Client.call(Client.java:1198) >> at >> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) >> at $Proxy9.getFileInfo(Unknown Source) >> at >> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:628) >> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at >> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) >> at >> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) >> at $Proxy10.getFileInfo(Unknown Source) >> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1532) >> at >> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:803) >> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1332) >> at >> org.apache.hadoop.mapred.JobTrackerHAServiceProtocol$SystemDirectoryMonitor.run(JobTrackerHAServiceProtocol.java:96) >> at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) >> at >> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) >> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) >> at >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) >> at >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180) >> at >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >> at java.lang.Thread.run(Thread.java:662) >> 2014-01-22 06:51:55,637 INFO org.mortbay.log: Stopped >> [email protected]:50031 >> >> on standby node >> 2014-01-22 06:50:05,010 INFO >> org.apache.hadoop.mapred.JobTrackerHAServiceProtocol: Transitioning to active >> 2014-01-22 06:50:05,010 INFO >> org.apache.hadoop.mapred.JobTrackerHAHttpRedirector: Stopping >> JobTrackerHAHttpRedirector on port 50030 >> 2014-01-22 06:50:05,098 INFO org.mortbay.log: Stopped >> [email protected]:50030 >> 2014-01-22 06:50:05,198 INFO >> org.apache.hadoop.mapred.JobTrackerHAHttpRedirector: Stopped >> 2014-01-22 06:50:05,201 INFO >> org.apache.hadoop.mapred.JobTrackerHAServiceProtocol: Renaming previous >> system directory hdfs://***/tmp/mapred/system/seq-000000000022 to hdfs://t >> aykey/tmp/mapred/system/seq-000000000023 >> 2014-01-22 06:50:05,244 INFO >> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: >> Updating the current master key for generating delegation tokens >> 2014-01-22 06:50:05,248 INFO org.apache.hadoop.mapred.JobTracker: Scheduler >> configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT, >> limitMaxMemForMapTasks, limitMaxMemF >> orReduceTasks) (-1, -1, -1, -1) >> 2014-01-22 06:50:05,248 INFO org.apache.hadoop.util.HostsFileReader: >> Refreshing hosts (include/exclude) list >> 2014-01-22 06:50:11,839 INFO >> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: >> Starting expired delegation token remover thread, tokenRemoverScanI >> nterval=60 min(s) >> ... >> 2014-01-22 06:52:00,870 INFO org.apache.hadoop.mapred.JobTracker: Starting >> RUNNING >> 2014-01-22 06:52:06,560 INFO >> org.apache.hadoop.mapred.JobTrackerHAServiceProtocol: Transitioned to active >> 2014-01-22 06:52:06,560 WARN org.apache.hadoop.ipc.Server: IPC Server >> Responder, call org.apache.hadoop.ha.HAServiceProtocol.transitionToActive >> from ****:32931: output error >> 2014-01-22 06:52:06,561 INFO org.apache.hadoop.ipc.Server: IPC Server >> handler 0 on 8023 caught an exception >> java.nio.channels.ClosedChannelException >> at >> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:135) >> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:326) >> at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2134) >> at org.apache.hadoop.ipc.Server.access$2000(Server.java:108) >> at >> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:931) >> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:997) >> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1741) >> 2014-01-22 06:52:13,168 WARN org.apache.hadoop.ipc.Server: IPC Server >> Responder, call org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus from >> ****:60965: output error >> 2014-01-22 06:52:13,168 INFO org.apache.hadoop.ipc.Server: IPC Server >> handler 0 on 8023 caught an exception >> java.nio.channels.ClosedChannelException >> at >> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:135) >> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:326) >> at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2134) >> at org.apache.hadoop.ipc.Server.access$2000(Server.java:108) >> at >> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:931) >> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:997) >> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1741) >> >> thanks >> Oren >
