Is it possible for you to get the whole app log from your customer ? On Thu, Feb 25, 2016 at 6:26 PM, Johannes Zillmann <[email protected] > wrote:
> Looks like there is the same warning in MR as well: > > 016-02-23 11:59:24,568 WARN [main] org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server : > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby > > But in this case it doesn’t seem to have a bad effect… > Johannes > > On 25 Feb 2016, at 10:46, Jeff Zhang <[email protected]> wrote: > > Might be, does only tez job affected ? What about other jobs like MR ? > > On Thu, Feb 25, 2016 at 5:32 PM, Johannes Zillmann < > [email protected]> wrote: > >> So yes, i think HA is enabled. >> But this whole behaviour is not a one shot thing, its happen repeatedly >> for this job. So my guess is its rather a unhappy configuration then the >> name node switching again and again !? >> >> Johannes >> >> >> On 25 Feb 2016, at 10:29, Jeff Zhang <[email protected]> wrote: >> >> The standby warning is because it try to connect to the standby node. (HA >> should be enabled I think). Maybe this cause the output missing, could you >> attach the whole app log ? >> >> On Thu, Feb 25, 2016 at 5:03 PM, Johannes Zillmann < >> [email protected]> wrote: >> >>> Hey guys, >>> >>> have a Tez job on a customer deployment which has very strange symptoms. >>> Maybe you have some pointers for me what to look into! >>> >>> App master log: >>> ————————————— >>> >>> 2016-02-16 12:05:16,666 INFO [main] app.DAGAppMaster: Created DAGAppMaster >>> for application appattempt_1455209830330_35852_000001, versionInfo=[ >>> component=tez-dag, version=0.6.0, >>> revision=25e41bc0f29f5e25adbfc17258b00ad702a17888, >>> SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git, >>> buildTime=2015-03-03T18:01:25Z ] >>> 2016-02-16 12:05:16,807 INFO [main] Configuration.deprecation: >>> fs.default.name is deprecated. Instead, use fs.defaultFS >>> 2016-02-16 12:05:16,816 INFO [main] Configuration.deprecation: >>> dfs.umaskmode is deprecated. Instead, use fs.permissions.umask-mode >>> 2016-02-16 12:05:16,987 WARN [main] util.NativeCodeLoader: Unable to load >>> native-hadoop library for your platform... using builtin-java classes where >>> applicable >>> 2016-02-16 12:05:17,134 INFO [main] app.DAGAppMaster: Comparing client >>> version with AM version, clientVersion=0.6.0, AMVersion=0.6.0 >>> 2016-02-16 12:05:18,158 WARN [main] ipc.Client: Exception encountered while >>> connecting to the server : >>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): >>> Operation category READ is not supported in state standby >>> 2016-02-16 12:05:18,298 INFO [main] app.DAGAppMaster: Adding session token >>> to jobTokenSecretManager for application >>> 2016-02-16 12:05:18,305 INFO [main] event.AsyncDispatcher: Registering >>> class org.apache.tez.dag.app.rm.container.AMContainerEventType for class >>> org.apache.tez.dag.app.rm.container.AMContainerMap >>> 2016-02-16 12:05:18,306 INFO [main] event.AsyncDispatcher: Registering >>> class org.apache.tez.dag.app.rm.node.AMNodeEventType for class >>> org.apache.tez.dag.app.rm.node.AMNodeTracker >>> 2016-02-16 12:05:18,308 INFO [main] event.AsyncDispatcher: Registering >>> class org.apache.tez.dag.app.dag.event.DAGAppMasterEventType for class >>> org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterEventHandler >>> 2016-02-16 12:05:18,309 INFO [main] event.AsyncDispatcher: Registering >>> class org.apache.tez.dag.app.dag.event.DAGEventType for class >>> org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher >>> 2016-02-16 12:05:18,310 INFO [main] event.AsyncDispatcher: Registering >>> class org.apache.tez.dag.app.dag.event.VertexEventType for class >>> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher >>> 2016-02-16 12:05:18,311 INFO [main] event.AsyncDispatcher: Registering >>> class org.apache.tez.dag.app.dag.event.TaskEventType for class >>> org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher >>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering >>> class org.apache.tez.dag.app.dag.event.TaskAttemptEventType for class >>> org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher >>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering >>> class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class >>> org.apache.tez.dag.app.DAGAppMaster$4 >>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering >>> class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class >>> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler >>> 2016-02-16 12:05:18,357 INFO [main] event.AsyncDispatcher: Registering >>> class org.apache.tez.dag.app.rm.AMSchedulerEventType for class >>> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler >>> 2016-02-16 12:05:18,359 INFO [main] event.AsyncDispatcher: Registering >>> class org.apache.tez.dag.app.rm.NMCommunicatorEventType for class >>> org.apache.tez.dag.app.launcher.ContainerLauncherImpl >>> 2016-02-16 12:05:18,483 INFO [main] node.AMNodeTracker: >>> blacklistDisablePercent is 33, blacklistingEnabled: true, >>> maxTaskFailuresPerNode: 10 >>> 2016-02-16 12:05:18,485 INFO [main] launcher.ContainerLauncherImpl: Upper >>> limit on the thread pool size is 500 >>> 2016-02-16 12:05:18,485 INFO [main] history.HistoryEventHandler: >>> Initializing HistoryEventHandler >>> 2016-02-16 12:05:18,489 INFO [main] impl.SimpleHistoryLoggingService: Log >>> file location for SimpleHistoryLoggingService not specified, defaulting to >>> containerLogDir=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001 >>> 2016-02-16 12:05:18,490 INFO [main] impl.SimpleHistoryLoggingService: >>> Initializing SimpleHistoryLoggingService, >>> logFileLocation=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/history.txt.appattempt_1455209830330_35852_000001, >>> maxErrors=10 >>> 2016-02-16 12:05:18,490 INFO [main] recovery.RecoveryService: Initializing >>> RecoveryService >>> 2016-02-16 12:05:18,491 INFO [main] history.HistoryEventHandler: >>> [HISTORY][DAG:N/A][Event:APP_LAUNCHED]: >>> applicationId=application_1455209830330_35852, appSubmitTime=1455642293317, >>> launchTime=1455642316658 >>> 2016-02-16 12:05:18,492 INFO [main] history.HistoryEventHandler: >>> [HISTORY][DAG:N/A][Event:AM_LAUNCHED]: >>> appAttemptId=appattempt_1455209830330_35852_000001, >>> appSubmitTime=1455642293317, >>> >>> >>> … user specific code >>> >>> 2016-02-16 12:05:19,343 INFO [IPC Server handler 0 on 13239] >>> app.DAGAppMaster: Starting DAG submitted via RPC: Export job (349): >>> ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected >>> record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf) >>> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] >>> Configuration.deprecation: mapred.output.compress is deprecated. Instead, >>> use mapreduce.output.fileoutputformat.compress >>> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] >>> Configuration.deprecation: fs.default.name is deprecated. Instead, use >>> fs.defaultFS >>> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] >>> Configuration.deprecation: mapred.map.tasks.speculative.execution is >>> deprecated. Instead, use mapreduce.map.speculative >>> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] >>> Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use >>> mapreduce.job.reduces >>> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] >>> Configuration.deprecation: mapred.min.split.size is deprecated. Instead, >>> use mapreduce.input.fileinputformat.split.minsize >>> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] >>> Configuration.deprecation: mapred.reduce.tasks.speculative.execution is >>> deprecated. Instead, use mapreduce.reduce.speculative >>> 2016-02-16 12:05:19,351 INFO [IPC Server handler 0 on 13239] >>> Configuration.deprecation: mapred.output.compression.type is deprecated. >>> Instead, use mapreduce.output.fileoutputformat.compress.type >>> 2016-02-16 12:05:19,352 INFO [IPC Server handler 0 on 13239] >>> Configuration.deprecation: mapred.max.split.size is deprecated. Instead, >>> use mapreduce.input.fileinputformat.split.maxsize >>> 2016-02-16 12:05:19,354 INFO [IPC Server handler 0 on 13239] >>> Configuration.deprecation: mapred.compress.map.output is deprecated. >>> Instead, use mapreduce.map.output.compress >>> 2016-02-16 12:05:19,480 INFO [IPC Server handler 0 on 13239] >>> app.DAGAppMaster: Generating DAG graphviz file, >>> dagId=dag_1455209830330_35852_1, >>> filePath=/u01/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/dag_1455209830330_35852_1.dot >>> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] >>> common.TezUtilsInternal: Redirecting log file based on addend: >>> dag_1455209830330_35852_1 >>> >>> ————————————— >>> >>> DAG-log: >>> -------------------------- >>> >>> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] >>> app.DAGAppMaster: Running DAG: Export job (349): >>> ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected >>> record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf) >>> 2016-02-16 12:05:19,966 INFO [HistoryEventHandlingThread] >>> impl.SimpleHistoryLoggingService: Writing event DAG_SUBMITTED to history >>> file >>> >>> … user specific code >>> >>> 2016-02-16 12:05:21,437 INFO [InputInitializer [MapChainVertex:Export job >>> (349): ClaimLineExportToDB2#export(Disconnected record stream) >>> (2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] util.LoggingUtil: Setting root >>> logger level for hadoop task to WARN: >>> 2016-02-16 12:05:21,722 WARN [AsyncDispatcher event handler] >>> hdfs.ClientContext: Existing client context 'default' does not match >>> requested configuration. Existing: shortCircuitStreamsCacheSize = 256, >>> shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, >>> shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout >>> = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity >>> = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = false, >>> useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, >>> shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000, Requested: >>> shortCircuitStreamsCacheSize = 4096, shortCircuitStreamsCacheExpiryMs = >>> 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = >>> 3600000, shortCircuitMmapCacheRetryTimeout = 300000, >>> shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, >>> socketCacheExpiry = 3000, shortCircuitLocalReads = true, >>> useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, >>> shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000 >>> 2016-02-16 12:05:21,732 WARN [AsyncDispatcher event handler] ipc.Client: >>> Exception encountered while connecting to the server : >>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): >>> Operation category READ is not supported in state standby >>> 2016-02-16 12:08:12,554 WARN [AsyncDispatcher event handler] ipc.Client: >>> Exception encountered while connecting to the server : >>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): >>> Operation category READ is not supported in state standby >>> >>> -------------------------- >>> >>> >>> Latest line of the (single) task: >>> -------------------------- >>> >>> ... >>> >>> 2016-02-16 12:07:00,166 INFO [Initializer 1] >>> runtime.LogicalIOProcessorRuntimeTask: Initializing Output using >>> OutputSpec: { destinationVertexName=output, physicalEdgeCount=0, >>> outputClassName=datameer.plugin.tez.output.DoNothingOutput } >>> 2016-02-16 12:07:00,168 INFO [Initializer 0] >>> runtime.LogicalIOProcessorRuntimeTask: Creating Input >>> 2016-02-16 12:07:00,168 INFO [Initializer 1] >>> runtime.LogicalIOProcessorRuntimeTask: Creating Output >>> 2016-02-16 12:07:00,175 INFO [Initializer 1] >>> runtime.LogicalIOProcessorRuntimeTask: Initializing Output with dest edge: >>> output >>> 2016-02-16 12:07:00,176 INFO [Initializer 0] >>> runtime.LogicalIOProcessorRuntimeTask: Initializing Input with src edge: >>> Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record >>> stream) >>> 2016-02-16 12:07:00,176 INFO [Initializer 0] >>> resources.MemoryDistributor$RequestorInfo: Received request: 0, type: >>> INPUT, componentVertexName: Input:Export job (349): >>> ClaimLineExportToDB2#export(Disconnected record stream) >>> 2016-02-16 12:07:00,176 INFO [Initializer 1] >>> resources.MemoryDistributor$RequestorInfo: Received request: 0, type: >>> OUTPUT, componentVertexName: output >>> 2016-02-16 12:07:00,177 INFO [Initializer 1] >>> runtime.LogicalIOProcessorRuntimeTask: Initialized Output with dest edge: >>> output >>> >>> >>> … user specific code >>> >>> 2016-02-16 12:07:01,871 INFO [TezChild] util.LoggingUtil: Setting root >>> logger level for hadoop task to WARN: >>> >>> 2016-02-16 12:07:03,990 WARN [TezChild] ipc.Client: Exception encountered >>> while connecting to the server : >>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): >>> Operation category READ is not supported in state standby >>> >>> ————————————— >>> >>> The client log contains a shutdown exception: >>> ————————————— >>> >>> INFO [2016-02-16 12:13:30.412] [MrPlanRunnerV2] (TezClient.java:470) - >>> Shutting down Tez Session, sessionName=Export job (349): >>> ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected >>> record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), >>> applicationId=application_1455209830330_35852 >>> INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClientUtils.java:838) >>> - Application not running, applicationId=application_1455209830330_35852, >>> yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, >>> trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/, >>> diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, >>> killedDAGs=0 >>> >>> INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClient.java:484) - >>> Failed to shutdown Tez Session via proxy >>> org.apache.tez.dag.api.SessionNotRunning: Application not running, >>> applicationId=application_1455209830330_35852, >>> yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, >>> trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/, >>> diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, >>> killedDAGs=0 >>> >>> at >>> org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:839) >>> at org.apache.tez.client.TezClient.getSessionAMProxy(TezClient.java:669) >>> at org.apache.tez.client.TezClient.stop(TezClient.java:476) >>> at >>> datameer.plugin.tez.session.TezSessionImpl.close(TezSessionImpl.java:69) >>> at >>> datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSession(PoolingTezSessionFactory.java:173) >>> at >>> datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSessions(PoolingTezSessionFactory.java:167) >>> at >>> datameer.plugin.tez.session.PoolingTezSessionFactory.close(PoolingTezSessionFactory.java:162) >>> at datameer.com.google.common.io.Closer.close(Closer.java:214) >>> at >>> datameer.plugin.tez.smart.SmartClusterJobFlowResolver.close(SmartClusterJobFlowResolver.java:147) >>> at datameer.com.google.common.io.Closer.close(Closer.java:214) >>> at datameer.dap.sdk.util.IoUtil.close(IoUtil.java:330) >>> at >>> datameer.dap.common.graphv2.mixedframework.MixedClusterSession.commitJob(MixedClusterSession.java:65) >>> at >>> datameer.dap.common.graphv2.ClusterSession.commitJob(ClusterSession.java:81) >>> at >>> datameer.dap.common.graphv2.MrPlanRunnerV2.run(MrPlanRunnerV2.java:97) >>> at >>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >>> at >>> datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:96) >>> at >>> datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:173) >>> at datameer.dap.common.security.RunAsThread.run(RunAsThread.java:22) >>> INFO [2016-02-16 12:13:30.582] [MrPlanRunnerV2] (TezClient.java:489) - >>> Could not connect to AM, killing session via YARN, sessionName=Export job >>> (349): ClaimLineExportToDB2#export(Disconnected record >>> stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), >>> applicationId=application_1455209830330_35852 >>> >>> >>> ————————————— >>> >>> >>> The counter and the event shows that: >>> - the correct amount of input records has been ingested >>> - the expected amount of output records is missing (we have zero records >>> produced which is reflected in both, the counters and the output itself) >>> - for Tez the task and the DAG are both handled as successful >>> >>> The customers says that the symptoms of the job are always the same: >>> - there is that standby warning >>> - there is the problem with the shutdown >>> - (and the data is missing) >>> >>> To me its still unclear if there are strange things going on with Tez or >>> with our application logic. >>> The same job is successful when executed with pure MapReduce. >>> Is the standby thing something to worry about ? >>> >>> best >>> Johannes >>> >> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> >> >> > > > -- > Best Regards > > Jeff Zhang > > > -- Best Regards Jeff Zhang
