Might be, does only tez job affected ? What about other jobs like MR ?

On Thu, Feb 25, 2016 at 5:32 PM, Johannes Zillmann <[email protected]
> wrote:

> So yes, i think HA is enabled.
> But this whole behaviour is not a one shot thing, its happen repeatedly
> for this job. So my guess is its rather a unhappy configuration then the
> name node switching again and again !?
>
> Johannes
>
>
> On 25 Feb 2016, at 10:29, Jeff Zhang <[email protected]> wrote:
>
> The standby warning is because it try to connect to the standby node. (HA
> should be enabled I think).  Maybe this cause the output missing, could you
> attach the whole app log ?
>
> On Thu, Feb 25, 2016 at 5:03 PM, Johannes Zillmann <
> [email protected]> wrote:
>
>> Hey guys,
>>
>> have a Tez job on a customer deployment which has very strange symptoms.
>> Maybe you have some pointers for me what to look into!
>>
>> App master log:
>> —————————————
>>
>> 2016-02-16 12:05:16,666 INFO [main] app.DAGAppMaster: Created DAGAppMaster 
>> for application appattempt_1455209830330_35852_000001, versionInfo=[ 
>> component=tez-dag, version=0.6.0, 
>> revision=25e41bc0f29f5e25adbfc17258b00ad702a17888, 
>> SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git, 
>> buildTime=2015-03-03T18:01:25Z ]
>> 2016-02-16 12:05:16,807 INFO [main] Configuration.deprecation: 
>> fs.default.name is deprecated. Instead, use fs.defaultFS
>> 2016-02-16 12:05:16,816 INFO [main] Configuration.deprecation: dfs.umaskmode 
>> is deprecated. Instead, use fs.permissions.umask-mode
>> 2016-02-16 12:05:16,987 WARN [main] util.NativeCodeLoader: Unable to load 
>> native-hadoop library for your platform... using builtin-java classes where 
>> applicable
>> 2016-02-16 12:05:17,134 INFO [main] app.DAGAppMaster: Comparing client 
>> version with AM version, clientVersion=0.6.0, AMVersion=0.6.0
>> 2016-02-16 12:05:18,158 WARN [main] ipc.Client: Exception encountered while 
>> connecting to the server : 
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>>  Operation category READ is not supported in state standby
>> 2016-02-16 12:05:18,298 INFO [main] app.DAGAppMaster: Adding session token 
>> to jobTokenSecretManager for application
>> 2016-02-16 12:05:18,305 INFO [main] event.AsyncDispatcher: Registering class 
>> org.apache.tez.dag.app.rm.container.AMContainerEventType for class 
>> org.apache.tez.dag.app.rm.container.AMContainerMap
>> 2016-02-16 12:05:18,306 INFO [main] event.AsyncDispatcher: Registering class 
>> org.apache.tez.dag.app.rm.node.AMNodeEventType for class 
>> org.apache.tez.dag.app.rm.node.AMNodeTracker
>> 2016-02-16 12:05:18,308 INFO [main] event.AsyncDispatcher: Registering class 
>> org.apache.tez.dag.app.dag.event.DAGAppMasterEventType for class 
>> org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterEventHandler
>> 2016-02-16 12:05:18,309 INFO [main] event.AsyncDispatcher: Registering class 
>> org.apache.tez.dag.app.dag.event.DAGEventType for class 
>> org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher
>> 2016-02-16 12:05:18,310 INFO [main] event.AsyncDispatcher: Registering class 
>> org.apache.tez.dag.app.dag.event.VertexEventType for class 
>> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher
>> 2016-02-16 12:05:18,311 INFO [main] event.AsyncDispatcher: Registering class 
>> org.apache.tez.dag.app.dag.event.TaskEventType for class 
>> org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher
>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class 
>> org.apache.tez.dag.app.dag.event.TaskAttemptEventType for class 
>> org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher
>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class 
>> org.apache.tez.dag.app.dag.event.SpeculatorEventType for class 
>> org.apache.tez.dag.app.DAGAppMaster$4
>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class 
>> org.apache.tez.dag.app.dag.event.SpeculatorEventType for class 
>> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler
>> 2016-02-16 12:05:18,357 INFO [main] event.AsyncDispatcher: Registering class 
>> org.apache.tez.dag.app.rm.AMSchedulerEventType for class 
>> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler
>> 2016-02-16 12:05:18,359 INFO [main] event.AsyncDispatcher: Registering class 
>> org.apache.tez.dag.app.rm.NMCommunicatorEventType for class 
>> org.apache.tez.dag.app.launcher.ContainerLauncherImpl
>> 2016-02-16 12:05:18,483 INFO [main] node.AMNodeTracker: 
>> blacklistDisablePercent is 33, blacklistingEnabled: true, 
>> maxTaskFailuresPerNode: 10
>> 2016-02-16 12:05:18,485 INFO [main] launcher.ContainerLauncherImpl: Upper 
>> limit on the thread pool size is 500
>> 2016-02-16 12:05:18,485 INFO [main] history.HistoryEventHandler: 
>> Initializing HistoryEventHandler
>> 2016-02-16 12:05:18,489 INFO [main] impl.SimpleHistoryLoggingService: Log 
>> file location for SimpleHistoryLoggingService not specified, defaulting to 
>> containerLogDir=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001
>> 2016-02-16 12:05:18,490 INFO [main] impl.SimpleHistoryLoggingService: 
>> Initializing SimpleHistoryLoggingService, 
>> logFileLocation=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/history.txt.appattempt_1455209830330_35852_000001,
>>  maxErrors=10
>> 2016-02-16 12:05:18,490 INFO [main] recovery.RecoveryService: Initializing 
>> RecoveryService
>> 2016-02-16 12:05:18,491 INFO [main] history.HistoryEventHandler: 
>> [HISTORY][DAG:N/A][Event:APP_LAUNCHED]: 
>> applicationId=application_1455209830330_35852, appSubmitTime=1455642293317, 
>> launchTime=1455642316658
>> 2016-02-16 12:05:18,492 INFO [main] history.HistoryEventHandler: 
>> [HISTORY][DAG:N/A][Event:AM_LAUNCHED]: 
>> appAttemptId=appattempt_1455209830330_35852_000001, 
>> appSubmitTime=1455642293317,
>>
>>
>> … user specific code
>>
>> 2016-02-16 12:05:19,343 INFO [IPC Server handler 0 on 13239] 
>> app.DAGAppMaster: Starting DAG submitted via RPC: Export job (349): 
>> ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected 
>> record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
>> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] 
>> Configuration.deprecation: mapred.output.compress is deprecated. Instead, 
>> use mapreduce.output.fileoutputformat.compress
>> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] 
>> Configuration.deprecation: fs.default.name is deprecated. Instead, use 
>> fs.defaultFS
>> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] 
>> Configuration.deprecation: mapred.map.tasks.speculative.execution is 
>> deprecated. Instead, use mapreduce.map.speculative
>> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] 
>> Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use 
>> mapreduce.job.reduces
>> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] 
>> Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use 
>> mapreduce.input.fileinputformat.split.minsize
>> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] 
>> Configuration.deprecation: mapred.reduce.tasks.speculative.execution is 
>> deprecated. Instead, use mapreduce.reduce.speculative
>> 2016-02-16 12:05:19,351 INFO [IPC Server handler 0 on 13239] 
>> Configuration.deprecation: mapred.output.compression.type is deprecated. 
>> Instead, use mapreduce.output.fileoutputformat.compress.type
>> 2016-02-16 12:05:19,352 INFO [IPC Server handler 0 on 13239] 
>> Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use 
>> mapreduce.input.fileinputformat.split.maxsize
>> 2016-02-16 12:05:19,354 INFO [IPC Server handler 0 on 13239] 
>> Configuration.deprecation: mapred.compress.map.output is deprecated. 
>> Instead, use mapreduce.map.output.compress
>> 2016-02-16 12:05:19,480 INFO [IPC Server handler 0 on 13239] 
>> app.DAGAppMaster: Generating DAG graphviz file, 
>> dagId=dag_1455209830330_35852_1, 
>> filePath=/u01/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/dag_1455209830330_35852_1.dot
>> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] 
>> common.TezUtilsInternal: Redirecting log file based on addend: 
>> dag_1455209830330_35852_1
>>
>> —————————————
>>
>> DAG-log:
>> --------------------------
>>
>> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] 
>> app.DAGAppMaster: Running DAG: Export job (349): 
>> ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected 
>> record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
>> 2016-02-16 12:05:19,966 INFO [HistoryEventHandlingThread] 
>> impl.SimpleHistoryLoggingService: Writing event DAG_SUBMITTED to history file
>>
>> … user specific code
>>
>> 2016-02-16 12:05:21,437 INFO [InputInitializer [MapChainVertex:Export job 
>> (349): ClaimLineExportToDB2#export(Disconnected record stream) 
>> (2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] util.LoggingUtil: Setting root 
>> logger level for hadoop task to WARN:
>> 2016-02-16 12:05:21,722 WARN [AsyncDispatcher event handler] 
>> hdfs.ClientContext: Existing client context 'default' does not match 
>> requested configuration.  Existing: shortCircuitStreamsCacheSize = 256, 
>> shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, 
>> shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 
>> 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 
>> 16, socketCacheExpiry = 3000, shortCircuitLocalReads = false, 
>> useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, 
>> shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000, Requested: 
>> shortCircuitStreamsCacheSize = 4096, shortCircuitStreamsCacheExpiryMs = 
>> 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 
>> 3600000, shortCircuitMmapCacheRetryTimeout = 300000, 
>> shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, 
>> socketCacheExpiry = 3000, shortCircuitLocalReads = true, 
>> useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, 
>> shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000
>> 2016-02-16 12:05:21,732 WARN [AsyncDispatcher event handler] ipc.Client: 
>> Exception encountered while connecting to the server : 
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>>  Operation category READ is not supported in state standby
>> 2016-02-16 12:08:12,554 WARN [AsyncDispatcher event handler] ipc.Client: 
>> Exception encountered while connecting to the server : 
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>>  Operation category READ is not supported in state standby
>>
>> --------------------------
>>
>>
>> Latest line of the (single) task:
>> --------------------------
>>
>> ...
>>
>> 2016-02-16 12:07:00,166 INFO [Initializer 1] 
>> runtime.LogicalIOProcessorRuntimeTask: Initializing Output using OutputSpec: 
>> { destinationVertexName=output, physicalEdgeCount=0, 
>> outputClassName=datameer.plugin.tez.output.DoNothingOutput }
>> 2016-02-16 12:07:00,168 INFO [Initializer 0] 
>> runtime.LogicalIOProcessorRuntimeTask: Creating Input
>> 2016-02-16 12:07:00,168 INFO [Initializer 1] 
>> runtime.LogicalIOProcessorRuntimeTask: Creating Output
>> 2016-02-16 12:07:00,175 INFO [Initializer 1] 
>> runtime.LogicalIOProcessorRuntimeTask: Initializing Output with dest edge: 
>> output
>> 2016-02-16 12:07:00,176 INFO [Initializer 0] 
>> runtime.LogicalIOProcessorRuntimeTask: Initializing Input with src edge: 
>> Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record 
>> stream)
>> 2016-02-16 12:07:00,176 INFO [Initializer 0] 
>> resources.MemoryDistributor$RequestorInfo: Received request: 0, type: INPUT, 
>> componentVertexName: Input:Export job (349): 
>> ClaimLineExportToDB2#export(Disconnected record stream)
>> 2016-02-16 12:07:00,176 INFO [Initializer 1] 
>> resources.MemoryDistributor$RequestorInfo: Received request: 0, type: 
>> OUTPUT, componentVertexName: output
>> 2016-02-16 12:07:00,177 INFO [Initializer 1] 
>> runtime.LogicalIOProcessorRuntimeTask: Initialized Output with dest edge: 
>> output
>>
>>
>> … user specific code
>>
>> 2016-02-16 12:07:01,871 INFO [TezChild] util.LoggingUtil: Setting root
>> logger level for hadoop task to WARN:
>>
>> 2016-02-16 12:07:03,990 WARN [TezChild] ipc.Client: Exception encountered 
>> while connecting to the server : 
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>>  Operation category READ is not supported in state standby
>>
>> —————————————
>>
>> The client log contains a shutdown exception:
>> —————————————
>>
>> INFO [2016-02-16 12:13:30.412] [MrPlanRunnerV2] (TezClient.java:470) - 
>> Shutting down Tez Session, sessionName=Export job (349): 
>> ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected 
>> record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), 
>> applicationId=application_1455209830330_35852
>>  INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClientUtils.java:838) - 
>> Application not running, applicationId=application_1455209830330_35852, 
>> yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, 
>> trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/, 
>> diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, 
>> killedDAGs=0
>>
>>  INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClient.java:484) - 
>> Failed to shutdown Tez Session via proxy
>> org.apache.tez.dag.api.SessionNotRunning: Application not running, 
>> applicationId=application_1455209830330_35852, 
>> yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, 
>> trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/, 
>> diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, 
>> killedDAGs=0
>>
>>      at 
>> org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:839)
>>      at org.apache.tez.client.TezClient.getSessionAMProxy(TezClient.java:669)
>>      at org.apache.tez.client.TezClient.stop(TezClient.java:476)
>>      at 
>> datameer.plugin.tez.session.TezSessionImpl.close(TezSessionImpl.java:69)
>>      at 
>> datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSession(PoolingTezSessionFactory.java:173)
>>      at 
>> datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSessions(PoolingTezSessionFactory.java:167)
>>      at 
>> datameer.plugin.tez.session.PoolingTezSessionFactory.close(PoolingTezSessionFactory.java:162)
>>      at datameer.com.google.common.io.Closer.close(Closer.java:214)
>>      at 
>> datameer.plugin.tez.smart.SmartClusterJobFlowResolver.close(SmartClusterJobFlowResolver.java:147)
>>      at datameer.com.google.common.io.Closer.close(Closer.java:214)
>>      at datameer.dap.sdk.util.IoUtil.close(IoUtil.java:330)
>>      at 
>> datameer.dap.common.graphv2.mixedframework.MixedClusterSession.commitJob(MixedClusterSession.java:65)
>>      at 
>> datameer.dap.common.graphv2.ClusterSession.commitJob(ClusterSession.java:81)
>>      at 
>> datameer.dap.common.graphv2.MrPlanRunnerV2.run(MrPlanRunnerV2.java:97)
>>      at 
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>      at 
>> datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:96)
>>      at 
>> datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:173)
>>      at datameer.dap.common.security.RunAsThread.run(RunAsThread.java:22)
>>  INFO [2016-02-16 12:13:30.582] [MrPlanRunnerV2] (TezClient.java:489) - 
>> Could not connect to AM, killing session via YARN, sessionName=Export job 
>> (349): ClaimLineExportToDB2#export(Disconnected record 
>> stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), 
>> applicationId=application_1455209830330_35852
>>
>>
>> —————————————
>>
>>
>> The counter and the event shows that:
>> - the correct amount of input records has been ingested
>> - the expected amount of output records is missing (we have zero records
>> produced which is reflected in both, the counters and the output itself)
>> - for Tez the task and the DAG are both handled as successful
>>
>> The customers says that the symptoms of the job are always the same:
>> - there is that standby warning
>> - there is the problem with the shutdown
>> - (and the data is missing)
>>
>> To me its still unclear if there are strange things going on with Tez or
>> with our application logic.
>> The same job is successful when executed with pure MapReduce.
>> Is the standby thing something to worry about ?
>>
>> best
>> Johannes
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>


-- 
Best Regards

Jeff Zhang

Reply via email to