[ 
https://issues.apache.org/jira/browse/YARN-9734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908977#comment-16908977
 ] 

Prabhu Joseph commented on YARN-9734:
-------------------------------------

It works fine after setting 
dfs.client.block.write.replace-datanode-on-failure.policy = NEVER. Looks this 
config is needed for a small cluster as per the Hdfs Config Doc.

{code}
    When the cluster size is extremely small, e.g. 3 nodes or less, cluster
    administrators may want to set the policy to NEVER in the default
    configuration file or disable this feature.  Otherwise, users may
    experience an unusually high rate of pipeline failures since it is
    impossible to find new datanodes for replacement.
{code}


> LogAggregationIndexedFileController fails to upload logs in rolling fashion
> ---------------------------------------------------------------------------
>
>                 Key: YARN-9734
>                 URL: https://issues.apache.org/jira/browse/YARN-9734
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: log-aggregation
>    Affects Versions: 3.3.0
>            Reporter: Prabhu Joseph
>            Assignee: Prabhu Joseph
>            Priority: Major
>
> LogAggregationIndexedFileController fails to upload logs in rolling fashion.
> *Configs:*
> {code}
> yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds = 60
> yarn.nodemanager.log-aggregation.debug-enabled = true
> yarn.log-aggregation.file-formats=IFile
> yarn.log-aggregation.file-controller.IFile.class=org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController
> {code}
> *Initialize writer fails with below error:*
> {code}
> 2019-08-09 07:46:12,411 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
>  Cannot create writer for app application_1565102314214_0007. Skip log upload 
> this time.
> java.io.IOException: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.RecoveryInProgressException):
>  Failed to APPEND_FILE 
> /app-logs/ambari-qa/bucket-logs-ifile/0007/application_1565102314214_0007/yarnDocker-1_45454_1565335809907
>  for DFSClient_NONMAPREDUCE_-1185242013_202 on 172.26.86.24 because lease 
> recovery is in progress. Try again later.
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2697)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:125)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2745)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:823)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2920)
>         at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:227)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:312)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:482)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:449)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:295)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)      
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.RecoveryInProgressException):
>  Failed to APPEND_FILE 
> /app-logs/ambari-qa/bucket-logs-ifile/0007/application_1565102314214_0007/yarnDocker-1_45454_1565335809907
>  for DFSClient_NONMAPREDUCE_-1185242013_202 on 172.26.86.24 because lease 
> recovery is in progress. Try again later.
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2697)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:125)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2745)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:823)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2920)
>         at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1549)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1495)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1392)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>         at com.sun.proxy.$Proxy12.append(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.append(ClientNamenodeProtocolTranslatorPB.java:404)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>         at com.sun.proxy.$Proxy13.append(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1346)
>         at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1368)
>         at 
> org.apache.hadoop.hdfs.DFSClient.primitiveAppend(DFSClient.java:1271)
>         at 
> org.apache.hadoop.hdfs.DFSClient.primitiveCreate(DFSClient.java:1287)
>         at org.apache.hadoop.fs.Hdfs.createInternal(Hdfs.java:105)
>         at org.apache.hadoop.fs.Hdfs.createInternal(Hdfs.java:60)
>         at 
> org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:624)
>         at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:697)
>         at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:693)
>         at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>         at org.apache.hadoop.fs.FileContext.create(FileContext.java:699)
>         at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriterInRolling(LogAggregationIndexedFileController.java:294)
>         at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.access$600(LogAggregationIndexedFileController.java:96)
>         at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController$1.run(LogAggregationIndexedFileController.java:194)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
>         at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:175)
>               
> {code}
> *Uploading logs for containers fails with below:*
> {code}
> 2019-08-09 07:45:12,318 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
>  Uploading logs for container container_e43_1565102314214_0007_01_000003. 
> Current good log dirs are /hadoop/yarn/log
> 2019-08-09 07:45:12,321 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
>  Couldn't upload logs for container_e43_1565102314214_0007_01_000003. 
> Skipping this container.
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[172.26.86.24:9866,DS-f28457aa-bba0-411f-acd6-82fcf7947583,DISK]],
>  
> original=[DatanodeInfoWithStorage[172.26.86.24:9866,DS-f28457aa-bba0-411f-acd6-82fcf7947583,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
>         at 
> org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1304)
>         at 
> org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1372)
>         at 
> org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1598)
>         at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1499)
>         at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
>         at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:719)
> {code}
> It works fine with LogAggregationTFileController.             



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to