Re: LeaseExpiredExceptions and temp side effect files

Josh Wills Tue, 18 Aug 2015 16:10:23 -0700

(In particular, I'm wondering if something in CRUNCH-481 is related to this
problem.)


On Tue, Aug 18, 2015 at 4:07 PM, Josh Wills <[email protected]> wrote:

> Hey Everett,
>
> Shot in the dark-- would you mind trying it w/0.11.0-hadoop2 w/the 553
> patch? Is that easy to do?
>
> J
>
> On Tue, Aug 18, 2015 at 3:18 PM, Everett Anderson <[email protected]>
> wrote:
>
>> Hi,
>>
>> I verified that the pipeline succeeds on the same cc2.8xlarge hardware
>> when setting crunch.max.running.jobs to 1. I generally feel like the
>> pipeline application itself logic is sound, at this point. It could be that
>> this is just taxing these machines too hard and we need to increase the
>> number of retries?
>>
>> It reliably fails on this hardware when crunch.max.running.jobs set to
>> its default.
>>
>> Can you explain a little what the /tmp/crunch-XXXXXXX files are as well
>> as how Crunch uses side effect files? Do you know if HDFS would clean up
>> those directories from underneath Crunch?
>>
>> There are usually 4 failed applications, failing due to reduces. The
>> failures seem to be one of the following three kinds -- (1) No lease on
>> <side effect file>, (2) File not found </tmp/crunch-XXXXXXX> file, (3)
>> SocketTimeoutException.
>>
>> Examples:
>>
>> [1] No lease exception
>>
>> Error: org.apache.crunch.CrunchRuntimeException:
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>> No lease on
>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>> File does not exist. Holder
>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>> any open files. at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>> at
>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>> at
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>> at
>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>> at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>> java.security.AccessController.doPrivileged(Native Method) at
>> javax.security.auth.Subject.doAs(Subject.java:415) at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
>> at
>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at
>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at
>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at
>> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at
>> java.security.AccessController.doPrivileged(Native Method) at
>> javax.security.auth.Subject.doAs(Subject.java:415) at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by:
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>> No lease on
>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>> File does not exist. Holder
>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>> any open files. at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>> at
>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>> at
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>> at
>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>> at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>> java.security.AccessController.doPrivileged(Native Method) at
>> javax.security.auth.Subject.doAs(Subject.java:415) at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>> org.apache.hadoop.ipc.Client.call(Client.java:1410) at
>> org.apache.hadoop.ipc.Client.call(Client.java:1363) at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606) at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
>> at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:404)
>> at
>> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2130)
>> at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2114)
>> at
>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>> at
>> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
>> at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1289)
>> at
>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:87)
>> at
>> org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutputs.java:300)
>> at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:180) at
>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:72)
>> ... 9 more
>>
>>
>> [2] File does not exist
>>
>> 2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handler] 
>> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics 
>> report from attempt_1439917295505_0034_r_000004_1: Error: 
>> org.apache.crunch.CrunchRuntimeException: Could not read runtime node 
>> information
>>      at 
>> org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
>>      at 
>> org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
>>      at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
>>      at 
>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>      at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>      at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>>      at java.security.AccessController.doPrivileged(Native Method)
>>      at javax.security.auth.Subject.doAs(Subject.java:415)
>>      at 
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>      at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
>> Caused by: java.io.FileNotFoundException: File does not exist: 
>> /tmp/crunch-4694113/p470/REDUCE
>>      at 
>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>>      at 
>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>>      at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>>      at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>>      at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>>      at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>>      at 
>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>>      at 
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>>      at 
>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>      at 
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>      at java.security.AccessController.doPrivileged(Native Method)
>>      at javax.security.auth.Subject.doAs(Subject.java:415)
>>      at 
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>
>>      at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>      at 
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>      at 
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>      at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>      at 
>> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>>      at 
>> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>>      at 
>> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
>>      at 
>> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
>>      at 
>> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
>>      at 
>> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
>>      at 
>> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
>>      at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
>>      at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
>>      at 
>> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
>>      at 
>> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
>>      at 
>> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>      at 
>> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
>>      at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
>>      at org.apache.crunch.util.DistCache.read(DistCache.java:72)
>>      at 
>> org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
>>      ... 9 more
>>
>> [3] SocketTimeoutException
>>
>> Error: org.apache.crunch.CrunchRuntimeException: 
>> java.net.SocketTimeoutException: 70000 millis timeout while waiting for 
>> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
>> local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at 
>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
>>  at 
>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64) 
>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at 
>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at 
>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at 
>> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at 
>> java.security.AccessController.doPrivileged(Native Method) at 
>> javax.security.auth.Subject.doAs(Subject.java:415) at 
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by: 
>> java.net.SocketTimeoutException: 70000 millis timeout while waiting for 
>> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
>> local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at 
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) 
>> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) 
>> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) 
>> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) 
>> at java.io.FilterInputStream.read(FilterInputStream.java:83) at 
>> java.io.FilterInputStream.read(FilterInputStream.java:83) at 
>> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1985) 
>> at 
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1075)
>>  at 
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1042)
>>  at 
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1186)
>>  at 
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:935)
>>  at 
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:491)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Aug 14, 2015 at 3:54 PM, Everett Anderson <[email protected]>
>> wrote:
>>
>>>
>>>
>>> On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <[email protected]> wrote:
>>>
>>>> Hey Everett,
>>>>
>>>> Initial thought-- there are lots of reasons for lease expired
>>>> exceptions, and their usually more symptomatic of other problems in the
>>>> pipeline. Are you sure none of the jobs in the Crunch pipeline on the
>>>> non-SSD instances are failing for some other reason? I'd be surprised if no
>>>> other errors showed up in the app master, although there are reports of
>>>> some weirdness around LeaseExpireds when writing to S3-- but you're not
>>>> doing that here, right?
>>>>
>>>
>>> We're reading from and writing to HDFS, here. (We've copied in input
>>> from S3 to HDFS in another step.)
>>>
>>> There are a few exceptions in the logs. Most seem related to missing
>>> temp files.
>>>
>>> Let me see if I can reproduce it with crunch.max.running.jobs set to 1
>>> to try to narrow down the originating failure.
>>>
>>>
>>>
>>>
>>>>
>>>> J
>>>>
>>>> On Fri, Aug 14, 2015 at 2:10 PM, Everett Anderson <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I recently started trying to run our Crunch pipeline on more data and
>>>>> have been trying out different AWS instance types in anticipation of our
>>>>> storage and compute needs.
>>>>>
>>>>> I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12 (patched with
>>>>> the CRUNCH-553 <https://issues.apache.org/jira/browse/CRUNCH-553>
>>>>> fix).
>>>>>
>>>>> Our pipeline finishes fine in these cluster configurations:
>>>>>
>>>>>    - 50 c3.4xlarge Core, 0 Task
>>>>>    - 10 c3.8xlarge Core, 0 Task
>>>>>    - 25 c3.8xlarge Core, 0 Task
>>>>>
>>>>> However, it always fails on the same data when using 10 cc2.8xlarge
>>>>> Core instances.
>>>>>
>>>>> The biggest obvious hardware difference is that the cc2.8xlarges use
>>>>> hard disks instead of SSDs.
>>>>>
>>>>> While it's a little hard to track down the exact originating failure,
>>>>> I think it's from errors like:
>>>>>
>>>>> 2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711]
>>>>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
>>>>> attempt_1439499407003_0028_r_000153_1 - exited :
>>>>> org.apache.crunch.CrunchRuntimeException:
>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>> No lease on
>>>>> /tmp/crunch-970849245/p662/output/_temporary/1/_temporary/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153:
>>>>> File does not exist. Holder
>>>>> DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 does not have
>>>>> any open files.
>>>>>
>>>>> Those paths look like these side effect files
>>>>> <https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)>
>>>>> .
>>>>>
>>>>> Would Crunch have generated applications that depend on side effect
>>>>> paths as input across MapReduce applications and something in HDFS is
>>>>> cleaning up those paths, unaware of the higher level dependencies? AWS
>>>>> configures Hadoop differently for each instance type, and might have more
>>>>> aggressive cleanup settings on HDs, though this is very uninformed
>>>>> hypothesis.
>>>>>
>>>>> A sample full log is attached.
>>>>>
>>>>> Thanks for any guidance!
>>>>>
>>>>> - Everett
>>>>>
>>>>>
>>>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>>>> may contain information that is confidential, proprietary in nature,
>>>>> protected health information (PHI), or otherwise protected by law from
>>>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>>>> are not the intended recipient, you are hereby notified that any use,
>>>>> disclosure or copying of this email, including any attachments, is
>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>> error, please notify the sender of this email. Please delete this and all
>>>>> copies of this email from your system. Any opinions either expressed or
>>>>> implied in this email and all attachments, are those of its author only,
>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Director of Data Science
>>>> Cloudera <http://www.cloudera.com>
>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>
>>>
>>>
>>
>> *DISCLAIMER:* The contents of this email, including any attachments, may
>> contain information that is confidential, proprietary in nature, protected
>> health information (PHI), or otherwise protected by law from disclosure,
>> and is solely for the use of the intended recipient(s). If you are not the
>> intended recipient, you are hereby notified that any use, disclosure or
>> copying of this email, including any attachments, is unauthorized and
>> strictly prohibited. If you have received this email in error, please
>> notify the sender of this email. Please delete this and all copies of this
>> email from your system. Any opinions either expressed or implied in this
>> email and all attachments, are those of its author only, and do not
>> necessarily reflect those of Nuna Health, Inc.
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: LeaseExpiredExceptions and temp side effect files

Reply via email to