Re: LeaseExpiredExceptions and temp side effect files

Josh Wills Sat, 26 Sep 2015 14:16:22 -0700

You can mix in a combination of Pipeline.run and Pipeline.cleanup calls to
control job execution and cleanup.
On Sat, Sep 26, 2015 at 1:48 PM Everett Anderson <[email protected]> wrote:


> On Thu, Sep 24, 2015 at 5:46 PM, Josh Wills <[email protected]> wrote:
>
>> Hrm. If you never call Pipeline.done, you should never cleanup the
>> temporary files for the job...
>>
>
> Interesting.
>
> We're currently exploring giving the datanodes more memory as there's some
> evidence they were getting overloaded.
>
> Right now, our Crunch pipeline is long, with many stages, but not all data
> is used in each stage. If our problem is that we're overloading some part
> of HDFS (and in other cluster configs we have seen ourselves hit our disk
> capacity cap), I wonder if it'd help if we DID somehow prune away temporary
> outputs that were no longer necessary.
>
>
>
>
>
>
>>
>> On Thu, Sep 24, 2015 at 5:44 PM, Everett Anderson <[email protected]>
>> wrote:
>>
>>> While we tried to take comfort in the fact that we'd only seen this only
>>> HD-based cc2.8xlarges, I'm afraid we're now seeing it when processing
>>> larger amounts of data on SSD-based c3.4x8larges.
>>>
>>> My two hypotheses are
>>>
>>> 1) Somehow these temp files are getting cleaned up before they're
>>> accessed for the last time. Perhaps either something in HDFS or Hadoop
>>> cleans up these temp directories, or perhaps there's a bunch in Crunch's
>>> planner.
>>>
>>> 2) HDFS has chosen 3 machines to replicate data to, but it is performing
>>> a very lopsided replication. While the cluster overall looks like it has
>>> HDFS capacity, perhaps a small subset of the machines is actually at
>>> capacity. Things seem to fail in obscure ways when running out of disk.
>>>
>>>
>>> 2015-09-24 23:28:58,850 WARN [main] org.apache.hadoop.mapred.YarnChild: 
>>> Exception running child : org.apache.crunch.CrunchRuntimeException: Could 
>>> not read runtime node information
>>>     at 
>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
>>>     at 
>>> org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
>>>     at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
>>>     at 
>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>>     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>>>     at java.security.AccessController.doPrivileged(Native Method)
>>>     at javax.security.auth.Subject.doAs(Subject.java:415)
>>>     at 
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
>>> Caused by: java.io.FileNotFoundException: File does not exist: 
>>> /tmp/crunch-2031291770/p567/REDUCE
>>>     at 
>>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>>>     at 
>>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>>>     at 
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>>>     at 
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>>>     at 
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>>>     at 
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>>>     at 
>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>>>     at 
>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>>>     at 
>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>     at 
>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>>     at java.security.AccessController.doPrivileged(Native Method)
>>>     at javax.security.auth.Subject.doAs(Subject.java:415)
>>>     at 
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>
>>>     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>>     at 
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>     at 
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>     at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>     at 
>>> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>>>     at 
>>> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>>>     at 
>>> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
>>>     at 
>>> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
>>>     at 
>>> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
>>>     at 
>>> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
>>>     at 
>>> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
>>>     at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
>>>     at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
>>>     at 
>>> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
>>>     at 
>>> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
>>>     at 
>>> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>>     at 
>>> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
>>>     at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
>>>     at org.apache.crunch.util.DistCache.read(DistCache.java:72)
>>>     at 
>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
>>>     ... 9 more
>>> Caused by: 
>>> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File 
>>> does not exist: /tmp/crunch-2031291770/p567/REDUCE
>>>     at 
>>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>>>     at 
>>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>>>     at 
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>>>     at 
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>>>     at 
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>>>     at 
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>>>     at 
>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>>>     at 
>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>>>     at 
>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>     at 
>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>>     at java.security.AccessController.doPrivileged(Native Method)
>>>     at javax.security.auth.Subject.doAs(Subject.java:415)
>>>     at 
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>
>>>     at org.apache.hadoop.ipc.Client.call(Client.java:1410)
>>>     at org.apache.hadoop.ipc.Client.call(Client.java:1363)
>>>     at 
>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
>>>     at com.sun.proxy.$Proxy13.getBlockLocations(Unknown Source)
>>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>     at 
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>     at 
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>     at java.lang.reflect.Method.invoke(Method.java:606)
>>>     at 
>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
>>>     at 
>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>>>     at com.sun.proxy.$Proxy13.getBlockLocations(Unknown Source)
>>>     at 
>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:219)
>>>     at 
>>> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1145)
>>>     ... 22 more
>>>
>>>
>>> On Fri, Aug 21, 2015 at 3:52 PM, Jeff Quinn <[email protected]> wrote:
>>>
>>>> Also worth noting, we inspected the hadoop configuration defaults that
>>>> the AWS EMR service populates for the two different instance types, for
>>>> mapred-site.xml, core-site.xml, and hdfs-site.xml all settings were
>>>> identical, with the exception of slight differences in JVM memory allotted.
>>>> Further investigated the max number of file descriptors for each instance
>>>> type via ulimit, and saw no differences there either.
>>>>
>>>> So not sure what the main difference is between these two clusters that
>>>> would cause these very different outcomes, other than cc2.8xlarge having
>>>> SSDs and c3.8xlarge having spinning disks.
>>>>
>>>> On Fri, Aug 21, 2015 at 1:03 PM, Everett Anderson <[email protected]>
>>>> wrote:
>>>>
>>>>> Hey,
>>>>>
>>>>> Jeff graciously agreed to try it out.
>>>>>
>>>>> I'm afraid we're still getting failures on that instance type, though
>>>>> with 0.11 with the patches, the cluster ended up in a state that no new
>>>>> applications could be submitted afterwards.
>>>>>
>>>>> The errors when running the pipeline seem to be similarly HDFS
>>>>> related. It's quite odd.
>>>>>
>>>>> Examples when using 0.11 + the patches:
>>>>>
>>>>>
>>>>> 2015-08-20 23:17:50,455 WARN [Thread-38]
>>>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>>>>> file
>>>>> "/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_1440102643297_out0_0107_r_000001_0/out0-r-00001"
>>>>> - Aborting...
>>>>>
>>>>>
>>>>> 2015-08-20 22:39:42,184 WARN [Thread-51]
>>>>> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>> No lease on
>>>>> /tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_1440102643297_out12_0103_r_000167_2/out12-r-00167
>>>>> (inode 83784): File does not exist. [Lease.  Holder:
>>>>> DFSClient_attempt_1440102643297_0103_r_000167_2_964529009_1,
>>>>> pendingcreates: 24]
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3516)
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.abandonBlock(FSNamesystem.java:3486)
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.abandonBlock(NameNodeRpcServer.java:687)
>>>>> at
>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.abandonBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:467)
>>>>> at
>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>> at
>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:635)
>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>>>>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>>>>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>> at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>>>>>
>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1468)
>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>>>>> at
>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:241)
>>>>> at com.sun.proxy.$Proxy13.abandonBlock(Unknown Source)
>>>>> at
>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.abandonBlock(ClientNamenodeProtocolTranslatorPB.java:376)
>>>>> at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>>>>> at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>>> at
>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>>>>> at
>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>>>>> at com.sun.proxy.$Proxy14.abandonBlock(Unknown Source)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1377)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>>> 2015-08-20 22:39:42,184 WARN [Thread-51]
>>>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>>>>> file
>>>>> "/tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_1440102643297_out12_0103_r_000167_2/out12-r-00167"
>>>>> - Aborting...
>>>>>
>>>>>
>>>>>
>>>>> 2015-08-20 23:34:59,276 INFO [Thread-37]
>>>>> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
>>>>> java.io.IOException: Bad connect ack with firstBadLink as
>>>>> 10.55.1.103:50010
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>>> 2015-08-20 23:34:59,276 INFO [Thread-37]
>>>>> org.apache.hadoop.hdfs.DFSClient: Abandoning
>>>>> BP-835517662-10.55.1.32-1440102626965:blk_1073828261_95268
>>>>> 2015-08-20 23:34:59,278 INFO [Thread-37]
>>>>> org.apache.hadoop.hdfs.DFSClient: Excluding datanode 10.55.1.103:50010
>>>>> 2015-08-20 23:34:59,278 WARN [Thread-37]
>>>>> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
>>>>> java.io.IOException: Unable to create new block.
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1386)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>>> 2015-08-20 23:34:59,278 WARN [Thread-37]
>>>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>>>>> file
>>>>> "/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_1440102643297_out0_0107_r_000001_2/out0-r-00001"
>>>>> - Aborting...
>>>>> 2015-08-20 23:34:59,279 WARN [main]
>>>>> org.apache.hadoop.mapred.YarnChild: Exception running child :
>>>>> org.apache.crunch.CrunchRuntimeException: java.io.IOException: Bad connect
>>>>> ack with firstBadLink as 10.55.1.103:50010
>>>>> at
>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
>>>>> at
>>>>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195)
>>>>> at
>>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:171)
>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>> at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:166)
>>>>> Caused by: java.io.IOException: Bad connect ack with firstBadLink as
>>>>> 10.55.1.103:50010
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 21, 2015 at 11:59 AM, Josh Wills <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Curious how this went. :)
>>>>>>
>>>>>> On Tue, Aug 18, 2015 at 4:26 PM, Everett Anderson <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Sure, let me give it a try. I'm going to take 0.11 and patch it with
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/CRUNCH-553
>>>>>>> https://issues.apache.org/jira/browse/CRUNCH-517
>>>>>>>
>>>>>>> as we also rely on 517.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 18, 2015 at 4:09 PM, Josh Wills <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> (In particular, I'm wondering if something in CRUNCH-481 is related
>>>>>>>> to this problem.)
>>>>>>>>
>>>>>>>> On Tue, Aug 18, 2015 at 4:07 PM, Josh Wills <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Everett,
>>>>>>>>>
>>>>>>>>> Shot in the dark-- would you mind trying it w/0.11.0-hadoop2 w/the
>>>>>>>>> 553 patch? Is that easy to do?
>>>>>>>>>
>>>>>>>>> J
>>>>>>>>>
>>>>>>>>> On Tue, Aug 18, 2015 at 3:18 PM, Everett Anderson <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I verified that the pipeline succeeds on the same cc2.8xlarge
>>>>>>>>>> hardware when setting crunch.max.running.jobs to 1. I generally
>>>>>>>>>> feel like the pipeline application itself logic is sound, at this 
>>>>>>>>>> point. It
>>>>>>>>>> could be that this is just taxing these machines too hard and we 
>>>>>>>>>> need to
>>>>>>>>>> increase the number of retries?
>>>>>>>>>>
>>>>>>>>>> It reliably fails on this hardware when crunch.max.running.jobs
>>>>>>>>>> set to its default.
>>>>>>>>>>
>>>>>>>>>> Can you explain a little what the /tmp/crunch-XXXXXXX files are
>>>>>>>>>> as well as how Crunch uses side effect files? Do you know if HDFS 
>>>>>>>>>> would
>>>>>>>>>> clean up those directories from underneath Crunch?
>>>>>>>>>>
>>>>>>>>>> There are usually 4 failed applications, failing due to reduces.
>>>>>>>>>> The failures seem to be one of the following three kinds -- (1) No 
>>>>>>>>>> lease on
>>>>>>>>>> <side effect file>, (2) File not found </tmp/crunch-XXXXXXX> file, 
>>>>>>>>>> (3)
>>>>>>>>>> SocketTimeoutException.
>>>>>>>>>>
>>>>>>>>>> Examples:
>>>>>>>>>>
>>>>>>>>>> [1] No lease exception
>>>>>>>>>>
>>>>>>>>>> Error: org.apache.crunch.CrunchRuntimeException:
>>>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>>>>> No lease on
>>>>>>>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>>>>>>>>> File does not exist. Holder
>>>>>>>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not 
>>>>>>>>>> have
>>>>>>>>>> any open files. at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
>>>>>>>>>> at
>>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
>>>>>>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at
>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>>>>>>>>>  at
>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at
>>>>>>>>>> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at
>>>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) 
>>>>>>>>>> Caused by:
>>>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>>>>> No lease on
>>>>>>>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>>>>>>>>> File does not exist. Holder
>>>>>>>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not 
>>>>>>>>>> have
>>>>>>>>>> any open files. at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>>>>>>>> org.apache.hadoop.ipc.Client.call(Client.java:1410) at
>>>>>>>>>> org.apache.hadoop.ipc.Client.call(Client.java:1363) at
>>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
>>>>>>>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
>>>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>>>>> at
>>>>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:606) at
>>>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>>>>>>>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:404)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2130)
>>>>>>>>>> at 
>>>>>>>>>> org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2114)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
>>>>>>>>>> at 
>>>>>>>>>> org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1289)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:87)
>>>>>>>>>> at
>>>>>>>>>> org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutputs.java:300)
>>>>>>>>>> at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:180) 
>>>>>>>>>> at
>>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:72)
>>>>>>>>>> ... 9 more
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [2] File does not exist
>>>>>>>>>>
>>>>>>>>>> 2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handler] 
>>>>>>>>>> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
>>>>>>>>>> Diagnostics report from attempt_1439917295505_0034_r_000004_1: 
>>>>>>>>>> Error: org.apache.crunch.CrunchRuntimeException: Could not read 
>>>>>>>>>> runtime node information
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
>>>>>>>>>>      at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>>>>>>>>>      at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>>>>>>>>>      at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>>>>>>>>>>      at java.security.AccessController.doPrivileged(Native Method)
>>>>>>>>>>      at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>>>      at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
>>>>>>>>>> Caused by: java.io.FileNotFoundException: File does not exist: 
>>>>>>>>>> /tmp/crunch-4694113/p470/REDUCE
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>>>>>>      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>>>>>>>>>      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>>>>>>>>>      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>>>>>>>>>      at java.security.AccessController.doPrivileged(Native Method)
>>>>>>>>>>      at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>>>      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>>>>>>>>
>>>>>>>>>>      at 
>>>>>>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>>>>>>>>>      at 
>>>>>>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>>>>>>>>      at 
>>>>>>>>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>>>>>>>      at 
>>>>>>>>>> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
>>>>>>>>>>      at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
>>>>>>>>>>      at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
>>>>>>>>>>      at org.apache.crunch.util.DistCache.read(DistCache.java:72)
>>>>>>>>>>      at 
>>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
>>>>>>>>>>      ... 9 more
>>>>>>>>>>
>>>>>>>>>> [3] SocketTimeoutException
>>>>>>>>>>
>>>>>>>>>> Error: org.apache.crunch.CrunchRuntimeException: 
>>>>>>>>>> java.net.SocketTimeoutException: 70000 millis timeout while waiting 
>>>>>>>>>> for channel to be ready for read. ch : 
>>>>>>>>>> java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 
>>>>>>>>>> remote=/10.55.1.230:9200] at 
>>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
>>>>>>>>>>  at 
>>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
>>>>>>>>>>  at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at 
>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>>>>>>>>>  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at 
>>>>>>>>>> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at 
>>>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at 
>>>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at 
>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>>>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) 
>>>>>>>>>> Caused by: java.net.SocketTimeoutException: 70000 millis timeout 
>>>>>>>>>> while waiting for channel to be ready for read. ch : 
>>>>>>>>>> java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 
>>>>>>>>>> remote=/10.55.1.230:9200] at 
>>>>>>>>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>>>>>>>>>>  at 
>>>>>>>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>>>>>>>>>>  at 
>>>>>>>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>>>>>>>>>>  at 
>>>>>>>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
>>>>>>>>>>  at java.io.FilterInputStream.read(FilterInputStream.java:83) at 
>>>>>>>>>> java.io.FilterInputStream.read(FilterInputStream.java:83) at 
>>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1985)
>>>>>>>>>>  at 
>>>>>>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1075)
>>>>>>>>>>  at 
>>>>>>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1042)
>>>>>>>>>>  at 
>>>>>>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1186)
>>>>>>>>>>  at 
>>>>>>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:935)
>>>>>>>>>>  at 
>>>>>>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:491)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Aug 14, 2015 at 3:54 PM, Everett Anderson <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <[email protected]
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey Everett,
>>>>>>>>>>>>
>>>>>>>>>>>> Initial thought-- there are lots of reasons for lease expired
>>>>>>>>>>>> exceptions, and their usually more symptomatic of other problems 
>>>>>>>>>>>> in the
>>>>>>>>>>>> pipeline. Are you sure none of the jobs in the Crunch pipeline on 
>>>>>>>>>>>> the
>>>>>>>>>>>> non-SSD instances are failing for some other reason? I'd be 
>>>>>>>>>>>> surprised if no
>>>>>>>>>>>> other errors showed up in the app master, although there are 
>>>>>>>>>>>> reports of
>>>>>>>>>>>> some weirdness around LeaseExpireds when writing to S3-- but 
>>>>>>>>>>>> you're not
>>>>>>>>>>>> doing that here, right?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> We're reading from and writing to HDFS, here. (We've copied in
>>>>>>>>>>> input from S3 to HDFS in another step.)
>>>>>>>>>>>
>>>>>>>>>>> There are a few exceptions in the logs. Most seem related to
>>>>>>>>>>> missing temp files.
>>>>>>>>>>>
>>>>>>>>>>> Let me see if I can reproduce it with crunch.max.running.jobs
>>>>>>>>>>> set to 1 to try to narrow down the originating failure.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> J
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Aug 14, 2015 at 2:10 PM, Everett Anderson <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I recently started trying to run our Crunch pipeline on more
>>>>>>>>>>>>> data and have been trying out different AWS instance types in 
>>>>>>>>>>>>> anticipation
>>>>>>>>>>>>> of our storage and compute needs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12
>>>>>>>>>>>>> (patched with the CRUNCH-553
>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/CRUNCH-553> fix).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Our pipeline finishes fine in these cluster configurations:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - 50 c3.4xlarge Core, 0 Task
>>>>>>>>>>>>>    - 10 c3.8xlarge Core, 0 Task
>>>>>>>>>>>>>    - 25 c3.8xlarge Core, 0 Task
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, it always fails on the same data when using 10
>>>>>>>>>>>>> cc2.8xlarge Core instances.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The biggest obvious hardware difference is that the
>>>>>>>>>>>>> cc2.8xlarges use hard disks instead of SSDs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> While it's a little hard to track down the exact originating
>>>>>>>>>>>>> failure, I think it's from errors like:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711]
>>>>>>>>>>>>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
>>>>>>>>>>>>> attempt_1439499407003_0028_r_000153_1 - exited :
>>>>>>>>>>>>> org.apache.crunch.CrunchRuntimeException:
>>>>>>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>>>>>>>> No lease on
>>>>>>>>>>>>> /tmp/crunch-970849245/p662/output/_temporary/1/_temporary/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153:
>>>>>>>>>>>>> File does not exist. Holder
>>>>>>>>>>>>> DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 does 
>>>>>>>>>>>>> not have
>>>>>>>>>>>>> any open files.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Those paths look like these side effect files
>>>>>>>>>>>>> <https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)>
>>>>>>>>>>>>> .
>>>>>>>>>>>>>
>>>>>>>>>>>>> Would Crunch have generated applications that depend on side
>>>>>>>>>>>>> effect paths as input across MapReduce applications and something 
>>>>>>>>>>>>> in HDFS
>>>>>>>>>>>>> is cleaning up those paths, unaware of the higher level 
>>>>>>>>>>>>> dependencies? AWS
>>>>>>>>>>>>> configures Hadoop differently for each instance type, and might 
>>>>>>>>>>>>> have more
>>>>>>>>>>>>> aggressive cleanup settings on HDs, though this is very uninformed
>>>>>>>>>>>>> hypothesis.
>>>>>>>>>>>>>
>>>>>>>>>>>>> A sample full log is attached.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for any guidance!
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Everett
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>>>>>>>> attachments, may contain information that is confidential, 
>>>>>>>>>>>>> proprietary in
>>>>>>>>>>>>> nature, protected health information (PHI), or otherwise 
>>>>>>>>>>>>> protected by law
>>>>>>>>>>>>> from disclosure, and is solely for the use of the intended 
>>>>>>>>>>>>> recipient(s). If
>>>>>>>>>>>>> you are not the intended recipient, you are hereby notified that 
>>>>>>>>>>>>> any use,
>>>>>>>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>>>>>>>> unauthorized and strictly prohibited. If you have received this 
>>>>>>>>>>>>> email in
>>>>>>>>>>>>> error, please notify the sender of this email. Please delete this 
>>>>>>>>>>>>> and all
>>>>>>>>>>>>> copies of this email from your system. Any opinions either 
>>>>>>>>>>>>> expressed or
>>>>>>>>>>>>> implied in this email and all attachments, are those of its 
>>>>>>>>>>>>> author only,
>>>>>>>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Director of Data Science
>>>>>>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>>>>> attachments, may contain information that is confidential, 
>>>>>>>>>> proprietary in
>>>>>>>>>> nature, protected health information (PHI), or otherwise protected 
>>>>>>>>>> by law
>>>>>>>>>> from disclosure, and is solely for the use of the intended 
>>>>>>>>>> recipient(s). If
>>>>>>>>>> you are not the intended recipient, you are hereby notified that any 
>>>>>>>>>> use,
>>>>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>>>>> unauthorized and strictly prohibited. If you have received this 
>>>>>>>>>> email in
>>>>>>>>>> error, please notify the sender of this email. Please delete this 
>>>>>>>>>> and all
>>>>>>>>>> copies of this email from your system. Any opinions either expressed 
>>>>>>>>>> or
>>>>>>>>>> implied in this email and all attachments, are those of its author 
>>>>>>>>>> only,
>>>>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Director of Data Science
>>>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Director of Data Science
>>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>> attachments, may contain information that is confidential, proprietary 
>>>>>>> in
>>>>>>> nature, protected health information (PHI), or otherwise protected by 
>>>>>>> law
>>>>>>> from disclosure, and is solely for the use of the intended 
>>>>>>> recipient(s). If
>>>>>>> you are not the intended recipient, you are hereby notified that any 
>>>>>>> use,
>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>>> error, please notify the sender of this email. Please delete this and 
>>>>>>> all
>>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Director of Data Science
>>>>>> Cloudera <http://www.cloudera.com>
>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>> may contain information that is confidential, proprietary in nature,
>>> protected health information (PHI), or otherwise protected by law from
>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>> are not the intended recipient, you are hereby notified that any use,
>>> disclosure or copying of this email, including any attachments, is
>>> unauthorized and strictly prohibited. If you have received this email in
>>> error, please notify the sender of this email. Please delete this and all
>>> copies of this email from your system. Any opinions either expressed or
>>> implied in this email and all attachments, are those of its author only,
>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.

Re: LeaseExpiredExceptions and temp side effect files

Reply via email to