Hey Everett, Shot in the dark-- would you mind trying it w/0.11.0-hadoop2 w/the 553 patch? Is that easy to do?
J On Tue, Aug 18, 2015 at 3:18 PM, Everett Anderson <[email protected]> wrote: > Hi, > > I verified that the pipeline succeeds on the same cc2.8xlarge hardware > when setting crunch.max.running.jobs to 1. I generally feel like the > pipeline application itself logic is sound, at this point. It could be that > this is just taxing these machines too hard and we need to increase the > number of retries? > > It reliably fails on this hardware when crunch.max.running.jobs set to > its default. > > Can you explain a little what the /tmp/crunch-XXXXXXX files are as well as > how Crunch uses side effect files? Do you know if HDFS would clean up those > directories from underneath Crunch? > > There are usually 4 failed applications, failing due to reduces. The > failures seem to be one of the following three kinds -- (1) No lease on > <side effect file>, (2) File not found </tmp/crunch-XXXXXXX> file, (3) > SocketTimeoutException. > > Examples: > > [1] No lease exception > > Error: org.apache.crunch.CrunchRuntimeException: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): > No lease on > /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003: > File does not exist. Holder > DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have > any open files. at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:415) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at > org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74) > at > org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at > org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at > org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:415) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): > No lease on > /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003: > File does not exist. Holder > DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have > any open files. at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:415) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at > org.apache.hadoop.ipc.Client.call(Client.java:1410) at > org.apache.hadoop.ipc.Client.call(Client.java:1363) at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215) > at com.sun.proxy.$Proxy13.complete(Unknown Source) at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) > at com.sun.proxy.$Proxy13.complete(Unknown Source) at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:404) > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2130) > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2114) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > at > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105) > at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1289) > at > org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:87) > at > org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutputs.java:300) > at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:180) at > org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:72) > ... 9 more > > > [2] File does not exist > > 2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics > report from attempt_1439917295505_0034_r_000004_1: Error: > org.apache.crunch.CrunchRuntimeException: Could not read runtime node > information > at > org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48) > at > org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) > Caused by: java.io.FileNotFoundException: File does not exist: > /tmp/crunch-4694113/p470/REDUCE > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) > at > org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147) > at > org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135) > at > org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125) > at > org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273) > at > org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240) > at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233) > at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298) > at > org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300) > at > org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768) > at org.apache.crunch.util.DistCache.read(DistCache.java:72) > at > org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46) > ... 9 more > > [3] SocketTimeoutException > > Error: org.apache.crunch.CrunchRuntimeException: > java.net.SocketTimeoutException: 70000 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at > org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74) > at > org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64) at > org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at > org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at > org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:415) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by: > java.net.SocketTimeoutException: 70000 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) > at java.io.FilterInputStream.read(FilterInputStream.java:83) at > java.io.FilterInputStream.read(FilterInputStream.java:83) at > org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1985) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1075) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1042) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1186) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:935) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:491) > > > > > > > > > > > > > > On Fri, Aug 14, 2015 at 3:54 PM, Everett Anderson <[email protected]> > wrote: > >> >> >> On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <[email protected]> wrote: >> >>> Hey Everett, >>> >>> Initial thought-- there are lots of reasons for lease expired >>> exceptions, and their usually more symptomatic of other problems in the >>> pipeline. Are you sure none of the jobs in the Crunch pipeline on the >>> non-SSD instances are failing for some other reason? I'd be surprised if no >>> other errors showed up in the app master, although there are reports of >>> some weirdness around LeaseExpireds when writing to S3-- but you're not >>> doing that here, right? >>> >> >> We're reading from and writing to HDFS, here. (We've copied in input from >> S3 to HDFS in another step.) >> >> There are a few exceptions in the logs. Most seem related to missing temp >> files. >> >> Let me see if I can reproduce it with crunch.max.running.jobs set to 1 >> to try to narrow down the originating failure. >> >> >> >> >>> >>> J >>> >>> On Fri, Aug 14, 2015 at 2:10 PM, Everett Anderson <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> I recently started trying to run our Crunch pipeline on more data and >>>> have been trying out different AWS instance types in anticipation of our >>>> storage and compute needs. >>>> >>>> I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12 (patched with >>>> the CRUNCH-553 <https://issues.apache.org/jira/browse/CRUNCH-553> fix). >>>> >>>> Our pipeline finishes fine in these cluster configurations: >>>> >>>> - 50 c3.4xlarge Core, 0 Task >>>> - 10 c3.8xlarge Core, 0 Task >>>> - 25 c3.8xlarge Core, 0 Task >>>> >>>> However, it always fails on the same data when using 10 cc2.8xlarge >>>> Core instances. >>>> >>>> The biggest obvious hardware difference is that the cc2.8xlarges use >>>> hard disks instead of SSDs. >>>> >>>> While it's a little hard to track down the exact originating failure, I >>>> think it's from errors like: >>>> >>>> 2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711] >>>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: >>>> attempt_1439499407003_0028_r_000153_1 - exited : >>>> org.apache.crunch.CrunchRuntimeException: >>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): >>>> No lease on >>>> /tmp/crunch-970849245/p662/output/_temporary/1/_temporary/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153: >>>> File does not exist. Holder >>>> DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 does not have >>>> any open files. >>>> >>>> Those paths look like these side effect files >>>> <https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)> >>>> . >>>> >>>> Would Crunch have generated applications that depend on side effect >>>> paths as input across MapReduce applications and something in HDFS is >>>> cleaning up those paths, unaware of the higher level dependencies? AWS >>>> configures Hadoop differently for each instance type, and might have more >>>> aggressive cleanup settings on HDs, though this is very uninformed >>>> hypothesis. >>>> >>>> A sample full log is attached. >>>> >>>> Thanks for any guidance! >>>> >>>> - Everett >>>> >>>> >>>> *DISCLAIMER:* The contents of this email, including any attachments, >>>> may contain information that is confidential, proprietary in nature, >>>> protected health information (PHI), or otherwise protected by law from >>>> disclosure, and is solely for the use of the intended recipient(s). If you >>>> are not the intended recipient, you are hereby notified that any use, >>>> disclosure or copying of this email, including any attachments, is >>>> unauthorized and strictly prohibited. If you have received this email in >>>> error, please notify the sender of this email. Please delete this and all >>>> copies of this email from your system. Any opinions either expressed or >>>> implied in this email and all attachments, are those of its author only, >>>> and do not necessarily reflect those of Nuna Health, Inc. >>> >>> >>> >>> >>> -- >>> Director of Data Science >>> Cloudera <http://www.cloudera.com> >>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>> >> >> > > *DISCLAIMER:* The contents of this email, including any attachments, may > contain information that is confidential, proprietary in nature, protected > health information (PHI), or otherwise protected by law from disclosure, > and is solely for the use of the intended recipient(s). If you are not the > intended recipient, you are hereby notified that any use, disclosure or > copying of this email, including any attachments, is unauthorized and > strictly prohibited. If you have received this email in error, please > notify the sender of this email. Please delete this and all copies of this > email from your system. Any opinions either expressed or implied in this > email and all attachments, are those of its author only, and do not > necessarily reflect those of Nuna Health, Inc. > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
