(In particular, I'm wondering if something in CRUNCH-481 is related to this problem.)
On Tue, Aug 18, 2015 at 4:07 PM, Josh Wills <[email protected]> wrote: > Hey Everett, > > Shot in the dark-- would you mind trying it w/0.11.0-hadoop2 w/the 553 > patch? Is that easy to do? > > J > > On Tue, Aug 18, 2015 at 3:18 PM, Everett Anderson <[email protected]> > wrote: > >> Hi, >> >> I verified that the pipeline succeeds on the same cc2.8xlarge hardware >> when setting crunch.max.running.jobs to 1. I generally feel like the >> pipeline application itself logic is sound, at this point. It could be that >> this is just taxing these machines too hard and we need to increase the >> number of retries? >> >> It reliably fails on this hardware when crunch.max.running.jobs set to >> its default. >> >> Can you explain a little what the /tmp/crunch-XXXXXXX files are as well >> as how Crunch uses side effect files? Do you know if HDFS would clean up >> those directories from underneath Crunch? >> >> There are usually 4 failed applications, failing due to reduces. The >> failures seem to be one of the following three kinds -- (1) No lease on >> <side effect file>, (2) File not found </tmp/crunch-XXXXXXX> file, (3) >> SocketTimeoutException. >> >> Examples: >> >> [1] No lease exception >> >> Error: org.apache.crunch.CrunchRuntimeException: >> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): >> No lease on >> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003: >> File does not exist. Holder >> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have >> any open files. at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641) >> at >> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484) >> at >> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) >> at >> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599) >> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at >> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at >> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at >> java.security.AccessController.doPrivileged(Native Method) at >> javax.security.auth.Subject.doAs(Subject.java:415) at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) >> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at >> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74) >> at >> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64) >> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at >> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at >> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at >> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at >> java.security.AccessController.doPrivileged(Native Method) at >> javax.security.auth.Subject.doAs(Subject.java:415) at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) >> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by: >> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): >> No lease on >> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003: >> File does not exist. Holder >> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have >> any open files. at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641) >> at >> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484) >> at >> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) >> at >> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599) >> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at >> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at >> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at >> java.security.AccessController.doPrivileged(Native Method) at >> javax.security.auth.Subject.doAs(Subject.java:415) at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) >> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at >> org.apache.hadoop.ipc.Client.call(Client.java:1410) at >> org.apache.hadoop.ipc.Client.call(Client.java:1363) at >> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215) >> at com.sun.proxy.$Proxy13.complete(Unknown Source) at >> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:606) at >> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190) >> at >> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) >> at com.sun.proxy.$Proxy13.complete(Unknown Source) at >> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:404) >> at >> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2130) >> at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2114) >> at >> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) >> at >> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105) >> at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1289) >> at >> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:87) >> at >> org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutputs.java:300) >> at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:180) at >> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:72) >> ... 9 more >> >> >> [2] File does not exist >> >> 2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handler] >> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics >> report from attempt_1439917295505_0034_r_000004_1: Error: >> org.apache.crunch.CrunchRuntimeException: Could not read runtime node >> information >> at >> org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48) >> at >> org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40) >> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172) >> at >> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) >> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) >> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:415) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) >> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) >> Caused by: java.io.FileNotFoundException: File does not exist: >> /tmp/crunch-4694113/p470/REDUCE >> at >> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65) >> at >> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497) >> at >> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322) >> at >> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) >> at >> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599) >> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) >> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) >> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:415) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) >> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) >> >> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) >> at >> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) >> at >> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >> at java.lang.reflect.Constructor.newInstance(Constructor.java:526) >> at >> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) >> at >> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) >> at >> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147) >> at >> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135) >> at >> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125) >> at >> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273) >> at >> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240) >> at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233) >> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298) >> at >> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300) >> at >> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296) >> at >> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) >> at >> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296) >> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768) >> at org.apache.crunch.util.DistCache.read(DistCache.java:72) >> at >> org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46) >> ... 9 more >> >> [3] SocketTimeoutException >> >> Error: org.apache.crunch.CrunchRuntimeException: >> java.net.SocketTimeoutException: 70000 millis timeout while waiting for >> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected >> local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at >> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74) >> at >> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64) >> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at >> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at >> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at >> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at >> java.security.AccessController.doPrivileged(Native Method) at >> javax.security.auth.Subject.doAs(Subject.java:415) at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) >> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by: >> java.net.SocketTimeoutException: 70000 millis timeout while waiting for >> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected >> local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at >> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) >> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) >> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) >> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) >> at java.io.FilterInputStream.read(FilterInputStream.java:83) at >> java.io.FilterInputStream.read(FilterInputStream.java:83) at >> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1985) >> at >> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1075) >> at >> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1042) >> at >> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1186) >> at >> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:935) >> at >> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:491) >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Fri, Aug 14, 2015 at 3:54 PM, Everett Anderson <[email protected]> >> wrote: >> >>> >>> >>> On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <[email protected]> wrote: >>> >>>> Hey Everett, >>>> >>>> Initial thought-- there are lots of reasons for lease expired >>>> exceptions, and their usually more symptomatic of other problems in the >>>> pipeline. Are you sure none of the jobs in the Crunch pipeline on the >>>> non-SSD instances are failing for some other reason? I'd be surprised if no >>>> other errors showed up in the app master, although there are reports of >>>> some weirdness around LeaseExpireds when writing to S3-- but you're not >>>> doing that here, right? >>>> >>> >>> We're reading from and writing to HDFS, here. (We've copied in input >>> from S3 to HDFS in another step.) >>> >>> There are a few exceptions in the logs. Most seem related to missing >>> temp files. >>> >>> Let me see if I can reproduce it with crunch.max.running.jobs set to 1 >>> to try to narrow down the originating failure. >>> >>> >>> >>> >>>> >>>> J >>>> >>>> On Fri, Aug 14, 2015 at 2:10 PM, Everett Anderson <[email protected]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I recently started trying to run our Crunch pipeline on more data and >>>>> have been trying out different AWS instance types in anticipation of our >>>>> storage and compute needs. >>>>> >>>>> I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12 (patched with >>>>> the CRUNCH-553 <https://issues.apache.org/jira/browse/CRUNCH-553> >>>>> fix). >>>>> >>>>> Our pipeline finishes fine in these cluster configurations: >>>>> >>>>> - 50 c3.4xlarge Core, 0 Task >>>>> - 10 c3.8xlarge Core, 0 Task >>>>> - 25 c3.8xlarge Core, 0 Task >>>>> >>>>> However, it always fails on the same data when using 10 cc2.8xlarge >>>>> Core instances. >>>>> >>>>> The biggest obvious hardware difference is that the cc2.8xlarges use >>>>> hard disks instead of SSDs. >>>>> >>>>> While it's a little hard to track down the exact originating failure, >>>>> I think it's from errors like: >>>>> >>>>> 2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711] >>>>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: >>>>> attempt_1439499407003_0028_r_000153_1 - exited : >>>>> org.apache.crunch.CrunchRuntimeException: >>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): >>>>> No lease on >>>>> /tmp/crunch-970849245/p662/output/_temporary/1/_temporary/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153: >>>>> File does not exist. Holder >>>>> DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 does not have >>>>> any open files. >>>>> >>>>> Those paths look like these side effect files >>>>> <https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)> >>>>> . >>>>> >>>>> Would Crunch have generated applications that depend on side effect >>>>> paths as input across MapReduce applications and something in HDFS is >>>>> cleaning up those paths, unaware of the higher level dependencies? AWS >>>>> configures Hadoop differently for each instance type, and might have more >>>>> aggressive cleanup settings on HDs, though this is very uninformed >>>>> hypothesis. >>>>> >>>>> A sample full log is attached. >>>>> >>>>> Thanks for any guidance! >>>>> >>>>> - Everett >>>>> >>>>> >>>>> *DISCLAIMER:* The contents of this email, including any attachments, >>>>> may contain information that is confidential, proprietary in nature, >>>>> protected health information (PHI), or otherwise protected by law from >>>>> disclosure, and is solely for the use of the intended recipient(s). If you >>>>> are not the intended recipient, you are hereby notified that any use, >>>>> disclosure or copying of this email, including any attachments, is >>>>> unauthorized and strictly prohibited. If you have received this email in >>>>> error, please notify the sender of this email. Please delete this and all >>>>> copies of this email from your system. Any opinions either expressed or >>>>> implied in this email and all attachments, are those of its author only, >>>>> and do not necessarily reflect those of Nuna Health, Inc. >>>> >>>> >>>> >>>> >>>> -- >>>> Director of Data Science >>>> Cloudera <http://www.cloudera.com> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>>> >>> >>> >> >> *DISCLAIMER:* The contents of this email, including any attachments, may >> contain information that is confidential, proprietary in nature, protected >> health information (PHI), or otherwise protected by law from disclosure, >> and is solely for the use of the intended recipient(s). If you are not the >> intended recipient, you are hereby notified that any use, disclosure or >> copying of this email, including any attachments, is unauthorized and >> strictly prohibited. If you have received this email in error, please >> notify the sender of this email. Please delete this and all copies of this >> email from your system. Any opinions either expressed or implied in this >> email and all attachments, are those of its author only, and do not >> necessarily reflect those of Nuna Health, Inc. >> > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
