Hi everyone, I've been running the TestDFSIO benchmark on HDFS using the following setup: 8 nodes, (1 namenode with co-located resource manager, 7 data nodes with co-located node managers), HDFS block size of 32M, replication of 1, 21 files of 1G each (i.e. 3 mappers per data node). I am running TestDFSIO ten times in a row (as a cycle of write, read and cleanup operations), and in some of the runs I'm getting a LeaseExpiredException (not the first run though). Following is a stack trace with some context. I was hoping that maybe you could point me to where I might have gone wrong in my configuration. My HDFS config files are pretty vanilla, I am using Hadoop 2.7.1.
... 15/11/10 11:44:15 INFO mapreduce.Job: Running job: job_1447152143064_0003 15/11/10 11:44:21 INFO mapreduce.Job: Job job_1447152143064_0003 running in uber mode : false 15/11/10 11:44:21 INFO mapreduce.Job: map 0% reduce 0% 15/11/10 11:44:27 INFO mapreduce.Job: map 5% reduce 0% 15/11/10 11:44:28 INFO mapreduce.Job: map 38% reduce 0% 15/11/10 11:44:29 INFO mapreduce.Job: map 48% reduce 0% 15/11/10 11:44:30 INFO mapreduce.Job: map 57% reduce 0% 15/11/10 11:44:35 INFO mapreduce.Job: map 73% reduce 0% 15/11/10 11:44:37 INFO mapreduce.Job: map 86% reduce 0% 15/11/10 11:44:38 INFO mapreduce.Job: map 86% reduce 19% 15/11/10 11:44:47 INFO mapreduce.Job: Task Id : attempt_1447152143064_0003_m_000008_0, Status : FAILED Error: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /benchmarks/TestDFSIO/io_data/test_io_18 (inode 16554): File does not exist. Holder DFSClient_attempt_1447152143064_0003_m_000008_0_690388761_1 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3431) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3236) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3074) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3034) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:723) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) at org.apache.hadoop.ipc.Client.call(Client.java:1476) at org.apache.hadoop.ipc.Client.call(Client.java:1407) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy12.addBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:418) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy13.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1430) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1226) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449) 15/11/10 11:44:48 INFO mapreduce.Job: map 83% reduce 19% 15/11/10 11:44:50 INFO mapreduce.Job: map 89% reduce 22% 15/11/10 11:44:51 INFO mapreduce.Job: map 100% reduce 22% 15/11/10 11:44:52 INFO mapreduce.Job: map 100% reduce 100% 15/11/10 11:44:53 INFO mapreduce.Job: Job job_1447152143064_0003 completed successfully 15/11/10 11:44:53 INFO mapreduce.Job: Counters: 51 ... I am also seeing an extremely high standard deviation for the read rate (up to almost 100%), as well as running times for read operations (between 20s and 160s). The locality of the placement is also roughly only 15 out of 21. Could this be related to the above exception(s)? Thanks a lot in advance, I'm happy to supply any more information if you need it. Robert -- My GPG Key ID: 336E2680
