The connection reset by peer from the Master in combination with the lock not acquired by the tablet server makes me wonder if the process owner for the tablet server is able to access HDFS correctly.
Are dfs permissions enabled on your HDFS? It makes me think the tablet server does not have permissions to read from the /accumulo path that was initialized on the master. Did you use the same account for 'accumulo init' ? From: [email protected] [mailto:[email protected]] On Behalf Of Alex Lee Sent: Wednesday, March 05, 2014 12:17 PM To: [email protected] Subject: Tablet server stuck waiting for lock Hello, I'm trying to create a virtualized Accumulo 1.4.4 cluster with 4 tablet servers using Hadoop 0.20.2 and ZooKeeper 3.3.5. It didn't seem to be working correctly with 4 tablet servers, so I first tried just running with one tablet server, which seemed to work fine. When I tried to run it with just 2 tablet servers, I ran into some issues. Just to preface, I double checked configs within zookeeper and accumulo, and everything matches. All hostnames are resolving correctly, and passwordless SSH for the accumulo user is also functional between all nodes. Running "echo stat | nc <zk-server> <zk port>" responds appropriately. Here's the first error log for the Tablet Master: 2014-03-05 11:18:16,626 [master.Master] ERROR: Error processing table state for store Root Tablet org.apache.thrift.transport.TTransportException: java.io.IOException: Connection reset by peer at org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:161) at org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:158) at org.apache.accumulo.core.client.impl.ThriftTransportPool$CachedTTransport.flush(ThriftTransportPool.java:299) at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.send_loadTablet(TabletClientService.java:653) at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.loadTablet(TabletClientService.java:640) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.accumulo.cloudtrace.instrument.thrift.TraceWrap$2.invoke(TraceWrap.java:84) at com.sun.proxy.$Proxy4.loadTablet(Unknown Source) at org.apache.accumulo.server.master.LiveTServerSet$TServerConnection.assignTablet(LiveTServerSet.java:86) at org.apache.accumulo.server.master.Master$TabletGroupWatcher.flushChanges(Master.java:1818) at org.apache.accumulo.server.master.Master$TabletGroupWatcher.run(Master.java:1426) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(Unknown Source) at sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source) at sun.nio.ch.IOUtil.write(Unknown Source) at sun.nio.ch.SocketChannelImpl.write(Unknown Source) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) at java.io.BufferedOutputStream.flushBuffer(Unknown Source) at java.io.BufferedOutputStream.flush(Unknown Source) at org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:159) ... 13 more Here are the error logs for Tablet Server #1: 2014-03-05 11:17:15,152 [tabletserver.TabletServer] INFO : Tablet server starting on 172.16.111.3 2014-03-05 11:17:15,187 [util.FileSystemMonitor] INFO : Filesystem monitor started 2014-03-05 11:17:15,194 [tabletserver.NativeMap] INFO : Loaded native map shared library /opt/accumulo/accumulo/lib/native/map/libNativeMap-Linux-amd64-64.so 2014-03-05 11:17:15,499 [tabletserver.TabletServer] INFO : port = 9997 2014-03-05 11:17:15,540 [tabletserver.TabletServer] INFO : Waiting for tablet server lock 2014-03-05 11:17:16,633 [tabletserver.TabletServer] WARN : Got loadTablet message from master before lock acquired, ignoring... 2014-03-05 11:17:16,634 [server.TNonblockingServer] ERROR: Unexpected exception while invoking! java.lang.RuntimeException: Lock not acquired at org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.checkPermission(TabletServer.java:1782) at org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.loadTablet(TabletServer.java:1814) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.accumulo.cloudtrace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:59) at com.sun.proxy.$Proxy1.loadTablet(Unknown Source) at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$loadTablet.process(TabletClientService.java:2510) at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor.process(TabletClientService.java:2053) at org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:154) at org.apache.thrift.server.TNonblockingServer$FrameBuffer.invoke(TNonblockingServer.java:631) at org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:202) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) at java.lang.Thread.run(Unknown Source) 2014-03-05 11:17:20,564 [tabletserver.TabletServer] INFO : Waiting for tablet server lock 2014-03-05 11:17:25,589 [tabletserver.TabletServer] INFO : Waiting for tablet server lock (continues until too many retries, then exits) Tablet Server #2's logs get as far as this (below), and then just stop. 2014-03-05 11:17:14,112 [tabletserver.TabletServer] INFO : Tablet server starting on 172.16.111.3 2014-03-05 11:17:14,149 [util.FileSystemMonitor] INFO : Filesystem monitor started 2014-03-05 11:17:14,157 [tabletserver.NativeMap] INFO : Loaded native map shared library /opt/accumulo/accumulo/lib/native/map/libNativeMap-Linux-amd64-64.so 2014-03-05 11:17:14,481 [tabletserver.TabletServer] INFO : port = 9997 Also, the master logs interestingly never make any calls to Tablet #2's IP address. Any thoughts? We have another cluster that is setup identically in just about every way (besides hostnames), but it has never experienced any of these issues. My research shows that these issues can exist within 1.4.3, which we were using at first, but we switched to 1.4.4 because these types of issues were supposed to be resolved. Any help would be greatly appreciated. Thanks, Alex Lee
