Hello,
I'm trying to create a virtualized Accumulo 1.4.4 cluster with 4 tablet servers
using Hadoop 0.20.2 and ZooKeeper 3.3.5. It didn't seem to be working correctly
with 4 tablet servers, so I first tried just running with one tablet server,
which seemed to work fine. When I tried to run it with just 2 tablet servers, I
ran into some issues.
Just to preface, I double checked configs within zookeeper and accumulo, and
everything matches. All hostnames are resolving correctly, and passwordless SSH
for the accumulo user is also functional between all nodes. Running "echo stat
| nc <zk-server> <zk port>" responds appropriately.
Here's the first error log for the Tablet Master:
2014-03-05 11:18:16,626 [master.Master] ERROR: Error processing table state for
store Root Tablet
org.apache.thrift.transport.TTransportException: java.io.IOException:
Connection reset by peer
at
org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:161)
at
org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:158)
at
org.apache.accumulo.core.client.impl.ThriftTransportPool$CachedTTransport.flush(ThriftTransportPool.java:299)
at
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.send_loadTablet(TabletClientService.java:653)
at
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.loadTablet(TabletClientService.java:640)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at
org.apache.accumulo.cloudtrace.instrument.thrift.TraceWrap$2.invoke(TraceWrap.java:84)
at com.sun.proxy.$Proxy4.loadTablet(Unknown Source)
at
org.apache.accumulo.server.master.LiveTServerSet$TServerConnection.assignTablet(LiveTServerSet.java:86)
at
org.apache.accumulo.server.master.Master$TabletGroupWatcher.flushChanges(Master.java:1818)
at
org.apache.accumulo.server.master.Master$TabletGroupWatcher.run(Master.java:1426)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(Unknown Source)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source)
at sun.nio.ch.IOUtil.write(Unknown Source)
at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
at
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
at java.io.BufferedOutputStream.flush(Unknown Source)
at
org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:159)
... 13 more
Here are the error logs for Tablet Server #1:
2014-03-05 11:17:15,152 [tabletserver.TabletServer] INFO : Tablet server
starting on 172.16.111.3
2014-03-05 11:17:15,187 [util.FileSystemMonitor] INFO : Filesystem monitor
started
2014-03-05 11:17:15,194 [tabletserver.NativeMap] INFO : Loaded native map
shared library
/opt/accumulo/accumulo/lib/native/map/libNativeMap-Linux-amd64-64.so
2014-03-05 11:17:15,499 [tabletserver.TabletServer] INFO : port = 9997
2014-03-05 11:17:15,540 [tabletserver.TabletServer] INFO : Waiting for tablet
server lock
2014-03-05 11:17:16,633 [tabletserver.TabletServer] WARN : Got loadTablet
message from master before lock acquired, ignoring...
2014-03-05 11:17:16,634 [server.TNonblockingServer] ERROR: Unexpected exception
while invoking!
java.lang.RuntimeException: Lock not acquired
at
org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.checkPermission(TabletServer.java:1782)
at
org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.loadTablet(TabletServer.java:1814)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at
org.apache.accumulo.cloudtrace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:59)
at com.sun.proxy.$Proxy1.loadTablet(Unknown Source)
at
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$loadTablet.process(TabletClientService.java:2510)
at
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor.process(TabletClientService.java:2053)
at
org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:154)
at
org.apache.thrift.server.TNonblockingServer$FrameBuffer.invoke(TNonblockingServer.java:631)
at
org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:202)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at
org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
at java.lang.Thread.run(Unknown Source)
2014-03-05 11:17:20,564 [tabletserver.TabletServer] INFO : Waiting for tablet
server lock
2014-03-05 11:17:25,589 [tabletserver.TabletServer] INFO : Waiting for tablet
server lock
(continues until too many retries, then exits)
Tablet Server #2's logs get as far as this (below), and then just stop.
2014-03-05 11:17:14,112 [tabletserver.TabletServer] INFO : Tablet server
starting on 172.16.111.3
2014-03-05 11:17:14,149 [util.FileSystemMonitor] INFO : Filesystem monitor
started
2014-03-05 11:17:14,157 [tabletserver.NativeMap] INFO : Loaded native map
shared library
/opt/accumulo/accumulo/lib/native/map/libNativeMap-Linux-amd64-64.so
2014-03-05 11:17:14,481 [tabletserver.TabletServer] INFO : port = 9997
Also, the master logs interestingly never make any calls to Tablet #2's IP
address.
Any thoughts? We have another cluster that is setup identically in just about
every way (besides hostnames), but it has never experienced any of these
issues. My research shows that these issues can exist within 1.4.3, which we
were using at first, but we switched to 1.4.4 because these types of issues
were supposed to be resolved. Any help would be greatly appreciated.
Thanks,
Alex Lee