It looks like both servers are resolving their address to be 172.16.111.3. -Eric
On Wed, Mar 5, 2014 at 1:33 PM, Alex Lee <[email protected]> wrote: > Eric, > > > > I had previously done a stop-all.sh, so I started everything back up again > to check this. > > > > Zoo1:2181, follower, 4 clients > > Zoo2:2181, leader, 3 clients > > Zoo3:2181, follower, 5 clients > > > > The behavior of the tablet servers seems to be inconsistent. Upon > restarting Accumulo, the overview page gives the impression that everything > is fine. However, it is only listing 1 tablet server (Node 1), and it is > now tablet server #2 that is timing out while "Waiting for tablet server > lock". There are no errors in tablet server #1's logs right now, while > tablet server #2 failed after too many retries of waiting for the lock. > > > > Also, the master log still has no mention of the second node's IP address, > even though the start-all.sh script indicates that the tserver and logger > are being started on both nodes. > > > > Thanks, > > > > Alex > > > > > > *From:* Eric Newton [mailto:[email protected]] > *Sent:* Wednesday, March 05, 2014 1:25 PM > *To:* [email protected] > *Subject:* Re: Tablet server stuck waiting for lock > > > > On the monitor page, there's a box that shows your zookeepers and their > status. What does it say? > > > > -Eric > > > > > > On Wed, Mar 5, 2014 at 1:09 PM, Alex Lee <[email protected]> > wrote: > > Dfs permissions is currently disabled. I'm using the accumulo user for > "accumulo init" and for "start-all.sh", and it is also the user that has > passwordless SSH enabled. > > > > I ran "hadoop fs -ls /accumulo" as the accumulo user on both tablet > servers, and I am able to see inside of the /accumulo directory on hdfs. > > > > Alex > > > > *From:* Ott, Charlie H. [mailto:[email protected]] > *Sent:* Wednesday, March 05, 2014 1:02 PM > *To:* [email protected] > *Subject:* RE: Tablet server stuck waiting for lock > > > > The connection reset by peer from the Master in combination with the lock > not acquired by the tablet server makes me wonder if the process owner for > the tablet server is able to access HDFS correctly. > > > > Are dfs permissions enabled on your HDFS? It makes me think the tablet > server does not have permissions to read from the /accumulo path that was > initialized on the master. Did you use the same account for 'accumulo > init' ? > > > > > > > > *From:* [email protected] [ > mailto:[email protected]<[email protected]>] > *On Behalf Of *Alex Lee > *Sent:* Wednesday, March 05, 2014 12:17 PM > *To:* [email protected] > *Subject:* Tablet server stuck waiting for lock > > > > Hello, > > > > I'm trying to create a virtualized Accumulo 1.4.4 cluster with 4 tablet > servers using Hadoop 0.20.2 and ZooKeeper 3.3.5. It didn't seem to be > working correctly with 4 tablet servers, so I first tried just running with > one tablet server, which seemed to work fine. When I tried to run it with > just 2 tablet servers, I ran into some issues. > > > > Just to preface, I double checked configs within zookeeper and accumulo, > and everything matches. All hostnames are resolving correctly, and > passwordless SSH for the accumulo user is also functional between all > nodes. Running "echo stat | nc <zk-server> <zk port>" responds > appropriately. > > > > Here's the first error log for the Tablet Master: > > > > 2014-03-05 11:18:16,626 [master.Master] ERROR: Error processing table > state for store Root Tablet > > org.apache.thrift.transport.TTransportException: java.io.IOException: > Connection reset by peer > > at > org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:161) > > at > org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:158) > > at > org.apache.accumulo.core.client.impl.ThriftTransportPool$CachedTTransport.flush(ThriftTransportPool.java:299) > > at > org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.send_loadTablet(TabletClientService.java:653) > > at > org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.loadTablet(TabletClientService.java:640) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) > > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > > at java.lang.reflect.Method.invoke(Unknown Source) > > at > org.apache.accumulo.cloudtrace.instrument.thrift.TraceWrap$2.invoke(TraceWrap.java:84) > > at com.sun.proxy.$Proxy4.loadTablet(Unknown Source) > > at > org.apache.accumulo.server.master.LiveTServerSet$TServerConnection.assignTablet(LiveTServerSet.java:86) > > at > org.apache.accumulo.server.master.Master$TabletGroupWatcher.flushChanges(Master.java:1818) > > at > org.apache.accumulo.server.master.Master$TabletGroupWatcher.run(Master.java:1426) > > Caused by: java.io.IOException: Connection reset by peer > > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > > at sun.nio.ch.SocketDispatcher.write(Unknown Source) > > at sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source) > > at sun.nio.ch.IOUtil.write(Unknown Source) > > at sun.nio.ch.SocketChannelImpl.write(Unknown Source) > > at > org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55) > > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) > > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) > > at java.io.BufferedOutputStream.flushBuffer(Unknown Source) > > at java.io.BufferedOutputStream.flush(Unknown Source) > > at > org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:159) > > ... 13 more > > > > Here are the error logs for Tablet Server #1: > > > > 2014-03-05 11:17:15,152 [tabletserver.TabletServer] INFO : Tablet server > starting on 172.16.111.3 > > 2014-03-05 11:17:15,187 [util.FileSystemMonitor] INFO : Filesystem monitor > started > > 2014-03-05 11:17:15,194 [tabletserver.NativeMap] INFO : Loaded native map > shared library > /opt/accumulo/accumulo/lib/native/map/libNativeMap-Linux-amd64-64.so > > 2014-03-05 11:17:15,499 [tabletserver.TabletServer] INFO : port = 9997 > > 2014-03-05 11:17:15,540 [tabletserver.TabletServer] INFO : Waiting for > tablet server lock > > 2014-03-05 11:17:16,633 [tabletserver.TabletServer] WARN : Got loadTablet > message from master before lock acquired, ignoring... > > 2014-03-05 11:17:16,634 [server.TNonblockingServer] ERROR: Unexpected > exception while invoking! > > java.lang.RuntimeException: Lock not acquired > > at > org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.checkPermission(TabletServer.java:1782) > > at > org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.loadTablet(TabletServer.java:1814) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) > > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > > at java.lang.reflect.Method.invoke(Unknown Source) > > at > org.apache.accumulo.cloudtrace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:59) > > at com.sun.proxy.$Proxy1.loadTablet(Unknown Source) > > at > org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$loadTablet.process(TabletClientService.java:2510) > > at > org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor.process(TabletClientService.java:2053) > > at > org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:154) > > at > org.apache.thrift.server.TNonblockingServer$FrameBuffer.invoke(TNonblockingServer.java:631) > > at > org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:202) > > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown > Source) > > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown > Source) > > at > org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) > > at java.lang.Thread.run(Unknown Source) > > 2014-03-05 11:17:20,564 [tabletserver.TabletServer] INFO : Waiting for > tablet server lock > > 2014-03-05 11:17:25,589 [tabletserver.TabletServer] INFO : Waiting for > tablet server lock > > > > (continues until too many retries, then exits) > > > > Tablet Server #2's logs get as far as this (below), and then just stop. > > > > 2014-03-05 11:17:14,112 [tabletserver.TabletServer] INFO : Tablet server > starting on 172.16.111.3 > > 2014-03-05 11:17:14,149 [util.FileSystemMonitor] INFO : Filesystem monitor > started > > 2014-03-05 11:17:14,157 [tabletserver.NativeMap] INFO : Loaded native map > shared library > /opt/accumulo/accumulo/lib/native/map/libNativeMap-Linux-amd64-64.so > > 2014-03-05 11:17:14,481 [tabletserver.TabletServer] INFO : port = 9997 > > > > Also, the master logs interestingly never make any calls to Tablet #2's IP > address. > > > > Any thoughts? We have another cluster that is setup identically in just > about every way (besides hostnames), but it has never experienced any of > these issues. My research shows that these issues can exist within 1.4.3, > which we were using at first, but we switched to 1.4.4 because these types > of issues were supposed to be resolved. Any help would be greatly > appreciated. > > > > Thanks, > > > > Alex Lee > > >
