Thanks Josh. But what do you mean my "jstack'ing"? I'm unfamiliar with that term. A better question would be how can one troubleshoot such a thing?
btw I am the sole user on this cluster. On Tue, Oct 7, 2014 at 4:18 PM, Josh Elser <josh.el...@gmail.com> wrote: > Ok, this record: > > tcp 0 0 0.0.0.0:9997 0.0.0.0:* > LISTEN > > Means that your is listening on the correct port on all interfaces. > There shouldn't be issues connecting to the tserver. This is also > confirmed by the fact that you authenticated and got a Connector (this > does an RPC to the tserver). > > So, your tserver is up, and your client can communicate with it. The > real question is why is the scan hanging. Perhaps jstack'ing the > tserver when your client is blocked waiting for results. > > On Tue, Oct 7, 2014 at 2:07 PM, Geoffry Roberts <threadedb...@gmail.com> > wrote: > > "...it's when > > you make a Connector, and your client will talk to a tabletserver to > > authenticate, that your program should hang. It would be good to > > verify that." > > > > > > My program should hang? Would you expand? That is exactly what it is > > doing. I am able to get a connector. But when I try to iterate the > result > > of a scan, that's when it hangs. > > > > > > > > > > Here's what comes from netstat: > > > > > > $ netstat -na | grep 9997 > > > > tcp 0 0 0.0.0.0:9997 0.0.0.0:* > > LISTEN > > > > tcp 0 0 204.9.140.36:35679 204.9.140.36:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:53146 204.9.140.37:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33896 204.9.140.38:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:53282 204.9.140.37:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:53188 204.9.140.37:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:35609 204.9.140.36:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33901 204.9.140.38:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:35588 204.9.140.36:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33877 204.9.140.38:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33946 204.9.140.38:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:53167 204.9.140.37:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33949 204.9.140.38:9997 > > ESTABLISHED > > > > tcp 0 0 204.9.140.36:35546 204.9.140.36:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33852 204.9.140.38:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:53125 204.9.140.37:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33922 204.9.140.38:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33747 204.9.140.38:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33961 204.9.140.38:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33793 204.9.140.38:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:35768 204.9.140.36:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33917 204.9.140.38:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33814 204.9.140.38:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:35567 204.9.140.36:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33444 204.9.140.38:9997 > > FIN_WAIT2 > > > > tcp 0 0 204.9.140.36:35701 204.9.140.36:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33969 204.9.140.38:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:53258 204.9.140.37:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33831 204.9.140.38:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:53210 204.9.140.37:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:53104 204.9.140.37:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33789 204.9.140.38:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33856 204.9.140.38:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:53237 204.9.140.37:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33835 204.9.140.38:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:35651 204.9.140.36:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33938 204.9.140.38:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33041 204.9.140.36:9997 > > ESTABLISHED > > > > tcp 0 0 204.9.140.36:53285 204.9.140.37:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:53305 204.9.140.37:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33768 204.9.140.38:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:35630 204.9.140.36:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33754 204.9.140.38:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:35745 204.9.140.36:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:35724 204.9.140.36:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:9997 204.9.140.36:33041 > > ESTABLISHED > > > > tcp 0 0 204.9.140.36:53083 204.9.140.37:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:50623 204.9.140.37:9997 > > ESTABLISHED > > > > tcp 0 0 204.9.140.36:33772 204.9.140.38:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33732 204.9.140.38:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33874 204.9.140.38:9997 > > TIME_WAIT > > > > tcp 0 0 204.9.140.36:33810 204.9.140.38:9997 > > TIME_WAIT > > > > > > On Tue, Oct 7, 2014 at 11:34 AM, Josh Elser <josh.el...@gmail.com> > wrote: > >> > >> Can you provide the output from netstat, lsof or /proc/$pid/fd for the > >> tserver? Assuming you haven't altered tserv.port.client in > >> accumulo-site.xml, we want the line for port 9997. > >> > >> From my laptop running a tserver on localhost: > >> > >> $ netstat -na | grep 9997 > >> tcp4 0 0 127.0.0.1.9997 *.* > LISTEN > >> > >> Depending on the tool you use, you can grep out the pid of the tserver > >> or just that port itself. > >> > >> Just so you know, ZK binds to all available interfaces when it starts, > >> so it should work seamlessly with localhost or the FQDN for the host. > >> As such, it shouldn't matter what you provide to the > >> ZooKeeperInstance. That should connect in all cases for you, it's when > >> you make a Connector, and your client will talk to a tabletserver to > >> authenticate, that your program should hang. It would be good to > >> verify that. > >> > >> On Tue, Oct 7, 2014 at 11:23 AM, Geoffry Roberts < > threadedb...@gmail.com> > >> wrote: > >> > All, > >> > > >> > Thanks for the responses. > >> > > >> > Is this a problem for Accumulo? > >> > Reverse DNS is yielding my ISP's host name. You know the drill, my IP > in > >> > reverse followed by their domain name, as opposed to my FQDN, which > what > >> > I > >> > use in my config files. > >> > > >> > Running Accumulo 1.5.1 > >> > I have only one interface. > >> > I have the FQDN in both master and slaves files for both Hadoop and > >> > Accumulo; in zoo.cfg; and in accumulo-site.xml where the Zookeepers > are > >> > referenced. > >> > Also, I am passing in all Zk FQDN when I instantiate > ZookeeperInstance. > >> > Forward DNS works > >> > Reverse DNS... well (See above). > >> > > >> > > >> > > >> > On Mon, Oct 6, 2014 at 10:26 PM, Adam Fuchs <afu...@apache.org> > wrote: > >> >> > >> >> Accumulo tservers typically listen on a single interface. If you > have a > >> >> server with multiple interfaces (e.g. loopback and eth0), you might > >> >> have a > >> >> problem in which the tablet servers are not listening on externally > >> >> reachable interfaces. Tablet servers will list the interfaces that > they > >> >> are > >> >> listening to when they boot, and you can also use tools like lsof to > >> >> find > >> >> them. > >> >> > >> >> If that is indeed the problem, then you might just need to change you > >> >> conf/slaves file to use <hostname> instead of localhost, and then > >> >> restart. > >> >> > >> >> Adam > >> >> > >> >> On Oct 6, 2014 4:27 PM, "Geoffry Roberts" <threadedb...@gmail.com> > >> >> wrote: > >> >>> > >> >>> > >> >>> I have been happily working with Acc, but today things changed. No > >> >>> errors > >> >>> > >> >>> Until now I ran everything server side, which meant the URL was > >> >>> localhost:2181, and life was good. Today tried running some of the > >> >>> same > >> >>> code as a remote client, which means <host name>:2181. Things hang > >> >>> when > >> >>> BatchWriter tries to commit anything and Scan hangs when it tries to > >> >>> iterate > >> >>> through a Map. > >> >>> > >> >>> Let's focus on the scan part: > >> >>> > >> >>> scan.fetchColumnFamily(new Text("colfY")); // This executes then > >> >>> hangs. > >> >>> for(Entry<Key,Value> entry : scan) { > >> >>> def row = entry.getKey().getRow(); > >> >>> def value = entry.getValue(); > >> >>> println "value=" + value; > >> >>> } > >> >>> > >> >>> This is what appears in the console : > >> >>> > >> >>> 17:22:39.802 C{0} M DEBUG org.apache.zookeeper.ClientCnxn - Got ping > >> >>> response for sessionid: 0x148c6f03388005e after 21ms > >> >>> > >> >>> 17:22:49.803 C{0} M DEBUG org.apache.zookeeper.ClientCnxn - Got ping > >> >>> response for sessionid: 0x148c6f03388005e after 21ms > >> >>> > >> >>> <and on and on> > >> >>> > >> >>> > >> >>> > >> >>> The only difference between success and a hang is a URL change, and > of > >> >>> course being remote. > >> >>> > >> >>> I don't believe this is a firewall issue. I shutdown the firewall. > >> >>> > >> >>> Am I missing something? > >> >>> > >> >>> Thanks all. > >> >>> > >> >>> -- > >> >>> There are ways and there are ways, > >> >>> > >> >>> Geoffry Roberts > >> > > >> > > >> > > >> > > >> > -- > >> > There are ways and there are ways, > >> > > >> > Geoffry Roberts > > > > > > > > > > -- > > There are ways and there are ways, > > > > Geoffry Roberts > -- There are ways and there are ways, Geoffry Roberts