Did you find the message in the tserver*.out, terver*.err or the monitor page?
(Thanks for the follow-up message.) On Wed, Oct 8, 2014 at 6:39 PM, Geoffry Roberts <threadedb...@gmail.com> wrote: > Just for the record, I finally got to the bottom of things. One of my > Tservers was running out of memory. I hadn't noticed. I had my SA > allocate a lttle more--each node now has 6G up from 2G--and things are > working better. > On Oct 8, 2014 10:09 AM, "Josh Elser" <josh.el...@gmail.com> wrote: > >> Jstack is a tool which can be used to tell a java process to dump the >> current stack traces for all of its threads. It's usually included with the >> JDK. `kill -3 $pid` also does the same. If the output can't be respected >> automatically to your shell, check the stdout for the process you gave as >> an argument. >> >> When your client is sitting waiting on data from the tabletserver, you >> can get the stack traces from the tserver and you should be able to find a >> thread with scan in the name, along with your client's IP, and we can help >> debug exactly what the server is doing that is preventing it from returning >> data to your client. >> On Oct 8, 2014 9:43 AM, "Geoffry Roberts" <threadedb...@gmail.com> wrote: >> >>> Thanks Josh. But what do you mean my "jstack'ing"? I'm unfamiliar >>> with that term. A better question would be how can one troubleshoot such a >>> thing? >>> >>> btw >>> I am the sole user on this cluster. >>> >>> On Tue, Oct 7, 2014 at 4:18 PM, Josh Elser <josh.el...@gmail.com> wrote: >>> >>>> Ok, this record: >>>> >>>> tcp 0 0 0.0.0.0:9997 0.0.0.0:* >>>> LISTEN >>>> >>>> Means that your is listening on the correct port on all interfaces. >>>> There shouldn't be issues connecting to the tserver. This is also >>>> confirmed by the fact that you authenticated and got a Connector (this >>>> does an RPC to the tserver). >>>> >>>> So, your tserver is up, and your client can communicate with it. The >>>> real question is why is the scan hanging. Perhaps jstack'ing the >>>> tserver when your client is blocked waiting for results. >>>> >>>> On Tue, Oct 7, 2014 at 2:07 PM, Geoffry Roberts <threadedb...@gmail.com> >>>> wrote: >>>> > "...it's when >>>> > you make a Connector, and your client will talk to a tabletserver to >>>> > authenticate, that your program should hang. It would be good to >>>> > verify that." >>>> > >>>> > >>>> > My program should hang? Would you expand? That is exactly what it is >>>> > doing. I am able to get a connector. But when I try to iterate the >>>> result >>>> > of a scan, that's when it hangs. >>>> > >>>> > >>>> > >>>> > >>>> > Here's what comes from netstat: >>>> > >>>> > >>>> > $ netstat -na | grep 9997 >>>> > >>>> > tcp 0 0 0.0.0.0:9997 0.0.0.0:* >>>> > LISTEN >>>> > >>>> > tcp 0 0 204.9.140.36:35679 204.9.140.36:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:53146 204.9.140.37:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33896 204.9.140.38:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:53282 204.9.140.37:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:53188 204.9.140.37:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:35609 204.9.140.36:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33901 204.9.140.38:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:35588 204.9.140.36:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33877 204.9.140.38:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33946 204.9.140.38:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:53167 204.9.140.37:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33949 204.9.140.38:9997 >>>> > ESTABLISHED >>>> > >>>> > tcp 0 0 204.9.140.36:35546 204.9.140.36:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33852 204.9.140.38:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:53125 204.9.140.37:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33922 204.9.140.38:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33747 204.9.140.38:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33961 204.9.140.38:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33793 204.9.140.38:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:35768 204.9.140.36:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33917 204.9.140.38:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33814 204.9.140.38:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:35567 204.9.140.36:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33444 204.9.140.38:9997 >>>> > FIN_WAIT2 >>>> > >>>> > tcp 0 0 204.9.140.36:35701 204.9.140.36:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33969 204.9.140.38:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:53258 204.9.140.37:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33831 204.9.140.38:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:53210 204.9.140.37:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:53104 204.9.140.37:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33789 204.9.140.38:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33856 204.9.140.38:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:53237 204.9.140.37:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33835 204.9.140.38:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:35651 204.9.140.36:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33938 204.9.140.38:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33041 204.9.140.36:9997 >>>> > ESTABLISHED >>>> > >>>> > tcp 0 0 204.9.140.36:53285 204.9.140.37:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:53305 204.9.140.37:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33768 204.9.140.38:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:35630 204.9.140.36:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33754 204.9.140.38:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:35745 204.9.140.36:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:35724 204.9.140.36:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:9997 204.9.140.36:33041 >>>> > ESTABLISHED >>>> > >>>> > tcp 0 0 204.9.140.36:53083 204.9.140.37:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:50623 204.9.140.37:9997 >>>> > ESTABLISHED >>>> > >>>> > tcp 0 0 204.9.140.36:33772 204.9.140.38:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33732 204.9.140.38:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33874 204.9.140.38:9997 >>>> > TIME_WAIT >>>> > >>>> > tcp 0 0 204.9.140.36:33810 204.9.140.38:9997 >>>> > TIME_WAIT >>>> > >>>> > >>>> > On Tue, Oct 7, 2014 at 11:34 AM, Josh Elser <josh.el...@gmail.com> >>>> wrote: >>>> >> >>>> >> Can you provide the output from netstat, lsof or /proc/$pid/fd for >>>> the >>>> >> tserver? Assuming you haven't altered tserv.port.client in >>>> >> accumulo-site.xml, we want the line for port 9997. >>>> >> >>>> >> From my laptop running a tserver on localhost: >>>> >> >>>> >> $ netstat -na | grep 9997 >>>> >> tcp4 0 0 127.0.0.1.9997 *.* >>>> LISTEN >>>> >> >>>> >> Depending on the tool you use, you can grep out the pid of the >>>> tserver >>>> >> or just that port itself. >>>> >> >>>> >> Just so you know, ZK binds to all available interfaces when it >>>> starts, >>>> >> so it should work seamlessly with localhost or the FQDN for the host. >>>> >> As such, it shouldn't matter what you provide to the >>>> >> ZooKeeperInstance. That should connect in all cases for you, it's >>>> when >>>> >> you make a Connector, and your client will talk to a tabletserver to >>>> >> authenticate, that your program should hang. It would be good to >>>> >> verify that. >>>> >> >>>> >> On Tue, Oct 7, 2014 at 11:23 AM, Geoffry Roberts < >>>> threadedb...@gmail.com> >>>> >> wrote: >>>> >> > All, >>>> >> > >>>> >> > Thanks for the responses. >>>> >> > >>>> >> > Is this a problem for Accumulo? >>>> >> > Reverse DNS is yielding my ISP's host name. You know the drill, my >>>> IP in >>>> >> > reverse followed by their domain name, as opposed to my FQDN, >>>> which what >>>> >> > I >>>> >> > use in my config files. >>>> >> > >>>> >> > Running Accumulo 1.5.1 >>>> >> > I have only one interface. >>>> >> > I have the FQDN in both master and slaves files for both Hadoop and >>>> >> > Accumulo; in zoo.cfg; and in accumulo-site.xml where the >>>> Zookeepers are >>>> >> > referenced. >>>> >> > Also, I am passing in all Zk FQDN when I instantiate >>>> ZookeeperInstance. >>>> >> > Forward DNS works >>>> >> > Reverse DNS... well (See above). >>>> >> > >>>> >> > >>>> >> > >>>> >> > On Mon, Oct 6, 2014 at 10:26 PM, Adam Fuchs <afu...@apache.org> >>>> wrote: >>>> >> >> >>>> >> >> Accumulo tservers typically listen on a single interface. If you >>>> have a >>>> >> >> server with multiple interfaces (e.g. loopback and eth0), you >>>> might >>>> >> >> have a >>>> >> >> problem in which the tablet servers are not listening on >>>> externally >>>> >> >> reachable interfaces. Tablet servers will list the interfaces >>>> that they >>>> >> >> are >>>> >> >> listening to when they boot, and you can also use tools like lsof >>>> to >>>> >> >> find >>>> >> >> them. >>>> >> >> >>>> >> >> If that is indeed the problem, then you might just need to change >>>> you >>>> >> >> conf/slaves file to use <hostname> instead of localhost, and then >>>> >> >> restart. >>>> >> >> >>>> >> >> Adam >>>> >> >> >>>> >> >> On Oct 6, 2014 4:27 PM, "Geoffry Roberts" <threadedb...@gmail.com >>>> > >>>> >> >> wrote: >>>> >> >>> >>>> >> >>> >>>> >> >>> I have been happily working with Acc, but today things changed. >>>> No >>>> >> >>> errors >>>> >> >>> >>>> >> >>> Until now I ran everything server side, which meant the URL was >>>> >> >>> localhost:2181, and life was good. Today tried running some of >>>> the >>>> >> >>> same >>>> >> >>> code as a remote client, which means <host name>:2181. Things >>>> hang >>>> >> >>> when >>>> >> >>> BatchWriter tries to commit anything and Scan hangs when it >>>> tries to >>>> >> >>> iterate >>>> >> >>> through a Map. >>>> >> >>> >>>> >> >>> Let's focus on the scan part: >>>> >> >>> >>>> >> >>> scan.fetchColumnFamily(new Text("colfY")); // This executes then >>>> >> >>> hangs. >>>> >> >>> for(Entry<Key,Value> entry : scan) { >>>> >> >>> def row = entry.getKey().getRow(); >>>> >> >>> def value = entry.getValue(); >>>> >> >>> println "value=" + value; >>>> >> >>> } >>>> >> >>> >>>> >> >>> This is what appears in the console : >>>> >> >>> >>>> >> >>> 17:22:39.802 C{0} M DEBUG org.apache.zookeeper.ClientCnxn - Got >>>> ping >>>> >> >>> response for sessionid: 0x148c6f03388005e after 21ms >>>> >> >>> >>>> >> >>> 17:22:49.803 C{0} M DEBUG org.apache.zookeeper.ClientCnxn - Got >>>> ping >>>> >> >>> response for sessionid: 0x148c6f03388005e after 21ms >>>> >> >>> >>>> >> >>> <and on and on> >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> The only difference between success and a hang is a URL change, >>>> and of >>>> >> >>> course being remote. >>>> >> >>> >>>> >> >>> I don't believe this is a firewall issue. I shutdown the >>>> firewall. >>>> >> >>> >>>> >> >>> Am I missing something? >>>> >> >>> >>>> >> >>> Thanks all. >>>> >> >>> >>>> >> >>> -- >>>> >> >>> There are ways and there are ways, >>>> >> >>> >>>> >> >>> Geoffry Roberts >>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > -- >>>> >> > There are ways and there are ways, >>>> >> > >>>> >> > Geoffry Roberts >>>> > >>>> > >>>> > >>>> > >>>> > -- >>>> > There are ways and there are ways, >>>> > >>>> > Geoffry Roberts >>>> >>> >>> >>> >>> -- >>> There are ways and there are ways, >>> >>> Geoffry Roberts >>> >>