Did you find the message in the tserver*.out, terver*.err or the monitor
page?

(Thanks for the follow-up message.)

On Wed, Oct 8, 2014 at 6:39 PM, Geoffry Roberts <threadedb...@gmail.com>
wrote:

> Just for the record, I finally got to the bottom of things.  One of my
> Tservers was running out of memory.  I hadn't noticed.  I had my SA
> allocate a lttle more--each node now has 6G up from 2G--and things are
> working better.
>  On Oct 8, 2014 10:09 AM, "Josh Elser" <josh.el...@gmail.com> wrote:
>
>> Jstack is a tool which can be used to tell a java process to dump the
>> current stack traces for all of its threads. It's usually included with the
>> JDK. `kill -3 $pid` also does the same. If the output can't be respected
>> automatically to your shell, check the stdout for the process you gave as
>> an argument.
>>
>> When your client is sitting waiting on data from the tabletserver, you
>> can get the stack traces from the tserver and you should be able to find a
>> thread with scan in the name, along with your client's IP, and we can help
>> debug exactly what the server is doing that is preventing it from returning
>> data to your client.
>> On Oct 8, 2014 9:43 AM, "Geoffry Roberts" <threadedb...@gmail.com> wrote:
>>
>>> Thanks Josh.  But what do you mean my "jstack'ing"?  I'm unfamiliar
>>> with that term.  A better question would be how can one troubleshoot such a
>>> thing?
>>>
>>> btw
>>> I am the sole user on this cluster.
>>>
>>> On Tue, Oct 7, 2014 at 4:18 PM, Josh Elser <josh.el...@gmail.com> wrote:
>>>
>>>> Ok, this record:
>>>>
>>>> tcp        0      0 0.0.0.0:9997                0.0.0.0:*
>>>>      LISTEN
>>>>
>>>> Means that your is listening on the correct port on all interfaces.
>>>> There shouldn't be issues connecting to the tserver. This is also
>>>> confirmed by the fact that you authenticated and got a Connector (this
>>>> does an RPC to the tserver).
>>>>
>>>> So, your tserver is up, and your client can communicate with it. The
>>>> real question is why is the scan hanging. Perhaps jstack'ing the
>>>> tserver when your client is blocked waiting for results.
>>>>
>>>> On Tue, Oct 7, 2014 at 2:07 PM, Geoffry Roberts <threadedb...@gmail.com>
>>>> wrote:
>>>> > "...it's when
>>>> > you make a Connector, and your client will talk to a tabletserver to
>>>> > authenticate, that your program should hang. It would be good to
>>>> > verify that."
>>>> >
>>>> >
>>>> > My program should hang?  Would you expand?  That is exactly what it is
>>>> > doing.  I am able to get a connector.  But when I try to iterate the
>>>> result
>>>> > of a scan, that's when it hangs.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > Here's what comes from netstat:
>>>> >
>>>> >
>>>> > $ netstat -na | grep 9997
>>>> >
>>>> > tcp        0      0 0.0.0.0:9997                0.0.0.0:*
>>>> > LISTEN
>>>> >
>>>> > tcp        0      0 204.9.140.36:35679          204.9.140.36:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:53146          204.9.140.37:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33896          204.9.140.38:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:53282          204.9.140.37:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:53188          204.9.140.37:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:35609          204.9.140.36:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33901          204.9.140.38:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:35588          204.9.140.36:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33877          204.9.140.38:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33946          204.9.140.38:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:53167          204.9.140.37:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33949          204.9.140.38:9997
>>>> > ESTABLISHED
>>>> >
>>>> > tcp        0      0 204.9.140.36:35546          204.9.140.36:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33852          204.9.140.38:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:53125          204.9.140.37:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33922          204.9.140.38:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33747          204.9.140.38:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33961          204.9.140.38:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33793          204.9.140.38:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:35768          204.9.140.36:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33917          204.9.140.38:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33814          204.9.140.38:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:35567          204.9.140.36:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33444          204.9.140.38:9997
>>>> > FIN_WAIT2
>>>> >
>>>> > tcp        0      0 204.9.140.36:35701          204.9.140.36:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33969          204.9.140.38:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:53258          204.9.140.37:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33831          204.9.140.38:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:53210          204.9.140.37:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:53104          204.9.140.37:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33789          204.9.140.38:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33856          204.9.140.38:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:53237          204.9.140.37:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33835          204.9.140.38:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:35651          204.9.140.36:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33938          204.9.140.38:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33041          204.9.140.36:9997
>>>> > ESTABLISHED
>>>> >
>>>> > tcp        0      0 204.9.140.36:53285          204.9.140.37:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:53305          204.9.140.37:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33768          204.9.140.38:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:35630          204.9.140.36:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33754          204.9.140.38:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:35745          204.9.140.36:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:35724          204.9.140.36:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:9997           204.9.140.36:33041
>>>> > ESTABLISHED
>>>> >
>>>> > tcp        0      0 204.9.140.36:53083          204.9.140.37:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:50623          204.9.140.37:9997
>>>> > ESTABLISHED
>>>> >
>>>> > tcp        0      0 204.9.140.36:33772          204.9.140.38:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33732          204.9.140.38:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33874          204.9.140.38:9997
>>>> > TIME_WAIT
>>>> >
>>>> > tcp        0      0 204.9.140.36:33810          204.9.140.38:9997
>>>> > TIME_WAIT
>>>> >
>>>> >
>>>> > On Tue, Oct 7, 2014 at 11:34 AM, Josh Elser <josh.el...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> Can you provide the output from netstat, lsof or /proc/$pid/fd for
>>>> the
>>>> >> tserver? Assuming you haven't altered tserv.port.client in
>>>> >> accumulo-site.xml, we want the line for port 9997.
>>>> >>
>>>> >> From my laptop running a tserver on localhost:
>>>> >>
>>>> >> $ netstat -na | grep 9997
>>>> >> tcp4       0      0  127.0.0.1.9997         *.*
>>>> LISTEN
>>>> >>
>>>> >> Depending on the tool you use, you can grep out the pid of the
>>>> tserver
>>>> >> or just that port itself.
>>>> >>
>>>> >> Just so you know, ZK binds to all available interfaces when it
>>>> starts,
>>>> >> so it should work seamlessly with localhost or the FQDN for the host.
>>>> >> As such, it shouldn't matter what you provide to the
>>>> >> ZooKeeperInstance. That should connect in all cases for you, it's
>>>> when
>>>> >> you make a Connector, and your client will talk to a tabletserver to
>>>> >> authenticate, that your program should hang. It would be good to
>>>> >> verify that.
>>>> >>
>>>> >> On Tue, Oct 7, 2014 at 11:23 AM, Geoffry Roberts <
>>>> threadedb...@gmail.com>
>>>> >> wrote:
>>>> >> > All,
>>>> >> >
>>>> >> > Thanks for the responses.
>>>> >> >
>>>> >> > Is this a problem for Accumulo?
>>>> >> > Reverse DNS is yielding my ISP's host name. You know the drill, my
>>>> IP in
>>>> >> > reverse followed by their domain name, as opposed to my FQDN,
>>>> which what
>>>> >> > I
>>>> >> > use in my config files.
>>>> >> >
>>>> >> > Running Accumulo 1.5.1
>>>> >> > I have only one interface.
>>>> >> > I have the FQDN in both master and slaves files for both Hadoop and
>>>> >> > Accumulo; in zoo.cfg; and in accumulo-site.xml where the
>>>> Zookeepers are
>>>> >> > referenced.
>>>> >> > Also, I am passing in all Zk FQDN when I instantiate
>>>> ZookeeperInstance.
>>>> >> > Forward DNS works
>>>> >> > Reverse DNS... well (See above).
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > On Mon, Oct 6, 2014 at 10:26 PM, Adam Fuchs <afu...@apache.org>
>>>> wrote:
>>>> >> >>
>>>> >> >> Accumulo tservers typically listen on a single interface. If you
>>>> have a
>>>> >> >> server with multiple interfaces (e.g. loopback and eth0), you
>>>> might
>>>> >> >> have a
>>>> >> >> problem in which the tablet servers are not listening on
>>>> externally
>>>> >> >> reachable interfaces. Tablet servers will list the interfaces
>>>> that they
>>>> >> >> are
>>>> >> >> listening to when they boot, and you can also use tools like lsof
>>>> to
>>>> >> >> find
>>>> >> >> them.
>>>> >> >>
>>>> >> >> If that is indeed the problem, then you might just need to change
>>>> you
>>>> >> >> conf/slaves file to use <hostname> instead of localhost, and then
>>>> >> >> restart.
>>>> >> >>
>>>> >> >> Adam
>>>> >> >>
>>>> >> >> On Oct 6, 2014 4:27 PM, "Geoffry Roberts" <threadedb...@gmail.com
>>>> >
>>>> >> >> wrote:
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> I have been happily working with Acc, but today things changed.
>>>> No
>>>> >> >>> errors
>>>> >> >>>
>>>> >> >>> Until now I ran everything server side, which meant the URL was
>>>> >> >>> localhost:2181, and life was good.  Today tried running some of
>>>> the
>>>> >> >>> same
>>>> >> >>> code as a remote client, which means <host name>:2181.  Things
>>>> hang
>>>> >> >>> when
>>>> >> >>> BatchWriter tries to commit anything and Scan hangs when it
>>>> tries to
>>>> >> >>> iterate
>>>> >> >>> through a Map.
>>>> >> >>>
>>>> >> >>> Let's focus on the scan part:
>>>> >> >>>
>>>> >> >>> scan.fetchColumnFamily(new Text("colfY")); // This executes then
>>>> >> >>> hangs.
>>>> >> >>> for(Entry<Key,Value> entry : scan) {
>>>> >> >>> def row = entry.getKey().getRow();
>>>> >> >>> def value = entry.getValue();
>>>> >> >>> println "value=" + value;
>>>> >> >>> }
>>>> >> >>>
>>>> >> >>> This is what appears in the console :
>>>> >> >>>
>>>> >> >>> 17:22:39.802 C{0} M DEBUG org.apache.zookeeper.ClientCnxn - Got
>>>> ping
>>>> >> >>> response for sessionid: 0x148c6f03388005e after 21ms
>>>> >> >>>
>>>> >> >>> 17:22:49.803 C{0} M DEBUG org.apache.zookeeper.ClientCnxn - Got
>>>> ping
>>>> >> >>> response for sessionid: 0x148c6f03388005e after 21ms
>>>> >> >>>
>>>> >> >>> <and on and on>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> The only difference between success and a hang is a URL change,
>>>> and of
>>>> >> >>> course being remote.
>>>> >> >>>
>>>> >> >>> I don't believe this is a firewall issue.  I shutdown the
>>>> firewall.
>>>> >> >>>
>>>> >> >>> Am I missing something?
>>>> >> >>>
>>>> >> >>> Thanks all.
>>>> >> >>>
>>>> >> >>> --
>>>> >> >>> There are ways and there are ways,
>>>> >> >>>
>>>> >> >>> Geoffry Roberts
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > --
>>>> >> > There are ways and there are ways,
>>>> >> >
>>>> >> > Geoffry Roberts
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > There are ways and there are ways,
>>>> >
>>>> > Geoffry Roberts
>>>>
>>>
>>>
>>>
>>> --
>>> There are ways and there are ways,
>>>
>>> Geoffry Roberts
>>>
>>

Reply via email to