I found the message in tserver*.out.  tserver*.err has 0 in it.

I posted last night, life was good, sat down this morning and saw that
another tserver had crashed, over night, with no activity.  ??  In tserver*.out
it again says out of heap space.

ACCUMULO_TSERVER_OPTS=-Xmx2G -Xms1G. I would have thought it sufficient.

The fact that the log entries lack timestamps, but have hashmarks makes
makes me wonder if I am reading things correctly.

#

# java.lang.OutOfMemoryError: Java heap space

# -XX:OnOutOfMemoryError="kill -9 %p"

#   Executing /bin/sh -c "kill -9 3241"...


Is there a way to start a particular tablet server?

On Wed, Oct 8, 2014 at 6:55 PM, Eric Newton <eric.new...@gmail.com> wrote:

> Did you find the message in the tserver*.out, terver*.err or the monitor
> page?
>
> (Thanks for the follow-up message.)
>
> On Wed, Oct 8, 2014 at 6:39 PM, Geoffry Roberts <threadedb...@gmail.com>
> wrote:
>
>> Just for the record, I finally got to the bottom of things.  One of my
>> Tservers was running out of memory.  I hadn't noticed.  I had my SA
>> allocate a lttle more--each node now has 6G up from 2G--and things are
>> working better.
>>  On Oct 8, 2014 10:09 AM, "Josh Elser" <josh.el...@gmail.com> wrote:
>>
>>> Jstack is a tool which can be used to tell a java process to dump the
>>> current stack traces for all of its threads. It's usually included with the
>>> JDK. `kill -3 $pid` also does the same. If the output can't be respected
>>> automatically to your shell, check the stdout for the process you gave as
>>> an argument.
>>>
>>> When your client is sitting waiting on data from the tabletserver, you
>>> can get the stack traces from the tserver and you should be able to find a
>>> thread with scan in the name, along with your client's IP, and we can help
>>> debug exactly what the server is doing that is preventing it from returning
>>> data to your client.
>>> On Oct 8, 2014 9:43 AM, "Geoffry Roberts" <threadedb...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Josh.  But what do you mean my "jstack'ing"?  I'm unfamiliar
>>>> with that term.  A better question would be how can one troubleshoot such a
>>>> thing?
>>>>
>>>> btw
>>>> I am the sole user on this cluster.
>>>>
>>>> On Tue, Oct 7, 2014 at 4:18 PM, Josh Elser <josh.el...@gmail.com>
>>>> wrote:
>>>>
>>>>> Ok, this record:
>>>>>
>>>>> tcp        0      0 0.0.0.0:9997                0.0.0.0:*
>>>>>      LISTEN
>>>>>
>>>>> Means that your is listening on the correct port on all interfaces.
>>>>> There shouldn't be issues connecting to the tserver. This is also
>>>>> confirmed by the fact that you authenticated and got a Connector (this
>>>>> does an RPC to the tserver).
>>>>>
>>>>> So, your tserver is up, and your client can communicate with it. The
>>>>> real question is why is the scan hanging. Perhaps jstack'ing the
>>>>> tserver when your client is blocked waiting for results.
>>>>>
>>>>> On Tue, Oct 7, 2014 at 2:07 PM, Geoffry Roberts <
>>>>> threadedb...@gmail.com> wrote:
>>>>> > "...it's when
>>>>> > you make a Connector, and your client will talk to a tabletserver to
>>>>> > authenticate, that your program should hang. It would be good to
>>>>> > verify that."
>>>>> >
>>>>> >
>>>>> > My program should hang?  Would you expand?  That is exactly what it
>>>>> is
>>>>> > doing.  I am able to get a connector.  But when I try to iterate the
>>>>> result
>>>>> > of a scan, that's when it hangs.
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > Here's what comes from netstat:
>>>>> >
>>>>> >
>>>>> > $ netstat -na | grep 9997
>>>>> >
>>>>> > tcp        0      0 0.0.0.0:9997                0.0.0.0:*
>>>>> > LISTEN
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:35679          204.9.140.36:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:53146          204.9.140.37:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33896          204.9.140.38:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:53282          204.9.140.37:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:53188          204.9.140.37:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:35609          204.9.140.36:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33901          204.9.140.38:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:35588          204.9.140.36:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33877          204.9.140.38:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33946          204.9.140.38:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:53167          204.9.140.37:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33949          204.9.140.38:9997
>>>>> > ESTABLISHED
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:35546          204.9.140.36:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33852          204.9.140.38:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:53125          204.9.140.37:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33922          204.9.140.38:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33747          204.9.140.38:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33961          204.9.140.38:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33793          204.9.140.38:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:35768          204.9.140.36:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33917          204.9.140.38:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33814          204.9.140.38:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:35567          204.9.140.36:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33444          204.9.140.38:9997
>>>>> > FIN_WAIT2
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:35701          204.9.140.36:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33969          204.9.140.38:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:53258          204.9.140.37:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33831          204.9.140.38:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:53210          204.9.140.37:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:53104          204.9.140.37:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33789          204.9.140.38:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33856          204.9.140.38:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:53237          204.9.140.37:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33835          204.9.140.38:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:35651          204.9.140.36:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33938          204.9.140.38:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33041          204.9.140.36:9997
>>>>> > ESTABLISHED
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:53285          204.9.140.37:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:53305          204.9.140.37:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33768          204.9.140.38:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:35630          204.9.140.36:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33754          204.9.140.38:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:35745          204.9.140.36:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:35724          204.9.140.36:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:9997           204.9.140.36:33041
>>>>> > ESTABLISHED
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:53083          204.9.140.37:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:50623          204.9.140.37:9997
>>>>> > ESTABLISHED
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33772          204.9.140.38:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33732          204.9.140.38:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33874          204.9.140.38:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> > tcp        0      0 204.9.140.36:33810          204.9.140.38:9997
>>>>> > TIME_WAIT
>>>>> >
>>>>> >
>>>>> > On Tue, Oct 7, 2014 at 11:34 AM, Josh Elser <josh.el...@gmail.com>
>>>>> wrote:
>>>>> >>
>>>>> >> Can you provide the output from netstat, lsof or /proc/$pid/fd for
>>>>> the
>>>>> >> tserver? Assuming you haven't altered tserv.port.client in
>>>>> >> accumulo-site.xml, we want the line for port 9997.
>>>>> >>
>>>>> >> From my laptop running a tserver on localhost:
>>>>> >>
>>>>> >> $ netstat -na | grep 9997
>>>>> >> tcp4       0      0  127.0.0.1.9997         *.*
>>>>> LISTEN
>>>>> >>
>>>>> >> Depending on the tool you use, you can grep out the pid of the
>>>>> tserver
>>>>> >> or just that port itself.
>>>>> >>
>>>>> >> Just so you know, ZK binds to all available interfaces when it
>>>>> starts,
>>>>> >> so it should work seamlessly with localhost or the FQDN for the
>>>>> host.
>>>>> >> As such, it shouldn't matter what you provide to the
>>>>> >> ZooKeeperInstance. That should connect in all cases for you, it's
>>>>> when
>>>>> >> you make a Connector, and your client will talk to a tabletserver to
>>>>> >> authenticate, that your program should hang. It would be good to
>>>>> >> verify that.
>>>>> >>
>>>>> >> On Tue, Oct 7, 2014 at 11:23 AM, Geoffry Roberts <
>>>>> threadedb...@gmail.com>
>>>>> >> wrote:
>>>>> >> > All,
>>>>> >> >
>>>>> >> > Thanks for the responses.
>>>>> >> >
>>>>> >> > Is this a problem for Accumulo?
>>>>> >> > Reverse DNS is yielding my ISP's host name. You know the drill,
>>>>> my IP in
>>>>> >> > reverse followed by their domain name, as opposed to my FQDN,
>>>>> which what
>>>>> >> > I
>>>>> >> > use in my config files.
>>>>> >> >
>>>>> >> > Running Accumulo 1.5.1
>>>>> >> > I have only one interface.
>>>>> >> > I have the FQDN in both master and slaves files for both Hadoop
>>>>> and
>>>>> >> > Accumulo; in zoo.cfg; and in accumulo-site.xml where the
>>>>> Zookeepers are
>>>>> >> > referenced.
>>>>> >> > Also, I am passing in all Zk FQDN when I instantiate
>>>>> ZookeeperInstance.
>>>>> >> > Forward DNS works
>>>>> >> > Reverse DNS... well (See above).
>>>>> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> > On Mon, Oct 6, 2014 at 10:26 PM, Adam Fuchs <afu...@apache.org>
>>>>> wrote:
>>>>> >> >>
>>>>> >> >> Accumulo tservers typically listen on a single interface. If you
>>>>> have a
>>>>> >> >> server with multiple interfaces (e.g. loopback and eth0), you
>>>>> might
>>>>> >> >> have a
>>>>> >> >> problem in which the tablet servers are not listening on
>>>>> externally
>>>>> >> >> reachable interfaces. Tablet servers will list the interfaces
>>>>> that they
>>>>> >> >> are
>>>>> >> >> listening to when they boot, and you can also use tools like
>>>>> lsof to
>>>>> >> >> find
>>>>> >> >> them.
>>>>> >> >>
>>>>> >> >> If that is indeed the problem, then you might just need to
>>>>> change you
>>>>> >> >> conf/slaves file to use <hostname> instead of localhost, and then
>>>>> >> >> restart.
>>>>> >> >>
>>>>> >> >> Adam
>>>>> >> >>
>>>>> >> >> On Oct 6, 2014 4:27 PM, "Geoffry Roberts" <
>>>>> threadedb...@gmail.com>
>>>>> >> >> wrote:
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> I have been happily working with Acc, but today things
>>>>> changed.  No
>>>>> >> >>> errors
>>>>> >> >>>
>>>>> >> >>> Until now I ran everything server side, which meant the URL was
>>>>> >> >>> localhost:2181, and life was good.  Today tried running some of
>>>>> the
>>>>> >> >>> same
>>>>> >> >>> code as a remote client, which means <host name>:2181.  Things
>>>>> hang
>>>>> >> >>> when
>>>>> >> >>> BatchWriter tries to commit anything and Scan hangs when it
>>>>> tries to
>>>>> >> >>> iterate
>>>>> >> >>> through a Map.
>>>>> >> >>>
>>>>> >> >>> Let's focus on the scan part:
>>>>> >> >>>
>>>>> >> >>> scan.fetchColumnFamily(new Text("colfY")); // This executes then
>>>>> >> >>> hangs.
>>>>> >> >>> for(Entry<Key,Value> entry : scan) {
>>>>> >> >>> def row = entry.getKey().getRow();
>>>>> >> >>> def value = entry.getValue();
>>>>> >> >>> println "value=" + value;
>>>>> >> >>> }
>>>>> >> >>>
>>>>> >> >>> This is what appears in the console :
>>>>> >> >>>
>>>>> >> >>> 17:22:39.802 C{0} M DEBUG org.apache.zookeeper.ClientCnxn - Got
>>>>> ping
>>>>> >> >>> response for sessionid: 0x148c6f03388005e after 21ms
>>>>> >> >>>
>>>>> >> >>> 17:22:49.803 C{0} M DEBUG org.apache.zookeeper.ClientCnxn - Got
>>>>> ping
>>>>> >> >>> response for sessionid: 0x148c6f03388005e after 21ms
>>>>> >> >>>
>>>>> >> >>> <and on and on>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> The only difference between success and a hang is a URL change,
>>>>> and of
>>>>> >> >>> course being remote.
>>>>> >> >>>
>>>>> >> >>> I don't believe this is a firewall issue.  I shutdown the
>>>>> firewall.
>>>>> >> >>>
>>>>> >> >>> Am I missing something?
>>>>> >> >>>
>>>>> >> >>> Thanks all.
>>>>> >> >>>
>>>>> >> >>> --
>>>>> >> >>> There are ways and there are ways,
>>>>> >> >>>
>>>>> >> >>> Geoffry Roberts
>>>>> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> > --
>>>>> >> > There are ways and there are ways,
>>>>> >> >
>>>>> >> > Geoffry Roberts
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > There are ways and there are ways,
>>>>> >
>>>>> > Geoffry Roberts
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> There are ways and there are ways,
>>>>
>>>> Geoffry Roberts
>>>>
>>>
>


-- 
There are ways and there are ways,

Geoffry Roberts

Reply via email to