I found the message in tserver*.out. tserver*.err has 0 in it. I posted last night, life was good, sat down this morning and saw that another tserver had crashed, over night, with no activity. ?? In tserver*.out it again says out of heap space.
ACCUMULO_TSERVER_OPTS=-Xmx2G -Xms1G. I would have thought it sufficient. The fact that the log entries lack timestamps, but have hashmarks makes makes me wonder if I am reading things correctly. # # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError="kill -9 %p" # Executing /bin/sh -c "kill -9 3241"... Is there a way to start a particular tablet server? On Wed, Oct 8, 2014 at 6:55 PM, Eric Newton <eric.new...@gmail.com> wrote: > Did you find the message in the tserver*.out, terver*.err or the monitor > page? > > (Thanks for the follow-up message.) > > On Wed, Oct 8, 2014 at 6:39 PM, Geoffry Roberts <threadedb...@gmail.com> > wrote: > >> Just for the record, I finally got to the bottom of things. One of my >> Tservers was running out of memory. I hadn't noticed. I had my SA >> allocate a lttle more--each node now has 6G up from 2G--and things are >> working better. >> On Oct 8, 2014 10:09 AM, "Josh Elser" <josh.el...@gmail.com> wrote: >> >>> Jstack is a tool which can be used to tell a java process to dump the >>> current stack traces for all of its threads. It's usually included with the >>> JDK. `kill -3 $pid` also does the same. If the output can't be respected >>> automatically to your shell, check the stdout for the process you gave as >>> an argument. >>> >>> When your client is sitting waiting on data from the tabletserver, you >>> can get the stack traces from the tserver and you should be able to find a >>> thread with scan in the name, along with your client's IP, and we can help >>> debug exactly what the server is doing that is preventing it from returning >>> data to your client. >>> On Oct 8, 2014 9:43 AM, "Geoffry Roberts" <threadedb...@gmail.com> >>> wrote: >>> >>>> Thanks Josh. But what do you mean my "jstack'ing"? I'm unfamiliar >>>> with that term. A better question would be how can one troubleshoot such a >>>> thing? >>>> >>>> btw >>>> I am the sole user on this cluster. >>>> >>>> On Tue, Oct 7, 2014 at 4:18 PM, Josh Elser <josh.el...@gmail.com> >>>> wrote: >>>> >>>>> Ok, this record: >>>>> >>>>> tcp 0 0 0.0.0.0:9997 0.0.0.0:* >>>>> LISTEN >>>>> >>>>> Means that your is listening on the correct port on all interfaces. >>>>> There shouldn't be issues connecting to the tserver. This is also >>>>> confirmed by the fact that you authenticated and got a Connector (this >>>>> does an RPC to the tserver). >>>>> >>>>> So, your tserver is up, and your client can communicate with it. The >>>>> real question is why is the scan hanging. Perhaps jstack'ing the >>>>> tserver when your client is blocked waiting for results. >>>>> >>>>> On Tue, Oct 7, 2014 at 2:07 PM, Geoffry Roberts < >>>>> threadedb...@gmail.com> wrote: >>>>> > "...it's when >>>>> > you make a Connector, and your client will talk to a tabletserver to >>>>> > authenticate, that your program should hang. It would be good to >>>>> > verify that." >>>>> > >>>>> > >>>>> > My program should hang? Would you expand? That is exactly what it >>>>> is >>>>> > doing. I am able to get a connector. But when I try to iterate the >>>>> result >>>>> > of a scan, that's when it hangs. >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > Here's what comes from netstat: >>>>> > >>>>> > >>>>> > $ netstat -na | grep 9997 >>>>> > >>>>> > tcp 0 0 0.0.0.0:9997 0.0.0.0:* >>>>> > LISTEN >>>>> > >>>>> > tcp 0 0 204.9.140.36:35679 204.9.140.36:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:53146 204.9.140.37:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33896 204.9.140.38:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:53282 204.9.140.37:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:53188 204.9.140.37:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:35609 204.9.140.36:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33901 204.9.140.38:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:35588 204.9.140.36:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33877 204.9.140.38:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33946 204.9.140.38:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:53167 204.9.140.37:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33949 204.9.140.38:9997 >>>>> > ESTABLISHED >>>>> > >>>>> > tcp 0 0 204.9.140.36:35546 204.9.140.36:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33852 204.9.140.38:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:53125 204.9.140.37:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33922 204.9.140.38:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33747 204.9.140.38:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33961 204.9.140.38:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33793 204.9.140.38:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:35768 204.9.140.36:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33917 204.9.140.38:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33814 204.9.140.38:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:35567 204.9.140.36:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33444 204.9.140.38:9997 >>>>> > FIN_WAIT2 >>>>> > >>>>> > tcp 0 0 204.9.140.36:35701 204.9.140.36:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33969 204.9.140.38:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:53258 204.9.140.37:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33831 204.9.140.38:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:53210 204.9.140.37:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:53104 204.9.140.37:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33789 204.9.140.38:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33856 204.9.140.38:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:53237 204.9.140.37:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33835 204.9.140.38:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:35651 204.9.140.36:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33938 204.9.140.38:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33041 204.9.140.36:9997 >>>>> > ESTABLISHED >>>>> > >>>>> > tcp 0 0 204.9.140.36:53285 204.9.140.37:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:53305 204.9.140.37:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33768 204.9.140.38:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:35630 204.9.140.36:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33754 204.9.140.38:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:35745 204.9.140.36:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:35724 204.9.140.36:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:9997 204.9.140.36:33041 >>>>> > ESTABLISHED >>>>> > >>>>> > tcp 0 0 204.9.140.36:53083 204.9.140.37:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:50623 204.9.140.37:9997 >>>>> > ESTABLISHED >>>>> > >>>>> > tcp 0 0 204.9.140.36:33772 204.9.140.38:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33732 204.9.140.38:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33874 204.9.140.38:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > tcp 0 0 204.9.140.36:33810 204.9.140.38:9997 >>>>> > TIME_WAIT >>>>> > >>>>> > >>>>> > On Tue, Oct 7, 2014 at 11:34 AM, Josh Elser <josh.el...@gmail.com> >>>>> wrote: >>>>> >> >>>>> >> Can you provide the output from netstat, lsof or /proc/$pid/fd for >>>>> the >>>>> >> tserver? Assuming you haven't altered tserv.port.client in >>>>> >> accumulo-site.xml, we want the line for port 9997. >>>>> >> >>>>> >> From my laptop running a tserver on localhost: >>>>> >> >>>>> >> $ netstat -na | grep 9997 >>>>> >> tcp4 0 0 127.0.0.1.9997 *.* >>>>> LISTEN >>>>> >> >>>>> >> Depending on the tool you use, you can grep out the pid of the >>>>> tserver >>>>> >> or just that port itself. >>>>> >> >>>>> >> Just so you know, ZK binds to all available interfaces when it >>>>> starts, >>>>> >> so it should work seamlessly with localhost or the FQDN for the >>>>> host. >>>>> >> As such, it shouldn't matter what you provide to the >>>>> >> ZooKeeperInstance. That should connect in all cases for you, it's >>>>> when >>>>> >> you make a Connector, and your client will talk to a tabletserver to >>>>> >> authenticate, that your program should hang. It would be good to >>>>> >> verify that. >>>>> >> >>>>> >> On Tue, Oct 7, 2014 at 11:23 AM, Geoffry Roberts < >>>>> threadedb...@gmail.com> >>>>> >> wrote: >>>>> >> > All, >>>>> >> > >>>>> >> > Thanks for the responses. >>>>> >> > >>>>> >> > Is this a problem for Accumulo? >>>>> >> > Reverse DNS is yielding my ISP's host name. You know the drill, >>>>> my IP in >>>>> >> > reverse followed by their domain name, as opposed to my FQDN, >>>>> which what >>>>> >> > I >>>>> >> > use in my config files. >>>>> >> > >>>>> >> > Running Accumulo 1.5.1 >>>>> >> > I have only one interface. >>>>> >> > I have the FQDN in both master and slaves files for both Hadoop >>>>> and >>>>> >> > Accumulo; in zoo.cfg; and in accumulo-site.xml where the >>>>> Zookeepers are >>>>> >> > referenced. >>>>> >> > Also, I am passing in all Zk FQDN when I instantiate >>>>> ZookeeperInstance. >>>>> >> > Forward DNS works >>>>> >> > Reverse DNS... well (See above). >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> > On Mon, Oct 6, 2014 at 10:26 PM, Adam Fuchs <afu...@apache.org> >>>>> wrote: >>>>> >> >> >>>>> >> >> Accumulo tservers typically listen on a single interface. If you >>>>> have a >>>>> >> >> server with multiple interfaces (e.g. loopback and eth0), you >>>>> might >>>>> >> >> have a >>>>> >> >> problem in which the tablet servers are not listening on >>>>> externally >>>>> >> >> reachable interfaces. Tablet servers will list the interfaces >>>>> that they >>>>> >> >> are >>>>> >> >> listening to when they boot, and you can also use tools like >>>>> lsof to >>>>> >> >> find >>>>> >> >> them. >>>>> >> >> >>>>> >> >> If that is indeed the problem, then you might just need to >>>>> change you >>>>> >> >> conf/slaves file to use <hostname> instead of localhost, and then >>>>> >> >> restart. >>>>> >> >> >>>>> >> >> Adam >>>>> >> >> >>>>> >> >> On Oct 6, 2014 4:27 PM, "Geoffry Roberts" < >>>>> threadedb...@gmail.com> >>>>> >> >> wrote: >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> I have been happily working with Acc, but today things >>>>> changed. No >>>>> >> >>> errors >>>>> >> >>> >>>>> >> >>> Until now I ran everything server side, which meant the URL was >>>>> >> >>> localhost:2181, and life was good. Today tried running some of >>>>> the >>>>> >> >>> same >>>>> >> >>> code as a remote client, which means <host name>:2181. Things >>>>> hang >>>>> >> >>> when >>>>> >> >>> BatchWriter tries to commit anything and Scan hangs when it >>>>> tries to >>>>> >> >>> iterate >>>>> >> >>> through a Map. >>>>> >> >>> >>>>> >> >>> Let's focus on the scan part: >>>>> >> >>> >>>>> >> >>> scan.fetchColumnFamily(new Text("colfY")); // This executes then >>>>> >> >>> hangs. >>>>> >> >>> for(Entry<Key,Value> entry : scan) { >>>>> >> >>> def row = entry.getKey().getRow(); >>>>> >> >>> def value = entry.getValue(); >>>>> >> >>> println "value=" + value; >>>>> >> >>> } >>>>> >> >>> >>>>> >> >>> This is what appears in the console : >>>>> >> >>> >>>>> >> >>> 17:22:39.802 C{0} M DEBUG org.apache.zookeeper.ClientCnxn - Got >>>>> ping >>>>> >> >>> response for sessionid: 0x148c6f03388005e after 21ms >>>>> >> >>> >>>>> >> >>> 17:22:49.803 C{0} M DEBUG org.apache.zookeeper.ClientCnxn - Got >>>>> ping >>>>> >> >>> response for sessionid: 0x148c6f03388005e after 21ms >>>>> >> >>> >>>>> >> >>> <and on and on> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> The only difference between success and a hang is a URL change, >>>>> and of >>>>> >> >>> course being remote. >>>>> >> >>> >>>>> >> >>> I don't believe this is a firewall issue. I shutdown the >>>>> firewall. >>>>> >> >>> >>>>> >> >>> Am I missing something? >>>>> >> >>> >>>>> >> >>> Thanks all. >>>>> >> >>> >>>>> >> >>> -- >>>>> >> >>> There are ways and there are ways, >>>>> >> >>> >>>>> >> >>> Geoffry Roberts >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> > -- >>>>> >> > There are ways and there are ways, >>>>> >> > >>>>> >> > Geoffry Roberts >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > -- >>>>> > There are ways and there are ways, >>>>> > >>>>> > Geoffry Roberts >>>>> >>>> >>>> >>>> >>>> -- >>>> There are ways and there are ways, >>>> >>>> Geoffry Roberts >>>> >>> > -- There are ways and there are ways, Geoffry Roberts