I'm having a problem recovering from an improper shutdown of a tablet server.

 


Originally, the tablet server was giving me warnings about being low on memory, 
so I wanted to update it's memory settings and restart it. Before I did 
anything, the server worked fine running the tablet server and logger -- it 
warned about low memory but still operated.


After editing the configuration, I called stop-here.sh in the machine the 
server was running on, which stopped the tablet server and logger processes. 
Calling start-here.sh, however, did not do anything, and calling start-all.sh 
on the master server started the tablet server and logger processes but the 
server still appeared offline on the monitoring webpage. Eventually, I undid 
the configuration changes I had made and I was able to start the process by 
manually killing the tablet and logger processes and calling start-all.sh 
again, but then a new problem arose.


The tablet server was now online, but its walog still needed to be recovered. 
Then, when the recovery began, it started the copy/sort process on the walogs 
of not only the server that was offline but another server (which I have 
discovered has a walog with the same contents as the walog of the previously 
offline tablet but a different name). As soon as the recovery process starts, 
the loggers of the two servers go offline, and the recovery process lingers 
without advancing progress until the master gives up once the maximum recovery 
time is reached. When the loggers are offline, I am able to bring them online 
again by calling start-all.sh on the master, but it does not affect the 
progress of any current recovery and they go offline again once the next 
recovery is attempted.


Log files seem to reveal the core error at hand: the 
logger_server-address.in-addr.arpa.out log says that an OutOfMemory java heap 
space error has occurred and that the pid of the logger has been killed. This 
raises the question: how could the server have had enough room for these 
processes before but not now? Using a monitoring service (ganglia) shows that 
one server still has 1 MB of memory free and the other 7 MB when the tablet and 
logger servers are killed at the time of recovery. 


Is the solution to this to allocate more heap space to java, to change the 
Accumulo memory configurations, or something else? Of the machines in our 
cluster, all run CentOS 6.2, some x86 and some x86_64. The two servers in 
question are x86_64 machines, so there are other x86_64 machines with the same 
configurations as these two but have shown no problems.


Thanks for working to help me understand this,


Patrick Lynch


 

Reply via email to