Cross posting my recent blog entry...
As documented in THRIFT-601, sending random data to Thrift can cause it
to leak memory.
At Mozilla, we use a web load balancer to distribute traffic to our
Thrift machines, and the default liveness check it uses is a simple TCP
connect. We also had Nagios performing TCP connect checks on these nodes
for general alerting.
All these connects were causing the Thrift servers to start generating
OOM errors sometimes as quickly as a few days after being started.
I wrote a test utility that performs a legitimate Thrift API call (it
actually tries to get the schema of the .META. table) and returns a
success if it can execute the call.
The utility can either run from the command line, or it can use the
lightweight HTTP server class that is part of the Sun JRE 6 and it will
listen for a request to /thrift/health and report back the status.
$ java -jar HbaseThriftTester.jar
Missing required option: [-check Immediately checks the following
host:port combinations and returns a summary message with an exit value
of the number of failures., -listen Run as an HTTP daemon listening on
port. Checks the hosts every time /thrift/health URL is requested.]
usage: HbaseThriftTester [-timeout <ms>] <mode> <host:port>...
-check Immediately checks the following host:port
combinations and returns a summary message with an
exit value of the number of failures.
-listen <port> Run as an HTTP daemon listening on port. Checks the
hosts every time /thrift/health URL is requested.
-timeout <seconds> Number of seconds to wait for Thrift call to
complete
The app is bundled up using one-jar so it is simple and easy to call
from within a Nagios script or some-such. Maybe it will be useful to
someone else. Just pull down the project then build with ant.
https://svn.mozilla.org/metrics/hadoop/hbase/HbaseThriftTester/
-Daniel