Thanks Ted,

I was not familiar with that JIRA, though I have read it now. The next time
it happens I will run the test at the top: netstat -nap |grep CLOSE_WAIT
|grep 21592 |wc -l

However, from what I can tell in the JIRA this should affect almost all
versions of HBase above 0.94.  The issue we are hitting does not appear to
affect 0.98/CDH5.3.8. We also never saw it when we were on 0.94. This seems
new in either 1.0+ or 1.2+.

On Mon, May 23, 2016 at 12:59 PM Ted Yu <yuzhih...@gmail.com> wrote:

> Have you taken a look at HBASE-9393 ?
>
> On Mon, May 23, 2016 at 9:55 AM, Bryan Beaudreault <
> bbeaudrea...@hubspot.com
> > wrote:
>
> > Hey everyone,
> >
> > We are noticing a file descriptor leak that is only affecting nodes in
> our
> > cluster running 5.7.0, not those still running 5.3.8. I ran an lsof
> against
> > an affected regionserver, and noticed that there were 10k+ unix sockets
> > that are just called "socket", as well as another 10k+ of the form
> > "/dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_-<int>_1_<int>".
> The
> > 2 seem related based on how closely the counts match.
> >
> > We are in the middle of a rolling upgrade from CDH5.3.8 to CDH5.7.0 (we
> > handled the namenode upgrade separately).  The 5.3.8 nodes *do not*
> > experience this issue. The 5.7.0 nodes *do. *We are holding off upgrading
> > more regionservers until we can figure this out. I'm not sure if any
> > intermediate versions between the 2 have the issue.
> >
> > We traced the root cause to a hadoop job running against a basic table:
> >
> > 'my-table-1', {TABLE_ATTRIBUTES => {MAX_FILESIZE => '107374182400',
> > MEMSTORE_FLUSHSIZE => '67108864'}, {NAME => '0', VERSIONS => '50',
> > BLOOMFILTER => 'NONE', COMPRESSION => 'LZO', METADATA =>
> > {'COMPRESSION_COMPACT' => 'LZO', 'ENCODE_ON_DISK' => 'true'}}
> >
> > This is very similar to all of our other tables (we have many). However,
> > it's regions are getting up there in size, 40+gb per region, compressed.
> > This has not been an issue for us previously.
> >
> > The hadoop job is a simple TableMapper job with no special parameters,
> > though we haven't updated our client yet to the latest (will do that once
> > we finish the server side). The hadoop job runs on a separate hadoop
> > cluster, remotely accessing the HBase cluster. It does not do any other
> > reads or writes, outside of the TableMapper scans.
> >
> > Moving the regions off of an affected server, or killing the hadoop job,
> > causes the file descriptors to gradually go back down to normal.
> >
> > Any ideas?
> >
> > Thanks,
> >
> > Bryan
> >
>

Reply via email to