Thanks Ted, I was not familiar with that JIRA, though I have read it now. The next time it happens I will run the test at the top: netstat -nap |grep CLOSE_WAIT |grep 21592 |wc -l
However, from what I can tell in the JIRA this should affect almost all versions of HBase above 0.94. The issue we are hitting does not appear to affect 0.98/CDH5.3.8. We also never saw it when we were on 0.94. This seems new in either 1.0+ or 1.2+. On Mon, May 23, 2016 at 12:59 PM Ted Yu <yuzhih...@gmail.com> wrote: > Have you taken a look at HBASE-9393 ? > > On Mon, May 23, 2016 at 9:55 AM, Bryan Beaudreault < > bbeaudrea...@hubspot.com > > wrote: > > > Hey everyone, > > > > We are noticing a file descriptor leak that is only affecting nodes in > our > > cluster running 5.7.0, not those still running 5.3.8. I ran an lsof > against > > an affected regionserver, and noticed that there were 10k+ unix sockets > > that are just called "socket", as well as another 10k+ of the form > > "/dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_-<int>_1_<int>". > The > > 2 seem related based on how closely the counts match. > > > > We are in the middle of a rolling upgrade from CDH5.3.8 to CDH5.7.0 (we > > handled the namenode upgrade separately). The 5.3.8 nodes *do not* > > experience this issue. The 5.7.0 nodes *do. *We are holding off upgrading > > more regionservers until we can figure this out. I'm not sure if any > > intermediate versions between the 2 have the issue. > > > > We traced the root cause to a hadoop job running against a basic table: > > > > 'my-table-1', {TABLE_ATTRIBUTES => {MAX_FILESIZE => '107374182400', > > MEMSTORE_FLUSHSIZE => '67108864'}, {NAME => '0', VERSIONS => '50', > > BLOOMFILTER => 'NONE', COMPRESSION => 'LZO', METADATA => > > {'COMPRESSION_COMPACT' => 'LZO', 'ENCODE_ON_DISK' => 'true'}} > > > > This is very similar to all of our other tables (we have many). However, > > it's regions are getting up there in size, 40+gb per region, compressed. > > This has not been an issue for us previously. > > > > The hadoop job is a simple TableMapper job with no special parameters, > > though we haven't updated our client yet to the latest (will do that once > > we finish the server side). The hadoop job runs on a separate hadoop > > cluster, remotely accessing the HBase cluster. It does not do any other > > reads or writes, outside of the TableMapper scans. > > > > Moving the regions off of an affected server, or killing the hadoop job, > > causes the file descriptors to gradually go back down to normal. > > > > Any ideas? > > > > Thanks, > > > > Bryan > > >