I've forced the issue to happen again. netstat takes a while to run on this host while it's happening, but I do not see an abnormal amount of CLOSE_WAIT (compared to other hosts).
I forced more than usual number of regions for the affected table onto the host to speed up the process. File Descriptors are now growing quite rapidly, about 8-10 per second. This is what lsof looks like, multiplied by a couple thousand: COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME java 23180 hbase DEL REG 0,16 3848784656 /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_377023711_1_1702253823 java 23180 hbase DEL REG 0,16 3847643924 /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_377023711_1_1614925966 java 23180 hbase DEL REG 0,16 3847614191 /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_377023711_1_888427288 The only thing that varies is the last int on the end. > Anything about the job itself that is holding open references or throwing away files w/o closing them? The MR job does a TableMapper directly against HBase, which as far as I know uses the HBase RPC and does not hit HDFS directly at all. Is it possible that a long running scan (one with many, many next() calls) could keep some references to HDFS open for the duration of the overall scan? On Mon, May 23, 2016 at 2:19 PM Bryan Beaudreault <bbeaudrea...@hubspot.com> wrote: > We run MR against many tables in all of our clusters, they mostly have > similar schema definitions though vary in terms of key length, # columns, > etc. This is the only cluster and only table we've seen leak so far. It's > probably the table with the biggest regions which we MR against, though > it's hard to verify that (anyone in engineering can run such a job). > > dfs.client.read.shortcircuit.streams.cache.size = 256 > > Our typical FD amount is around 3000. When this hadoop job runs, that can > climb up to our limit of over 30k if we don't act -- it is a gradual build > up over the course of a couple hours. When we move the regions off or kill > the job, the FDs will gradually go back down at roughly the same pace. It > forms a graph in the shape of a pyramid. > > We don't use CM, we use mostly the default *-site.xml. We haven't > overridden anything related to this. The configs between CDH5.3.8 and 5.7.0 > are identical for us. > > On Mon, May 23, 2016 at 2:03 PM Stack <st...@duboce.net> wrote: > >> On Mon, May 23, 2016 at 9:55 AM, Bryan Beaudreault < >> bbeaudrea...@hubspot.com >> > wrote: >> >> > Hey everyone, >> > >> > We are noticing a file descriptor leak that is only affecting nodes in >> our >> > cluster running 5.7.0, not those still running 5.3.8. >> >> >> Translation: roughly hbase-1.2.0+hadoop-2.6.0 vs >> hbase-0.98.6+hadoop-2.5.0. >> >> >> > I ran an lsof against >> > an affected regionserver, and noticed that there were 10k+ unix sockets >> > that are just called "socket", as well as another 10k+ of the form >> > "/dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_-<int>_1_<int>". >> The >> > 2 seem related based on how closely the counts match. >> > >> > We are in the middle of a rolling upgrade from CDH5.3.8 to CDH5.7.0 (we >> > handled the namenode upgrade separately). The 5.3.8 nodes *do not* >> > experience this issue. The 5.7.0 nodes *do. *We are holding off >> upgrading >> > more regionservers until we can figure this out. I'm not sure if any >> > intermediate versions between the 2 have the issue. >> > >> > We traced the root cause to a hadoop job running against a basic table: >> > >> > 'my-table-1', {TABLE_ATTRIBUTES => {MAX_FILESIZE => '107374182400', >> > MEMSTORE_FLUSHSIZE => '67108864'}, {NAME => '0', VERSIONS => '50', >> > BLOOMFILTER => 'NONE', COMPRESSION => 'LZO', METADATA => >> > {'COMPRESSION_COMPACT' => 'LZO', 'ENCODE_ON_DISK' => 'true'}} >> > >> > This is very similar to all of our other tables (we have many). >> >> >> You are doing MR against some of these also? They have different schemas? >> No leaks here? >> >> >> >> > However, >> > it's regions are getting up there in size, 40+gb per region, compressed. >> > This has not been an issue for us previously. >> > >> > The hadoop job is a simple TableMapper job with no special parameters, >> > though we haven't updated our client yet to the latest (will do that >> once >> > we finish the server side). The hadoop job runs on a separate hadoop >> > cluster, remotely accessing the HBase cluster. It does not do any other >> > reads or writes, outside of the TableMapper scans. >> > >> > Moving the regions off of an affected server, or killing the hadoop job, >> > causes the file descriptors to gradually go back down to normal. >> > >> > >> Any ideas? >> > >> > >> Is it just the FD cache running 'normally'? 10k seems like a lot though. >> 256 seems to be the default in hdfs but maybe it is different in CM or in >> hbase? >> >> What is your dfs.client.read.shortcircuit.streams.cache.size set to? >> St.Ack >> >> >> >> > Thanks, >> > >> > Bryan >> > >> >