I'll try to run with a higher caching to see how that changes things, thanks
On Mon, May 23, 2016 at 4:07 PM Stack <st...@duboce.net> wrote: > How hard to change the below if only temporarily (Trying to get a datapoint > or two to act on; the short circuit code hasn't changed that we know of... > perhaps the scan chunking facility in 1.1 has some side effect we've not > noticed up to this). > > If you up the caching to be bigger does it lower the rate of FD leak > creation? > > If you cache the blocks, assuming it does not blow the cache for others, > does that make a difference. > > Hang on... will be back in a sec... just sending this in meantime... > > St.Ack > > On Mon, May 23, 2016 at 12:20 PM, Bryan Beaudreault < > bbeaudrea...@hubspot.com> wrote: > > > For reference, the Scan backing the job is pretty basic: > > > > Scan scan = new Scan(); > > scan.setCaching(500); // probably too small for the datasize we're > dealing > > with > > scan.setCacheBlocks(false); > > scan.setScanMetricsEnabled(true); > > scan.setMaxVersions(1); > > scan.setTimeRange(startTime, stopTime); > > > > Otherwise it is using the out-of-the-box TableInputFormat. > > > > > > > > On Mon, May 23, 2016 at 3:13 PM Bryan Beaudreault < > > bbeaudrea...@hubspot.com> > > wrote: > > > > > I've forced the issue to happen again. netstat takes a while to run on > > > this host while it's happening, but I do not see an abnormal amount of > > > CLOSE_WAIT (compared to other hosts). > > > > > > I forced more than usual number of regions for the affected table onto > > the > > > host to speed up the process. File Descriptors are now growing quite > > > rapidly, about 8-10 per second. > > > > > > This is what lsof looks like, multiplied by a couple thousand: > > > > > > COMMAND PID USER FD TYPE DEVICE SIZE/OFF > > > NODE NAME > > > java 23180 hbase DEL REG 0,16 > 3848784656 > > > > > > /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_377023711_1_1702253823 > > > java 23180 hbase DEL REG 0,16 > 3847643924 > > > > > > /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_377023711_1_1614925966 > > > java 23180 hbase DEL REG 0,16 > 3847614191 > > > > > > /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_377023711_1_888427288 > > > > > > The only thing that varies is the last int on the end. > > > > > > > Anything about the job itself that is holding open references or > > > throwing away files w/o closing them? > > > > > > The MR job does a TableMapper directly against HBase, which as far as I > > > know uses the HBase RPC and does not hit HDFS directly at all. Is it > > > possible that a long running scan (one with many, many next() calls) > > could > > > keep some references to HDFS open for the duration of the overall scan? > > > > > > > > > On Mon, May 23, 2016 at 2:19 PM Bryan Beaudreault < > > > bbeaudrea...@hubspot.com> wrote: > > > > > >> We run MR against many tables in all of our clusters, they mostly have > > >> similar schema definitions though vary in terms of key length, # > > columns, > > >> etc. This is the only cluster and only table we've seen leak so far. > > It's > > >> probably the table with the biggest regions which we MR against, > though > > >> it's hard to verify that (anyone in engineering can run such a job). > > >> > > >> dfs.client.read.shortcircuit.streams.cache.size = 256 > > >> > > >> Our typical FD amount is around 3000. When this hadoop job runs, that > > >> can climb up to our limit of over 30k if we don't act -- it is a > gradual > > >> build up over the course of a couple hours. When we move the regions > > off or > > >> kill the job, the FDs will gradually go back down at roughly the same > > pace. > > >> It forms a graph in the shape of a pyramid. > > >> > > >> We don't use CM, we use mostly the default *-site.xml. We haven't > > >> overridden anything related to this. The configs between CDH5.3.8 and > > 5.7.0 > > >> are identical for us. > > >> > > >> On Mon, May 23, 2016 at 2:03 PM Stack <st...@duboce.net> wrote: > > >> > > >>> On Mon, May 23, 2016 at 9:55 AM, Bryan Beaudreault < > > >>> bbeaudrea...@hubspot.com > > >>> > wrote: > > >>> > > >>> > Hey everyone, > > >>> > > > >>> > We are noticing a file descriptor leak that is only affecting nodes > > in > > >>> our > > >>> > cluster running 5.7.0, not those still running 5.3.8. > > >>> > > >>> > > >>> Translation: roughly hbase-1.2.0+hadoop-2.6.0 vs > > >>> hbase-0.98.6+hadoop-2.5.0. > > >>> > > >>> > > >>> > I ran an lsof against > > >>> > an affected regionserver, and noticed that there were 10k+ unix > > sockets > > >>> > that are just called "socket", as well as another 10k+ of the form > > >>> > > > >>> > > "/dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_-<int>_1_<int>". > The > > >>> > 2 seem related based on how closely the counts match. > > >>> > > > >>> > We are in the middle of a rolling upgrade from CDH5.3.8 to CDH5.7.0 > > (we > > >>> > handled the namenode upgrade separately). The 5.3.8 nodes *do not* > > >>> > experience this issue. The 5.7.0 nodes *do. *We are holding off > > >>> upgrading > > >>> > more regionservers until we can figure this out. I'm not sure if > any > > >>> > intermediate versions between the 2 have the issue. > > >>> > > > >>> > We traced the root cause to a hadoop job running against a basic > > table: > > >>> > > > >>> > 'my-table-1', {TABLE_ATTRIBUTES => {MAX_FILESIZE => '107374182400', > > >>> > MEMSTORE_FLUSHSIZE => '67108864'}, {NAME => '0', VERSIONS => '50', > > >>> > BLOOMFILTER => 'NONE', COMPRESSION => 'LZO', METADATA => > > >>> > {'COMPRESSION_COMPACT' => 'LZO', 'ENCODE_ON_DISK' => 'true'}} > > >>> > > > >>> > This is very similar to all of our other tables (we have many). > > >>> > > >>> > > >>> You are doing MR against some of these also? They have different > > schemas? > > >>> No leaks here? > > >>> > > >>> > > >>> > > >>> > However, > > >>> > it's regions are getting up there in size, 40+gb per region, > > >>> compressed. > > >>> > This has not been an issue for us previously. > > >>> > > > >>> > The hadoop job is a simple TableMapper job with no special > > parameters, > > >>> > though we haven't updated our client yet to the latest (will do > that > > >>> once > > >>> > we finish the server side). The hadoop job runs on a separate > hadoop > > >>> > cluster, remotely accessing the HBase cluster. It does not do any > > other > > >>> > reads or writes, outside of the TableMapper scans. > > >>> > > > >>> > Moving the regions off of an affected server, or killing the hadoop > > >>> job, > > >>> > causes the file descriptors to gradually go back down to normal. > > >>> > > > >>> > > > >>> Any ideas? > > >>> > > > >>> > > > >>> Is it just the FD cache running 'normally'? 10k seems like a lot > > though. > > >>> 256 seems to be the default in hdfs but maybe it is different in CM > or > > in > > >>> hbase? > > >>> > > >>> What is your dfs.client.read.shortcircuit.streams.cache.size set to? > > >>> St.Ack > > >>> > > >>> > > >>> > > >>> > Thanks, > > >>> > > > >>> > Bryan > > >>> > > > >>> > > >> > > >