We run into one similar case for the replication at the DR cluster. Turned out that I filed HBASE-19320 <https://issues.apache.org/jira/browse/HBASE-19320> without knowing works done here. The way I detected the DM leak is using the metrix to find the direct memory usage and heap dump to analyze the place which holds DM.
Another alternative to avoid OOME for DM as mentioned in the discussion in the jira is to to ayncRpcClient, just FYI. Thanks, Huaxiang > On Nov 8, 2017, at 10:25 AM, Stack <[email protected]> wrote: > > On Wed, Nov 8, 2017 at 3:31 AM, Abhishek Singh Chouhan < > [email protected]> wrote: > >> I faced the same issue and have been debugging this for some time now(the >> logging is not very helpful as daniel mentions :)). >> Looking deeper into this i realized that the side effects also are large >> incorrect byte buffer allocations on the server side apart from call >> timeouts on the client side. >> Have filed HBASE-19215 <https://issues.apache.org/jira/browse/HBASE-19215> >> for >> this >> >> > Thank you lads for the info. Lets carry-on over in HBASE-19215. Good one. > S > > > >> On Wed, Nov 8, 2017 at 4:05 PM, Daniel Jeliński <[email protected]> >> wrote: >> >>> 2017-11-07 18:22 GMT+01:00 Stack <[email protected]>: >>> >>>> On Mon, Nov 6, 2017 at 6:33 AM, Daniel Jeliński <[email protected]> >>>> wrote: >>>> >>>>> For others that run into similar issue, it turned out that the >>>>> OutOfMemoryError was thrown (and subsequently hidden) on the client >>> side. >>>>> The error was caused by excessive direct memory usage in Java NIO's >>>>> bytebuffer caching (described here: >>>>> http://www.evanjones.ca/java-bytebuffer-leak.html), and setting >>>>> -Djdk.nio.maxCachedBufferSize=262144 >>>>> allowed the application to complete. >>>>> >>>>> >>>> Suggestions for how to expose the client-side OOME Daniel? We should >> add >>>> note to the thrown exception about "-Djdk.nio.maxCachedBufferSize" >> (and >>>> make sure the exception makes it out!) >>>> >>> >>> Well I found the problem by adding printStackTrace to >>> AsyncProcess.createLog function, which was responsible for logging the >>> original OOME. This is not very elegant, and I wouldn't recommend adding >> it >>> to the official codebase, but the stack trace offers some hints: >>> >>> java.io.IOException: com.google.protobuf.ServiceException: >>> java.lang.OutOfMemoryError: Direct buffer memory >>> >>> at >>> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException( >>> ProtobufUtil.java:329) >>> >>> at >>> org.apache.hadoop.hbase.client.MultiServerCallable. >>> call(MultiServerCallable.java:130) >>> >>> at >>> org.apache.hadoop.hbase.client.MultiServerCallable. >>> call(MultiServerCallable.java:53) >>> >>> at >>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries( >>> RpcRetryingCaller.java:200) >>> >>> at >>> org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl$ >>> SingleServerRequestRunnable.run(AsyncProcess.java:727) >>> >>> at >>> java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) >>> >>> at java.util.concurrent.FutureTask.run(Unknown >>> Source) >>> >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) >>> >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) >>> >>> at java.lang.Thread.run(Unknown Source) >>> >>> Caused by: com.google.protobuf.ServiceException: >>> java.lang.OutOfMemoryError: Direct buffer memory >>> >>> at >>> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod( >>> AbstractRpcClient.java:240) >>> >>> at >>> org.apache.hadoop.hbase.ipc.AbstractRpcClient$ >>> BlockingRpcChannelImplementation.callBlockingMethod( >>> AbstractRpcClient.java:336) >>> >>> at >>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$ >>> BlockingStub.multi(ClientProtos.java:34142) >>> >>> at >>> org.apache.hadoop.hbase.client.MultiServerCallable. >>> call(MultiServerCallable.java:128) >>> >>> ... 8 more >>> >>> Caused by: java.lang.OutOfMemoryError: Direct buffer memory >>> >>> at java.nio.Bits.reserveMemory(Unknown Source) >>> >>> at java.nio.DirectByteBuffer.<init>(Unknown >>> Source) >>> >>> at java.nio.ByteBuffer.allocateDirect(Unknown >>> Source) >>> >>> at sun.nio.ch.Util.getTemporaryDirectBuffer( >>> Unknown >>> Source) >>> >>> at sun.nio.ch.IOUtil.write(Unknown Source) >>> >>> at sun.nio.ch.SocketChannelImpl.write(Unknown >>> Source) >>> >>> at >>> org.apache.hadoop.net.SocketOutputStream$Writer. >>> performIO(SocketOutputStream.java:63) >>> >>> at >>> org.apache.hadoop.net.SocketIOWithTimeout.doIO( >>> SocketIOWithTimeout.java:142) >>> >>> at >>> org.apache.hadoop.net.SocketOutputStream.write( >>> SocketOutputStream.java:159) >>> >>> at >>> org.apache.hadoop.net.SocketOutputStream.write( >>> SocketOutputStream.java:117) >>> >>> at >>> org.apache.hadoop.security.SaslOutputStream.write( >>> SaslOutputStream.java:169) >>> >>> at java.io.BufferedOutputStream.write(Unknown >>> Source) >>> >>> at java.io.DataOutputStream.write(Unknown >> Source) >>> >>> at >>> org.apache.hadoop.hbase.ipc.IPCUtil.write(IPCUtil.java:277) >>> >>> at >>> org.apache.hadoop.hbase.ipc.IPCUtil.write(IPCUtil.java:266) >>> >>> at >>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection. >>> writeRequest(RpcClientImpl.java:921) >>> >>> at >>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest( >>> RpcClientImpl.java:874) >>> >>> at >>> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1243) >>> >>> at >>> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod( >>> AbstractRpcClient.java:227) >>> >>> ... 11 more >>> This stack trace comes from cdh5.10.2 version, but the master branch is >>> sufficiently similar. So, depending on what we want to achieve, we could: >>> - just replace catch(Throwable e) in AbstractRpcClient. >> callBlockingMethod >>> with something more fine-grained and fail the application >>> - or forward OOME in callBlockingMethod, but add information about >>> maxCachedBufferSize, >>> also failing the application but suggesting possible corrective action to >>> the user >>> - or pass the error to the user, allowing the application to intercept >> it. >>> Not sure yet how to do that, but we would need to do something about the >>> connection becoming unusable after OOME, in case user decides to keep >>> going. >>> What's your take? >>> >>> >>> >>> >>>> Thanks for updating the list, >>>> S >>>> >>>> >>>> >>>>> Yet another proof that correct handling of OOME is hard. >>>>> Thanks, >>>>> Daniel >>>>> >>>>> 2017-10-11 11:33 GMT+02:00 Daniel Jeliński <[email protected]>: >>>>> >>>>>> Thanks for the hints. I'll see if we can explicitly set >>>>>> MaxDirectMemorySize to a safe number. >>>>>> Thanks, >>>>>> Daniel >>>>>> >>>>>> 2017-10-10 21:10 GMT+02:00 Esteban Gutierrez <[email protected] >>> : >>>>>> >>>>>>> http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/ >>>>>>> classes/sun/misc/VM.java#l184 >>>>>>> >>>>>>> // The initial value of this field is arbitrary; during JRE >>>>>>> initialization >>>>>>> // it will be reset to the value specified on the command >> line, >>> if >>>>>>> any, >>>>>>> // otherwise to Runtime.getRuntime().maxMemory(). >>>>>>> >>>>>>> which goes all the way down to memory/heap.cpp to whatever was >> left >>> to >>>>> the >>>>>>> reserved memory depending on the flags and the platform used as >>>> Vladimir >>>>>>> says. >>>>>>> >>>>>>> Also, depending on which distribution and features are used there >>> are >>>>>>> specific guidelines about setting that parameter so mileage might >>>> vary. >>>>>>> >>>>>>> thanks, >>>>>>> esteban. >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Cloudera, Inc. >>>>>>> >>>>>>> >>>>>>> On Tue, Oct 10, 2017 at 1:35 PM, Vladimir Rodionov < >>>>>>> [email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>>>> The default value is zero, which means the maximum direct >>> memory >>>> is >>>>>>>> unbounded. >>>>>>>> >>>>>>>> That is not correct. If you do not specify MaxDirectMemorySize, >>>>> default >>>>>>> is >>>>>>>> platform specific >>>>>>>> >>>>>>>> The link above is for JRockit JVM I presume? >>>>>>>> >>>>>>>> On Tue, Oct 10, 2017 at 11:19 AM, Esteban Gutierrez < >>>>>>> [email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I don't think is truly unbounded, IIRC it s limited to the >>> maximum >>>>>>>>> allocated heap. >>>>>>>>> >>>>>>>>> thanks, >>>>>>>>> esteban. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Cloudera, Inc. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Oct 10, 2017 at 1:11 PM, Ted Yu <[email protected]> >>>>> wrote: >>>>>>>>> >>>>>>>>>> From https://docs.oracle.com/cd/E15289_01/doc.40/e15062/ >>>> optionxx. >>>>>>> htm : >>>>>>>>>> >>>>>>>>>> java -XX:MaxDirectMemorySize=2g myApp >>>>>>>>>> >>>>>>>>>> Default Value >>>>>>>>>> >>>>>>>>>> The default value is zero, which means the maximum direct >>> memory >>>>> is >>>>>>>>>> unbounded. >>>>>>>>>> >>>>>>>>>> On Tue, Oct 10, 2017 at 11:04 AM, Vladimir Rodionov < >>>>>>>>>> [email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>>>> XXMaxDirectMemorySize is set to the default 0, which >>> means >>>>>>>> unlimited >>>>>>>>>> as >>>>>>>>>>> far >>>>>>>>>>>>> as I can tell. >>>>>>>>>>> >>>>>>>>>>> Not sure if this is true. The only conforming that link I >>>> found >>>>>>> was >>>>>>>> for >>>>>>>>>>> JRockit JVM. >>>>>>>>>>> >>>>>>>>>>> On Mon, Oct 9, 2017 at 11:29 PM, Daniel Jeliński < >>>>>>>> [email protected] >>>>>>>>>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Vladimir, >>>>>>>>>>>> XXMaxDirectMemorySize is set to the default 0, which >> means >>>>>>>> unlimited >>>>>>>>> as >>>>>>>>>>> far >>>>>>>>>>>> as I can tell. >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Daniel >>>>>>>>>>>> >>>>>>>>>>>> 2017-10-09 19:30 GMT+02:00 Vladimir Rodionov < >>>>>>>> [email protected] >>>>>>>>>> : >>>>>>>>>>>> >>>>>>>>>>>>> Have you try to increase direct memory size for server >>>>>>> process? >>>>>>>>>>>>> -XXMaxDirectMemorySize=? >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Oct 9, 2017 at 2:12 AM, Daniel Jeliński < >>>>>>>>>> [email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>> I'm running an application doing a lot of Puts (size >>>>>>> anywhere >>>>>>>>>>> between 0 >>>>>>>>>>>>> and >>>>>>>>>>>>>> 10MB, one cell at a time); occasionally I'm getting >> an >>>>> error >>>>>>>> like >>>>>>>>>> the >>>>>>>>>>>>>> below: >>>>>>>>>>>>>> 2017-10-09 04:29:29,811 WARN [AsyncProcess] - >> #13368, >>>>>>>>>>>>>> table=researchplatform:repo_stripe, attempt=1/1 >>>>>>> failed=1ops, >>>>>>>>> last >>>>>>>>>>>>>> exception: java.io.IOException: com.google.protobuf. >>>>>>>>>>> ServiceException: >>>>>>>>>>>>>> java.lang.OutOfMemoryError: Direct buffer memory on >>>>>>>>>>>>>> c169dzv.int.westgroup.com,60020,1506476748534, >>> tracking >>>>>>>> started >>>>>>>>>> Mon >>>>>>>>>>>> Oct >>>>>>>>>>>>> 09 >>>>>>>>>>>>>> 04:29:29 EDT 2017; not retrying 1 - final failure >>>>>>>>>>>>>> >>>>>>>>>>>>>> After that the connection to RegionServer becomes >>>>> unusable. >>>>>>>> Every >>>>>>>>>>>>>> subsequent attempt to execute Put on that connection >>>>>>> results in >>>>>>>>>>>>>> CallTimeoutException. I only found the OutOfMemory >> by >>>>>>> reducing >>>>>>>>> the >>>>>>>>>>>> number >>>>>>>>>>>>>> of tries to 1. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The host running HBase appears to have at least a >> few >>> GB >>>>> of >>>>>>>> free >>>>>>>>>>> memory >>>>>>>>>>>>>> available. Server logs do not mention anything about >>>> this >>>>>>>> error. >>>>>>>>>>>> Cluster >>>>>>>>>>>>> is >>>>>>>>>>>>>> running HBase 1.2.0-cdh5.10.2. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Is this a known problem? Are there workarounds >>>> available? >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Daniel >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >>
