I faced the same issue and have been debugging this for some time now(the logging is not very helpful as daniel mentions :)). Looking deeper into this i realized that the side effects also are large incorrect byte buffer allocations on the server side apart from call timeouts on the client side. Have filed HBASE-19215 <https://issues.apache.org/jira/browse/HBASE-19215> for this
On Wed, Nov 8, 2017 at 4:05 PM, Daniel Jeliński <[email protected]> wrote: > 2017-11-07 18:22 GMT+01:00 Stack <[email protected]>: > > > On Mon, Nov 6, 2017 at 6:33 AM, Daniel Jeliński <[email protected]> > > wrote: > > > > > For others that run into similar issue, it turned out that the > > > OutOfMemoryError was thrown (and subsequently hidden) on the client > side. > > > The error was caused by excessive direct memory usage in Java NIO's > > > bytebuffer caching (described here: > > > http://www.evanjones.ca/java-bytebuffer-leak.html), and setting > > > -Djdk.nio.maxCachedBufferSize=262144 > > > allowed the application to complete. > > > > > > > > Suggestions for how to expose the client-side OOME Daniel? We should add > > note to the thrown exception about "-Djdk.nio.maxCachedBufferSize" (and > > make sure the exception makes it out!) > > > > Well I found the problem by adding printStackTrace to > AsyncProcess.createLog function, which was responsible for logging the > original OOME. This is not very elegant, and I wouldn't recommend adding it > to the official codebase, but the stack trace offers some hints: > > java.io.IOException: com.google.protobuf.ServiceException: > java.lang.OutOfMemoryError: Direct buffer memory > > at > org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException( > ProtobufUtil.java:329) > > at > org.apache.hadoop.hbase.client.MultiServerCallable. > call(MultiServerCallable.java:130) > > at > org.apache.hadoop.hbase.client.MultiServerCallable. > call(MultiServerCallable.java:53) > > at > org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries( > RpcRetryingCaller.java:200) > > at > org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl$ > SingleServerRequestRunnable.run(AsyncProcess.java:727) > > at > java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) > > at java.util.concurrent.FutureTask.run(Unknown > Source) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > > at java.lang.Thread.run(Unknown Source) > > Caused by: com.google.protobuf.ServiceException: > java.lang.OutOfMemoryError: Direct buffer memory > > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod( > AbstractRpcClient.java:240) > > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$ > BlockingRpcChannelImplementation.callBlockingMethod( > AbstractRpcClient.java:336) > > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$ > BlockingStub.multi(ClientProtos.java:34142) > > at > org.apache.hadoop.hbase.client.MultiServerCallable. > call(MultiServerCallable.java:128) > > ... 8 more > > Caused by: java.lang.OutOfMemoryError: Direct buffer memory > > at java.nio.Bits.reserveMemory(Unknown Source) > > at java.nio.DirectByteBuffer.<init>(Unknown > Source) > > at java.nio.ByteBuffer.allocateDirect(Unknown > Source) > > at sun.nio.ch.Util.getTemporaryDirectBuffer( > Unknown > Source) > > at sun.nio.ch.IOUtil.write(Unknown Source) > > at sun.nio.ch.SocketChannelImpl.write(Unknown > Source) > > at > org.apache.hadoop.net.SocketOutputStream$Writer. > performIO(SocketOutputStream.java:63) > > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO( > SocketIOWithTimeout.java:142) > > at > org.apache.hadoop.net.SocketOutputStream.write( > SocketOutputStream.java:159) > > at > org.apache.hadoop.net.SocketOutputStream.write( > SocketOutputStream.java:117) > > at > org.apache.hadoop.security.SaslOutputStream.write( > SaslOutputStream.java:169) > > at java.io.BufferedOutputStream.write(Unknown > Source) > > at java.io.DataOutputStream.write(Unknown Source) > > at > org.apache.hadoop.hbase.ipc.IPCUtil.write(IPCUtil.java:277) > > at > org.apache.hadoop.hbase.ipc.IPCUtil.write(IPCUtil.java:266) > > at > org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection. > writeRequest(RpcClientImpl.java:921) > > at > org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest( > RpcClientImpl.java:874) > > at > org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1243) > > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod( > AbstractRpcClient.java:227) > > ... 11 more > This stack trace comes from cdh5.10.2 version, but the master branch is > sufficiently similar. So, depending on what we want to achieve, we could: > - just replace catch(Throwable e) in AbstractRpcClient.callBlockingMethod > with something more fine-grained and fail the application > - or forward OOME in callBlockingMethod, but add information about > maxCachedBufferSize, > also failing the application but suggesting possible corrective action to > the user > - or pass the error to the user, allowing the application to intercept it. > Not sure yet how to do that, but we would need to do something about the > connection becoming unusable after OOME, in case user decides to keep > going. > What's your take? > > > > > > Thanks for updating the list, > > S > > > > > > > > > Yet another proof that correct handling of OOME is hard. > > > Thanks, > > > Daniel > > > > > > 2017-10-11 11:33 GMT+02:00 Daniel Jeliński <[email protected]>: > > > > > > > Thanks for the hints. I'll see if we can explicitly set > > > > MaxDirectMemorySize to a safe number. > > > > Thanks, > > > > Daniel > > > > > > > > 2017-10-10 21:10 GMT+02:00 Esteban Gutierrez <[email protected]>: > > > > > > > >> http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/ > > > >> classes/sun/misc/VM.java#l184 > > > >> > > > >> // The initial value of this field is arbitrary; during JRE > > > >> initialization > > > >> // it will be reset to the value specified on the command line, > if > > > >> any, > > > >> // otherwise to Runtime.getRuntime().maxMemory(). > > > >> > > > >> which goes all the way down to memory/heap.cpp to whatever was left > to > > > the > > > >> reserved memory depending on the flags and the platform used as > > Vladimir > > > >> says. > > > >> > > > >> Also, depending on which distribution and features are used there > are > > > >> specific guidelines about setting that parameter so mileage might > > vary. > > > >> > > > >> thanks, > > > >> esteban. > > > >> > > > >> > > > >> > > > >> -- > > > >> Cloudera, Inc. > > > >> > > > >> > > > >> On Tue, Oct 10, 2017 at 1:35 PM, Vladimir Rodionov < > > > >> [email protected]> > > > >> wrote: > > > >> > > > >> > >> The default value is zero, which means the maximum direct > memory > > is > > > >> > unbounded. > > > >> > > > > >> > That is not correct. If you do not specify MaxDirectMemorySize, > > > default > > > >> is > > > >> > platform specific > > > >> > > > > >> > The link above is for JRockit JVM I presume? > > > >> > > > > >> > On Tue, Oct 10, 2017 at 11:19 AM, Esteban Gutierrez < > > > >> [email protected]> > > > >> > wrote: > > > >> > > > > >> > > I don't think is truly unbounded, IIRC it s limited to the > maximum > > > >> > > allocated heap. > > > >> > > > > > >> > > thanks, > > > >> > > esteban. > > > >> > > > > > >> > > -- > > > >> > > Cloudera, Inc. > > > >> > > > > > >> > > > > > >> > > On Tue, Oct 10, 2017 at 1:11 PM, Ted Yu <[email protected]> > > > wrote: > > > >> > > > > > >> > > > From https://docs.oracle.com/cd/E15289_01/doc.40/e15062/ > > optionxx. > > > >> htm : > > > >> > > > > > > >> > > > java -XX:MaxDirectMemorySize=2g myApp > > > >> > > > > > > >> > > > Default Value > > > >> > > > > > > >> > > > The default value is zero, which means the maximum direct > memory > > > is > > > >> > > > unbounded. > > > >> > > > > > > >> > > > On Tue, Oct 10, 2017 at 11:04 AM, Vladimir Rodionov < > > > >> > > > [email protected]> > > > >> > > > wrote: > > > >> > > > > > > >> > > > > >> XXMaxDirectMemorySize is set to the default 0, which > means > > > >> > unlimited > > > >> > > > as > > > >> > > > > far > > > >> > > > > >> as I can tell. > > > >> > > > > > > > >> > > > > Not sure if this is true. The only conforming that link I > > found > > > >> was > > > >> > for > > > >> > > > > JRockit JVM. > > > >> > > > > > > > >> > > > > On Mon, Oct 9, 2017 at 11:29 PM, Daniel Jeliński < > > > >> > [email protected] > > > >> > > > > > > >> > > > > wrote: > > > >> > > > > > > > >> > > > > > Vladimir, > > > >> > > > > > XXMaxDirectMemorySize is set to the default 0, which means > > > >> > unlimited > > > >> > > as > > > >> > > > > far > > > >> > > > > > as I can tell. > > > >> > > > > > Thanks, > > > >> > > > > > Daniel > > > >> > > > > > > > > >> > > > > > 2017-10-09 19:30 GMT+02:00 Vladimir Rodionov < > > > >> > [email protected] > > > >> > > >: > > > >> > > > > > > > > >> > > > > > > Have you try to increase direct memory size for server > > > >> process? > > > >> > > > > > > -XXMaxDirectMemorySize=? > > > >> > > > > > > > > > >> > > > > > > On Mon, Oct 9, 2017 at 2:12 AM, Daniel Jeliński < > > > >> > > > [email protected]> > > > >> > > > > > > wrote: > > > >> > > > > > > > > > >> > > > > > > > Hello, > > > >> > > > > > > > I'm running an application doing a lot of Puts (size > > > >> anywhere > > > >> > > > > between 0 > > > >> > > > > > > and > > > >> > > > > > > > 10MB, one cell at a time); occasionally I'm getting an > > > error > > > >> > like > > > >> > > > the > > > >> > > > > > > > below: > > > >> > > > > > > > 2017-10-09 04:29:29,811 WARN [AsyncProcess] - #13368, > > > >> > > > > > > > table=researchplatform:repo_stripe, attempt=1/1 > > > >> failed=1ops, > > > >> > > last > > > >> > > > > > > > exception: java.io.IOException: com.google.protobuf. > > > >> > > > > ServiceException: > > > >> > > > > > > > java.lang.OutOfMemoryError: Direct buffer memory on > > > >> > > > > > > > c169dzv.int.westgroup.com,60020,1506476748534, > tracking > > > >> > started > > > >> > > > Mon > > > >> > > > > > Oct > > > >> > > > > > > 09 > > > >> > > > > > > > 04:29:29 EDT 2017; not retrying 1 - final failure > > > >> > > > > > > > > > > >> > > > > > > > After that the connection to RegionServer becomes > > > unusable. > > > >> > Every > > > >> > > > > > > > subsequent attempt to execute Put on that connection > > > >> results in > > > >> > > > > > > > CallTimeoutException. I only found the OutOfMemory by > > > >> reducing > > > >> > > the > > > >> > > > > > number > > > >> > > > > > > > of tries to 1. > > > >> > > > > > > > > > > >> > > > > > > > The host running HBase appears to have at least a few > GB > > > of > > > >> > free > > > >> > > > > memory > > > >> > > > > > > > available. Server logs do not mention anything about > > this > > > >> > error. > > > >> > > > > > Cluster > > > >> > > > > > > is > > > >> > > > > > > > running HBase 1.2.0-cdh5.10.2. > > > >> > > > > > > > > > > >> > > > > > > > Is this a known problem? Are there workarounds > > available? > > > >> > > > > > > > Thanks, > > > >> > > > > > > > Daniel > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > > > > > > > > >
