Same results with 2G tserver.memory.maps.max. May be we just reached the limit :)
On Mon, Feb 10, 2014 at 7:08 PM, Diego Woitasen <[email protected]> wrote: > On Mon, Feb 10, 2014 at 6:21 PM, Josh Elser <[email protected]> wrote: >> I assume you're running a datanode along side the tserver on that node? That >> may be stretching the capabilities of that node (not to mention ec2 nodes >> tend to be a little flakey in general). 2G for the tserver.memory.maps.max >> might be a little safer. >> >> You got an error in a tserver log about that IOException in internalReader. >> After that, the tserver was still alive? And the proxy client was dead - >> quit normally? > > Yes, everything is still alive. > >> >> If that's the case, the proxy might just be disconnecting in a noisy manner? > > Right! > > I'll try with 2G tserver.memory.maps.max. >> >> >> On 2/10/14, 3:38 PM, Diego Woitasen wrote: >>> >>> Hi, >>> I tried increasing the tserver.memory.maps.max to 3G and failed >>> again, but with other error. I have a heap size of 3G and 7.5 GB of >>> total ram. >>> >>> The error that I've found in the crashed tserver is: >>> >>> 2014-02-08 03:37:35,497 [util.TServerUtils$THsHaServer] WARN : Got an >>> IOException in internalRead! >>> >>> The tserver haven't crashed, but the client was disconnected during the >>> test. >>> >>> Another hint is welcome :) >>> >>> On Mon, Feb 3, 2014 at 3:58 PM, Josh Elser <[email protected]> wrote: >>>> >>>> Oh, ok. So that isn't quite as bad as it seems. >>>> >>>> The "commits are held" exception is thrown when the tserver is running >>>> low >>>> on memory. The tserver will block new mutations coming in until it can >>>> process the ones it already has and free up some memory. This makes sense >>>> that you would see this more often when you have more proxy servers as >>>> the >>>> total amount of Mutations you can send to your Accumulo instance is >>>> increased. With one proxy server, your tserver had enough memory to >>>> process >>>> the incoming data. With many proxy servers, your tservers would likely >>>> fall >>>> over eventually because they'll get bogged down in JVM garbage >>>> collection. >>>> >>>> If you have more memory that you can give the tservers, that would help. >>>> Also, you should make sure that you're using the Accumulo native maps as >>>> this will use off-JVM-heap space instead of JVM heap which should help >>>> tremendously with your ingest rates. >>>> >>>> Native maps should be on by default unless you turned them off using the >>>> property 'tserver.memory.maps.native.enabled' in accumulo-site.xml. >>>> Additionally, you can try increasing the size of the native maps using >>>> 'tserver.memory.maps.max' in accumulo-site.xml. Just be aware that with >>>> the >>>> native maps, you need to ensure that total_ram > JVM_heap + >>>> tserver.memory.maps.max >>>> >>>> - Josh >>>> >>>> >>>> On 2/3/14, 1:33 PM, Diego Woitasen wrote: >>>>> >>>>> >>>>> I've launched the cluster again and I was able to reproduce the error: >>>>> >>>>> In the proxy I had the same error that I mention in one of my previous >>>>> messages, about a failure in a table server. I checked the log of that >>>>> tablet server and I found: >>>>> >>>>> 2014-02-03 18:02:24,065 [thrift.ProcessFunction] ERROR: Internal error >>>>> processing update >>>>> org.apache.accumulo.server.tabletserver.HoldTimeoutException: Commits >>>>> are >>>>> held >>>>> >>>>> A lot of times. >>>>> >>>>> Full log if someone want to have a look: >>>>> >>>>> >>>>> http://www.vhgroup.net/diegows/tserver_matrix-slave-07.accumulo-ec2-test.com.debug.log >>>>> >>>>> Regards, >>>>> Diego >>>>> >>>>> On Mon, Feb 3, 2014 at 12:11 PM, Josh Elser <[email protected]> >>>>> wrote: >>>>>> >>>>>> >>>>>> I would assume that that proxy service would become a bottleneck fairly >>>>>> quickly and your throughput would benefit from running multiple >>>>>> proxies, >>>>>> but I don't have substantive numbers to back up that assertion. >>>>>> >>>>>> I'll put this on my list and see if I can reproduce something. >>>>>> >>>>>> >>>>>> On 2/3/14, 7:42 AM, Diego Woitasen wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> I have to run the tests again because they were ec2 instances and I've >>>>>>> destroyed. It's easy to reproduce BTW. >>>>>>> >>>>>>> My question is, does it makes sense to run multiple proxies? Are there >>>>>>> a limit? Right now I'm trying with 10 nodes and 10 proxies (running on >>>>>>> every node). May be that doesn't make sense or it's a buggy >>>>>>> configuration. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Jan 31, 2014 at 7:29 PM, Josh Elser <[email protected]> >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> When you had multiple proxies, what were the failures on that tablet >>>>>>>> server >>>>>>>> (10.202.6.46:9997). >>>>>>>> >>>>>>>> I'm curious why using one proxy didn't cause errors but multiple did. >>>>>>>> >>>>>>>> >>>>>>>> On 1/31/14, 4:44 PM, Diego Woitasen wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I've reproduced the error and I've found this in the proxy logs: >>>>>>>>> >>>>>>>>> 2014-01-31 19:47:50,430 [server.THsHaServer] WARN : Got an >>>>>>>>> IOException in internalRead! >>>>>>>>> java.io.IOException: Connection reset by peer >>>>>>>>> at sun.nio.ch.FileDispatcherImpl.read0(Native Method) >>>>>>>>> at >>>>>>>>> sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) >>>>>>>>> at >>>>>>>>> sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) >>>>>>>>> at sun.nio.ch.IOUtil.read(IOUtil.java:197) >>>>>>>>> at >>>>>>>>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) >>>>>>>>> at >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java:141) >>>>>>>>> at >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.internalRead(AbstractNonblockingServer.java:515) >>>>>>>>> at >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.read(AbstractNonblockingServer.java:305) >>>>>>>>> at >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> org.apache.thrift.server.AbstractNonblockingServer$AbstractSelectThread.handleRead(AbstractNonblockingServer.java:202) >>>>>>>>> at >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.select(TNonblockingServer.java:198) >>>>>>>>> at >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.run(TNonblockingServer.java:154) >>>>>>>>> 2014-01-31 19:51:13,185 [impl.ThriftTransportPool] WARN : >>>>>>>>> Server >>>>>>>>> 10.202.6.46:9997:9997 (30000) had 20 failures in a short time >>>>>>>>> period, >>>>>>>>> will not complain anymore >>>>>>>>> >>>>>>>>> A lot of this messages appear in all the proxies. >>>>>>>>> >>>>>>>>> I tried the same stress tests agaisnt one proxy and I was able to >>>>>>>>> increase the load without getting any error. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Diego >>>>>>>>> >>>>>>>>> On Thu, Jan 30, 2014 at 2:47 PM, Keith Turner <[email protected]> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Do you see more information in the proxy logs? "# exceptions 1" >>>>>>>>>> indicates >>>>>>>>>> an unexpected exception occured in the batch writer client code. >>>>>>>>>> The >>>>>>>>>> proxy >>>>>>>>>> uses this client code, so maybe there will be a more detailed stack >>>>>>>>>> trace >>>>>>>>>> in >>>>>>>>>> its logs. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Jan 30, 2014 at 9:46 AM, Diego Woitasen >>>>>>>>>> <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> I'm testing with a ten node cluster with the proxy enabled in >>>>>>>>>>> all >>>>>>>>>>> the >>>>>>>>>>> nodes. I'm doing a stress test balancing the connection between >>>>>>>>>>> the >>>>>>>>>>> proxies using round robin. When I increase the load (400 workers >>>>>>>>>>> writting) I get this error: >>>>>>>>>>> >>>>>>>>>>> AccumuloSecurityException: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> AccumuloSecurityException(msg='org.apache.accumulo.core.client.MutationsRejectedException: >>>>>>>>>>> # constraint violations : 0 security codes: [] # server errors 0 >>>>>>>>>>> # >>>>>>>>>>> exceptions 1') >>>>>>>>>>> >>>>>>>>>>> The complete message is: >>>>>>>>>>> >>>>>>>>>>> AccumuloSecurityException: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> AccumuloSecurityException(msg='org.apache.accumulo.core.client.MutationsRejectedException: >>>>>>>>>>> # constraint violations : 0 security codes: [] # server errors 0 >>>>>>>>>>> # >>>>>>>>>>> exceptions 1') >>>>>>>>>>> kvlayer-test client failed! >>>>>>>>>>> Traceback (most recent call last): >>>>>>>>>>> File "tests/kvlayer/test_accumulo_throughput.py", line 64, >>>>>>>>>>> in >>>>>>>>>>> __call__ >>>>>>>>>>> self.client.put('t1', ((u,), self.one_mb)) >>>>>>>>>>> File >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/kvlayer-0.2.7-py2.7.egg/kvlayer/_decorators.py", >>>>>>>>>>> line 26, in wrapper >>>>>>>>>>> return method(*args, **kwargs) >>>>>>>>>>> File >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/kvlayer-0.2.7-py2.7.egg/kvlayer/_accumulo.py", >>>>>>>>>>> line 154, in put >>>>>>>>>>> batch_writer.close() >>>>>>>>>>> File >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/pyaccumulo_dev-1.5.0.2-py2.7.egg/pyaccumulo/__init__.py", >>>>>>>>>>> line 126, in close >>>>>>>>>>> self._conn.client.closeWriter(self._writer) >>>>>>>>>>> File >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/pyaccumulo_dev-1.5.0.2-py2.7.egg/pyaccumulo/proxy/AccumuloProxy.py", >>>>>>>>>>> line 3149, in closeWriter >>>>>>>>>>> self.recv_closeWriter() >>>>>>>>>>> File >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/pyaccumulo_dev-1.5.0.2-py2.7.egg/pyaccumulo/proxy/AccumuloProxy.py", >>>>>>>>>>> line 3172, in recv_closeWriter >>>>>>>>>>> raise result.ouch2 >>>>>>>>>>> >>>>>>>>>>> I'm not sure if the errror is produced by the way I'm using the >>>>>>>>>>> cluster with multiple proxies, may be I should use one. >>>>>>>>>>> >>>>>>>>>>> Ideas are welcome. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Diego >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Diego Woitasen >>>>>>>>>>> VHGroup - Linux and Open Source solutions architect >>>>>>>>>>> www.vhgroup.net >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >>> >> > > > > -- > Diego Woitasen > VHGroup - Linux and Open Source solutions architect > www.vhgroup.net -- Diego Woitasen VHGroup - Linux and Open Source solutions architect www.vhgroup.net
