I'm using random keys for this tests. They are uuid4 keys. On Tue, Feb 11, 2014 at 1:04 PM, Josh Elser <[email protected]> wrote: > The other thing I thought about.. what's the distribution of Key-Values > that you're writing? Specifically, do many of the Keys sort "near" each > other. Similarly, do you notice excessive load on some tservers, but not all > (the "Tablet Servers" page on the Monitor is a good check)? > > Consider the following: you have 10 tservers and you have 10 proxy servers. > The first thought is that 10 tservers should be plenty to balance the load > of those 10 proxy servers. However, a problem arises when if the data that > each of those proxy servers is writing happens to reside on a _small number > of tablet servers_. Thus, your 10 proxy servers might only be writing to one > or two tabletservers. > > If you notice that you're getting skew like this (or even just know that > you're apt to have a situation where multiple clients might write data that > sorts close to one another), it would be a good idea to add splits to your > table before starting your workload. > > e.g. if you consider that your Key-space is the numbers from 1 to 10, and > you have ten tservers, it would be a good idea to add splits 1, 2, ... 10, > so that each tservers hosts at least one tablet (e.g. [1,2), [2,3)... > [10,+inf)). Having at least 5 or 10 tablets per tserver per table (split > according to the distribution of your data) might help ease the load. > > > On 2/11/14, 10:47 AM, Diego Woitasen wrote: >> >> Same results with 2G tserver.memory.maps.max. >> >> May be we just reached the limit :) >> >> On Mon, Feb 10, 2014 at 7:08 PM, Diego Woitasen >> <[email protected]> wrote: >>> >>> On Mon, Feb 10, 2014 at 6:21 PM, Josh Elser <[email protected]> wrote: >>>> >>>> I assume you're running a datanode along side the tserver on that node? >>>> That >>>> may be stretching the capabilities of that node (not to mention ec2 >>>> nodes >>>> tend to be a little flakey in general). 2G for the >>>> tserver.memory.maps.max >>>> might be a little safer. >>>> >>>> You got an error in a tserver log about that IOException in >>>> internalReader. >>>> After that, the tserver was still alive? And the proxy client was dead - >>>> quit normally? >>> >>> >>> Yes, everything is still alive. >>> >>>> >>>> If that's the case, the proxy might just be disconnecting in a noisy >>>> manner? >>> >>> >>> Right! >>> >>> I'll try with 2G tserver.memory.maps.max. >>>> >>>> >>>> >>>> On 2/10/14, 3:38 PM, Diego Woitasen wrote: >>>>> >>>>> >>>>> Hi, >>>>> I tried increasing the tserver.memory.maps.max to 3G and failed >>>>> again, but with other error. I have a heap size of 3G and 7.5 GB of >>>>> total ram. >>>>> >>>>> The error that I've found in the crashed tserver is: >>>>> >>>>> 2014-02-08 03:37:35,497 [util.TServerUtils$THsHaServer] WARN : Got an >>>>> IOException in internalRead! >>>>> >>>>> The tserver haven't crashed, but the client was disconnected during the >>>>> test. >>>>> >>>>> Another hint is welcome :) >>>>> >>>>> On Mon, Feb 3, 2014 at 3:58 PM, Josh Elser <[email protected]> >>>>> wrote: >>>>>> >>>>>> >>>>>> Oh, ok. So that isn't quite as bad as it seems. >>>>>> >>>>>> The "commits are held" exception is thrown when the tserver is running >>>>>> low >>>>>> on memory. The tserver will block new mutations coming in until it can >>>>>> process the ones it already has and free up some memory. This makes >>>>>> sense >>>>>> that you would see this more often when you have more proxy servers as >>>>>> the >>>>>> total amount of Mutations you can send to your Accumulo instance is >>>>>> increased. With one proxy server, your tserver had enough memory to >>>>>> process >>>>>> the incoming data. With many proxy servers, your tservers would likely >>>>>> fall >>>>>> over eventually because they'll get bogged down in JVM garbage >>>>>> collection. >>>>>> >>>>>> If you have more memory that you can give the tservers, that would >>>>>> help. >>>>>> Also, you should make sure that you're using the Accumulo native maps >>>>>> as >>>>>> this will use off-JVM-heap space instead of JVM heap which should help >>>>>> tremendously with your ingest rates. >>>>>> >>>>>> Native maps should be on by default unless you turned them off using >>>>>> the >>>>>> property 'tserver.memory.maps.native.enabled' in accumulo-site.xml. >>>>>> Additionally, you can try increasing the size of the native maps using >>>>>> 'tserver.memory.maps.max' in accumulo-site.xml. Just be aware that >>>>>> with >>>>>> the >>>>>> native maps, you need to ensure that total_ram > JVM_heap + >>>>>> tserver.memory.maps.max >>>>>> >>>>>> - Josh >>>>>> >>>>>> >>>>>> On 2/3/14, 1:33 PM, Diego Woitasen wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> I've launched the cluster again and I was able to reproduce the >>>>>>> error: >>>>>>> >>>>>>> In the proxy I had the same error that I mention in one of my >>>>>>> previous >>>>>>> messages, about a failure in a table server. I checked the log of >>>>>>> that >>>>>>> tablet server and I found: >>>>>>> >>>>>>> 2014-02-03 18:02:24,065 [thrift.ProcessFunction] ERROR: Internal >>>>>>> error >>>>>>> processing update >>>>>>> org.apache.accumulo.server.tabletserver.HoldTimeoutException: Commits >>>>>>> are >>>>>>> held >>>>>>> >>>>>>> A lot of times. >>>>>>> >>>>>>> Full log if someone want to have a look: >>>>>>> >>>>>>> >>>>>>> >>>>>>> http://www.vhgroup.net/diegows/tserver_matrix-slave-07.accumulo-ec2-test.com.debug.log >>>>>>> >>>>>>> Regards, >>>>>>> Diego >>>>>>> >>>>>>> On Mon, Feb 3, 2014 at 12:11 PM, Josh Elser <[email protected]> >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I would assume that that proxy service would become a bottleneck >>>>>>>> fairly >>>>>>>> quickly and your throughput would benefit from running multiple >>>>>>>> proxies, >>>>>>>> but I don't have substantive numbers to back up that assertion. >>>>>>>> >>>>>>>> I'll put this on my list and see if I can reproduce something. >>>>>>>> >>>>>>>> >>>>>>>> On 2/3/14, 7:42 AM, Diego Woitasen wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I have to run the tests again because they were ec2 instances and >>>>>>>>> I've >>>>>>>>> destroyed. It's easy to reproduce BTW. >>>>>>>>> >>>>>>>>> My question is, does it makes sense to run multiple proxies? Are >>>>>>>>> there >>>>>>>>> a limit? Right now I'm trying with 10 nodes and 10 proxies (running >>>>>>>>> on >>>>>>>>> every node). May be that doesn't make sense or it's a buggy >>>>>>>>> configuration. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Jan 31, 2014 at 7:29 PM, Josh Elser <[email protected]> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> When you had multiple proxies, what were the failures on that >>>>>>>>>> tablet >>>>>>>>>> server >>>>>>>>>> (10.202.6.46:9997). >>>>>>>>>> >>>>>>>>>> I'm curious why using one proxy didn't cause errors but multiple >>>>>>>>>> did. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 1/31/14, 4:44 PM, Diego Woitasen wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I've reproduced the error and I've found this in the proxy logs: >>>>>>>>>>> >>>>>>>>>>> 2014-01-31 19:47:50,430 [server.THsHaServer] WARN : Got >>>>>>>>>>> an >>>>>>>>>>> IOException in internalRead! >>>>>>>>>>> java.io.IOException: Connection reset by peer >>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.read0(Native >>>>>>>>>>> Method) >>>>>>>>>>> at >>>>>>>>>>> sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) >>>>>>>>>>> at >>>>>>>>>>> sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) >>>>>>>>>>> at sun.nio.ch.IOUtil.read(IOUtil.java:197) >>>>>>>>>>> at >>>>>>>>>>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) >>>>>>>>>>> at >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java:141) >>>>>>>>>>> at >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.internalRead(AbstractNonblockingServer.java:515) >>>>>>>>>>> at >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.read(AbstractNonblockingServer.java:305) >>>>>>>>>>> at >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> org.apache.thrift.server.AbstractNonblockingServer$AbstractSelectThread.handleRead(AbstractNonblockingServer.java:202) >>>>>>>>>>> at >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.select(TNonblockingServer.java:198) >>>>>>>>>>> at >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.run(TNonblockingServer.java:154) >>>>>>>>>>> 2014-01-31 19:51:13,185 [impl.ThriftTransportPool] WARN >>>>>>>>>>> : >>>>>>>>>>> Server >>>>>>>>>>> 10.202.6.46:9997:9997 (30000) had 20 failures in a short time >>>>>>>>>>> period, >>>>>>>>>>> will not complain anymore >>>>>>>>>>> >>>>>>>>>>> A lot of this messages appear in all the proxies. >>>>>>>>>>> >>>>>>>>>>> I tried the same stress tests agaisnt one proxy and I was able to >>>>>>>>>>> increase the load without getting any error. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Diego >>>>>>>>>>> >>>>>>>>>>> On Thu, Jan 30, 2014 at 2:47 PM, Keith Turner <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Do you see more information in the proxy logs? "# exceptions 1" >>>>>>>>>>>> indicates >>>>>>>>>>>> an unexpected exception occured in the batch writer client code. >>>>>>>>>>>> The >>>>>>>>>>>> proxy >>>>>>>>>>>> uses this client code, so maybe there will be a more detailed >>>>>>>>>>>> stack >>>>>>>>>>>> trace >>>>>>>>>>>> in >>>>>>>>>>>> its logs. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Jan 30, 2014 at 9:46 AM, Diego Woitasen >>>>>>>>>>>> <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> I'm testing with a ten node cluster with the proxy >>>>>>>>>>>>> enabled in >>>>>>>>>>>>> all >>>>>>>>>>>>> the >>>>>>>>>>>>> nodes. I'm doing a stress test balancing the connection between >>>>>>>>>>>>> the >>>>>>>>>>>>> proxies using round robin. When I increase the load (400 >>>>>>>>>>>>> workers >>>>>>>>>>>>> writting) I get this error: >>>>>>>>>>>>> >>>>>>>>>>>>> AccumuloSecurityException: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> AccumuloSecurityException(msg='org.apache.accumulo.core.client.MutationsRejectedException: >>>>>>>>>>>>> # constraint violations : 0 security codes: [] # server >>>>>>>>>>>>> errors 0 >>>>>>>>>>>>> # >>>>>>>>>>>>> exceptions 1') >>>>>>>>>>>>> >>>>>>>>>>>>> The complete message is: >>>>>>>>>>>>> >>>>>>>>>>>>> AccumuloSecurityException: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> AccumuloSecurityException(msg='org.apache.accumulo.core.client.MutationsRejectedException: >>>>>>>>>>>>> # constraint violations : 0 security codes: [] # server >>>>>>>>>>>>> errors 0 >>>>>>>>>>>>> # >>>>>>>>>>>>> exceptions 1') >>>>>>>>>>>>> kvlayer-test client failed! >>>>>>>>>>>>> Traceback (most recent call last): >>>>>>>>>>>>> File "tests/kvlayer/test_accumulo_throughput.py", line >>>>>>>>>>>>> 64, >>>>>>>>>>>>> in >>>>>>>>>>>>> __call__ >>>>>>>>>>>>> self.client.put('t1', ((u,), self.one_mb)) >>>>>>>>>>>>> File >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/kvlayer-0.2.7-py2.7.egg/kvlayer/_decorators.py", >>>>>>>>>>>>> line 26, in wrapper >>>>>>>>>>>>> return method(*args, **kwargs) >>>>>>>>>>>>> File >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/kvlayer-0.2.7-py2.7.egg/kvlayer/_accumulo.py", >>>>>>>>>>>>> line 154, in put >>>>>>>>>>>>> batch_writer.close() >>>>>>>>>>>>> File >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/pyaccumulo_dev-1.5.0.2-py2.7.egg/pyaccumulo/__init__.py", >>>>>>>>>>>>> line 126, in close >>>>>>>>>>>>> self._conn.client.closeWriter(self._writer) >>>>>>>>>>>>> File >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/pyaccumulo_dev-1.5.0.2-py2.7.egg/pyaccumulo/proxy/AccumuloProxy.py", >>>>>>>>>>>>> line 3149, in closeWriter >>>>>>>>>>>>> self.recv_closeWriter() >>>>>>>>>>>>> File >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/pyaccumulo_dev-1.5.0.2-py2.7.egg/pyaccumulo/proxy/AccumuloProxy.py", >>>>>>>>>>>>> line 3172, in recv_closeWriter >>>>>>>>>>>>> raise result.ouch2 >>>>>>>>>>>>> >>>>>>>>>>>>> I'm not sure if the errror is produced by the way I'm using the >>>>>>>>>>>>> cluster with multiple proxies, may be I should use one. >>>>>>>>>>>>> >>>>>>>>>>>>> Ideas are welcome. >>>>>>>>>>>>> >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> Diego >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Diego Woitasen >>>>>>>>>>>>> VHGroup - Linux and Open Source solutions architect >>>>>>>>>>>>> www.vhgroup.net >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >>> >>> -- >>> Diego Woitasen >>> VHGroup - Linux and Open Source solutions architect >>> www.vhgroup.net >> >> >> >> >
-- Diego Woitasen VHGroup - Linux and Open Source solutions architect www.vhgroup.net
