For uuid4 keys, you might want to do [00, 01, 02, ..., 0e, 0f, 10, ..., fd, fe, ff] to cover the full range.
On Tue, Feb 11, 2014 at 9:16 AM, Josh Elser <[email protected]> wrote: > Ok. Even so, try adding some split points to the tables before you begin > (if you aren't already) as it will *greatly* smooth the startup. > > Something like [00, 01, 02, ... 10, 11, 12, .. 97, 98, 99] would be good. > You can easily dump this to a file on local disk and run the `addsplits` > command in the Accumulo shell and provide it that file with the -sf (I > think) option. > > > On 2/11/14, 12:00 PM, Diego Woitasen wrote: > >> I'm using random keys for this tests. They are uuid4 keys. >> >> On Tue, Feb 11, 2014 at 1:04 PM, Josh Elser <[email protected]> wrote: >> >>> The other thing I thought about.. what's the distribution of Key-Values >>> that you're writing? Specifically, do many of the Keys sort "near" each >>> other. Similarly, do you notice excessive load on some tservers, but not >>> all >>> (the "Tablet Servers" page on the Monitor is a good check)? >>> >>> Consider the following: you have 10 tservers and you have 10 proxy >>> servers. >>> The first thought is that 10 tservers should be plenty to balance the >>> load >>> of those 10 proxy servers. However, a problem arises when if the data >>> that >>> each of those proxy servers is writing happens to reside on a _small >>> number >>> of tablet servers_. Thus, your 10 proxy servers might only be writing to >>> one >>> or two tabletservers. >>> >>> If you notice that you're getting skew like this (or even just know that >>> you're apt to have a situation where multiple clients might write data >>> that >>> sorts close to one another), it would be a good idea to add splits to >>> your >>> table before starting your workload. >>> >>> e.g. if you consider that your Key-space is the numbers from 1 to 10, and >>> you have ten tservers, it would be a good idea to add splits 1, 2, ... >>> 10, >>> so that each tservers hosts at least one tablet (e.g. [1,2), [2,3)... >>> [10,+inf)). Having at least 5 or 10 tablets per tserver per table (split >>> according to the distribution of your data) might help ease the load. >>> >>> >>> On 2/11/14, 10:47 AM, Diego Woitasen wrote: >>> >>>> >>>> Same results with 2G tserver.memory.maps.max. >>>> >>>> May be we just reached the limit :) >>>> >>>> On Mon, Feb 10, 2014 at 7:08 PM, Diego Woitasen >>>> <[email protected]> wrote: >>>> >>>>> >>>>> On Mon, Feb 10, 2014 at 6:21 PM, Josh Elser <[email protected]> >>>>> wrote: >>>>> >>>>>> >>>>>> I assume you're running a datanode along side the tserver on that >>>>>> node? >>>>>> That >>>>>> may be stretching the capabilities of that node (not to mention ec2 >>>>>> nodes >>>>>> tend to be a little flakey in general). 2G for the >>>>>> tserver.memory.maps.max >>>>>> might be a little safer. >>>>>> >>>>>> You got an error in a tserver log about that IOException in >>>>>> internalReader. >>>>>> After that, the tserver was still alive? And the proxy client was >>>>>> dead - >>>>>> quit normally? >>>>>> >>>>> >>>>> >>>>> Yes, everything is still alive. >>>>> >>>>> >>>>>> If that's the case, the proxy might just be disconnecting in a noisy >>>>>> manner? >>>>>> >>>>> >>>>> >>>>> Right! >>>>> >>>>> I'll try with 2G tserver.memory.maps.max. >>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 2/10/14, 3:38 PM, Diego Woitasen wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> Hi, >>>>>>> I tried increasing the tserver.memory.maps.max to 3G and failed >>>>>>> again, but with other error. I have a heap size of 3G and 7.5 GB of >>>>>>> total ram. >>>>>>> >>>>>>> The error that I've found in the crashed tserver is: >>>>>>> >>>>>>> 2014-02-08 03:37:35,497 [util.TServerUtils$THsHaServer] WARN : Got >>>>>>> an >>>>>>> IOException in internalRead! >>>>>>> >>>>>>> The tserver haven't crashed, but the client was disconnected during >>>>>>> the >>>>>>> test. >>>>>>> >>>>>>> Another hint is welcome :) >>>>>>> >>>>>>> On Mon, Feb 3, 2014 at 3:58 PM, Josh Elser <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Oh, ok. So that isn't quite as bad as it seems. >>>>>>>> >>>>>>>> The "commits are held" exception is thrown when the tserver is >>>>>>>> running >>>>>>>> low >>>>>>>> on memory. The tserver will block new mutations coming in until it >>>>>>>> can >>>>>>>> process the ones it already has and free up some memory. This makes >>>>>>>> sense >>>>>>>> that you would see this more often when you have more proxy servers >>>>>>>> as >>>>>>>> the >>>>>>>> total amount of Mutations you can send to your Accumulo instance is >>>>>>>> increased. With one proxy server, your tserver had enough memory to >>>>>>>> process >>>>>>>> the incoming data. With many proxy servers, your tservers would >>>>>>>> likely >>>>>>>> fall >>>>>>>> over eventually because they'll get bogged down in JVM garbage >>>>>>>> collection. >>>>>>>> >>>>>>>> If you have more memory that you can give the tservers, that would >>>>>>>> help. >>>>>>>> Also, you should make sure that you're using the Accumulo native >>>>>>>> maps >>>>>>>> as >>>>>>>> this will use off-JVM-heap space instead of JVM heap which should >>>>>>>> help >>>>>>>> tremendously with your ingest rates. >>>>>>>> >>>>>>>> Native maps should be on by default unless you turned them off using >>>>>>>> the >>>>>>>> property 'tserver.memory.maps.native.enabled' in accumulo-site.xml. >>>>>>>> Additionally, you can try increasing the size of the native maps >>>>>>>> using >>>>>>>> 'tserver.memory.maps.max' in accumulo-site.xml. Just be aware that >>>>>>>> with >>>>>>>> the >>>>>>>> native maps, you need to ensure that total_ram > JVM_heap + >>>>>>>> tserver.memory.maps.max >>>>>>>> >>>>>>>> - Josh >>>>>>>> >>>>>>>> >>>>>>>> On 2/3/14, 1:33 PM, Diego Woitasen wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I've launched the cluster again and I was able to reproduce the >>>>>>>>> error: >>>>>>>>> >>>>>>>>> In the proxy I had the same error that I mention in one of my >>>>>>>>> previous >>>>>>>>> messages, about a failure in a table server. I checked the log of >>>>>>>>> that >>>>>>>>> tablet server and I found: >>>>>>>>> >>>>>>>>> 2014-02-03 18:02:24,065 [thrift.ProcessFunction] ERROR: Internal >>>>>>>>> error >>>>>>>>> processing update >>>>>>>>> org.apache.accumulo.server.tabletserver.HoldTimeoutException: >>>>>>>>> Commits >>>>>>>>> are >>>>>>>>> held >>>>>>>>> >>>>>>>>> A lot of times. >>>>>>>>> >>>>>>>>> Full log if someone want to have a look: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> http://www.vhgroup.net/diegows/tserver_matrix-slave- >>>>>>>>> 07.accumulo-ec2-test.com.debug.log >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Diego >>>>>>>>> >>>>>>>>> On Mon, Feb 3, 2014 at 12:11 PM, Josh Elser <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I would assume that that proxy service would become a bottleneck >>>>>>>>>> fairly >>>>>>>>>> quickly and your throughput would benefit from running multiple >>>>>>>>>> proxies, >>>>>>>>>> but I don't have substantive numbers to back up that assertion. >>>>>>>>>> >>>>>>>>>> I'll put this on my list and see if I can reproduce something. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 2/3/14, 7:42 AM, Diego Woitasen wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I have to run the tests again because they were ec2 instances and >>>>>>>>>>> I've >>>>>>>>>>> destroyed. It's easy to reproduce BTW. >>>>>>>>>>> >>>>>>>>>>> My question is, does it makes sense to run multiple proxies? Are >>>>>>>>>>> there >>>>>>>>>>> a limit? Right now I'm trying with 10 nodes and 10 proxies >>>>>>>>>>> (running >>>>>>>>>>> on >>>>>>>>>>> every node). May be that doesn't make sense or it's a buggy >>>>>>>>>>> configuration. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Jan 31, 2014 at 7:29 PM, Josh Elser < >>>>>>>>>>> [email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> When you had multiple proxies, what were the failures on that >>>>>>>>>>>> tablet >>>>>>>>>>>> server >>>>>>>>>>>> (10.202.6.46:9997). >>>>>>>>>>>> >>>>>>>>>>>> I'm curious why using one proxy didn't cause errors but multiple >>>>>>>>>>>> did. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 1/31/14, 4:44 PM, Diego Woitasen wrote: >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I've reproduced the error and I've found this in the proxy >>>>>>>>>>>>> logs: >>>>>>>>>>>>> >>>>>>>>>>>>> 2014-01-31 19:47:50,430 [server.THsHaServer] WARN : >>>>>>>>>>>>> Got >>>>>>>>>>>>> an >>>>>>>>>>>>> IOException in internalRead! >>>>>>>>>>>>> java.io.IOException: Connection reset by peer >>>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.read0(Native >>>>>>>>>>>>> Method) >>>>>>>>>>>>> at >>>>>>>>>>>>> sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) >>>>>>>>>>>>> at >>>>>>>>>>>>> sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) >>>>>>>>>>>>> at sun.nio.ch.IOUtil.read(IOUtil.java:197) >>>>>>>>>>>>> at >>>>>>>>>>>>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> org.apache.thrift.transport.TNonblockingSocket.read( >>>>>>>>>>>>> TNonblockingSocket.java:141) >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> org.apache.thrift.server.AbstractNonblockingServer$ >>>>>>>>>>>>> FrameBuffer.internalRead(AbstractNonblockingServer.java:515) >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> org.apache.thrift.server.AbstractNonblockingServer$ >>>>>>>>>>>>> FrameBuffer.read(AbstractNonblockingServer.java:305) >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> org.apache.thrift.server.AbstractNonblockingServer$ >>>>>>>>>>>>> AbstractSelectThread.handleRead(AbstractNonblockingServer. >>>>>>>>>>>>> java:202) >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> org.apache.thrift.server.TNonblockingServer$ >>>>>>>>>>>>> SelectAcceptThread.select(TNonblockingServer.java:198) >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> org.apache.thrift.server.TNonblockingServer$ >>>>>>>>>>>>> SelectAcceptThread.run(TNonblockingServer.java:154) >>>>>>>>>>>>> 2014-01-31 19:51:13,185 [impl.ThriftTransportPool] >>>>>>>>>>>>> WARN >>>>>>>>>>>>> : >>>>>>>>>>>>> Server >>>>>>>>>>>>> 10.202.6.46:9997:9997 (30000) had 20 failures in a short time >>>>>>>>>>>>> period, >>>>>>>>>>>>> will not complain anymore >>>>>>>>>>>>> >>>>>>>>>>>>> A lot of this messages appear in all the proxies. >>>>>>>>>>>>> >>>>>>>>>>>>> I tried the same stress tests agaisnt one proxy and I was able >>>>>>>>>>>>> to >>>>>>>>>>>>> increase the load without getting any error. >>>>>>>>>>>>> >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> Diego >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Jan 30, 2014 at 2:47 PM, Keith Turner < >>>>>>>>>>>>> [email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Do you see more information in the proxy logs? "# exceptions >>>>>>>>>>>>>> 1" >>>>>>>>>>>>>> indicates >>>>>>>>>>>>>> an unexpected exception occured in the batch writer client >>>>>>>>>>>>>> code. >>>>>>>>>>>>>> The >>>>>>>>>>>>>> proxy >>>>>>>>>>>>>> uses this client code, so maybe there will be a more detailed >>>>>>>>>>>>>> stack >>>>>>>>>>>>>> trace >>>>>>>>>>>>>> in >>>>>>>>>>>>>> its logs. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Jan 30, 2014 at 9:46 AM, Diego Woitasen >>>>>>>>>>>>>> <[email protected]> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>> I'm testing with a ten node cluster with the proxy >>>>>>>>>>>>>>> enabled in >>>>>>>>>>>>>>> all >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> nodes. I'm doing a stress test balancing the connection >>>>>>>>>>>>>>> between >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> proxies using round robin. When I increase the load (400 >>>>>>>>>>>>>>> workers >>>>>>>>>>>>>>> writting) I get this error: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> AccumuloSecurityException: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> AccumuloSecurityException(msg='org.apache.accumulo.core. >>>>>>>>>>>>>>> client.MutationsRejectedException: >>>>>>>>>>>>>>> # constraint violations : 0 security codes: [] # server >>>>>>>>>>>>>>> errors 0 >>>>>>>>>>>>>>> # >>>>>>>>>>>>>>> exceptions 1') >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The complete message is: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> AccumuloSecurityException: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> AccumuloSecurityException(msg='org.apache.accumulo.core. >>>>>>>>>>>>>>> client.MutationsRejectedException: >>>>>>>>>>>>>>> # constraint violations : 0 security codes: [] # server >>>>>>>>>>>>>>> errors 0 >>>>>>>>>>>>>>> # >>>>>>>>>>>>>>> exceptions 1') >>>>>>>>>>>>>>> kvlayer-test client failed! >>>>>>>>>>>>>>> Traceback (most recent call last): >>>>>>>>>>>>>>> File "tests/kvlayer/test_accumulo_throughput.py", >>>>>>>>>>>>>>> line >>>>>>>>>>>>>>> 64, >>>>>>>>>>>>>>> in >>>>>>>>>>>>>>> __call__ >>>>>>>>>>>>>>> self.client.put('t1', ((u,), self.one_mb)) >>>>>>>>>>>>>>> File >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site- >>>>>>>>>>>>>>> packages/kvlayer-0.2.7-py2.7.egg/kvlayer/_decorators.py", >>>>>>>>>>>>>>> line 26, in wrapper >>>>>>>>>>>>>>> return method(*args, **kwargs) >>>>>>>>>>>>>>> File >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site- >>>>>>>>>>>>>>> packages/kvlayer-0.2.7-py2.7.egg/kvlayer/_accumulo.py", >>>>>>>>>>>>>>> line 154, in put >>>>>>>>>>>>>>> batch_writer.close() >>>>>>>>>>>>>>> File >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site- >>>>>>>>>>>>>>> packages/pyaccumulo_dev-1.5.0.2-py2.7.egg/pyaccumulo/__init_ >>>>>>>>>>>>>>> _.py", >>>>>>>>>>>>>>> line 126, in close >>>>>>>>>>>>>>> self._conn.client.closeWriter(self._writer) >>>>>>>>>>>>>>> File >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site- >>>>>>>>>>>>>>> packages/pyaccumulo_dev-1.5.0.2-py2.7.egg/pyaccumulo/proxy/ >>>>>>>>>>>>>>> AccumuloProxy.py", >>>>>>>>>>>>>>> line 3149, in closeWriter >>>>>>>>>>>>>>> self.recv_closeWriter() >>>>>>>>>>>>>>> File >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site- >>>>>>>>>>>>>>> packages/pyaccumulo_dev-1.5.0.2-py2.7.egg/pyaccumulo/proxy/ >>>>>>>>>>>>>>> AccumuloProxy.py", >>>>>>>>>>>>>>> line 3172, in recv_closeWriter >>>>>>>>>>>>>>> raise result.ouch2 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm not sure if the errror is produced by the way I'm using >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> cluster with multiple proxies, may be I should use one. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Ideas are welcome. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>> Diego >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Diego Woitasen >>>>>>>>>>>>>>> VHGroup - Linux and Open Source solutions architect >>>>>>>>>>>>>>> www.vhgroup.net >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Diego Woitasen >>>>> VHGroup - Linux and Open Source solutions architect >>>>> www.vhgroup.net >>>>> >>>> >>>> >>>> >>>> >>>> >>> >> >> >>
