Re: Error stressing with pyaccumulo app

Diego Woitasen Tue, 11 Feb 2014 09:06:59 -0800

I'm using random keys for this tests. They are uuid4 keys.

On Tue, Feb 11, 2014 at 1:04 PM, Josh Elser <[email protected]> wrote:
> The other thing I thought about.. what's the distribution of Key-Values
> that you're writing? Specifically, do many of the Keys sort "near" each
> other. Similarly, do you notice excessive load on some tservers, but not all
> (the "Tablet Servers" page on the Monitor is a good check)?
>
> Consider the following: you have 10 tservers and you have 10 proxy servers.
> The first thought is that 10 tservers should be plenty to balance the load
> of those 10 proxy servers. However, a problem arises when if the data that
> each of those proxy servers is writing happens to reside on a _small number
> of tablet servers_. Thus, your 10 proxy servers might only be writing to one
> or two tabletservers.
>
> If you notice that you're getting skew like this (or even just know that
> you're apt to have a situation where multiple clients might write data that
> sorts close to one another), it would be a good idea to add splits to your
> table before starting your workload.
>
> e.g. if you consider that your Key-space is the numbers from 1 to 10, and
> you have ten tservers, it would be a good idea to add splits 1, 2, ... 10,
> so that each tservers hosts at least one tablet (e.g. [1,2), [2,3)...
> [10,+inf)). Having at least 5 or 10 tablets per tserver per table (split
> according to the distribution of your data) might help ease the load.
>
>
> On 2/11/14, 10:47 AM, Diego Woitasen wrote:
>>
>> Same results with 2G tserver.memory.maps.max.
>>
>> May be we just reached the limit :)
>>
>> On Mon, Feb 10, 2014 at 7:08 PM, Diego Woitasen
>> <[email protected]> wrote:
>>>
>>> On Mon, Feb 10, 2014 at 6:21 PM, Josh Elser <[email protected]> wrote:
>>>>
>>>> I assume you're running a datanode along side the tserver on that node?
>>>> That
>>>> may be stretching the capabilities of that node (not to mention ec2
>>>> nodes
>>>> tend to be a little flakey in general). 2G for the
>>>> tserver.memory.maps.max
>>>> might be a little safer.
>>>>
>>>> You got an error in a tserver log about that IOException in
>>>> internalReader.
>>>> After that, the tserver was still alive? And the proxy client was dead -
>>>> quit normally?
>>>
>>>
>>> Yes, everything is still alive.
>>>
>>>>
>>>> If that's the case, the proxy might just be disconnecting in a noisy
>>>> manner?
>>>
>>>
>>> Right!
>>>
>>> I'll try with 2G  tserver.memory.maps.max.
>>>>
>>>>
>>>>
>>>> On 2/10/14, 3:38 PM, Diego Woitasen wrote:
>>>>>
>>>>>
>>>>> Hi,
>>>>>    I tried increasing the tserver.memory.maps.max to 3G and failed
>>>>> again, but with other error. I have a heap size of 3G and 7.5 GB of
>>>>> total ram.
>>>>>
>>>>> The error that I've found in the crashed tserver is:
>>>>>
>>>>> 2014-02-08 03:37:35,497 [util.TServerUtils$THsHaServer] WARN : Got an
>>>>> IOException in internalRead!
>>>>>
>>>>> The tserver haven't crashed, but the client was disconnected during the
>>>>> test.
>>>>>
>>>>> Another hint is welcome :)
>>>>>
>>>>> On Mon, Feb 3, 2014 at 3:58 PM, Josh Elser <[email protected]>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Oh, ok. So that isn't quite as bad as it seems.
>>>>>>
>>>>>> The "commits are held" exception is thrown when the tserver is running
>>>>>> low
>>>>>> on memory. The tserver will block new mutations coming in until it can
>>>>>> process the ones it already has and free up some memory. This makes
>>>>>> sense
>>>>>> that you would see this more often when you have more proxy servers as
>>>>>> the
>>>>>> total amount of Mutations you can send to your Accumulo instance is
>>>>>> increased. With one proxy server, your tserver had enough memory to
>>>>>> process
>>>>>> the incoming data. With many proxy servers, your tservers would likely
>>>>>> fall
>>>>>> over eventually because they'll get bogged down in JVM garbage
>>>>>> collection.
>>>>>>
>>>>>> If you have more memory that you can give the tservers, that would
>>>>>> help.
>>>>>> Also, you should make sure that you're using the Accumulo native maps
>>>>>> as
>>>>>> this will use off-JVM-heap space instead of JVM heap which should help
>>>>>> tremendously with your ingest rates.
>>>>>>
>>>>>> Native maps should be on by default unless you turned them off using
>>>>>> the
>>>>>> property 'tserver.memory.maps.native.enabled' in accumulo-site.xml.
>>>>>> Additionally, you can try increasing the size of the native maps using
>>>>>> 'tserver.memory.maps.max' in accumulo-site.xml. Just be aware that
>>>>>> with
>>>>>> the
>>>>>> native maps, you need to ensure that total_ram > JVM_heap +
>>>>>> tserver.memory.maps.max
>>>>>>
>>>>>> - Josh
>>>>>>
>>>>>>
>>>>>> On 2/3/14, 1:33 PM, Diego Woitasen wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I've launched the cluster again and I was able to reproduce the
>>>>>>> error:
>>>>>>>
>>>>>>> In the proxy I had the same error that I mention in one of my
>>>>>>> previous
>>>>>>> messages, about a failure in a table server. I checked the log of
>>>>>>> that
>>>>>>> tablet server and I found:
>>>>>>>
>>>>>>> 2014-02-03 18:02:24,065 [thrift.ProcessFunction] ERROR: Internal
>>>>>>> error
>>>>>>> processing update
>>>>>>> org.apache.accumulo.server.tabletserver.HoldTimeoutException: Commits
>>>>>>> are
>>>>>>> held
>>>>>>>
>>>>>>> A lot of times.
>>>>>>>
>>>>>>> Full log if someone want to have a look:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://www.vhgroup.net/diegows/tserver_matrix-slave-07.accumulo-ec2-test.com.debug.log
>>>>>>>
>>>>>>> Regards,
>>>>>>>      Diego
>>>>>>>
>>>>>>> On Mon, Feb 3, 2014 at 12:11 PM, Josh Elser <[email protected]>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I would assume that that proxy service would become a bottleneck
>>>>>>>> fairly
>>>>>>>> quickly and your throughput would benefit from running multiple
>>>>>>>> proxies,
>>>>>>>> but I don't have substantive numbers to back up that assertion.
>>>>>>>>
>>>>>>>> I'll put this on my list and see if I can reproduce something.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/3/14, 7:42 AM, Diego Woitasen wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have to run the tests again because they were ec2 instances and
>>>>>>>>> I've
>>>>>>>>> destroyed. It's easy to reproduce BTW.
>>>>>>>>>
>>>>>>>>> My question is, does it makes sense to run multiple proxies? Are
>>>>>>>>> there
>>>>>>>>> a limit? Right now I'm trying with 10 nodes and 10 proxies (running
>>>>>>>>> on
>>>>>>>>> every node). May be that doesn't make sense or it's a buggy
>>>>>>>>> configuration.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jan 31, 2014 at 7:29 PM, Josh Elser <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> When you had multiple proxies, what were the failures on that
>>>>>>>>>> tablet
>>>>>>>>>> server
>>>>>>>>>> (10.202.6.46:9997).
>>>>>>>>>>
>>>>>>>>>> I'm curious why using one proxy didn't cause errors but multiple
>>>>>>>>>> did.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 1/31/14, 4:44 PM, Diego Woitasen wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I've reproduced the error and I've found this in the proxy logs:
>>>>>>>>>>>
>>>>>>>>>>>          2014-01-31 19:47:50,430 [server.THsHaServer] WARN : Got
>>>>>>>>>>> an
>>>>>>>>>>> IOException in internalRead!
>>>>>>>>>>>          java.io.IOException: Connection reset by peer
>>>>>>>>>>>              at sun.nio.ch.FileDispatcherImpl.read0(Native
>>>>>>>>>>> Method)
>>>>>>>>>>>              at
>>>>>>>>>>> sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>>>>>>>>>>              at
>>>>>>>>>>> sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>>>>>>>>>>              at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>>>>>>>>>>>              at
>>>>>>>>>>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>>>>>>>>>>>              at
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java:141)
>>>>>>>>>>>              at
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.internalRead(AbstractNonblockingServer.java:515)
>>>>>>>>>>>              at
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.read(AbstractNonblockingServer.java:305)
>>>>>>>>>>>              at
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> org.apache.thrift.server.AbstractNonblockingServer$AbstractSelectThread.handleRead(AbstractNonblockingServer.java:202)
>>>>>>>>>>>              at
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.select(TNonblockingServer.java:198)
>>>>>>>>>>>              at
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.run(TNonblockingServer.java:154)
>>>>>>>>>>>          2014-01-31 19:51:13,185 [impl.ThriftTransportPool] WARN
>>>>>>>>>>> :
>>>>>>>>>>> Server
>>>>>>>>>>> 10.202.6.46:9997:9997 (30000) had 20 failures in a short time
>>>>>>>>>>> period,
>>>>>>>>>>> will not complain anymore
>>>>>>>>>>>
>>>>>>>>>>> A lot of this messages appear in all the proxies.
>>>>>>>>>>>
>>>>>>>>>>> I tried the same stress tests agaisnt one proxy and I was able to
>>>>>>>>>>> increase the load without getting any error.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>>        Diego
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jan 30, 2014 at 2:47 PM, Keith Turner <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Do you see more information in the proxy logs?  "# exceptions 1"
>>>>>>>>>>>> indicates
>>>>>>>>>>>> an unexpected exception occured in the batch writer client code.
>>>>>>>>>>>> The
>>>>>>>>>>>> proxy
>>>>>>>>>>>> uses this client code, so maybe there will be a more detailed
>>>>>>>>>>>> stack
>>>>>>>>>>>> trace
>>>>>>>>>>>> in
>>>>>>>>>>>> its logs.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jan 30, 2014 at 9:46 AM, Diego Woitasen
>>>>>>>>>>>> <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>       I'm testing with a ten node cluster with the proxy
>>>>>>>>>>>>> enabled in
>>>>>>>>>>>>> all
>>>>>>>>>>>>> the
>>>>>>>>>>>>> nodes. I'm doing a stress test balancing the connection between
>>>>>>>>>>>>> the
>>>>>>>>>>>>> proxies using round robin. When I increase the load (400
>>>>>>>>>>>>> workers
>>>>>>>>>>>>> writting) I get this error:
>>>>>>>>>>>>>
>>>>>>>>>>>>> AccumuloSecurityException:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> AccumuloSecurityException(msg='org.apache.accumulo.core.client.MutationsRejectedException:
>>>>>>>>>>>>> # constraint violations : 0  security codes: []  # server
>>>>>>>>>>>>> errors 0
>>>>>>>>>>>>> #
>>>>>>>>>>>>> exceptions 1')
>>>>>>>>>>>>>
>>>>>>>>>>>>> The complete message is:
>>>>>>>>>>>>>
>>>>>>>>>>>>> AccumuloSecurityException:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> AccumuloSecurityException(msg='org.apache.accumulo.core.client.MutationsRejectedException:
>>>>>>>>>>>>> # constraint violations : 0  security codes: []  # server
>>>>>>>>>>>>> errors 0
>>>>>>>>>>>>> #
>>>>>>>>>>>>> exceptions 1')
>>>>>>>>>>>>> kvlayer-test client failed!
>>>>>>>>>>>>> Traceback (most recent call last):
>>>>>>>>>>>>>        File "tests/kvlayer/test_accumulo_throughput.py", line
>>>>>>>>>>>>> 64,
>>>>>>>>>>>>> in
>>>>>>>>>>>>> __call__
>>>>>>>>>>>>>          self.client.put('t1', ((u,), self.one_mb))
>>>>>>>>>>>>>        File
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/kvlayer-0.2.7-py2.7.egg/kvlayer/_decorators.py",
>>>>>>>>>>>>> line 26, in wrapper
>>>>>>>>>>>>>          return method(*args, **kwargs)
>>>>>>>>>>>>>        File
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/kvlayer-0.2.7-py2.7.egg/kvlayer/_accumulo.py",
>>>>>>>>>>>>> line 154, in put
>>>>>>>>>>>>>          batch_writer.close()
>>>>>>>>>>>>>        File
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/pyaccumulo_dev-1.5.0.2-py2.7.egg/pyaccumulo/__init__.py",
>>>>>>>>>>>>> line 126, in close
>>>>>>>>>>>>>          self._conn.client.closeWriter(self._writer)
>>>>>>>>>>>>>        File
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/pyaccumulo_dev-1.5.0.2-py2.7.egg/pyaccumulo/proxy/AccumuloProxy.py",
>>>>>>>>>>>>> line 3149, in closeWriter
>>>>>>>>>>>>>          self.recv_closeWriter()
>>>>>>>>>>>>>        File
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/pyaccumulo_dev-1.5.0.2-py2.7.egg/pyaccumulo/proxy/AccumuloProxy.py",
>>>>>>>>>>>>> line 3172, in recv_closeWriter
>>>>>>>>>>>>>          raise result.ouch2
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm not sure if the errror is produced by the way I'm using the
>>>>>>>>>>>>> cluster with multiple proxies, may be I should use one.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ideas are welcome.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>        Diego
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Diego Woitasen
>>>>>>>>>>>>> VHGroup - Linux and Open Source solutions architect
>>>>>>>>>>>>> www.vhgroup.net
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Diego Woitasen
>>> VHGroup - Linux and Open Source solutions architect
>>> www.vhgroup.net
>>
>>
>>
>>
>




-- 
Diego Woitasen
VHGroup - Linux and Open Source solutions architect
www.vhgroup.net

Re: Error stressing with pyaccumulo app

Reply via email to