Re: Error stressing with pyaccumulo app

Mike Drob Tue, 11 Feb 2014 09:35:40 -0800

For uuid4 keys, you might want to do [00, 01, 02, ..., 0e, 0f, 10, ..., fd,
fe, ff] to cover the full range.



On Tue, Feb 11, 2014 at 9:16 AM, Josh Elser <[email protected]> wrote:

> Ok. Even so, try adding some split points to the tables before you begin
> (if you aren't already) as it will *greatly* smooth the startup.
>
> Something like [00, 01, 02, ... 10, 11, 12, .. 97, 98, 99] would be good.
> You can easily dump this to a file on local disk and run the `addsplits`
> command in the Accumulo shell and provide it that file with the -sf (I
> think) option.
>
>
> On 2/11/14, 12:00 PM, Diego Woitasen wrote:
>
>> I'm using random keys for this tests. They are uuid4 keys.
>>
>> On Tue, Feb 11, 2014 at 1:04 PM, Josh Elser <[email protected]> wrote:
>>
>>> The other thing I thought about.. what's the distribution of Key-Values
>>> that you're writing? Specifically, do many of the Keys sort "near" each
>>> other. Similarly, do you notice excessive load on some tservers, but not
>>> all
>>> (the "Tablet Servers" page on the Monitor is a good check)?
>>>
>>> Consider the following: you have 10 tservers and you have 10 proxy
>>> servers.
>>> The first thought is that 10 tservers should be plenty to balance the
>>> load
>>> of those 10 proxy servers. However, a problem arises when if the data
>>> that
>>> each of those proxy servers is writing happens to reside on a _small
>>> number
>>> of tablet servers_. Thus, your 10 proxy servers might only be writing to
>>> one
>>> or two tabletservers.
>>>
>>> If you notice that you're getting skew like this (or even just know that
>>> you're apt to have a situation where multiple clients might write data
>>> that
>>> sorts close to one another), it would be a good idea to add splits to
>>> your
>>> table before starting your workload.
>>>
>>> e.g. if you consider that your Key-space is the numbers from 1 to 10, and
>>> you have ten tservers, it would be a good idea to add splits 1, 2, ...
>>> 10,
>>> so that each tservers hosts at least one tablet (e.g. [1,2), [2,3)...
>>> [10,+inf)). Having at least 5 or 10 tablets per tserver per table (split
>>> according to the distribution of your data) might help ease the load.
>>>
>>>
>>> On 2/11/14, 10:47 AM, Diego Woitasen wrote:
>>>
>>>>
>>>> Same results with 2G tserver.memory.maps.max.
>>>>
>>>> May be we just reached the limit :)
>>>>
>>>> On Mon, Feb 10, 2014 at 7:08 PM, Diego Woitasen
>>>> <[email protected]> wrote:
>>>>
>>>>>
>>>>> On Mon, Feb 10, 2014 at 6:21 PM, Josh Elser <[email protected]>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> I assume you're running a datanode along side the tserver on that
>>>>>> node?
>>>>>> That
>>>>>> may be stretching the capabilities of that node (not to mention ec2
>>>>>> nodes
>>>>>> tend to be a little flakey in general). 2G for the
>>>>>> tserver.memory.maps.max
>>>>>> might be a little safer.
>>>>>>
>>>>>> You got an error in a tserver log about that IOException in
>>>>>> internalReader.
>>>>>> After that, the tserver was still alive? And the proxy client was
>>>>>> dead -
>>>>>> quit normally?
>>>>>>
>>>>>
>>>>>
>>>>> Yes, everything is still alive.
>>>>>
>>>>>
>>>>>> If that's the case, the proxy might just be disconnecting in a noisy
>>>>>> manner?
>>>>>>
>>>>>
>>>>>
>>>>> Right!
>>>>>
>>>>> I'll try with 2G  tserver.memory.maps.max.
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2/10/14, 3:38 PM, Diego Woitasen wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>     I tried increasing the tserver.memory.maps.max to 3G and failed
>>>>>>> again, but with other error. I have a heap size of 3G and 7.5 GB of
>>>>>>> total ram.
>>>>>>>
>>>>>>> The error that I've found in the crashed tserver is:
>>>>>>>
>>>>>>> 2014-02-08 03:37:35,497 [util.TServerUtils$THsHaServer] WARN : Got
>>>>>>> an
>>>>>>> IOException in internalRead!
>>>>>>>
>>>>>>> The tserver haven't crashed, but the client was disconnected during
>>>>>>> the
>>>>>>> test.
>>>>>>>
>>>>>>> Another hint is welcome :)
>>>>>>>
>>>>>>> On Mon, Feb 3, 2014 at 3:58 PM, Josh Elser <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Oh, ok. So that isn't quite as bad as it seems.
>>>>>>>>
>>>>>>>> The "commits are held" exception is thrown when the tserver is
>>>>>>>> running
>>>>>>>> low
>>>>>>>> on memory. The tserver will block new mutations coming in until it
>>>>>>>> can
>>>>>>>> process the ones it already has and free up some memory. This makes
>>>>>>>> sense
>>>>>>>> that you would see this more often when you have more proxy servers
>>>>>>>> as
>>>>>>>> the
>>>>>>>> total amount of Mutations you can send to your Accumulo instance is
>>>>>>>> increased. With one proxy server, your tserver had enough memory to
>>>>>>>> process
>>>>>>>> the incoming data. With many proxy servers, your tservers would
>>>>>>>> likely
>>>>>>>> fall
>>>>>>>> over eventually because they'll get bogged down in JVM garbage
>>>>>>>> collection.
>>>>>>>>
>>>>>>>> If you have more memory that you can give the tservers, that would
>>>>>>>> help.
>>>>>>>> Also, you should make sure that you're using the Accumulo native
>>>>>>>> maps
>>>>>>>> as
>>>>>>>> this will use off-JVM-heap space instead of JVM heap which should
>>>>>>>> help
>>>>>>>> tremendously with your ingest rates.
>>>>>>>>
>>>>>>>> Native maps should be on by default unless you turned them off using
>>>>>>>> the
>>>>>>>> property 'tserver.memory.maps.native.enabled' in accumulo-site.xml.
>>>>>>>> Additionally, you can try increasing the size of the native maps
>>>>>>>> using
>>>>>>>> 'tserver.memory.maps.max' in accumulo-site.xml. Just be aware that
>>>>>>>> with
>>>>>>>> the
>>>>>>>> native maps, you need to ensure that total_ram > JVM_heap +
>>>>>>>> tserver.memory.maps.max
>>>>>>>>
>>>>>>>> - Josh
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/3/14, 1:33 PM, Diego Woitasen wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I've launched the cluster again and I was able to reproduce the
>>>>>>>>> error:
>>>>>>>>>
>>>>>>>>> In the proxy I had the same error that I mention in one of my
>>>>>>>>> previous
>>>>>>>>> messages, about a failure in a table server. I checked the log of
>>>>>>>>> that
>>>>>>>>> tablet server and I found:
>>>>>>>>>
>>>>>>>>> 2014-02-03 18:02:24,065 [thrift.ProcessFunction] ERROR: Internal
>>>>>>>>> error
>>>>>>>>> processing update
>>>>>>>>> org.apache.accumulo.server.tabletserver.HoldTimeoutException:
>>>>>>>>> Commits
>>>>>>>>> are
>>>>>>>>> held
>>>>>>>>>
>>>>>>>>> A lot of times.
>>>>>>>>>
>>>>>>>>> Full log if someone want to have a look:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://www.vhgroup.net/diegows/tserver_matrix-slave-
>>>>>>>>> 07.accumulo-ec2-test.com.debug.log
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>       Diego
>>>>>>>>>
>>>>>>>>> On Mon, Feb 3, 2014 at 12:11 PM, Josh Elser <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I would assume that that proxy service would become a bottleneck
>>>>>>>>>> fairly
>>>>>>>>>> quickly and your throughput would benefit from running multiple
>>>>>>>>>> proxies,
>>>>>>>>>> but I don't have substantive numbers to back up that assertion.
>>>>>>>>>>
>>>>>>>>>> I'll put this on my list and see if I can reproduce something.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2/3/14, 7:42 AM, Diego Woitasen wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I have to run the tests again because they were ec2 instances and
>>>>>>>>>>> I've
>>>>>>>>>>> destroyed. It's easy to reproduce BTW.
>>>>>>>>>>>
>>>>>>>>>>> My question is, does it makes sense to run multiple proxies? Are
>>>>>>>>>>> there
>>>>>>>>>>> a limit? Right now I'm trying with 10 nodes and 10 proxies
>>>>>>>>>>> (running
>>>>>>>>>>> on
>>>>>>>>>>> every node). May be that doesn't make sense or it's a buggy
>>>>>>>>>>> configuration.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jan 31, 2014 at 7:29 PM, Josh Elser <
>>>>>>>>>>> [email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> When you had multiple proxies, what were the failures on that
>>>>>>>>>>>> tablet
>>>>>>>>>>>> server
>>>>>>>>>>>> (10.202.6.46:9997).
>>>>>>>>>>>>
>>>>>>>>>>>> I'm curious why using one proxy didn't cause errors but multiple
>>>>>>>>>>>> did.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 1/31/14, 4:44 PM, Diego Woitasen wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've reproduced the error and I've found this in the proxy
>>>>>>>>>>>>> logs:
>>>>>>>>>>>>>
>>>>>>>>>>>>>           2014-01-31 19:47:50,430 [server.THsHaServer] WARN :
>>>>>>>>>>>>> Got
>>>>>>>>>>>>> an
>>>>>>>>>>>>> IOException in internalRead!
>>>>>>>>>>>>>           java.io.IOException: Connection reset by peer
>>>>>>>>>>>>>               at sun.nio.ch.FileDispatcherImpl.read0(Native
>>>>>>>>>>>>> Method)
>>>>>>>>>>>>>               at
>>>>>>>>>>>>> sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>>>>>>>>>>>>               at
>>>>>>>>>>>>> sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>>>>>>>>>>>>               at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>>>>>>>>>>>>>               at
>>>>>>>>>>>>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>>>>>>>>>>>>>               at
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> org.apache.thrift.transport.TNonblockingSocket.read(
>>>>>>>>>>>>> TNonblockingSocket.java:141)
>>>>>>>>>>>>>               at
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> org.apache.thrift.server.AbstractNonblockingServer$
>>>>>>>>>>>>> FrameBuffer.internalRead(AbstractNonblockingServer.java:515)
>>>>>>>>>>>>>               at
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> org.apache.thrift.server.AbstractNonblockingServer$
>>>>>>>>>>>>> FrameBuffer.read(AbstractNonblockingServer.java:305)
>>>>>>>>>>>>>               at
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> org.apache.thrift.server.AbstractNonblockingServer$
>>>>>>>>>>>>> AbstractSelectThread.handleRead(AbstractNonblockingServer.
>>>>>>>>>>>>> java:202)
>>>>>>>>>>>>>               at
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> org.apache.thrift.server.TNonblockingServer$
>>>>>>>>>>>>> SelectAcceptThread.select(TNonblockingServer.java:198)
>>>>>>>>>>>>>               at
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> org.apache.thrift.server.TNonblockingServer$
>>>>>>>>>>>>> SelectAcceptThread.run(TNonblockingServer.java:154)
>>>>>>>>>>>>>           2014-01-31 19:51:13,185 [impl.ThriftTransportPool]
>>>>>>>>>>>>> WARN
>>>>>>>>>>>>> :
>>>>>>>>>>>>> Server
>>>>>>>>>>>>> 10.202.6.46:9997:9997 (30000) had 20 failures in a short time
>>>>>>>>>>>>> period,
>>>>>>>>>>>>> will not complain anymore
>>>>>>>>>>>>>
>>>>>>>>>>>>> A lot of this messages appear in all the proxies.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I tried the same stress tests agaisnt one proxy and I was able
>>>>>>>>>>>>> to
>>>>>>>>>>>>> increase the load without getting any error.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>         Diego
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Jan 30, 2014 at 2:47 PM, Keith Turner <
>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do you see more information in the proxy logs?  "# exceptions
>>>>>>>>>>>>>> 1"
>>>>>>>>>>>>>> indicates
>>>>>>>>>>>>>> an unexpected exception occured in the batch writer client
>>>>>>>>>>>>>> code.
>>>>>>>>>>>>>> The
>>>>>>>>>>>>>> proxy
>>>>>>>>>>>>>> uses this client code, so maybe there will be a more detailed
>>>>>>>>>>>>>> stack
>>>>>>>>>>>>>> trace
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>> its logs.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Jan 30, 2014 at 9:46 AM, Diego Woitasen
>>>>>>>>>>>>>> <[email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>        I'm testing with a ten node cluster with the proxy
>>>>>>>>>>>>>>> enabled in
>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> nodes. I'm doing a stress test balancing the connection
>>>>>>>>>>>>>>> between
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> proxies using round robin. When I increase the load (400
>>>>>>>>>>>>>>> workers
>>>>>>>>>>>>>>> writting) I get this error:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> AccumuloSecurityException:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> AccumuloSecurityException(msg='org.apache.accumulo.core.
>>>>>>>>>>>>>>> client.MutationsRejectedException:
>>>>>>>>>>>>>>> # constraint violations : 0  security codes: []  # server
>>>>>>>>>>>>>>> errors 0
>>>>>>>>>>>>>>> #
>>>>>>>>>>>>>>> exceptions 1')
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The complete message is:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> AccumuloSecurityException:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> AccumuloSecurityException(msg='org.apache.accumulo.core.
>>>>>>>>>>>>>>> client.MutationsRejectedException:
>>>>>>>>>>>>>>> # constraint violations : 0  security codes: []  # server
>>>>>>>>>>>>>>> errors 0
>>>>>>>>>>>>>>> #
>>>>>>>>>>>>>>> exceptions 1')
>>>>>>>>>>>>>>> kvlayer-test client failed!
>>>>>>>>>>>>>>> Traceback (most recent call last):
>>>>>>>>>>>>>>>         File "tests/kvlayer/test_accumulo_throughput.py",
>>>>>>>>>>>>>>> line
>>>>>>>>>>>>>>> 64,
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>> __call__
>>>>>>>>>>>>>>>           self.client.put('t1', ((u,), self.one_mb))
>>>>>>>>>>>>>>>         File
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-
>>>>>>>>>>>>>>> packages/kvlayer-0.2.7-py2.7.egg/kvlayer/_decorators.py",
>>>>>>>>>>>>>>> line 26, in wrapper
>>>>>>>>>>>>>>>           return method(*args, **kwargs)
>>>>>>>>>>>>>>>         File
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-
>>>>>>>>>>>>>>> packages/kvlayer-0.2.7-py2.7.egg/kvlayer/_accumulo.py",
>>>>>>>>>>>>>>> line 154, in put
>>>>>>>>>>>>>>>           batch_writer.close()
>>>>>>>>>>>>>>>         File
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-
>>>>>>>>>>>>>>> packages/pyaccumulo_dev-1.5.0.2-py2.7.egg/pyaccumulo/__init_
>>>>>>>>>>>>>>> _.py",
>>>>>>>>>>>>>>> line 126, in close
>>>>>>>>>>>>>>>           self._conn.client.closeWriter(self._writer)
>>>>>>>>>>>>>>>         File
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-
>>>>>>>>>>>>>>> packages/pyaccumulo_dev-1.5.0.2-py2.7.egg/pyaccumulo/proxy/
>>>>>>>>>>>>>>> AccumuloProxy.py",
>>>>>>>>>>>>>>> line 3149, in closeWriter
>>>>>>>>>>>>>>>           self.recv_closeWriter()
>>>>>>>>>>>>>>>         File
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-
>>>>>>>>>>>>>>> packages/pyaccumulo_dev-1.5.0.2-py2.7.egg/pyaccumulo/proxy/
>>>>>>>>>>>>>>> AccumuloProxy.py",
>>>>>>>>>>>>>>> line 3172, in recv_closeWriter
>>>>>>>>>>>>>>>           raise result.ouch2
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm not sure if the errror is produced by the way I'm using
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> cluster with multiple proxies, may be I should use one.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ideas are welcome.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>         Diego
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Diego Woitasen
>>>>>>>>>>>>>>> VHGroup - Linux and Open Source solutions architect
>>>>>>>>>>>>>>> www.vhgroup.net
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Diego Woitasen
>>>>> VHGroup - Linux and Open Source solutions architect
>>>>> www.vhgroup.net
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>

Re: Error stressing with pyaccumulo app

Reply via email to