Re: Error stressing with pyaccumulo app

Josh Elser Tue, 11 Feb 2014 09:17:39 -0800

Ok. Even so, try adding some split points to the tables before you begin(if you aren't already) as it will *greatly* smooth the startup.

Something like [00, 01, 02, ... 10, 11, 12, .. 97, 98, 99] would begood. You can easily dump this to a file on local disk and run the`addsplits` command in the Accumulo shell and provide it that file withthe -sf (I think) option.


On 2/11/14, 12:00 PM, Diego Woitasen wrote:

I'm using random keys for this tests. They are uuid4 keys.

On Tue, Feb 11, 2014 at 1:04 PM, Josh Elser <[email protected]> wrote:

The other thing I thought about.. what's the distribution of Key-Values
that you're writing? Specifically, do many of the Keys sort "near" each
other. Similarly, do you notice excessive load on some tservers, but not all
(the "Tablet Servers" page on the Monitor is a good check)?

Consider the following: you have 10 tservers and you have 10 proxy servers.
The first thought is that 10 tservers should be plenty to balance the load
of those 10 proxy servers. However, a problem arises when if the data that
each of those proxy servers is writing happens to reside on a _small number
of tablet servers_. Thus, your 10 proxy servers might only be writing to one
or two tabletservers.

If you notice that you're getting skew like this (or even just know that
you're apt to have a situation where multiple clients might write data that
sorts close to one another), it would be a good idea to add splits to your
table before starting your workload.

e.g. if you consider that your Key-space is the numbers from 1 to 10, and
you have ten tservers, it would be a good idea to add splits 1, 2, ... 10,
so that each tservers hosts at least one tablet (e.g. [1,2), [2,3)...
[10,+inf)). Having at least 5 or 10 tablets per tserver per table (split
according to the distribution of your data) might help ease the load.


On 2/11/14, 10:47 AM, Diego Woitasen wrote:


Same results with 2G tserver.memory.maps.max.

May be we just reached the limit :)

On Mon, Feb 10, 2014 at 7:08 PM, Diego Woitasen
<[email protected]> wrote:


On Mon, Feb 10, 2014 at 6:21 PM, Josh Elser <[email protected]> wrote:


I assume you're running a datanode along side the tserver on that node?
That
may be stretching the capabilities of that node (not to mention ec2
nodes
tend to be a little flakey in general). 2G for the
tserver.memory.maps.max
might be a little safer.

You got an error in a tserver log about that IOException in
internalReader.
After that, the tserver was still alive? And the proxy client was dead -
quit normally?



Yes, everything is still alive.


If that's the case, the proxy might just be disconnecting in a noisy
manner?



Right!

I'll try with 2G  tserver.memory.maps.max.




On 2/10/14, 3:38 PM, Diego Woitasen wrote:



Hi,
    I tried increasing the tserver.memory.maps.max to 3G and failed
again, but with other error. I have a heap size of 3G and 7.5 GB of
total ram.

The error that I've found in the crashed tserver is:

2014-02-08 03:37:35,497 [util.TServerUtils$THsHaServer] WARN : Got an
IOException in internalRead!

The tserver haven't crashed, but the client was disconnected during the
test.

Another hint is welcome :)

On Mon, Feb 3, 2014 at 3:58 PM, Josh Elser <[email protected]>
wrote:



Oh, ok. So that isn't quite as bad as it seems.

The "commits are held" exception is thrown when the tserver is running
low
on memory. The tserver will block new mutations coming in until it can
process the ones it already has and free up some memory. This makes
sense
that you would see this more often when you have more proxy servers as
the
total amount of Mutations you can send to your Accumulo instance is
increased. With one proxy server, your tserver had enough memory to
process
the incoming data. With many proxy servers, your tservers would likely
fall
over eventually because they'll get bogged down in JVM garbage
collection.

If you have more memory that you can give the tservers, that would
help.
Also, you should make sure that you're using the Accumulo native maps
as
this will use off-JVM-heap space instead of JVM heap which should help
tremendously with your ingest rates.

Native maps should be on by default unless you turned them off using
the
property 'tserver.memory.maps.native.enabled' in accumulo-site.xml.
Additionally, you can try increasing the size of the native maps using
'tserver.memory.maps.max' in accumulo-site.xml. Just be aware that
with
the
native maps, you need to ensure that total_ram > JVM_heap +
tserver.memory.maps.max

- Josh


On 2/3/14, 1:33 PM, Diego Woitasen wrote:




I've launched the cluster again and I was able to reproduce the
error:

In the proxy I had the same error that I mention in one of my
previous
messages, about a failure in a table server. I checked the log of
that
tablet server and I found:

2014-02-03 18:02:24,065 [thrift.ProcessFunction] ERROR: Internal
error
processing update
org.apache.accumulo.server.tabletserver.HoldTimeoutException: Commits
are
held

A lot of times.

Full log if someone want to have a look:



http://www.vhgroup.net/diegows/tserver_matrix-slave-07.accumulo-ec2-test.com.debug.log

Regards,
      Diego

On Mon, Feb 3, 2014 at 12:11 PM, Josh Elser <[email protected]>
wrote:




I would assume that that proxy service would become a bottleneck
fairly
quickly and your throughput would benefit from running multiple
proxies,
but I don't have substantive numbers to back up that assertion.

I'll put this on my list and see if I can reproduce something.


On 2/3/14, 7:42 AM, Diego Woitasen wrote:





I have to run the tests again because they were ec2 instances and
I've
destroyed. It's easy to reproduce BTW.

My question is, does it makes sense to run multiple proxies? Are
there
a limit? Right now I'm trying with 10 nodes and 10 proxies (running
on
every node). May be that doesn't make sense or it's a buggy
configuration.



On Fri, Jan 31, 2014 at 7:29 PM, Josh Elser <[email protected]>
wrote:





When you had multiple proxies, what were the failures on that
tablet
server
(10.202.6.46:9997).

I'm curious why using one proxy didn't cause errors but multiple
did.


On 1/31/14, 4:44 PM, Diego Woitasen wrote:






I've reproduced the error and I've found this in the proxy logs:

          2014-01-31 19:47:50,430 [server.THsHaServer] WARN : Got
an
IOException in internalRead!
          java.io.IOException: Connection reset by peer
              at sun.nio.ch.FileDispatcherImpl.read0(Native
Method)
              at
sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
              at
sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
              at sun.nio.ch.IOUtil.read(IOUtil.java:197)
              at
sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
              at




org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java:141)
              at




org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.internalRead(AbstractNonblockingServer.java:515)
              at




org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.read(AbstractNonblockingServer.java:305)
              at




org.apache.thrift.server.AbstractNonblockingServer$AbstractSelectThread.handleRead(AbstractNonblockingServer.java:202)
              at




org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.select(TNonblockingServer.java:198)
              at




org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.run(TNonblockingServer.java:154)
          2014-01-31 19:51:13,185 [impl.ThriftTransportPool] WARN
:
Server
10.202.6.46:9997:9997 (30000) had 20 failures in a short time
period,
will not complain anymore

A lot of this messages appear in all the proxies.

I tried the same stress tests agaisnt one proxy and I was able to
increase the load without getting any error.

Regards,
        Diego

On Thu, Jan 30, 2014 at 2:47 PM, Keith Turner <[email protected]>
wrote:






Do you see more information in the proxy logs?  "# exceptions 1"
indicates
an unexpected exception occured in the batch writer client code.
The
proxy
uses this client code, so maybe there will be a more detailed
stack
trace
in
its logs.


On Thu, Jan 30, 2014 at 9:46 AM, Diego Woitasen
<[email protected]>
wrote:







Hi,
       I'm testing with a ten node cluster with the proxy
enabled in
all
the
nodes. I'm doing a stress test balancing the connection between
the
proxies using round robin. When I increase the load (400
workers
writting) I get this error:

AccumuloSecurityException:






AccumuloSecurityException(msg='org.apache.accumulo.core.client.MutationsRejectedException:
# constraint violations : 0  security codes: []  # server
errors 0
#
exceptions 1')

The complete message is:

AccumuloSecurityException:






AccumuloSecurityException(msg='org.apache.accumulo.core.client.MutationsRejectedException:
# constraint violations : 0  security codes: []  # server
errors 0
#
exceptions 1')
kvlayer-test client failed!
Traceback (most recent call last):
        File "tests/kvlayer/test_accumulo_throughput.py", line
64,
in
__call__
          self.client.put('t1', ((u,), self.one_mb))
        File





"/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/kvlayer-0.2.7-py2.7.egg/kvlayer/_decorators.py",
line 26, in wrapper
          return method(*args, **kwargs)
        File





"/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/kvlayer-0.2.7-py2.7.egg/kvlayer/_accumulo.py",
line 154, in put
          batch_writer.close()
        File





"/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/pyaccumulo_dev-1.5.0.2-py2.7.egg/pyaccumulo/__init__.py",
line 126, in close
          self._conn.client.closeWriter(self._writer)
        File





"/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/pyaccumulo_dev-1.5.0.2-py2.7.egg/pyaccumulo/proxy/AccumuloProxy.py",
line 3149, in closeWriter
          self.recv_closeWriter()
        File





"/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/pyaccumulo_dev-1.5.0.2-py2.7.egg/pyaccumulo/proxy/AccumuloProxy.py",
line 3172, in recv_closeWriter
          raise result.ouch2

I'm not sure if the errror is produced by the way I'm using the
cluster with multiple proxies, may be I should use one.

Ideas are welcome.

Regards,
        Diego

--
Diego Woitasen
VHGroup - Linux and Open Source solutions architect
www.vhgroup.net




--
Diego Woitasen
VHGroup - Linux and Open Source solutions architect
www.vhgroup.net

Re: Error stressing with pyaccumulo app

Reply via email to