Hi Marc,
How many tablets are in the table you're running MR over (see the
monitor)? Might adding some more splits to your table (`addsplits` in
the Accumulo shell) get you better parallelism?
What does your data look like in your table? Lots of small rows? Few
very large rows?
On 4/2/13 10:56 AM, Marc Reichman wrote:
Hello,
I am running a accumulo-based MR job using the AccumuloRowInputFormat
on 1.4.1. Config is more-or-less default, using the native-standalone
3GB template, but with the TServer memory put up to 2GB in
accumulo-env.sh from its default. accumulo-site.xml has
tserver.memory.maps.max at 1G, tserver.cache.data.size at 50M, and
tserver.cache.index.size at 512M.
My tables are created with maxversions for all three types (scan,
minc, majc) at 1 and compress type as gz.
I am finding, on an 8 node test cluster with 64 map task slots, that
when a job is running, the 'Running Scans' count in the monitor is
roughly 0-4 on average for each tablet server. When viewed at the
table view, this puts the running scans anywhere from 4-24 on average.
I would expect/hope the scans to be somewhere close to the map task
count. To me, this means one of the following.
1. There is a configuration setting inhibiting the amount of scans
from accumulating (excuse the pun) to about the same amount as my map
tasks
2. My map task job is cpu-intensive enough to introduce delays between
scans and everything is fine
3. Some combination of 1/2.
On an alternate cluster, 40 nodes with 320 task slots, we haven't seen
anywhere near full capacity scanning with map tasks which have the
same performance, and the problem seems much worse.
I am experimenting with some of the readahead configuration variables
for the tablet servers in the meantime, but haven't found any smoking
guns yet.
Thank you,
Marc
--
http://saucyandbossy.wordpress.com