Hello, I am running a accumulo-based MR job using the AccumuloRowInputFormat on 1.4.1. Config is more-or-less default, using the native-standalone 3GB template, but with the TServer memory put up to 2GB in accumulo-env.sh from its default. accumulo-site.xml has tserver.memory.maps.max at 1G, tserver.cache.data.size at 50M, and tserver.cache.index.size at 512M.
My tables are created with maxversions for all three types (scan, minc, majc) at 1 and compress type as gz. I am finding, on an 8 node test cluster with 64 map task slots, that when a job is running, the 'Running Scans' count in the monitor is roughly 0-4 on average for each tablet server. When viewed at the table view, this puts the running scans anywhere from 4-24 on average. I would expect/hope the scans to be somewhere close to the map task count. To me, this means one of the following. 1. There is a configuration setting inhibiting the amount of scans from accumulating (excuse the pun) to about the same amount as my map tasks 2. My map task job is cpu-intensive enough to introduce delays between scans and everything is fine 3. Some combination of 1/2. On an alternate cluster, 40 nodes with 320 task slots, we haven't seen anywhere near full capacity scanning with map tasks which have the same performance, and the problem seems much worse. I am experimenting with some of the readahead configuration variables for the tablet servers in the meantime, but haven't found any smoking guns yet. Thank you, Marc -- http://saucyandbossy.wordpress.com
