One aspect of Accumulo architecture is still unclear to me. Would you achieve better scan performance if you could guarantee that the tablet and its ISAM file lived on the same node? Guessing ISAM files are not splittable so they pretty much stay on one HDFS data node (plus the replica copy). Or is the theory that SATA and a 10GBps network provide more or less the same throughput?
I generally understand that as the table grows and Accumulo creates more splits (tablets) you get better distribution over the cluster but seems like data location would still be important. HBase folks seem to think that you can approx. double your throughput if let the region server directly read the file (dfs.client.read.shortcircuit=true) as opposed to going through the data node. ( http://files.meetup.com/1350427/hug_ebay_jdcryans.pdf). Perhaps this is due more to HDFS overhead? I do get that one really nice thing about Accumulo's architecture is that it costs almost nothing to reassign tablet to a different tserver and this is a huge problem for other systems.
