On 1/12/14, 6:17 PM, Sean Busbey wrote:
On Sun, Jan 12, 2014 at 4:42 PM, William Slacum
<[email protected] <mailto:[email protected]>>
wrote:
Some data on short circuit reads would be great to have.
What kind of data are you looking for? Just HDFS read rates? or
specifically Accumulo when set up to make use of it?
I believe what Bill means, and what I'm also curious about, is
specifically the impact on performance for Accumulo's workload: a merged
read over multiple files. An easy test might be to create multiple
RFiles (1 to 10 files?) which contain interspersed data. Test some sort
of random-read and random-seek+sequential-read workloads, from 1 to 10
RFiles, and with shortcircuit reads on an off.
Perhaps a slightly more accurate test would be to up the compaction
ratio on a table, and then bulk import them to a single table, and then
just use the regular client API.
I'm unsure of how correct the "compaction leading to eventual
locality" postulation is. It seems, to me at least, that in the case
of a multi-block file, the file system would eventually try to
distribute those blocks rather than leave them all on a single host.
I know in HBase set ups, it's common to either disable the HDFS Balancer
or just disable for a namespace containing the part of the filesystem
that handles HBase. Otherwise, when the blocks are moved off to other
hosts you get performance degradation until compaction can happen again.
I would expect the same thing ought to be done for Accumulo.
AFAIK, HBase also does a lot more in regards to assigning Tablets in
regards to the blocks that serve them, no? To my knowledge, Accumulo
doesn't do anything like this. I don't want users to think that
disabling the HDFS balancer is a good idea for Accumulo unless we have
actual evidence.