Re: ISAM file location vs. read performance

Eric Newton Thu, 16 Jan 2014 11:42:40 -0800

You may find org.apache.accumulo.server.util.LocalityCheck useful.

-Eric




On Thu, Jan 16, 2014 at 2:12 PM, Arshak Navruzyan <[email protected]> wrote:

> I did some manual testing on this to see where HDFS is placing blocks in
> relation to the location of the tablets.  I used the following command to
> determine where HDFS is replicating the various blocks of the Rfiles.
>
> hadoop fsck /accumulo/tables/a -locations -blocks -files
>
> From my limited testing, it appears that John's observation that "tserver
> with ultimately end up major compacting it's files, ensuring locality" is
> indeed true.  In all cases, the node that was responsible for the tablet,
> held a copy of all the blocks of that Rfile.
>
> More extensive testing in bigger environments would probably still be
> helpful before we write this into the documentation.  Also not sure what
> happen during tserver failures/reassignments.
>
> One thing that would make testing much easier is if "getsplits -v"
> reported the HDFS location of the tablet.  Right now you have to troll
> through !METADATA to figure it out.
>
>
> On Mon, Jan 13, 2014 at 10:25 AM, Arshak Navruzyan <[email protected]>wrote:
>
>> Thanks for all the explanations.  Perhaps this is something we should
>> clearly spell out in the documentation once all the facts are in.  I'll
>> keep a task open for now. (
>> https://issues.apache.org/jira/browse/ACCUMULO-2185)
>>
>>
>> On Sun, Jan 12, 2014 at 4:26 PM, Donald Miner <[email protected]>wrote:
>>
>>> HDFS-385 (
>>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/HDFS-385 )
>>> is for custom pluggable block placement policies and there has been some
>>> talk (i think) about improving mean time to recovering and data locality in
>>> hbase.
>>>
>>> Basically this would allow accumulo to have a policy for its blocks and
>>> control its own destiny... Instead of things like the rebalancer screwing
>>> things up.
>>>
>>> I honestly don't know much else about this. Just thought it might be
>>> relevant to the conversation.
>>>
>>> > On Jan 12, 2014, at 6:42 PM, Josh Elser <[email protected]> wrote:
>>> >
>>> >
>>> >
>>> >> On 1/12/14, 6:17 PM, Sean Busbey wrote:
>>> >> On Sun, Jan 12, 2014 at 4:42 PM, William Slacum
>>> >> <[email protected] <mailto:
>>> [email protected]>>
>>> >> wrote:
>>> >>
>>> >>    Some data on short circuit reads would be great to have.
>>> >>
>>> >>
>>> >> What kind of data are you looking for? Just HDFS read rates? or
>>> >> specifically Accumulo when set up to make use of it?
>>> >
>>> > I believe what Bill means, and what I'm also curious about, is
>>> specifically the impact on performance for Accumulo's workload: a merged
>>> read over multiple files. An easy test might be to create multiple RFiles
>>> (1 to 10 files?) which contain interspersed data. Test some sort of
>>> random-read and random-seek+sequential-read workloads, from 1 to 10 RFiles,
>>> and with shortcircuit reads on an off.
>>> >
>>> > Perhaps a slightly more accurate test would be to up the compaction
>>> ratio on a table, and then bulk import them to a single table, and then
>>> just use the regular client API.
>>> >
>>> >>    I'm unsure of how correct the "compaction leading to eventual
>>> >>    locality" postulation is. It seems, to me at least, that in the
>>> case
>>> >>    of a multi-block file, the file system would eventually try to
>>> >>    distribute those blocks rather than leave them all on a single
>>> host.
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> I know in HBase set ups, it's common to either disable the HDFS
>>> Balancer
>>> >> or just disable for a namespace containing the part of the filesystem
>>> >> that handles HBase. Otherwise, when the blocks are moved off to other
>>> >> hosts you get performance degradation until compaction can happen
>>> again.
>>> >> I would expect the same thing ought to be done for Accumulo.
>>> >
>>> > AFAIK, HBase also does a lot more in regards to assigning Tablets in
>>> regards to the blocks that serve them, no? To my knowledge, Accumulo
>>> doesn't do anything like this. I don't want users to think that disabling
>>> the HDFS balancer is a good idea for Accumulo unless we have actual
>>> evidence.
>>>
>>
>>
>

Re: ISAM file location vs. read performance

Reply via email to