You may find org.apache.accumulo.server.util.LocalityCheck useful. -Eric
On Thu, Jan 16, 2014 at 2:12 PM, Arshak Navruzyan <[email protected]> wrote: > I did some manual testing on this to see where HDFS is placing blocks in > relation to the location of the tablets. I used the following command to > determine where HDFS is replicating the various blocks of the Rfiles. > > hadoop fsck /accumulo/tables/a -locations -blocks -files > > From my limited testing, it appears that John's observation that "tserver > with ultimately end up major compacting it's files, ensuring locality" is > indeed true. In all cases, the node that was responsible for the tablet, > held a copy of all the blocks of that Rfile. > > More extensive testing in bigger environments would probably still be > helpful before we write this into the documentation. Also not sure what > happen during tserver failures/reassignments. > > One thing that would make testing much easier is if "getsplits -v" > reported the HDFS location of the tablet. Right now you have to troll > through !METADATA to figure it out. > > > On Mon, Jan 13, 2014 at 10:25 AM, Arshak Navruzyan <[email protected]>wrote: > >> Thanks for all the explanations. Perhaps this is something we should >> clearly spell out in the documentation once all the facts are in. I'll >> keep a task open for now. ( >> https://issues.apache.org/jira/browse/ACCUMULO-2185) >> >> >> On Sun, Jan 12, 2014 at 4:26 PM, Donald Miner <[email protected]>wrote: >> >>> HDFS-385 ( >>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/HDFS-385 ) >>> is for custom pluggable block placement policies and there has been some >>> talk (i think) about improving mean time to recovering and data locality in >>> hbase. >>> >>> Basically this would allow accumulo to have a policy for its blocks and >>> control its own destiny... Instead of things like the rebalancer screwing >>> things up. >>> >>> I honestly don't know much else about this. Just thought it might be >>> relevant to the conversation. >>> >>> > On Jan 12, 2014, at 6:42 PM, Josh Elser <[email protected]> wrote: >>> > >>> > >>> > >>> >> On 1/12/14, 6:17 PM, Sean Busbey wrote: >>> >> On Sun, Jan 12, 2014 at 4:42 PM, William Slacum >>> >> <[email protected] <mailto: >>> [email protected]>> >>> >> wrote: >>> >> >>> >> Some data on short circuit reads would be great to have. >>> >> >>> >> >>> >> What kind of data are you looking for? Just HDFS read rates? or >>> >> specifically Accumulo when set up to make use of it? >>> > >>> > I believe what Bill means, and what I'm also curious about, is >>> specifically the impact on performance for Accumulo's workload: a merged >>> read over multiple files. An easy test might be to create multiple RFiles >>> (1 to 10 files?) which contain interspersed data. Test some sort of >>> random-read and random-seek+sequential-read workloads, from 1 to 10 RFiles, >>> and with shortcircuit reads on an off. >>> > >>> > Perhaps a slightly more accurate test would be to up the compaction >>> ratio on a table, and then bulk import them to a single table, and then >>> just use the regular client API. >>> > >>> >> I'm unsure of how correct the "compaction leading to eventual >>> >> locality" postulation is. It seems, to me at least, that in the >>> case >>> >> of a multi-block file, the file system would eventually try to >>> >> distribute those blocks rather than leave them all on a single >>> host. >>> >> >>> >> >>> >> >>> >> >>> >> I know in HBase set ups, it's common to either disable the HDFS >>> Balancer >>> >> or just disable for a namespace containing the part of the filesystem >>> >> that handles HBase. Otherwise, when the blocks are moved off to other >>> >> hosts you get performance degradation until compaction can happen >>> again. >>> >> I would expect the same thing ought to be done for Accumulo. >>> > >>> > AFAIK, HBase also does a lot more in regards to assigning Tablets in >>> regards to the blocks that serve them, no? To my knowledge, Accumulo >>> doesn't do anything like this. I don't want users to think that disabling >>> the HDFS balancer is a good idea for Accumulo unless we have actual >>> evidence. >>> >> >> >
