You can write your own balancer, and use it just for your table. -Eric
On Wed, Aug 21, 2013 at 7:47 PM, Slater, David M. <david.sla...@jhuapl.edu>wrote: > Hi James,**** > > ** ** > > I already had the data sharded into 7 partitions, and that works well to > distribute the data into 7 tablets. (I have 2 GB tablet sizes, with about > 1.2 TB of data, so there are numerous tablets per server). The difficulty > is that Accumulo seems to decide for itself where each tablet goes. When I > only had 10 GB of data, it nicely divided into 7 tablets, one on each node. > However, with dozens of tablets per tablet server, it assigns tablets to > tablet servers orthogonally to my presplits. **** > > ** ** > > Is there a way to force Accumulo to keep specific ranges on specific > nodes? If not, I suppose that I could have 4x or more shards per tablet > server to ensure that it was more uniformly placed.**** > > ** ** > > D**** > > ** ** > > *From:* James Hughes [mailto:jn...@virginia.edu] > *Sent:* Wednesday, August 21, 2013 7:29 PM > *To:* user@accumulo.apache.org > *Subject:* Re: Straggler problem in Accumulo BatchScans**** > > ** ** > > David,**** > > Each tablet is hosted by one tablet server, and there's no way around > that. (This is actually quite reasonably; otherwise, we would receive > duplicate results from multiple tablet servers.) **** > > One strategy to deal with imbalanced data is to add a random partition > prefix to your row Ids. This does complicate building queries, but in > general, you'll be able to leverage all of your nodes. I did some testing > with the nodes of such random shard ids, and it seems like having 1-2x as > many shards as tablet servers worked pretty well. (I'd suggest 2x in case > you ever grow your cloud.)**** > > In particular, if you can reingest your data, prepend a random "01-14~" to > the beginning of each row Id, and see if that helps. After that, you can > "help" Accumulo decide where it should split tablets with addSplits 01 02 > <etc> 14 from the Accumulo shell (or programmatically with the addSplits). > After that, you can make sure that your 14+ splits are distributed across > the 7 nodes in a reasonable way. **** > > I hope that helps, > > Jim > > > http://accumulo.apache.org/1.4/apidocs/org/apache/accumulo/core/client/admin/TableOperations.html#addSplits%28java.lang.String,%20java.util.SortedSet%29 > **** > > ** ** > > On Wed, Aug 21, 2013 at 7:09 PM, Slater, David M. <david.sla...@jhuapl.edu> > wrote:**** > > Hey, I have a 7 node network running accumulo 1.4.1 and hadoop 1.0.4.**** > > **** > > When I run large BatchScanner operations, the number of tablets scanned > per node is not uniform, leading to the overloaded nodes taking much longer > to finish than the others. For queries that require all of the scans to > finish before returning, this is a major latency issue. What are some > practical means of load-balancing this to reduce delay?**** > > **** > > Is it possible for tablets to be hosted on multiple tablet servers, up to > the replication factor of the underlying hdfs? Are there reasons this might > be an undesirable design?**** > > **** > > Thanks in advance, > David **** > > ** ** >