Thanks, Adam… that's exactly what I was looking for, and gives me a lot to think about.
-Krishmin On Oct 30, 2012, at 4:08 PM, Adam Fuchs wrote: > Krishmin, > > There are a few extremes to keep in mind when choosing a manual partitioning > strategy: > 1. Parallelism and balance at ingest time. You need to find a happy medium > between too few partitions (not enough parallelism) and too many partitions > (tablet server resource contention and inefficient indexes). Probably at > least one partition per tablet server being actively written to is good, and > you'll want to pre-split so they can be distributed evenly. Ten partitions > per tablet server is probably not too many -- I wouldn't expect to see > contention at that point. > 2. Parallelism and balance at query time. At query time, you'll be selecting > a subset of all of the partitions that you've ever ingested into. This subset > should be bounded similarly to the concern addressed in #1, but the bounds > could be looser depending on the types of queries you want to run. Lower > latency queries would tend to favor only a few partitions per node. > 3. Growth over time in partition size. Over time, you want partitions to be > bounded to less than about 10-100GB. This has to do with limiting the maximum > amount of time that a major compaction will take, and impacts availability > and performance in the extreme cases. At the same time, you want partitions > to be as large as possible so that their indexes are more efficient. > > One strategy to optimize partition size would be to keep using each partition > until it is "full", then make another partition. Another would be to allocate > a certain number of partitions per day, and then only put data in those > partitions during that day. These strategies are also elastic, and can be > tweaked as the cluster grows. > > In all of these cases, you will need a good load balancing strategy. The > default strategy of evening up the number of partitions per tablet server is > probably not sufficient, so you may need to write your own tablet load > balancer that is aware of your partitioning strategy. > > Cheers, > Adam > > > > On Tue, Oct 30, 2012 at 3:06 PM, Krishmin Rai <kr...@missionfoc.us> wrote: > Hi All, > We're working with an index table whose row is a shardId (an integer, like > the wiki-search or IndexedDoc examples). I was just wondering what the right > strategy is for choosing a number of partitions, particularly given a cluster > that could potentially grow. > > If I simply set the number of shards equal to the number of slave nodes, > additional nodes would not improve query performance (at least over the data > already ingested). But starting with more partitions than slave nodes would > result in multiple tablets per tablet server… I'm not really sure how that > would impact performance, particularly given that all queries against the > table will be batchscanners with an infinite range. > > Just wondering how others have addressed this problem, and if there are any > performance rules of thumb regarding the ratio of tablets to tablet servers. > > Thanks! > Krishmin >