Thanks, Adam… that's exactly what I was looking for, and gives me a lot to 
think about.

-Krishmin

On Oct 30, 2012, at 4:08 PM, Adam Fuchs wrote:

> Krishmin,
> 
> There are a few extremes to keep in mind when choosing a manual partitioning 
> strategy:
> 1. Parallelism and balance at ingest time. You need to find a happy medium 
> between too few partitions (not enough parallelism) and too many partitions 
> (tablet server resource contention and inefficient indexes). Probably at 
> least one partition per tablet server being actively written to is good, and 
> you'll want to pre-split so they can be distributed evenly. Ten partitions 
> per tablet server is probably not too many -- I wouldn't expect to see 
> contention at that point.
> 2. Parallelism and balance at query time. At query time, you'll be selecting 
> a subset of all of the partitions that you've ever ingested into. This subset 
> should be bounded similarly to the concern addressed in #1, but the bounds 
> could be looser depending on the types of queries you want to run. Lower 
> latency queries would tend to favor only a few partitions per node.
> 3. Growth over time in partition size. Over time, you want partitions to be 
> bounded to less than about 10-100GB. This has to do with limiting the maximum 
> amount of time that a major compaction will take, and impacts availability 
> and performance in the extreme cases. At the same time, you want partitions 
> to be as large as possible so that their indexes are more efficient.
> 
> One strategy to optimize partition size would be to keep using each partition 
> until it is "full", then make another partition. Another would be to allocate 
> a certain number of partitions per day, and then only put data in those 
> partitions during that day. These strategies are also elastic, and can be 
> tweaked as the cluster grows.
> 
> In all of these cases, you will need a good load balancing strategy. The 
> default strategy of evening up the number of partitions per tablet server is 
> probably not sufficient, so you may need to write your own tablet load 
> balancer that is aware of your partitioning strategy.
> 
> Cheers,
> Adam
> 
> 
> 
> On Tue, Oct 30, 2012 at 3:06 PM, Krishmin Rai <kr...@missionfoc.us> wrote:
> Hi All,
>   We're working with an index table whose row is a shardId (an integer, like 
> the wiki-search or IndexedDoc examples). I was just wondering what the right 
> strategy is for choosing a number of partitions, particularly given a cluster 
> that could potentially grow.
> 
>   If I simply set the number of shards equal to the number of slave nodes, 
> additional nodes would not improve query performance (at least over the data 
> already ingested). But starting with more partitions than slave nodes would 
> result in multiple tablets per tablet server… I'm not really sure how that 
> would impact performance, particularly given that all queries against the 
> table will be batchscanners with an infinite range.
> 
>   Just wondering how others have addressed this problem, and if there are any 
> performance rules of thumb regarding the ratio of tablets to tablet servers.
> 
> Thanks!
> Krishmin
> 

Reply via email to