Re: Number of partitions for sharded table

2012-10-30 Thread Krishmin Rai
I should clarify that I've been pre-splitting tables at each shard so that each 
tablet consists of a single row.

On Oct 30, 2012, at 3:06 PM, Krishmin Rai wrote:

 Hi All, 
  We're working with an index table whose row is a shardId (an integer, like 
 the wiki-search or IndexedDoc examples). I was just wondering what the right 
 strategy is for choosing a number of partitions, particularly given a cluster 
 that could potentially grow.
 
  If I simply set the number of shards equal to the number of slave nodes, 
 additional nodes would not improve query performance (at least over the data 
 already ingested). But starting with more partitions than slave nodes would 
 result in multiple tablets per tablet server… I'm not really sure how that 
 would impact performance, particularly given that all queries against the 
 table will be batchscanners with an infinite range.
 
  Just wondering how others have addressed this problem, and if there are any 
 performance rules of thumb regarding the ratio of tablets to tablet servers.
 
 Thanks!
 Krishmin



Re: Number of partitions for sharded table

2012-10-30 Thread dlmarion
Krishmin, 

In the wikisearch example there is a non-sharded index table and a sharded 
document table. The index table is used to reduce the number of tablets that 
need to be searched for a given set of terms. Is your setup similar? I'm a 
little confused since you mention using a sharded index table and that all 
queries will have an infinite range. 

Dave Marion 

- Original Message -
From: Krishmin Rai kr...@missionfoc.us 
To: user@accumulo.apache.org 
Sent: Tuesday, October 30, 2012 3:28:15 PM 
Subject: Re: Number of partitions for sharded table 

I should clarify that I've been pre-splitting tables at each shard so that each 
tablet consists of a single row. 

On Oct 30, 2012, at 3:06 PM, Krishmin Rai wrote: 

 Hi All, 
 We're working with an index table whose row is a shardId (an integer, like 
 the wiki-search or IndexedDoc examples). I was just wondering what the right 
 strategy is for choosing a number of partitions, particularly given a cluster 
 that could potentially grow. 
 
 If I simply set the number of shards equal to the number of slave nodes, 
 additional nodes would not improve query performance (at least over the data 
 already ingested). But starting with more partitions than slave nodes would 
 result in multiple tablets per tablet server… I'm not really sure how that 
 would impact performance, particularly given that all queries against the 
 table will be batchscanners with an infinite range. 
 
 Just wondering how others have addressed this problem, and if there are any 
 performance rules of thumb regarding the ratio of tablets to tablet servers. 
 
 Thanks! 
 Krishmin 



Re: Number of partitions for sharded table

2012-10-30 Thread Krishmin Rai
Hi Dave,
  Our example is simpler than the wiki-search one, and I had forgotten exactly 
how wiki-search is structured: We're using a simple, single-table layout, 
everything sharded. We also have some other stuff in the column family, but 
simplified, it's just:

row: ShardID
colFam: Term
colQual: Document

And then searches will use an iterator extending IntersectingIterator to find 
results matching specified terms etc.

Thanks,
Krishmin

On Oct 30, 2012, at 3:43 PM, dlmar...@comcast.net wrote:

 Krishmin,
 
   In the wikisearch example there is a non-sharded index table and a sharded 
 document table. The index table is used to reduce the number of tablets that 
 need to be searched for a given set of terms. Is your setup similar? I'm a 
 little confused since you mention using a sharded index table and that all 
 queries will have an infinite range.
 
 Dave Marion
 
 From: Krishmin Rai kr...@missionfoc.us
 To: user@accumulo.apache.org
 Sent: Tuesday, October 30, 2012 3:28:15 PM
 Subject: Re: Number of partitions for sharded table
 
 I should clarify that I've been pre-splitting tables at each shard so that 
 each tablet consists of a single row.
 
 On Oct 30, 2012, at 3:06 PM, Krishmin Rai wrote:
 
  Hi All, 
   We're working with an index table whose row is a shardId (an integer, like 
  the wiki-search or IndexedDoc examples). I was just wondering what the 
  right strategy is for choosing a number of partitions, particularly given a 
  cluster that could potentially grow.
  
   If I simply set the number of shards equal to the number of slave nodes, 
  additional nodes would not improve query performance (at least over the 
  data already ingested). But starting with more partitions than slave nodes 
  would result in multiple tablets per tablet server… I'm not really sure how 
  that would impact performance, particularly given that all queries against 
  the table will be batchscanners with an infinite range.
  
   Just wondering how others have addressed this problem, and if there are 
  any performance rules of thumb regarding the ratio of tablets to tablet 
  servers.
  
  Thanks!
  Krishmin
 



Re: Number of partitions for sharded table

2012-10-30 Thread Adam Fuchs
Krishmin,

There are a few extremes to keep in mind when choosing a manual
partitioning strategy:
1. Parallelism and balance at ingest time. You need to find a happy medium
between too few partitions (not enough parallelism) and too many partitions
(tablet server resource contention and inefficient indexes). Probably at
least one partition per tablet server being actively written to is good,
and you'll want to pre-split so they can be distributed evenly. Ten
partitions per tablet server is probably not too many -- I wouldn't expect
to see contention at that point.
2. Parallelism and balance at query time. At query time, you'll be
selecting a subset of all of the partitions that you've ever ingested into.
This subset should be bounded similarly to the concern addressed in #1, but
the bounds could be looser depending on the types of queries you want to
run. Lower latency queries would tend to favor only a few partitions per
node.
3. Growth over time in partition size. Over time, you want partitions to be
bounded to less than about 10-100GB. This has to do with limiting the
maximum amount of time that a major compaction will take, and impacts
availability and performance in the extreme cases. At the same time, you
want partitions to be as large as possible so that their indexes are more
efficient.

One strategy to optimize partition size would be to keep using each
partition until it is full, then make another partition. Another would be
to allocate a certain number of partitions per day, and then only put data
in those partitions during that day. These strategies are also elastic, and
can be tweaked as the cluster grows.

In all of these cases, you will need a good load balancing strategy. The
default strategy of evening up the number of partitions per tablet server
is probably not sufficient, so you may need to write your own tablet load
balancer that is aware of your partitioning strategy.

Cheers,
Adam



On Tue, Oct 30, 2012 at 3:06 PM, Krishmin Rai kr...@missionfoc.us wrote:

 Hi All,
   We're working with an index table whose row is a shardId (an integer,
 like the wiki-search or IndexedDoc examples). I was just wondering what the
 right strategy is for choosing a number of partitions, particularly given a
 cluster that could potentially grow.

   If I simply set the number of shards equal to the number of slave nodes,
 additional nodes would not improve query performance (at least over the
 data already ingested). But starting with more partitions than slave nodes
 would result in multiple tablets per tablet server… I'm not really sure how
 that would impact performance, particularly given that all queries against
 the table will be batchscanners with an infinite range.

   Just wondering how others have addressed this problem, and if there are
 any performance rules of thumb regarding the ratio of tablets to tablet
 servers.

 Thanks!
 Krishmin


Re: Number of partitions for sharded table

2012-10-30 Thread Krishmin Rai
Thanks, Adam… that's exactly what I was looking for, and gives me a lot to 
think about.

-Krishmin

On Oct 30, 2012, at 4:08 PM, Adam Fuchs wrote:

 Krishmin,
 
 There are a few extremes to keep in mind when choosing a manual partitioning 
 strategy:
 1. Parallelism and balance at ingest time. You need to find a happy medium 
 between too few partitions (not enough parallelism) and too many partitions 
 (tablet server resource contention and inefficient indexes). Probably at 
 least one partition per tablet server being actively written to is good, and 
 you'll want to pre-split so they can be distributed evenly. Ten partitions 
 per tablet server is probably not too many -- I wouldn't expect to see 
 contention at that point.
 2. Parallelism and balance at query time. At query time, you'll be selecting 
 a subset of all of the partitions that you've ever ingested into. This subset 
 should be bounded similarly to the concern addressed in #1, but the bounds 
 could be looser depending on the types of queries you want to run. Lower 
 latency queries would tend to favor only a few partitions per node.
 3. Growth over time in partition size. Over time, you want partitions to be 
 bounded to less than about 10-100GB. This has to do with limiting the maximum 
 amount of time that a major compaction will take, and impacts availability 
 and performance in the extreme cases. At the same time, you want partitions 
 to be as large as possible so that their indexes are more efficient.
 
 One strategy to optimize partition size would be to keep using each partition 
 until it is full, then make another partition. Another would be to allocate 
 a certain number of partitions per day, and then only put data in those 
 partitions during that day. These strategies are also elastic, and can be 
 tweaked as the cluster grows.
 
 In all of these cases, you will need a good load balancing strategy. The 
 default strategy of evening up the number of partitions per tablet server is 
 probably not sufficient, so you may need to write your own tablet load 
 balancer that is aware of your partitioning strategy.
 
 Cheers,
 Adam
 
 
 
 On Tue, Oct 30, 2012 at 3:06 PM, Krishmin Rai kr...@missionfoc.us wrote:
 Hi All,
   We're working with an index table whose row is a shardId (an integer, like 
 the wiki-search or IndexedDoc examples). I was just wondering what the right 
 strategy is for choosing a number of partitions, particularly given a cluster 
 that could potentially grow.
 
   If I simply set the number of shards equal to the number of slave nodes, 
 additional nodes would not improve query performance (at least over the data 
 already ingested). But starting with more partitions than slave nodes would 
 result in multiple tablets per tablet server… I'm not really sure how that 
 would impact performance, particularly given that all queries against the 
 table will be batchscanners with an infinite range.
 
   Just wondering how others have addressed this problem, and if there are any 
 performance rules of thumb regarding the ratio of tablets to tablet servers.
 
 Thanks!
 Krishmin