I assume you primarily access your data using a time range query then, so salting on the data tables makes sense. Is this the case for your secondary indexes as well? Do they lead their PK with a date/time column as well? Did you know you can turn salting off for an index over a salted table (by specifying SALT_BUCKETS=0 when you create the index)?
I'd recommend using Pherf to find the optimal number of salt buckets. Perhaps you can start with 2x the number of region servers (+ some expected future growth, as you can't change the salt bucket on a table without re-writing it). Thanks, James On Mon, Jun 8, 2015 at 10:08 AM, Perko, Ralph J <[email protected]> wrote: > James, > > Thanks for the response. > > There could be a dozen or so users accessing the system and the same portions > of the tables. The motive for salting has been to eliminate hot spotting - > our data is time-series based and that is what the PK is based on. > > Thanks, > Ralph > > > > > On 6/8/15, 10:00 AM, "James Taylor" <[email protected]> wrote: > >>Hi Ralph, >>What kind of workload do you expect on your cluster? Will there be >>many users accessing many different parts of your table(s) >>simultaneously? Have you considered not salting your tables? Or do you >>have hot spotting issues at write time due to the layout of your PK >>that salting is preventing? With the advent of table stats >>(http://phoenix.apache.org/update_statistics.html), Phoenix is able to >>parallelize queries along equal chunks of data, similar to the what >>occurs with salting. >> >>The downside of salting is for queries that are only accessing a >>handful of rows. Because Phoenix doesn't know which salt bucket >>contains which of these rows, a scan always needs always be run for >>every salt bucket. If you have 100 salt buckets, this is 100 scans >>(worst case loading 100 blocks) versus a single scan for the unsalted >>case (loading a single block). This will impact the throughput you >>see. >> >>I'd encourage you to use Pherf (http://phoenix.apache.org/pherf.html) >>to test salting (over multiple salt bucket sizes) versus unsalted for >>realistic scenarios to get an accurate asssesment for your workload. >> >>Thanks, >>James >> >>On Mon, Jun 8, 2015 at 9:34 AM, Perko, Ralph J <[email protected]> wrote: >>> Hi – following up on this. >>> >>> Is it generally recommended to roughly match the salt bucket count to region >>> server count? Or is it more arbitrary? Should I use something like 255 >>> because the regions are going to split anyway? >>> >>> Thanks, >>> Ralph >>> >>> >>> From: "Perko, Ralph J" >>> Reply-To: "[email protected]" >>> Date: Friday, June 5, 2015 at 11:39 AM >>> To: "[email protected]" >>> Subject: Salt bucket count recommendation >>> >>> Hi, >>> >>> We have a 40 node cluster with 8 core tables and around 35 secondary index >>> tables. The tables get very large – billions of records and terabytes of >>> data. What salt bucket count do you recommend? >>> >>> Thanks, >>> Ralph >>>
