UNOFFICIAL

I have a table that we create splits of the form yyyymmdd-nnnn where nnnn 
ranges from 0000 to 0840.  The bulk of our data is loaded for the current date 
with no data loaded for days older than 3 days so from my understanding it 
would be wise to merge splits older than 3 days in order to reduce the overall 
tablet count.  It would still be optimal to maintain some distribution of 
tablets for a day across the cluster so I'm looking at merging splits in 10 
increments eg, merge -b 20130901-0000 -e 20130901-0009, therefore reducing 840 
splits per day to 84.

Currently we have 120K tablets (size 1G) on a cluster of 56 nodes and our 
ingest has slowed as the data quantity and tablet count has grown.  Initialy we 
were achieving 200-300K, now 50-100K.

My question is, what is the best way to do this merge?  Should we use the merge 
command with the size option set at something like 5G, or maybe use the 
compaction command?

>From my tests this process could take some time so I'm keen to understand the 
>most efficient approach.

Thanks in advance,
Matt Dickson

Reply via email to