I'll use ASCII graphics to demonstrate the size of a tablet.
Small: []
Medium: [ ]
Large: [ ]
Think of it like this... if you are running age-off... you probably have
lots of little buckets of rows at the beginning and larger buckets at the
end:
[][][][][][][][][]...[ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][
]
What you probably want is something like this:
[ ][ ][ ][ ][ ][ ][ ][
]
Some big bucket at the start, with old data, and some larger buckets for
everything afterwards. But... this would probably work:
[ ][ ][ ][ ][ ][ ][ ][ ][
]
Just a bunch of larger tablets throughout.
So you need to set your merge size to "[ ]" (4G), and you can always
keep creating smaller tablets for future rows with manual splits:
[ ][ ][ ][ ][ ][ ][ ][ ][
][ ][ ][ ][ ][ ]
So increase the split threshold to 4G, and merge on 4G, but continue to
make manual splits for your current days, as necessary. Merge them away
later.
-Eric
On Wed, Oct 2, 2013 at 6:35 PM, Dickson, Matt MR <
[email protected]> wrote:
> **
>
> *UNOFFICIAL*
> Thanks Eric,
>
> If I do the merge with size of 4G does the split threshold need to be
> increased to the 4G also?
>
> ------------------------------
> *From:* Eric Newton [mailto:[email protected]]
> *Sent:* Wednesday, 2 October 2013 23:05
> *To:* [email protected]
> *Subject:* Re: Efficient Tablet Merging [SEC=UNOFFICIAL]
>
> The most efficient way is kind of scary. If this is a production
> system, I would not recommend it.
>
> First, find out the size of your 10x tablets. Let's say it's 10G. Set
> your split threshold to 10G. Then merge all old tablets.... all of them
> into one tablet. This will dump thousands of files into a single tablet,
> but it will soon split out again into the nice 10G tablets you are looking
> for. The system will probably be unusable during this operation.
>
> The more conservative way is to specify the merge in single steps (the
> master will only coordinate a single merge on a table at a time anyhow).
> You can do it by range or by size... I would do it by size, especially if
> you are aging off your old data.
>
> Compacting the data won't have any effect on the speed of the merge.
>
> -Eric
>
>
>
> On Tue, Oct 1, 2013 at 11:58 PM, Dickson, Matt MR <
> [email protected]> wrote:
>
>> **
>>
>> *UNOFFICIAL*
>> I have a table that we create splits of the form yyyymmdd-*nnnn *where
>> nnnn ranges from 0000 to 0840. The bulk of our data is loaded for the
>> current date with no data loaded for days older than 3 days so from my
>> understanding it would be wise to merge splits older than 3 days in order
>> to reduce the overall tablet count. It would still be optimal to
>> maintain some distribution of tablets for a day across the cluster so I'm
>> looking at merging splits in 10 increments eg, merge -b 20130901-0000 -e
>> 20130901-0009, therefore reducing 840 splits per day to 84.
>>
>> Currently we have 120K tablets (size 1G) on a cluster of 56 nodes and our
>> ingest has slowed as the data quantity and tablet count has grown.
>> Initialy we were achieving 200-300K, now 50-100K.
>>
>> My question is, what is the best way to do this merge? Should we use the
>> merge command with the size option set at something like 5G, or maybe use
>> the compaction command?
>>
>> From my tests this process could take some time so I'm keen to understand
>> the most efficient approach.
>>
>> Thanks in advance,
>> Matt Dickson
>>
>
>