Never underestimate the power of ascii art! Adam On Oct 2, 2013 11:28 PM, "Eric Newton" <eric.new...@gmail.com> wrote:
> I'll use ASCII graphics to demonstrate the size of a tablet. > > Small: [] > Medium: [ ] > Large: [ ] > > Think of it like this... if you are running age-off... you probably have > lots of little buckets of rows at the beginning and larger buckets at the > end: > > [][][][][][][][][]...[ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ > ][ ] > > What you probably want is something like this: > > [ ][ ][ ][ ][ ][ ][ ][ > ] > > Some big bucket at the start, with old data, and some larger buckets for > everything afterwards. But... this would probably work: > > [ ][ ][ ][ ][ ][ ][ ][ ][ > ] > > Just a bunch of larger tablets throughout. > > So you need to set your merge size to "[ ]" (4G), and you can always > keep creating smaller tablets for future rows with manual splits: > > [ ][ ][ ][ ][ ][ ][ ][ ][ > ][ ][ ][ ][ ][ ] > > > So increase the split threshold to 4G, and merge on 4G, but continue to > make manual splits for your current days, as necessary. Merge them away > later. > > > -Eric > > > > > On Wed, Oct 2, 2013 at 6:35 PM, Dickson, Matt MR < > matt.dick...@defence.gov.au> wrote: > >> ** >> >> *UNOFFICIAL* >> Thanks Eric, >> >> If I do the merge with size of 4G does the split threshold need to be >> increased to the 4G also? >> >> ------------------------------ >> *From:* Eric Newton [mailto:eric.new...@gmail.com] >> *Sent:* Wednesday, 2 October 2013 23:05 >> *To:* user@accumulo.apache.org >> *Subject:* Re: Efficient Tablet Merging [SEC=UNOFFICIAL] >> >> The most efficient way is kind of scary. If this is a production >> system, I would not recommend it. >> >> First, find out the size of your 10x tablets. Let's say it's 10G. Set >> your split threshold to 10G. Then merge all old tablets.... all of them >> into one tablet. This will dump thousands of files into a single tablet, >> but it will soon split out again into the nice 10G tablets you are looking >> for. The system will probably be unusable during this operation. >> >> The more conservative way is to specify the merge in single steps (the >> master will only coordinate a single merge on a table at a time anyhow). >> You can do it by range or by size... I would do it by size, especially if >> you are aging off your old data. >> >> Compacting the data won't have any effect on the speed of the merge. >> >> -Eric >> >> >> >> On Tue, Oct 1, 2013 at 11:58 PM, Dickson, Matt MR < >> matt.dick...@defence.gov.au> wrote: >> >>> ** >>> >>> *UNOFFICIAL* >>> I have a table that we create splits of the form yyyymmdd-*nnnn *where >>> nnnn ranges from 0000 to 0840. The bulk of our data is loaded for the >>> current date with no data loaded for days older than 3 days so from my >>> understanding it would be wise to merge splits older than 3 days in order >>> to reduce the overall tablet count. It would still be optimal to >>> maintain some distribution of tablets for a day across the cluster so I'm >>> looking at merging splits in 10 increments eg, merge -b 20130901-0000 -e >>> 20130901-0009, therefore reducing 840 splits per day to 84. >>> >>> Currently we have 120K tablets (size 1G) on a cluster of 56 nodes and >>> our ingest has slowed as the data quantity and tablet count has grown. >>> Initialy we were achieving 200-300K, now 50-100K. >>> >>> My question is, what is the best way to do this merge? Should we use >>> the merge command with the size option set at something like 5G, or maybe >>> use the compaction command? >>> >>> From my tests this process could take some time so I'm keen to >>> understand the most efficient approach. >>> >>> Thanks in advance, >>> Matt Dickson >>> >> >> >