In my experience, merging goes faster if you compact the ranges to be merged first.
On Tue, Feb 1, 2022, 07:48 Ligade, Shailesh [USA] <ligade_shail...@bah.com> wrote: > Thank you for explanation! > > Once ran getsplits it was clear that splits were the culprit, so I need to > do merge as well bump the threshold to higher number as you have suggested. > > If I have to perform a major compaction, should i do it before merge or > after merge? > > Thanks again, > > -S > > > ------------------------------ > *From:* dev1 <d...@etcoleman.com> > *Sent:* Monday, January 31, 2022 1:14 PM > *To:* 'user@accumulo.apache.org' <user@accumulo.apache.org> > *Subject:* [External] RE: tablets per tablet server for accumulo 1.10.0 > > > You can get the hdfs size using standard hdfs commands – count or ls. As > long as you have not cloned the table, the size of the hdfs files and the > space occupied by the table are equivalent. > > > > You can also get a sense of the referenced files examining the metadata > table – the column qualifier file: will just give you the referenced files. > You can look at the directories b-xxxxxxx are from a bulk import and > t-xxxxxxx files are assigned to the tablets. Also bulk import file names > start with I-xxxxxx, files from compactions will be A-xxxxxx if from a full > compaction, C-xxxxxxx from a minor compaction and F-xxxxxx is the result of > a flush. You can look at the entries for the files – the numbers for the > value are number of entities, file size > > > > How do you ingest? Bulk or continuous? On a bulk ingest, the imported > files end up in /accumulo/table/x/b-xxxxx and then are assigned to > tablets – the directories for the > > Tablets will be created, but will be “empty” until a compaction occurs. A > compaction will copy from the files referenced by the tablets into a new > file that will be placed into the corresponding /accumulo/table/x/t-xxxxxx > directory. When a bulk imported file is no longer referenced by any > tablets, it will get garbage collected, until then file will exist and > inflate the actual space used by the table. The compaction will also remove > any data that is past the TTL for the records. > > > > Do you ever run a compaction? With a very large number of tablets, you > may want to run the compaction in parts so that you don’t end up occupying > all of the compaction slots for a long time. > > > > Are you using keys (row ids) that are always increasing? An typical > example would be a date. Say some of your row ids are yyyy-mm-dd-hh and > there is a 10 day TTL. What will happened is that new data will continue > to create new tablets and on compaction the old tablets will age-off and > have 0 size. You can remove the “unused splits” by running a merge. > Anything that creates new row ids that are ordered can do this – new splits > are necessary and the old-splits eventually become unnecessary, if the row > ids are distributed across the splits it will not do this. It is not > necessary a problem if this what you data looks like, just something that > you may want to manage with merges. > > > > There is usually not much benefit having a large number of tablets for a > single table on a server. You can reduce the number of tablets required by > setting the split threshold to a larger number and then running a merge. > This can be done in sections, and you should run a compaction on the > section first. > > > > If you have recently compacted, you can figure out the rough number of > tables necessary by taking hdfs size / split threshold = number of > tablets. If you increase the spilt threshold size you will need fewer > tablets. You may also consider setting a split threshold that is larger > than your target – say you decided that 5G was a good target, if you set > the threshold to 8G during the merge and then setting it to 5G when > completed will cause the table to split – and it could give you a better > distribution of data in the splits. > > > > This can be done while things are running, but it will be a heavy IO load > (files and on the hdfs namenode) and can take a very long time. What can be > useful is you the getSplits command with the number of split options and > create a script that compacts, then merges a section – using the splits as > start / end row to the compaction and merge command. > > > > Ed Coleman > > > > *From:* Ligade, Shailesh [USA] <ligade_shail...@bah.com> > *Sent:* Monday, January 31, 2022 11:16 AM > *To:* user@accumulo.apache.org > *Subject:* tablets per tablet server for accumulo 1.10.0 > > > > Hello, > > > > table.split.threshold is set to default 1G (except for metadata nd root - > which is set to 64M) > > What can cause tablets per tablet server count to go high? Within a week, > that count jumped from 5k/tablet server to 23k/tablet server, even though > total size in hdfs has not changed. > > Is high count, a cause for concern? > > We didn't apply any splits. I did a dumpConfig and checked all my tables > and didn't see splits either. > > > > Is there a way to find tablet size in hdfs? When I look at hdfs > /accumulo/table/x/ i see some empty folders, meaning not all folders has rf > files. is that normal? > > > > Thanks in advance! > > > > -S >