In my experience, merging goes faster if you compact the ranges to be
merged first.

On Tue, Feb 1, 2022, 07:48 Ligade, Shailesh [USA] <ligade_shail...@bah.com>
wrote:

> Thank you for explanation!
>
> Once ran getsplits it was clear that splits were the culprit, so I need to
> do merge as well bump the threshold to higher number as you have suggested.
>
> If I have to perform a major compaction, should i do it before merge or
> after merge?
>
> Thanks again,
>
> -S
>
>
> ------------------------------
> *From:* dev1 <d...@etcoleman.com>
> *Sent:* Monday, January 31, 2022 1:14 PM
> *To:* 'user@accumulo.apache.org' <user@accumulo.apache.org>
> *Subject:* [External] RE: tablets per tablet server for accumulo 1.10.0
>
>
> You can get the hdfs size using standard hdfs commands – count or ls.  As
> long as you have not cloned the table, the size of the hdfs files and the
> space occupied by the table are equivalent.
>
>
>
> You can also get a sense of the referenced files examining the metadata
> table – the column qualifier file: will just give you the referenced files.
> You can look at the directories b-xxxxxxx are from a bulk import and
> t-xxxxxxx files are assigned to the tablets.  Also bulk import file names
> start with I-xxxxxx, files from compactions will be A-xxxxxx if from a full
> compaction, C-xxxxxxx from a minor compaction and F-xxxxxx is the result of
> a flush. You can look at the entries for the files – the numbers for the
> value are number of entities, file size
>
>
>
> How do you ingest? Bulk or continuous?  On a bulk ingest, the imported
> files end up in /accumulo/table/x/b-xxxxx and then are assigned to
> tablets – the directories for the
>
> Tablets will be created, but will be “empty” until a compaction occurs.  A
> compaction will copy from the files referenced by the tablets into a new
> file that will be placed into the corresponding /accumulo/table/x/t-xxxxxx
> directory.  When a bulk imported file is no longer referenced by any
> tablets, it will get garbage collected, until then file will exist and
> inflate the actual space used by the table. The compaction will also remove
> any data that is past the TTL for the records.
>
>
>
> Do you ever run a compaction?  With a very large number of tablets, you
> may want to run the compaction in parts so that you don’t end up occupying
> all of the compaction slots for a long time.
>
>
>
> Are you using keys (row ids) that are always increasing? An typical
> example would be a date.  Say some of your row ids are yyyy-mm-dd-hh and
> there is a 10 day TTL.  What will happened is that new data will continue
> to create new tablets and on compaction the old tablets will age-off and
> have 0 size.  You can remove the “unused splits” by running a merge.
> Anything that creates new row ids that are ordered can do this – new splits
> are necessary and the old-splits eventually become unnecessary, if the row
> ids are distributed across the splits it will not do this. It is not
> necessary a problem if this what you data looks like, just something that
> you may want to manage with merges.
>
>
>
> There is usually not much benefit having a large number of tablets for a
> single table on a server.  You can reduce the number of tablets required by
> setting the split threshold to a larger number and then running a merge.
> This can be done in sections, and you should run a compaction on the
> section first.
>
>
>
> If you have recently compacted, you can figure out the rough number of
> tables necessary  by taking hdfs size / split threshold = number of
> tablets.   If you increase the spilt threshold size you will need fewer
> tablets.  You may also consider setting a split threshold that is larger
> than your target – say you decided that 5G was a good target, if you set
> the threshold to 8G during the merge and then setting it to 5G when
> completed will cause the table to split – and it could give you a better
> distribution of data in the splits.
>
>
>
> This can be done while things are running, but it will be a heavy IO load
> (files and on the hdfs namenode) and can take a very long time. What can be
> useful is you the getSplits command with the number of split options and
> create a script that compacts, then merges a section – using the splits as
> start / end row to the compaction and merge command.
>
>
>
> Ed Coleman
>
>
>
> *From:* Ligade, Shailesh [USA] <ligade_shail...@bah.com>
> *Sent:* Monday, January 31, 2022 11:16 AM
> *To:* user@accumulo.apache.org
> *Subject:* tablets per tablet server for accumulo 1.10.0
>
>
>
> Hello,
>
>
>
> table.split.threshold is set to default 1G (except for metadata nd root -
> which is set to 64M)
>
> What can cause tablets per tablet server count to go high? Within a week,
> that count jumped from 5k/tablet server to 23k/tablet server, even though
> total size in hdfs  has not changed.
>
> Is high count, a cause for concern?
>
> We didn't apply any splits. I did a dumpConfig and checked all my tables
> and didn't see splits either.
>
>
>
> Is there a way to find tablet size in hdfs? When I look at hdfs
> /accumulo/table/x/ i see some empty folders, meaning not all folders has rf
> files. is that normal?
>
>
>
> Thanks in advance!
>
>
>
> -S
>

Reply via email to