You can get the hdfs size using standard hdfs commands - count or ls.  As long 
as you have not cloned the table, the size of the hdfs files and the space 
occupied by the table are equivalent.

You can also get a sense of the referenced files examining the metadata table - 
the column qualifier file: will just give you the referenced files. You can 
look at the directories b-xxxxxxx are from a bulk import and t-xxxxxxx files 
are assigned to the tablets.  Also bulk import file names start with I-xxxxxx, 
files from compactions will be A-xxxxxx if from a full compaction, C-xxxxxxx 
from a minor compaction and F-xxxxxx is the result of a flush. You can look at 
the entries for the files - the numbers for the value are number of entities, 
file size

How do you ingest? Bulk or continuous?  On a bulk ingest, the imported files 
end up in /accumulo/table/x/b-xxxxx and then are assigned to tablets - the 
directories for the
Tablets will be created, but will be "empty" until a compaction occurs.  A 
compaction will copy from the files referenced by the tablets into a new file 
that will be placed into the corresponding /accumulo/table/x/t-xxxxxx 
directory.  When a bulk imported file is no longer referenced by any tablets, 
it will get garbage collected, until then file will exist and inflate the 
actual space used by the table. The compaction will also remove any data that 
is past the TTL for the records.

Do you ever run a compaction?  With a very large number of tablets, you may 
want to run the compaction in parts so that you don't end up occupying all of 
the compaction slots for a long time.

Are you using keys (row ids) that are always increasing? An typical example 
would be a date.  Say some of your row ids are yyyy-mm-dd-hh and there is a 10 
day TTL.  What will happened is that new data will continue to create new 
tablets and on compaction the old tablets will age-off and have 0 size.  You 
can remove the "unused splits" by running a merge.  Anything that creates new 
row ids that are ordered can do this - new splits are necessary and the 
old-splits eventually become unnecessary, if the row ids are distributed across 
the splits it will not do this. It is not necessary a problem if this what you 
data looks like, just something that you may want to manage with merges.

There is usually not much benefit having a large number of tablets for a single 
table on a server.  You can reduce the number of tablets required by setting 
the split threshold to a larger number and then running a merge.  This can be 
done in sections, and you should run a compaction on the section first.

If you have recently compacted, you can figure out the rough number of tables 
necessary  by taking hdfs size / split threshold = number of tablets.   If you 
increase the spilt threshold size you will need fewer tablets.  You may also 
consider setting a split threshold that is larger than your target - say you 
decided that 5G was a good target, if you set the threshold to 8G during the 
merge and then setting it to 5G when completed will cause the table to split - 
and it could give you a better distribution of data in the splits.

This can be done while things are running, but it will be a heavy IO load 
(files and on the hdfs namenode) and can take a very long time. What can be 
useful is you the getSplits command with the number of split options and create 
a script that compacts, then merges a section - using the splits as start / end 
row to the compaction and merge command.

Ed Coleman

From: Ligade, Shailesh [USA] <ligade_shail...@bah.com>
Sent: Monday, January 31, 2022 11:16 AM
To: user@accumulo.apache.org
Subject: tablets per tablet server for accumulo 1.10.0

Hello,

table.split.threshold is set to default 1G (except for metadata nd root - which 
is set to 64M)
What can cause tablets per tablet server count to go high? Within a week, that 
count jumped from 5k/tablet server to 23k/tablet server, even though total size 
in hdfs  has not changed.
Is high count, a cause for concern?
We didn't apply any splits. I did a dumpConfig and checked all my tables and 
didn't see splits either.

Is there a way to find tablet size in hdfs? When I look at hdfs 
/accumulo/table/x/ i see some empty folders, meaning not all folders has rf 
files. is that normal?

Thanks in advance!

-S

Reply via email to