Roughly (don’t have the exact command syntax on hand)  I make a file that is 
the executed by passing to the shell command. To build the command file:

Use the getSplits command with the number of batches that I want – that can 
roughly be calculated using # current tablets / (# tservers * # compaction 
slots * comfort factor). IYou can specify and output file or tee the command 
output, something like


  *   getsplits -t tablename -n 20 -o /tmp/my_splits.txt


This would give you the splits for 20 rounds. Using those splits the compact 
command file then looks like:

compact -w -t tablename -e {first split}
compact -w -t tablename -b [first split] -e [second split]
…
compact -w -t tablename -b [last split]

To do a merge, interleave the merge commands:

compact -w -t tablename -e [first split]
merge -w -t tablename -size=[5G] -e [first split]
compact -w -t tablename -b [first split] -e [second split]
merge -w -t tablename -size 5G -b [first split] -e [second split]

Then just issue the shell command with (login info) -e filename. (I don’t 
recall if the switch to pass a file is -e, -f,…?)

The -w switch pauses each round so that it completes before moving to the next.

The comfort factor is some multiple to increase the number of tablets in each 
round.  This will over subscribe the compaction slots – but usually some 
compactions are quick for small tablets and the over-subscription quickly 
drops. It is a balancing act, you want fewer rounds, but limit the over 
subscription period.

You may want to increase the # of compaction slots available – depending on 
your hardware and load – I think the default is 3, 6 is not unreasonable.

Using the compact / merge command with just and end (first row) and a beginning 
(last row) are to insure that all splits are covered – don’t mix them up – or 
you will compact everything.

A few tablets can take much longer if the row ids are not evenly distributed – 
the time that each round takes will be the time of the longest compaction. With 
larger, but fewer rounds you increase the chance that more of the long-poles 
will be in a round and run in parallel. And shorten the total time needed to 
complete – but doing it in rounds does take longer because each round may have 
a long-pole that is essentially being compacted serially in each round.

Ed Coleman


From: Ligade, Shailesh [USA] <ligade_shail...@bah.com>
Sent: Friday, February 4, 2022 8:28 AM
To: 'user@accumulo.apache.org' <user@accumulo.apache.org>
Subject: Re: tablets per tablet server for accumulo 1.10.0

Thank you,

Will range compaction (compact -t <> --begin-row<> --end-row<>) be faster than 
just compact -t <>? My worry is, if I somehow issue 72k compact command at 
once, it will kill the system?
On that part what is the best way to issue these compact commands, especially 
because there are so many of them. I saw accumulo shell -u<> -p<>  -e 'compact 
...,compact...,compact,....' will work just don't know how many i can tack on 
one shell command..is there a better way of doing all this? I mean i want to be 
as gentle to my production system and yet as fast as possible.. don't want to 
spend days doing compact/merge 🙁

Thanks

-S

________________________________
From: dev1 <d...@etcoleman.com<mailto:d...@etcoleman.com>>
Sent: Tuesday, February 1, 2022 8:53 AM
To: 'user@accumulo.apache.org' 
<user@accumulo.apache.org<mailto:user@accumulo.apache.org>>
Subject: [External] RE: tablets per tablet server for accumulo 1.10.0


Before.  That has the benefit that file sizes are reduced (if data is eligible 
for age off) and the merge is operating on current file sizes.



From: Ligade, Shailesh [USA] 
<ligade_shail...@bah.com<mailto:ligade_shail...@bah.com>>
Sent: Tuesday, February 1, 2022 7:49 AM
To: 'user@accumulo.apache.org' 
<user@accumulo.apache.org<mailto:user@accumulo.apache.org>>
Subject: Re: tablets per tablet server for accumulo 1.10.0



Thank you for explanation!



Once ran getsplits it was clear that splits were the culprit, so I need to do 
merge as well bump the threshold to higher number as you have suggested.



If I have to perform a major compaction, should i do it before merge or after 
merge?



Thanks again,



-S





________________________________

From: dev1 <d...@etcoleman.com<mailto:d...@etcoleman.com>>
Sent: Monday, January 31, 2022 1:14 PM
To: 'user@accumulo.apache.org' 
<user@accumulo.apache.org<mailto:user@accumulo.apache.org>>
Subject: [External] RE: tablets per tablet server for accumulo 1.10.0



You can get the hdfs size using standard hdfs commands – count or ls.  As long 
as you have not cloned the table, the size of the hdfs files and the space 
occupied by the table are equivalent.



You can also get a sense of the referenced files examining the metadata table – 
the column qualifier file: will just give you the referenced files. You can 
look at the directories b-xxxxxxx are from a bulk import and t-xxxxxxx files 
are assigned to the tablets.  Also bulk import file names start with I-xxxxxx, 
files from compactions will be A-xxxxxx if from a full compaction, C-xxxxxxx 
from a minor compaction and F-xxxxxx is the result of a flush. You can look at 
the entries for the files – the numbers for the value are number of entities, 
file size



How do you ingest? Bulk or continuous?  On a bulk ingest, the imported files 
end up in /accumulo/table/x/b-xxxxx and then are assigned to tablets – the 
directories for the

Tablets will be created, but will be “empty” until a compaction occurs.  A 
compaction will copy from the files referenced by the tablets into a new file 
that will be placed into the corresponding /accumulo/table/x/t-xxxxxx 
directory.  When a bulk imported file is no longer referenced by any tablets, 
it will get garbage collected, until then file will exist and inflate the 
actual space used by the table. The compaction will also remove any data that 
is past the TTL for the records.



Do you ever run a compaction?  With a very large number of tablets, you may 
want to run the compaction in parts so that you don’t end up occupying all of 
the compaction slots for a long time.



Are you using keys (row ids) that are always increasing? An typical example 
would be a date.  Say some of your row ids are yyyy-mm-dd-hh and there is a 10 
day TTL.  What will happened is that new data will continue to create new 
tablets and on compaction the old tablets will age-off and have 0 size.  You 
can remove the “unused splits” by running a merge.  Anything that creates new 
row ids that are ordered can do this – new splits are necessary and the 
old-splits eventually become unnecessary, if the row ids are distributed across 
the splits it will not do this. It is not necessary a problem if this what you 
data looks like, just something that you may want to manage with merges.



There is usually not much benefit having a large number of tablets for a single 
table on a server.  You can reduce the number of tablets required by setting 
the split threshold to a larger number and then running a merge.  This can be 
done in sections, and you should run a compaction on the section first.



If you have recently compacted, you can figure out the rough number of tables 
necessary  by taking hdfs size / split threshold = number of tablets.   If you 
increase the spilt threshold size you will need fewer tablets.  You may also 
consider setting a split threshold that is larger than your target – say you 
decided that 5G was a good target, if you set the threshold to 8G during the 
merge and then setting it to 5G when completed will cause the table to split – 
and it could give you a better distribution of data in the splits.



This can be done while things are running, but it will be a heavy IO load 
(files and on the hdfs namenode) and can take a very long time. What can be 
useful is you the getSplits command with the number of split options and create 
a script that compacts, then merges a section – using the splits as start / end 
row to the compaction and merge command.



Ed Coleman



From: Ligade, Shailesh [USA] 
<ligade_shail...@bah.com<mailto:ligade_shail...@bah.com>>
Sent: Monday, January 31, 2022 11:16 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: tablets per tablet server for accumulo 1.10.0



Hello,



table.split.threshold is set to default 1G (except for metadata nd root - which 
is set to 64M)

What can cause tablets per tablet server count to go high? Within a week, that 
count jumped from 5k/tablet server to 23k/tablet server, even though total size 
in hdfs  has not changed.

Is high count, a cause for concern?

We didn't apply any splits. I did a dumpConfig and checked all my tables and 
didn't see splits either.



Is there a way to find tablet size in hdfs? When I look at hdfs 
/accumulo/table/x/ i see some empty folders, meaning not all folders has rf 
files. is that normal?



Thanks in advance!



-S

Reply via email to