On Wed, Jul 18, 2012 at 7:26 PM, Bryan Beaudreault
<[email protected]> wrote:
> I am looking into managing major compactions ourselves, but there doesn't 
> appear to be any mechanisms I can hook in to determine which tables need 
> compacting.  Ideally each time my cron job runs it would compact the table 
> with the next longest time since compaction, but I can't find a way to access 
> this metric.
>

Would suggest you have a region-view rather than a table-view.

Internally, we look at the hdfs modification time when we check if we
are to compact.  If it is > whatever the major compaction interval set
for the particular column family is, we'll do a major compaction.

Running an external script, you could look at each region in turn on
occasion.  Look at its files.  Check their modification time (and you
perhaps how many files there are under the region column family) and
if its > whatever you want, run a major compaction on the region.

Try to balance how many you'd have running at a time.

> The default major compaction algorithm seems to be able to get the oldest 
> modified time for all store files for a region to determine when it was last 
> major compacted.  I know this is not ideal, but it seems good enough.  
> Unfortunately I don't see an easy way to get this.
>

Its in the stats datastructure for an hdfs file.  Scripting you could
parse it from an hdfs listing.


St.Ack

Reply via email to