Thanks a lot to everyone - very nice point about looking for oldest file and taking locality into consideration. Going to implement it now :)
On Wed, Jul 8, 2015 at 10:57 PM Bryan Beaudreault <[email protected]> wrote: > Our automation uses a combination of the following to determine what to > compact: > > - Which regions have bad locality (% of blocks are local vs remote, using > HDFS getBlockLocations APIs) > - Which regions have the most number of HFiles (most files per region/cf > directory) > - Which regions have gone the longest since a compaction (oldest file) > > The order here is the priority we have given each, but YMMV. We run in > EC2, so value locality over almost everything, to avoid network latencies > on reads. > > On Wed, Jul 8, 2015 at 4:48 PM Jean-Marc Spaggiari < > [email protected]> > wrote: > > > Just missing the ColumnFamiliy at the end of the path. Your memory is > > pretty good. > > > > JM > > > > 2015-07-08 16:39 GMT-04:00 Vladimir Rodionov <[email protected]>: > > > > > You can find this info yourself, Dejan > > > > > > 1. Locate table dir on HDFS > > > 2. List all regions (directories) > > > 3. Iterate files in each directory and find the oldest one (creation > > time) > > > 4. The region with the oldest file is your candidate for major > compaction > > > > > > /HBASE_ROOT/data/namespace/table/region (If my memory serves me right > :)) > > > > > > -Vlad > > > > > > On Wed, Jul 8, 2015 at 1:07 PM, Dejan Menges <[email protected]> > > > wrote: > > > > > > > Hi Mikhail, > > > > > > > > Actually, reason is quite stupid on my side - to avoid compacting one > > > > region over and over again while others are waiting in line (reading > > HTML > > > > and sorting only on number of store files gets you at some point > having > > > > bunch of regions having exactly the same number of store files). > > > > > > > > Thanks for this hint - this is exactly something I was looking for. > Was > > > > trying previously to figure out if it's possible to query meta for > this > > > > information (using currently 0.98.0, 0.98.4 and waiting for HDP 2.3 > > from > > > > Hortonworks to upgrade immediately) but for our current version > didn't > > > > found that possible, that's why I decided going this way. > > > > > > > > On Wed, Jul 8, 2015 at 10:02 PM Mikhail Antonov < > [email protected]> > > > > wrote: > > > > > > > > > I totally understand the reasoning behind compacting regions with > > > > > biggest number of store files, but didn't follow why it's best to > > > > > compact regions which have biggest store files, maybe I'm missing > > > > > something? I'd maybe compact regions which have the smallest avg > > > > > storefile size? > > > > > > > > > > You may also want to take a look at > > > > > https://issues.apache.org/jira/browse/HBASE-12859, and compact > > regions > > > > > for which MC was last run longer time ago. > > > > > > > > > > -Mikhail > > > > > > > > > > On Wed, Jul 8, 2015 at 10:30 AM, Dejan Menges < > > [email protected]> > > > > > wrote: > > > > > > Hi Behdad, > > > > > > > > > > > > Thanks a lot, but this part I do already. My question was more > what > > > to > > > > > use > > > > > > to most intelligently (what exposed or not exposed metrics) > figure > > > out > > > > > > where major compaction is needed the most. > > > > > > > > > > > > Currently, choosing the region which has biggest number of store > > > files > > > > + > > > > > > the biggest amount of store files is doing the job, but wasn't > sure > > > if > > > > > > there's maybe something better so far to choose from. > > > > > > > > > > > > Cheers, > > > > > > Dejan > > > > > > > > > > > > On Wed, Jul 8, 2015 at 7:19 PM Behdad Forghani < > > > [email protected]> > > > > > > wrote: > > > > > > > > > > > >> To start major compaction for tablename from cli, you need to > run: > > > > > >> echo major_compact tablename | hbase shell > > > > > >> > > > > > >> I do this after bulk loading to the table. > > > > > >> > > > > > >> FYI, to avoid surprises, I also turn off load balancer and > > rebalance > > > > > >> regions manually. > > > > > >> > > > > > >> The cli command to turn off balancer is: > > > > > >> echo balance_switch false | hbase shell > > > > > >> > > > > > >> To rebalance regions after a bulk load or other changes, run: > > > > > >> echo balance | hbase shell > > > > > >> > > > > > >> You can run these two command using ssh. I use Ansible to do > > these. > > > > > >> Assuming you have defined hbase_master in your hosts file, you > can > > > > run: > > > > > >> ansible -i hosts hbase_master -a "echo major_compact tablename | > > > hbase > > > > > >> shell" > > > > > >> > > > > > >> Behdad Forghani > > > > > >> > > > > > >> On Wed, Jul 8, 2015 at 8:03 AM, Dejan Menges < > > > [email protected]> > > > > > >> wrote: > > > > > >> > > > > > >> > Hi, > > > > > >> > > > > > > >> > What's the best way to automate major compactions without > > enabling > > > > it > > > > > >> > during off peak period? > > > > > >> > > > > > > >> > What I was testing is simple script which runs on every node > in > > > > > cluster, > > > > > >> > checks if there is major compaction already running on that > > node, > > > if > > > > > not > > > > > >> > picks one region for compaction and run compaction on that one > > > > region. > > > > > >> > > > > > > >> > It's running for some time and it helped us get our data to > much > > > > > better > > > > > >> > shape, but now I'm not quite sure how to choose anymore which > > > region > > > > > to > > > > > >> > compact. So far I was reading for that node > > > > rs-status#regionStoreStats > > > > > >> and > > > > > >> > first choosing the one with biggest amount of storefiles, and > > then > > > > > those > > > > > >> > with biggest storefile sizes. > > > > > >> > > > > > > >> > Is there maybe something more intelligent I could/should do? > > > > > >> > > > > > > >> > Thanks a lot! > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > -- > > > > > Thanks, > > > > > Michael Antonov > > > > > > > > > > > > > > >
