Hi Gajanan, CC dev@gora, this is something we may wish to implement within HBase. If anything I've provided below is incorrect, then please correct the record. BTW, I found the following article written by Elis, to be extremely useful https://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/
On Wed, Sep 19, 2018 at 3:55 AM <user-digest-h...@nutch.apache.org> wrote: > From: Gajanan Watkar <gajananwat...@gmail.com> > To: user@nutch.apache.org > Cc: > Bcc: > Date: Wed, 19 Sep 2018 16:24:52 +0530 > Subject: Re: Nodemanager crashing repeatedly > Hi Lewis, > It appears that my setup was infected. After studying ResourceManager logs > closely I found that lot of jobs were getting submitted to my cluster as > user "dr who". Moreover my crontab was listing 2 wget cron jobs I never > configured (Suspect it to be cryptocurrency miner) and one java app running > from /var/tmp/java. I Configured firewall, blocked port 8088, purged cron > (as it was coming back with every re-install) and removed java app from > /var/tmp/java. It seem to have stabilized my setup. For now it is working > fine. No more unexpected NodeManager Exits. Also applied patch for > MalformedURLException. > Good to hear that you were able to debug this. From the description you provided I wondered if it was anything to do with Nutch 2.x namely because I've never experienced anything like this in the past. > > I am getting uneven region sizes, can you suggest me on pre-spliting > webpage table i.e. split points to be used and splitting policy and optimum > GC setup for regionserver for efficient Nutch crawling. > > Can you provide the version of HBase you are using? Assuming that you are using Nutch 2.x branch from Git, you should be using 1.2.6. Can you also provide the logging from HBase which indicates uneven region sizes? >From what I understand (and I am no HBase expert) when Gora first creates the HBase table, by default, only one region is allocated for the table. This means that initially, all requests will go to a single region server, regardless of the number of region servers in your HBase deployment. A knock on effect of this is that initial phases of loading data into the empty Webpage table cannot utilize the whole capacity of the Base cluster however I don't think this is by any means your issue. The issue at hand is concerned with supplying the split points at the table creation time which would hopefully resolve the uneven region size. The comment I made above regarding Gora creating only one region allocation for the table is correct, take a look at [0], you will see that we do not use additional parameters for the call to Admin.createTable which would explicitly specify for example the split points. Examples of additional parameters which could be used when creating our Table are below, these can also be seen at [1]. void createTable(HTableDescriptor desc) Creates a new table. void createTable(HTableDescriptor desc, byte[][] splitKeys) Creates a new table with an initial set of empty regions defined by the specified split keys. void createTable(HTableDescriptor desc, byte[] startKey, byte[] endKey, int numRegions) Creates a new table with the specified number of regions. void createTableAsync(HTableDescriptor desc, byte[][] splitKeys) Creates a new table but does not block and wait for it to come online. One other issue you asked was regarding the split policy... again, we do not currently specify an explicit split policy, instead we utilize the auto splitting capability (which I believe is ConstantSizeRegionSplitPolicy) made available by HBase, however, if we wanted to implement an explicit split policy, we could do so by implementing the code below at the following line [2] within Gora's HBaseStore#createSchema method. HTableDescriptor tableDesc = new HTableDescriptor("example-table"); tableDesc.setValue(HTableDescriptor.SPLIT_POLICY, AwesomeSplitPolicy.class.getName()); //add columns etc admin.createTable(tableDesc); OR, we could make this configurable by providing the 'hbase.regionserver.region.split.policy' available within gora.properties. There are a few ways we could prototype this. Finally, regarding GC, I am not entirely sure right now. I don't know too much about HBase optimization but just like any distributed system you could tinker with GC values until you land at something which works. The above however hopefully gets you started in the right direction. hth Lewis [0] https://github.com/apache/gora/blob/master/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L182 [1] https://hbase.apache.org/1.2/apidocs/org/apache/hadoop/hbase/client/Admin.html [2] https://github.com/apache/gora/blob/master/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L180