Hi Gajanan,
CC dev@gora, this is something we may wish to implement within HBase.
If anything I've provided below is incorrect, then please correct the
record.
BTW, I found the following article written by Elis, to be extremely useful
https://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/

On Wed, Sep 19, 2018 at 3:55 AM <user-digest-h...@nutch.apache.org> wrote:

> From: Gajanan Watkar <gajananwat...@gmail.com>
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Wed, 19 Sep 2018 16:24:52 +0530
> Subject: Re: Nodemanager crashing repeatedly
> Hi Lewis,
> It appears that my setup was infected. After studying ResourceManager logs
> closely I found that lot of jobs were getting submitted to my cluster as
> user "dr who". Moreover my crontab was listing 2 wget cron jobs I never
> configured (Suspect it to be cryptocurrency miner) and one java app running
> from /var/tmp/java. I Configured firewall, blocked port 8088, purged cron
> (as it was coming back with every re-install) and removed java app from
> /var/tmp/java. It seem to have stabilized my setup. For now it is working
> fine. No more unexpected NodeManager Exits. Also applied patch for
> MalformedURLException.
>

Good to hear that you were able to debug this. From the description you
provided I wondered if it was anything to do with Nutch 2.x namely because
I've never experienced anything like this in the past.


>
> I am getting uneven region sizes, can you suggest me on pre-spliting
> webpage table i.e. split points to be used and splitting policy and optimum
> GC setup for regionserver for efficient Nutch crawling.
>
>
Can you provide the version of HBase you are using? Assuming that you are
using Nutch 2.x branch from Git, you should be using 1.2.6.
Can you also provide the logging from HBase which indicates uneven region
sizes?

>From what I understand (and I am no HBase expert) when Gora first creates
the HBase table, by default, only one region is allocated for the table.
This means that initially, all requests will go to a single region server,
regardless of the number of region servers in your HBase deployment. A
knock on effect of this is that initial phases of loading data into the
empty Webpage table cannot utilize the whole capacity of the Base cluster
however I don't think this is by any means your issue.

The issue at hand is concerned with supplying the split points at the table
creation time which would hopefully resolve the uneven region size. The
comment I made above regarding Gora creating only one region allocation for
the table is correct, take a look at [0], you will see that we do not use
additional parameters for the call to Admin.createTable which would
explicitly specify for example the split points. Examples of additional
parameters which could be used when creating our Table are below, these can
also be seen at [1].

void createTable(HTableDescriptor desc)
Creates a new table.

void createTable(HTableDescriptor desc, byte[][] splitKeys)
Creates a new table with an initial set of empty regions defined by the
specified split keys.

void createTable(HTableDescriptor desc, byte[] startKey, byte[] endKey, int
numRegions)
Creates a new table with the specified number of regions.

void createTableAsync(HTableDescriptor desc, byte[][] splitKeys)
Creates a new table but does not block and wait for it to come online.

One other issue you asked was regarding the split policy... again, we do
not currently specify an explicit split policy, instead we utilize the auto
splitting capability (which I believe is ConstantSizeRegionSplitPolicy)
made available by HBase, however, if we wanted to implement an explicit
split policy, we could do so by implementing the code below at the
following line [2] within Gora's HBaseStore#createSchema method.

HTableDescriptor tableDesc = new HTableDescriptor("example-table");
tableDesc.setValue(HTableDescriptor.SPLIT_POLICY,
AwesomeSplitPolicy.class.getName()); //add columns etc
admin.createTable(tableDesc);

OR, we could make this configurable by providing the
'hbase.regionserver.region.split.policy' available within gora.properties.
There are a few ways we could prototype this.

Finally, regarding GC, I am not entirely sure right now. I don't know too
much about HBase optimization but just like any distributed system you
could tinker with GC values until you land at something which works. The
above however hopefully gets you started in the right direction.

hth
Lewis

[0]
https://github.com/apache/gora/blob/master/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L182
[1]
https://hbase.apache.org/1.2/apidocs/org/apache/hadoop/hbase/client/Admin.html
[2]
https://github.com/apache/gora/blob/master/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L180

Reply via email to