Re: Uneven HBase region sizes WAS Re: Nodemanager crashing repeatedly

2018-09-22 Thread Gajanan Watkar
Hi Lewis,
Thanks for such a detailed reply.
The problem I was facing had nothing to do with Nutch 2.x directly. It was
happening because of security gaps in Hadoop cluster setup.

I am using Hbase-1.2.3. As far as uneven Hbase region sizes are concerned I
wont be able to share any logging as I have already merged smaller regions
manually by using merge_regions Hbase shell command.
Hbase, by default, is using *IncreasingToUpperBoundRegionSplitPolicy*
version 0.94 onwards creating lot of regions initially (which is actually
good for large setups) . So after merging regions I changed the split
policy to *ConstantSizeRegionSplitPolicy* and increased*
hbase.hregion.max.filesize* to 20 GB from default 10 GB as I have limited
memory.  Moreover, I noticed that uneven region sizes was also result of my
Nutch generate configurations. I was getting lot of urls from same host. So
I changed *generate.max.count* to 100 per host to get proper mix and owing
to this diversity in fetch list my regions are filling in fair proportion
as of now.

In the context of Hbase GC configurations I am experimenting. Right now I
am using G1GC with *-XX:MaxGCPauseMillis=50 *which seem to be working fine.
Anyways thanks once again for giving out such pin pointed directions.

-Gajanan

On Wed, Sep 19, 2018 at 11:06 PM lewis john mcgibbney 
wrote:

> Hi Gajanan,
> CC dev@gora, this is something we may wish to implement within HBase.
> If anything I've provided below is incorrect, then please correct the
> record.
> BTW, I found the following article written by Elis, to be extremely useful
> https://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/
>
> On Wed, Sep 19, 2018 at 3:55 AM  wrote:
>
> > From: Gajanan Watkar 
> > To: user@nutch.apache.org
> > Cc:
> > Bcc:
> > Date: Wed, 19 Sep 2018 16:24:52 +0530
> > Subject: Re: Nodemanager crashing repeatedly
> > Hi Lewis,
> > It appears that my setup was infected. After studying ResourceManager
> logs
> > closely I found that lot of jobs were getting submitted to my cluster as
> > user "dr who". Moreover my crontab was listing 2 wget cron jobs I never
> > configured (Suspect it to be cryptocurrency miner) and one java app
> running
> > from /var/tmp/java. I Configured firewall, blocked port 8088, purged cron
> > (as it was coming back with every re-install) and removed java app from
> > /var/tmp/java. It seem to have stabilized my setup. For now it is working
> > fine. No more unexpected NodeManager Exits. Also applied patch for
> > MalformedURLException.
> >
>
> Good to hear that you were able to debug this. From the description you
> provided I wondered if it was anything to do with Nutch 2.x namely because
> I've never experienced anything like this in the past.
>
>
> >
> > I am getting uneven region sizes, can you suggest me on pre-spliting
> > webpage table i.e. split points to be used and splitting policy and
> optimum
> > GC setup for regionserver for efficient Nutch crawling.
> >
> >
> Can you provide the version of HBase you are using? Assuming that you are
> using Nutch 2.x branch from Git, you should be using 1.2.6.
> Can you also provide the logging from HBase which indicates uneven region
> sizes?
>
> From what I understand (and I am no HBase expert) when Gora first creates
> the HBase table, by default, only one region is allocated for the table.
> This means that initially, all requests will go to a single region server,
> regardless of the number of region servers in your HBase deployment. A
> knock on effect of this is that initial phases of loading data into the
> empty Webpage table cannot utilize the whole capacity of the Base cluster
> however I don't think this is by any means your issue.
>
> The issue at hand is concerned with supplying the split points at the table
> creation time which would hopefully resolve the uneven region size. The
> comment I made above regarding Gora creating only one region allocation for
> the table is correct, take a look at [0], you will see that we do not use
> additional parameters for the call to Admin.createTable which would
> explicitly specify for example the split points. Examples of additional
> parameters which could be used when creating our Table are below, these can
> also be seen at [1].
>
> void createTable(HTableDescriptor desc)
> Creates a new table.
>
> void createTable(HTableDescriptor desc, byte[][] splitKeys)
> Creates a new table with an initial set of empty regions defined by the
> specified split keys.
>
> void createTable(HTableDescriptor desc, byte[] startKey, byte[] endKey, int
> numRegions)
> Creates a new table with the specified number of regions.
>
> void createTableAsync(HTableDescriptor desc, byte[][] splitKeys)
> Creates a new table but does not block and wait for it to come online.
>
> One other issue you asked was regarding the split policy... again, we do
> not currently specify an explicit split policy, instead we utilize the auto
> splitting capability (which I believe is 

Uneven HBase region sizes WAS Re: Nodemanager crashing repeatedly

2018-09-19 Thread lewis john mcgibbney
Hi Gajanan,
CC dev@gora, this is something we may wish to implement within HBase.
If anything I've provided below is incorrect, then please correct the
record.
BTW, I found the following article written by Elis, to be extremely useful
https://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/

On Wed, Sep 19, 2018 at 3:55 AM  wrote:

> From: Gajanan Watkar 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Wed, 19 Sep 2018 16:24:52 +0530
> Subject: Re: Nodemanager crashing repeatedly
> Hi Lewis,
> It appears that my setup was infected. After studying ResourceManager logs
> closely I found that lot of jobs were getting submitted to my cluster as
> user "dr who". Moreover my crontab was listing 2 wget cron jobs I never
> configured (Suspect it to be cryptocurrency miner) and one java app running
> from /var/tmp/java. I Configured firewall, blocked port 8088, purged cron
> (as it was coming back with every re-install) and removed java app from
> /var/tmp/java. It seem to have stabilized my setup. For now it is working
> fine. No more unexpected NodeManager Exits. Also applied patch for
> MalformedURLException.
>

Good to hear that you were able to debug this. From the description you
provided I wondered if it was anything to do with Nutch 2.x namely because
I've never experienced anything like this in the past.


>
> I am getting uneven region sizes, can you suggest me on pre-spliting
> webpage table i.e. split points to be used and splitting policy and optimum
> GC setup for regionserver for efficient Nutch crawling.
>
>
Can you provide the version of HBase you are using? Assuming that you are
using Nutch 2.x branch from Git, you should be using 1.2.6.
Can you also provide the logging from HBase which indicates uneven region
sizes?

>From what I understand (and I am no HBase expert) when Gora first creates
the HBase table, by default, only one region is allocated for the table.
This means that initially, all requests will go to a single region server,
regardless of the number of region servers in your HBase deployment. A
knock on effect of this is that initial phases of loading data into the
empty Webpage table cannot utilize the whole capacity of the Base cluster
however I don't think this is by any means your issue.

The issue at hand is concerned with supplying the split points at the table
creation time which would hopefully resolve the uneven region size. The
comment I made above regarding Gora creating only one region allocation for
the table is correct, take a look at [0], you will see that we do not use
additional parameters for the call to Admin.createTable which would
explicitly specify for example the split points. Examples of additional
parameters which could be used when creating our Table are below, these can
also be seen at [1].

void createTable(HTableDescriptor desc)
Creates a new table.

void createTable(HTableDescriptor desc, byte[][] splitKeys)
Creates a new table with an initial set of empty regions defined by the
specified split keys.

void createTable(HTableDescriptor desc, byte[] startKey, byte[] endKey, int
numRegions)
Creates a new table with the specified number of regions.

void createTableAsync(HTableDescriptor desc, byte[][] splitKeys)
Creates a new table but does not block and wait for it to come online.

One other issue you asked was regarding the split policy... again, we do
not currently specify an explicit split policy, instead we utilize the auto
splitting capability (which I believe is ConstantSizeRegionSplitPolicy)
made available by HBase, however, if we wanted to implement an explicit
split policy, we could do so by implementing the code below at the
following line [2] within Gora's HBaseStore#createSchema method.

HTableDescriptor tableDesc = new HTableDescriptor("example-table");
tableDesc.setValue(HTableDescriptor.SPLIT_POLICY,
AwesomeSplitPolicy.class.getName()); //add columns etc
admin.createTable(tableDesc);

OR, we could make this configurable by providing the
'hbase.regionserver.region.split.policy' available within gora.properties.
There are a few ways we could prototype this.

Finally, regarding GC, I am not entirely sure right now. I don't know too
much about HBase optimization but just like any distributed system you
could tinker with GC values until you land at something which works. The
above however hopefully gets you started in the right direction.

hth
Lewis

[0]
https://github.com/apache/gora/blob/master/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L182
[1]
https://hbase.apache.org/1.2/apidocs/org/apache/hadoop/hbase/client/Admin.html
[2]
https://github.com/apache/gora/blob/master/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L180