Re: Uneven HBase region sizes WAS Re: Nodemanager crashing repeatedly

2018-09-22 Thread Gajanan Watkar
Hi Lewis,
Thanks for such a detailed reply.
The problem I was facing had nothing to do with Nutch 2.x directly. It was
happening because of security gaps in Hadoop cluster setup.

I am using Hbase-1.2.3. As far as uneven Hbase region sizes are concerned I
wont be able to share any logging as I have already merged smaller regions
manually by using merge_regions Hbase shell command.
Hbase, by default, is using *IncreasingToUpperBoundRegionSplitPolicy*
version 0.94 onwards creating lot of regions initially (which is actually
good for large setups) . So after merging regions I changed the split
policy to *ConstantSizeRegionSplitPolicy* and increased*
hbase.hregion.max.filesize* to 20 GB from default 10 GB as I have limited
memory.  Moreover, I noticed that uneven region sizes was also result of my
Nutch generate configurations. I was getting lot of urls from same host. So
I changed *generate.max.count* to 100 per host to get proper mix and owing
to this diversity in fetch list my regions are filling in fair proportion
as of now.

In the context of Hbase GC configurations I am experimenting. Right now I
am using G1GC with *-XX:MaxGCPauseMillis=50 *which seem to be working fine.
Anyways thanks once again for giving out such pin pointed directions.

-Gajanan

On Wed, Sep 19, 2018 at 11:06 PM lewis john mcgibbney 
wrote:

> Hi Gajanan,
> CC dev@gora, this is something we may wish to implement within HBase.
> If anything I've provided below is incorrect, then please correct the
> record.
> BTW, I found the following article written by Elis, to be extremely useful
> https://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/
>
> On Wed, Sep 19, 2018 at 3:55 AM  wrote:
>
> > From: Gajanan Watkar 
> > To: user@nutch.apache.org
> > Cc:
> > Bcc:
> > Date: Wed, 19 Sep 2018 16:24:52 +0530
> > Subject: Re: Nodemanager crashing repeatedly
> > Hi Lewis,
> > It appears that my setup was infected. After studying ResourceManager
> logs
> > closely I found that lot of jobs were getting submitted to my cluster as
> > user "dr who". Moreover my crontab was listing 2 wget cron jobs I never
> > configured (Suspect it to be cryptocurrency miner) and one java app
> running
> > from /var/tmp/java. I Configured firewall, blocked port 8088, purged cron
> > (as it was coming back with every re-install) and removed java app from
> > /var/tmp/java. It seem to have stabilized my setup. For now it is working
> > fine. No more unexpected NodeManager Exits. Also applied patch for
> > MalformedURLException.
> >
>
> Good to hear that you were able to debug this. From the description you
> provided I wondered if it was anything to do with Nutch 2.x namely because
> I've never experienced anything like this in the past.
>
>
> >
> > I am getting uneven region sizes, can you suggest me on pre-spliting
> > webpage table i.e. split points to be used and splitting policy and
> optimum
> > GC setup for regionserver for efficient Nutch crawling.
> >
> >
> Can you provide the version of HBase you are using? Assuming that you are
> using Nutch 2.x branch from Git, you should be using 1.2.6.
> Can you also provide the logging from HBase which indicates uneven region
> sizes?
>
> From what I understand (and I am no HBase expert) when Gora first creates
> the HBase table, by default, only one region is allocated for the table.
> This means that initially, all requests will go to a single region server,
> regardless of the number of region servers in your HBase deployment. A
> knock on effect of this is that initial phases of loading data into the
> empty Webpage table cannot utilize the whole capacity of the Base cluster
> however I don't think this is by any means your issue.
>
> The issue at hand is concerned with supplying the split points at the table
> creation time which would hopefully resolve the uneven region size. The
> comment I made above regarding Gora creating only one region allocation for
> the table is correct, take a look at [0], you will see that we do not use
> additional parameters for the call to Admin.createTable which would
> explicitly specify for example the split points. Examples of additional
> parameters which could be used when creating our Table are below, these can
> also be seen at [1].
>
> void createTable(HTableDescriptor desc)
> Creates a new table.
>
> void createTable(HTableDescriptor desc, byte[][] splitKeys)
> Creates a new table with an initial set of empty regions defined by the
> specified split keys.
>
> void createTable(HTableDescriptor desc, byte[] startKey, byte[] endKey, int
> numRegions)
> Creates a new table with the specified number of regions.
>
> void createTableAsync(HTable

Uneven HBase region sizes WAS Re: Nodemanager crashing repeatedly

2018-09-19 Thread lewis john mcgibbney
Hi Gajanan,
CC dev@gora, this is something we may wish to implement within HBase.
If anything I've provided below is incorrect, then please correct the
record.
BTW, I found the following article written by Elis, to be extremely useful
https://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/

On Wed, Sep 19, 2018 at 3:55 AM  wrote:

> From: Gajanan Watkar 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Wed, 19 Sep 2018 16:24:52 +0530
> Subject: Re: Nodemanager crashing repeatedly
> Hi Lewis,
> It appears that my setup was infected. After studying ResourceManager logs
> closely I found that lot of jobs were getting submitted to my cluster as
> user "dr who". Moreover my crontab was listing 2 wget cron jobs I never
> configured (Suspect it to be cryptocurrency miner) and one java app running
> from /var/tmp/java. I Configured firewall, blocked port 8088, purged cron
> (as it was coming back with every re-install) and removed java app from
> /var/tmp/java. It seem to have stabilized my setup. For now it is working
> fine. No more unexpected NodeManager Exits. Also applied patch for
> MalformedURLException.
>

Good to hear that you were able to debug this. From the description you
provided I wondered if it was anything to do with Nutch 2.x namely because
I've never experienced anything like this in the past.


>
> I am getting uneven region sizes, can you suggest me on pre-spliting
> webpage table i.e. split points to be used and splitting policy and optimum
> GC setup for regionserver for efficient Nutch crawling.
>
>
Can you provide the version of HBase you are using? Assuming that you are
using Nutch 2.x branch from Git, you should be using 1.2.6.
Can you also provide the logging from HBase which indicates uneven region
sizes?

>From what I understand (and I am no HBase expert) when Gora first creates
the HBase table, by default, only one region is allocated for the table.
This means that initially, all requests will go to a single region server,
regardless of the number of region servers in your HBase deployment. A
knock on effect of this is that initial phases of loading data into the
empty Webpage table cannot utilize the whole capacity of the Base cluster
however I don't think this is by any means your issue.

The issue at hand is concerned with supplying the split points at the table
creation time which would hopefully resolve the uneven region size. The
comment I made above regarding Gora creating only one region allocation for
the table is correct, take a look at [0], you will see that we do not use
additional parameters for the call to Admin.createTable which would
explicitly specify for example the split points. Examples of additional
parameters which could be used when creating our Table are below, these can
also be seen at [1].

void createTable(HTableDescriptor desc)
Creates a new table.

void createTable(HTableDescriptor desc, byte[][] splitKeys)
Creates a new table with an initial set of empty regions defined by the
specified split keys.

void createTable(HTableDescriptor desc, byte[] startKey, byte[] endKey, int
numRegions)
Creates a new table with the specified number of regions.

void createTableAsync(HTableDescriptor desc, byte[][] splitKeys)
Creates a new table but does not block and wait for it to come online.

One other issue you asked was regarding the split policy... again, we do
not currently specify an explicit split policy, instead we utilize the auto
splitting capability (which I believe is ConstantSizeRegionSplitPolicy)
made available by HBase, however, if we wanted to implement an explicit
split policy, we could do so by implementing the code below at the
following line [2] within Gora's HBaseStore#createSchema method.

HTableDescriptor tableDesc = new HTableDescriptor("example-table");
tableDesc.setValue(HTableDescriptor.SPLIT_POLICY,
AwesomeSplitPolicy.class.getName()); //add columns etc
admin.createTable(tableDesc);

OR, we could make this configurable by providing the
'hbase.regionserver.region.split.policy' available within gora.properties.
There are a few ways we could prototype this.

Finally, regarding GC, I am not entirely sure right now. I don't know too
much about HBase optimization but just like any distributed system you
could tinker with GC values until you land at something which works. The
above however hopefully gets you started in the right direction.

hth
Lewis

[0]
https://github.com/apache/gora/blob/master/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L182
[1]
https://hbase.apache.org/1.2/apidocs/org/apache/hadoop/hbase/client/Admin.html
[2]
https://github.com/apache/gora/blob/master/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L180


Re: Nodemanager crashing repeatedly

2018-09-08 Thread Gajanan Watkar
Thanks Lewis,
I am running on Debian Stretch.
Its month old checkout that I am using.
Nodemanager crashes during different phases of crawl, i.e. sometimes during
generate, sometimes during fetch, sometimes during parse and sometime
during parse, updatedb, index and dedupe.
On some occasions it crashes immediately after completing the respective
crawl phase.
Note: It appears that my nodemanager, all other hadoop daemons and hbase
were using /tmp for local and temporary storage. Even though my /tmp was
having enough space, I configured temp and local directories for everything
including map reduce tasks on my /home partition. That seem to have
stabilizing effect. Needs more testing. Will report if it stabilizes.

-Gajanan




On Thu, Sep 6, 2018 at 10:31 PM lewis john mcgibbney 
wrote:

> Hi Gajanan,
> Which OS are you running this on?
> I would also suggest that if you want to use the 2.x codebase, you should
> use the most recent from SCM e.g. check out master and change to 2.x
> branch.
> Finally, for now at least, you didn't mention the phase at which the crawl
> is failing. Can you provide this?
>
> On Thu, Sep 6, 2018 at 8:58 AM  wrote:
>
> > From: Gajanan Watkar 
> > To: user@nutch.apache.org
> > Cc:
> > Bcc:
> > Date: Wed, 05 Sep 2018 11:27:21 +0530
> > Subject: Nodemanager crashing repeatedly
> > I am running Nutch-2.3.1 over Hadoop-2.5.2 and Hbase-1.2.3 with
> > integration to Solr-6.5.1. I have crawled over 10 million pages. But
> > while doing all this I am continuously facing two problems:
> >
> > 1. My Nodemanager is crashing repeatedly during different phases of
> > crawl. It crashes my linux session and forces logout with nodemanager
> > killed. I log-in again, restart NodeManger and the same failed crawl
> > phase runs to success. [Nodemanager log has nothing to report]
> >
> > 2. I am running all my crawl phases one by one without crawl script, as
> > with crawl script most of the time my jobs were exiting with
> > "WaitForjobCompletion" error at different stages of crawl. So, I
> > decided to go ahead with one by one method which prevented
> > "WaitForjobCompletion" to occure.
> >
> > Any help will be highly appreciated. New to mailing-list, New to Nutch.
> >
> > -Gajanan
> >
> >
>


Re: Nodemanager crashing repeatedly

2018-09-06 Thread lewis john mcgibbney
Hi Gajanan,
Which OS are you running this on?
I would also suggest that if you want to use the 2.x codebase, you should
use the most recent from SCM e.g. check out master and change to 2.x branch.
Finally, for now at least, you didn't mention the phase at which the crawl
is failing. Can you provide this?

On Thu, Sep 6, 2018 at 8:58 AM  wrote:

> From: Gajanan Watkar 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Wed, 05 Sep 2018 11:27:21 +0530
> Subject: Nodemanager crashing repeatedly
> I am running Nutch-2.3.1 over Hadoop-2.5.2 and Hbase-1.2.3 with
> integration to Solr-6.5.1. I have crawled over 10 million pages. But
> while doing all this I am continuously facing two problems:
>
> 1. My Nodemanager is crashing repeatedly during different phases of
> crawl. It crashes my linux session and forces logout with nodemanager
> killed. I log-in again, restart NodeManger and the same failed crawl
> phase runs to success. [Nodemanager log has nothing to report]
>
> 2. I am running all my crawl phases one by one without crawl script, as
> with crawl script most of the time my jobs were exiting with
> "WaitForjobCompletion" error at different stages of crawl. So, I
> decided to go ahead with one by one method which prevented
> "WaitForjobCompletion" to occure.
>
> Any help will be highly appreciated. New to mailing-list, New to Nutch.
>
> -Gajanan
>
>


Nodemanager crashing repeatedly

2018-09-04 Thread Gajanan Watkar
I am running Nutch-2.3.1 over Hadoop-2.5.2 and Hbase-1.2.3 with
integration to Solr-6.5.1. I have crawled over 10 million pages. But
while doing all this I am continuously facing two problems:

1. My Nodemanager is crashing repeatedly during different phases of
crawl. It crashes my linux session and forces logout with nodemanager
killed. I log-in again, restart NodeManger and the same failed crawl
phase runs to success. [Nodemanager log has nothing to report]

2. I am running all my crawl phases one by one without crawl script, as
with crawl script most of the time my jobs were exiting with
"WaitForjobCompletion" error at different stages of crawl. So, I
decided to go ahead with one by one method which prevented
"WaitForjobCompletion" to occure.

Any help will be highly appreciated. New to mailing-list, New to Nutch.

-Gajanan