Re: Optimize Nutch Indexing Speed

lewis john mcgibbney Thu, 15 Jun 2017 20:19:10 -0700

Hi Dennis,

On Thu, Jun 15, 2017 at 1:41 AM, <[email protected]> wrote:


>
> From: Dennis A <[email protected]>
> To: [email protected]
> Cc:
> Bcc:
> Date: Wed, 14 Jun 2017 20:45:35 +0200
> Subject: Re: Optimize Nutch Indexing Speed
> Hi Lewis,
> thank you for your suggestions!
>

No problems at all.

...

> My current
> investigations point me to the fact that the temporary folder has only
> about 4.5GB of disk space remaining, which might be the reason for a
> collapse, since I managed to estimate the size on at least 2.5-3GB for the
> smaller configuration.
> I plan to move this to another folder where more disk space is remaining.
>

Please also consider the 'hadoop.tmp.dir' configuration parameter... this
should should be set to a path where this is lots/enough of disk space...
Hadoop intermediate data structures reside locally on disk, you need to
accommodate this.


>
> What I could sadly not find, is the option to increase the number of
> mappers/reducers for the tasks. I deducted (seemingly correct) that the
> actual hadoop-site.xml and mapred-site.xml configurations can (or more:
> have) be done in the nutch-site.xml file?
>

Yes they can... but please make sure they are not also override within the
nutch script
https://github.com/apache/nutch/blob/master/src/bin/crawl#L116-L118
Please investigate and adapt for your particular environment.


> My problem now is: For the fetching and generation step, the machine seems
> to utilize many cores in parallel, and htop does show me multiple threads,
> probably the Hadoop mappers.
>

Please see above, please also look at the following

https://github.com/apache/nutch/blob/master/src/bin/crawl#L120-L122

as well as anything else 'fetch'-related within that script.


>
> Yet, for the parsing step (which is now the longest part with around 1h), I
> only notice one major thread. Since I do already notice multiple threads
> for the former, I am unsure whether this can be parallelized in the local
> execution mode, or whether this is only possible for
> pseudo-distributed/distributed mode.
> Do the linked properties possibly resolve this problem, too? Or would this
> only further increase the number of executors for the fetch/parse steps?
>

Please investigate the nutch script... it will give you a body of insight
into the crawl cycle as well as the limitations. Nutch is a batch oriented
system.. it has limitations. Once you understand them, you can mitigate to
a certain extent or leverage them for your benefit.


>
> Sorry that I ask, but I do not yet have so much experience with crawling at
> all :/
>
>
No problem at all.
Lewis

Re: Optimize Nutch Indexing Speed

Reply via email to