Hi Dennis, On Thu, Jun 15, 2017 at 1:41 AM, <[email protected]> wrote:
> > From: Dennis A <[email protected]> > To: [email protected] > Cc: > Bcc: > Date: Wed, 14 Jun 2017 20:45:35 +0200 > Subject: Re: Optimize Nutch Indexing Speed > Hi Lewis, > thank you for your suggestions! > No problems at all. ... > My current > investigations point me to the fact that the temporary folder has only > about 4.5GB of disk space remaining, which might be the reason for a > collapse, since I managed to estimate the size on at least 2.5-3GB for the > smaller configuration. > I plan to move this to another folder where more disk space is remaining. > Please also consider the 'hadoop.tmp.dir' configuration parameter... this should should be set to a path where this is lots/enough of disk space... Hadoop intermediate data structures reside locally on disk, you need to accommodate this. > > What I could sadly not find, is the option to increase the number of > mappers/reducers for the tasks. I deducted (seemingly correct) that the > actual hadoop-site.xml and mapred-site.xml configurations can (or more: > have) be done in the nutch-site.xml file? > Yes they can... but please make sure they are not also override within the nutch script https://github.com/apache/nutch/blob/master/src/bin/crawl#L116-L118 Please investigate and adapt for your particular environment. > My problem now is: For the fetching and generation step, the machine seems > to utilize many cores in parallel, and htop does show me multiple threads, > probably the Hadoop mappers. > Please see above, please also look at the following https://github.com/apache/nutch/blob/master/src/bin/crawl#L120-L122 as well as anything else 'fetch'-related within that script. > > Yet, for the parsing step (which is now the longest part with around 1h), I > only notice one major thread. Since I do already notice multiple threads > for the former, I am unsure whether this can be parallelized in the local > execution mode, or whether this is only possible for > pseudo-distributed/distributed mode. > Do the linked properties possibly resolve this problem, too? Or would this > only further increase the number of executors for the fetch/parse steps? > Please investigate the nutch script... it will give you a body of insight into the crawl cycle as well as the limitations. Nutch is a batch oriented system.. it has limitations. Once you understand them, you can mitigate to a certain extent or leverage them for your benefit. > > Sorry that I ask, but I do not yet have so much experience with crawling at > all :/ > > No problem at all. Lewis

