Hi Dennis, On Sun, Jun 11, 2017 at 2:45 AM, <[email protected]> wrote:
> > From: Dennis A <[email protected]> > To: [email protected] > Cc: > Bcc: > Date: Fri, 9 Jun 2017 09:59:05 +0200 > Subject: Optimize Nutch Indexing Speed > Hello, > I have recently configured my Nutch crawler to index a whole domain, with > an estimated number of 1.5M-3M documents. > > For this purpose, I wanted to use Nutch 1.13 and Solr 4.10.4 to build a > search index over these documents. The compute server is a 16 core Xeon > Server with 128GB RAM. > While everything has worked for subdomain crawls quite well, I noticed some > severe drawbacks once I put it on the whole domain: > - The solr indexing failed without any obvious reason if I did not lower > the -topN value to 40k instead of 50k documents. > Did this possibly fail on SolrClean/Clean task instead of indexing task? If so, then you've encountered https://issues.apache.org/jira/browse/NUTCH-2269. I would suggest you possibly upgrade to master branch to work around this or else desable Clean for the time being. > - The CrawlDb and LinkDb merging steps take an unreasonable long amount of > time after only 150k indexed documents (~7 crawl iterations). For the > latest step, it took over 8 hours. This is way too long. Have you tried profiling the tasks? How are you running Nutch? Local, pseudo-distributed or distributed? I would look more closely into your logs with DEBUG on to see what is going on. I would also profile the task to see exactly where the tasks are struggling. re you filtering and normalizing? If so do you have some complex rules in there which may be decreasing performance? > I noticed that it does seem to only > utilize on core on the machine, which seems weird to me. I also already > increased the Java Heap Size to 5GB (from default 1GB), but did not notice > any imminent improvements. > Please check the following https://stackoverflow.com/questions/8357296/full-utilization-of-all-cores-in-hadoop-pseudo-distributed-mode#8359416 See if any of this applies. > > My questions would be: > - As an alternative to the server, I have access to a cluster of 4/5 nodes > with 2 cores and 10 GB available for Hadoop. Would I benefit from a > distributed run at all? It doesn't seem to me that the fetching/generating > process is the bottleneck, but rather the (serial?) update of the database. > Generally speaking parallelizing the task will benefit you yes. Please consider the above responses I've provided however before diving in with this. Also note, it is possible to have overlapping crawls on the go even on one machine. > - Since crawling is not the issue, could I potentially benefit from > switching to Nutch 2.x? > There is no reason why Nutch 1.X is not able to scale to this task. Your dataset is not overly large by any means. I would stick with what you have got and make an attempt to optimize configuration. > - Is there any known reason that Solr might "reject" an indexing step, or > was it just some temporary error? I have honestly not tried it again, since > I have temporal limitations regarding the crawl, and do not want to have to > start over again. > Understood. Please check it the 'clean' task killed it off for you. If so, then please remove this from your crawl process. > - Is there any way to efficiently "skip" the update steps for most of the > time, and only perform them once a certain amount of pages have been > acquired? yes absolutely. It is not completely necessary to do this after every crawl cycle. > Is it even normal that it take this long, or may I have some > configuraitonal errors? > I think spending some time on the issue above should resolve your issues. Lewis

