Hi Lewis, thank you for your suggestions! I have already worked out a (partial) solution to my problem, which was mostly on me, I guess. The most time was actually spent on the "bin/nutch invertlinks" command (not the updateDb as described above!), which alone took most of the time to complete. I did not investigate further how and why it did so, since I only now understand what it does, and simply disabled this step in the bin/crawl script. On my behalf, I didn't realize from the Nutch Tutorial that this is more an optional step should I want to investigate the inlinks/anchors.
To answer some of your questions, and possibly provide further insight for others: I am only using it in local mode, with no additional Hadoop setup on the machine. The failure of the indexing is actually the indexing step. My current investigations point me to the fact that the temporary folder has only about 4.5GB of disk space remaining, which might be the reason for a collapse, since I managed to estimate the size on at least 2.5-3GB for the smaller configuration. I plan to move this to another folder where more disk space is remaining. What I could sadly not find, is the option to increase the number of mappers/reducers for the tasks. I deducted (seemingly correct) that the actual hadoop-site.xml and mapred-site.xml configurations can (or more: have) be done in the nutch-site.xml file? My problem now is: For the fetching and generation step, the machine seems to utilize many cores in parallel, and htop does show me multiple threads, probably the Hadoop mappers. Yet, for the parsing step (which is now the longest part with around 1h), I only notice one major thread. Since I do already notice multiple threads for the former, I am unsure whether this can be parallelized in the local execution mode, or whether this is only possible for pseudo-distributed/distributed mode. Do the linked properties possibly resolve this problem, too? Or would this only further increase the number of executors for the fetch/parse steps? Sorry that I ask, but I do not yet have so much experience with crawling at all :/ Best, Dennis 2017-06-14 17:36 GMT+02:00 lewis john mcgibbney <[email protected]>: > Hi Dennis, > > On Sun, Jun 11, 2017 at 2:45 AM, <[email protected]> > wrote: > > > > > From: Dennis A <[email protected]> > > To: [email protected] > > Cc: > > Bcc: > > Date: Fri, 9 Jun 2017 09:59:05 +0200 > > Subject: Optimize Nutch Indexing Speed > > Hello, > > I have recently configured my Nutch crawler to index a whole domain, with > > an estimated number of 1.5M-3M documents. > > > > For this purpose, I wanted to use Nutch 1.13 and Solr 4.10.4 to build a > > search index over these documents. The compute server is a 16 core Xeon > > Server with 128GB RAM. > > While everything has worked for subdomain crawls quite well, I noticed > some > > severe drawbacks once I put it on the whole domain: > > - The solr indexing failed without any obvious reason if I did not lower > > the -topN value to 40k instead of 50k documents. > > > > Did this possibly fail on SolrClean/Clean task instead of indexing task? If > so, then you've encountered https://issues.apache.org/ > jira/browse/NUTCH-2269. > I would suggest you possibly upgrade to master branch to work around this > or else desable Clean for the time being. > > > > - The CrawlDb and LinkDb merging steps take an unreasonable long amount > of > > time after only 150k indexed documents (~7 crawl iterations). For the > > latest step, it took over 8 hours. > > > This is way too long. Have you tried profiling the tasks? How are you > running Nutch? Local, pseudo-distributed or distributed? I would look more > closely into your logs with DEBUG on to see what is going on. I would also > profile the task to see exactly where the tasks are struggling. re you > filtering and normalizing? If so do you have some complex rules in there > which may be decreasing performance? > > > > I noticed that it does seem to only > > utilize on core on the machine, which seems weird to me. I also already > > increased the Java Heap Size to 5GB (from default 1GB), but did not > notice > > any imminent improvements. > > > > Please check the following > https://stackoverflow.com/questions/8357296/full- > utilization-of-all-cores-in-hadoop-pseudo-distributed-mode#8359416 > See if any of this applies. > > > > > > My questions would be: > > - As an alternative to the server, I have access to a cluster of 4/5 > nodes > > with 2 cores and 10 GB available for Hadoop. Would I benefit from a > > distributed run at all? It doesn't seem to me that the > fetching/generating > > process is the bottleneck, but rather the (serial?) update of the > database. > > > > Generally speaking parallelizing the task will benefit you yes. Please > consider the above responses I've provided however before diving in with > this. Also note, it is possible to have overlapping crawls on the go even > on one machine. > > > > - Since crawling is not the issue, could I potentially benefit from > > switching to Nutch 2.x? > > > > There is no reason why Nutch 1.X is not able to scale to this task. Your > dataset is not overly large by any means. I would stick with what you have > got and make an attempt to optimize configuration. > > > > - Is there any known reason that Solr might "reject" an indexing step, or > > was it just some temporary error? I have honestly not tried it again, > since > > I have temporal limitations regarding the crawl, and do not want to have > to > > start over again. > > > > Understood. Please check it the 'clean' task killed it off for you. If so, > then please remove this from your crawl process. > > > > - Is there any way to efficiently "skip" the update steps for most of the > > time, and only perform them once a certain amount of pages have been > > acquired? > > > yes absolutely. It is not completely necessary to do this after every crawl > cycle. > > > > Is it even normal that it take this long, or may I have some > > configuraitonal errors? > > > > I think spending some time on the issue above should resolve your issues. > > Lewis >

