Hello, I have recently configured my Nutch crawler to index a whole domain, with an estimated number of 1.5M-3M documents.
For this purpose, I wanted to use Nutch 1.13 and Solr 4.10.4 to build a search index over these documents. The compute server is a 16 core Xeon Server with 128GB RAM. While everything has worked for subdomain crawls quite well, I noticed some severe drawbacks once I put it on the whole domain: - The solr indexing failed without any obvious reason if I did not lower the -topN value to 40k instead of 50k documents. - The CrawlDb and LinkDb merging steps take an unreasonable long amount of time after only 150k indexed documents (~7 crawl iterations). For the latest step, it took over 8 hours. I noticed that it does seem to only utilize on core on the machine, which seems weird to me. I also already increased the Java Heap Size to 5GB (from default 1GB), but did not notice any imminent improvements. My questions would be: - As an alternative to the server, I have access to a cluster of 4/5 nodes with 2 cores and 10 GB available for Hadoop. Would I benefit from a distributed run at all? It doesn't seem to me that the fetching/generating process is the bottleneck, but rather the (serial?) update of the database. - Since crawling is not the issue, could I potentially benefit from switching to Nutch 2.x? - Is there any known reason that Solr might "reject" an indexing step, or was it just some temporary error? I have honestly not tried it again, since I have temporal limitations regarding the crawl, and do not want to have to start over again. - Is there any way to efficiently "skip" the update steps for most of the time, and only perform them once a certain amount of pages have been acquired? Is it even normal that it take this long, or may I have some configuraitonal errors? Many thanks in advance, Dennis

