Hello,
I have recently configured my Nutch crawler to index a whole domain, with
an estimated number of 1.5M-3M documents.

For this purpose, I wanted to use Nutch 1.13 and Solr 4.10.4 to build a
search index over these documents. The compute server is a 16 core Xeon
Server with 128GB RAM.
While everything has worked for subdomain crawls quite well, I noticed some
severe drawbacks once I put it on the whole domain:
- The solr indexing failed without any obvious reason if I did not lower
the -topN value to 40k instead of 50k documents.
- The CrawlDb and LinkDb merging steps take an unreasonable long amount of
time after only 150k indexed documents (~7 crawl iterations). For the
latest step, it took over 8 hours. I noticed that it does seem to only
utilize on core on the machine, which seems weird to me. I also already
increased the Java Heap Size to 5GB (from default 1GB), but did not notice
any imminent improvements.

My questions would be:
- As an alternative to the server, I have access to a cluster of 4/5 nodes
with 2 cores and 10 GB available for Hadoop. Would I benefit from a
distributed run at all? It doesn't seem to me that the fetching/generating
process is the bottleneck, but rather the (serial?) update of the database.
- Since crawling is not the issue, could I potentially benefit from
switching to Nutch 2.x?
- Is there any known reason that Solr might "reject" an indexing step, or
was it just some temporary error? I have honestly not tried it again, since
I have temporal limitations regarding the crawl, and do not want to have to
start over again.
- Is there any way to efficiently "skip" the update steps for most of the
time, and only perform them once a certain amount of pages have been
acquired? Is it even normal that it take this long, or may I have some
configuraitonal errors?

Many thanks in advance,
Dennis

Reply via email to