Two observations using the nutch 1.1. nightly build
nutch-2010-04-14_04-00-47:
1) I was using nutch 1.0 to crawl successfully, but had problems w/
parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which appears to
parse all of the 'problem' pdfs that parse-pdf could not handle. The
crawldb
The Fix.
In line 131 of Crawl.java
Generate no longer returns segments like it used to. Now it returns segs.
line 131 needs to read
If (segs == null)
Instead of the current
If (segments == null)
After that change and a recompile, crawl is working just fine.
Two observations using the nutch 1.1. nightly build
nutch-2010-04-14_04-00-47:
1) Previously I was using nutch 1.0 to crawl successfully, but had
problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which
appears to parse all of the 'problem' pdfs that parse-pdf could not
handle. The
I am new to nutch and still trying to figure out the code flow, however, as
a work around to issue #1, after the crawl finishes you could run linkdb and
index command separately from cygwin.
$bin/nutch invertlinks crawl/linkdb -dir crawl/segments
$ bin/nutch index crawl/indexes crawl/crawldb/
Hi Harry,
Yes indeed. It appears to work for me too. Thank you!
nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/crawl/segments/20100415221103
LinkDb: adding