nutch 1.1 crawl d/n complete issue

2010-04-16 Thread matthew a. grisius
Two observations using the nutch 1.1. nightly build nutch-2010-04-14_04-00-47: 1) I was using nutch 1.0 to crawl successfully, but had problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which appears to parse all of the 'problem' pdfs that parse-pdf could not handle. The crawldb

Re: nutch 1.1 crawl d/n complete issue

2010-04-16 Thread Phil Barnett
The Fix. In line 131 of Crawl.java Generate no longer returns segments like it used to. Now it returns segs. line 131 needs to read If (segs == null) Instead of the current If (segments == null) After that change and a recompile, crawl is working just fine.

nutch 1.1 crawl d/n complete issue

2010-04-15 Thread matthew a. grisius
Two observations using the nutch 1.1. nightly build nutch-2010-04-14_04-00-47: 1) Previously I was using nutch 1.0 to crawl successfully, but had problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which appears to parse all of the 'problem' pdfs that parse-pdf could not handle. The

Re: nutch 1.1 crawl d/n complete issue

2010-04-15 Thread Harry Nutch
I am new to nutch and still trying to figure out the code flow, however, as a work around to issue #1, after the crawl finishes you could run linkdb and index command separately from cygwin. $bin/nutch invertlinks crawl/linkdb -dir crawl/segments $ bin/nutch index crawl/indexes crawl/crawldb/

Re: nutch 1.1 crawl d/n complete issue

2010-04-15 Thread matthew a. grisius
Hi Harry, Yes indeed. It appears to work for me too. Thank you! nutch invertlinks crawl/linkdb -dir crawl/segments LinkDb: starting LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/crawl/segments/20100415221103 LinkDb: adding