Hi Harry,
Yes indeed. It appears to work for me too. Thank you!
nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/crawl/segments/20100415221103
LinkDb: adding segment: file:/crawl/segments/20100415221122
LinkDb: adding segment: file:/crawl/segments/20100415221141
LinkDb: adding segment: file:/crawl/segments/20100415221032
LinkDb: adding segment: file:/crawl/segments/20100415221019
LinkDb: adding segment: file:/crawl/segments/20100415221046
LinkDb: done
nutch index crawl/indexes crawl/crawldb/ crawl/linkdb
crawl/segments/20100415221103 crawl/segments/20100415221122
crawl/segments/20100415221141 crawl/segments/20100415221032
crawl/segments/20100415221019 crawl/segments/20100415221046
Indexer: starting
Indexer: done
1) After 'verifying' that the Harry nutch 1.1 work around can complete
the work for my 'small test crawl'. How do I scale the 'index step'
above when the data increases 50x and I can no longer fit the segments
onto a command line? Maybe this is a 'non-issue', hopefully this will be
fixed before the Nutch 1.1 Release Candidate #1 is voted in.
2) Additionally, now I have lost my ability to peer into the data
structures. Both luke 0.9.9.1 and luke 1.0.1 report:
No valid directory at the location, try another location.
Oh well, any suggestions to #1 or #2 w/b appreciated. Thanks again!
-m.
On Fri, 2010-04-16 at 08:44 +0800, Harry Nutch wrote:
I am new to nutch and still trying to figure out the code flow, however, as
a work around to issue #1, after the crawl finishes you could run linkdb and
index command separately from cygwin.
$bin/nutch invertlinks crawl/linkdb -dir crawl/segments
$ bin/nutch index crawl/indexes crawl/crawldb/ crawl/linkdb
crawl/segments/20100415163946 crawl/segments/20100415164106
This seems to work for me. You may have already tried this workaround, but
just in case.
-Harry
On Fri, Apr 16, 2010 at 3:34 AM, matthew a. grisius
mgris...@comcast.netwrote:
Two observations using the nutch 1.1. nightly build
nutch-2010-04-14_04-00-47:
1) Previously I was using nutch 1.0 to crawl successfully, but had
problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which
appears to parse all of the 'problem' pdfs that parse-pdf could not
handle. The crawldb and segments directories are created and appear to
be valid. However, the overall crawl does not finish now:
nutch crawl urls/urls -dir crawl -depth 10
...
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20100415015102]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Exception in thread main java.lang.NullPointerException
at org.apache.nutch.crawl.Crawl.main(Crawl.java:133)
Nutch 1.0 would complete like this:
nutch crawl urls/urls -dir crawl -depth 10
...
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=7 - no more URLs to fetch.
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225731
LinkDb: adding segment:
file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225644
LinkDb: adding segment:
file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225749
LinkDb: adding segment:
file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225808
LinkDb: adding segment:
file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225713
LinkDb: adding segment:
file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225937
LinkDb: adding segment:
file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225656
LinkDb: done
Indexer: starting
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawl/indexes
Dedup: done
merging indexes to: crawl/index
Adding
file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/indexes/part-0
done merging
crawl finished: crawl
Any ideas?
2) if there is a 'space' in any component dir then $NUTCH_OPTS is
invalid and causes this problem:
m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin nutch
crawl urls/urls -dir crawl -depth 10 -topN 10
NUTCH_OPTS: -Dhadoop.log.dir=/home/mag/Desktop/untitled
folder/nutch-2010-04-14_04-00-47/logs -Dhadoop.log.file=hadoop.log
-Djava.library.path=/home/mag/Desktop/untitled