Hello nutchers!
I am trying to compute linkrank scores without spending excessive time on the
task. My version of the crawl script contains the following line, which is
similar to a commented-out line in the bin/crawl script in the 1.12
distribution.
__bin_nutch webgraph $commonOptions -filter -normalize -segmentDir
"$CRAWL_PATH"/segments/ -webgraphdb "$CRAWL_PATH"
I notice that it specifies -segmentDir, rather than -segment. Does that mean itÂ
re-computes the outlinkdb and other information for every existing segment
every time it does a new segment, or does it check and avoid re-doing things it
did before?
If I change it to say -segment "$CRAWL_PATH"/segments/$SEGMENT, will it do just
what needs doing? The way I have it now, it spends a lot of time computing
outlinkdb.
Thanks for any light you may shed.