Hello,

Start by disabling filtering and normalizing, it was already done in the 
parser. Only enable it just once if you changed filters and/or normalizers. You 
can use -segmemt to update an existing graph. By the way, is building the graph 
a performance problem? What about computing the linkrank which is much more 
costly.

Markus

-----Original message-----
> From:Michael Coffey <[email protected]>
> Sent: Thursday 2nd March 2017 2:07
> To: [email protected]
> Subject: webgraph speed
> 
> Hello nutchers!
> I am trying to compute linkrank scores without spending excessive time on the 
> task. My version of the crawl script contains the following line, which is 
> similar to a commented-out line in the bin/crawl script in the 1.12 
> distribution.
> __bin_nutch webgraph $commonOptions -filter -normalize -segmentDir 
> "$CRAWL_PATH"/segments/ -webgraphdb "$CRAWL_PATH"
> I notice that it specifies -segmentDir, rather than -segment. Does that mean 
> it  re-computes the outlinkdb and other information for every existing 
> segment every time it does a new segment, or does it check and avoid re-doing 
> things it did before?
> If I change it to say -segment "$CRAWL_PATH"/segments/$SEGMENT, will it do 
> just what needs doing? The way I have it now, it spends a lot of time 
> computing outlinkdb.
> Thanks for any light you may shed.
> 

Reply via email to