Hi, you remove it the same way, no matter whether the crawl is run locally or in a cluster: you have to remove the command invertlinks (or LinkDb.invert(...) when called from Java). Consequently, there will be no linkdb and you cannot use it when indexing.
The concrete steps depend on how the crawler is launched: - bin/crawl - custom script - o.a.n.crawl.Crawler (deprecated, removed in 1.8) - custom Java code Sebastian On 03/17/2014 08:51 PM, S.L wrote: > Thanks Sebastian! I am actually running it as a MapReduce Job on Hadoop, > how would I disable it in this case ? > > > On Mon, Mar 17, 2014 at 3:39 PM, Sebastian Nagel <[email protected] >> wrote: > >> Hi, >> >> in the script bin/crawl (or a copy of it): >> - comment/remove the line >> $bin/nutch invertlinks $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT >> - remove >> -linkdb $CRAWL_PATH/linkdb >> from line >> $bin/nutch index ... >> >> Sebastian >> >> On 03/17/2014 03:43 PM, S.L wrote: >>> Hi , >>> >>> I am building a search engine for Chinese medicine and I know the list of >>> websites that I need to crawl , which we can think of as isolated islands >>> with no inter-connectivity between them, which makes every page in the >>> websites of my interest equally important. >>> >>> Now Nutch has a MapReduce phase called LinkInversion which calculates the >>> importance of a given page by calculating the InLinks for a given page , >>> now in my case there are no inter-site inlinks which means I should not >>> even attempt to do LinkInversion. >>> >>> Can some one please suggest how to disable the LinkInversion phase in >>> Apache Nutch 1.7 ? >>> >>> Thanks. >>> >> >> >

