On Oct 4, 2011, at 4:03 AM, Danicela nutch wrote: > Hi, > > I want to make a ScoringFilter plugin which will give priority to seeds file. > > I mean, I have a crawdb and a seeds file with links, I set a topN=5 to test, > and I want that my seeds links are fetched first, before what I have in the > crawldb. > > For that, I tried to implement ScoringFilter methods, particularly > injectedScore(Text text, CrawlDatum cd), I made a 'cd.setScore(100f)'. The > score is correctly given but it's not used and in my 5 pages segment I don't > have these links. > > Maybe I made something wrong ? > > Thanks in advance.
If your goal is to simply crawl the seed list first, you can use the FreeGenerator tool to create a fetch segment containing just the URLs from the seed list. Assuming you are using hadoop to run your crawler ... 1) hadoop jar nutch.job org.apache.nutch.tools.FreeGenerator /hdfs/path/to/seeds/dir /hdfs/path/to/segments/dir 2) hadoop jar nutch.job org.apache.nutch.fetcher.Fetcher /hdfs/path/to/segments/dir/20111004123015 -noParsing 3) hadoop jar nutch.job org.apache.nutch.parse.ParseSegment /hdfs/path/to/segments/dir/20111004123015 4) hadoop jar nutch.job org.apache.nutch.crawl.CrawlDb /hdfs/path/to/crawldb /hdfs/path/to/segments/dir/20111004123015 Apologies for being incredibly verbose there. That will fetch all your seed URLs, parse them, and update the crawl database. For our crawl setup, we run the FreeGenerator each time we create a new collection of segment files to fetch and parse. This ensures that we always crawl the home pages of our various websites since that is where new content is posted each day. This ensures we are getting the latest content into nutch/solr as quickly as possible. Great question. Hope this helps; and I especially hope it helps you avoid the work of writing your own ScoringFilter plugin! Blessings, TwP

