Re: Giving priority to seeds

Tim Pease Tue, 04 Oct 2011 11:35:43 -0700

On Oct 4, 2011, at 4:03 AM, Danicela nutch wrote:

> Hi,
> 
> I want to make a ScoringFilter plugin which will give priority to seeds file.
> 
> I mean, I have a crawdb and a seeds file with links, I set a topN=5 to test, 
> and I want that my seeds links are fetched first, before what I have in the 
> crawldb.
> 
> For that, I tried to implement ScoringFilter methods, particularly 
> injectedScore(Text text, CrawlDatum cd), I made a 'cd.setScore(100f)'. The 
> score is correctly given but it's not used and in my 5 pages segment I don't 
> have these links.
> 
> Maybe I made something wrong ?
> 
> Thanks in advance.


If your goal is to simply crawl the seed list first, you can use the 
FreeGenerator tool to create a fetch segment containing just the URLs from the 
seed list. Assuming you are using hadoop to run your crawler ...

1) hadoop jar nutch.job org.apache.nutch.tools.FreeGenerator 
/hdfs/path/to/seeds/dir /hdfs/path/to/segments/dir
2) hadoop jar nutch.job org.apache.nutch.fetcher.Fetcher 
/hdfs/path/to/segments/dir/20111004123015 -noParsing
3) hadoop jar nutch.job org.apache.nutch.parse.ParseSegment 
/hdfs/path/to/segments/dir/20111004123015
4) hadoop jar nutch.job org.apache.nutch.crawl.CrawlDb /hdfs/path/to/crawldb 
/hdfs/path/to/segments/dir/20111004123015

Apologies for being incredibly verbose there. That will fetch all your seed 
URLs, parse them, and update the crawl database.

For our crawl setup, we run the FreeGenerator each time we create a new 
collection of segment files to fetch and parse. This ensures that we always 
crawl the home pages of our various websites since that is where new content is 
posted each day. This ensures we are getting the latest content into nutch/solr 
as quickly as possible.

Great question. Hope this helps; and I especially hope it helps you avoid the 
work of writing your own ScoringFilter plugin!

Blessings,
TwP

Re: Giving priority to seeds

Reply via email to