Hi Meenakshi, This can be done by implementing a custom ScoringFilter which would propagate the depth from one page to its outlinks via their metadata. The info about the depth would then be in the crawldb and you could have a custom mapreduce job to filter the URLs based on the value of the metadata. Note that this requires recrawling
HTH Julien On 24 May 2011 04:17, Meenakshi Kanaujia <[email protected]>wrote: > Hi Lewis, > > Actually I am crawlling an intranet site. > We need to migrate intranet site from one CMS to another. > Crawling is done successfully. > Now I have to decide the migration order of the crawled web pages. > For doing that I was just thinking a way to find the list of url at each > depth. > Suppose at the time of crawling I have specified the depth 5, then I want > to > fetch the list of url at depth 1,2,3.... and so on. > Can this be possible using Nutch api? > > Thanks, > Meenakshi > > On Mon, May 23, 2011 at 4:02 PM, McGibbney, Lewis John < > [email protected]> wrote: > > > Hi Meenakshi, > > > > Can you expand any on this? This is very vague. > > > > Lewis > > > > ________________________________________ > > From: Meenakshi Kanaujia [[email protected]] > > Sent: 23 May 2011 05:30 > > To: [email protected] > > Subject: Fetch list of urls > > > > Hi, > > > > I have crawled site using Nutch. > > Is this possible to fetch the list of urls depth wise from crawlDB. > > > > Thanks, > > Meenakshi > > > > Email has been scanned for viruses by Altman Technologies' email > management > > service - www.altman.co.uk/emailsystems > > > > Glasgow Caledonian University is a registered Scottish charity, number > > SC021474 > > > > Winner: Times Higher Education’s Widening Participation Initiative of the > > Year 2009 and Herald Society’s Education Initiative of the Year 2009. > > > > > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html > > > > Winner: Times Higher Education’s Outstanding Support for Early Career > > Researchers of the Year 2010, GCU as a lead with Universities Scotland > > partners. > > > > > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html > > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

