What you describe sounds like a complete update and revamp prior to transition to new CMS. Although what you have described might aid this transition by finding the depth of each web page in your site, I would comment that a more holistic approach needs to be considered possibly taking into account the overall structure of your site.
To answer your question, I would say the best way to do what you require is to crawl at a specific depth then dump the crawldb with readdb -dump. This way you would be able to see the current urls Nutch has fetched at your requested depth. Lewis ________________________________________ From: Meenakshi Kanaujia [[email protected]] Sent: 24 May 2011 04:17 To: [email protected] Subject: Re: Fetch list of urls Hi Lewis, Actually I am crawlling an intranet site. We need to migrate intranet site from one CMS to another. Crawling is done successfully. Now I have to decide the migration order of the crawled web pages. For doing that I was just thinking a way to find the list of url at each depth. Suppose at the time of crawling I have specified the depth 5, then I want to fetch the list of url at depth 1,2,3.... and so on. Can this be possible using Nutch api? Thanks, Meenakshi On Mon, May 23, 2011 at 4:02 PM, McGibbney, Lewis John < [email protected]> wrote: > Hi Meenakshi, > > Can you expand any on this? This is very vague. > > Lewis > > ________________________________________ > From: Meenakshi Kanaujia [[email protected]] > Sent: 23 May 2011 05:30 > To: [email protected] > Subject: Fetch list of urls > > Hi, > > I have crawled site using Nutch. > Is this possible to fetch the list of urls depth wise from crawlDB. > > Thanks, > Meenakshi > > Email has been scanned for viruses by Altman Technologies' email management > service - www.altman.co.uk/emailsystems > > Glasgow Caledonian University is a registered Scottish charity, number > SC021474 > > Winner: Times Higher Education’s Widening Participation Initiative of the > Year 2009 and Herald Society’s Education Initiative of the Year 2009. > > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html > > Winner: Times Higher Education’s Outstanding Support for Early Career > Researchers of the Year 2010, GCU as a lead with Universities Scotland > partners. > > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html > Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

