What you describe sounds like a complete update and revamp prior to transition 
to new CMS. Although what you have described might aid this transition by 
finding the depth of each web page in your site, I would comment that a more 
holistic approach needs to be considered possibly taking into account the 
overall structure of your site.

To answer your question, I would say the best way to do what you require is to 
crawl at a specific depth then dump the crawldb with readdb -dump. This way you 
would be able to see the current urls Nutch has fetched at your requested depth.

Lewis

________________________________________
From: Meenakshi Kanaujia [[email protected]]
Sent: 24 May 2011 04:17
To: [email protected]
Subject: Re: Fetch list of urls

Hi Lewis,

Actually I am crawlling an intranet site.
We need to migrate intranet site from one CMS to another.
Crawling is done successfully.
Now I have to decide the migration order of the crawled web pages.
For doing that I was just thinking a way to find the list of url at each
depth.
Suppose at the time of crawling I have specified the depth 5, then I want to
fetch the list of url at depth 1,2,3.... and so on.
Can this be possible using Nutch api?

Thanks,
Meenakshi

On Mon, May 23, 2011 at 4:02 PM, McGibbney, Lewis John <
[email protected]> wrote:

> Hi Meenakshi,
>
> Can you expand any on this? This is very vague.
>
> Lewis
>
> ________________________________________
> From: Meenakshi Kanaujia [[email protected]]
> Sent: 23 May 2011 05:30
> To: [email protected]
> Subject: Fetch list of urls
>
> Hi,
>
> I have crawled site using Nutch.
> Is this possible to fetch the list of urls depth wise from crawlDB.
>
> Thanks,
> Meenakshi
>
> Email has been scanned for viruses by Altman Technologies' email management
> service - www.altman.co.uk/emailsystems
>
> Glasgow Caledonian University is a registered Scottish charity, number
> SC021474
>
> Winner: Times Higher Education’s Widening Participation Initiative of the
> Year 2009 and Herald Society’s Education Initiative of the Year 2009.
>
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>
> Winner: Times Higher Education’s Outstanding Support for Early Career
> Researchers of the Year 2010, GCU as a lead with Universities Scotland
> partners.
>
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
>
Email has been scanned for viruses by Altman Technologies' email management 
service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

Reply via email to