Hi Senthil,
I think you should take a look at this website. You can find detailed information there. http://wiki.apache.org/nutch/FrontPage I will presume you are using Nutch 1.xx without Hadoop. You can then check this site first: http://wiki.apache.org/nutch/NutchTutorial You should think a bit more in link depth than in the time which is needed to completely crawl a site. When Nutch is started using something like 'bin/nutch crawl your_site_urls_dir -dir DIRECTORY_FOR_SITE_k -depth 3' and presuming your_site_urls_dir contains a text file with just the start url, then Nutch would take three loops which can be seen as crawling the given site with link depth 3. In each loop cycle Nutch will take all urls or a part of it (depending on your configuration) from the crawldb and try to fetch them. Nutch then analyzes the fetched pages and finds new links which will be fetched in the next fetch cycle and so on. Initially you could start with depth 1 to create the crawldb for site k (1 <= k < n) and inject all your start-urls provided. I think you need to do this for all 50 sites! These 50 directories you can provide to the crawl as DIRECTORY_FOR_SITE_k in the crawl command. Each time Nutch is started it will check the crawldb for unvisited links and fetch them. It will not constantly fetch until it does not find anything more! You will have to provide a loop number that will loop enough times, in order to get the whole site completely visited. But most of the time sites don't change in a certain period but constantly. All these changes are recognized at once when they were found. So it is a bit more like a crawling/re-crawling. You can configure the periods in which fetched pages will be visited again for a lookup of changes. I think you will have to start the same crawl job for each site each time with another site directory and I think you should do it in a manner that all your pages are fetched. I can imagine that you start Nutch from a script for some loops for one site after the other. This depends on the number of pages which are hosted on your sites. Alternatively you are free to start 50 Nutches in parallel, one for each site. Hope this helps, Walter Am 18.04.2013 18:46, schrieb mesenthil1: > Hi, > > Can some one please explain how the following scenario works? > > I need to crawl a site with 50K urls. This site is a dynamic site and will > have frequent updates on the site. Assuming it takes 2 days to completely > crawl this site, can we have some configuration(fetch schedule or something > else) so that once the crawl cycle is complete, the next crawl cycle will > start automatically after two days to find the new URLS. If this feature is > not available, should we manually control the repeated crawling of the site > thru some sort of scripting? > > Actually we will have to crawl more than 50 sites to be crawled separately. > If we need to maintain re-crawling of each site, should we have 50 separate > scripts to handle them. Please let us know if anyone has faced this > situation? > > > Thanks, > Senthil > > > > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Whether-Nutch-AdaptiveFetchSchedule-can-do-recrawling-automatically-tp4056979p4057036.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- -------------------------------- Walter Tietze Senior Software Developer Neofonie GmbH Robert-Koch-Platz 4 10115 Berlin T: +49 30 246 27 318 [email protected] http://www.neofonie.de Handelsregister Berlin-Charlottenburg: HRB 67460 Geschäftsführung Thomas Kitlitschko --------------------------------

