Re: Whether Nutch AdaptiveFetchSchedule can do recrawling automatically?

Walter Tietze Thu, 18 Apr 2013 10:42:14 -0700

Hi Senthil,


I think you should take a look at this website. You can find detailed 
information there.

http://wiki.apache.org/nutch/FrontPage


I will presume you are using Nutch 1.xx without Hadoop. You can then check this 
site first: http://wiki.apache.org/nutch/NutchTutorial



You should think a bit more in link depth than in the time which is needed to 
completely crawl a site.

When Nutch is started using something like  'bin/nutch crawl your_site_urls_dir 
 -dir DIRECTORY_FOR_SITE_k  -depth 3'  and presuming your_site_urls_dir 
contains a text file with just the start url, then Nutch

would take three loops which can be seen as crawling the given site with link 
depth 3.


In each loop cycle Nutch will take all urls or a part of it (depending on your 
configuration) from the crawldb and try to fetch them. Nutch then analyzes the 
fetched pages and finds new links which will

be fetched in the next fetch cycle and so on.


Initially you could start with depth 1 to create the crawldb for site k (1 <= k 
< n) and inject all your start-urls provided. I think you need to do this for 
all 50 sites!

These 50 directories you can provide to the crawl as DIRECTORY_FOR_SITE_k in 
the crawl command.


Each time Nutch is started it will check the crawldb for unvisited links and 
fetch them. It will not constantly fetch until it does not find anything more! 
You will have to provide a loop number that will loop

enough times, in order to get the whole site completely visited. But most of 
the time sites don't change in a certain period but constantly. All these 
changes are recognized at once when they were found.

So it is a bit more like a crawling/re-crawling. You can configure the periods 
in which fetched pages will be visited again for a lookup of changes.



I think you will have to start the same crawl job for each site each time with 
another site directory and I think you should do it in a manner that all your 
pages are fetched.


I can imagine that you start Nutch from a script for some loops for one site 
after the other. This depends on the number of pages which are hosted on your 
sites.

Alternatively you are free to start 50 Nutches in parallel, one for each site.



Hope this helps, Walter






Am 18.04.2013 18:46, schrieb mesenthil1:
> Hi,
> 
> Can some one please explain how the following scenario works?
> 
> I need to crawl a site with 50K urls.  This site is a dynamic site and will
> have frequent updates on the site. Assuming it takes 2 days to completely
> crawl this site, can we have some configuration(fetch schedule or something
> else) so that once the crawl cycle is complete, the next crawl cycle will
> start automatically after two days to find the new URLS. If this feature is
> not available, should we manually control the repeated crawling of the site
> thru some sort of scripting?
> 
> Actually we will have to crawl more than 50 sites to be crawled separately.
> If we need to maintain re-crawling of each site, should we have 50 separate
> scripts to handle them. Please let us know if anyone has faced this
> situation?
> 
> 
> Thanks,
> Senthil
> 
> 
> 
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Whether-Nutch-AdaptiveFetchSchedule-can-do-recrawling-automatically-tp4056979p4057036.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 
> 


-- 

--------------------------------
Walter Tietze
Senior Software Developer

Neofonie GmbH
Robert-Koch-Platz 4
10115 Berlin

T: +49 30 246 27 318

[email protected]
http://www.neofonie.de

Handelsregister
Berlin-Charlottenburg: HRB 67460

Geschäftsführung
Thomas Kitlitschko
--------------------------------

Re: Whether Nutch AdaptiveFetchSchedule can do recrawling automatically?

Reply via email to