recrawl a single page explicit
Hi there, till now i did not find a way to crawl a specific page manuell. Is there a possibility manuell set the recrawl interval or the crawl date, or any other explicit way to make nutch invalidate a page? We have got 70k+ pages in the index and a full recrawl would take to long. Thanks Jan
Re: recrawl a single page explicit
Hi, we have kind of a similar case and we perform the following: 1 put all URLs you want to recrawl in the regex-urlfilter.txt 2 perform a bin/nutch mergedb with -filter param to strip those URLs from the crawl-db * 3 put the URLs from 1 into a seed file 4 remove the URLs from 1 from the regex-urlfilter.txt 5 start the crawl with the seed file from 3 * This is a merge on itself, for example: bin/nutch mergedb $CRAWLFOLDER/NEWMergeDB $CRAWLFOLDER/crawldb -filter I dunno wether this is the best way to do it, but since we automated it it works very well. Regards Hannes On Mon, Apr 2, 2012 at 11:07 AM, Jan Riewe jan.ri...@comspace.de wrote: Hi there, till now i did not find a way to crawl a specific page manuell. Is there a possibility manuell set the recrawl interval or the crawl date, or any other explicit way to make nutch invalidate a page? We have got 70k+ pages in the index and a full recrawl would take to long. Thanks Jan
Re: recrawl a single page explicit
The FreeGenerator tool is the easiest approach. On Mon, 2 Apr 2012 11:29:02 +0200, Hannes Carl Meyer hannesc...@googlemail.com wrote: Hi, we have kind of a similar case and we perform the following: 1 put all URLs you want to recrawl in the regex-urlfilter.txt 2 perform a bin/nutch mergedb with -filter param to strip those URLs from the crawl-db * 3 put the URLs from 1 into a seed file 4 remove the URLs from 1 from the regex-urlfilter.txt 5 start the crawl with the seed file from 3 * This is a merge on itself, for example: bin/nutch mergedb $CRAWLFOLDER/NEWMergeDB $CRAWLFOLDER/crawldb -filter I dunno wether this is the best way to do it, but since we automated it it works very well. Regards Hannes On Mon, Apr 2, 2012 at 11:07 AM, Jan Riewe jan.ri...@comspace.de wrote: Hi there, till now i did not find a way to crawl a specific page manuell. Is there a possibility manuell set the recrawl interval or the crawl date, or any other explicit way to make nutch invalidate a page? We have got 70k+ pages in the index and a full recrawl would take to long. Thanks Jan -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350