recrawl a single page explicit

2012-04-02 Thread Jan Riewe
Hi there, till now i did not find a way to crawl a specific page manuell. Is there a possibility manuell set the recrawl interval or the crawl date, or any other explicit way to make nutch invalidate a page? We have got 70k+ pages in the index and a full recrawl would take to long. Thanks Jan

Re: recrawl a single page explicit

2012-04-02 Thread Hannes Carl Meyer
Hi, we have kind of a similar case and we perform the following: 1 put all URLs you want to recrawl in the regex-urlfilter.txt 2 perform a bin/nutch mergedb with -filter param to strip those URLs from the crawl-db * 3 put the URLs from 1 into a seed file 4 remove the URLs from 1 from the

Re: recrawl a single page explicit

2012-04-02 Thread Markus Jelsma
The FreeGenerator tool is the easiest approach. On Mon, 2 Apr 2012 11:29:02 +0200, Hannes Carl Meyer hannesc...@googlemail.com wrote: Hi, we have kind of a similar case and we perform the following: 1 put all URLs you want to recrawl in the regex-urlfilter.txt 2 perform a bin/nutch mergedb