Hi there,
till now i did not find a way to crawl a specific page manuell.
Is there a possibility manuell set the recrawl interval or the crawl
date, or any other explicit way to make nutch invalidate a page?
We have got 70k+ pages in the index and a full recrawl would take to
long.
Thanks
Jan
Hi,
we have kind of a similar case and we perform the following:
1 put all URLs you want to recrawl in the regex-urlfilter.txt
2 perform a bin/nutch mergedb with -filter param to strip those URLs from
the crawl-db *
3 put the URLs from 1 into a seed file
4 remove the URLs from 1 from the
The FreeGenerator tool is the easiest approach.
On Mon, 2 Apr 2012 11:29:02 +0200, Hannes Carl Meyer
hannesc...@googlemail.com wrote:
Hi,
we have kind of a similar case and we perform the following:
1 put all URLs you want to recrawl in the regex-urlfilter.txt
2 perform a bin/nutch mergedb
3 matches
Mail list logo