recrawl a single page explicit

2012-04-02 Thread Jan Riewe
Hi there,

till now i did not find a way to crawl a specific page manuell. 
Is there a possibility manuell set the recrawl interval or the crawl
date, or any other explicit way to make nutch invalidate a page?

We have got 70k+ pages in the index and a full recrawl would take to
long.

Thanks 
Jan


Re: recrawl a single page explicit

2012-04-02 Thread Hannes Carl Meyer
Hi,

we have kind of a similar case and we perform the following:

1 put all URLs you want to recrawl in the regex-urlfilter.txt
2 perform a bin/nutch mergedb with -filter param to strip those URLs from
the crawl-db *
3 put the URLs from 1 into a seed file
4 remove the URLs from 1 from the regex-urlfilter.txt
5 start the crawl with the seed file from 3

* This is a merge on itself, for example: bin/nutch mergedb
$CRAWLFOLDER/NEWMergeDB $CRAWLFOLDER/crawldb -filter

I dunno wether this is the best way to do it, but since we automated it it
works very well.

Regards

Hannes

On Mon, Apr 2, 2012 at 11:07 AM, Jan Riewe jan.ri...@comspace.de wrote:

 Hi there,

 till now i did not find a way to crawl a specific page manuell.
 Is there a possibility manuell set the recrawl interval or the crawl
 date, or any other explicit way to make nutch invalidate a page?

 We have got 70k+ pages in the index and a full recrawl would take to
 long.

 Thanks
 Jan



Re: recrawl a single page explicit

2012-04-02 Thread Markus Jelsma

The FreeGenerator tool is the easiest approach.

On Mon, 2 Apr 2012 11:29:02 +0200, Hannes Carl Meyer 
hannesc...@googlemail.com wrote:

Hi,

we have kind of a similar case and we perform the following:

1 put all URLs you want to recrawl in the regex-urlfilter.txt
2 perform a bin/nutch mergedb with -filter param to strip those URLs 
from

the crawl-db *
3 put the URLs from 1 into a seed file
4 remove the URLs from 1 from the regex-urlfilter.txt
5 start the crawl with the seed file from 3

* This is a merge on itself, for example: bin/nutch mergedb
$CRAWLFOLDER/NEWMergeDB $CRAWLFOLDER/crawldb -filter

I dunno wether this is the best way to do it, but since we automated 
it it

works very well.

Regards

Hannes

On Mon, Apr 2, 2012 at 11:07 AM, Jan Riewe jan.ri...@comspace.de 
wrote:



Hi there,

till now i did not find a way to crawl a specific page manuell.
Is there a possibility manuell set the recrawl interval or the crawl
date, or any other explicit way to make nutch invalidate a page?

We have got 70k+ pages in the index and a full recrawl would take to
long.

Thanks
Jan



--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350