Hi Markus,
I am also interested in using different regex-urlfilter for Generate step
because I need to crawl only homepage of 10 websites continuously and index
all links which are in the homepage but not go crawling recursively.
I think it can be done by puting in regex-urlfilter file for
Hi,
Thanks for your response.
If I set -depth 1, this will function only for the first crawl.
But sense initial urls are very dynamic webpages and the content changes every
hour,
I need to crawl the initial urls continuously(only initial urls).
Best Regards,
Marseldi
-Original Message-
In that case.. You can generate from one database . Do db update to a
different crawl db..
On Jan 15, 2011, at 10:06 AM, Marseld Dedgjonaj
marseld.dedgjo...@ikubinfo.com wrote:
Hi,
Thanks for your response.
If I set -depth 1, this will function only for the first crawl.
But sense initial
Hi,
You will need to create different versions of the regex-urlfilter.txt for the
different jobs. You can have different nutch-site configs where each has a
different setting for urlfilter.regex.file, pointing to the relevant regex-
urlfilter file. Or you can just copy regex-urlfilter-JOB.txt
Hi to all,
here is my problem. I want to crawl all ( to certain depth limit,
you know ) the pages of certain domains/subdomains to detect them, but
only store the home pages of the domains.( I don't have the list of
the domains )
¿There is a easy way to do this? or I have to change the source code
5 matches
Mail list logo