RE: How store only home page of domains but crawl all the pages to detect all different domains

2011-01-15 Thread Marseld Dedgjonaj
Hi Markus, I am also interested in using different regex-urlfilter for Generate step because I need to crawl only homepage of 10 websites continuously and index all links which are in the homepage but not go crawling recursively. I think it can be done by puting in regex-urlfilter file for

RE: How store only home page of domains but crawl all the pages to detect all different domains

2011-01-15 Thread Marseld Dedgjonaj
Hi, Thanks for your response. If I set -depth 1, this will function only for the first crawl. But sense initial urls are very dynamic webpages and the content changes every hour, I need to crawl the initial urls continuously(only initial urls). Best Regards, Marseldi -Original Message-

Re: How store only home page of domains but crawl all the pages to detect all different domains

2011-01-15 Thread Charan K
In that case.. You can generate from one database . Do db update to a different crawl db.. On Jan 15, 2011, at 10:06 AM, Marseld Dedgjonaj marseld.dedgjo...@ikubinfo.com wrote: Hi, Thanks for your response. If I set -depth 1, this will function only for the first crawl. But sense initial

Re: How store only home page of domains but crawl all the pages to detect all different domains

2011-01-13 Thread Markus Jelsma
Hi, You will need to create different versions of the regex-urlfilter.txt for the different jobs. You can have different nutch-site configs where each has a different setting for urlfilter.regex.file, pointing to the relevant regex- urlfilter file. Or you can just copy regex-urlfilter-JOB.txt

How store only home page of domains but crawl all the pages to detect all different domains

2011-01-12 Thread Asier Martínez
Hi to all, here is my problem. I want to crawl all ( to certain depth limit, you know ) the pages of certain domains/subdomains to detect them, but only store the home pages of the domains.( I don't have the list of the domains ) ¿There is a easy way to do this? or I have to change the source code