subject:"How store only home page of domains but crawl all the pages to detect all different domains"

RE: How store only home page of domains but crawl all the pages to detect all different domains

2011-01-15 Thread Marseld Dedgjonaj

Hi Markus, I am also interested in using different regex-urlfilter for Generate step because I need to crawl only homepage of 10 websites continuously and index all links which are in the homepage but not go crawling recursively. I think it can be done by puting in regex-urlfilter file for

RE: How store only home page of domains but crawl all the pages to detect all different domains

2011-01-15 Thread Marseld Dedgjonaj

Hi, Thanks for your response. If I set -depth 1, this will function only for the first crawl. But sense initial urls are very dynamic webpages and the content changes every hour, I need to crawl the initial urls continuously(only initial urls). Best Regards, Marseldi -Original Message-

Re: How store only home page of domains but crawl all the pages to detect all different domains

2011-01-15 Thread Charan K

In that case.. You can generate from one database . Do db update to a different crawl db.. On Jan 15, 2011, at 10:06 AM, Marseld Dedgjonaj marseld.dedgjo...@ikubinfo.com wrote: Hi, Thanks for your response. If I set -depth 1, this will function only for the first crawl. But sense initial

Re: How store only home page of domains but crawl all the pages to detect all different domains

2011-01-13 Thread Markus Jelsma

Hi, You will need to create different versions of the regex-urlfilter.txt for the different jobs. You can have different nutch-site configs where each has a different setting for urlfilter.regex.file, pointing to the relevant regex- urlfilter file. Or you can just copy regex-urlfilter-JOB.txt

How store only home page of domains but crawl all the pages to detect all different domains

2011-01-12 Thread Asier Martínez

Hi to all, here is my problem. I want to crawl all ( to certain depth limit, you know ) the pages of certain domains/subdomains to detect them, but only store the home pages of the domains.( I don't have the list of the domains ) ¿There is a easy way to do this? or I have to change the source code

RE: How store only home page of domains but crawl all the pages to detect all different domains

RE: How store only home page of domains but crawl all the pages to detect all different domains

Re: How store only home page of domains but crawl all the pages to detect all different domains

Re: How store only home page of domains but crawl all the pages to detect all different domains

How store only home page of domains but crawl all the pages to detect all different domains

5 matches

Site Navigation

Mail list logo

Footer information