Re: How store only home page of domains but crawl all the pages to detect all different domains

Asier Martínez Wed, 12 Jan 2011 17:06:43 -0800

Oh Thank you Markus for your input. The homepage thing I have "solved"
in my crawler in Python, but I founded that Nutch works more more fast
than my original crawler based on Twitested Lib.
And I want to learn more :-).


I didn't know about different url filters for fetching, updating etc,
¿Where can I change those filters?

Thank you,

2011/1/12 Markus Jelsma <[email protected]>:
> Hi,
>
> This is rather tricky. You can crawl a lot but index a little if you use
> different url filters for fetching, updating the db and indexing so that part 
> is
> rather easy.
>
> The question is how to define a home page in the url filters. For this website
> its /, for another its /home.html and another redirects to
> subdomain.domain.extension and even another will redirect to language based
> url.
>
> Cheers,
>
>> Hi to all,
>> here is my problem. I want to crawl "all" ( to certain depth limit,
>> you know ) the pages of certain domains/subdomains to detect them, but
>> only store the home pages of the domains.( I don't have the list of
>> the domains )
>> ¿There is a easy way to do this? or I have to change the source code
>> of some plugin? where can I start to looking?
>>
>> Thanks in advance,
>

Re: How store only home page of domains but crawl all the pages to detect all different domains

Reply via email to