Oh Thank you Markus for your input. The homepage thing I have "solved" in my crawler in Python, but I founded that Nutch works more more fast than my original crawler based on Twitested Lib. And I want to learn more :-).
I didn't know about different url filters for fetching, updating etc, ¿Where can I change those filters? Thank you, 2011/1/12 Markus Jelsma <[email protected]>: > Hi, > > This is rather tricky. You can crawl a lot but index a little if you use > different url filters for fetching, updating the db and indexing so that part > is > rather easy. > > The question is how to define a home page in the url filters. For this website > its /, for another its /home.html and another redirects to > subdomain.domain.extension and even another will redirect to language based > url. > > Cheers, > >> Hi to all, >> here is my problem. I want to crawl "all" ( to certain depth limit, >> you know ) the pages of certain domains/subdomains to detect them, but >> only store the home pages of the domains.( I don't have the list of >> the domains ) >> ¿There is a easy way to do this? or I have to change the source code >> of some plugin? where can I start to looking? >> >> Thanks in advance, >

