Hi, You will need to create different versions of the regex-urlfilter.txt for the different jobs. You can have different nutch-site configs where each has a different setting for urlfilter.regex.file, pointing to the relevant regex- urlfilter file. Or you can just copy regex-urlfilter-<JOB>.txt to regex- urlfilter.txt before executing that job.
Cheers, On Thursday 13 January 2011 02:06:15 Asier Martínez wrote: > Oh Thank you Markus for your input. The homepage thing I have "solved" > in my crawler in Python, but I founded that Nutch works more more fast > than my original crawler based on Twitested Lib. > And I want to learn more :-). > > I didn't know about different url filters for fetching, updating etc, > ¿Where can I change those filters? > > Thank you, > > 2011/1/12 Markus Jelsma <[email protected]>: > > Hi, > > > > This is rather tricky. You can crawl a lot but index a little if you use > > different url filters for fetching, updating the db and indexing so that > > part is rather easy. > > > > The question is how to define a home page in the url filters. For this > > website its /, for another its /home.html and another redirects to > > subdomain.domain.extension and even another will redirect to language > > based url. > > > > Cheers, > > > >> Hi to all, > >> here is my problem. I want to crawl "all" ( to certain depth limit, > >> you know ) the pages of certain domains/subdomains to detect them, but > >> only store the home pages of the domains.( I don't have the list of > >> the domains ) > >> ¿There is a easy way to do this? or I have to change the source code > >> of some plugin? where can I start to looking? > >> > >> Thanks in advance, -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

