Hi Markus, I am also interested in using different regex-urlfilter for Generate step because I need to crawl only homepage of 10 websites continuously and index all links which are in the homepage but not go crawling recursively. I think it can be done by puting in regex-urlfilter file for generate only these 10 websites, And in other steps(fetch, updated, invertlinks, index) to use the default regex-urlfilter. As I see you said that is possible to have different nutch-site configs for each step. How can I configure to have another nutch-site config file for generate step and another nutch-site config file for other steps? Should I change code for this or is just a configuration trick?
Please help me for this. I really need it. Best regards, Marseld -----Original Message----- From: Markus Jelsma [mailto:[email protected]] Sent: Thursday, January 13, 2011 1:51 PM To: Asier Martínez Cc: [email protected] Subject: Re: How store only home page of domains but crawl all the pages to detect all different domains Hi, You will need to create different versions of the regex-urlfilter.txt for the different jobs. You can have different nutch-site configs where each has a different setting for urlfilter.regex.file, pointing to the relevant regex- urlfilter file. Or you can just copy regex-urlfilter-<JOB>.txt to regex- urlfilter.txt before executing that job. Cheers, On Thursday 13 January 2011 02:06:15 Asier Martínez wrote: > Oh Thank you Markus for your input. The homepage thing I have "solved" > in my crawler in Python, but I founded that Nutch works more more fast > than my original crawler based on Twitested Lib. > And I want to learn more :-). > > I didn't know about different url filters for fetching, updating etc, > ¿Where can I change those filters? > > Thank you, > > 2011/1/12 Markus Jelsma <[email protected]>: > > Hi, > > > > This is rather tricky. You can crawl a lot but index a little if you use > > different url filters for fetching, updating the db and indexing so that > > part is rather easy. > > > > The question is how to define a home page in the url filters. For this > > website its /, for another its /home.html and another redirects to > > subdomain.domain.extension and even another will redirect to language > > based url. > > > > Cheers, > > > >> Hi to all, > >> here is my problem. I want to crawl "all" ( to certain depth limit, > >> you know ) the pages of certain domains/subdomains to detect them, but > >> only store the home pages of the domains.( I don't have the list of > >> the domains ) > >> ¿There is a easy way to do this? or I have to change the source code > >> of some plugin? where can I start to looking? > >> > >> Thanks in advance, -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350 <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni <b>Punë të Mirë</b> dhe <b>të Mirë për Punë</b>... Vizitoni: <a target="_blank" href="http://www.punaime.al/">www.punaime.al</a></span></p> <p><a target="_blank" href="http://www.punaime.al/"><span style="text-decoration: none;"><img width="165" height="31" border="0" alt="punaime" src="http://www.ikub.al/images/punaime.al_small.png" /></span></a></p>

