RE: How store only home page of domains but crawl all the pages to detect all different domains

Marseld Dedgjonaj Sat, 15 Jan 2011 03:44:05 -0800

Hi Markus,
I am also interested in using different regex-urlfilter for Generate step
because I need to crawl only homepage of 10 websites continuously and index
all links which are in the homepage but not go crawling recursively.
I think it can be done by puting in regex-urlfilter file for generate only
these 10 websites,
And in other steps(fetch, updated, invertlinks, index) to use the default
regex-urlfilter.
As I see you said that is possible to have different nutch-site configs for
each step.
How can I configure to have another nutch-site config file for generate step
and another nutch-site config file for other steps?
Should I change code for this or is just a configuration trick?


Please help me for this. I really need it.

Best regards,
Marseld


-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Thursday, January 13, 2011 1:51 PM
To: Asier Martínez
Cc: [email protected]
Subject: Re: How store only home page of domains but crawl all the pages to
detect all different domains

Hi,

You will need to create different versions of the regex-urlfilter.txt for
the 
different jobs.  You can have different nutch-site configs where each has a 
different setting for urlfilter.regex.file, pointing to the relevant regex-
urlfilter file. Or you can just copy regex-urlfilter-<JOB>.txt to regex-
urlfilter.txt before executing that job.

Cheers,

On Thursday 13 January 2011 02:06:15 Asier Martínez wrote:
> Oh Thank you Markus for your input. The homepage thing I have "solved"
> in my crawler in Python, but I founded that Nutch works more more fast
> than my original crawler based on Twitested Lib.
> And I want to learn more :-).
> 
> I didn't know about different url filters for fetching, updating etc,
> ¿Where can I change those filters?
> 
> Thank you,
> 
> 2011/1/12 Markus Jelsma <[email protected]>:
> > Hi,
> > 
> > This is rather tricky. You can crawl a lot but index a little if you use
> > different url filters for fetching, updating the db and indexing so that
> > part is rather easy.
> > 
> > The question is how to define a home page in the url filters. For this
> > website its /, for another its /home.html and another redirects to
> > subdomain.domain.extension and even another will redirect to language
> > based url.
> > 
> > Cheers,
> > 
> >> Hi to all,
> >> here is my problem. I want to crawl "all" ( to certain depth limit,
> >> you know ) the pages of certain domains/subdomains to detect them, but
> >> only store the home pages of the domains.( I don't have the list of
> >> the domains )
> >> ¿There is a easy way to do this? or I have to change the source code
> >> of some plugin? where can I start to looking?
> >> 
> >> Thanks in advance,

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350




<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni <b>Pun&euml; 
t&euml; Mir&euml;</b> dhe <b>t&euml; Mir&euml; p&euml;r Pun&euml;</b>... 
Vizitoni: <a target="_blank" 
href="http://www.punaime.al/";>www.punaime.al</a></span></p>
<p><a target="_blank" href="http://www.punaime.al/";><span 
style="text-decoration: none;"><img width="165" height="31" border="0" 
alt="punaime" src="http://www.ikub.al/images/punaime.al_small.png"; 
/></span></a></p>

RE: How store only home page of domains but crawl all the pages to detect all different domains

Reply via email to