Re: How store only home page of domains but crawl all the pages to detect all different domains

alxsss Sat, 15 Jan 2011 09:59:06 -0800

 Can not you do this by specifying -depth 1 in crawl command?



 

 

-----Original Message-----
From: Marseld Dedgjonaj <[email protected]>
To: user <[email protected]>; markus.jelsma <[email protected]>; 
'Asier Martínez' <[email protected]>
Sent: Sat, Jan 15, 2011 3:44 am
Subject: RE: How store only home page of domains but crawl all the pages to 
detect all different domains


Hi Markus,

I am also interested in using different regex-urlfilter for Generate step

because I need to crawl only homepage of 10 websites continuously and index

all links which are in the homepage but not go crawling recursively.

I think it can be done by puting in regex-urlfilter file for generate only

these 10 websites,

And in other steps(fetch, updated, invertlinks, index) to use the default

regex-urlfilter.

As I see you said that is possible to have different nutch-site configs for

each step.

How can I configure to have another nutch-site config file for generate step

and another nutch-site config file for other steps?

Should I change code for this or is just a configuration trick?



Please help me for this. I really need it.



Best regards,

Marseld





-----Original Message-----

From: Markus Jelsma [mailto:[email protected]] 

Sent: Thursday, January 13, 2011 1:51 PM

To: Asier Martínez

Cc: [email protected]

Subject: Re: How store only home page of domains but crawl all the pages to

detect all different domains



Hi,



You will need to create different versions of the regex-urlfilter.txt for

the 

different jobs.  You can have different nutch-site configs where each has a 

different setting for urlfilter.regex.file, pointing to the relevant regex-

urlfilter file. Or you can just copy regex-urlfilter-<JOB>.txt to regex-

urlfilter.txt before executing that job.



Cheers,



On Thursday 13 January 2011 02:06:15 Asier Martínez wrote:

> Oh Thank you Markus for your input. The homepage thing I have "solved"

> in my crawler in Python, but I founded that Nutch works more more fast

> than my original crawler based on Twitested Lib.

> And I want to learn more :-).

> 

> I didn't know about different url filters for fetching, updating etc,

> ¿Where can I change those filters?

> 

> Thank you,

> 

> 2011/1/12 Markus Jelsma <[email protected]>:

> > Hi,

> > 

> > This is rather tricky. You can crawl a lot but index a little if you use

> > different url filters for fetching, updating the db and indexing so that

> > part is rather easy.

> > 

> > The question is how to define a home page in the url filters. For this

> > website its /, for another its /home.html and another redirects to

> > subdomain.domain.extension and even another will redirect to language

> > based url.

> > 

> > Cheers,

> > 

> >> Hi to all,

> >> here is my problem. I want to crawl "all" ( to certain depth limit,

> >> you know ) the pages of certain domains/subdomains to detect them, but

> >> only store the home pages of the domains.( I don't have the list of

> >> the domains )

> >> ¿There is a easy way to do this? or I have to change the source code

> >> of some plugin? where can I start to looking?

> >> 

> >> Thanks in advance,



-- 

Markus Jelsma - CTO - Openindex

http://www.linkedin.com/in/markus17

050-8536620 / 06-50258350









<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni <b>Pun&euml; 

t&euml; Mir&euml;</b> dhe <b>t&euml; Mir&euml; p&euml;r Pun&euml;</b>... 

Vizitoni: <a target="_blank" 
href="http://www.punaime.al/";>www.punaime.al</a></span></p>

<p><a target="_blank" href="http://www.punaime.al/";><span 
style="text-decoration: 

none;"><img width="165" height="31" border="0" alt="punaime" 

src="http://www.ikub.al/images/punaime.al_small.png"; /></span></a></p>

Re: How store only home page of domains but crawl all the pages to detect all different domains

Reply via email to