Re: How store only home page of domains but crawl all the pages to detect all different domains

Charan K Sat, 15 Jan 2011 13:39:53 -0800

In that case.. You can generate from one database . Do db update  to a 
different crawl db..



On Jan 15, 2011, at 10:06 AM, "Marseld Dedgjonaj" 
<[email protected]> wrote:

> Hi,
> Thanks for your response.
> If I set -depth 1, this will function only for the first crawl.
> But sense initial urls are very dynamic webpages and the content changes 
> every hour,
> I need to crawl the initial urls continuously(only initial urls).
> 
> 
> Best Regards,
> Marseldi
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] 
> Sent: Saturday, January 15, 2011 6:58 PM
> To: [email protected]
> Subject: Re: How store only home page of domains but crawl all the pages to 
> detect all different domains
> 
> 
> Can not you do this by specifying -depth 1 in crawl command?
> 
> 
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Marseld Dedgjonaj <[email protected]>
> To: user <[email protected]>; markus.jelsma <[email protected]>; 
> 'Asier Martínez' <[email protected]>
> Sent: Sat, Jan 15, 2011 3:44 am
> Subject: RE: How store only home page of domains but crawl all the pages to 
> detect all different domains
> 
> 
> Hi Markus,
> 
> I am also interested in using different regex-urlfilter for Generate step
> 
> because I need to crawl only homepage of 10 websites continuously and index
> 
> all links which are in the homepage but not go crawling recursively.
> 
> I think it can be done by puting in regex-urlfilter file for generate only
> 
> these 10 websites,
> 
> And in other steps(fetch, updated, invertlinks, index) to use the default
> 
> regex-urlfilter.
> 
> As I see you said that is possible to have different nutch-site configs for
> 
> each step.
> 
> How can I configure to have another nutch-site config file for generate step
> 
> and another nutch-site config file for other steps?
> 
> Should I change code for this or is just a configuration trick?
> 
> 
> 
> Please help me for this. I really need it.
> 
> 
> 
> Best regards,
> 
> Marseld
> 
> 
> 
> 
> 
> -----Original Message-----
> 
> From: Markus Jelsma [mailto:[email protected]] 
> 
> Sent: Thursday, January 13, 2011 1:51 PM
> 
> To: Asier Martínez
> 
> Cc: [email protected]
> 
> Subject: Re: How store only home page of domains but crawl all the pages to
> 
> detect all different domains
> 
> 
> 
> Hi,
> 
> 
> 
> You will need to create different versions of the regex-urlfilter.txt for
> 
> the 
> 
> different jobs.  You can have different nutch-site configs where each has a 
> 
> different setting for urlfilter.regex.file, pointing to the relevant regex-
> 
> urlfilter file. Or you can just copy regex-urlfilter-<JOB>.txt to regex-
> 
> urlfilter.txt before executing that job.
> 
> 
> 
> Cheers,
> 
> 
> 
> On Thursday 13 January 2011 02:06:15 Asier Martínez wrote:
> 
>> Oh Thank you Markus for your input. The homepage thing I have "solved"
> 
>> in my crawler in Python, but I founded that Nutch works more more fast
> 
>> than my original crawler based on Twitested Lib.
> 
>> And I want to learn more :-).
> 
>> 
> 
>> I didn't know about different url filters for fetching, updating etc,
> 
>> ¿Where can I change those filters?
> 
>> 
> 
>> Thank you,
> 
>> 
> 
>> 2011/1/12 Markus Jelsma <[email protected]>:
> 
>>> Hi,
> 
>>> 
> 
>>> This is rather tricky. You can crawl a lot but index a little if you use
> 
>>> different url filters for fetching, updating the db and indexing so that
> 
>>> part is rather easy.
> 
>>> 
> 
>>> The question is how to define a home page in the url filters. For this
> 
>>> website its /, for another its /home.html and another redirects to
> 
>>> subdomain.domain.extension and even another will redirect to language
> 
>>> based url.
> 
>>> 
> 
>>> Cheers,
> 
>>> 
> 
>>>> Hi to all,
> 
>>>> here is my problem. I want to crawl "all" ( to certain depth limit,
> 
>>>> you know ) the pages of certain domains/subdomains to detect them, but
> 
>>>> only store the home pages of the domains.( I don't have the list of
> 
>>>> the domains )
> 
>>>> ¿There is a easy way to do this? or I have to change the source code
> 
>>>> of some plugin? where can I start to looking?
> 
>>>> 
> 
>>>> Thanks in advance,
> 
> 
> 
> -- 
> 
> Markus Jelsma - CTO - Openindex
> 
> http://www.linkedin.com/in/markus17
> 
> 050-8536620 / 06-50258350
> 
> 
> 
> 
> 
> 
> 
> 
> 
> <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni 
> <b>Pun&euml; 
> 
> t&euml; Mir&euml;</b> dhe <b>t&euml; Mir&euml; p&euml;r Pun&euml;</b>... 
> 
> Vizitoni: <a target="_blank" 
> href="http://www.punaime.al/";>www.punaime.al</a></span></p>
> 
> <p><a target="_blank" href="http://www.punaime.al/";><span 
> style="text-decoration: 
> 
> none;"><img width="165" height="31" border="0" alt="punaime" 
> 
> src="http://www.ikub.al/images/punaime.al_small.png"; /></span></a></p>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni 
> <b>Pun&euml; t&euml; Mir&euml;</b> dhe <b>t&euml; Mir&euml; p&euml;r 
> Pun&euml;</b>... Vizitoni: <a target="_blank" 
> href="http://www.punaime.al/";>www.punaime.al</a></span></p>
> <p><a target="_blank" href="http://www.punaime.al/";><span 
> style="text-decoration: none;"><img width="165" height="31" border="0" 
> alt="punaime" src="http://www.ikub.al/images/punaime.al_small.png"; 
> /></span></a></p>
> 
>

Re: How store only home page of domains but crawl all the pages to detect all different domains

Reply via email to