RE: How to crawl fast a large site

Marseld Dedgjonaj Mon, 07 Mar 2011 02:00:29 -0800

Thank you Arkadi.
I will check if Arch will satisfy my requirements.

Best Regards,
Marseldi


-----Original Message-----
From: [email protected] [mailto:[email protected]] 
Sent: Sunday, March 06, 2011 11:19 PM
To: [email protected]
Subject: RE: How to crawl fast a large site

Hello Marseld,

I think you should have a look at Arch:

http://www.atnf.csiro.au/computing/software/arch/

Arch is a free, open source extension of Nutch. Among other added features,
it supports partial recrawls.

Regards,

Arkadi


>-----Original Message-----
>From: Marseld Dedgjonaj [mailto:[email protected]]
>Sent: Saturday, March 05, 2011 3:21 AM
>To: [email protected]
>Subject: How to crawl fast a large site
>
>Hello everybody,
>
>I am trying to use nutch for searching within my site.
>
>I have configured an instance of nutch and start it to crawl the whole
>website. Now that all urls of my site are crawled (about 150'000 urls)
>and I
>need to crawl only the newest urls(about 10-20 per hour), a crawl
>process
>with depth = 1 and topN = 50 takes more than 15 hours.
>
>The most consuming time steps are merging segments and indexing.
>
>I need to have the newest urls searchable in my website as soon as
>possible.
>
>
>
>I was trying to configure an other instance  of nutch just to take the
>latest articles.
>
>In this instance I injected 40 urls that changes very often and the new
>articles added to the site will appear in one of this links.(links are:
>Homepage, latest news, etc)
>
>I set "db.fetch.interval.default" to 3600 (1 hr) to crawl my injected
>pages
>and fetch all newest urls .
>
>And clear crawldb,  segments and indexes of this site every 24 hr
>because
>now this urls should been crawled by the main instance.
>
>When a user search will search in both of instances and merge the
>results.
>
>
>
>My problem is:
>
>I need that my second instance to fetch only the injected urls and urls
>founded to the injected pages, but if I run the crawl continually to
>crawl
>fast newest urls, the crawl process crawls every url founded.
>
>
>
>Please any suggestion to make possible that when update crawlDB, to put
>inside only the urls that agree with my requests.
>
>
>
>Any other suggestion will be very valuable to me.
>
>
>
>Thanks in advance and
>
>Best regards,
>
>Marseldi
>
>
>
>
>
><p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni
><b>Pun&euml; t&euml; Mir&euml;</b> dhe <b>t&euml; Mir&euml; p&euml;r
>Pun&euml;</b>... Vizitoni: <a target="_blank"
>href="http://www.punaime.al/";>www.punaime.al</a></span></p>
><p><a target="_blank" href="http://www.punaime.al/";><span style="text-
>decoration: none;"><img width="165" height="31" border="0" alt="punaime"
>src="http://www.ikub.al/images/punaime.al_small.png"; /></span></a></p>




<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni <b>Pun&euml; 
t&euml; Mir&euml;</b> dhe <b>t&euml; Mir&euml; p&euml;r Pun&euml;</b>... 
Vizitoni: <a target="_blank" 
href="http://www.punaime.al/";>www.punaime.al</a></span></p>
<p><a target="_blank" href="http://www.punaime.al/";><span 
style="text-decoration: none;"><img width="165" height="31" border="0" 
alt="punaime" src="http://www.ikub.al/images/punaime.al_small.png"; 
/></span></a></p>

RE: How to crawl fast a large site

Reply via email to