Thank you Arkadi. I will check if Arch will satisfy my requirements. Best Regards, Marseldi
-----Original Message----- From: [email protected] [mailto:[email protected]] Sent: Sunday, March 06, 2011 11:19 PM To: [email protected] Subject: RE: How to crawl fast a large site Hello Marseld, I think you should have a look at Arch: http://www.atnf.csiro.au/computing/software/arch/ Arch is a free, open source extension of Nutch. Among other added features, it supports partial recrawls. Regards, Arkadi >-----Original Message----- >From: Marseld Dedgjonaj [mailto:[email protected]] >Sent: Saturday, March 05, 2011 3:21 AM >To: [email protected] >Subject: How to crawl fast a large site > >Hello everybody, > >I am trying to use nutch for searching within my site. > >I have configured an instance of nutch and start it to crawl the whole >website. Now that all urls of my site are crawled (about 150'000 urls) >and I >need to crawl only the newest urls(about 10-20 per hour), a crawl >process >with depth = 1 and topN = 50 takes more than 15 hours. > >The most consuming time steps are merging segments and indexing. > >I need to have the newest urls searchable in my website as soon as >possible. > > > >I was trying to configure an other instance of nutch just to take the >latest articles. > >In this instance I injected 40 urls that changes very often and the new >articles added to the site will appear in one of this links.(links are: >Homepage, latest news, etc) > >I set "db.fetch.interval.default" to 3600 (1 hr) to crawl my injected >pages >and fetch all newest urls . > >And clear crawldb, segments and indexes of this site every 24 hr >because >now this urls should been crawled by the main instance. > >When a user search will search in both of instances and merge the >results. > > > >My problem is: > >I need that my second instance to fetch only the injected urls and urls >founded to the injected pages, but if I run the crawl continually to >crawl >fast newest urls, the crawl process crawls every url founded. > > > >Please any suggestion to make possible that when update crawlDB, to put >inside only the urls that agree with my requests. > > > >Any other suggestion will be very valuable to me. > > > >Thanks in advance and > >Best regards, > >Marseldi > > > > > ><p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni ><b>Punë të Mirë</b> dhe <b>të Mirë për >Punë</b>... Vizitoni: <a target="_blank" >href="http://www.punaime.al/">www.punaime.al</a></span></p> ><p><a target="_blank" href="http://www.punaime.al/"><span style="text- >decoration: none;"><img width="165" height="31" border="0" alt="punaime" >src="http://www.ikub.al/images/punaime.al_small.png" /></span></a></p> <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni <b>Punë të Mirë</b> dhe <b>të Mirë për Punë</b>... Vizitoni: <a target="_blank" href="http://www.punaime.al/">www.punaime.al</a></span></p> <p><a target="_blank" href="http://www.punaime.al/"><span style="text-decoration: none;"><img width="165" height="31" border="0" alt="punaime" src="http://www.ikub.al/images/punaime.al_small.png" /></span></a></p>

