RE: Webpages are fetched multiple times

Markus Jelsma Mon, 25 Jan 2016 07:35:10 -0800

Hi - do you see the same URL's written to stdout when fetching? I have see that 
too a few times, but in no case was the URL actually downloaded twice, nor do 
they appear multiple times in the segment or CrawlDB.
Markus


 
 
-----Original message-----
> From:Hussain Pirosha <[email protected]>
> Sent: Monday 25th January 2016 14:30
> To: [email protected]
> Subject: Webpages are fetched multiple times
> 
> Hello,
> 
> 
> I have been experimenting with Apache Nutch version 1.11 for few days. My use 
> case is to crawl a forum in local mode. Seed url text just contains one entry 
> :
> 
> 
> http://www.flyertalk.com/forum/united-airlines-mileageplus/1736400-have-simple-question-about-united-airlines-mileageplus-ask-here-2016-a.html
> 
> 
> Nutch config is pasted @ http://pasted.co/782e59ad
> 
> 
> I issue the following commands :-
> 
> 1. nutch generate ftalk-db/ ftalk-db/segments/ -depth 5 -topN 500
> 
> 2. nutch fetch ftalk-db/segments/20160125154244/
> 
> 
> I am struggling to find, why nutch keeps on fetching same page multiple 
> times. Instead of getting unique web-pages at the end of crawl, I get lot of 
> duplicates.
> 
> 
> Please suggest what am I doing wrong ?
> 
> 
> Thanks,
> 
> Hussain
> 
> ________________________________
> 
> 
> 
> 
> 
> 
> NOTE: This message may contain information that is confidential, proprietary, 
> privileged or otherwise protected by law. The message is intended solely for 
> the named addressee. If received in error, please destroy and notify the 
> sender. Any use of this email is prohibited when received in error. Impetus 
> does not represent, warrant and/or guarantee, that the integrity of this 
> communication has been maintained nor that the communication is free of 
> errors, virus, interception or interference.
>

RE: Webpages are fetched multiple times

Reply via email to