Hi - do you see the same URL's written to stdout when fetching? I have see that too a few times, but in no case was the URL actually downloaded twice, nor do they appear multiple times in the segment or CrawlDB. Markus
-----Original message----- > From:Hussain Pirosha <[email protected]> > Sent: Monday 25th January 2016 14:30 > To: [email protected] > Subject: Webpages are fetched multiple times > > Hello, > > > I have been experimenting with Apache Nutch version 1.11 for few days. My use > case is to crawl a forum in local mode. Seed url text just contains one entry > : > > > http://www.flyertalk.com/forum/united-airlines-mileageplus/1736400-have-simple-question-about-united-airlines-mileageplus-ask-here-2016-a.html > > > Nutch config is pasted @ http://pasted.co/782e59ad > > > I issue the following commands :- > > 1. nutch generate ftalk-db/ ftalk-db/segments/ -depth 5 -topN 500 > > 2. nutch fetch ftalk-db/segments/20160125154244/ > > > I am struggling to find, why nutch keeps on fetching same page multiple > times. Instead of getting unique web-pages at the end of crawl, I get lot of > duplicates. > > > Please suggest what am I doing wrong ? > > > Thanks, > > Hussain > > ________________________________ > > > > > > > NOTE: This message may contain information that is confidential, proprietary, > privileged or otherwise protected by law. The message is intended solely for > the named addressee. If received in error, please destroy and notify the > sender. Any use of this email is prohibited when received in error. Impetus > does not represent, warrant and/or guarantee, that the integrity of this > communication has been maintained nor that the communication is free of > errors, virus, interception or interference. >

