Deduplication is a mechanism where a hash is being generated based on contents of some field (title and/or content as the usual). It can be as simple as an MD5 hash or a more fuzzy match. Nutch can deduplicate itself by using that command line option. You can also use Nutch to deduplicate whatever you pushed to a Solr index, and you can configure Solr to deduplicate as well.
http://wiki.apache.org/nutch/CommandLineOptions -----Original message----- From: Nemani, Raj <[email protected]> Sent: Thu 23-09-2010 22:26 To: [email protected]; Subject: RE: Duplicate URLs Markus, Thanks so much. Any link that outlines the step to take that you can forward or just explain if you can. I appreciate your help. I will keep looking online in the meantime. Thanks Raj -----Original Message----- From: Markus Jelsma [mailto:[email protected]] Sent: Thursday, September 23, 2010 4:20 PM To: [email protected] Subject: RE: Duplicate URLs Use deduplication. -----Original message----- From: Nemani, Raj <[email protected]> Sent: Thu 23-09-2010 22:12 To: [email protected]; Subject: Duplicate URLs All, I just wanted to see if there is way we can tell Nutch to treat the following URLs as same. http://SITENAME.DOMAINNAME.com/research/briefing_books/avian_flu/who_rec _action.htm http://SITENAME/research/briefing_books/avian_flu/who_rec_action.htm As you know you can set up web servers such that both the URLs above resolve to the same end point. In other words the two URLs are actually *same* even though they are physically different. Is there anyway I can tell NUTCH to treat these URLs as same? I cannot use to filtering to ignore one or the other (wither with DOMAINNAME or without) because I need to allow both patterns to allow genuine URLs. Thanks Raj

