bin/nutch solrdedup Usage: SolrDeleteDuplicates <solr url>
You could also handle deduplication in your Solr configuration. It exposes more options and lets you mark duplicates (documents with identical signatures) or overwrite them (deduplicate). http://wiki.apache.org/solr/Deduplication -----Original message----- From: Nemani, Raj <[email protected]> Sent: Thu 23-09-2010 22:48 To: [email protected]; Subject: RE: Duplicate URLs Thanks again. One final question. I do not create Nutch index. I just push the crawl segments to Solr using the follwing command line. bin/nutch solrindex $solr_endpoint crawl/crawldb crawl/linkdb crawl/segments/* Do I need to create Nutch index to get the Dedup going because I saw online script that submits the nutch Index directory to Dedup command. Can I just pass in the Segments directory (as shown in the document from the link you sent) without having to build the Nutch index? I am going to try both ways in the mean time. Thanks so much again Raj -----Original Message----- From: Markus Jelsma [mailto:[email protected]] Sent: Thursday, September 23, 2010 4:33 PM To: [email protected] Subject: RE: Duplicate URLs Deduplication is a mechanism where a hash is being generated based on contents of some field (title and/or content as the usual). It can be as simple as an MD5 hash or a more fuzzy match. Nutch can deduplicate itself by using that command line option. You can also use Nutch to deduplicate whatever you pushed to a Solr index, and you can configure Solr to deduplicate as well. http://wiki.apache.org/nutch/CommandLineOptions -----Original message----- From: Nemani, Raj <[email protected]> Sent: Thu 23-09-2010 22:26 To: [email protected]; Subject: RE: Duplicate URLs Markus, Thanks so much. Any link that outlines the step to take that you can forward or just explain if you can. I appreciate your help. I will keep looking online in the meantime. Thanks Raj -----Original Message----- From: Markus Jelsma [mailto:[email protected]] Sent: Thursday, September 23, 2010 4:20 PM To: [email protected] Subject: RE: Duplicate URLs Use deduplication. -----Original message----- From: Nemani, Raj <[email protected]> Sent: Thu 23-09-2010 22:12 To: [email protected]; Subject: Duplicate URLs All, I just wanted to see if there is way we can tell Nutch to treat the following URLs as same. http://SITENAME.DOMAINNAME.com/research/briefing_books/avian_flu/who_rec _action.htm http://SITENAME/research/briefing_books/avian_flu/who_rec_action.htm As you know you can set up web servers such that both the URLs above resolve to the same end point. In other words the two URLs are actually *same* even though they are physically different. Is there anyway I can tell NUTCH to treat these URLs as same? I cannot use to filtering to ignore one or the other (wither with DOMAINNAME or without) because I need to allow both patterns to allow genuine URLs. Thanks Raj

