RE: Duplicate URLs

Markus Jelsma Thu, 23 Sep 2010 13:53:02 -0700

bin/nutch solrdedup
Usage: SolrDeleteDuplicates <solr url>


You could also handle deduplication in your Solr configuration. It exposes more 
options and lets you mark duplicates (documents with identical signatures) or 
overwrite them (deduplicate).

 

http://wiki.apache.org/solr/Deduplication
 
-----Original message-----
From: Nemani, Raj <[email protected]>
Sent: Thu 23-09-2010 22:48
To: [email protected]; 
Subject: RE: Duplicate URLs

Thanks again.  One final question.  I do not create Nutch index.  I just push 
the crawl segments to Solr using the follwing command line.  

bin/nutch solrindex $solr_endpoint crawl/crawldb crawl/linkdb crawl/segments/*

Do I need to create Nutch index to get the Dedup going because I saw online 
script that submits the nutch Index directory to Dedup command.  Can I just 
pass in the Segments directory (as shown in the document from the link you 
sent) without having to build the Nutch index?

I am going to try both ways in the mean time.

Thanks so much again
Raj


-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Thursday, September 23, 2010 4:33 PM
To: [email protected]
Subject: RE: Duplicate URLs

Deduplication is a mechanism where a hash is being generated based on contents 
of some field (title and/or content as the usual). It can be as simple as an 
MD5 hash or a more fuzzy match. Nutch can deduplicate itself by using that 
command line option. You can also use Nutch to deduplicate whatever you pushed 
to a Solr index, and you can configure Solr to deduplicate as well.

 

http://wiki.apache.org/nutch/CommandLineOptions

 


 
-----Original message-----
From: Nemani, Raj <[email protected]>
Sent: Thu 23-09-2010 22:26
To: [email protected]; 
Subject: RE: Duplicate URLs

Markus,

Thanks so much.
Any link that outlines the step to take that you can forward or just explain if 
you can.  I appreciate your help.  I will keep looking online in the meantime.

Thanks
Raj


-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Thursday, September 23, 2010 4:20 PM
To: [email protected]
Subject: RE: Duplicate URLs

Use deduplication. 
 
-----Original message-----
From: Nemani, Raj <[email protected]>
Sent: Thu 23-09-2010 22:12
To: [email protected]; 
Subject: Duplicate URLs

All,



I just wanted to see if there is way we can tell Nutch to treat the
following URLs as same.  





http://SITENAME.DOMAINNAME.com/research/briefing_books/avian_flu/who_rec
_action.htm



http://SITENAME/research/briefing_books/avian_flu/who_rec_action.htm





As you know you can set up web servers such that both the URLs above
resolve to the same end point.  In other words the two URLs are actually
*same* even though they are physically different.  Is there anyway I can
tell NUTCH to treat these URLs as same?

I cannot use to filtering to ignore one or the other (wither with
DOMAINNAME or without) because I need to allow both patterns to allow
genuine URLs.



Thanks

Raj

RE: Duplicate URLs

Reply via email to