RE: Duplicate URLs

Markus Jelsma Thu, 23 Sep 2010 13:33:22 -0700

Deduplication is a mechanism where a hash is being generated based on contents 
of some field (title and/or content as the usual). It can be as simple as an 
MD5 hash or a more fuzzy match. Nutch can deduplicate itself by using that 
command line option. You can also use Nutch to deduplicate whatever you pushed 
to a Solr index, and you can configure Solr to deduplicate as well.


 

http://wiki.apache.org/nutch/CommandLineOptions

 


 
-----Original message-----
From: Nemani, Raj <[email protected]>
Sent: Thu 23-09-2010 22:26
To: [email protected]; 
Subject: RE: Duplicate URLs

Markus,

Thanks so much.
Any link that outlines the step to take that you can forward or just explain if 
you can.  I appreciate your help.  I will keep looking online in the meantime.

Thanks
Raj


-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Thursday, September 23, 2010 4:20 PM
To: [email protected]
Subject: RE: Duplicate URLs

Use deduplication. 
 
-----Original message-----
From: Nemani, Raj <[email protected]>
Sent: Thu 23-09-2010 22:12
To: [email protected]; 
Subject: Duplicate URLs

All,



I just wanted to see if there is way we can tell Nutch to treat the
following URLs as same.  





http://SITENAME.DOMAINNAME.com/research/briefing_books/avian_flu/who_rec
_action.htm



http://SITENAME/research/briefing_books/avian_flu/who_rec_action.htm





As you know you can set up web servers such that both the URLs above
resolve to the same end point.  In other words the two URLs are actually
*same* even though they are physically different.  Is there anyway I can
tell NUTCH to treat these URLs as same?

I cannot use to filtering to ignore one or the other (wither with
DOMAINNAME or without) because I need to allow both patterns to allow
genuine URLs.



Thanks

Raj

RE: Duplicate URLs

Reply via email to