Use deduplication. 
 
-----Original message-----
From: Nemani, Raj <[email protected]>
Sent: Thu 23-09-2010 22:12
To: [email protected]; 
Subject: Duplicate URLs

All,



I just wanted to see if there is way we can tell Nutch to treat the
following URLs as same.  





http://SITENAME.DOMAINNAME.com/research/briefing_books/avian_flu/who_rec
_action.htm



http://SITENAME/research/briefing_books/avian_flu/who_rec_action.htm





As you know you can set up web servers such that both the URLs above
resolve to the same end point.  In other words the two URLs are actually
*same* even though they are physically different.  Is there anyway I can
tell NUTCH to treat these URLs as same?

I cannot use to filtering to ignore one or the other (wither with
DOMAINNAME or without) because I need to allow both patterns to allow
genuine URLs.



Thanks

Raj





Reply via email to