Use deduplication. -----Original message----- From: Nemani, Raj <[email protected]> Sent: Thu 23-09-2010 22:12 To: [email protected]; Subject: Duplicate URLs
All, I just wanted to see if there is way we can tell Nutch to treat the following URLs as same. http://SITENAME.DOMAINNAME.com/research/briefing_books/avian_flu/who_rec _action.htm http://SITENAME/research/briefing_books/avian_flu/who_rec_action.htm As you know you can set up web servers such that both the URLs above resolve to the same end point. In other words the two URLs are actually *same* even though they are physically different. Is there anyway I can tell NUTCH to treat these URLs as same? I cannot use to filtering to ignore one or the other (wither with DOMAINNAME or without) because I need to allow both patterns to allow genuine URLs. Thanks Raj

