All,

 

I just wanted to see if there is way we can tell Nutch to treat the
following URLs as same.  

 

 

http://SITENAME.DOMAINNAME.com/research/briefing_books/avian_flu/who_rec
_action.htm

 

http://SITENAME/research/briefing_books/avian_flu/who_rec_action.htm

 

 

As you know you can set up web servers such that both the URLs above
resolve to the same end point.  In other words the two URLs are actually
*same* even though they are physically different.  Is there anyway I can
tell NUTCH to treat these URLs as same?

I cannot use to filtering to ignore one or the other (wither with
DOMAINNAME or without) because I need to allow both patterns to allow
genuine URLs.

 

Thanks

Raj

 

 

Reply via email to