On 23/06/2012 12:14, Markus Jelsma wrote:
Nutch now has a HostURLNormalizer capable of normalizing source hosts to a
target host. This prevents duplication of complete websites and bad hyperlinks.
https://issues.apache.org/jira/browse/NUTCH-1319
But does that normalize subdomains to the main site (same TLD -
sub.example.org to example.org etc) rather than clone sites in different
TLDs to the main site?
Regards...jmcc
--
**********************************************************
John McCormac * e-mail: [email protected]
MC2 * web: http://www.hosterstats.com/
22 Viewmount * Domain Registrations Statistics
Waterford * And Historical DNS Database.
Ireland * Over 275 Million Domains Tracked.
IE * http://www.hosterstats.com/blog
**********************************************************