Keep an eye on these open issues: https://issues.apache.org/jira/browse/NUTCH-1324 https://issues.apache.org/jira/browse/NUTCH-1325 https://issues.apache.org/jira/browse/NUTCH-1326
They are a set of tools capable of deduplicating the various databases via the HostNormalizer. They collect information on hosts, most importantly the link score. It also collects information on duplicates within a host and then produce deduplication rules for the HostNormalizer based on host and duplicate information. It's limited to domain because that's a larger problem in terms of resources and a bit easier to deal with. -----Original message----- > From:John McCormac <[email protected]> > Sent: Sat 23-Jun-2012 15:11 > To: [email protected] > Subject: Re: Near Duplicate Detection in nutch /Solr > > On 23/06/2012 13:17, Markus Jelsma wrote: > > Hello, > > > > It maps anything to anything and has wildcard support: > > *.example.com example.org > > maps all URL's on the example.com domain to example.org. > > > > Thanks. > The main problem though is still identifying the clone/original sites so > that the mapping can be determined. > > The process I use has the advantage of having the set of websites to be > indexed predetermined and the clone/original problem is dealt with (for > the most part) before the main indexing run. It can be a complicated > approach depending on the number of TLDs and target countries involved. > > The logic behind this approach is preventing GIGO as it is easier and > more efficent to solve the clone problem before it takes cycles and > bandwidth in the main index run. > > What I have seen is that some businesses will use numbers of keyword > type domains pointing (without a 301 redirect) to their main site. > However the main clone pair is the ccTLD/.com version of a site (same > domain but different TLDs). The .net and .org may also exist for older > businesses. The non-core TLDs (biz/info/mobi/eu/asia etc) are often less > likely to be properly set up in DNS with a working website as about 85% > of a country's domain footprint will be concentrated on the ccTLD/.com axis. > > Regards...jmcc > -- > ********************************************************** > John McCormac * e-mail: [email protected] > MC2 * web: http://www.hosterstats.com/ > 22 Viewmount * Domain Registrations Statistics > Waterford * And Historical DNS Database. > Ireland * Over 275 Million Domains Tracked. > IE * http://www.hosterstats.com/blog > ********************************************************** > > >

