On 23/06/2012 14:21, Markus Jelsma wrote:
Keep an eye on these open issues:

https://issues.apache.org/jira/browse/NUTCH-1324
https://issues.apache.org/jira/browse/NUTCH-1325
https://issues.apache.org/jira/browse/NUTCH-1326

They are a set of tools capable of deduplicating the various databases via the 
HostNormalizer. They collect information on hosts, most importantly the link 
score. It also collects information on duplicates within a host and then 
produce deduplication rules for the HostNormalizer based on host and duplicate 
information.

It's limited to domain because that's a larger problem in terms of resources 
and a bit easier to deal with.

The HostDB patch looks interesting. (I'm still very much a novice as regards Nutch and Java.) It might be a good thing to add a DNS lookup field and an IP lookup field. Some hosters have domain graveyard IPs (and PPC parking pages) where they point undeveloped or unrenewed domains. This would help with the blacklisting process by allowing unrenewed sites to be identified simply by IP. In DNS terms, if a domain moves to a PPC (sedoparking.com etc) or auction hoster (afternic.com etc) then it is no longer worth including in an active index.

Regards...jmcc
--
**********************************************************
John McCormac  *  e-mail: [email protected]
MC2            *  web: http://www.hosterstats.com/
22 Viewmount   *  Domain Registrations Statistics
Waterford      *  And Historical DNS Database.
Ireland        *  Over 275 Million Domains Tracked.
IE             *  http://www.hosterstats.com/blog
**********************************************************


Reply via email to