Hm, with some slight changes to the hostdb program, we could create metrics on 
protocol. A simple resolver program could then emit proper rules for the 
normalizer. The big disadvantage is that you cannot recreate the hostdb, 
because the protocol statistics would be lost and the resolver would not work, 
hence reintroducing duplicates.

We've delt with similar problem before, the slashes problem and the duplicate 
host problem (see the normalizers), each time it is paramount not to loose or 
recreate the DB, because any rule making program, or resolver,  would have to 
start from scratch. I haven't found a solution for that issue, perhaps just 
never delete the DB, which is already true for the CrawlDB anyway.

Markus

 
-----Original message-----
> From:Arthur Yarwood <[email protected]>
> Sent: Saturday 5th March 2016 23:13
> To: [email protected]
> Subject: Re: ttp vs https duplicate fetches - host-urlnormalize?
> 
> Ah good stuff. I'll keep an eye out for that 1.12 release.
> Many thanks!
> 
> Arthur
> 
> On 05/03/2016 20:48, Sebastian Nagel wrote:
> > Hi Arthur,
> >
> > this problem has been recently discussed in
> >    https://issues.apache.org/jira/browse/NUTCH-2065
> > and addressed by urlnormalizer-protocol
> >    https://issues.apache.org/jira/browse/NUTCH-2190
> >
> > Of course, you have to decide for every host
> > which protocol shall be used.
> >
> > Cheers,
> > Sebastian
> >
> >
> > On 03/04/2016 08:50 PM, Arthur Yarwood wrote:
> >> I have recently discovered my crawl had a fetched a number of sites in 
> >> duplicate - once over http,
> >> and again over https. In a  similar manner one can add a host to the 
> >> host-urlnormlize file to avoid
> >> a similar issue with www.example.com vs example.com urls - is there a 
> >> tactic to address http vs https?
> >>
> >> Ideally always favouring http over https (for efficiency), but not totally 
> >> discounting https
> >> totally, if an entire host is setup to always serve over https. i.e. I 
> >> don't really want to block
> >> all https hosts via a regex-urlfilter.
> >>
> >> I have worked around it to some degree via specific regex-urlfilters, but 
> >> it would be nice if there
> >> was a global option, rather than have to tweak config everytime I discover 
> >> duplicate content in my
> >> crawl.
> >>
> -- 
> Arthur Yarwood
> 
> 

Reply via email to