Re: ttp vs https duplicate fetches - host-urlnormalize?

Arthur Yarwood Sat, 05 Mar 2016 14:13:36 -0800

Ah good stuff. I'll keep an eye out for that 1.12 release.
Many thanks!


Arthur

On 05/03/2016 20:48, Sebastian Nagel wrote:

Hi Arthur,

this problem has been recently discussed in
   https://issues.apache.org/jira/browse/NUTCH-2065
and addressed by urlnormalizer-protocol
   https://issues.apache.org/jira/browse/NUTCH-2190

Of course, you have to decide for every host
which protocol shall be used.

Cheers,
Sebastian


On 03/04/2016 08:50 PM, Arthur Yarwood wrote:

I have recently discovered my crawl had a fetched a number of sites in 
duplicate - once over http,
and again over https. In a  similar manner one can add a host to the 
host-urlnormlize file to avoid
a similar issue with www.example.com vs example.com urls - is there a tactic to 
address http vs https?

Ideally always favouring http over https (for efficiency), but not totally 
discounting https
totally, if an entire host is setup to always serve over https. i.e. I don't 
really want to block
all https hosts via a regex-urlfilter.

I have worked around it to some degree via specific regex-urlfilters, but it 
would be nice if there
was a global option, rather than have to tweak config everytime I discover 
duplicate content in my
crawl.

--
Arthur Yarwood

Re: ttp vs https duplicate fetches - host-urlnormalize?

Reply via email to