Ah good stuff. I'll keep an eye out for that 1.12 release.
Many thanks!
Arthur
On 05/03/2016 20:48, Sebastian Nagel wrote:
Hi Arthur,
this problem has been recently discussed in
https://issues.apache.org/jira/browse/NUTCH-2065
and addressed by urlnormalizer-protocol
https://issues.apache.org/jira/browse/NUTCH-2190
Of course, you have to decide for every host
which protocol shall be used.
Cheers,
Sebastian
On 03/04/2016 08:50 PM, Arthur Yarwood wrote:
I have recently discovered my crawl had a fetched a number of sites in
duplicate - once over http,
and again over https. In a similar manner one can add a host to the
host-urlnormlize file to avoid
a similar issue with www.example.com vs example.com urls - is there a tactic to
address http vs https?
Ideally always favouring http over https (for efficiency), but not totally
discounting https
totally, if an entire host is setup to always serve over https. i.e. I don't
really want to block
all https hosts via a regex-urlfilter.
I have worked around it to some degree via specific regex-urlfilters, but it
would be nice if there
was a global option, rather than have to tweak config everytime I discover
duplicate content in my
crawl.
--
Arthur Yarwood