Thanks Markus, I will open a ticket and submit a patch.
One follow up question: UpdateHostDb checks and throws an exception if 
urlnormalizer-host (which can be used to mitigate the problem I mentioned) is 
enabled. Is that also an internal decision of OpenIndex, and perhaps should be 
removed now that the code is part of Nutch, or is there a reason this 
normalizer must not be used with UpdateHostDb?

        Yossi.

> -----Original Message-----
> From: Markus Jelsma <markus.jel...@openindex.io>
> Sent: 05 March 2018 12:22
> To: user@nutch.apache.org
> Subject: RE: Why doesn't hostdb support byDomain mode?
> 
> Hi,
> 
> The reason is simple, we (company) needed this information based on
> hostname, so we made a hostdb. I don't see any downside for supporting a
> domain mode. Adding support for it through hostdb.url.mode seems like a good
> idea.
> 
> Regards,
> Markus
> 
> -----Original message-----
> > From:Yossi Tamari <yossi.tam...@pipl.com>
> > Sent: Sunday 4th March 2018 12:01
> > To: user@nutch.apache.org
> > Subject: Why doesn't hostdb support byDomain mode?
> >
> > Hi,
> >
> >
> >
> > Is there a reason that hostdb provides per-host data even when the
> > generate/fetch are working by domain? This generates misleading
> > statistics for servers that load-balance by redirecting to nodes (e.g.
> photobucket).
> >
> > If this is just an oversight, I can contribute a patch, but I'm not
> > sure if I should use partition.url.mode, generate.count.mode, one of
> > the other similar properties, or create one more such property
> hostdb.url.mode.
> >
> >
> >
> > Yossi.
> >
> >

Reply via email to