Thanks Markus, I will open a ticket and submit a patch.
One follow up question: UpdateHostDb checks and throws an exception if
urlnormalizer-host (which can be used to mitigate the problem I mentioned) is
enabled. Is that also an internal decision of OpenIndex, and perhaps should be
removed now that the code is part of Nutch, or is there a reason this
normalizer must not be used with UpdateHostDb?
Yossi.
> -----Original Message-----
> From: Markus Jelsma <[email protected]>
> Sent: 05 March 2018 12:22
> To: [email protected]
> Subject: RE: Why doesn't hostdb support byDomain mode?
>
> Hi,
>
> The reason is simple, we (company) needed this information based on
> hostname, so we made a hostdb. I don't see any downside for supporting a
> domain mode. Adding support for it through hostdb.url.mode seems like a good
> idea.
>
> Regards,
> Markus
>
> -----Original message-----
> > From:Yossi Tamari <[email protected]>
> > Sent: Sunday 4th March 2018 12:01
> > To: [email protected]
> > Subject: Why doesn't hostdb support byDomain mode?
> >
> > Hi,
> >
> >
> >
> > Is there a reason that hostdb provides per-host data even when the
> > generate/fetch are working by domain? This generates misleading
> > statistics for servers that load-balance by redirecting to nodes (e.g.
> photobucket).
> >
> > If this is just an oversight, I can contribute a patch, but I'm not
> > sure if I should use partition.url.mode, generate.count.mode, one of
> > the other similar properties, or create one more such property
> hostdb.url.mode.
> >
> >
> >
> > Yossi.
> >
> >