Thanks Markus, I will open a ticket and submit a patch. One follow up question: UpdateHostDb checks and throws an exception if urlnormalizer-host (which can be used to mitigate the problem I mentioned) is enabled. Is that also an internal decision of OpenIndex, and perhaps should be removed now that the code is part of Nutch, or is there a reason this normalizer must not be used with UpdateHostDb?
Yossi. > -----Original Message----- > From: Markus Jelsma <markus.jel...@openindex.io> > Sent: 05 March 2018 12:22 > To: user@nutch.apache.org > Subject: RE: Why doesn't hostdb support byDomain mode? > > Hi, > > The reason is simple, we (company) needed this information based on > hostname, so we made a hostdb. I don't see any downside for supporting a > domain mode. Adding support for it through hostdb.url.mode seems like a good > idea. > > Regards, > Markus > > -----Original message----- > > From:Yossi Tamari <yossi.tam...@pipl.com> > > Sent: Sunday 4th March 2018 12:01 > > To: user@nutch.apache.org > > Subject: Why doesn't hostdb support byDomain mode? > > > > Hi, > > > > > > > > Is there a reason that hostdb provides per-host data even when the > > generate/fetch are working by domain? This generates misleading > > statistics for servers that load-balance by redirecting to nodes (e.g. > photobucket). > > > > If this is just an oversight, I can contribute a patch, but I'm not > > sure if I should use partition.url.mode, generate.count.mode, one of > > the other similar properties, or create one more such property > hostdb.url.mode. > > > > > > > > Yossi. > > > >