Ah, well, that is a good one! I took me a while to figure it out, but having the check there is an error. We had added the same check in an earlier different Nutch job where the database itself could remove itself just by the rules it emitted and host normalized enabled.
I simply reused the job setup code and forgot to remove that check. You can safely remove that check in HostDB. Regards, Markus -----Original message----- > From:Yossi Tamari <[email protected]> > Sent: Monday 5th March 2018 11:30 > To: [email protected] > Subject: RE: Why doesn't hostdb support byDomain mode? > > Thanks Markus, I will open a ticket and submit a patch. > One follow up question: UpdateHostDb checks and throws an exception if > urlnormalizer-host (which can be used to mitigate the problem I mentioned) is > enabled. Is that also an internal decision of OpenIndex, and perhaps should > be removed now that the code is part of Nutch, or is there a reason this > normalizer must not be used with UpdateHostDb? > > Yossi. > > > -----Original Message----- > > From: Markus Jelsma <[email protected]> > > Sent: 05 March 2018 12:22 > > To: [email protected] > > Subject: RE: Why doesn't hostdb support byDomain mode? > > > > Hi, > > > > The reason is simple, we (company) needed this information based on > > hostname, so we made a hostdb. I don't see any downside for supporting a > > domain mode. Adding support for it through hostdb.url.mode seems like a good > > idea. > > > > Regards, > > Markus > > > > -----Original message----- > > > From:Yossi Tamari <[email protected]> > > > Sent: Sunday 4th March 2018 12:01 > > > To: [email protected] > > > Subject: Why doesn't hostdb support byDomain mode? > > > > > > Hi, > > > > > > > > > > > > Is there a reason that hostdb provides per-host data even when the > > > generate/fetch are working by domain? This generates misleading > > > statistics for servers that load-balance by redirecting to nodes (e.g. > > photobucket). > > > > > > If this is just an oversight, I can contribute a patch, but I'm not > > > sure if I should use partition.url.mode, generate.count.mode, one of > > > the other similar properties, or create one more such property > > hostdb.url.mode. > > > > > > > > > > > > Yossi. > > > > > > > >

