Thanks, I will submit a patch for this. Since this allows me to solve my specific issue, and since Sebastian raised some questions regarding byDomain, I will not proceed with that currently.
> -----Original Message----- > From: Markus Jelsma <[email protected]> > Sent: 05 March 2018 14:41 > To: [email protected] > Subject: RE: Why doesn't hostdb support byDomain mode? > > Ah, well, that is a good one! I took me a while to figure it out, but having > the > check there is an error. We had added the same check in an earlier different > Nutch job where the database itself could remove itself just by the rules it > emitted and host normalized enabled. > > I simply reused the job setup code and forgot to remove that check. You can > safely remove that check in HostDB. > > Regards, > Markus > > > -----Original message----- > > From:Yossi Tamari <[email protected]> > > Sent: Monday 5th March 2018 11:30 > > To: [email protected] > > Subject: RE: Why doesn't hostdb support byDomain mode? > > > > Thanks Markus, I will open a ticket and submit a patch. > > One follow up question: UpdateHostDb checks and throws an exception if > urlnormalizer-host (which can be used to mitigate the problem I mentioned) is > enabled. Is that also an internal decision of OpenIndex, and perhaps should be > removed now that the code is part of Nutch, or is there a reason this > normalizer > must not be used with UpdateHostDb? > > > > Yossi. > > > > > -----Original Message----- > > > From: Markus Jelsma <[email protected]> > > > Sent: 05 March 2018 12:22 > > > To: [email protected] > > > Subject: RE: Why doesn't hostdb support byDomain mode? > > > > > > Hi, > > > > > > The reason is simple, we (company) needed this information based on > > > hostname, so we made a hostdb. I don't see any downside for > > > supporting a domain mode. Adding support for it through > > > hostdb.url.mode seems like a good idea. > > > > > > Regards, > > > Markus > > > > > > -----Original message----- > > > > From:Yossi Tamari <[email protected]> > > > > Sent: Sunday 4th March 2018 12:01 > > > > To: [email protected] > > > > Subject: Why doesn't hostdb support byDomain mode? > > > > > > > > Hi, > > > > > > > > > > > > > > > > Is there a reason that hostdb provides per-host data even when the > > > > generate/fetch are working by domain? This generates misleading > > > > statistics for servers that load-balance by redirecting to nodes (e.g. > > > photobucket). > > > > > > > > If this is just an oversight, I can contribute a patch, but I'm > > > > not sure if I should use partition.url.mode, generate.count.mode, > > > > one of the other similar properties, or create one more such > > > > property > > > hostdb.url.mode. > > > > > > > > > > > > > > > > Yossi. > > > > > > > > > > > >

