Thanks, I will submit a patch for this. Since this allows me to solve my 
specific issue, and since Sebastian raised some questions regarding byDomain, I 
will not proceed with that currently.

> -----Original Message-----
> From: Markus Jelsma <[email protected]>
> Sent: 05 March 2018 14:41
> To: [email protected]
> Subject: RE: Why doesn't hostdb support byDomain mode?
> 
> Ah, well, that is a good one! I took me a while to figure it out, but having 
> the
> check there is an error. We had added the same check in an earlier different
> Nutch job where the database itself could remove itself just by the rules it
> emitted and host normalized enabled.
> 
> I simply reused the job setup code and forgot to remove that check. You can
> safely remove that check in HostDB.
> 
> Regards,
> Markus
> 
> 
> -----Original message-----
> > From:Yossi Tamari <[email protected]>
> > Sent: Monday 5th March 2018 11:30
> > To: [email protected]
> > Subject: RE: Why doesn't hostdb support byDomain mode?
> >
> > Thanks Markus, I will open a ticket and submit a patch.
> > One follow up question: UpdateHostDb checks and throws an exception if
> urlnormalizer-host (which can be used to mitigate the problem I mentioned) is
> enabled. Is that also an internal decision of OpenIndex, and perhaps should be
> removed now that the code is part of Nutch, or is there a reason this 
> normalizer
> must not be used with UpdateHostDb?
> >
> >     Yossi.
> >
> > > -----Original Message-----
> > > From: Markus Jelsma <[email protected]>
> > > Sent: 05 March 2018 12:22
> > > To: [email protected]
> > > Subject: RE: Why doesn't hostdb support byDomain mode?
> > >
> > > Hi,
> > >
> > > The reason is simple, we (company) needed this information based on
> > > hostname, so we made a hostdb. I don't see any downside for
> > > supporting a domain mode. Adding support for it through
> > > hostdb.url.mode seems like a good idea.
> > >
> > > Regards,
> > > Markus
> > >
> > > -----Original message-----
> > > > From:Yossi Tamari <[email protected]>
> > > > Sent: Sunday 4th March 2018 12:01
> > > > To: [email protected]
> > > > Subject: Why doesn't hostdb support byDomain mode?
> > > >
> > > > Hi,
> > > >
> > > >
> > > >
> > > > Is there a reason that hostdb provides per-host data even when the
> > > > generate/fetch are working by domain? This generates misleading
> > > > statistics for servers that load-balance by redirecting to nodes (e.g.
> > > photobucket).
> > > >
> > > > If this is just an oversight, I can contribute a patch, but I'm
> > > > not sure if I should use partition.url.mode, generate.count.mode,
> > > > one of the other similar properties, or create one more such
> > > > property
> > > hostdb.url.mode.
> > > >
> > > >
> > > >
> > > > Yossi.
> > > >
> > > >
> >
> >

Reply via email to