Hi Sebastian,

So do you think this fix should be avoided? I wouldn't want to add something 
that will cause problems for users down the line, but, frankly, I can think of 
examples of domains that intend their robots.txt to apply across servers and 
protocols (crawl-delay), but I can't think of any that mean the opposite, 
standards aside.

Yossi.

> -----Original Message-----
> From: Sebastian Nagel <[email protected]>
> Sent: 05 March 2018 12:50
> To: [email protected]
> Subject: Re: Why doesn't hostdb support byDomain mode?
> 
> Hi Yossi, hi Markus,
> 
> we should keep on the radar that some features will not work properly with a
> domain-level hostdb:
> - DNS checks in UpdateHostDbReducer
> - SitemapProcessor tries to find sitemaps announced in the host's robots.txt
> (there may be more)
> 
> > Adding support for it through hostdb.url.mode seems like a good idea.
> 
> Yes! We already have three of them:
>   generate.count.mode
>   partition.url.mode
>   fetcher.queue.mode
> Better to keep it also as a separate property for the HostDb.
> In fact you may even set them to different values if you know what you do.
> 
> Btw., the fact that robots.txt is per host (and also protocol/port) also 
> affects the
> fetcher in domain mode: the robots.txt may define a custom crawl-delay, with
> multiple hosts per domain there is no guarantee that it is used. Also one 
> large
> delay could be used accidentally for the entire domain.
> 
> Sebastian
> 
> On 03/05/2018 11:21 AM, Markus Jelsma wrote:
> > Hi,
> >
> > The reason is simple, we (company) needed this information based on
> hostname, so we made a hostdb. I don't see any downside for supporting a
> domain mode. Adding support for it through hostdb.url.mode seems like a good
> idea.
> >
> > Regards,
> > Markus
> >
> > -----Original message-----
> >> From:Yossi Tamari <[email protected]>
> >> Sent: Sunday 4th March 2018 12:01
> >> To: [email protected]
> >> Subject: Why doesn't hostdb support byDomain mode?
> >>
> >> Hi,
> >>
> >>
> >>
> >> Is there a reason that hostdb provides per-host data even when the
> >> generate/fetch are working by domain? This generates misleading
> >> statistics for servers that load-balance by redirecting to nodes (e.g.
> photobucket).
> >>
> >> If this is just an oversight, I can contribute a patch, but I'm not
> >> sure if I should use partition.url.mode, generate.count.mode, one of
> >> the other similar properties, or create one more such property
> hostdb.url.mode.
> >>
> >>
> >>
> >> Yossi.
> >>
> >>


Reply via email to