Hi Sebastian, So do you think this fix should be avoided? I wouldn't want to add something that will cause problems for users down the line, but, frankly, I can think of examples of domains that intend their robots.txt to apply across servers and protocols (crawl-delay), but I can't think of any that mean the opposite, standards aside.
Yossi. > -----Original Message----- > From: Sebastian Nagel <[email protected]> > Sent: 05 March 2018 12:50 > To: [email protected] > Subject: Re: Why doesn't hostdb support byDomain mode? > > Hi Yossi, hi Markus, > > we should keep on the radar that some features will not work properly with a > domain-level hostdb: > - DNS checks in UpdateHostDbReducer > - SitemapProcessor tries to find sitemaps announced in the host's robots.txt > (there may be more) > > > Adding support for it through hostdb.url.mode seems like a good idea. > > Yes! We already have three of them: > generate.count.mode > partition.url.mode > fetcher.queue.mode > Better to keep it also as a separate property for the HostDb. > In fact you may even set them to different values if you know what you do. > > Btw., the fact that robots.txt is per host (and also protocol/port) also > affects the > fetcher in domain mode: the robots.txt may define a custom crawl-delay, with > multiple hosts per domain there is no guarantee that it is used. Also one > large > delay could be used accidentally for the entire domain. > > Sebastian > > On 03/05/2018 11:21 AM, Markus Jelsma wrote: > > Hi, > > > > The reason is simple, we (company) needed this information based on > hostname, so we made a hostdb. I don't see any downside for supporting a > domain mode. Adding support for it through hostdb.url.mode seems like a good > idea. > > > > Regards, > > Markus > > > > -----Original message----- > >> From:Yossi Tamari <[email protected]> > >> Sent: Sunday 4th March 2018 12:01 > >> To: [email protected] > >> Subject: Why doesn't hostdb support byDomain mode? > >> > >> Hi, > >> > >> > >> > >> Is there a reason that hostdb provides per-host data even when the > >> generate/fetch are working by domain? This generates misleading > >> statistics for servers that load-balance by redirecting to nodes (e.g. > photobucket). > >> > >> If this is just an oversight, I can contribute a patch, but I'm not > >> sure if I should use partition.url.mode, generate.count.mode, one of > >> the other similar properties, or create one more such property > hostdb.url.mode. > >> > >> > >> > >> Yossi. > >> > >>

