Hi Yossi, hi Markus, we should keep on the radar that some features will not work properly with a domain-level hostdb: - DNS checks in UpdateHostDbReducer - SitemapProcessor tries to find sitemaps announced in the host's robots.txt (there may be more)
> Adding support for it through hostdb.url.mode seems like a good idea. Yes! We already have three of them: generate.count.mode partition.url.mode fetcher.queue.mode Better to keep it also as a separate property for the HostDb. In fact you may even set them to different values if you know what you do. Btw., the fact that robots.txt is per host (and also protocol/port) also affects the fetcher in domain mode: the robots.txt may define a custom crawl-delay, with multiple hosts per domain there is no guarantee that it is used. Also one large delay could be used accidentally for the entire domain. Sebastian On 03/05/2018 11:21 AM, Markus Jelsma wrote: > Hi, > > The reason is simple, we (company) needed this information based on hostname, > so we made a hostdb. I don't see any downside for supporting a domain mode. > Adding support for it through hostdb.url.mode seems like a good idea. > > Regards, > Markus > > -----Original message----- >> From:Yossi Tamari <[email protected]> >> Sent: Sunday 4th March 2018 12:01 >> To: [email protected] >> Subject: Why doesn't hostdb support byDomain mode? >> >> Hi, >> >> >> >> Is there a reason that hostdb provides per-host data even when the >> generate/fetch are working by domain? This generates misleading statistics >> for servers that load-balance by redirecting to nodes (e.g. photobucket). >> >> If this is just an oversight, I can contribute a patch, but I'm not sure if >> I should use partition.url.mode, generate.count.mode, one of the other >> similar properties, or create one more such property hostdb.url.mode. >> >> >> >> Yossi. >> >>

