Hi Yossi,

please don't take it as a vote against your proposal.
It could be also solved by documenting what's not
working with the HostDb containing domains.

Are you only about the statistics or also about using
the HostDb for Generator?  For the former use case,
a solution could be also to aggregate the counts
by domain. Usually, the HostDb is orders of magnitude
smaller than the CrawlDb, so this should be considerably
fast.

Best,
Sebastian

On 03/05/2018 02:03 PM, Yossi Tamari wrote:
> Thanks, I will submit a patch for this. Since this allows me to solve my 
> specific issue, and since Sebastian raised some questions regarding byDomain, 
> I will not proceed with that currently.
> 
>> -----Original Message-----
>> From: Markus Jelsma <markus.jel...@openindex.io>
>> Sent: 05 March 2018 14:41
>> To: user@nutch.apache.org
>> Subject: RE: Why doesn't hostdb support byDomain mode?
>>
>> Ah, well, that is a good one! I took me a while to figure it out, but having 
>> the
>> check there is an error. We had added the same check in an earlier different
>> Nutch job where the database itself could remove itself just by the rules it
>> emitted and host normalized enabled.
>>
>> I simply reused the job setup code and forgot to remove that check. You can
>> safely remove that check in HostDB.
>>
>> Regards,
>> Markus
>>
>>
>> -----Original message-----
>>> From:Yossi Tamari <yossi.tam...@pipl.com>
>>> Sent: Monday 5th March 2018 11:30
>>> To: user@nutch.apache.org
>>> Subject: RE: Why doesn't hostdb support byDomain mode?
>>>
>>> Thanks Markus, I will open a ticket and submit a patch.
>>> One follow up question: UpdateHostDb checks and throws an exception if
>> urlnormalizer-host (which can be used to mitigate the problem I mentioned) is
>> enabled. Is that also an internal decision of OpenIndex, and perhaps should 
>> be
>> removed now that the code is part of Nutch, or is there a reason this 
>> normalizer
>> must not be used with UpdateHostDb?
>>>
>>>     Yossi.
>>>
>>>> -----Original Message-----
>>>> From: Markus Jelsma <markus.jel...@openindex.io>
>>>> Sent: 05 March 2018 12:22
>>>> To: user@nutch.apache.org
>>>> Subject: RE: Why doesn't hostdb support byDomain mode?
>>>>
>>>> Hi,
>>>>
>>>> The reason is simple, we (company) needed this information based on
>>>> hostname, so we made a hostdb. I don't see any downside for
>>>> supporting a domain mode. Adding support for it through
>>>> hostdb.url.mode seems like a good idea.
>>>>
>>>> Regards,
>>>> Markus
>>>>
>>>> -----Original message-----
>>>>> From:Yossi Tamari <yossi.tam...@pipl.com>
>>>>> Sent: Sunday 4th March 2018 12:01
>>>>> To: user@nutch.apache.org
>>>>> Subject: Why doesn't hostdb support byDomain mode?
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> Is there a reason that hostdb provides per-host data even when the
>>>>> generate/fetch are working by domain? This generates misleading
>>>>> statistics for servers that load-balance by redirecting to nodes (e.g.
>>>> photobucket).
>>>>>
>>>>> If this is just an oversight, I can contribute a patch, but I'm
>>>>> not sure if I should use partition.url.mode, generate.count.mode,
>>>>> one of the other similar properties, or create one more such
>>>>> property
>>>> hostdb.url.mode.
>>>>>
>>>>>
>>>>>
>>>>> Yossi.
>>>>>
>>>>>
>>>
>>>
> 

Reply via email to