>> I was about renaming db.max.anchor.length -> linkdb.max.anchor.length
> OK, agreed, but it should also be moved to the LinkDB section in 
> nutch-default.xml.

Yes, of course, plus make the description more explicit.
Could you open a Jira issue for this?

> It should apply to outlinks received from the parser, not to injected URLs, 
> for example.

Maybe it's ok not to apply it to seed URLs but what about URLs from sitemaps 
and ev. redirects?
But agreed, you always could also add a rule to regex-urlfilter.txt if 
required. But it should be
made clear that only outlinks are checked for length.
Could you reopen NUTCH-1106 to address this?


Thanks!


On 03/12/2018 03:27 PM, Yossi Tamari wrote:
>> Which property, db.max.outlinks.per.page or db.max.anchor.length?
> db.max.anchor.length, I already said that when I wrote 
> "db.max.outlinks.per.page" it was a copy/paste error.
> 
>> I was about renaming db.max.anchor.length -> linkdb.max.anchor.length
> OK, agreed, but it should also be moved to the LinkDB section in 
> nutch-default.xml.
> 
>> Regarding a property to limit the URL length as discussed in NUTCH-1106:
>> - it should be applied before URL normalizers
> Agreed, but it seems to me the most natural place to add it is where 
> db.max.outlinks.per.page is applied, around line 257 in ParseOutputFormat. It 
> should apply to outlinks received from the parser, not to injected URLs, for 
> example. The only other place I can think of where this may be needed is 
> after redirect.
> This is pretty much the same as what Semyon suggests, whether we push it down 
> into the filterNormalize method or do it before calling it.
> 
>       Yossi.
> 
>> -----Original Message-----
>> From: Sebastian Nagel <wastl.na...@googlemail.com>
>> Sent: 12 March 2018 15:57
>> To: user@nutch.apache.org
>> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long 
>> links
>>
>> Hi Semyon, Yossi, Markus,
>>
>>> what db.max.anchor.length was supposed to do
>>
>> it's applied to "anchor texts" (https://en.wikipedia.org/wiki/Anchor_text)
>>   <a href="url">anchor text</a>
>> Can we agree to use the term "anchor" in this meaning?
>> At least, that's how it is used in the class Outlink and hopefully throughout
>> Nutch.
>>
>>> Personally, I still think the property should be used to limit outlink
>>> length in parsing,
>>
>> Which property, db.max.outlinks.per.page or db.max.anchor.length?
>>
>> I was about renaming
>>   db.max.anchor.length -> linkdb.max.anchor.length This property was 
>> forgotten
>> when making the naming more consistent in
>>   [NUTCH-2220] - Rename db.* options used only by the linkdb to linkdb.*
>>
>> Regarding a property to limit the URL length as discussed in NUTCH-1106:
>> - it should be applied before URL normalizers
>>   (that would be the main advantage over adding a regex filter rule)
>> - but probably for all tools / places where URLs are filtered
>>   (ugly because there are many of them)
>> - one option would be to rethink the pipeline of URL normalizers and filters
>>   as Julien did it for Storm-crawler [1].
>> - a pragmatic solution to keep the code changes limited:
>>   do the length check twice at the beginning of
>>    URLNormalizers.normalize(...)
>>   and
>>    URLFilters.filter(...)
>>   (it's not guaranteed that normalizers are always called)
>> - the minimal solution: add a default rule to regex-urlfilter.txt.template
>>   to limit the length to 512 (or 1024/2048) characters
>>
>>
>> Best,
>> Sebastian
>>
>> [1]
>> https://github.com/DigitalPebble/storm-
>> crawler/blob/master/archetype/src/main/resources/archetype-
>> resources/src/main/resources/urlfilters.json
>>
>>
>>
>> On 03/12/2018 02:02 PM, Yossi Tamari wrote:
>>> The other properties in this section actually affect parsing (e.g.
>> db.max.outlinks.per.page). I was under the impression that this is what
>> db.max.anchor.length was supposed to do, and actually increased its value.
>> Turns out this is one of the many things in Nutch that are not intuitive (or 
>> in this
>> case, does nothing at all).
>>> One of the reasons I thought so is that very long links can be used as an 
>>> attack
>> on crawlers.
>>> Personally, I still think the property should be used to limit outlink 
>>> length in
>> parsing, but if that is not what it's supposed to do, I guess it needs to be
>> renamed (to match the code), moved to a different section of the properties
>> file, and perhaps better documented. In that case, you'll need to use Markus'
>> solution, and basically everybody should use Markus' first rule...
>>>
>>>> -----Original Message-----
>>>> From: Semyon Semyonov <semyon.semyo...@mail.com>
>>>> Sent: 12 March 2018 14:51
>>>> To: user@nutch.apache.org
>>>> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically
>>>> long links
>>>>
>>>> So, which is the conclusion?
>>>>
>>>> Should it be solved in regex file or through this property?
>>>>
>>>> Though, how the property of crawldb/linkdb suppose to prevent this
>>>> problem in Parse?
>>>>
>>>> Sent: Monday, March 12, 2018 at 1:42 PM
>>>> From: "Edward Capriolo" <edlinuxg...@gmail.com>
>>>> To: "user@nutch.apache.org" <user@nutch.apache.org>
>>>> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically
>>>> long links Some regular expressions (those with backtracing) can be
>>>> very expensive for lomg strings
>>>>
>>>> https://regular-expressions.mobi/catastrophic.html?wlr=1
>>>>
>>>> Maybe that is your issue.
>>>>
>>>> On Monday, March 12, 2018, Sebastian Nagel
>>>> <wastl.na...@googlemail.com>
>>>> wrote:
>>>>
>>>>> Good catch. It should be renamed to be consistent with other
>>>>> properties, right?
>>>>>
>>>>> On 03/12/2018 01:10 PM, Yossi Tamari wrote:
>>>>>> Perhaps, however it starts with db, not linkdb (like the other
>>>>>> linkdb
>>>>> properties), it is in the CrawlDB part of nutch-default.xml, and
>>>>> LinkDB code uses the property name linkdb.max.anchor.length.
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Markus Jelsma <markus.jel...@openindex.io>
>>>>>>> Sent: 12 March 2018 14:05
>>>>>>> To: user@nutch.apache.org
>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>>> unrealistically
>>>>> long links
>>>>>>>
>>>>>>> That is for the LinkDB.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -----Original message-----
>>>>>>>> From:Yossi Tamari <yossi.tam...@pipl.com>
>>>>>>>> Sent: Monday 12th March 2018 13:02
>>>>>>>> To: user@nutch.apache.org
>>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>>>> unrealistically long links
>>>>>>>>
>>>>>>>> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy
>>>>>>>> paste
>>>>>>> error...
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Markus Jelsma <markus.jel...@openindex.io>
>>>>>>>>> Sent: 12 March 2018 14:01
>>>>>>>>> To: user@nutch.apache.org
>>>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>>>>> unrealistically long links
>>>>>>>>>
>>>>>>>>> scripts/apache-nutch-
>>>>>>>>> 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
>>>>>>>>> maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page",
>>>>>>>>> 100);
>>>>>>>>> scripts/apache-nutch-
>>>>>>>>> 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:
>>>>> int
>>>>>>>>> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page",
>>>>>>>>> 100);
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----Original message-----
>>>>>>>>>> From:Yossi Tamari <yossi.tam...@pipl.com>
>>>>>>>>>> Sent: Monday 12th March 2018 12:56
>>>>>>>>>> To: user@nutch.apache.org
>>>>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>>>>>> unrealistically long links
>>>>>>>>>>
>>>>>>>>>> Nutch.default contains a property db.max.outlinks.per.page,
>>>>>>>>>> which I think is
>>>>>>>>> supposed to prevent these cases. However, I just searched the
>>>>>>>>> code and couldn't find where it is used. Bug?
>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Semyon Semyonov <semyon.semyo...@mail.com>
>>>>>>>>>>> Sent: 12 March 2018 12:47
>>>>>>>>>>> To: usernutch.apache.org <user@nutch.apache.org>
>>>>>>>>>>> Subject: UrlRegexFilter is getting destroyed for
>>>>>>>>>>> unrealistically long links
>>>>>>>>>>>
>>>>>>>>>>> Dear all,
>>>>>>>>>>>
>>>>>>>>>>> There is an issue with UrlRegexFilter and parsing. In average,
>>>>>>>>>>> parsing takes about 1 millisecond, but sometimes the websites
>>>>>>>>>>> have the crazy links that destroy the parsing(takes 3+ hours
>>>>>>>>>>> and destroy the next
>>>>>>>>> steps of the crawling).
>>>>>>>>>>> For example, below you can see shortened logged version of url
>>>>>>>>>>> with encoded image, the real lenght of the link is 532572
>>>>> characters.
>>>>>>>>>>>
>>>>>>>>>>> Any idea what should I do with such behavior? Should I modify
>>>>>>>>>>> the plugin to reject links with lenght > MAX or use more
>>>>>>>>>>> comlex logic/check extra configuration?
>>>>>>>>>>> 2018-03-10 23:39:52,082 INFO [main]
>>>>>>>>>>> org.apache.nutch.parse.ParseOutputFormat:
>>>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing
>>>>>>>>>>> and normalization
>>>>>>>>>>> 2018-03-10 23:39:52,178 INFO [main]
>>>>>>>>>>> org.apache.nutch.urlfilter.api.RegexURLFilterBase:
>>>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url
>>>>>>>>>>> filter for url
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>
>> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS[
>>>>>>>
>>>>
>> https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS]
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>
>> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>
>> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr
>>>>>>> 7
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>
>> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>
>> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
>>>>>>>>>>> dbnu50253lju... [532572 characters]
>>>>>>>>>>> 2018-03-11 03:56:26,118 INFO [main]
>>>>>>>>>>> org.apache.nutch.parse.ParseOutputFormat:
>>>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.4. After filteing
>>>>>>>>>>> and normalization
>>>>>>>>>>>
>>>>>>>>>>> Semyon.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Sorry this was sent from mobile. Will do less grammar and spell check
>>>> than usual.
>>>
> 
> 

Reply via email to