Re: UrlRegexFilter is getting destroyed for unrealistically long links

Semyon Semyonov Mon, 12 Mar 2018 07:16:59 -0700

Hi Sebastian,

I think that the simplest(and more solid way then the regex modification) would 
be modification of ParseOutputFormat.filterNormalize.


As far as I can see all the url modifications/filtrations occur there. 
Therefore if in the beginning we add to 
    if (fromUrl.equals(toUrl)) {
      return null;
    }

condition 
if(len(fromUrl) > MAX OR len(toUrl)> MAX){
   return null
}

that should be it.

Do I miss something?

Semyon

Sent: Monday, March 12, 2018 at 2:57 PM
From: "Sebastian Nagel" <wastl.na...@googlemail.com>
To: user@nutch.apache.org
Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long links
Hi Semyon, Yossi, Markus,

> what db.max.anchor.length was supposed to do

it's applied to "anchor texts" (https://en.wikipedia.org/wiki/Anchor_text)
<a href="url">anchor text</a>
Can we agree to use the term "anchor" in this meaning?
At least, that's how it is used in the class Outlink and hopefully throughout 
Nutch.

> Personally, I still think the property should be used to limit outlink length 
> in parsing,

Which property, db.max.outlinks.per.page or db.max.anchor.length?

I was about renaming
db.max.anchor.length -> linkdb.max.anchor.length
This property was forgotten when making the naming more consistent in
[NUTCH-2220] - Rename db.* options used only by the linkdb to linkdb.*

Regarding a property to limit the URL length as discussed in NUTCH-1106:
- it should be applied before URL normalizers
(that would be the main advantage over adding a regex filter rule)
- but probably for all tools / places where URLs are filtered
(ugly because there are many of them)
- one option would be to rethink the pipeline of URL normalizers and filters
as Julien did it for Storm-crawler [1].
- a pragmatic solution to keep the code changes limited:
do the length check twice at the beginning of
URLNormalizers.normalize(...)
and
URLFilters.filter(...)
(it's not guaranteed that normalizers are always called)
- the minimal solution: add a default rule to regex-urlfilter.txt.template
to limit the length to 512 (or 1024/2048) characters


Best,
Sebastian

[1]
https://github.com/DigitalPebble/storm-crawler/blob/master/archetype/src/main/resources/archetype-resources/src/main/resources/urlfilters.json[https://github.com/DigitalPebble/storm-crawler/blob/master/archetype/src/main/resources/archetype-resources/src/main/resources/urlfilters.json]



On 03/12/2018 02:02 PM, Yossi Tamari wrote:
> The other properties in this section actually affect parsing (e.g. 
> db.max.outlinks.per.page). I was under the impression that this is what 
> db.max.anchor.length was supposed to do, and actually increased its value. 
> Turns out this is one of the many things in Nutch that are not intuitive (or 
> in this case, does nothing at all).
> One of the reasons I thought so is that very long links can be used as an 
> attack on crawlers.
> Personally, I still think the property should be used to limit outlink length 
> in parsing, but if that is not what it's supposed to do, I guess it needs to 
> be renamed (to match the code), moved to a different section of the 
> properties file, and perhaps better documented. In that case, you'll need to 
> use Markus' solution, and basically everybody should use Markus' first rule...
>
>> -----Original Message-----
>> From: Semyon Semyonov <semyon.semyo...@mail.com>
>> Sent: 12 March 2018 14:51
>> To: user@nutch.apache.org
>> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long 
>> links
>>
>> So, which is the conclusion?
>>
>> Should it be solved in regex file or through this property?
>>
>> Though, how the property of crawldb/linkdb suppose to prevent this problem in
>> Parse?
>>
>> Sent: Monday, March 12, 2018 at 1:42 PM
>> From: "Edward Capriolo" <edlinuxg...@gmail.com>
>> To: "user@nutch.apache.org" <user@nutch.apache.org>
>> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long 
>> links
>> Some regular expressions (those with backtracing) can be very expensive for
>> lomg strings
>>
>> https://regular-expressions.mobi/catastrophic.html?wlr=1[https://regular-expressions.mobi/catastrophic.html?wlr=1]
>>
>> Maybe that is your issue.
>>
>> On Monday, March 12, 2018, Sebastian Nagel <wastl.na...@googlemail.com>
>> wrote:
>>
>>> Good catch. It should be renamed to be consistent with other
>>> properties, right?
>>>
>>> On 03/12/2018 01:10 PM, Yossi Tamari wrote:
>>>> Perhaps, however it starts with db, not linkdb (like the other
>>>> linkdb
>>> properties), it is in the CrawlDB part of nutch-default.xml, and
>>> LinkDB code uses the property name linkdb.max.anchor.length.
>>>>
>>>>> -----Original Message-----
>>>>> From: Markus Jelsma <markus.jel...@openindex.io>
>>>>> Sent: 12 March 2018 14:05
>>>>> To: user@nutch.apache.org
>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>> unrealistically
>>> long links
>>>>>
>>>>> That is for the LinkDB.
>>>>>
>>>>>
>>>>>
>>>>> -----Original message-----
>>>>>> From:Yossi Tamari <yossi.tam...@pipl.com>
>>>>>> Sent: Monday 12th March 2018 13:02
>>>>>> To: user@nutch.apache.org
>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>> unrealistically long links
>>>>>>
>>>>>> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy
>>>>>> paste
>>>>> error...
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Markus Jelsma <markus.jel...@openindex.io>
>>>>>>> Sent: 12 March 2018 14:01
>>>>>>> To: user@nutch.apache.org
>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>>> unrealistically long links
>>>>>>>
>>>>>>> scripts/apache-nutch-
>>>>>>> 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
>>>>>>> maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page",
>>>>>>> 100);
>>>>>>> scripts/apache-nutch-
>>>>>>> 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:
>>> int
>>>>>>> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100);
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -----Original message-----
>>>>>>>> From:Yossi Tamari <yossi.tam...@pipl.com>
>>>>>>>> Sent: Monday 12th March 2018 12:56
>>>>>>>> To: user@nutch.apache.org
>>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>>>> unrealistically long links
>>>>>>>>
>>>>>>>> Nutch.default contains a property db.max.outlinks.per.page,
>>>>>>>> which I think is
>>>>>>> supposed to prevent these cases. However, I just searched the
>>>>>>> code and couldn't find where it is used. Bug?
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Semyon Semyonov <semyon.semyo...@mail.com>
>>>>>>>>> Sent: 12 March 2018 12:47
>>>>>>>>> To: usernutch.apache.org <user@nutch.apache.org>
>>>>>>>>> Subject: UrlRegexFilter is getting destroyed for
>>>>>>>>> unrealistically long links
>>>>>>>>>
>>>>>>>>> Dear all,
>>>>>>>>>
>>>>>>>>> There is an issue with UrlRegexFilter and parsing. In average,
>>>>>>>>> parsing takes about 1 millisecond, but sometimes the websites
>>>>>>>>> have the crazy links that destroy the parsing(takes 3+ hours
>>>>>>>>> and destroy the next
>>>>>>> steps of the crawling).
>>>>>>>>> For example, below you can see shortened logged version of url
>>>>>>>>> with encoded image, the real lenght of the link is 532572
>>> characters.
>>>>>>>>>
>>>>>>>>> Any idea what should I do with such behavior? Should I modify
>>>>>>>>> the plugin to reject links with lenght > MAX or use more comlex
>>>>>>>>> logic/check extra configuration?
>>>>>>>>> 2018-03-10 23:39:52,082 INFO [main]
>>>>>>>>> org.apache.nutch.parse.ParseOutputFormat:
>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing
>>>>>>>>> and normalization
>>>>>>>>> 2018-03-10 23:39:52,178 INFO [main]
>>>>>>>>> org.apache.nutch.urlfilter.api.RegexURLFilterBase:
>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url
>>>>>>>>> filter for url
>>>>>>>>>
>>>>>>>
>>>>>
>> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS[https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS][
>>>>>
>> https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS[https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS]]
>>>>>>>>>
>>>>>>>
>>>>>
>> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
>>>>>>>>>
>>>>>>>
>>>>>
>> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr
>>>>> 7
>>>>>>>>>
>>>>>>>
>>>>>
>> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
>>>>>>>>>
>>>>>>>
>>>>>
>> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
>>>>>>>>> dbnu50253lju... [532572 characters]
>>>>>>>>> 2018-03-11 03:56:26,118 INFO [main]
>>>>>>>>> org.apache.nutch.parse.ParseOutputFormat:
>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.4. After filteing and
>>>>>>>>> normalization
>>>>>>>>>
>>>>>>>>> Semyon.
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>>
>>
>> --
>> Sorry this was sent from mobile. Will do less grammar and spell check than
>> usual.
>

Re: UrlRegexFilter is getting destroyed for unrealistically long links

Reply via email to