Hi Yossi,

ok, I see, you need administrator privileges to reopen old issues.
Done: reopened NUTCH-1106.

Opened a new issue NUTCH-2530 instead of reopening NUTCH-2220
to avoid that we accidentally modify release notes, e.g.
   
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=10680&version=12340218
when updating affects/fix versions of resolved issues.

Thanks,
Sebastian


On 03/12/2018 04:50 PM, Yossi Tamari wrote:
> I think the first one should also be handled by reopening NUTCH-2220, which 
> specifically mentions renaming db.max.anchor.length. The problem is that it 
> seems like I am not able to reopen a closed/resolved issue. Sorry...
> 
>> -----Original Message-----
>> From: Sebastian Nagel <wastl.na...@googlemail.com>
>> Sent: 12 March 2018 17:39
>> To: user@nutch.apache.org
>> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long 
>> links
>>
>>>> I was about renaming db.max.anchor.length -> linkdb.max.anchor.length
>>> OK, agreed, but it should also be moved to the LinkDB section in nutch-
>> default.xml.
>>
>> Yes, of course, plus make the description more explicit.
>> Could you open a Jira issue for this?
>>
>>> It should apply to outlinks received from the parser, not to injected URLs, 
>>> for
>> example.
>>
>> Maybe it's ok not to apply it to seed URLs but what about URLs from sitemaps
>> and ev. redirects?
>> But agreed, you always could also add a rule to regex-urlfilter.txt if 
>> required. But
>> it should be made clear that only outlinks are checked for length.
>> Could you reopen NUTCH-1106 to address this?
>>
>>
>> Thanks!
>>
>>
>> On 03/12/2018 03:27 PM, Yossi Tamari wrote:
>>>> Which property, db.max.outlinks.per.page or db.max.anchor.length?
>>> db.max.anchor.length, I already said that when I wrote
>> "db.max.outlinks.per.page" it was a copy/paste error.
>>>
>>>> I was about renaming db.max.anchor.length -> linkdb.max.anchor.length
>>> OK, agreed, but it should also be moved to the LinkDB section in nutch-
>> default.xml.
>>>
>>>> Regarding a property to limit the URL length as discussed in NUTCH-1106:
>>>> - it should be applied before URL normalizers
>>> Agreed, but it seems to me the most natural place to add it is where
>> db.max.outlinks.per.page is applied, around line 257 in ParseOutputFormat. It
>> should apply to outlinks received from the parser, not to injected URLs, for
>> example. The only other place I can think of where this may be needed is 
>> after
>> redirect.
>>> This is pretty much the same as what Semyon suggests, whether we push it
>> down into the filterNormalize method or do it before calling it.
>>>
>>>     Yossi.
>>>
>>>> -----Original Message-----
>>>> From: Sebastian Nagel <wastl.na...@googlemail.com>
>>>> Sent: 12 March 2018 15:57
>>>> To: user@nutch.apache.org
>>>> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically
>>>> long links
>>>>
>>>> Hi Semyon, Yossi, Markus,
>>>>
>>>>> what db.max.anchor.length was supposed to do
>>>>
>>>> it's applied to "anchor texts" (https://en.wikipedia.org/wiki/Anchor_text)
>>>>   <a href="url">anchor text</a>
>>>> Can we agree to use the term "anchor" in this meaning?
>>>> At least, that's how it is used in the class Outlink and hopefully
>>>> throughout Nutch.
>>>>
>>>>> Personally, I still think the property should be used to limit
>>>>> outlink length in parsing,
>>>>
>>>> Which property, db.max.outlinks.per.page or db.max.anchor.length?
>>>>
>>>> I was about renaming
>>>>   db.max.anchor.length -> linkdb.max.anchor.length This property was
>>>> forgotten when making the naming more consistent in
>>>>   [NUTCH-2220] - Rename db.* options used only by the linkdb to
>>>> linkdb.*
>>>>
>>>> Regarding a property to limit the URL length as discussed in NUTCH-1106:
>>>> - it should be applied before URL normalizers
>>>>   (that would be the main advantage over adding a regex filter rule)
>>>> - but probably for all tools / places where URLs are filtered
>>>>   (ugly because there are many of them)
>>>> - one option would be to rethink the pipeline of URL normalizers and 
>>>> filters
>>>>   as Julien did it for Storm-crawler [1].
>>>> - a pragmatic solution to keep the code changes limited:
>>>>   do the length check twice at the beginning of
>>>>    URLNormalizers.normalize(...)
>>>>   and
>>>>    URLFilters.filter(...)
>>>>   (it's not guaranteed that normalizers are always called)
>>>> - the minimal solution: add a default rule to regex-urlfilter.txt.template
>>>>   to limit the length to 512 (or 1024/2048) characters
>>>>
>>>>
>>>> Best,
>>>> Sebastian
>>>>
>>>> [1]
>>>> https://github.com/DigitalPebble/storm-
>>>> crawler/blob/master/archetype/src/main/resources/archetype-
>>>> resources/src/main/resources/urlfilters.json
>>>>
>>>>
>>>>
>>>> On 03/12/2018 02:02 PM, Yossi Tamari wrote:
>>>>> The other properties in this section actually affect parsing (e.g.
>>>> db.max.outlinks.per.page). I was under the impression that this is
>>>> what db.max.anchor.length was supposed to do, and actually increased its
>> value.
>>>> Turns out this is one of the many things in Nutch that are not
>>>> intuitive (or in this case, does nothing at all).
>>>>> One of the reasons I thought so is that very long links can be used
>>>>> as an attack
>>>> on crawlers.
>>>>> Personally, I still think the property should be used to limit
>>>>> outlink length in
>>>> parsing, but if that is not what it's supposed to do, I guess it
>>>> needs to be renamed (to match the code), moved to a different section
>>>> of the properties file, and perhaps better documented. In that case, you'll
>> need to use Markus'
>>>> solution, and basically everybody should use Markus' first rule...
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Semyon Semyonov <semyon.semyo...@mail.com>
>>>>>> Sent: 12 March 2018 14:51
>>>>>> To: user@nutch.apache.org
>>>>>> Subject: Re: UrlRegexFilter is getting destroyed for
>>>>>> unrealistically long links
>>>>>>
>>>>>> So, which is the conclusion?
>>>>>>
>>>>>> Should it be solved in regex file or through this property?
>>>>>>
>>>>>> Though, how the property of crawldb/linkdb suppose to prevent this
>>>>>> problem in Parse?
>>>>>>
>>>>>> Sent: Monday, March 12, 2018 at 1:42 PM
>>>>>> From: "Edward Capriolo" <edlinuxg...@gmail.com>
>>>>>> To: "user@nutch.apache.org" <user@nutch.apache.org>
>>>>>> Subject: Re: UrlRegexFilter is getting destroyed for
>>>>>> unrealistically long links Some regular expressions (those with
>>>>>> backtracing) can be very expensive for lomg strings
>>>>>>
>>>>>> https://regular-expressions.mobi/catastrophic.html?wlr=1
>>>>>>
>>>>>> Maybe that is your issue.
>>>>>>
>>>>>> On Monday, March 12, 2018, Sebastian Nagel
>>>>>> <wastl.na...@googlemail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Good catch. It should be renamed to be consistent with other
>>>>>>> properties, right?
>>>>>>>
>>>>>>> On 03/12/2018 01:10 PM, Yossi Tamari wrote:
>>>>>>>> Perhaps, however it starts with db, not linkdb (like the other
>>>>>>>> linkdb
>>>>>>> properties), it is in the CrawlDB part of nutch-default.xml, and
>>>>>>> LinkDB code uses the property name linkdb.max.anchor.length.
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Markus Jelsma <markus.jel...@openindex.io>
>>>>>>>>> Sent: 12 March 2018 14:05
>>>>>>>>> To: user@nutch.apache.org
>>>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>>>>> unrealistically
>>>>>>> long links
>>>>>>>>>
>>>>>>>>> That is for the LinkDB.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----Original message-----
>>>>>>>>>> From:Yossi Tamari <yossi.tam...@pipl.com>
>>>>>>>>>> Sent: Monday 12th March 2018 13:02
>>>>>>>>>> To: user@nutch.apache.org
>>>>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>>>>>> unrealistically long links
>>>>>>>>>>
>>>>>>>>>> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy
>>>>>>>>>> paste
>>>>>>>>> error...
>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Markus Jelsma <markus.jel...@openindex.io>
>>>>>>>>>>> Sent: 12 March 2018 14:01
>>>>>>>>>>> To: user@nutch.apache.org
>>>>>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>>>>>>> unrealistically long links
>>>>>>>>>>>
>>>>>>>>>>> scripts/apache-nutch-
>>>>>>>>>>> 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
>>>>>>>>>>> maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page",
>>>>>>>>>>> 100);
>>>>>>>>>>> scripts/apache-nutch-
>>>>>>>>>>>
>> 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:
>>>>>>> int
>>>>>>>>>>> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page",
>>>>>>>>>>> 100);
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -----Original message-----
>>>>>>>>>>>> From:Yossi Tamari <yossi.tam...@pipl.com>
>>>>>>>>>>>> Sent: Monday 12th March 2018 12:56
>>>>>>>>>>>> To: user@nutch.apache.org
>>>>>>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>>>>>>>>> unrealistically long links
>>>>>>>>>>>>
>>>>>>>>>>>> Nutch.default contains a property db.max.outlinks.per.page,
>>>>>>>>>>>> which I think is
>>>>>>>>>>> supposed to prevent these cases. However, I just searched the
>>>>>>>>>>> code and couldn't find where it is used. Bug?
>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Semyon Semyonov <semyon.semyo...@mail.com>
>>>>>>>>>>>>> Sent: 12 March 2018 12:47
>>>>>>>>>>>>> To: usernutch.apache.org <user@nutch.apache.org>
>>>>>>>>>>>>> Subject: UrlRegexFilter is getting destroyed for
>>>>>>>>>>>>> unrealistically long links
>>>>>>>>>>>>>
>>>>>>>>>>>>> Dear all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> There is an issue with UrlRegexFilter and parsing. In
>>>>>>>>>>>>> average, parsing takes about 1 millisecond, but sometimes
>>>>>>>>>>>>> the websites have the crazy links that destroy the
>>>>>>>>>>>>> parsing(takes 3+ hours and destroy the next
>>>>>>>>>>> steps of the crawling).
>>>>>>>>>>>>> For example, below you can see shortened logged version of
>>>>>>>>>>>>> url with encoded image, the real lenght of the link is
>>>>>>>>>>>>> 532572
>>>>>>> characters.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Any idea what should I do with such behavior? Should I
>>>>>>>>>>>>> modify the plugin to reject links with lenght > MAX or use
>>>>>>>>>>>>> more comlex logic/check extra configuration?
>>>>>>>>>>>>> 2018-03-10 23:39:52,082 INFO [main]
>>>>>>>>>>>>> org.apache.nutch.parse.ParseOutputFormat:
>>>>>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing
>>>>>>>>>>>>> and normalization
>>>>>>>>>>>>> 2018-03-10 23:39:52,178 INFO [main]
>>>>>>>>>>>>> org.apache.nutch.urlfilter.api.RegexURLFilterBase:
>>>>>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url
>>>>>>>>>>>>> filter for url
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>
>> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS[
>>>>>>>>>
>>>>>>
>>>>
>> https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS]
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>
>> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>
>> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr
>>>>>>>>> 7
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>
>> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>
>> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
>>>>>>>>>>>>> dbnu50253lju... [532572 characters]
>>>>>>>>>>>>> 2018-03-11 03:56:26,118 INFO [main]
>>>>>>>>>>>>> org.apache.nutch.parse.ParseOutputFormat:
>>>>>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.4. After filteing
>>>>>>>>>>>>> and normalization
>>>>>>>>>>>>>
>>>>>>>>>>>>> Semyon.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sorry this was sent from mobile. Will do less grammar and spell
>>>>>> check than usual.
>>>>>
>>>
>>>
> 
> 

Reply via email to