I think the first one should also be handled by reopening NUTCH-2220, which 
specifically mentions renaming db.max.anchor.length. The problem is that it 
seems like I am not able to reopen a closed/resolved issue. Sorry...

> -----Original Message-----
> From: Sebastian Nagel <wastl.na...@googlemail.com>
> Sent: 12 March 2018 17:39
> To: user@nutch.apache.org
> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long 
> links
> 
> >> I was about renaming db.max.anchor.length -> linkdb.max.anchor.length
> > OK, agreed, but it should also be moved to the LinkDB section in nutch-
> default.xml.
> 
> Yes, of course, plus make the description more explicit.
> Could you open a Jira issue for this?
> 
> > It should apply to outlinks received from the parser, not to injected URLs, 
> > for
> example.
> 
> Maybe it's ok not to apply it to seed URLs but what about URLs from sitemaps
> and ev. redirects?
> But agreed, you always could also add a rule to regex-urlfilter.txt if 
> required. But
> it should be made clear that only outlinks are checked for length.
> Could you reopen NUTCH-1106 to address this?
> 
> 
> Thanks!
> 
> 
> On 03/12/2018 03:27 PM, Yossi Tamari wrote:
> >> Which property, db.max.outlinks.per.page or db.max.anchor.length?
> > db.max.anchor.length, I already said that when I wrote
> "db.max.outlinks.per.page" it was a copy/paste error.
> >
> >> I was about renaming db.max.anchor.length -> linkdb.max.anchor.length
> > OK, agreed, but it should also be moved to the LinkDB section in nutch-
> default.xml.
> >
> >> Regarding a property to limit the URL length as discussed in NUTCH-1106:
> >> - it should be applied before URL normalizers
> > Agreed, but it seems to me the most natural place to add it is where
> db.max.outlinks.per.page is applied, around line 257 in ParseOutputFormat. It
> should apply to outlinks received from the parser, not to injected URLs, for
> example. The only other place I can think of where this may be needed is after
> redirect.
> > This is pretty much the same as what Semyon suggests, whether we push it
> down into the filterNormalize method or do it before calling it.
> >
> >     Yossi.
> >
> >> -----Original Message-----
> >> From: Sebastian Nagel <wastl.na...@googlemail.com>
> >> Sent: 12 March 2018 15:57
> >> To: user@nutch.apache.org
> >> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically
> >> long links
> >>
> >> Hi Semyon, Yossi, Markus,
> >>
> >>> what db.max.anchor.length was supposed to do
> >>
> >> it's applied to "anchor texts" (https://en.wikipedia.org/wiki/Anchor_text)
> >>   <a href="url">anchor text</a>
> >> Can we agree to use the term "anchor" in this meaning?
> >> At least, that's how it is used in the class Outlink and hopefully
> >> throughout Nutch.
> >>
> >>> Personally, I still think the property should be used to limit
> >>> outlink length in parsing,
> >>
> >> Which property, db.max.outlinks.per.page or db.max.anchor.length?
> >>
> >> I was about renaming
> >>   db.max.anchor.length -> linkdb.max.anchor.length This property was
> >> forgotten when making the naming more consistent in
> >>   [NUTCH-2220] - Rename db.* options used only by the linkdb to
> >> linkdb.*
> >>
> >> Regarding a property to limit the URL length as discussed in NUTCH-1106:
> >> - it should be applied before URL normalizers
> >>   (that would be the main advantage over adding a regex filter rule)
> >> - but probably for all tools / places where URLs are filtered
> >>   (ugly because there are many of them)
> >> - one option would be to rethink the pipeline of URL normalizers and 
> >> filters
> >>   as Julien did it for Storm-crawler [1].
> >> - a pragmatic solution to keep the code changes limited:
> >>   do the length check twice at the beginning of
> >>    URLNormalizers.normalize(...)
> >>   and
> >>    URLFilters.filter(...)
> >>   (it's not guaranteed that normalizers are always called)
> >> - the minimal solution: add a default rule to regex-urlfilter.txt.template
> >>   to limit the length to 512 (or 1024/2048) characters
> >>
> >>
> >> Best,
> >> Sebastian
> >>
> >> [1]
> >> https://github.com/DigitalPebble/storm-
> >> crawler/blob/master/archetype/src/main/resources/archetype-
> >> resources/src/main/resources/urlfilters.json
> >>
> >>
> >>
> >> On 03/12/2018 02:02 PM, Yossi Tamari wrote:
> >>> The other properties in this section actually affect parsing (e.g.
> >> db.max.outlinks.per.page). I was under the impression that this is
> >> what db.max.anchor.length was supposed to do, and actually increased its
> value.
> >> Turns out this is one of the many things in Nutch that are not
> >> intuitive (or in this case, does nothing at all).
> >>> One of the reasons I thought so is that very long links can be used
> >>> as an attack
> >> on crawlers.
> >>> Personally, I still think the property should be used to limit
> >>> outlink length in
> >> parsing, but if that is not what it's supposed to do, I guess it
> >> needs to be renamed (to match the code), moved to a different section
> >> of the properties file, and perhaps better documented. In that case, you'll
> need to use Markus'
> >> solution, and basically everybody should use Markus' first rule...
> >>>
> >>>> -----Original Message-----
> >>>> From: Semyon Semyonov <semyon.semyo...@mail.com>
> >>>> Sent: 12 March 2018 14:51
> >>>> To: user@nutch.apache.org
> >>>> Subject: Re: UrlRegexFilter is getting destroyed for
> >>>> unrealistically long links
> >>>>
> >>>> So, which is the conclusion?
> >>>>
> >>>> Should it be solved in regex file or through this property?
> >>>>
> >>>> Though, how the property of crawldb/linkdb suppose to prevent this
> >>>> problem in Parse?
> >>>>
> >>>> Sent: Monday, March 12, 2018 at 1:42 PM
> >>>> From: "Edward Capriolo" <edlinuxg...@gmail.com>
> >>>> To: "user@nutch.apache.org" <user@nutch.apache.org>
> >>>> Subject: Re: UrlRegexFilter is getting destroyed for
> >>>> unrealistically long links Some regular expressions (those with
> >>>> backtracing) can be very expensive for lomg strings
> >>>>
> >>>> https://regular-expressions.mobi/catastrophic.html?wlr=1
> >>>>
> >>>> Maybe that is your issue.
> >>>>
> >>>> On Monday, March 12, 2018, Sebastian Nagel
> >>>> <wastl.na...@googlemail.com>
> >>>> wrote:
> >>>>
> >>>>> Good catch. It should be renamed to be consistent with other
> >>>>> properties, right?
> >>>>>
> >>>>> On 03/12/2018 01:10 PM, Yossi Tamari wrote:
> >>>>>> Perhaps, however it starts with db, not linkdb (like the other
> >>>>>> linkdb
> >>>>> properties), it is in the CrawlDB part of nutch-default.xml, and
> >>>>> LinkDB code uses the property name linkdb.max.anchor.length.
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Markus Jelsma <markus.jel...@openindex.io>
> >>>>>>> Sent: 12 March 2018 14:05
> >>>>>>> To: user@nutch.apache.org
> >>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
> >>>>>>> unrealistically
> >>>>> long links
> >>>>>>>
> >>>>>>> That is for the LinkDB.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> -----Original message-----
> >>>>>>>> From:Yossi Tamari <yossi.tam...@pipl.com>
> >>>>>>>> Sent: Monday 12th March 2018 13:02
> >>>>>>>> To: user@nutch.apache.org
> >>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
> >>>>>>>> unrealistically long links
> >>>>>>>>
> >>>>>>>> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy
> >>>>>>>> paste
> >>>>>>> error...
> >>>>>>>>
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: Markus Jelsma <markus.jel...@openindex.io>
> >>>>>>>>> Sent: 12 March 2018 14:01
> >>>>>>>>> To: user@nutch.apache.org
> >>>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
> >>>>>>>>> unrealistically long links
> >>>>>>>>>
> >>>>>>>>> scripts/apache-nutch-
> >>>>>>>>> 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
> >>>>>>>>> maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page",
> >>>>>>>>> 100);
> >>>>>>>>> scripts/apache-nutch-
> >>>>>>>>>
> 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:
> >>>>> int
> >>>>>>>>> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page",
> >>>>>>>>> 100);
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> -----Original message-----
> >>>>>>>>>> From:Yossi Tamari <yossi.tam...@pipl.com>
> >>>>>>>>>> Sent: Monday 12th March 2018 12:56
> >>>>>>>>>> To: user@nutch.apache.org
> >>>>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
> >>>>>>>>>> unrealistically long links
> >>>>>>>>>>
> >>>>>>>>>> Nutch.default contains a property db.max.outlinks.per.page,
> >>>>>>>>>> which I think is
> >>>>>>>>> supposed to prevent these cases. However, I just searched the
> >>>>>>>>> code and couldn't find where it is used. Bug?
> >>>>>>>>>>
> >>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>> From: Semyon Semyonov <semyon.semyo...@mail.com>
> >>>>>>>>>>> Sent: 12 March 2018 12:47
> >>>>>>>>>>> To: usernutch.apache.org <user@nutch.apache.org>
> >>>>>>>>>>> Subject: UrlRegexFilter is getting destroyed for
> >>>>>>>>>>> unrealistically long links
> >>>>>>>>>>>
> >>>>>>>>>>> Dear all,
> >>>>>>>>>>>
> >>>>>>>>>>> There is an issue with UrlRegexFilter and parsing. In
> >>>>>>>>>>> average, parsing takes about 1 millisecond, but sometimes
> >>>>>>>>>>> the websites have the crazy links that destroy the
> >>>>>>>>>>> parsing(takes 3+ hours and destroy the next
> >>>>>>>>> steps of the crawling).
> >>>>>>>>>>> For example, below you can see shortened logged version of
> >>>>>>>>>>> url with encoded image, the real lenght of the link is
> >>>>>>>>>>> 532572
> >>>>> characters.
> >>>>>>>>>>>
> >>>>>>>>>>> Any idea what should I do with such behavior? Should I
> >>>>>>>>>>> modify the plugin to reject links with lenght > MAX or use
> >>>>>>>>>>> more comlex logic/check extra configuration?
> >>>>>>>>>>> 2018-03-10 23:39:52,082 INFO [main]
> >>>>>>>>>>> org.apache.nutch.parse.ParseOutputFormat:
> >>>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing
> >>>>>>>>>>> and normalization
> >>>>>>>>>>> 2018-03-10 23:39:52,178 INFO [main]
> >>>>>>>>>>> org.apache.nutch.urlfilter.api.RegexURLFilterBase:
> >>>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url
> >>>>>>>>>>> filter for url
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>
> >>
> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS[
> >>>>>>>
> >>>>
> >>
> https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS]
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>
> >>
> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>
> >>
> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr
> >>>>>>> 7
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>
> >>
> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>
> >>
> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
> >>>>>>>>>>> dbnu50253lju... [532572 characters]
> >>>>>>>>>>> 2018-03-11 03:56:26,118 INFO [main]
> >>>>>>>>>>> org.apache.nutch.parse.ParseOutputFormat:
> >>>>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.4. After filteing
> >>>>>>>>>>> and normalization
> >>>>>>>>>>>
> >>>>>>>>>>> Semyon.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> Sorry this was sent from mobile. Will do less grammar and spell
> >>>> check than usual.
> >>>
> >
> >


Reply via email to