RE: UrlRegexFilter is getting destroyed for unrealistically long links

Yossi Tamari Mon, 12 Mar 2018 06:02:50 -0700

The other properties in this section actually affect parsing (e.g. 
db.max.outlinks.per.page). I was under the impression that this is what 
db.max.anchor.length was supposed to do, and actually increased its value. 
Turns out this is one of the many things in Nutch that are not intuitive (or in 
this case, does nothing at all).
One of the reasons I thought so is that very long links can be used as an 
attack on crawlers.
Personally, I still think the property should be used to limit outlink length 
in parsing, but if that is not what it's supposed to do, I guess it needs to be 
renamed (to match the code), moved to a different section of the properties 
file, and perhaps better documented. In that case, you'll need to use Markus' 
solution, and basically everybody should use Markus' first rule...


> -----Original Message-----
> From: Semyon Semyonov <semyon.semyo...@mail.com>
> Sent: 12 March 2018 14:51
> To: user@nutch.apache.org
> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long 
> links
> 
> So, which is the conclusion?
> 
> Should it be solved in regex file or through this property?
> 
> Though, how the property of crawldb/linkdb suppose to prevent this problem in
> Parse?
> 
> Sent: Monday, March 12, 2018 at 1:42 PM
> From: "Edward Capriolo" <edlinuxg...@gmail.com>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long 
> links
> Some regular expressions (those with backtracing) can be very expensive for
> lomg strings
> 
> https://regular-expressions.mobi/catastrophic.html?wlr=1
> 
> Maybe that is your issue.
> 
> On Monday, March 12, 2018, Sebastian Nagel <wastl.na...@googlemail.com>
> wrote:
> 
> > Good catch. It should be renamed to be consistent with other
> > properties, right?
> >
> > On 03/12/2018 01:10 PM, Yossi Tamari wrote:
> > > Perhaps, however it starts with db, not linkdb (like the other
> > > linkdb
> > properties), it is in the CrawlDB part of nutch-default.xml, and
> > LinkDB code uses the property name linkdb.max.anchor.length.
> > >
> > >> -----Original Message-----
> > >> From: Markus Jelsma <markus.jel...@openindex.io>
> > >> Sent: 12 March 2018 14:05
> > >> To: user@nutch.apache.org
> > >> Subject: RE: UrlRegexFilter is getting destroyed for
> > >> unrealistically
> > long links
> > >>
> > >> That is for the LinkDB.
> > >>
> > >>
> > >>
> > >> -----Original message-----
> > >>> From:Yossi Tamari <yossi.tam...@pipl.com>
> > >>> Sent: Monday 12th March 2018 13:02
> > >>> To: user@nutch.apache.org
> > >>> Subject: RE: UrlRegexFilter is getting destroyed for
> > >>> unrealistically long links
> > >>>
> > >>> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy
> > >>> paste
> > >> error...
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: Markus Jelsma <markus.jel...@openindex.io>
> > >>>> Sent: 12 March 2018 14:01
> > >>>> To: user@nutch.apache.org
> > >>>> Subject: RE: UrlRegexFilter is getting destroyed for
> > >>>> unrealistically long links
> > >>>>
> > >>>> scripts/apache-nutch-
> > >>>> 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
> > >>>> maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page",
> > >>>> 100);
> > >>>> scripts/apache-nutch-
> > >>>> 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:
> > int
> > >>>> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100);
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> -----Original message-----
> > >>>>> From:Yossi Tamari <yossi.tam...@pipl.com>
> > >>>>> Sent: Monday 12th March 2018 12:56
> > >>>>> To: user@nutch.apache.org
> > >>>>> Subject: RE: UrlRegexFilter is getting destroyed for
> > >>>>> unrealistically long links
> > >>>>>
> > >>>>> Nutch.default contains a property db.max.outlinks.per.page,
> > >>>>> which I think is
> > >>>> supposed to prevent these cases. However, I just searched the
> > >>>> code and couldn't find where it is used. Bug?
> > >>>>>
> > >>>>>> -----Original Message-----
> > >>>>>> From: Semyon Semyonov <semyon.semyo...@mail.com>
> > >>>>>> Sent: 12 March 2018 12:47
> > >>>>>> To: usernutch.apache.org <user@nutch.apache.org>
> > >>>>>> Subject: UrlRegexFilter is getting destroyed for
> > >>>>>> unrealistically long links
> > >>>>>>
> > >>>>>> Dear all,
> > >>>>>>
> > >>>>>> There is an issue with UrlRegexFilter and parsing. In average,
> > >>>>>> parsing takes about 1 millisecond, but sometimes the websites
> > >>>>>> have the crazy links that destroy the parsing(takes 3+ hours
> > >>>>>> and destroy the next
> > >>>> steps of the crawling).
> > >>>>>> For example, below you can see shortened logged version of url
> > >>>>>> with encoded image, the real lenght of the link is 532572
> > characters.
> > >>>>>>
> > >>>>>> Any idea what should I do with such behavior? Should I modify
> > >>>>>> the plugin to reject links with lenght > MAX or use more comlex
> > >>>>>> logic/check extra configuration?
> > >>>>>> 2018-03-10 23:39:52,082 INFO [main]
> > >>>>>> org.apache.nutch.parse.ParseOutputFormat:
> > >>>>>> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing
> > >>>>>> and normalization
> > >>>>>> 2018-03-10 23:39:52,178 INFO [main]
> > >>>>>> org.apache.nutch.urlfilter.api.RegexURLFilterBase:
> > >>>>>> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url
> > >>>>>> filter for url
> > >>>>>>
> > >>>>
> > >>
> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS[
> > >>
> https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS]
> > >>>>>>
> > >>>>
> > >>
> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
> > >>>>>>
> > >>>>
> > >>
> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr
> > >> 7
> > >>>>>>
> > >>>>
> > >>
> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
> > >>>>>>
> > >>>>
> > >>
> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
> > >>>>>> dbnu50253lju... [532572 characters]
> > >>>>>> 2018-03-11 03:56:26,118 INFO [main]
> > >>>>>> org.apache.nutch.parse.ParseOutputFormat:
> > >>>>>> ParseOutputFormat.Write.filterNormalize 4.4. After filteing and
> > >>>>>> normalization
> > >>>>>>
> > >>>>>> Semyon.
> > >>>>>
> > >>>>>
> > >>>
> > >>>
> > >
> >
> >
> 
> --
> Sorry this was sent from mobile. Will do less grammar and spell check than
> usual.

RE: UrlRegexFilter is getting destroyed for unrealistically long links

Reply via email to