Re: UrlRegexFilter is getting destroyed for unrealistically long links

Semyon Semyonov Mon, 12 Mar 2018 05:51:07 -0700

So, which is the conclusion?

Should it be solved in regex file or through this property?


Though, how the property of crawldb/linkdb suppose to prevent this problem in 
Parse?

Sent: Monday, March 12, 2018 at 1:42 PM
From: "Edward Capriolo" <edlinuxg...@gmail.com>
To: "user@nutch.apache.org" <user@nutch.apache.org>
Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long links
Some regular expressions (those with backtracing) can be very expensive for
lomg strings

https://regular-expressions.mobi/catastrophic.html?wlr=1

Maybe that is your issue.

On Monday, March 12, 2018, Sebastian Nagel <wastl.na...@googlemail.com>
wrote:

> Good catch. It should be renamed to be consistent with other properties,
> right?
>
> On 03/12/2018 01:10 PM, Yossi Tamari wrote:
> > Perhaps, however it starts with db, not linkdb (like the other linkdb
> properties), it is in the CrawlDB part of nutch-default.xml, and LinkDB
> code uses the property name linkdb.max.anchor.length.
> >
> >> -----Original Message-----
> >> From: Markus Jelsma <markus.jel...@openindex.io>
> >> Sent: 12 March 2018 14:05
> >> To: user@nutch.apache.org
> >> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically
> long links
> >>
> >> That is for the LinkDB.
> >>
> >>
> >>
> >> -----Original message-----
> >>> From:Yossi Tamari <yossi.tam...@pipl.com>
> >>> Sent: Monday 12th March 2018 13:02
> >>> To: user@nutch.apache.org
> >>> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically
> >>> long links
> >>>
> >>> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy paste
> >> error...
> >>>
> >>>> -----Original Message-----
> >>>> From: Markus Jelsma <markus.jel...@openindex.io>
> >>>> Sent: 12 March 2018 14:01
> >>>> To: user@nutch.apache.org
> >>>> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically
> >>>> long links
> >>>>
> >>>> scripts/apache-nutch-
> >>>> 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
> >>>> maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100);
> >>>> scripts/apache-nutch-
> >>>> 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:
> int
> >>>> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100);
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> -----Original message-----
> >>>>> From:Yossi Tamari <yossi.tam...@pipl.com>
> >>>>> Sent: Monday 12th March 2018 12:56
> >>>>> To: user@nutch.apache.org
> >>>>> Subject: RE: UrlRegexFilter is getting destroyed for
> >>>>> unrealistically long links
> >>>>>
> >>>>> Nutch.default contains a property db.max.outlinks.per.page, which
> >>>>> I think is
> >>>> supposed to prevent these cases. However, I just searched the code
> >>>> and couldn't find where it is used. Bug?
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Semyon Semyonov <semyon.semyo...@mail.com>
> >>>>>> Sent: 12 March 2018 12:47
> >>>>>> To: usernutch.apache.org <user@nutch.apache.org>
> >>>>>> Subject: UrlRegexFilter is getting destroyed for unrealistically
> >>>>>> long links
> >>>>>>
> >>>>>> Dear all,
> >>>>>>
> >>>>>> There is an issue with UrlRegexFilter and parsing. In average,
> >>>>>> parsing takes about 1 millisecond, but sometimes the websites
> >>>>>> have the crazy links that destroy the parsing(takes 3+ hours and
> >>>>>> destroy the next
> >>>> steps of the crawling).
> >>>>>> For example, below you can see shortened logged version of url
> >>>>>> with encoded image, the real lenght of the link is 532572
> characters.
> >>>>>>
> >>>>>> Any idea what should I do with such behavior? Should I modify
> >>>>>> the plugin to reject links with lenght > MAX or use more comlex
> >>>>>> logic/check extra configuration?
> >>>>>> 2018-03-10 23:39:52,082 INFO [main]
> >>>>>> org.apache.nutch.parse.ParseOutputFormat:
> >>>>>> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and
> >>>>>> normalization
> >>>>>> 2018-03-10 23:39:52,178 INFO [main]
> >>>>>> org.apache.nutch.urlfilter.api.RegexURLFilterBase:
> >>>>>> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url
> >>>>>> filter for url
> >>>>>>
> >>>>
> >> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS[https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS]
> >>>>>>
> >>>>
> >> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
> >>>>>>
> >>>>
> >> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7
> >>>>>>
> >>>>
> >> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
> >>>>>>
> >>>>
> >> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
> >>>>>> dbnu50253lju... [532572 characters]
> >>>>>> 2018-03-11 03:56:26,118 INFO [main]
> >>>>>> org.apache.nutch.parse.ParseOutputFormat:
> >>>>>> ParseOutputFormat.Write.filterNormalize 4.4. After filteing and
> >>>>>> normalization
> >>>>>>
> >>>>>> Semyon.
> >>>>>
> >>>>>
> >>>
> >>>
> >
>
>

--
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.

Re: UrlRegexFilter is getting destroyed for unrealistically long links

Reply via email to