is that it
> seems like I am not able to reopen a closed/resolved issue. Sorry...
>
>> -Original Message-
>> From: Sebastian Nagel
>> Sent: 12 March 2018 17:39
>> To: user@nutch.apache.org
>> Subject: Re: UrlRegexFilter is getting destroyed for unrealistical
12 March 2018 17:39
> To: user@nutch.apache.org
> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long
> links
>
> >> I was about renaming db.max.anchor.length -> linkdb.max.anchor.length
> > OK, agreed, but it should also be moved to the LinkDB se
-Original Message-----
>> From: Sebastian Nagel
>> Sent: 12 March 2018 15:57
>> To: user@nutch.apache.org
>> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long
>> links
>>
>> Hi Semyon, Yossi, Markus,
>>
>>> what db.max.
Normalize method or do it before calling it.
Yossi.
> -Original Message-
> From: Sebastian Nagel
> Sent: 12 March 2018 15:57
> To: user@nutch.apache.org
> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long
> links
>
> Hi Semyon, Yo
cally everybody should use Markus' first rule...
>
>> -Original Message-----
>> From: Semyon Semyonov
>> Sent: 12 March 2018 14:51
>> To: user@nutch.apache.org
>> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long
>> links
&g
gt; Though, how the property of crawldb/linkdb suppose to prevent this problem in
>> Parse?
>>
>> Sent: Monday, March 12, 2018 at 1:42 PM
>> From: "Edward Capriolo"
>> To: "user@nutch.apache.org"
>> Subject: Re: UrlRegexFilter is getting destroyed for un
ent: 12 March 2018 14:51
> To: user@nutch.apache.org
> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long
> links
>
> So, which is the conclusion?
>
> Should it be solved in regex file or through this property?
>
> Though, how the property of c
g"
Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long links
Some regular expressions (those with backtracing) can be very expensive for
lomg strings
https://regular-expressions.mobi/catastrophic.html?wlr=1
Maybe that is your issue.
On Monday, March 12, 2018, Sebastian Nagel
wrot
gt;
> >> -Original Message-
> >> From: Markus Jelsma
> >> Sent: 12 March 2018 14:05
> >> To: user@nutch.apache.org
> >> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically
> long links
> >>
> >> That is for the Li
property name linkdb.max.anchor.length.
>
>> -Original Message-
>> From: Markus Jelsma
>> Sent: 12 March 2018 14:05
>> To: user@nutch.apache.org
>> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long
>> links
>>
>&g
gt; To: user@nutch.apache.org
> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long
> links
>
> That is for the LinkDB.
>
>
>
> -Original message-
> > From:Yossi Tamari
> > Sent: Monday 12th March 2018 13:02
> > To: user@nutch
That is for the LinkDB.
-Original message-
> From:Yossi Tamari
> Sent: Monday 12th March 2018 13:02
> To: user@nutch.apache.org
> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long
> links
>
> Sorry, not db.max.outlinks.per.page, db.ma
inal Message-
>> From: Semyon Semyonov
>> Sent: 12 March 2018 12:47
>> To: usernutch.apache.org
>> Subject: UrlRegexFilter is getting destroyed for unrealistically long links
>>
>> Dear all,
>>
>> There is an issue with UrlRegexFilter and parsing.
Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy paste error...
> -Original Message-
> From: Markus Jelsma
> Sent: 12 March 2018 14:01
> To: user@nutch.apache.org
> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long
> links
>
&
job.getInt("db.max.outlinks.per.page", 100);
-Original message-
> From:Yossi Tamari
> Sent: Monday 12th March 2018 12:56
> To: user@nutch.apache.org
> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long
> links
>
> Nutch.default contains a propert
rnutch.apache.org
> Subject: UrlRegexFilter is getting destroyed for unrealistically long links
>
> Dear all,
>
> There is an issue with UrlRegexFilter and parsing. In average, parsing takes
> about 1 millisecond, but sometimes the websites have the crazy links that
> destroy
Hello - see inline.
Regards,
Markus
-Original message-
> From:Semyon Semyonov
> Sent: Monday 12th March 2018 11:47
> To: usernutch.apache.org
> Subject: UrlRegexFilter is getting destroyed for unrealistically long links
>
> Dear all,
>
> There is an issue
Dear all,
There is an issue with UrlRegexFilter and parsing. In average, parsing takes
about 1 millisecond, but sometimes the websites have the crazy links that
destroy the parsing(takes 3+ hours and destroy the next steps of the crawling).
For example, below you can see shortened logged versio
18 matches
Mail list logo