gt; seems like I am not able to reopen a closed/resolved issue. Sorry...
>
>> -Original Message-
>> From: Sebastian Nagel <wastl.na...@googlemail.com>
>> Sent: 12 March 2018 17:39
>> To: user@nutch.apache.org
>> Subject: Re: UrlRegexFilter is gett
;wastl.na...@googlemail.com>
> Sent: 12 March 2018 17:39
> To: user@nutch.apache.org
> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long
> links
>
> >> I was about renaming db.max.anchor.length -> linkdb.max.anchor.length
> > OK, agreed, but
inal Message-
>> From: Sebastian Nagel <wastl.na...@googlemail.com>
>> Sent: 12 March 2018 15:57
>> To: user@nutch.apache.org
>> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long
>> links
>>
>> Hi Semyon, Yossi, Markus,
>&
terNormalize method or do it before calling it.
Yossi.
> -Original Message-
> From: Sebastian Nagel <wastl.na...@googlemail.com>
> Sent: 12 March 2018 15:57
> To: user@nutch.apache.org
> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long
>
ution, and basically everybody should use Markus' first rule...
>
>> -Original Message-
>> From: Semyon Semyonov <semyon.semyo...@mail.com>
>> Sent: 12 March 2018 14:51
>> To: user@nutch.apache.org
>> Subject: Re: UrlRegexFilter is getting destroyed for u
w the property of crawldb/linkdb suppose to prevent this problem in
>> Parse?
>>
>> Sent: Monday, March 12, 2018 at 1:42 PM
>> From: "Edward Capriolo" <edlinuxg...@gmail.com>
>> To: "user@nutch.apache.org" <user@nutch.apache.org>
>>
m>
> Sent: 12 March 2018 14:51
> To: user@nutch.apache.org
> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long
> links
>
> So, which is the conclusion?
>
> Should it be solved in regex file or through this property?
>
> Though, how the pro
user@nutch.apache.org" <user@nutch.apache.org>
Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long links
Some regular expressions (those with backtracing) can be very expensive for
lomg strings
https://regular-expressions.mobi/catastrophic.html?wlr=1
Maybe that is your i
name linkdb.max.anchor.length.
> >
> >> -Original Message-
> >> From: Markus Jelsma <markus.jel...@openindex.io>
> >> Sent: 12 March 2018 14:05
> >> To: user@nutch.apache.org
> >> Subject: RE: UrlRegexFilter is getting destroyed for unrealist
property name linkdb.max.anchor.length.
>
>> -Original Message-
>> From: Markus Jelsma <markus.jel...@openindex.io>
>> Sent: 12 March 2018 14:05
>> To: user@nutch.apache.org
>> Subject: RE: UrlRegexFilter is getting destroyed for unrealistic
Sent: 12 March 2018 14:05
> To: user@nutch.apache.org
> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long
> links
>
> That is for the LinkDB.
>
>
>
> -Original message-
> > From:Yossi Tamari <yossi.tam...@pipl.com&g
That is for the LinkDB.
-Original message-
> From:Yossi Tamari <yossi.tam...@pipl.com>
> Sent: Monday 12th March 2018 13:02
> To: user@nutch.apache.org
> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long
> links
>
> Sorry, n
Hi Yossi,
it's used in FetcherThread and ParseOutputFormat:
git grep -F db.max.outlinks.per.page
However, it's not to limit the length of single outlink in characters
but the number of outlinks followed (added to CrawlDb).
There was NUTCH-1106 to add a property to limit the outlink length.
t.java:118:int
> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100);
>
>
>
>
> -Original message-
> > From:Yossi Tamari <yossi.tam...@pipl.com>
> > Sent: Monday 12th March 2018 12:56
> > To: user@nutch.apache.org
>
job.getInt("db.max.outlinks.per.page", 100);
-Original message-
> From:Yossi Tamari <yossi.tam...@pipl.com>
> Sent: Monday 12th March 2018 12:56
> To: user@nutch.apache.org
> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long
> lin
Nutch.default contains a property db.max.outlinks.per.page, which I think is
supposed to prevent these cases. However, I just searched the code and couldn't
find where it is used. Bug?
> -Original Message-
> From: Semyon Semyonov
> Sent: 12 March 2018 12:47
>
Hello - see inline.
Regards,
Markus
-Original message-
> From:Semyon Semyonov
> Sent: Monday 12th March 2018 11:47
> To: usernutch.apache.org
> Subject: UrlRegexFilter is getting destroyed for unrealistically long links
>
> Dear all,
>
17 matches
Mail list logo