Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Sebastian Nagel
gt; seems like I am not able to reopen a closed/resolved issue. Sorry... > >> -Original Message- >> From: Sebastian Nagel <wastl.na...@googlemail.com> >> Sent: 12 March 2018 17:39 >> To: user@nutch.apache.org >> Subject: Re: UrlRegexFilter is gett

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari
;wastl.na...@googlemail.com> > Sent: 12 March 2018 17:39 > To: user@nutch.apache.org > Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long > links > > >> I was about renaming db.max.anchor.length -> linkdb.max.anchor.length > > OK, agreed, but

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Sebastian Nagel
inal Message- >> From: Sebastian Nagel <wastl.na...@googlemail.com> >> Sent: 12 March 2018 15:57 >> To: user@nutch.apache.org >> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long >> links >> >> Hi Semyon, Yossi, Markus, >&

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari
terNormalize method or do it before calling it. Yossi. > -Original Message- > From: Sebastian Nagel <wastl.na...@googlemail.com> > Sent: 12 March 2018 15:57 > To: user@nutch.apache.org > Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long >

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Semyon Semyonov
ution, and basically everybody should use Markus' first rule... > >> -Original Message- >> From: Semyon Semyonov <semyon.semyo...@mail.com> >> Sent: 12 March 2018 14:51 >> To: user@nutch.apache.org >> Subject: Re: UrlRegexFilter is getting destroyed for u

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Sebastian Nagel
w the property of crawldb/linkdb suppose to prevent this problem in >> Parse? >> >> Sent: Monday, March 12, 2018 at 1:42 PM >> From: "Edward Capriolo" <edlinuxg...@gmail.com> >> To: "user@nutch.apache.org" <user@nutch.apache.org> >>

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari
m> > Sent: 12 March 2018 14:51 > To: user@nutch.apache.org > Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long > links > > So, which is the conclusion? > > Should it be solved in regex file or through this property? > > Though, how the pro

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Semyon Semyonov
user@nutch.apache.org" <user@nutch.apache.org> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long links Some regular expressions (those with backtracing) can be very expensive for lomg strings https://regular-expressions.mobi/catastrophic.html?wlr=1 Maybe that is your i

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Edward Capriolo
name linkdb.max.anchor.length. > > > >> -Original Message- > >> From: Markus Jelsma <markus.jel...@openindex.io> > >> Sent: 12 March 2018 14:05 > >> To: user@nutch.apache.org > >> Subject: RE: UrlRegexFilter is getting destroyed for unrealist

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Sebastian Nagel
property name linkdb.max.anchor.length. > >> -Original Message- >> From: Markus Jelsma <markus.jel...@openindex.io> >> Sent: 12 March 2018 14:05 >> To: user@nutch.apache.org >> Subject: RE: UrlRegexFilter is getting destroyed for unrealistic

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari
Sent: 12 March 2018 14:05 > To: user@nutch.apache.org > Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long > links > > That is for the LinkDB. > > > > -Original message- > > From:Yossi Tamari <yossi.tam...@pipl.com&g

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Markus Jelsma
That is for the LinkDB. -Original message- > From:Yossi Tamari <yossi.tam...@pipl.com> > Sent: Monday 12th March 2018 13:02 > To: user@nutch.apache.org > Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long > links > > Sorry, n

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Sebastian Nagel
Hi Yossi, it's used in FetcherThread and ParseOutputFormat: git grep -F db.max.outlinks.per.page However, it's not to limit the length of single outlink in characters but the number of outlinks followed (added to CrawlDb). There was NUTCH-1106 to add a property to limit the outlink length.

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari
t.java:118:int > maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100); > > > > > -Original message- > > From:Yossi Tamari <yossi.tam...@pipl.com> > > Sent: Monday 12th March 2018 12:56 > > To: user@nutch.apache.org >

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Markus Jelsma
job.getInt("db.max.outlinks.per.page", 100); -Original message- > From:Yossi Tamari <yossi.tam...@pipl.com> > Sent: Monday 12th March 2018 12:56 > To: user@nutch.apache.org > Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long > lin

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari
Nutch.default contains a property db.max.outlinks.per.page, which I think is supposed to prevent these cases. However, I just searched the code and couldn't find where it is used. Bug? > -Original Message- > From: Semyon Semyonov > Sent: 12 March 2018 12:47 >

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Markus Jelsma
Hello - see inline. Regards, Markus -Original message- > From:Semyon Semyonov > Sent: Monday 12th March 2018 11:47 > To: usernutch.apache.org > Subject: UrlRegexFilter is getting destroyed for unrealistically long links > > Dear all, >