Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Sebastian Nagel
is that it > seems like I am not able to reopen a closed/resolved issue. Sorry... > >> -Original Message- >> From: Sebastian Nagel >> Sent: 12 March 2018 17:39 >> To: user@nutch.apache.org >> Subject: Re: UrlRegexFilter is getting destroyed for unrealistical

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari
12 March 2018 17:39 > To: user@nutch.apache.org > Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long > links > > >> I was about renaming db.max.anchor.length -> linkdb.max.anchor.length > > OK, agreed, but it should also be moved to the LinkDB se

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Sebastian Nagel
-Original Message----- >> From: Sebastian Nagel >> Sent: 12 March 2018 15:57 >> To: user@nutch.apache.org >> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long >> links >> >> Hi Semyon, Yossi, Markus, >> >>> what db.max.

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari
Normalize method or do it before calling it. Yossi. > -Original Message- > From: Sebastian Nagel > Sent: 12 March 2018 15:57 > To: user@nutch.apache.org > Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long > links > > Hi Semyon, Yo

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Semyon Semyonov
cally everybody should use Markus' first rule... > >> -Original Message----- >> From: Semyon Semyonov >> Sent: 12 March 2018 14:51 >> To: user@nutch.apache.org >> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long >> links &g

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Sebastian Nagel
gt; Though, how the property of crawldb/linkdb suppose to prevent this problem in >> Parse? >> >> Sent: Monday, March 12, 2018 at 1:42 PM >> From: "Edward Capriolo" >> To: "user@nutch.apache.org" >> Subject: Re: UrlRegexFilter is getting destroyed for un

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari
ent: 12 March 2018 14:51 > To: user@nutch.apache.org > Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long > links > > So, which is the conclusion? > > Should it be solved in regex file or through this property? > > Though, how the property of c

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Semyon Semyonov
g" Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long links Some regular expressions (those with backtracing) can be very expensive for lomg strings https://regular-expressions.mobi/catastrophic.html?wlr=1 Maybe that is your issue. On Monday, March 12, 2018, Sebastian Nagel wrot

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Edward Capriolo
gt; > >> -Original Message- > >> From: Markus Jelsma > >> Sent: 12 March 2018 14:05 > >> To: user@nutch.apache.org > >> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically > long links > >> > >> That is for the Li

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Sebastian Nagel
property name linkdb.max.anchor.length. > >> -Original Message- >> From: Markus Jelsma >> Sent: 12 March 2018 14:05 >> To: user@nutch.apache.org >> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long >> links >> >&g

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari
gt; To: user@nutch.apache.org > Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long > links > > That is for the LinkDB. > > > > -Original message- > > From:Yossi Tamari > > Sent: Monday 12th March 2018 13:02 > > To: user@nutch

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Markus Jelsma
That is for the LinkDB. -Original message- > From:Yossi Tamari > Sent: Monday 12th March 2018 13:02 > To: user@nutch.apache.org > Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long > links > > Sorry, not db.max.outlinks.per.page, db.ma

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Sebastian Nagel
inal Message- >> From: Semyon Semyonov >> Sent: 12 March 2018 12:47 >> To: usernutch.apache.org >> Subject: UrlRegexFilter is getting destroyed for unrealistically long links >> >> Dear all, >> >> There is an issue with UrlRegexFilter and parsing.

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari
Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy paste error... > -Original Message- > From: Markus Jelsma > Sent: 12 March 2018 14:01 > To: user@nutch.apache.org > Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long > links > &

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Markus Jelsma
job.getInt("db.max.outlinks.per.page", 100); -Original message- > From:Yossi Tamari > Sent: Monday 12th March 2018 12:56 > To: user@nutch.apache.org > Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long > links > > Nutch.default contains a propert

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari
rnutch.apache.org > Subject: UrlRegexFilter is getting destroyed for unrealistically long links > > Dear all, > > There is an issue with UrlRegexFilter and parsing. In average, parsing takes > about 1 millisecond, but sometimes the websites have the crazy links that > destroy

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Markus Jelsma
Hello - see inline. Regards, Markus -Original message- > From:Semyon Semyonov > Sent: Monday 12th March 2018 11:47 > To: usernutch.apache.org > Subject: UrlRegexFilter is getting destroyed for unrealistically long links > > Dear all, > > There is an issue

UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Semyon Semyonov
Dear all, There is an issue with UrlRegexFilter and parsing. In average, parsing takes about 1 millisecond, but sometimes the websites have the crazy links that destroy the parsing(takes 3+ hours and destroy the next steps of the crawling).  For example, below you can see shortened logged versio