UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Semyon Semyonov
Dear all, There is an issue with UrlRegexFilter and parsing. In average, parsing takes about 1 millisecond, but sometimes the websites have the crazy links that destroy the parsing(takes 3+ hours and destroy the next steps of the crawling).  For example, below you can see shortened logged

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Markus Jelsma
scripts/apache-nutch-1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205: maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100); scripts/apache-nutch-1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118: int maxOutlinksPerPage =

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Markus Jelsma
Hello - see inline. Regards, Markus -Original message- > From:Semyon Semyonov > Sent: Monday 12th March 2018 11:47 > To: usernutch.apache.org > Subject: UrlRegexFilter is getting destroyed for unrealistically long links > > Dear all, >

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari
Perhaps, however it starts with db, not linkdb (like the other linkdb properties), it is in the CrawlDB part of nutch-default.xml, and LinkDB code uses the property name linkdb.max.anchor.length. > -Original Message- > From: Markus Jelsma > Sent: 12 March

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Semyon Semyonov
So, which is the conclusion? Should it be solved in regex file or through this property? Though, how the property of crawldb/linkdb suppose to prevent this problem in Parse? Sent: Monday, March 12, 2018 at 1:42 PM From: "Edward Capriolo" To: "user@nutch.apache.org"

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari
Nutch.default contains a property db.max.outlinks.per.page, which I think is supposed to prevent these cases. However, I just searched the code and couldn't find where it is used. Bug? > -Original Message- > From: Semyon Semyonov > Sent: 12 March 2018 12:47 >

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Markus Jelsma
That is for the LinkDB. -Original message- > From:Yossi Tamari > Sent: Monday 12th March 2018 13:02 > To: user@nutch.apache.org > Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long > links > > Sorry, not db.max.outlinks.per.page,

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Sebastian Nagel
Hi Yossi, it's used in FetcherThread and ParseOutputFormat: git grep -F db.max.outlinks.per.page However, it's not to limit the length of single outlink in characters but the number of outlinks followed (added to CrawlDb). There was NUTCH-1106 to add a property to limit the outlink length.

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Edward Capriolo
Some regular expressions (those with backtracing) can be very expensive for lomg strings https://regular-expressions.mobi/catastrophic.html?wlr=1 Maybe that is your issue. On Monday, March 12, 2018, Sebastian Nagel wrote: > Good catch. It should be renamed to be

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari
Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy paste error... > -Original Message- > From: Markus Jelsma > Sent: 12 March 2018 14:01 > To: user@nutch.apache.org > Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long >

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Sebastian Nagel
Good catch. It should be renamed to be consistent with other properties, right? On 03/12/2018 01:10 PM, Yossi Tamari wrote: > Perhaps, however it starts with db, not linkdb (like the other linkdb > properties), it is in the CrawlDB part of nutch-default.xml, and LinkDB code > uses the property

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari
The other properties in this section actually affect parsing (e.g. db.max.outlinks.per.page). I was under the impression that this is what db.max.anchor.length was supposed to do, and actually increased its value. Turns out this is one of the many things in Nutch that are not intuitive (or in

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Sebastian Nagel
Hi Semyon, Yossi, Markus, > what db.max.anchor.length was supposed to do it's applied to "anchor texts" (https://en.wikipedia.org/wiki/Anchor_text) anchor text Can we agree to use the term "anchor" in this meaning? At least, that's how it is used in the class Outlink and hopefully throughout

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Semyon Semyonov
Hi Sebastian, I think that the simplest(and more solid way then the regex modification) would be modification of ParseOutputFormat.filterNormalize. As far as I can see all the url modifications/filtrations occur there. Therefore if in the beginning we add to if (fromUrl.equals(toUrl)) {

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari
> Which property, db.max.outlinks.per.page or db.max.anchor.length? db.max.anchor.length, I already said that when I wrote "db.max.outlinks.per.page" it was a copy/paste error. > I was about renaming db.max.anchor.length -> linkdb.max.anchor.length OK, agreed, but it should also be moved to the

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Sebastian Nagel
>> I was about renaming db.max.anchor.length -> linkdb.max.anchor.length > OK, agreed, but it should also be moved to the LinkDB section in > nutch-default.xml. Yes, of course, plus make the description more explicit. Could you open a Jira issue for this? > It should apply to outlinks received

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari
I think the first one should also be handled by reopening NUTCH-2220, which specifically mentions renaming db.max.anchor.length. The problem is that it seems like I am not able to reopen a closed/resolved issue. Sorry... > -Original Message- > From: Sebastian Nagel

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Sebastian Nagel
Hi Yossi, ok, I see, you need administrator privileges to reopen old issues. Done: reopened NUTCH-1106. Opened a new issue NUTCH-2530 instead of reopening NUTCH-2220 to avoid that we accidentally modify release notes, e.g.