Perhaps, however it starts with db, not linkdb (like the other linkdb 
properties), it is in the CrawlDB part of nutch-default.xml, and LinkDB code 
uses the property name linkdb.max.anchor.length.

> -----Original Message-----
> From: Markus Jelsma <markus.jel...@openindex.io>
> Sent: 12 March 2018 14:05
> To: user@nutch.apache.org
> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long 
> links
> 
> That is for the LinkDB.
> 
> 
> 
> -----Original message-----
> > From:Yossi Tamari <yossi.tam...@pipl.com>
> > Sent: Monday 12th March 2018 13:02
> > To: user@nutch.apache.org
> > Subject: RE: UrlRegexFilter is getting destroyed for unrealistically
> > long links
> >
> > Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy paste
> error...
> >
> > > -----Original Message-----
> > > From: Markus Jelsma <markus.jel...@openindex.io>
> > > Sent: 12 March 2018 14:01
> > > To: user@nutch.apache.org
> > > Subject: RE: UrlRegexFilter is getting destroyed for unrealistically
> > > long links
> > >
> > > scripts/apache-nutch-
> > > 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
> > > maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100);
> > > scripts/apache-nutch-
> > > 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:    int
> > > maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100);
> > >
> > >
> > >
> > >
> > > -----Original message-----
> > > > From:Yossi Tamari <yossi.tam...@pipl.com>
> > > > Sent: Monday 12th March 2018 12:56
> > > > To: user@nutch.apache.org
> > > > Subject: RE: UrlRegexFilter is getting destroyed for
> > > > unrealistically long links
> > > >
> > > > Nutch.default contains a property db.max.outlinks.per.page, which
> > > > I think is
> > > supposed to prevent these cases. However, I just searched the code
> > > and couldn't find where it is used. Bug?
> > > >
> > > > > -----Original Message-----
> > > > > From: Semyon Semyonov <semyon.semyo...@mail.com>
> > > > > Sent: 12 March 2018 12:47
> > > > > To: usernutch.apache.org <user@nutch.apache.org>
> > > > > Subject: UrlRegexFilter is getting destroyed for unrealistically
> > > > > long links
> > > > >
> > > > > Dear all,
> > > > >
> > > > > There is an issue with UrlRegexFilter and parsing. In average,
> > > > > parsing takes about 1 millisecond, but sometimes the websites
> > > > > have the crazy links that destroy the parsing(takes 3+ hours and
> > > > > destroy the next
> > > steps of the crawling).
> > > > > For example, below you can see shortened logged version of url
> > > > > with encoded image, the real lenght of the link is 532572 characters.
> > > > >
> > > > > Any idea what should I do with such behavior?  Should I modify
> > > > > the plugin to reject links with lenght > MAX or use more comlex
> > > > > logic/check extra configuration?
> > > > > 2018-03-10 23:39:52,082 INFO [main]
> > > > > org.apache.nutch.parse.ParseOutputFormat:
> > > > > ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and
> > > > > normalization
> > > > > 2018-03-10 23:39:52,178 INFO [main]
> > > > > org.apache.nutch.urlfilter.api.RegexURLFilterBase:
> > > > > ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url
> > > > > filter for url
> > > > >
> > >
> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS
> > > > >
> > >
> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
> > > > >
> > >
> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7
> > > > >
> > >
> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
> > > > >
> > >
> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
> > > > > dbnu50253lju... [532572 characters]
> > > > > 2018-03-11 03:56:26,118 INFO [main]
> > > > > org.apache.nutch.parse.ParseOutputFormat:
> > > > > ParseOutputFormat.Write.filterNormalize 4.4. After filteing and
> > > > > normalization
> > > > >
> > > > > Semyon.
> > > >
> > > >
> >
> >

Reply via email to