Hello - see inline.

Regards,
Markus
 
-----Original message-----
> From:Semyon Semyonov <semyon.semyo...@mail.com>
> Sent: Monday 12th March 2018 11:47
> To: usernutch.apache.org <user@nutch.apache.org>
> Subject: UrlRegexFilter is getting destroyed for unrealistically long links
> 
> Dear all,
> 
> There is an issue with UrlRegexFilter and parsing. In average, parsing takes 
> about 1 millisecond, but sometimes the websites have the crazy links that 
> destroy the parsing(takes 3+ hours and destroy the next steps of the 
> crawling). 

Regarding destroys the next steps, you mean other jobs then also take a long 
time? In that case you have filtering/normalizing enabled for other jobs, which 
you can safely disable. You filtered/normalized while parsing, no need to do it 
twice or more (except when you have different filters depending on job).

> For example, below you can see shortened logged version of url with encoded 
> image, the real lenght of the link is 532572 characters.
>  
> Any idea what should I do with such behavior?  Should I modify the plugin to 
> reject links with lenght > MAX or use more comlex logic/check extra 
> configuration?

We skip all URL's longer than 512 characters using -.{512,} as first rule in 
the regex file. We have not seen any problem with skipping those URL's, and not 
seen any customer URL's that still make sense but are longer than 512.

> 2018-03-10 23:39:52,082 INFO [main] org.apache.nutch.parse.ParseOutputFormat: 
> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and 
> normalization 
> 2018-03-10 23:39:52,178 INFO [main] 
> org.apache.nutch.urlfilter.api.RegexURLFilterBase: 
> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url filter for url 
> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANSUhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNudbnu50253lju...
>  [532572 characters]
> 2018-03-11 03:56:26,118 INFO [main] org.apache.nutch.parse.ParseOutputFormat: 
> ParseOutputFormat.Write.filterNormalize 4.4. After filteing and normalization 
> 
> Semyon.
> 

Reply via email to