UrlRegexFilter is getting destroyed for unrealistically long links

Semyon Semyonov Mon, 12 Mar 2018 03:48:07 -0700

Dear all,

There is an issue with UrlRegexFilter and parsing. In average, parsing takes 
about 1 millisecond, but sometimes the websites have the crazy links that 
destroy the parsing(takes 3+ hours and destroy the next steps of the crawling). 
For example, below you can see shortened logged version of url with encoded 
image, the real lenght of the link is 532572 characters.
 
Any idea what should I do with such behavior?  Should I modify the plugin to 
reject links with lenght > MAX or use more comlex logic/check extra 
configuration?
2018-03-10 23:39:52,082 INFO [main] org.apache.nutch.parse.ParseOutputFormat: 
ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and normalization 
2018-03-10 23:39:52,178 INFO [main] 
org.apache.nutch.urlfilter.api.RegexURLFilterBase: 
ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url filter for url 
:https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANSUhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNudbnu50253lju...
 [532572 characters]
2018-03-11 03:56:26,118 INFO [main] org.apache.nutch.parse.ParseOutputFormat: 
ParseOutputFormat.Write.filterNormalize 4.4. After filteing and normalization


Semyon.

UrlRegexFilter is getting destroyed for unrealistically long links

Reply via email to