Dear all,

There is an issue with UrlRegexFilter and parsing. In average, parsing takes 
about 1 millisecond, but sometimes the websites have the crazy links that 
destroy the parsing(takes 3+ hours and destroy the next steps of the crawling). 
For example, below you can see shortened logged version of url with encoded 
image, the real lenght of the link is 532572 characters.
 
Any idea what should I do with such behavior?  Should I modify the plugin to 
reject links with lenght > MAX or use more comlex logic/check extra 
configuration?
2018-03-10 23:39:52,082 INFO [main] org.apache.nutch.parse.ParseOutputFormat: 
ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and normalization 
2018-03-10 23:39:52,178 INFO [main] 
org.apache.nutch.urlfilter.api.RegexURLFilterBase: 
ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url filter for url 
:https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANSUhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNudbnu50253lju...
 [532572 characters]
2018-03-11 03:56:26,118 INFO [main] org.apache.nutch.parse.ParseOutputFormat: 
ParseOutputFormat.Write.filterNormalize 4.4. After filteing and normalization 

Semyon.

Reply via email to