Dear all, There is an issue with UrlRegexFilter and parsing. In average, parsing takes about 1 millisecond, but sometimes the websites have the crazy links that destroy the parsing(takes 3+ hours and destroy the next steps of the crawling). For example, below you can see shortened logged version of url with encoded image, the real lenght of the link is 532572 characters. Any idea what should I do with such behavior? Should I modify the plugin to reject links with lenght > MAX or use more comlex logic/check extra configuration? 2018-03-10 23:39:52,082 INFO [main] org.apache.nutch.parse.ParseOutputFormat: ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and normalization 2018-03-10 23:39:52,178 INFO [main] org.apache.nutch.urlfilter.api.RegexURLFilterBase: ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url filter for url :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANSUhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNudbnu50253lju... [532572 characters] 2018-03-11 03:56:26,118 INFO [main] org.apache.nutch.parse.ParseOutputFormat: ParseOutputFormat.Write.filterNormalize 4.4. After filteing and normalization
Semyon.