I've modified the regular expression in OutlinkExtractor not to allow URI 
schema's other than http:// and i can confirm a significant increase in 
throughput. 

The previous parse/reduce took ages and had only ~600.000 random internet 
document to process. Another parse/reduce did it in less than half the time 
and had 33% more documents. Instead of countless exceptions this produces less 
than 10 for all documents.

Wouldn't it be a good idea to connect the various URL filters to Nutch' own 
outlink extractor? It shouldn't be hard to create a partial regex from some 
simple url filters. Since URL's extracted by the regex are still processed by 
filters and/or normalizers there would be a huge gain in throughput when we 1) 
simplify the regex and 2) stop unwanted URL's and 'URL's' at the gate.

And how could crawler-commons be fitted in to the Nutch' outlink extractor or 
even Tika for HTML documents?

> Hi,
> 
> The reducer of a huge parse takes forever! It trips over numerous URL
> filter exceptions, mostly stuff like:
> 
> 2011-07-18 15:07:15,360 ERROR
> org.apache.nutch.urlfilter.domain.DomainURLFilter: Could not apply filter
> on url: Anlagen:AdresseAvans
> java.net.MalformedURLException: unknown protocol: anlagen
> 
> I suspect the issue is the OutlinkExtractor, being a bit to eager. How
> about making it a bit more configurable? This is now a real waste of
> CPU-cycles.
> 
> Thanks

Reply via email to