I've modified the regular expression in OutlinkExtractor not to allow URI schema's other than http:// and i can confirm a significant increase in throughput.
The previous parse/reduce took ages and had only ~600.000 random internet document to process. Another parse/reduce did it in less than half the time and had 33% more documents. Instead of countless exceptions this produces less than 10 for all documents. Wouldn't it be a good idea to connect the various URL filters to Nutch' own outlink extractor? It shouldn't be hard to create a partial regex from some simple url filters. Since URL's extracted by the regex are still processed by filters and/or normalizers there would be a huge gain in throughput when we 1) simplify the regex and 2) stop unwanted URL's and 'URL's' at the gate. And how could crawler-commons be fitted in to the Nutch' outlink extractor or even Tika for HTML documents? > Hi, > > The reducer of a huge parse takes forever! It trips over numerous URL > filter exceptions, mostly stuff like: > > 2011-07-18 15:07:15,360 ERROR > org.apache.nutch.urlfilter.domain.DomainURLFilter: Could not apply filter > on url: Anlagen:AdresseAvans > java.net.MalformedURLException: unknown protocol: anlagen > > I suspect the issue is the OutlinkExtractor, being a bit to eager. How > about making it a bit more configurable? This is now a real waste of > CPU-cycles. > > Thanks

