Hi Markus On 18 July 2011 23:46, Markus Jelsma <[email protected]> wrote:
> I've modified the regular expression in OutlinkExtractor not to allow URI > schema's other than http:// and i can confirm a significant increase in > throughput. > Can't remember how the OutlinkExtractor works but are relative URLs already normalised into full form at that stage? Bear in mind that we also handle other protocols such as file://, ftp://https:// so it is not only about http:// > The previous parse/reduce took ages and had only ~600.000 random internet > document to process. Another parse/reduce did it in less than half the time > and had 33% more documents. Instead of countless exceptions this produces > less > than 10 for all documents. > JIRA + patch? Am sure the outlink extractor could be improved indeed > > Wouldn't it be a good idea to connect the various URL filters to Nutch' own > outlink extractor? It shouldn't be hard to create a partial regex from some > simple url filters. Since URL's extracted by the regex are still processed > by > filters and/or normalizers there would be a huge gain in throughput when we > 1) > simplify the regex and 2) stop unwanted URL's and 'URL's' at the gate. > Unlike URLNormalisers, URLFilters don't have a realm so when you apply the filtering ALL the filters are used so you can't have a specific set of filters for that particular stage. Why don't you simply specify the regex-based URLFilters to be applied BEFORE the domain one. It would simply be a matter of setting something like http://.+ or whatever protocol you are using. This way you won't get any issues with the DomainFilter > And how could crawler-commons be fitted in to the Nutch' outlink extractor > or even Tika for HTML documents? > what for? we don't do URL filtering in CC yet For the Tika parser it has ContentHandler which extracts the links, we could use this instead of OutlinkExtractor and see how it fares. Julien > > > Hi, > > > > The reducer of a huge parse takes forever! It trips over numerous URL > > filter exceptions, mostly stuff like: > > > > 2011-07-18 15:07:15,360 ERROR > > org.apache.nutch.urlfilter.domain.DomainURLFilter: Could not apply filter > > on url: Anlagen:AdresseAvans > > java.net.MalformedURLException: unknown protocol: anlagen > > > > I suspect the issue is the OutlinkExtractor, being a bit to eager. How > > about making it a bit more configurable? This is now a real waste of > > CPU-cycles. > > > > Thanks > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

