Re: OutlinkExtractor, configure schema in regex

Markus Jelsma Tue, 19 Jul 2011 02:45:29 -0700

Hi Julien,

On Tuesday 19 July 2011 11:20:30 Julien Nioche wrote:
> Hi Markus
> 
> On 18 July 2011 23:46, Markus Jelsma <[email protected]> wrote:
> > I've modified the regular expression in OutlinkExtractor not to allow URI
> > schema's other than http:// and i can confirm a significant increase in
> > throughput.
> 
> Can't remember how the OutlinkExtractor works but are relative URLs already
> normalised into full form at that stage?
> Bear in mind that we also handle other protocols such as file://,
> ftp://https://
> so it is not only about http://


Correct. The prefix URL filter's settings could be used. For example, the 
current extractor's schema regex is:

([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/]

This can be greatly simplified by tieing in the settings from prefix URL 
filter. If one filters for https | file | http you would get the following 
partial regex for the schema:

(^|[ \t\r\n])((http|https|file):

This would mean no outlinks are extracted at all besides the schema's we want. 
This is a part which greatly reduces the total number of extracted outlinks. 
If you don't do this the extractor comes up with countless `URL's` from plain 
text (or parsed PDF etc) such as:

id:12
says:how

..and other parts of normal text. 

> 
> > The previous parse/reduce took ages and had only ~600.000 random internet
> > document to process. Another parse/reduce did it in less than half the
> > time and had 33% more documents. Instead of countless exceptions this
> > produces less
> > than 10 for all documents.
> 
> JIRA + patch? Am sure the outlink extractor could be improved indeed

Yes, i shall open an issue.

> 
> > Wouldn't it be a good idea to connect the various URL filters to Nutch'
> > own outlink extractor? It shouldn't be hard to create a partial regex
> > from some simple url filters. Since URL's extracted by the regex are
> > still processed by
> > filters and/or normalizers there would be a huge gain in throughput when
> > we 1)
> > simplify the regex and 2) stop unwanted URL's and 'URL's' at the gate.
> 
> Unlike URLNormalisers, URLFilters don't have a realm so when you apply the
> filtering ALL the filters are used so you can't have a specific set of
> filters for that particular stage.

I know. I shouldn't mention normalizers at all. It adds confusion.

> 
> Why don't you simply specify the regex-based URLFilters to be applied
> BEFORE the domain one. It would simply be a matter of setting something
> like http://.+ or whatever protocol you are using. This way you won't get
> any issues with the DomainFilter

I can indeed put filters in a specific order, my goal is to reduce the amount 
of URL's produced by the extractor. IF the extractor produces less unwanted 
URL's (that are filtered away anyway) we spare a lot of cycles.

> 
> > And how could crawler-commons be fitted in to the Nutch' outlink
> > extractor or
> 
> even Tika for HTML documents?
> 
> 
> what for? we don't do URL filtering in CCexpression: yet

Indeed, not yet. I'll try to come up with a patch beginning with prefix URL 
filter into the regex of OutlinkExtractor.

> 
> For the Tika parser it has ContentHandler which extracts the links, we
> could use this instead of OutlinkExtractor and see how it fares.
> 
> Julien


Thank you for your comments.
Markus

> 
> > > Hi,
> > > 
> > > The reducer of a huge parse takes forever! It trips over numerous URL
> > > filter exceptions, mostly stuff like:
> > > 
> > > 2011-07-18 15:07:15,360 ERROR
> > > org.apache.nutch.urlfilter.domain.DomainURLFilter: Could not apply
> > > filter on url: Anlagen:AdresseAvans
> > > java.net.MalformedURLException: unknown protocol: anlagen
> > > 
> > > I suspect the issue is the OutlinkExtractor, being a bit to eager. How
> > > about making it a bit more configurable? This is now a real waste of
> > > CPU-cycles.
> > > 
> > > Thanks

Re: OutlinkExtractor, configure schema in regex

Reply via email to