Hi Julien,
On Tuesday 19 July 2011 11:20:30 Julien Nioche wrote:
> Hi Markus
>
> On 18 July 2011 23:46, Markus Jelsma <[email protected]> wrote:
> > I've modified the regular expression in OutlinkExtractor not to allow URI
> > schema's other than http:// and i can confirm a significant increase in
> > throughput.
>
> Can't remember how the OutlinkExtractor works but are relative URLs already
> normalised into full form at that stage?
> Bear in mind that we also handle other protocols such as file://,
> ftp://https://
> so it is not only about http://
Correct. The prefix URL filter's settings could be used. For example, the
current extractor's schema regex is:
([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/]
This can be greatly simplified by tieing in the settings from prefix URL
filter. If one filters for https | file | http you would get the following
partial regex for the schema:
(^|[ \t\r\n])((http|https|file):
This would mean no outlinks are extracted at all besides the schema's we want.
This is a part which greatly reduces the total number of extracted outlinks.
If you don't do this the extractor comes up with countless `URL's` from plain
text (or parsed PDF etc) such as:
id:12
says:how
..and other parts of normal text.
>
> > The previous parse/reduce took ages and had only ~600.000 random internet
> > document to process. Another parse/reduce did it in less than half the
> > time and had 33% more documents. Instead of countless exceptions this
> > produces less
> > than 10 for all documents.
>
> JIRA + patch? Am sure the outlink extractor could be improved indeed
Yes, i shall open an issue.
>
> > Wouldn't it be a good idea to connect the various URL filters to Nutch'
> > own outlink extractor? It shouldn't be hard to create a partial regex
> > from some simple url filters. Since URL's extracted by the regex are
> > still processed by
> > filters and/or normalizers there would be a huge gain in throughput when
> > we 1)
> > simplify the regex and 2) stop unwanted URL's and 'URL's' at the gate.
>
> Unlike URLNormalisers, URLFilters don't have a realm so when you apply the
> filtering ALL the filters are used so you can't have a specific set of
> filters for that particular stage.
I know. I shouldn't mention normalizers at all. It adds confusion.
>
> Why don't you simply specify the regex-based URLFilters to be applied
> BEFORE the domain one. It would simply be a matter of setting something
> like http://.+ or whatever protocol you are using. This way you won't get
> any issues with the DomainFilter
I can indeed put filters in a specific order, my goal is to reduce the amount
of URL's produced by the extractor. IF the extractor produces less unwanted
URL's (that are filtered away anyway) we spare a lot of cycles.
>
> > And how could crawler-commons be fitted in to the Nutch' outlink
> > extractor or
>
> even Tika for HTML documents?
>
>
> what for? we don't do URL filtering in CCexpression: yet
Indeed, not yet. I'll try to come up with a patch beginning with prefix URL
filter into the regex of OutlinkExtractor.
>
> For the Tika parser it has ContentHandler which extracts the links, we
> could use this instead of OutlinkExtractor and see how it fares.
>
> Julien
Thank you for your comments.
Markus
>
> > > Hi,
> > >
> > > The reducer of a huge parse takes forever! It trips over numerous URL
> > > filter exceptions, mostly stuff like:
> > >
> > > 2011-07-18 15:07:15,360 ERROR
> > > org.apache.nutch.urlfilter.domain.DomainURLFilter: Could not apply
> > > filter on url: Anlagen:AdresseAvans
> > > java.net.MalformedURLException: unknown protocol: anlagen
> > >
> > > I suspect the issue is the OutlinkExtractor, being a bit to eager. How
> > > about making it a bit more configurable? This is now a real waste of
> > > CPU-cycles.
> > >
> > > Thanks