Re: nutch javascript capabilities

Lewis John Mcgibbney Tue, 08 Jan 2013 19:32:19 -0800

Hi Michael,


On Tue, Jan 8, 2013 at 7:15 AM, Michael Gang <[email protected]> wrote:

> JavaScript (for extracting links only?) (parse-js)
>

Yes, both in and outlinks if present.


>
> I don't understand what this exactly means.
> Let's say if i have a link
> <a onclick="do_something">
> or a jquery binding in onready
> and in this code i open a new window and show there a result of a form
> submit
> will nutch extract for me the resulting page as link ?
>
>
The idea is (taken in part from the class Javadoc) that the parsing
implementation implements a heuristic link extractor for pure JS files and
additionally embedded JS snippets in (x)html. When JS is discovered, the
parsing logic executes a two-pass regex matching for obtaining correct
links which may be useful to a Nutch crawl. This plugin is known to act up
from time to time, however basically the two regex matches boil down to the
following
- a 'simple' string matching pattern which allows invalid URL characters
- an 'altrnative' pattern which limits valid URL chars.

When attempting to extract URLs from literals embedded in JS, the two
patterns are run in that order.

hth
LEwis

Re: nutch javascript capabilities

Reply via email to