Hi Michael,
On Tue, Jan 8, 2013 at 7:15 AM, Michael Gang <[email protected]> wrote: > JavaScript (for extracting links only?) (parse-js) > Yes, both in and outlinks if present. > > I don't understand what this exactly means. > Let's say if i have a link > <a onclick="do_something"> > or a jquery binding in onready > and in this code i open a new window and show there a result of a form > submit > will nutch extract for me the resulting page as link ? > > The idea is (taken in part from the class Javadoc) that the parsing implementation implements a heuristic link extractor for pure JS files and additionally embedded JS snippets in (x)html. When JS is discovered, the parsing logic executes a two-pass regex matching for obtaining correct links which may be useful to a Nutch crawl. This plugin is known to act up from time to time, however basically the two regex matches boil down to the following - a 'simple' string matching pattern which allows invalid URL characters - an 'altrnative' pattern which limits valid URL chars. When attempting to extract URLs from literals embedded in JS, the two patterns are run in that order. hth LEwis

