Hi, Nutch Gurus, I am a Nutch newbie and I would like to ask for help seeking the execution of a Nutch plugin. I have written a plugin that extracts all the JavaScript urls and creates outlinks wrapped within a Parse object. The outlinks generated would be ideally inserted into the crawldb during any of the phases. Unfortunately, the plugin is not being invoked and I would appreciate any assistance in this matter.
I have tried to run this on both Windows and Linux machines, but to no avail. The set up for the windows machine is given below. I referred to http://wiki.apache.org/nutch/WritingPluginExample , http://florianhartl.com/nutch-plugin-tutorial.html, http://sujitpal.blogspot.de/2009/07/nutch-custom-plugin-to-parse-and-add.html, and http://sujitpal.blogspot.com/2009/07/nutch-getting-my-feet-wet.html I would like help for 2 questions that I have 1. How to invoke the plugin to generate outlinks? 2. How to do I update the crawldb with the updated outlinks? Any suggestion would be gratefully appreciated. ###################### My nutch-config is given below ######################## <property> <name>plugin.folders</name> <value>C:\apache-nutch-2.2.1\build</value> <description>Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.</description> </property> property> <name>plugin.auto-activation</name> <value>true</value> <description>Defines if some plugins that are not activated regarding the plugin.includes and plugin.excludes properties must be automaticaly activated if they are needed by some actived plugins. </description> </property> <!—- localeextractor is my custom plugin --> <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|(localeextractor)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property> ###################### My plugin code ######################## public class LocaleExtractorFilter implements Parser { private static final Logger LOG = LoggerFactory.getLogger(LocaleExtractorFilter.class); private Configuration configuration; private static final Set<Field> FIELDS = new HashSet<Field>(); static { FIELDS.add(WebPage.Field.OUTLINKS); } @Override public Collection<Field> getFields() { // TODO Auto-generated method stub return FIELDS; } @Override public void setConf(Configuration conf) { this.configuration = conf; } @Override public Configuration getConf() { return this.configuration; } /** * Extracts the JS links to create outlinks. * {@inheritdoc} */ @Override public Parse getParse(String url, WebPage page) { // TODO Auto-generated method stub String stringContent = Bytes.toString(page.getContent()); Set<Outlink> jsOutlinks = this.addUrlsToBeParsed(stringContent); return new Parse( page.getText().toString(), page.getTitle().toString(), jsOutlinks.toArray(new Outlink[0]), page.getParseStatus()); } private static final Pattern PATTERN_WITH_ASCII_QUOTES = Pattern.compile("^(?:.*?goto\\('(\\w+)'\\).*|.*?OOLPopUp\\('(.+?'\\)).*)$", Pattern.MULTILINE); private static final String REDIRECT = "/accounts/redirect.go?target="; /** * The implementation parses the URLs from the string content of HTML files. The URLs are of the * following format: * <ul> * <li>{@code goto} links, Example * {@code <a href='javascript:goto('billpay');'>Accounts</a>} * </ul> * * @param stringContent from which multiple urls can be constructed */ Set<Outlink> addUrlsToBeParsed(String stringContent) { Set<Outlink> outlinks = new TreeSet<Outlink>(); Matcher matcher = PATTERN_WITH_ASCII_QUOTES.matcher(stringContent); while (matcher.find()) { String url = ""; try { url = new StringBuilder(REDIRECT).append( matcher.group(1) != null ? matcher.group(1) : matcher.group(2)).toString(); outlinks.add(new Outlink(url, "")); } catch (MalformedURLException mue) { LOG.warn("Error generating outlink urls for " + url, mue); } } return outlinks; } } ############## Plugin.xml ############### <?xml version="1.0" encoding="UTF-8"?> <plugin id="localeextractor" name="Locale extractor Filter" version="1.0.0" provider-name="nutch.org"> <runtime> <library name="localeextractor"> <export name="*" /> </library> </runtime> <requires>e <import plugin="nutch-extensionpoints" /> </requires> <extension id="com.bofa.ecom.search.LocaleExtractorFilter" name="Nutch Links Generator" point="org.apache.nutch.parse.Parser"> <implementation id="parser-localeextractor" class="com.bofa.ecom.search.LocaleExtractorFilter" /> </extension> </plugin> ############## Build.xml ############### <project name="locale-detector" default="jar-core"> <import file="../build-plugin.xml" /> </project> -- View this message in context: http://lucene.472066.n3.nabble.com/Parser-plugin-not-invoked-tp4157840.html Sent from the Nutch - User mailing list archive at Nabble.com.

