Parser plugin not invoked.

kkrishnanand Wed, 10 Sep 2014 05:51:56 -0700

Hi, Nutch Gurus,

I am a Nutch newbie and I would like to ask for help seeking the execution
of a Nutch plugin. I have written a plugin that extracts all the JavaScript
urls and creates outlinks wrapped within a Parse object. The outlinks
generated would be ideally inserted into the crawldb  during any of the
phases.
Unfortunately, the plugin is not being invoked and I would appreciate any
assistance in this matter.


I have tried to run this on both Windows and Linux machines, but to no
avail. The set up for the windows machine is given below. I referred to 
http://wiki.apache.org/nutch/WritingPluginExample ,
http://florianhartl.com/nutch-plugin-tutorial.html,
http://sujitpal.blogspot.de/2009/07/nutch-custom-plugin-to-parse-and-add.html, 
and http://sujitpal.blogspot.com/2009/07/nutch-getting-my-feet-wet.html

I would like help for 2 questions that  I have

1.      How to invoke the plugin to generate outlinks?
2.      How to do I update the crawldb with the updated outlinks?

Any suggestion would be gratefully appreciated.

######################
My nutch-config is given below
########################

<property>
  <name>plugin.folders</name>
  <value>C:\apache-nutch-2.2.1\build</value>
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>

property>
  <name>plugin.auto-activation</name>
  <value>true</value>
  <description>Defines if some plugins that are not activated regarding
  the plugin.includes and plugin.excludes properties must be automaticaly
  activated if they are needed by some actived plugins.
  </description>
</property>

<!—- localeextractor is my custom plugin -->
<property>
  <name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|(localeextractor)</value>
<description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable 
  protocol-httpclient, but be aware of possible intermittent problems with
the 
  underlying commons-httpclient library.
  </description>
</property>

######################
My plugin code
########################

public class LocaleExtractorFilter implements Parser {
  
  private static final Logger LOG =
LoggerFactory.getLogger(LocaleExtractorFilter.class);
  
  private Configuration configuration;
  
  private static final Set<Field> FIELDS = new HashSet<Field>();
  
  static {
    FIELDS.add(WebPage.Field.OUTLINKS);
  }
  
  @Override
  public Collection<Field> getFields() {
    // TODO Auto-generated method stub
    return FIELDS;
  }
  
  @Override
  public void setConf(Configuration conf) {
    this.configuration = conf;
  }

  @Override
  public Configuration getConf() {
    return this.configuration;
  }
  
  /**
   * Extracts the JS links to create outlinks.
   * {@inheritdoc}
   */
  @Override
  public Parse getParse(String url, WebPage page) {
    // TODO Auto-generated method stub
    String stringContent = Bytes.toString(page.getContent());
    Set<Outlink> jsOutlinks = this.addUrlsToBeParsed(stringContent);
    return new Parse(
        page.getText().toString(), page.getTitle().toString(),
        jsOutlinks.toArray(new Outlink[0]), page.getParseStatus());
  }
  
  private static final Pattern PATTERN_WITH_ASCII_QUOTES =
     
Pattern.compile("^(?:.*?goto\\(&#39;(\\w+)&#39;\\).*|.*?OOLPopUp\\(&#39;(.+?&#39;\\)).*)$",
          Pattern.MULTILINE);
  
  private static final String REDIRECT = "/accounts/redirect.go?target=";

  /**
   * The implementation parses the URLs from the string content of HTML
files. The URLs are of the
   * following format:
   * <ul>
   *   <li>{@code goto} links, Example
   *       {@code &lt;a
href='javascript:goto(&#39;billpay&#39;);'&gt;Accounts&lt;/a&gt;}
   * </ul>
   * 
   * @param stringContent from which multiple urls can be constructed
   */
  Set<Outlink> addUrlsToBeParsed(String stringContent) {
    Set<Outlink> outlinks = new TreeSet<Outlink>();
    Matcher matcher = PATTERN_WITH_ASCII_QUOTES.matcher(stringContent);
    while (matcher.find()) {
      String url = "";
      try {
        url = new StringBuilder(REDIRECT).append(
            matcher.group(1) != null ? matcher.group(1) :
matcher.group(2)).toString();
        outlinks.add(new Outlink(url, ""));
      } catch (MalformedURLException mue) {
        LOG.warn("Error generating outlink urls for " + url, mue);
      }
    }
    
    return outlinks;   
  }

}

##############
Plugin.xml
###############

<?xml version="1.0" encoding="UTF-8"?>
<plugin id="localeextractor" name="Locale extractor Filter" version="1.0.0"
  provider-name="nutch.org">

  <runtime>
    <library name="localeextractor">
      <export name="*" />
    </library>
  </runtime>

  <requires>e
    <import plugin="nutch-extensionpoints" />
  </requires>

  <extension id="com.bofa.ecom.search.LocaleExtractorFilter"
    name="Nutch Links Generator"
    point="org.apache.nutch.parse.Parser">
    <implementation id="parser-localeextractor"
      class="com.bofa.ecom.search.LocaleExtractorFilter" />
  </extension>

</plugin>

##############
Build.xml
###############
<project name="locale-detector" default="jar-core">

  <import file="../build-plugin.xml" />

</project>





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Parser-plugin-not-invoked-tp4157840.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Parser plugin not invoked.

Reply via email to