Hi

I need to create a simple extension to Nutch indexing only web pages matching certain criteria.

I followed the explanation on how to setup Nutch using Eclipse and got a running basic system. Then I followed the explanations on setting up a simple plugin here: http://wiki.apache.org/nutch/WritingPluginExample. However after adding the Plugin I always get output with the following exception which basically tells me nothing:

...
Fetcher: finished at 2012-08-12 11:06:47, elapsed: 00:00:07
ParseSegment: starting at 2012-08-12 11:06:47
ParseSegment: segment: crawl/segments/20120812110633
Exception in thread "main" java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
    at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209)
    at org.apache.nutch.crawl.Crawl.run(Crawl.java:138)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

I wanted to simplify the example by using only on extension which simply prints out "test" for every crawled page. Here is the code for my plugin class:

package testplugin;

import java.util.Collection;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Inlinks;
import org.apache.nutch.indexer.IndexingException;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.nutch.indexer.NutchDocument;
import org.apache.nutch.parse.Parse;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public final class SimpleFilter implements IndexingFilter {

public static final Logger LOGGER = LoggerFactory.getLogger(SimpleFilter.class);

public static final Logger LOGGER = LoggerFactory.getLogger(FocusedForumCrawler.class);
    private Configuration conf;

    @Override
    public Configuration getConf() {
        return conf;
    }

    @Override
    public void setConf(Configuration conf) {
        this.conf = conf;

        if (conf == null)
            return;
    }

    @Override
public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
            throws IndexingException {
        LOGGER.info("test");
        return doc;
    }

}

I also adapted the plugin.xml to look like:

<?xml version="1.0" encoding="UTF-8"?>
<plugin id="simpletestplugin" name="URL Meta Indexing Filter" version="1.0.0" provider-name="alaak">
    <runtime>
        <library name="simpletestplugin.jar">
            <export name="*"/>
        </library>
    </runtime>

    <requires>
        <import plugin="nutch-extensionpoints"/>
    </requires>

<extension id="testplugin" name="Some Simple Test Plugin" point="org.apache.nutch.segment.SegmentMergeFilter">
        <implementation id="page-filter" class="testplugin.SimpleFilter"/>
    </extension>
</plugin>

Can someone please give me a clue what I am doing wrong or which additional information you would need to help me?

Thanks and regards.

Reply via email to