Hi,

I found the following exception in hadoop.log

java.lang.Error: Unresolved compilation problems:
        The import org.cyberneko cannot be resolved
        org.ccil cannot be resolved to a type
        org.ccil cannot be resolved to a type
        org.ccil.cowan.tagsoup.Parser cannot be resolved to a type
        org.ccil.cowan.tagsoup.Parser cannot be resolved to a type
        DOMFragmentParser cannot be resolved to a type
        DOMFragmentParser cannot be resolved to a type

        at org.apache.nutch.parse.html.HtmlParser.<init>(HtmlParser.java:28)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
        at java.lang.Class.newInstance0(Class.java:372)
        at java.lang.Class.newInstance(Class.java:325)
at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:160) at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:132)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:77)
        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:1)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

Eclipse indeed does show me that cyberneko is missing but it worked until I added:

<property>
        <name>plugin.includes</name>
        
<value>protocol-http|urlfilter-regex|parse-(html)|simpletestplugin</value>
</property>

to my nutch-site.xml file. I can only assume that the parse-(html) normally is no part of the plugin.includes property. So I think I have two possible directions of action. Either get the default value of plugin.includes from somewhere and add my plugin to that list or fix the missing dependencies which I do not exactly know how because I usually use Maven and never have worked with Ant or Ivy for dependency management. It would be nice if you could give me a pointer in either direction.

Am So 12 Aug 2012 13:11:16 CEST schrieb Alaak:
Hi,

Ah sorry. Both are actually copy and paste errors. Of course I only
have one logger with the correct class name and the extension point
is: "org.apache.nutch.indexer.IndexingFilter"

This is the actual plugin.xml I am using.

<?xml version="1.0" encoding="UTF-8"?>
<plugin id="simpletestplugin" name="URL Meta Indexing Filter""
version="1.0.0" provider-name="alaak">
    <runtime>
        <library name="simpletestplugin.jar">
            <export name="*"/>
        </library>
    </runtime>

    <requires>
        <import plugin="nutch-extensionpoints"/>
    </requires>

    <extension id="de.effingo.crawler" name="Some Simple Test Plugin"
point="org.apache.nutch.indexer.IndexingFilter">
        <implementation id="page-filter"
class="testplugin.SimpleFilter"/>
    </extension>
</plugin>

Am So 12 Aug 2012 12:31:46 CEST schrieb Lewis John Mcgibbney:

Hi Alaak,

On Sun, Aug 12, 2012 at 10:58 AM, Alaak <[email protected]> wrote:

I always get output with the following
exception which basically tells me nothing:

...
Fetcher: finished at 2012-08-12 11:06:47, elapsed: 00:00:07
ParseSegment: starting at 2012-08-12 11:06:47
ParseSegment: segment: crawl/segments/20120812110633
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209)


It tells you that there is a problem whilst parsing a particular
segment. This is quite a lot to go on.

All the Java code looks fine. I don't see any problems except that you
have an addition logging variable which seems to point outside of the
class.



<extension id="testplugin" name="Some Simple Test Plugin"
point="org.apache.nutch.segment.SegmentMergeFilter">
<implementation id="page-filter" class="testplugin.SimpleFilter"/>
</extension>
</plugin>


Now we come to the main point of concern. For me (as far as I
understand what you ar trying to do) you should not extend the
SegmentMergeFilter point. This should refer to the IndexingFilter you
wish to extend. A list of extension points can be seen here [0]

[0]
http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml


hth

Lewis

Reply via email to