Hi,
I found the following exception in hadoop.log
java.lang.Error: Unresolved compilation problems:
The import org.cyberneko cannot be resolved
org.ccil cannot be resolved to a type
org.ccil cannot be resolved to a type
org.ccil.cowan.tagsoup.Parser cannot be resolved to a type
org.ccil.cowan.tagsoup.Parser cannot be resolved to a type
DOMFragmentParser cannot be resolved to a type
DOMFragmentParser cannot be resolved to a type
at org.apache.nutch.parse.html.HtmlParser.<init>(HtmlParser.java:28)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at java.lang.Class.newInstance0(Class.java:372)
at java.lang.Class.newInstance(Class.java:325)
at
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:160)
at
org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:132)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:77)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:1)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Eclipse indeed does show me that cyberneko is missing but it worked
until I added:
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html)|simpletestplugin</value>
</property>
to my nutch-site.xml file. I can only assume that the parse-(html)
normally is no part of the plugin.includes property. So I think I have
two possible directions of action. Either get the default value of
plugin.includes from somewhere and add my plugin to that list or fix
the missing dependencies which I do not exactly know how because I
usually use Maven and never have worked with Ant or Ivy for dependency
management. It would be nice if you could give me a pointer in either
direction.
Am So 12 Aug 2012 13:11:16 CEST schrieb Alaak:
Hi,
Ah sorry. Both are actually copy and paste errors. Of course I only
have one logger with the correct class name and the extension point
is: "org.apache.nutch.indexer.IndexingFilter"
This is the actual plugin.xml I am using.
<?xml version="1.0" encoding="UTF-8"?>
<plugin id="simpletestplugin" name="URL Meta Indexing Filter""
version="1.0.0" provider-name="alaak">
<runtime>
<library name="simpletestplugin.jar">
<export name="*"/>
</library>
</runtime>
<requires>
<import plugin="nutch-extensionpoints"/>
</requires>
<extension id="de.effingo.crawler" name="Some Simple Test Plugin"
point="org.apache.nutch.indexer.IndexingFilter">
<implementation id="page-filter"
class="testplugin.SimpleFilter"/>
</extension>
</plugin>
Am So 12 Aug 2012 12:31:46 CEST schrieb Lewis John Mcgibbney:
Hi Alaak,
On Sun, Aug 12, 2012 at 10:58 AM, Alaak <[email protected]> wrote:
I always get output with the following
exception which basically tells me nothing:
...
Fetcher: finished at 2012-08-12 11:06:47, elapsed: 00:00:07
ParseSegment: starting at 2012-08-12 11:06:47
ParseSegment: segment: crawl/segments/20120812110633
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209)
It tells you that there is a problem whilst parsing a particular
segment. This is quite a lot to go on.
All the Java code looks fine. I don't see any problems except that you
have an addition logging variable which seems to point outside of the
class.
<extension id="testplugin" name="Some Simple Test Plugin"
point="org.apache.nutch.segment.SegmentMergeFilter">
<implementation id="page-filter" class="testplugin.SimpleFilter"/>
</extension>
</plugin>
Now we come to the main point of concern. For me (as far as I
understand what you ar trying to do) you should not extend the
SegmentMergeFilter point. This should refer to the IndexingFilter you
wish to extend. A list of extension points can be seen here [0]
[0]
http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml
hth
Lewis