Hi Klemens,

Please don't hijack others' threads. It is impolite and your threads
will not be answered.

Thank you
Lewis

On Sun, Aug 12, 2012 at 12:23 PM, Klemens Muthmann
<[email protected]> wrote:
> Hi,
>
> I found the following exception in hadoop.log
>
> java.lang.Error: Unresolved compilation problems:
>         The import org.cyberneko cannot be resolved
>         org.ccil cannot be resolved to a type
>         org.ccil cannot be resolved to a type
>         org.ccil.cowan.tagsoup.Parser cannot be resolved to a type
>         org.ccil.cowan.tagsoup.Parser cannot be resolved to a type
>         DOMFragmentParser cannot be resolved to a type
>         DOMFragmentParser cannot be resolved to a type
>
>         at org.apache.nutch.parse.html.HtmlParser.<init>(HtmlParser.java:28)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
>         at java.lang.Class.newInstance0(Class.java:372)
>         at java.lang.Class.newInstance(Class.java:325)
>         at
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:160)
>         at
> org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:132)
>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:77)
>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:1)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>
> Eclipse indeed does show me that cyberneko is missing but it worked until I
> added:
>
> <property>
>         <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(html)|simpletestplugin</value>
> </property>
>
> to my nutch-site.xml file. I can only assume that the parse-(html) normally
> is no part of the plugin.includes property. So I think I have two possible
> directions of action. Either get the default value of plugin.includes from
> somewhere and add my plugin to that list or fix the missing dependencies
> which I do not exactly know how because I usually use Maven and never have
> worked with Ant or Ivy for dependency management. It would be nice if you
> could give me a pointer in either direction.
>
> Am So 12 Aug 2012 13:11:16 CEST schrieb Alaak:
>
>> Hi,
>>
>> Ah sorry. Both are actually copy and paste errors. Of course I only
>> have one logger with the correct class name and the extension point
>> is: "org.apache.nutch.indexer.IndexingFilter"
>>
>> This is the actual plugin.xml I am using.
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <plugin id="simpletestplugin" name="URL Meta Indexing Filter""
>> version="1.0.0" provider-name="alaak">
>>     <runtime>
>>         <library name="simpletestplugin.jar">
>>             <export name="*"/>
>>         </library>
>>     </runtime>
>>
>>     <requires>
>>         <import plugin="nutch-extensionpoints"/>
>>     </requires>
>>
>>     <extension id="de.effingo.crawler" name="Some Simple Test Plugin"
>> point="org.apache.nutch.indexer.IndexingFilter">
>>         <implementation id="page-filter"
>> class="testplugin.SimpleFilter"/>
>>     </extension>
>> </plugin>
>>
>> Am So 12 Aug 2012 12:31:46 CEST schrieb Lewis John Mcgibbney:
>>>
>>>
>>> Hi Alaak,
>>>
>>> On Sun, Aug 12, 2012 at 10:58 AM, Alaak <[email protected]> wrote:
>>>>
>>>>
>>>> I always get output with the following
>>>> exception which basically tells me nothing:
>>>>
>>>> ...
>>>> Fetcher: finished at 2012-08-12 11:06:47, elapsed: 00:00:07
>>>> ParseSegment: starting at 2012-08-12 11:06:47
>>>> ParseSegment: segment: crawl/segments/20120812110633
>>>> Exception in thread "main" java.io.IOException: Job failed!
>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
>>>> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209)
>>>
>>>
>>>
>>> It tells you that there is a problem whilst parsing a particular
>>> segment. This is quite a lot to go on.
>>>
>>> All the Java code looks fine. I don't see any problems except that you
>>> have an addition logging variable which seems to point outside of the
>>> class.
>>>
>>>>
>>>>
>>>> <extension id="testplugin" name="Some Simple Test Plugin"
>>>> point="org.apache.nutch.segment.SegmentMergeFilter">
>>>> <implementation id="page-filter" class="testplugin.SimpleFilter"/>
>>>> </extension>
>>>> </plugin>
>>>
>>>
>>>
>>> Now we come to the main point of concern. For me (as far as I
>>> understand what you ar trying to do) you should not extend the
>>> SegmentMergeFilter point. This should refer to the IndexingFilter you
>>> wish to extend. A list of extension points can be seen here [0]
>>>
>>> [0]
>>>
>>> http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml
>>>
>>>
>>> hth
>>>
>>> Lewis



-- 
Lewis

Reply via email to