Hi Klemens, Please don't hijack others' threads. It is impolite and your threads will not be answered.
Thank you Lewis On Sun, Aug 12, 2012 at 12:23 PM, Klemens Muthmann <[email protected]> wrote: > Hi, > > I found the following exception in hadoop.log > > java.lang.Error: Unresolved compilation problems: > The import org.cyberneko cannot be resolved > org.ccil cannot be resolved to a type > org.ccil cannot be resolved to a type > org.ccil.cowan.tagsoup.Parser cannot be resolved to a type > org.ccil.cowan.tagsoup.Parser cannot be resolved to a type > DOMFragmentParser cannot be resolved to a type > DOMFragmentParser cannot be resolved to a type > > at org.apache.nutch.parse.html.HtmlParser.<init>(HtmlParser.java:28) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:525) > at java.lang.Class.newInstance0(Class.java:372) > at java.lang.Class.newInstance(Class.java:325) > at > org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:160) > at > org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:132) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:77) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:1) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > > Eclipse indeed does show me that cyberneko is missing but it worked until I > added: > > <property> > <name>plugin.includes</name> > > <value>protocol-http|urlfilter-regex|parse-(html)|simpletestplugin</value> > </property> > > to my nutch-site.xml file. I can only assume that the parse-(html) normally > is no part of the plugin.includes property. So I think I have two possible > directions of action. Either get the default value of plugin.includes from > somewhere and add my plugin to that list or fix the missing dependencies > which I do not exactly know how because I usually use Maven and never have > worked with Ant or Ivy for dependency management. It would be nice if you > could give me a pointer in either direction. > > Am So 12 Aug 2012 13:11:16 CEST schrieb Alaak: > >> Hi, >> >> Ah sorry. Both are actually copy and paste errors. Of course I only >> have one logger with the correct class name and the extension point >> is: "org.apache.nutch.indexer.IndexingFilter" >> >> This is the actual plugin.xml I am using. >> >> <?xml version="1.0" encoding="UTF-8"?> >> <plugin id="simpletestplugin" name="URL Meta Indexing Filter"" >> version="1.0.0" provider-name="alaak"> >> <runtime> >> <library name="simpletestplugin.jar"> >> <export name="*"/> >> </library> >> </runtime> >> >> <requires> >> <import plugin="nutch-extensionpoints"/> >> </requires> >> >> <extension id="de.effingo.crawler" name="Some Simple Test Plugin" >> point="org.apache.nutch.indexer.IndexingFilter"> >> <implementation id="page-filter" >> class="testplugin.SimpleFilter"/> >> </extension> >> </plugin> >> >> Am So 12 Aug 2012 12:31:46 CEST schrieb Lewis John Mcgibbney: >>> >>> >>> Hi Alaak, >>> >>> On Sun, Aug 12, 2012 at 10:58 AM, Alaak <[email protected]> wrote: >>>> >>>> >>>> I always get output with the following >>>> exception which basically tells me nothing: >>>> >>>> ... >>>> Fetcher: finished at 2012-08-12 11:06:47, elapsed: 00:00:07 >>>> ParseSegment: starting at 2012-08-12 11:06:47 >>>> ParseSegment: segment: crawl/segments/20120812110633 >>>> Exception in thread "main" java.io.IOException: Job failed! >>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265) >>>> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209) >>> >>> >>> >>> It tells you that there is a problem whilst parsing a particular >>> segment. This is quite a lot to go on. >>> >>> All the Java code looks fine. I don't see any problems except that you >>> have an addition logging variable which seems to point outside of the >>> class. >>> >>>> >>>> >>>> <extension id="testplugin" name="Some Simple Test Plugin" >>>> point="org.apache.nutch.segment.SegmentMergeFilter"> >>>> <implementation id="page-filter" class="testplugin.SimpleFilter"/> >>>> </extension> >>>> </plugin> >>> >>> >>> >>> Now we come to the main point of concern. For me (as far as I >>> understand what you ar trying to do) you should not extend the >>> SegmentMergeFilter point. This should refer to the IndexingFilter you >>> wish to extend. A list of extension points can be seen here [0] >>> >>> [0] >>> >>> http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml >>> >>> >>> hth >>> >>> Lewis -- Lewis

