Hi Sebastian, I apologise for a long silence on this issue. I have been out of town, back on Monday. Then I will do what you are asking in 2-3 days.
Regards, Arkadi ________________________________________ From: Sebastian Nagel [[email protected]] Sent: Friday, 24 July 2015 6:38 AM To: [email protected] Cc: Kosmynin, Arkadi (CASS, Marsfield) Subject: Re: A parser failure on a single document may fail crawling job Hi Arkadi, does the problem persist? Which version of Nutch are you using? Can you point to one file or URL to reproduce it? Thanks, Sebastian On 06/26/2015 03:26 PM, Sebastian Nagel wrote: > Hi Arkadi, > > thanks for reporting that. Can you open a Jira ticket [1] to address this bug? > > It's rather a bug of the plugin parse-tika and should be solved there, > cf. https://issues.apache.org/jira/browse/TIKA-1240 > A plugin should be able to load all required classes. > > Thanks, > Sebastian > > [1] https://issues.apache.org/jira/browse/NUTCH > > 2015-06-23 3:59 GMT+02:00 <[email protected] > <mailto:[email protected]>>: > > Hi, > > This is what happened: > > java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) > at > org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213) > <...> > Caused by: java.lang.IncompatibleClassChangeError: class > org.apache.tika.parser.asm.XHTMLClassVisitor has interface > org.objectweb.asm.ClassVisitor as > super class > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:760) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at > java.net.URLClassLoader.defineClass(URLClassLoader.java:467) > at > java.net.URLClassLoader.access$100(URLClassLoader.java:73) > at java.net.URLClassLoader$1.run(URLClassLoader.java:368) > at java.net.URLClassLoader$1.run(URLClassLoader.java:362) > at java.security.AccessController.doPrivileged(Native > Method) > at > java.net.URLClassLoader.findClass(URLClassLoader.java:361) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at > org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51) > at > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:98) > at > org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:103) > > Suggested fix in ParseUtil: > > Replace > > if (maxParseTime!=-1) > parseResult = runParser(parsers[i], content); > else > parseResult = parsers[i].getParse(content); > > with > > try > { > if (maxParseTime!=-1) > parseResult = runParser(parsers[i], content); > else > parseResult = parsers[i].getParse(content); > } catch( Throwable e ) > { > LOG.warn( "Parsing " + content.getUrl() + " with " + > parsers[i].getClass().getName() + " > failed: " + e.getMessage() ) ; > parseResult = null ; > } > > Also replace > > if (maxParseTime!=-1) > parseResult = runParser(p, content); > else > parseResult = p.getParse(content); > > with > > try > { > if (maxParseTime!=-1) > parseResult = runParser(p, content); > else > parseResult = p.getParse(content); > } catch( Throwable e ) > { > LOG.warn( "Parsing " + content.getUrl() + " with " + > p.getClass().getName() + " failed: " > + e.getMessage() ) ; > } > > Regards, > Arkadi > >

