Hi Sebastian, > -----Original Message----- > From: Sebastian Nagel [mailto:[email protected]] > Sent: Friday, 24 July 2015 6:39 AM > To: [email protected] > Cc: Kosmynin, Arkadi (CASS, Marsfield) <[email protected]> > Subject: Re: A parser failure on a single document may fail crawling job > > Hi Arkadi, > > does the problem persist?
Yes. > Which version of Nutch are you using? 1.9 > Can you point to one file or URL to reproduce it? To reproduce: - Remove a jar file that one of your parsers depends on. - Make Nutch parse any file using this parser. This will result in NoSuchMethodError thrown and crawling job failed. I've created a JIRA issue NUTCH-2071 and attached a patch. I believe that this problem should be handled at ParseUtil level because people may use their own or third party parsers and Nutch should be protected from parsers problems. Regards, Arkadi > > Thanks, > Sebastian > > On 06/26/2015 03:26 PM, Sebastian Nagel wrote: > > Hi Arkadi, > > > > thanks for reporting that. Can you open a Jira ticket [1] to address this > > bug? > > > > It's rather a bug of the plugin parse-tika and should be solved there, > > cf. https://issues.apache.org/jira/browse/TIKA-1240 > > A plugin should be able to load all required classes. > > > > Thanks, > > Sebastian > > > > [1] https://issues.apache.org/jira/browse/NUTCH > > > > 2015-06-23 3:59 GMT+02:00 <[email protected] > <mailto:[email protected]>>: > > > > Hi, > > > > This is what happened: > > > > java.io.IOException: Job failed! > > at > > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) > > at > org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213) > > <...> > > Caused by: java.lang.IncompatibleClassChangeError: class > > org.apache.tika.parser.asm.XHTMLClassVisitor has interface > org.objectweb.asm.ClassVisitor as > > super class > > at java.lang.ClassLoader.defineClass1(Native Method) > > at > > java.lang.ClassLoader.defineClass(ClassLoader.java:760) > > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > > at > > java.net.URLClassLoader.defineClass(URLClassLoader.java:467) > > at > > java.net.URLClassLoader.access$100(URLClassLoader.java:73) > > at > > java.net.URLClassLoader$1.run(URLClassLoader.java:368) > > at > > java.net.URLClassLoader$1.run(URLClassLoader.java:362) > > at java.security.AccessController.doPrivileged(Native > > Method) > > at > > java.net.URLClassLoader.findClass(URLClassLoader.java:361) > > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > > at > org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51) > > at > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:98) > > at > > org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:103) > > > > Suggested fix in ParseUtil: > > > > Replace > > > > if (maxParseTime!=-1) > > parseResult = runParser(parsers[i], content); > > else > > parseResult = parsers[i].getParse(content); > > > > with > > > > try > > { > > if (maxParseTime!=-1) > > parseResult = runParser(parsers[i], content); > > else > > parseResult = parsers[i].getParse(content); > > } catch( Throwable e ) > > { > > LOG.warn( "Parsing " + content.getUrl() + " with " + > parsers[i].getClass().getName() + " > > failed: " + e.getMessage() ) ; > > parseResult = null ; > > } > > > > Also replace > > > > if (maxParseTime!=-1) > > parseResult = runParser(p, content); > > else > > parseResult = p.getParse(content); > > > > with > > > > try > > { > > if (maxParseTime!=-1) > > parseResult = runParser(p, content); > > else > > parseResult = p.getParse(content); > > } catch( Throwable e ) > > { > > LOG.warn( "Parsing " + content.getUrl() + " with " + > p.getClass().getName() + " failed: " > > + e.getMessage() ) ; > > } > > > > Regards, > > Arkadi > > > >

