Hi Sebastian,

> -----Original Message-----
> From: Sebastian Nagel [mailto:[email protected]]
> Sent: Friday, 24 July 2015 6:39 AM
> To: [email protected]
> Cc: Kosmynin, Arkadi (CASS, Marsfield) <[email protected]>
> Subject: Re: A parser failure on a single document may fail crawling job
> 
> Hi Arkadi,
> 
> does the problem persist?

Yes.

> Which version of Nutch are you using?

1.9

> Can you point to one file or URL to reproduce it?

To reproduce:

- Remove a jar file that one of your parsers depends on. 
- Make Nutch parse any file using this parser.

This will result in NoSuchMethodError thrown and crawling job failed.

I've created a JIRA issue NUTCH-2071 and attached a patch. I believe that this 
problem should be handled at ParseUtil level because people may use their own 
or third party parsers and Nutch should be protected from parsers problems.

Regards,
Arkadi

> 
> Thanks,
> Sebastian
> 
> On 06/26/2015 03:26 PM, Sebastian Nagel wrote:
> > Hi Arkadi,
> >
> > thanks for reporting that. Can you open a Jira ticket [1] to address this 
> > bug?
> >
> > It's rather a bug of the plugin parse-tika and should be solved there,
> > cf. https://issues.apache.org/jira/browse/TIKA-1240
> > A plugin should be able to load all required classes.
> >
> > Thanks,
> > Sebastian
> >
> > [1] https://issues.apache.org/jira/browse/NUTCH
> >
> > 2015-06-23 3:59 GMT+02:00 <[email protected]
> <mailto:[email protected]>>:
> >
> >     Hi,
> >
> >     This is what happened:
> >
> >     java.io.IOException: Job failed!
> >             at 
> > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
> >             at
> org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213)
> >             <...>
> >     Caused by: java.lang.IncompatibleClassChangeError: class
> >     org.apache.tika.parser.asm.XHTMLClassVisitor has interface
> org.objectweb.asm.ClassVisitor as
> >     super class
> >                     at java.lang.ClassLoader.defineClass1(Native Method)
> >                     at 
> > java.lang.ClassLoader.defineClass(ClassLoader.java:760)
> >                     at
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> >                     at 
> > java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
> >                     at 
> > java.net.URLClassLoader.access$100(URLClassLoader.java:73)
> >                     at 
> > java.net.URLClassLoader$1.run(URLClassLoader.java:368)
> >                     at 
> > java.net.URLClassLoader$1.run(URLClassLoader.java:362)
> >                     at java.security.AccessController.doPrivileged(Native 
> > Method)
> >                     at 
> > java.net.URLClassLoader.findClass(URLClassLoader.java:361)
> >                     at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> >                     at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> >                     at
> org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51)
> >                     at
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:98)
> >                     at
> > org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:103)
> >
> >     Suggested fix in ParseUtil:
> >
> >     Replace
> >
> >                 if (maxParseTime!=-1)
> >                            parseResult = runParser(parsers[i], content);
> >                 else
> >                            parseResult = parsers[i].getParse(content);
> >
> >     with
> >
> >           try
> >           {
> >                 if (maxParseTime!=-1)
> >                            parseResult = runParser(parsers[i], content);
> >                 else
> >                            parseResult = parsers[i].getParse(content);
> >           } catch( Throwable e )
> >           {
> >             LOG.warn( "Parsing " + content.getUrl() + " with " +
> parsers[i].getClass().getName() + "
> >     failed: " + e.getMessage() ) ;
> >             parseResult = null ;
> >           }
> >
> >     Also replace
> >
> >           if (maxParseTime!=-1)
> >                       parseResult = runParser(p, content);
> >            else
> >                       parseResult = p.getParse(content);
> >
> >     with
> >
> >         try
> >         {
> >           if (maxParseTime!=-1)
> >                       parseResult = runParser(p, content);
> >            else
> >                       parseResult = p.getParse(content);
> >         } catch( Throwable e )
> >         {
> >           LOG.warn( "Parsing " + content.getUrl() + " with " +
> p.getClass().getName() + " failed: "
> >     + e.getMessage() ) ;
> >         }
> >
> >     Regards,
> >     Arkadi
> >
> >

Reply via email to