Hi Sebastian,

I apologise for a long silence on this issue. I have been out of town, back on 
Monday. Then I will do what you are asking in 2-3 days.

Regards,
Arkadi
________________________________________
From: Sebastian Nagel [[email protected]]
Sent: Friday, 24 July 2015 6:38 AM
To: [email protected]
Cc: Kosmynin, Arkadi (CASS, Marsfield)
Subject: Re: A parser failure on a single document may fail crawling job

Hi Arkadi,

does the problem persist?
Which version of Nutch are you using?
Can you point to one file or URL to reproduce it?

Thanks,
Sebastian

On 06/26/2015 03:26 PM, Sebastian Nagel wrote:
> Hi Arkadi,
>
> thanks for reporting that. Can you open a Jira ticket [1] to address this bug?
>
> It's rather a bug of the plugin parse-tika and should be solved there,
> cf. https://issues.apache.org/jira/browse/TIKA-1240
> A plugin should be able to load all required classes.
>
> Thanks,
> Sebastian
>
> [1] https://issues.apache.org/jira/browse/NUTCH
>
> 2015-06-23 3:59 GMT+02:00 <[email protected] 
> <mailto:[email protected]>>:
>
>     Hi,
>
>     This is what happened:
>
>     java.io.IOException: Job failed!
>             at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>             at 
> org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213)
>             <...>
>     Caused by: java.lang.IncompatibleClassChangeError: class
>     org.apache.tika.parser.asm.XHTMLClassVisitor has interface 
> org.objectweb.asm.ClassVisitor as
>     super class
>                     at java.lang.ClassLoader.defineClass1(Native Method)
>                     at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
>                     at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>                     at 
> java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
>                     at 
> java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>                     at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
>                     at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>                     at java.security.AccessController.doPrivileged(Native 
> Method)
>                     at 
> java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>                     at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>                     at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>                     at 
> org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51)
>                     at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:98)
>                     at 
> org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:103)
>
>     Suggested fix in ParseUtil:
>
>     Replace
>
>                 if (maxParseTime!=-1)
>                            parseResult = runParser(parsers[i], content);
>                 else
>                            parseResult = parsers[i].getParse(content);
>
>     with
>
>           try
>           {
>                 if (maxParseTime!=-1)
>                            parseResult = runParser(parsers[i], content);
>                 else
>                            parseResult = parsers[i].getParse(content);
>           } catch( Throwable e )
>           {
>             LOG.warn( "Parsing " + content.getUrl() + " with " + 
> parsers[i].getClass().getName() + "
>     failed: " + e.getMessage() ) ;
>             parseResult = null ;
>           }
>
>     Also replace
>
>           if (maxParseTime!=-1)
>                       parseResult = runParser(p, content);
>            else
>                       parseResult = p.getParse(content);
>
>     with
>
>         try
>         {
>           if (maxParseTime!=-1)
>                       parseResult = runParser(p, content);
>            else
>                       parseResult = p.getParse(content);
>         } catch( Throwable e )
>         {
>           LOG.warn( "Parsing " + content.getUrl() + " with " + 
> p.getClass().getName() + " failed: "
>     + e.getMessage() ) ;
>         }
>
>     Regards,
>     Arkadi
>
>

Reply via email to