Hi,

Tried to upgrade any23 2.1 to 2.2 in nutch code base.

Changes:
1. src/plugin/any23/ivy.xml:
<dependency org="org.apache.any23" name="apache-any23-core" rev="2.2"
conf="*->default">

2. src/plugin/any23/plugin.xml

<library name="apache-any23-api-2.2.jar"/>
    <library name="apache-any23-core-2.2.jar"/>
    <library name="apache-any23-csvutils-2.2.jar"/>
    <library name="apache-any23-encoding-2.2.jar"/>
    <library name="apache-any23-mime-2.2.jar"/>


after "ant runtime",
below jar files are present in dir runtime/local/plugins/any23

any23.jar
apache-any23-api-2.2.jar
apache-any23-core-2.2.jar
apache-any23-csvutils-2.2.jar
apache-any23-encoding-2.2.jar
apache-any23-mime-2.2.jar




Did simple parse checker on a test html. Getting Errors as
1.  java.util.concurrent.ExecutionException:
java.lang.NoClassDefFoundError:
org/eclipse/rdf4j/common/lang/service/ServiceRegistry
 ....
Caused by: java.lang.NoClassDefFoundError:
org/eclipse/rdf4j/common/lang/service/ServiceRegistry

2. java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/apache/any23/extractor/ExtractorRegistryImpl
...
Caused by: java.lang.NoClassDefFoundError:
org/apache/any23/extractor/ExtractorRegistryImpl




Entire log file is attached in debug.txt.


Regards,
Govind
2018-04-02 17:09:49,999 INFO  parse.ParserChecker (ParserChecker.java:run(122)) 
- fetching: file:/tmp/exact_code.html
2018-04-02 17:09:50,205 DEBUG util.ObjectCache (ObjectCache.java:get(43)) - No 
object cache found for conf=Configuration: core-default.xml, core-site.xml, 
nutch-default.xml, nutch-site.xml, instantiating a new object cache
2018-04-02 17:09:50,328 DEBUG util.ObjectCache (ObjectCache.java:get(43)) - No 
object cache found for conf=Configuration: core-default.xml, core-site.xml, 
nutch-default.xml, nutch-site.xml, instantiating a new object cache
2018-04-02 17:09:50,366 TRACE file.File (FileResponse.java:<init>(117)) - 
fetching file:/tmp/exact_code.html
2018-04-02 17:09:50,450 INFO  parse.ParseSegment 
(ParseSegment.java:isTruncated(207)) - file:/tmp/exact_code.html skipped. 
Content of size 79433 was truncated to 65536
2018-04-02 17:09:50,450 WARN  parse.ParserChecker (ParserChecker.java:run(187)) 
- Content is truncated, parse may fail!
2018-04-02 17:09:50,457 TRACE parse.ParsePluginsReader 
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: extractor, 
extension-id: ir.co.bayan.simorq.zal.extractor.nutch.ExtractorParser
2018-04-02 17:09:50,457 TRACE parse.ParsePluginsReader 
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-tika, 
extension-id: org.apache.nutch.parse.tika.TikaParser
2018-04-02 17:09:50,457 TRACE parse.ParsePluginsReader 
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-ext, 
extension-id: ExtParser
2018-04-02 17:09:50,457 TRACE parse.ParsePluginsReader 
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-html, 
extension-id: org.apache.nutch.parse.html.HtmlParser
2018-04-02 17:09:50,458 TRACE parse.ParsePluginsReader 
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-js, 
extension-id: JSParser
2018-04-02 17:09:50,458 TRACE parse.ParsePluginsReader 
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: feed, 
extension-id: org.apache.nutch.parse.feed.FeedParser
2018-04-02 17:09:50,458 TRACE parse.ParsePluginsReader 
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-swf, 
extension-id: org.apache.nutch.parse.swf.SWFParser
2018-04-02 17:09:50,458 TRACE parse.ParsePluginsReader 
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-zip, 
extension-id: org.apache.nutch.parse.zip.ZipParser
2018-04-02 17:09:50,461 INFO  parse.ParserFactory 
(ParserFactory.java:matchExtensions(374)) - The parsing plugins: 
[org.apache.nutch.parse.tika.TikaParser - 
org.apache.nutch.parse.html.HtmlParser] are enabled via the plugin.includes 
system property, and all claim to support the content type text/html, but they 
are not mapped to it  in the parse-plugins.xml file
2018-04-02 17:09:50,871 DEBUG parse.ParseUtil (ParseUtil.java:parse(91)) - 
Parsing [file:/tmp/exact_code.html] with 
[org.apache.nutch.parse.tika.TikaParser@693fe6c9]
2018-04-02 17:09:50,878 DEBUG tika.TikaParser (TikaParser.java:getParse(101)) - 
Using Tika parser org.apache.tika.parser.html.HtmlParser for mime-type text/html
2018-04-02 17:09:51,205 TRACE tika.TikaParser (TikaParser.java:getParse(152)) - 
Meta tags for file:/tmp/exact_code.html: base=null, noCache=false, 
noFollow=false, noIndex=false, refresh=false, refreshHref=null
 * general tags:
   - viewport   =       width=device-width, initial-scale=1
   - dc:title   =       I.F. on Kharms – Just a Beginning
   - content-encoding   =       UTF-8
   - generator  =       WordPress 4.9.4
   - content-type       =       text/html; charset=UTF-8
   - robots     =       index,follow
 * http-equiv tags:

2018-04-02 17:09:51,206 TRACE tika.TikaParser (TikaParser.java:getParse(159)) - 
Getting text...
2018-04-02 17:09:51,222 TRACE tika.TikaParser (TikaParser.java:getParse(165)) - 
Getting title...
2018-04-02 17:09:51,224 TRACE tika.TikaParser (TikaParser.java:getParse(183)) - 
Getting links (base URL = file:/tmp/exact_code.html) ...
2018-04-02 17:09:51,227 TRACE tika.TikaParser (TikaParser.java:getParse(193)) - 
found 40 outlinks in file:/tmp/exact_code.html
2018-04-02 17:09:51,248 WARN  parse.ParseUtil (ParseUtil.java:runParser(173)) - 
Error parsing file:/tmp/exact_code.html with 
org.apache.nutch.parse.tika.TikaParser@693fe6c9
java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError: 
org/eclipse/rdf4j/common/lang/service/ServiceRegistry
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:206)
        at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:171)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95)
        at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:202)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:268)
Caused by: java.lang.NoClassDefFoundError: 
org/eclipse/rdf4j/common/lang/service/ServiceRegistry
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
        at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
        at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
        at 
org.apache.nutch.plugin.PluginClassLoader.loadClass(PluginClassLoader.java:70)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.any23.Any23.<init>(Any23.java:137)
        at org.apache.any23.Any23.<init>(Any23.java:147)
        at 
org.apache.nutch.any23.Any23ParseFilter$Any23Parser.parse(Any23ParseFilter.java:109)
        at 
org.apache.nutch.any23.Any23ParseFilter$Any23Parser.<init>(Any23ParseFilter.java:92)
        at 
org.apache.nutch.any23.Any23ParseFilter.filter(Any23ParseFilter.java:172)
        at 
org.apache.nutch.parse.HtmlParseFilters.filter(HtmlParseFilters.java:46)
        at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:227)
        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: 
org.eclipse.rdf4j.common.lang.service.ServiceRegistry
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at 
org.apache.nutch.plugin.PluginClassLoader.loadClassFromSystem(PluginClassLoader.java:104)
        at 
org.apache.nutch.plugin.PluginClassLoader.loadClassFromParent(PluginClassLoader.java:92)
        at 
org.apache.nutch.plugin.PluginClassLoader.loadClass(PluginClassLoader.java:72)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 24 more
2018-04-02 17:09:51,251 DEBUG parse.ParseUtil (ParseUtil.java:parse(91)) - 
Parsing [file:/tmp/exact_code.html] with 
[org.apache.nutch.parse.html.HtmlParser@6a84a97d]
2018-04-02 17:09:51,280 TRACE util.EncodingDetector 
(EncodingDetector.java:guessEncoding(243)) - file:/tmp/exact_code.html: charset 
utf-8 (sniffed)
2018-04-02 17:09:51,280 TRACE util.EncodingDetector 
(EncodingDetector.java:guessEncoding(258)) - file:/tmp/exact_code.html: 
Choosing encoding: utf-8 (sniffed)
2018-04-02 17:09:51,280 TRACE html.HtmlParser (HtmlParser.java:getParse(180)) - 
Parsing...
[Error] :10:44: Missing attribute name.
[Error] :11:56: Missing attribute name.
[Error] :12:43: Missing attribute name.
[Error] :13:55: Missing whitespace before attribute "rel".
[Error] :13:69: Missing attribute name.
[Error] :14:135: Missing attribute name.
[Error] :15:153: Missing attribute name.
[Error] :16:169: Missing attribute name.
[Error] :35:228: Missing attribute name.
[Error] :36:179: Missing attribute name.
[Error] :45:82: Missing attribute name.
[Error] :46:116: Missing attribute name.
[Error] :47:129: Missing attribute name.
[Error] :48:123: Missing attribute name.
[Error] :49:50: Missing attribute name.
[Error] :50:75: Missing attribute name.
[Error] :51:70: Missing attribute name.
[Error] :52:179: Missing attribute name.
[Error] :53:187: Missing attribute name.
[Error] :67:210: Missing attribute name.
[Error] :150:581: Missing attribute name.
[Error] :151:247: Missing attribute name.
[Error] :152:135: Missing attribute name.
[Error] :153:107: Missing attribute name.
[Error] :153:186: Missing attribute name.
[Error] :154:74: Missing attribute name.
[Error] :174:123: Missing attribute name.
[Error] :187:14: Missing attribute name.
[Error] :348:581: Premature end of file encountered.
[Error] :348:581: Premature end of file encountered.
[Warning] :348:581: Element <PATH> not closed properly.
[Warning] :348:581: Element <SYMBOL> not closed properly.
[Warning] :348:581: Element <DEFS> not closed properly.
[Warning] :348:581: Element <SVG> not closed properly.
[Warning] :348:581: Element <BODY> not closed properly.
[Warning] :348:581: Element <HTML> not closed properly.
2018-04-02 17:09:51,377 TRACE html.HtmlParser (HtmlParser.java:getParse(205)) - 
Meta tags for file:/tmp/exact_code.html: base=null, noCache=false, 
noFollow=false, noIndex=false, refresh=false, refreshHref=null
 * general tags:
   - viewport   =       width=device-width, initial-scale=1
   - generator  =       WordPress 4.9.4
   - robots     =       index,follow
 * http-equiv tags:

2018-04-02 17:09:51,377 TRACE html.HtmlParser (HtmlParser.java:getParse(211)) - 
Getting text...
2018-04-02 17:09:51,385 TRACE html.HtmlParser (HtmlParser.java:getParse(217)) - 
Getting title...
2018-04-02 17:09:51,386 TRACE html.HtmlParser (HtmlParser.java:getParse(235)) - 
Getting links...
2018-04-02 17:09:51,388 TRACE html.HtmlParser (HtmlParser.java:getParse(240)) - 
found 47 outlinks in file:/tmp/exact_code.html
2018-04-02 17:09:51,389 WARN  parse.ParseUtil (ParseUtil.java:runParser(173)) - 
Error parsing file:/tmp/exact_code.html with 
org.apache.nutch.parse.html.HtmlParser@6a84a97d
java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError: 
org/apache/any23/extractor/ExtractorRegistryImpl
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:206)
        at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:171)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95)
        at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:202)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:268)
Caused by: java.lang.NoClassDefFoundError: 
org/apache/any23/extractor/ExtractorRegistryImpl
        at org.apache.any23.Any23.<init>(Any23.java:137)
        at org.apache.any23.Any23.<init>(Any23.java:147)
        at 
org.apache.nutch.any23.Any23ParseFilter$Any23Parser.parse(Any23ParseFilter.java:109)
        at 
org.apache.nutch.any23.Any23ParseFilter$Any23Parser.<init>(Any23ParseFilter.java:92)
        at 
org.apache.nutch.any23.Any23ParseFilter.filter(Any23ParseFilter.java:172)
        at 
org.apache.nutch.parse.HtmlParseFilters.filter(HtmlParseFilters.java:46)
        at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:257)
        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2018-04-02 17:09:51,390 WARN  parse.ParseUtil (ParseUtil.java:parse(104)) - 
Unable to successfully parse content file:/tmp/exact_code.html of type text/html
2018-04-02 17:09:51,391 INFO  crawl.SignatureFactory 
(SignatureFactory.java:getSignature(51)) - Using Signature impl: 
org.apache.nutch.crawl.MD5Signature
2018-04-02 17:09:51,400 INFO  parse.ParserChecker (ParserChecker.java:run(214)) 
- parsing: file:/tmp/exact_code.html
2018-04-02 17:09:51,400 INFO  parse.ParserChecker (ParserChecker.java:run(215)) 
- contentType: text/html
2018-04-02 17:09:51,400 INFO  parse.ParserChecker (ParserChecker.java:run(216)) 
- signature: 650db1bac1e2c1c04ad51c0f1b54f379
2018-04-02 17:09:51,401 INFO  parse.ParserChecker (ParserChecker.java:run(244)) 
- ---------
Url
---------------

2018-04-02 17:09:51,401 INFO  parse.ParserChecker (ParserChecker.java:run(246)) 
- 
---------
ParseData
---------

2018-04-02 17:09:51,401 INFO  parse.ParserChecker (ParserChecker.java:run(249)) 
- ---------
ParseText
---------

Reply via email to