Hi,
Tried to upgrade any23 2.1 to 2.2 in nutch code base.
Changes:
1. src/plugin/any23/ivy.xml:
<dependency org="org.apache.any23" name="apache-any23-core" rev="2.2"
conf="*->default">
2. src/plugin/any23/plugin.xml
<library name="apache-any23-api-2.2.jar"/>
<library name="apache-any23-core-2.2.jar"/>
<library name="apache-any23-csvutils-2.2.jar"/>
<library name="apache-any23-encoding-2.2.jar"/>
<library name="apache-any23-mime-2.2.jar"/>
after "ant runtime",
below jar files are present in dir runtime/local/plugins/any23
any23.jar
apache-any23-api-2.2.jar
apache-any23-core-2.2.jar
apache-any23-csvutils-2.2.jar
apache-any23-encoding-2.2.jar
apache-any23-mime-2.2.jar
Did simple parse checker on a test html. Getting Errors as
1. java.util.concurrent.ExecutionException:
java.lang.NoClassDefFoundError:
org/eclipse/rdf4j/common/lang/service/ServiceRegistry
....
Caused by: java.lang.NoClassDefFoundError:
org/eclipse/rdf4j/common/lang/service/ServiceRegistry
2. java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/apache/any23/extractor/ExtractorRegistryImpl
...
Caused by: java.lang.NoClassDefFoundError:
org/apache/any23/extractor/ExtractorRegistryImpl
Entire log file is attached in debug.txt.
Regards,
Govind
2018-04-02 17:09:49,999 INFO parse.ParserChecker (ParserChecker.java:run(122))
- fetching: file:/tmp/exact_code.html
2018-04-02 17:09:50,205 DEBUG util.ObjectCache (ObjectCache.java:get(43)) - No
object cache found for conf=Configuration: core-default.xml, core-site.xml,
nutch-default.xml, nutch-site.xml, instantiating a new object cache
2018-04-02 17:09:50,328 DEBUG util.ObjectCache (ObjectCache.java:get(43)) - No
object cache found for conf=Configuration: core-default.xml, core-site.xml,
nutch-default.xml, nutch-site.xml, instantiating a new object cache
2018-04-02 17:09:50,366 TRACE file.File (FileResponse.java:<init>(117)) -
fetching file:/tmp/exact_code.html
2018-04-02 17:09:50,450 INFO parse.ParseSegment
(ParseSegment.java:isTruncated(207)) - file:/tmp/exact_code.html skipped.
Content of size 79433 was truncated to 65536
2018-04-02 17:09:50,450 WARN parse.ParserChecker (ParserChecker.java:run(187))
- Content is truncated, parse may fail!
2018-04-02 17:09:50,457 TRACE parse.ParsePluginsReader
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: extractor,
extension-id: ir.co.bayan.simorq.zal.extractor.nutch.ExtractorParser
2018-04-02 17:09:50,457 TRACE parse.ParsePluginsReader
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-tika,
extension-id: org.apache.nutch.parse.tika.TikaParser
2018-04-02 17:09:50,457 TRACE parse.ParsePluginsReader
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-ext,
extension-id: ExtParser
2018-04-02 17:09:50,457 TRACE parse.ParsePluginsReader
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-html,
extension-id: org.apache.nutch.parse.html.HtmlParser
2018-04-02 17:09:50,458 TRACE parse.ParsePluginsReader
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-js,
extension-id: JSParser
2018-04-02 17:09:50,458 TRACE parse.ParsePluginsReader
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: feed,
extension-id: org.apache.nutch.parse.feed.FeedParser
2018-04-02 17:09:50,458 TRACE parse.ParsePluginsReader
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-swf,
extension-id: org.apache.nutch.parse.swf.SWFParser
2018-04-02 17:09:50,458 TRACE parse.ParsePluginsReader
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-zip,
extension-id: org.apache.nutch.parse.zip.ZipParser
2018-04-02 17:09:50,461 INFO parse.ParserFactory
(ParserFactory.java:matchExtensions(374)) - The parsing plugins:
[org.apache.nutch.parse.tika.TikaParser -
org.apache.nutch.parse.html.HtmlParser] are enabled via the plugin.includes
system property, and all claim to support the content type text/html, but they
are not mapped to it in the parse-plugins.xml file
2018-04-02 17:09:50,871 DEBUG parse.ParseUtil (ParseUtil.java:parse(91)) -
Parsing [file:/tmp/exact_code.html] with
[org.apache.nutch.parse.tika.TikaParser@693fe6c9]
2018-04-02 17:09:50,878 DEBUG tika.TikaParser (TikaParser.java:getParse(101)) -
Using Tika parser org.apache.tika.parser.html.HtmlParser for mime-type text/html
2018-04-02 17:09:51,205 TRACE tika.TikaParser (TikaParser.java:getParse(152)) -
Meta tags for file:/tmp/exact_code.html: base=null, noCache=false,
noFollow=false, noIndex=false, refresh=false, refreshHref=null
* general tags:
- viewport = width=device-width, initial-scale=1
- dc:title = I.F. on Kharms – Just a Beginning
- content-encoding = UTF-8
- generator = WordPress 4.9.4
- content-type = text/html; charset=UTF-8
- robots = index,follow
* http-equiv tags:
2018-04-02 17:09:51,206 TRACE tika.TikaParser (TikaParser.java:getParse(159)) -
Getting text...
2018-04-02 17:09:51,222 TRACE tika.TikaParser (TikaParser.java:getParse(165)) -
Getting title...
2018-04-02 17:09:51,224 TRACE tika.TikaParser (TikaParser.java:getParse(183)) -
Getting links (base URL = file:/tmp/exact_code.html) ...
2018-04-02 17:09:51,227 TRACE tika.TikaParser (TikaParser.java:getParse(193)) -
found 40 outlinks in file:/tmp/exact_code.html
2018-04-02 17:09:51,248 WARN parse.ParseUtil (ParseUtil.java:runParser(173)) -
Error parsing file:/tmp/exact_code.html with
org.apache.nutch.parse.tika.TikaParser@693fe6c9
java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/eclipse/rdf4j/common/lang/service/ServiceRegistry
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:206)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:171)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95)
at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:202)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:268)
Caused by: java.lang.NoClassDefFoundError:
org/eclipse/rdf4j/common/lang/service/ServiceRegistry
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at
org.apache.nutch.plugin.PluginClassLoader.loadClass(PluginClassLoader.java:70)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.any23.Any23.<init>(Any23.java:137)
at org.apache.any23.Any23.<init>(Any23.java:147)
at
org.apache.nutch.any23.Any23ParseFilter$Any23Parser.parse(Any23ParseFilter.java:109)
at
org.apache.nutch.any23.Any23ParseFilter$Any23Parser.<init>(Any23ParseFilter.java:92)
at
org.apache.nutch.any23.Any23ParseFilter.filter(Any23ParseFilter.java:172)
at
org.apache.nutch.parse.HtmlParseFilters.filter(HtmlParseFilters.java:46)
at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:227)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException:
org.eclipse.rdf4j.common.lang.service.ServiceRegistry
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at
org.apache.nutch.plugin.PluginClassLoader.loadClassFromSystem(PluginClassLoader.java:104)
at
org.apache.nutch.plugin.PluginClassLoader.loadClassFromParent(PluginClassLoader.java:92)
at
org.apache.nutch.plugin.PluginClassLoader.loadClass(PluginClassLoader.java:72)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 24 more
2018-04-02 17:09:51,251 DEBUG parse.ParseUtil (ParseUtil.java:parse(91)) -
Parsing [file:/tmp/exact_code.html] with
[org.apache.nutch.parse.html.HtmlParser@6a84a97d]
2018-04-02 17:09:51,280 TRACE util.EncodingDetector
(EncodingDetector.java:guessEncoding(243)) - file:/tmp/exact_code.html: charset
utf-8 (sniffed)
2018-04-02 17:09:51,280 TRACE util.EncodingDetector
(EncodingDetector.java:guessEncoding(258)) - file:/tmp/exact_code.html:
Choosing encoding: utf-8 (sniffed)
2018-04-02 17:09:51,280 TRACE html.HtmlParser (HtmlParser.java:getParse(180)) -
Parsing...
[Error] :10:44: Missing attribute name.
[Error] :11:56: Missing attribute name.
[Error] :12:43: Missing attribute name.
[Error] :13:55: Missing whitespace before attribute "rel".
[Error] :13:69: Missing attribute name.
[Error] :14:135: Missing attribute name.
[Error] :15:153: Missing attribute name.
[Error] :16:169: Missing attribute name.
[Error] :35:228: Missing attribute name.
[Error] :36:179: Missing attribute name.
[Error] :45:82: Missing attribute name.
[Error] :46:116: Missing attribute name.
[Error] :47:129: Missing attribute name.
[Error] :48:123: Missing attribute name.
[Error] :49:50: Missing attribute name.
[Error] :50:75: Missing attribute name.
[Error] :51:70: Missing attribute name.
[Error] :52:179: Missing attribute name.
[Error] :53:187: Missing attribute name.
[Error] :67:210: Missing attribute name.
[Error] :150:581: Missing attribute name.
[Error] :151:247: Missing attribute name.
[Error] :152:135: Missing attribute name.
[Error] :153:107: Missing attribute name.
[Error] :153:186: Missing attribute name.
[Error] :154:74: Missing attribute name.
[Error] :174:123: Missing attribute name.
[Error] :187:14: Missing attribute name.
[Error] :348:581: Premature end of file encountered.
[Error] :348:581: Premature end of file encountered.
[Warning] :348:581: Element <PATH> not closed properly.
[Warning] :348:581: Element <SYMBOL> not closed properly.
[Warning] :348:581: Element <DEFS> not closed properly.
[Warning] :348:581: Element <SVG> not closed properly.
[Warning] :348:581: Element <BODY> not closed properly.
[Warning] :348:581: Element <HTML> not closed properly.
2018-04-02 17:09:51,377 TRACE html.HtmlParser (HtmlParser.java:getParse(205)) -
Meta tags for file:/tmp/exact_code.html: base=null, noCache=false,
noFollow=false, noIndex=false, refresh=false, refreshHref=null
* general tags:
- viewport = width=device-width, initial-scale=1
- generator = WordPress 4.9.4
- robots = index,follow
* http-equiv tags:
2018-04-02 17:09:51,377 TRACE html.HtmlParser (HtmlParser.java:getParse(211)) -
Getting text...
2018-04-02 17:09:51,385 TRACE html.HtmlParser (HtmlParser.java:getParse(217)) -
Getting title...
2018-04-02 17:09:51,386 TRACE html.HtmlParser (HtmlParser.java:getParse(235)) -
Getting links...
2018-04-02 17:09:51,388 TRACE html.HtmlParser (HtmlParser.java:getParse(240)) -
found 47 outlinks in file:/tmp/exact_code.html
2018-04-02 17:09:51,389 WARN parse.ParseUtil (ParseUtil.java:runParser(173)) -
Error parsing file:/tmp/exact_code.html with
org.apache.nutch.parse.html.HtmlParser@6a84a97d
java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/apache/any23/extractor/ExtractorRegistryImpl
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:206)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:171)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95)
at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:202)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:268)
Caused by: java.lang.NoClassDefFoundError:
org/apache/any23/extractor/ExtractorRegistryImpl
at org.apache.any23.Any23.<init>(Any23.java:137)
at org.apache.any23.Any23.<init>(Any23.java:147)
at
org.apache.nutch.any23.Any23ParseFilter$Any23Parser.parse(Any23ParseFilter.java:109)
at
org.apache.nutch.any23.Any23ParseFilter$Any23Parser.<init>(Any23ParseFilter.java:92)
at
org.apache.nutch.any23.Any23ParseFilter.filter(Any23ParseFilter.java:172)
at
org.apache.nutch.parse.HtmlParseFilters.filter(HtmlParseFilters.java:46)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:257)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-04-02 17:09:51,390 WARN parse.ParseUtil (ParseUtil.java:parse(104)) -
Unable to successfully parse content file:/tmp/exact_code.html of type text/html
2018-04-02 17:09:51,391 INFO crawl.SignatureFactory
(SignatureFactory.java:getSignature(51)) - Using Signature impl:
org.apache.nutch.crawl.MD5Signature
2018-04-02 17:09:51,400 INFO parse.ParserChecker (ParserChecker.java:run(214))
- parsing: file:/tmp/exact_code.html
2018-04-02 17:09:51,400 INFO parse.ParserChecker (ParserChecker.java:run(215))
- contentType: text/html
2018-04-02 17:09:51,400 INFO parse.ParserChecker (ParserChecker.java:run(216))
- signature: 650db1bac1e2c1c04ad51c0f1b54f379
2018-04-02 17:09:51,401 INFO parse.ParserChecker (ParserChecker.java:run(244))
- ---------
Url
---------------
2018-04-02 17:09:51,401 INFO parse.ParserChecker (ParserChecker.java:run(246))
-
---------
ParseData
---------
2018-04-02 17:09:51,401 INFO parse.ParserChecker (ParserChecker.java:run(249))
- ---------
ParseText
---------